XML Will Make It Easier
by Ralph Kimball
By now, most of us in the data warehouse community have heard of
extensible markup language (XML). We have been told that it will
substantially extend HTML for much of our Web communications, and it will
make the content of these communications far more transparent. That all
sounds fine.
But what is XML? What will it really do for us? How far are we along in
realizing the promised benefits? What does a data warehouse project
manager need to know about XML?
What Is XML?
XML is a language for data communication. Superficially, XML takes the
form of plain text documents containing both data and metadata but no
formatting information. XML is expressed with much of the same notation as
HTML, but departs from the architecture of an HTML document. HTML contains
data and formatting information, but no metadata.
These differences between XML and HTML are crucial to understanding how
XML will affect data warehousing. XML metadata consists of tags
unambiguously identifying each item in an XML document. For instance, an
invoice coded in XML contains sequences such as:
<Customer Name = "Bob" Address = "123
Main Street" City = "Philadelphia" />
Here "Customer" is an XML element defined between the sending
and receiving parties before the invoice is transmitted. The Customer
element has been defined to contain some XML attributes, including Name,
Address, City, and possibly others.
You can see that the XML coded invoice therefore contains both metadata
(the element and attribute tags) as well as data (the values Bob, 123 Main
Street, and Philadelphia). Notice that the XML does not tell an
application such as a Web browser how to lay out the invoice on the
screen. This information is deliberately omitted from the XML document.
If we had sent the same invoice coded as an HTML document, we would
have produced a superficially similar looking document consisting of tags
and data values, but the HTML tags would be restricted to formatting the
document; they would not convey the meaning of the data. There is no way
in HTML to, for example, declare the components of an invoice.
So, if you receive an XML coded invoice, how do you know what the tags
mean? And how do you format it in your Web browser?
DTDs, XML Schemas, And XSLT
In XML, the standard way for two parties to define the set of possible
tags is by exchanging a special document known as a document type
definition (DTD). The DTD declaration for our customer example could be
expressed as:
<!ELEMENT Customer (Name, Address,
City?, State?, Postalcode?)>
<!ELEMENT Name (#PCDATA)>
plus similar lines for Address, City, State, and Postalcode.
Here the question marks after City, State, and Postalcode indicate that
these fields are optional. The #PCDATA declaration indicates that Name is
an unformatted text string.
Notice that the DTD contains somewhat less information than a SQL
CREATE TABLE statement. In particular, there is no field length.
DTDs have until now been the basis for setting up a "metadata
understanding" between two parties exchanging XML, and a number of
industry groups have been defining standard DTDs for their subject areas.
But as data warehousers, we don't get enough from DTDs to build a SQL
table. To rectify this problem, the World Wide Web Consortium (W3C)
standards organization has been working on an industrial strength
successor to DTDs known as XML schemas. XML schemas will contain much more
database-oriented information about data types and how the XML elements
relate to each other - or in other words, how tables can be joined.
After participating parties agree on an XML schema and one receives XML
content from the other, the recipient uses a parsing engine to render the
information content via another specification called extensible stylesheet
language transformations (XSLT). Actually, XSLT is a general mechanism for
translating one XML document into another XML document, but its most
visible use is for turning XML into HTML for final on-screen presentation.
Where XML Will Make Life Easier
XML is intended to be the "dial tone" for computers to
exchange information. Although XML is independent from the Web, most of
the interesting uses will piggyback on the Web's hypertext transfer
protocol (HTTP). Here, in my opinion, are the places where XML will affect
data warehousing: legacy data extraction, input transaction capture,
direct storage of XML, and agnostic front-end information delivery.
Legacy data extraction. If a data source can expose its data in
XML format, then parties sharing a common XML schema can transmit and
receive the data. Although older legacy applications may need to be
retrofitted with XML writers, many more modern systems will be equipped to
write data in XML format. Relational databases, such as Oracle, DB2, and
Microsoft SQL Server 2000 already support query output and bulk data
transfer directly in XML form, with no intermediate application required.
The ability to read remote data via XML greatly simplifies data
extraction, because custom parsers for the remote data source do not have
to be written. This, in turn, promotes a more distributed and
opportunistic approach to a spread-out "data webhouse."
Input transaction capture. In many cases, the data warehouse is
focused on capturing the original transaction inputs and using them as the
source of data, rather than waiting for an extract or a query from the
production system. This transaction capture capability has been growing in
importance with the increased interest in realtime data warehouses.
A serendipitous side effect of the growing use of XML is that the
transaction flow from a data input terminal to the production system can
be siphoned off so that the data warehouse receives the transactions in
parallel with the production system. If the data is expressed in XML and
the XML schemas are shared, then the data warehouse can update in parallel
with the production system, without waiting for a conventional extract.
Direct storage of XML. One of the more interesting points of
overlap between XML and relational databases is the possibility of storing
XML documents directly as relations. Direct storage can take a couple of
forms. An XML document can be a kind of replacement for a SQL INSERT
or UPDATE statement, where the data in the XML document ends
up in relational tables. Or relational systems will implement a new XML
native data type. This transformation is more than cosmetic: XML content
can be far more hierarchical than most relational systems support. Vendors
are discussing hybrid scenarios in which data in XML form could be joined
inside the database directly to data in conventional relational tables. An
XML data type in a relational system would open up whole new classes of
hierarchical applications.
Agnostic front-end information delivery. The widespread
deployment of XML will be the final step in removing query and reporting
tools from end users' machines. An XML data transfer plus an associated
XSLT formatting specification may be enough to produce any desired user
interface presentation on a remote browser. What makes this interesting is
that different end-user clients can use different XSLT formatting
specifications while drawing data from the same database server. All that
the server needs to do is expose its query results through XML. All that
is needed at the client end is a Web browser.
The State of XML Implementation
XML is a set of open standards, defined through W3C. Although visiting
the W3C's Web site (www.w3c.org)
is interesting, browsing it is like drinking from a fire hose. There are
dozens of subcommittees, vendors, and industry groups working on various
aspects of XML, DTDs, and myriad other standards. But be aware that many
of these standards are part of the plumbing of XML, and are of more
interest to software developers in the vendor community than to data
warehouse project managers.
Honestly, as a data warehouse project leader, I don't think you should
dig deeply into the alphabet soup of all these committees and their work
at this time. Rather, I would continue to wait for your primary, familiar
vendors to complete their XML capabilities, and then I would try to take
advantage of the capabilities described in this column. Ultimately all of
this should make it easier to build up your data warehouse by connecting
remote, unrelated systems while at the same time doing less custom
software development.
|