DTD's and Schemas : XML Validation Structures

Tech Ads

Back to Article List

Originally published December 2003 [ Publisher Link ]

DTD's and Schemas : XML Validation Structures

The term "valid XML" is often used when writing XML code, but it seems each programmer perceives what's valid differently. For some developers valid XML implies that that a structure conforms to all of XML's syntax rules -- properly closing and nesting each element or placing quotes on every attribute. Loosely speaking this can be considered valid XML, but a more appropriate term is well-formed XML.

Another meaning of "valid XML" for developers is that of an XML fragment that, besides being well-formed, is also compliant with a specific contract that indicates, among other things, the precedence each XML element can have among its peers or the times each element can be nested upon one another.

In this article we will delve into specifics of the two widely used contracts to create valid XML: data type definitions and schemas.

Data Type Definitions: An SGML ancestor

DTDs were originally conceived with SGML (Standard Generalized Markup Language), an older and broader markup language than XML. When XML was developed it adopted the concept for processing the validity of an XML structure.

The following fragment describes an XML structure for which we will later define both a DTD and schema :

<?xml encoding="UTF-8"?>

<article>

  <date>2003-12-12</date>

  <subject>XML<subject>

  <sites main="devchannel.org" alternate="newforge.org"/>

</article>

Notice that this XML snippet is well-formed; however, if we were required to inspect its data structure in a more detailed manner for guaranteeing its functionality in an application or for third-party use, an XML validating structure would be in order. The following DTD can be used to validate the XML code:

<?xml version="1.0" encoding="UTF-8"?>

<!ELEMENT article (date, subject, sites*)>

<!ELEMENT date (#PCDATA)>

<!ELEMENT subject (#PCDATA)>

<!ELEMENT sites EMPTY>
<!ATTLIST sites
    main	CDATA	#REQUIRED
    alternate	CDATA	#IMPLIED
>

The declarations in the DTD indicate the following:

<!ELEMENT article (date, subject, sites*)> : The article tag must be parent to the date and subject tags, and it may contain nested in its structure the sites tag zero or more times.
<!ELEMENT date (#PCDATA)> and <!ELEMENT subject (#PCDATA)>: Both the date and subject tags may contain text in its nesting structure.
<!ELEMENT sites EMPTY>: The sites tag cannot contain any tag or text within its nesting structure.
<!ATTLIST sites main CDATA #REQUIRED alternate CDATA #IMPLIED >: The sites tag can contain two attributes: main, which is required for validation, and alternate, which is optional.

DTDs are older XML validating structures, and they pose some drawbacks when it comes to using them in conjunction with XML, especially when compared to the use of schemas:

Written in EBNF ("Extended Backus Naur Form"): The language used to write DTDs is non-intuitive for some developers, and a burden for those still coping with XML's syntax.
No support for granular values: Although a DTD can define the order and precedence of an XML structure, it cannot define a restriction at a granular level, like a number range or a string pattern, but only either text or non-text.
No support for namespaces: Through a namespace you can unequivocally distinguish between two same name elements. A DTD cannot. If you define an <address> element in a DTD, no other element can be defined with this name.

The previous example was just a runthrough of a DTD's most basic syntax, and by no means a comprehensive look at its full potential, but it illustrates the use of DTDs in validating XML.

Schemas: An XML creation

Unlike DTDs, schemas were specifically created by the W3C (XML's regent body) with XML in mind. They offer a newer and more XML-centric way to define an XML fragment.

Schemas of course address the majority of DTD's shortcomings, mostly the counterpoints described for DTDs earlier:

Written in XML: Schemas are XML, so if you are proficient writing XML you will easily grasp a schema.
Support for granular values: You can restrict a specific value in a number of ways, from a specific string pattern down to a particular date.
Support for namespaces: You can define the same name elements in a single validating contract, distinguished through the use of namespaces.

The following table illustrates a schema for the same XML fragment described in the previous section:


<?xml version="1.0" encoding="UTF-8"?>

<xsd:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xsd:element name="article">

  <xsd:complexType>

   <xsd:sequence>
     <xsd:element ref="date" minOccurs="1" 
                 maxOccurs="unbounded"/>
     <xsd:element ref="subject" minOccurs="1" 
                 maxOccurs="unbounded"/>
     <xsd:element ref="sites" minOccurs="0" 
                 maxOccurs="unbounded"/>

   </xsd:sequence>

  </xsd:complexType>

  </xsd:element>


  <!--  Date with format YYYY-MM-DD (year-month-date) -->

  <xsd:element name="date" 
               type="xsd:date"/>


  <xsd:element name="subject" 
               type="xsd:string">


  <xsd:element name="sites">

   <xsd:complexType>

    <xsd:attribute name="main"  
        type="xsd:string" use="required"/>

    <xsd:attribute name="alternate"  
        type="xsd:string" use="optional"/>

   </xsd:complexType>

  </xsd:element>

Notice the verbosity of a schema's syntax when compared to that of a DTD. It may be self-explanatory to some XML practitioners, but for novices let's describe each of its parts:

<xsd:element name="article">: Defining the article element, we find the <xsd:complexType> declaration, which in itself defines an <xsd:sequence>, which as its name implies is used in defining a sequence. This particular sequence contains the order in which its nested elements must be present. Notice that each of these declarations contains the attributes minOcurrs and maxOcurrs, which indicate the minimum and maximum number of times (respectively) the element can be present.
<xsd:element name="date" type="xsd:date">: Declares that the element named date must have its content in accordance to the Schema Date data type, which corresponds to the international format (YYYY-MM-DD)[Year-Month-Date].
<xsd:element name="subject">: Defines that the subject element have a string present in its structure.
<xsd:element name="sites">: Nested with this declaration we can once again find xsd:complexType, which indicates that the sites element can contain two attributes, one named main, which is required, and another called alternate, which is optional, both composed of a string value. These characteristics are implied by the corresponding attributes use and type defined in each <xsd:attribte> element.

Much like DTDs, schemas are impossible to illustrate in a single example, so I encourage you to explore their extensive syntax in Web resources like the W3C's Schemas recommendation.

What to use: DTDs or schemas?

In all likelihood your choice of whether to use DTDs or schemas will depend on what you are doing with XML. However keep the following in mind :

DTDs won't go away any time soon: DTDs are entrenched in even the newest applications. For example, all Deployment Descriptors in the latest J2EE 1.4 release still use DTDs to validate their XML definitions (JSP/Servlet and EJB Components).
SGML derivatives are still in use: When using widely adopted technology like HTML you still have to deal with DTDs to a certain extent.
Some newer generation XML technologies are based strictly on schemas. This is especially the case for Web Services (SOAP) applications.

Although concentrating your efforts on writing and comprehending schemas is possibly the wiser choice, you may still be required to at least understand DTD structures, so knowing your way around both XML validating structures is possibly the best way to go.