Schema Languages

Grammars

When we write a language of our own, whether it be a programming language we’re sketching out on a piece of paper, or modeling some data with markup, we have a certain shape in mind. We can formalize this this shape and explicitly define the syntax so documents (instance documents) can be validated against the shape (schema). In programming languages this is often done with BNF or EBNF.

Let’s checkout some examples:

Regular Expressions

Regular expressions (which you already may be familiar with) look quite similar. Let’s take a look at those now:

While we consider regular expressions, it’s worth pointing out that they should (generally) not be used for parsing things like HTML or XML. As mentioned in this oft-cited rant, HTML (and XML) are not regular languages and therefore must be parsed by something with more power (something like a sax or dom parser). While there are some techniques for parsing xml with regular expressions, and while most regular expression implemenations are actually more sophisticated (and therefore not regular languages), a solid rule of thumb would be to avoid regular expressions when parsing xml or html and opt for more sophisticated tools.

Defining Schemas in XML

In XML, there are a variety of languages for specifying the schema of documents. We’ll look closely at DTD, a simple language that is part of the standard. We’ll also glance at XML Schema to see what more powerful schema languages can provide.