Notes
Outline
XML I
Jagdish Gangolly
State University of New York at Albany
Introduction to XML
Structured Data vs. Documents
Document: Structure, Form, and Content
Markup languages
Basic HTML & its shortcomings
Why XML
Fundamentals of DTDs
XML I: Structured Data vs. Documents I
In structured data, the schema (metadata) is separate from the data
In documents, the metadata (pertaining to form, structure, as well as content) is contained in the document itself. Sometimes, therefore, such data is called self-describing.
XML I: Structured Data vs. Documents II
Two views of documents
Physical view: a byte-stream
Logical view: an abstract data structure (a tree of nodes)
XML I: Document: Structure, Form, and      Content I
Structure: Data about the way the document is structured. For example, a letter might consist of
sender’s name and address
addressee’s name and address
date
salutation
paragraphs in the letter
Closing
signature
XML I: Document: Structure, Form, and      Content II
Form: Data about how the document should appear to the reader (Formatting). For example, in a letter, you will need to specify things like
Left-justification of sender’s name
Right-justification of sender’s address
Bold-ing of sender’s name, Italicising of sender’s address
Italicising/bold-ing of certain text
…
XML I: Document: Structure, Form,  and     
            Content III
Content: The semantics of the document content.  For example, if it is a business letter, you may want to tag the letter content to specify the semantics of such content. Fdor example
If the letter refers to a purchase order, the meaning of such reference must be indicated by the tag
If it refers to an invoice, it must be similarly tagged
Such tagging makes it possible to integrate databases consisting of self-describing as well as structured data.
XML I: Markup languages
TeX/LaTeX
SGML
HTML
XML
EBXML, XBRL, …
MathML, XGMMl, …
XML I: Basic HTML & its shortcomings
Fixed tagset – no extensibility
Virtually no content tagging
Mostly formatting tags
Lack of discipline in document generation, very forgiving browsers
Almost exclusive preoccupation with how the document looks
Difficult/awkward/inefficient to interface with structured databases
XML I: Why XML?
Extensible. One can develop custom tagset
Not necessary to have a DTD, but you can specify one
Possible to separate content from structure/form
Possible to develop custom tagsets based on an object model of the domain
Possible to interface efficiently with backend structured (usually relational) databases
Possible to use heterogenous namespaces and schema to build modular e-commerce systems
Evolving standards to support e-business
XML I: Fundamentals of DTDs I
Specified using an EBNF (Extended Backus-Naur Form) syntax. With the adoption of XML-Schema specifications, in future much will be replaced by schemas
Constraints:
Wellformedness
Tree structure (root element, each element must have just one parent)
Attribute values must be quoted
….
Valid (document has a DTD to which it conforms.)
XML I: Fundamentals of DTDs I
<?xml encoding="UTF-8"?>
<!ELEMENT personnel (person)+>
<!ELEMENT person (name,email*,url*,link?)>
<!ATTLIST person id ID #REQUIRED>
<!ATTLIST person note CDATA #IMPLIED>
<!ATTLIST person contr (true|false) 'false'>
<!ATTLIST person salary CDATA #IMPLIED>
<!ELEMENT name ((family,given)|(given,family))>
<!ELEMENT family (#PCDATA)>
<!ELEMENT given (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT url EMPTY>
<!ATTLIST url href CDATA 'http://'>
<!ELEMENT link EMPTY>
<!ATTLIST link manager IDREF #IMPLIED>
<!ATTLIST link subordinates IDREFS #IMPLIED>
<!NOTATION gif PUBLIC '-//APP/Photoshop/4.0' 'photoshop.exe'>
XML I: Fundamentals of DTDs II
DTD Syntax:
XML Declaration & Character Encoding
<?xml version=‘1.0’ encoding=‘utf-8’ ?>
<? … ?> processing instructions
Utf-8 : 8-bit encoding, ideal for mostly ascii data
<!-- … --> comments
Character entities & Pre-declared entities
CDATA:  quoted attribute values that are not parsed
PCDATA: Element content that is parsed
XML I: Fundamentals of DTDs III
DTD Syntax (Continued):
Element declarations
<!ELEMENT name content-model>
Repitition-factor characters
    * ‘zero or more’ , + ‘one or more’,  ? ‘zero or one’
Content-model
EMPTY (neither text nor child elements)
    <!ELEMENT br EMPTY>
ANY (combination of text and child elements)
    <!ELEMENT container ANY>
Children-only content models
    <!ELEMENT exchange (greeting, response)>
Mixed content models (
   <!ELEMENT p (#PCDATA | a | ul | b | i | em)*>
XML I: Fundamentals of DTDs IV
DTD Syntax (Continued):
Attribute declaration
    <!ATTLIST element-name  attribute-definitions>
   where each attribute definition has
      attribute-name attribute-type deefault-declarations
        <!ELEMENT multiAttribute ‘EMPTY’>
        <!ATTLIST multiAttribute
            name             CDATA                       #REQUIRED
            nickname       ID                               #REQUIRED
            bfriend            IDREF                       #IMPLIED
            penname        NMTOKEN                #IMPLIED
            authors           NMTOKENS              #REQUIRED
            answer            (YES | NO)                “NO”
            method            CDATA                     #FIXED “TAXI”
            goto                 (DISCO | MOVIES)   #REQUIRED
           >
XML I: Fundamentals of DTDs V
DTD Syntax (Continued):
CDATA: text in quotes
ID: text, but value must be unique in document
IDREF: text equal to value of an ID in the document
NMTOKEN: restricted text containing only ‘name characters’, can not contain whitespace
NMTOKENS: comma-separated list of NMTOKEN items
(YES | NO): Enumerated type
#REQUIRED: attribute is required
#IMPLIED: attribute is optional
#FIXED: the attribute must always have the specified default value