|
|
|
Jagdish Gangolly |
|
State University of New York at Albany |
|
|
|
|
Structured Data vs. Documents |
|
Document: Structure, Form, and Content |
|
Markup languages |
|
Basic HTML & its shortcomings |
|
Why XML |
|
Fundamentals of DTDs |
|
|
|
|
|
|
In structured data, the schema (metadata) is
separate from the data |
|
In documents, the metadata (pertaining to form,
structure, as well as content) is contained in the document itself.
Sometimes, therefore, such data is called self-describing. |
|
|
|
|
|
Two views of documents |
|
|
|
Physical view: a byte-stream |
|
|
|
Logical view: an abstract data structure (a tree
of nodes) |
|
|
|
|
|
Structure: Data about the way the document is
structured. For example, a letter might consist of |
|
sender’s name and address |
|
addressee’s name and address |
|
date |
|
salutation |
|
paragraphs in the letter |
|
Closing |
|
signature |
|
|
|
|
|
Form: Data about how the document should appear
to the reader (Formatting). For example, in a letter, you will need to
specify things like |
|
Left-justification of sender’s name |
|
Right-justification of sender’s address |
|
Bold-ing of sender’s name, Italicising of
sender’s address |
|
Italicising/bold-ing of certain text |
|
… |
|
|
|
|
|
Content: The semantics of the document
content. For example, if it is a
business letter, you may want to tag the letter content to specify the
semantics of such content. Fdor example |
|
If the letter refers to a purchase order, the
meaning of such reference must be indicated by the tag |
|
If it refers to an invoice, it must be similarly
tagged |
|
|
|
Such tagging makes it possible to integrate
databases consisting of self-describing as well as structured data. |
|
|
|
|
|
TeX/LaTeX |
|
SGML |
|
HTML |
|
XML |
|
EBXML, XBRL, … |
|
MathML, XGMMl, … |
|
|
|
|
Fixed tagset – no extensibility |
|
Virtually no content tagging |
|
Mostly formatting tags |
|
Lack of discipline in document generation, very
forgiving browsers |
|
Almost exclusive preoccupation with how the
document looks |
|
Difficult/awkward/inefficient to interface with
structured databases |
|
|
|
|
|
|
Extensible. One can develop custom tagset |
|
Not necessary to have a DTD, but you can specify
one |
|
Possible to separate content from structure/form |
|
Possible to develop custom tagsets based on an
object model of the domain |
|
Possible to interface efficiently with backend
structured (usually relational) databases |
|
Possible to use heterogenous namespaces and
schema to build modular e-commerce systems |
|
Evolving standards to support e-business |
|
|
|
|
|
|
Specified using an EBNF (Extended Backus-Naur
Form) syntax. With the adoption of XML-Schema specifications, in future
much will be replaced by schemas |
|
Constraints: |
|
Wellformedness |
|
Tree structure (root element, each element must
have just one parent) |
|
Attribute values must be quoted |
|
…. |
|
Valid (document has a DTD to which it conforms.) |
|
|
|
|
|
|
<?xml encoding="UTF-8"?> |
|
<!ELEMENT personnel (person)+> |
|
|
|
<!ELEMENT person (name,email*,url*,link?)> |
|
<!ATTLIST person id ID #REQUIRED> |
|
<!ATTLIST person note CDATA #IMPLIED> |
|
<!ATTLIST person contr (true|false)
'false'> |
|
<!ATTLIST person salary CDATA #IMPLIED> |
|
|
|
<!ELEMENT name
((family,given)|(given,family))> |
|
|
|
<!ELEMENT family (#PCDATA)> |
|
|
|
<!ELEMENT given (#PCDATA)> |
|
|
|
<!ELEMENT email (#PCDATA)> |
|
|
|
<!ELEMENT url EMPTY> |
|
<!ATTLIST url href CDATA 'http://'> |
|
|
|
<!ELEMENT link EMPTY> |
|
<!ATTLIST link manager IDREF #IMPLIED> |
|
<!ATTLIST link subordinates IDREFS
#IMPLIED> |
|
|
|
<!NOTATION gif PUBLIC '-//APP/Photoshop/4.0'
'photoshop.exe'> |
|
|
|
|
|
|
|
DTD Syntax: |
|
|
|
XML Declaration & Character Encoding |
|
<?xml version=‘1.0’ encoding=‘utf-8’ ?> |
|
<? … ?> processing instructions |
|
Utf-8 : 8-bit encoding, ideal for mostly ascii
data |
|
<!-- … --> comments |
|
Character entities & Pre-declared entities |
|
CDATA:
quoted attribute values that are not parsed |
|
PCDATA: Element content that is parsed |
|
|
|
|
|
|
DTD Syntax (Continued): |
|
Element declarations |
|
<!ELEMENT name content-model> |
|
Repitition-factor characters |
|
*
‘zero or more’ , + ‘one or more’, ?
‘zero or one’ |
|
Content-model |
|
EMPTY (neither text nor child elements) |
|
<!ELEMENT br EMPTY> |
|
ANY (combination of text and child elements) |
|
<!ELEMENT container ANY> |
|
Children-only content models |
|
<!ELEMENT exchange (greeting, response)> |
|
Mixed content models ( |
|
<!ELEMENT p (#PCDATA | a | ul | b | i | em)*> |
|
|
|
|
DTD Syntax (Continued): |
|
Attribute declaration |
|
<!ATTLIST
element-name
attribute-definitions> |
|
where
each attribute definition has |
|
attribute-name attribute-type deefault-declarations |
|
|
|
<!ELEMENT multiAttribute ‘EMPTY’> |
|
<!ATTLIST multiAttribute |
|
name CDATA #REQUIRED |
|
nickname ID #REQUIRED |
|
bfriend IDREF #IMPLIED |
|
penname NMTOKEN #IMPLIED |
|
authors
NMTOKENS
#REQUIRED |
|
answer (YES |
NO) “NO” |
|
method CDATA #FIXED “TAXI” |
|
goto (DISCO |
MOVIES) #REQUIRED |
|
> |
|
|
|
|
|
|
|
DTD Syntax (Continued): |
|
CDATA: text in quotes |
|
ID: text, but value must be unique in document |
|
IDREF: text equal to value of an ID in the
document |
|
NMTOKEN: restricted text containing only ‘name
characters’, can not contain whitespace |
|
NMTOKENS: comma-separated list of NMTOKEN items |
|
(YES | NO): Enumerated type |
|
#REQUIRED: attribute is required |
|
#IMPLIED: attribute is optional |
|
#FIXED: the attribute must always have the
specified default value |
|
|
|