\documenttype{article} \surtitle{University at Albany XML-SGML Archive} \title{Extensible Markup Language (\abbr{XML})\brk; Standard Generalized Markup Language (\abbr{SGML})} \author{William F. Hammond} \date{Last revision: \ June 2, 2006} \begin{document} \tableofcontents \section{Basic} \quophrase{Standard Generalized Markup Language} (\abbr{SGML}) is a language for defining markup languages. \abbr{SGML} is defined by the International Standards Organization Document \abbr{ISO} 8879 [1986]. The \abbr{ISO} document is not freely available. A copy of it is found in the book: \begin{menu}\item Charles F. Goldfarb, \emph{The \abbr{SGML} Handbook},\brk;Clarendon Press, Oxford, 1990.\end{menu} \quophrase{Hypertext Markup Language} (\abbr{HTML}), the basic language of the World Wide Web, is a markup language under \abbr{SGML}. \quophrase{Extensible Markup Language} (\abbr{XML}) is a limited form of \abbr{SGML} that is currently under heavy promotion by the World Wide Web Consortium (\abbr{W3C}). It is sometimes perceived as \quophrase{extended \abbr{HTML}}. \abbr{XML} has been designed to be usable on browsing platforms, while full-fledged \abbr{SGML} is usually more suitable for authoring platforms. In fact, \abbr{XML} has for most purposes become the only form of \abbr{SGML} that is suitable for public sharing. Many \abbr{SGML} languages\footnote{ The phrase \quophrase{\abbr{SGML} language} used here, as well as the parallel phrase \quophrase{\abbr{XML} language}, is formally not correct usage. What is called here an \abbr{SGML} (respectively, \abbr{XML}) language is formally known as an \abbr{SGML} (respectively, \abbr{XML}) \emph{application}. Every \abbr{XML} application may also be regarded as an \abbr{SGML} application. There is an identifying correspondence between \emph{applications} in this sense and \emph{document types}. } that are realistically suitable for authors admit rapid automatic translation to nearly equivalent \abbr{XML} languages. (Note that an \abbr{XML} language need not contain the \abbr{HTML} tag set nor have any relation to \abbr{HTML}, and \abbr{HTML} is not an \abbr{XML} language although it may be automatically converted to a language under \abbr{XML}.) \section{Classical \abbr{HTML} is not an \abbr{XML} Language} Classical \abbr{HTML} refers to the markup language behind World Wide Web locations from the beginning of the Web at CERN, Geneva, until very recently. The versions of \abbr{W3C} \abbr{HTML} numbered from 2.0 through 4.01 are all languages under \abbr{SGML} that do not fall within \abbr{XML}. Three simple reasons why \abbr{HTML} is not an \abbr{XML} language are: \begin{enumerate} \item In \abbr{HTML} most paragraphs are marked up using an opentag \qquostr{

} at the beginning of the paragraph without needing a closetag \qquostr{

} at the end, while there must be a closetag for every opentag in \abbr{XML}. \item In \abbr{HTML} tag names are not case-sensitive, while in \abbr{XML} tag names are case-sensitive. (A new standard way of converting \abbr{HTML} into an \abbr{XML} language will specify that tag names all be lower case.) \item In \abbr{HTML} some attribute values need not be placed inside quotation marks, while in \abbr{XML} all attribute values must be quoted. \end{enumerate} Early in the year 2000 a new evolute of \abbr{HTML} referred to as \anch[href="http://www.w3.org/TR/2000/rec-xhtml1-20000126"]{\abbr{XHTML}} --- but bearing the formal document type name \quophrase{html} (lower case characters only) --- acquired the status of \abbr{W3C} Recommendation. \abbr{XHTML}, version 1.0, is an \abbr{XML} language that has the same tag set as \anch[href="http://www.w3.org/TR/1999/REC-html401-19991224" ]{\abbr{HTML} 4.01}. Apart from technical details \abbr{XHTML} 1.0 is almost the same language as \abbr{HTML} 4.01. Because of the technical differences, however, a computer does not need the full weight of an \abbr{SGML} processor to interpret \abbr{XHTML}. This advantage is offset by the fact that it is slightly more difficult for authors to create \abbr{XHTML} than to create classical \abbr{HTML}. \section{The Nature of \abbr{SGML}} While \abbr{SGML} may be described as a language for creating markup languages with a shared syntax, more realistically and more abstractly, an \abbr{SGML} language (formally, an \abbr{SGML} \emph{application}) is a template for processing. For this reason when an \abbr{SGML} document (formally, an \abbr{SGML} \emph{instance}) is written, the author is, in fact, setting its text as organized data. The abstract character of languages under the \abbr{SGML} umbrella makes it possible to use the family to describe computer programs. The Extensible Style Language (\abbr{XSL}) described below is an example of such an \abbr{SGML} application that is, in fact, an \abbr{XML} application. \section{Styling and Translating \abbr{XML} documents} In principle, an author may create a personal \abbr{XML} language. To do so the author must be prepared to provide, in addition, (1) companion \quophrase{style sheets} or (2) companion translators. If one uses a language under \abbr{XML} or \abbr{SGML}, one must understand what companion style sheets or translators will be used with that language. A style sheet is a document that is created to provide directions for a processing program, perhaps a printing formatter or a web browser, on the formatting or rendering of a document that is prepared in a markup language. While a translator may be any program, typically a translator is a package of small programs (sometimes called functions) for processing a document under an \abbr{XML} language to some other language, which might be \tex;, \abbr{HTML}, another \abbr{XML}, ... under a general framework for processing \abbr{XML} or \abbr{SGML}. There are free frameworks for writing such programs in various languages. Most of these frameworks require pre-processing parsers, and free parsers are also available. Near-term plans for the development of the World Wide Web anticipate major web browsing programs having the capability to provide finely-tuned rendering of \abbr{XML} documents that are accompanied by a style sheet. Style sheet support for \abbr{HTML} documents is currently available. Limited rendering of \abbr{XML} documents on the World Wide Web is based on \anch[href="http://www.w3.org/Style/CSS/"]{\quophrase{Cascading Style Sheets} (\abbr{CSS})}, which has been in use for customized rendering guidance with \abbr{HTML} browsing programs. A future standard style language for \abbr{XML} documents in World Wide Web browsing programs is called \anch[href="http://www.w3.org/Style/XSL/"]{\quophrase{Extensible Style Language} (\abbr{XSL})}. \abbr{XSL} is a restricted form of \anch[href="http://www.jclark.com/dsssl/"]{\quophrase{Document Style Semantics and Specification Language} (\abbr{DSSSL})} that is written with \abbr{XML} syntax. The specification for \abbr{XSL} was still under draft at \abbr{W3C} on March 1, 2000, while a variant called \anch[href="http://www.w3.org/TR/xslt"]{\quophrase{\abbr{XSL} Transformation Language} (\abbr{XSLT})}, which may be used for \emph{translating} \abbr{XML} languages to other languages (whether \abbr{XML} or not), became a \abbr{W3C} recommendation in late 1999. While \abbr{XSL}-directed formatting offers more precision than is available with \abbr{CSS}-guided formatting, in the overall world of \abbr{XML} processing one should expect formatting based on either \abbr{CSS} or \abbr{XSL} style sheets to be a limited type of formatting. One should expect to obtain the finest typesetting results by going beyond the narrow class of \abbr{XML} translation programs that admit expression in a style sheet language. A relatively new simple example of \abbr{SGML} processing may be found in the system manual under \abbr{SunOS}, version 5.7. Observant users of University at Albany SunStations may have noticed that as of the summer of 1999 most of the system manual in the central \qquostr{/usr/man} area now exists in source form under an \abbr{SGML} document type for the manual rather than, as formerly, in the \emph{nroff} typesetting language. (This is temporarily hampering the operation of the classical \abbr{X11} program \emph{xman} for the affected portions of the system manual; text rendering is not affected.) See the manual page for \quophrase{solbook} and browse \qquostr{/usr/lib/sgml} and \qquostr{/usr/share/lib/sgml} for more information. A document created carefully today under a well designed \abbr{XML} or \abbr{SGML} language should admit automatic conversion to future formats once an \abbr{SGML} or \abbr{XML} translator for such conversion has been created. \section{Example Languages under \abbr{XML} and \abbr{SGML}} \begin{enumerate} \item \abbr{CALS} is a language under \abbr{SGML} that is widely used in the U.S. Department of Defense. \item \anch[ href="http://www.oasis-open.org/docbook/"]{\quophrase{DocBook}} is a public language under \abbr{SGML} that may be used by authors. A fall 1999 book, Norman Walsh, \anch[href="http://www.docbook.org/tdg/html/" ]{DocBook: The Definitive Guide} is available online and in bookstores. Walsh maintains a web site \urlanch{http://nwalsh.com/} with a great deal of information about related topics, including an excellent tutorial on \abbr{XSL}. (\bold{Campus UNIX Network only}: A copy of the \anch[href="file:///usr/share/local/xml/docbook/dtd/" ]{DocBook \abbr{DTD}} is available on the local network.) \item The \anch[ href="http://www.tei-c.org/"]{TEI Consortium} has emerged from the \anch[href="http://www.uic.edu/orgs/tei/"]{Text Encoding Initiative} at The University of Illinois at Chicago as custodian of the \abbr{TEI} language definition. \abbr{TEI} is another public language that may be used by authors. Its modular design has led to the creation of the \anch[href="http://www.hcu.ox.ac.uk/TEI/newpizza.html" ]{TEI Pizza Chef} web site at Oxford. A copy of the current \anch[href="http://www.tei-c.org/P4X/"]{TEI Guidelines} in HTML, which includes \anch[href="http://www.tei-c.org/P4X/SG.html"]{\emph{A Gentle Introduction to \abbr{XML}}} is available for \bold{local browsing} on the Sun network from the file system location \anch[ href="file:///usr/share/local/xml/tei/P4X/index.html" ]{\path{/usr/share/local/xml/tei/P4X/index.html}}. \item \abbr{HTML} is a language under \abbr{SGML}. \item \abbr{XHTML} (formerly \abbr{HTML}-Voyager) is a language under \abbr{XML}, recommended by the World Wide Web Consortium (\abbr{W3C}), that is designed to be equivalent to \abbr{HTML}. It is intended to be the base for extending \abbr{HTML} to a language under \abbr{XML}. See: \display{\urlanch{http://www.w3.org/TR/xhtml1/}\hsp;.} \item \abbr{MathML}, \emph{Mathematical Markup Language} is a client platform language under \abbr{XML} that is intended to add mathematical functionality to the world wide web. See: \display{\urlanch{http://www.w3.org/Math/}\hsp;.} The \abbr{W3C} Recommendation for \abbr{MathML}, version 2, points to a document type definition at \abbr{W3C} for the implementation of a \anch[href="http://www.w3.org/TR/REC-xml-names/" ]{namespace}-based extension of \abbr{XHTML} that includes \abbr{MathML}. \item The \abbr{W3C} working draft on \emph{Scalable Vector Graphics} (\abbr{SVG}) format proposes an \abbr{XML} language for online graphics. This draft may be found along with other related information at \display{\urlanch{http://www.w3.org/Graphics/SVG/}\hsp;.} \item \anch[href="http://www.cs.rpi.edu/\tld;puninj/XGMML/"]{\abbr{XGMML}}, \emph{eXtensible Graph Markup and Modeling Language}, developed recently in New York's Capital District at \anch[href="http://www.rpi.org/"]{RPI}, is an XML application based on GML which is used for graph description. See also \display{\urlanch{http://xml.coverpages.org/xgmml.html}\hsp;.} \item Any programming \emph{assembly language} in which each line consists of an operation code followed by parameters is equivalent to an \abbr{XML} language. \item The \emph{device independent} typesetting file format (\abbr{DVI}) associated with the typesetting language \tex; (and with the program \quostr{groff}) is equivalent to an \abbr{XML} language. \end{enumerate} \section{References} The World Wide Web Consortium is the driving force behind \abbr{XML}. See:\display{\urlanch{http://www.w3.org/XML/}\eos} A 1998 book on \abbr{XML} is: \begin{menu} \item Charles F. Goldfarb and Paul Prescod,\brk; \emph{The \abbr{XML} Handbook}, Prentice Hall. A second edition has now appeared. \end{menu} A very comprehensive catalogue of information about \abbr{SGML} and \abbr{XML} may be found on the web at \display{\urlanch{http://xml.coverpages.org/}\eos} An interesting and useful web site with ties to Sun MicroSystems, one of the principal sponsors of \abbr{XML}, is \display{\urlanch{http://metalab.unc.edu/xml/}\ \eos} An early survey \anch[ href="http://www.w3.org/TR/NOTE-sgml-xml-971215.html" ]{\emph{Comparison of \abbr{SGML} and \abbr{XML}}} is available from \abbr{W3C}. Monitoring the UseNet newsgroups \urlanch{news:comp.text.sgml} and \urlanch{news:comp.text.xml} is an excellent way to have a window on current discussion. One may also seek answers to questions in the newsgroups when the answers cannot be obtained locally through the HelpDesk at \urlanch{mailto:helpdesk@csc.albany.edu}. However, one should first make sure that the question is appropriate to the specific topic of the newsgroup. For example, most questions about creating web pages do not belong in these two newsgroups. Information about the topic of \quophrase{mathematics and \abbr{SGML}} may be found at the (local) \abbr{URL} \display{\urlanch{http://math.albany.edu:8800/hm/sgml/about.html}\eos} \section{Software Available Locally} The University at Albany \abbr{UNIX} Network has several basic, general purpose, freely available tools for working with \abbr{SGML} and \abbr{XML} including: \begin{enumerate} \item The ``open source'' evolute, called \qquostr{onsgml}, of James Clark's \abbr{SGML} parser \qquostr{nsgmls}, which is an application under the \quostr{OpenSP} C++ library. The public location for \quostr{OpenSP} is the \anch[href="http://openjade.sourceforge.net"]{\softw{OpenJade} Project} at \emph{SourceForge}. The public location for information about \quostr{SP} is: \display{\urlanch{http://www.jclark.com/}\hsp;.} Note: \qquostr{onsgmls}, when properly called, may be used to check the structural correctness of an \abbr{HTML} document. At the University at Albany the command \qquostr{validhtml} is an interface to \qquostr{onsgmls} for this method of \abbr{HTML} validation. \item Script interfaces to various Java-based tools of James Clark for handling \abbr{XML} including: \begin{description} \item[\quostr{dtdinst}] a utility to generate an \abbr{XML} instance that models an \abbr{XML} document type definition given in \abbr{DTD} form. \item[\quostr{jcxt}] the engine called ``xt'' for transformations specified in the \abbr{XSLT} language. \item[\quostr{jing}] a utility to validate an \abbr{XML} instance against a document type definition specified in the form of either a \anch[href="http://www.relaxng.org/"]{Relax-NG schema} or a \anch[href="http://www.w3.org/XML/Schema"]{W3C schema}. \item[\quostr{trang}] a utility for translations between various types of \abbr{XML} document type definitions \end{description} \item David Megginson's general purpose \abbr{SGML}-to-anything processor, \qquostr{sgmlspl}, which is an application under his Perl-5 library \qquostr{SGMLSPM}. Local documentation on \qquostr{SGMLSPM/sgmlspl} may be found at: \display{ \urlanch{file:///usr/share/local/xml/html/sgmlspm/index.html}\hsp;} The public location for information about \qquostr{SGMLSPM/sgmlspl} for many years \emph{was} \display{\quostr{http://home.sprynet.com/sprynet/dmeggins/}\hsp;.} That appears to have been superseded by \display{\urlanch{http://www.megginson.com/Software/}\hsp;;} and \quostr{SGMLSPM/sgmlspl} is also available at \anch[ href="http://www.cpan.org/modules/by-authors/David\und;Megginson/" ]{CPAN}. \end{enumerate} \section{Miscellaneous} \subsection{XML and Electronic Data Interchange (EDI)} \abbr{XML} offers a standard framework for the general interchange of many kinds of data. The usefulness of \abbr{XML-EDI} lies in the inherent adaptability to this end of the many new tools for handling \abbr{XML}. There is a substantial amount of material on this topic in the book by Goldfarb and Prescod cited above. See the web site: \display{ \urlanch{http://www.geocities.com/WallStreet/Floor/5815/}\hsp;.} The World Wide Web Consortium (W3C) has basic information about how one might proceed to model a database in XML at the site: \display{\urlanch{http://www.w3.org/XML/}\hsp;.} \subsection{Library Metadata} The Open Archives Initiative (\urlanch{http://www.openarchives.org/}) has developed a protocol for interoperable handling of library metadata across the network based on records prepared under special purpose \abbr{XML} document types that are defined using the new notion of \anch[href="http://www.w3.org/XML/Schema"]{XML schema}. \subsection{How This Document Was Prepared} This document was prepared in Generalized Extensible \latex;-like Markup (\abbr{GELLMU}), which is the author's user markup interface for \abbr{SGML} languages. Presently the system, still under development, may be used to create both \anch[href="general.ltx"]{standard \latex;} and \anch[href="general.html"]{\abbr{HTML}} versions from a single \anch[href="general.glm"]{\latex-like source}, a text file. The program \softw{latex} may be used to prepare a high quality \anch[href="general.dvi"]{typeset version} in \abbr{DVI} format\footnote{ Donald Knuth's Device Independent Format (\abbr{DVI}) } suitable for printing on this system using the program \softw{dvips}, and a variant of \softw{latex} known as \softw{pdflatex} may be used to prepare a different \anch[href="general.pdf" ]{typeset version} in \abbr{PDF} format and an alternate form of processing to \abbr{HTML} will produce \anch[href="general.xhtml"]{\abbr{XHTML}} extended by \abbr{MathML}. For more information on this system see \display{\urlanch{http://www.albany.edu/\tld;hammond/gellmu}\hsp;.} \end{document}