Introductory Survey

A Bridge for Authors from LaTeX to XML

William F. Hammond

Email: gellmu

Last updated: May 14, 2020

photo    drawing    QR code


Table of Contents

1  Introduction... *
2  First Demonstrations... *
3  Can Content-Level MathML be a Derived Format?... *
4  Brief Introductions... *
4.1  Basic GELLMU... *
4.2  Advanced GELLMU... *
4.3  Regular GELLMU: The Didactic Production System... *
4.4  Other Production Systems... *
5  Materials... *
6  Relevant Public Discussion and Comment... *
7  Pointers to a Few Related Things... *
8  About this Document... *

1.  Introduction

Generalized Extensible LaTeX-Like Markup (GELLMU) is my concept for using LaTeX-like markup to create documents in an easy plain text format that may be faithfully converted to high-powered documents marked up under SGML. TeX is the classical typesetting markup language (with robust handling of mathematics) that was created by Donald E. Knuth of Stanford University around 1980. The LaTeX document preparation system was created shortly thereafter by Leslie Lamport of Digital Equipment Corporation. LaTeX is a simplified markup interface to TeX designed to let “the user concentrate on the structure of the text” rather than on typesetting. SGML, an abbreviation for Standard Generalized Markup Language (ISO 8879:1986), is the name of a family of markup languages, unspecified in number, designed for efficient automatic text processing with shared tools of a certain type.

During the period 1993-1998 the most familiar example of a markup language in the SGML family was Hypertext Markup Language (HTML), the now familiar language of the World Wide Web. HTML is a rather low-powered member of the SGML family. The notion of “power” for a language under the umbrella of SGML has to do with the number of available translations to other document languages, both within and without SGML.

One of the ideas in my design for GELLMU is that with existing stable freely available SGML tools one may go to almost any presentation format. For the community of mathematicians and scientists, who have become accustomed to using TeX to create finely typeset documents for printing, this design provides a way automatically to create other carefully crafted forms from a single source document without over-burdening Donald Knuth's program TeX.

For typeset printed presentation, SGML-based processing to the language TeX should be optimal, while SGML-based processing to Lamport (v.2) LaTeX is used in didactic examples found below. (See also “jadetex” at The Comprehensive TeX Archive Network (CTAN); brief comment on “jade” may be found below.) Most of the magic is due to Charles Goldfarb, the inventor of SGML, James Clark, the author of “nsgmls”, and David Megginson, the author of “sgmlspl”. The GELLMU to SGML transliterator that I am still writing could have been done in many languages, but ELISP, the language of GNU Emacs, probably the best-documented of all languages, and probably also the most easy-to-debug general purpose language, seemed to be just right for this. Beyond that I am grateful to Richard Stallman for encouragement and answers. Of course, when things do not work, the problems should in no way be attributed even in part to the antecedent work.

2.  First Demonstrations

For a quick look, intended for those who know LaTeX, there is A Silly Little GELLMU Article of about three printed pages. Alongside the HTML form of this article are other versions:

And yes, of course, both HTML versions were generated from the XML version.

3.  Can Content-Level MathML be a Derived Format?

Mathematical Markup Language (MathML) is a language under development by the World Wide Web Consortium (W3C) for (1) the display of mathematics in ordinary web pages and (2) automated interchange of mathematical segments among web-compatible software applications.

Corresponding to (1) and (2) above the W3C has provided presentation and content-level versions of MathML.

While MathML, which is an XML language (formally “application”), is verbose to a point that makes its writing by human authors almost impossible, the W3C project has not undertaken to provide a language suitable for authors. Moreover, one cannot robustly translate well-structured standard LaTeX or TeX math segments into MathML without the discipline of rules that are difficult both to formulate and to enforce.

The concept of generalized LaTeX in the GELLMU Project provides such discipline.

The version of “Regular” (see §4.3) GELLMU in the tarball, has, since August 2004, provided translation of generalized LaTeX source markup under the article document type to HTML with presentation-level MathML as well as translation to ordinary LaTeX.

The key question in designing a system sufficient for generation of mathematics under an umbrella like content-level MathML either using highly specialized LaTeX or using an SGML or XML language for authors is how far authors will be willing to diverge from past habits.

The Math Benchmark Document offers an example of various mathematical segments that one might want to have automatically translated to a language with relative semantics such as content-level MathML.

There is something of an explanation (now in early draft stage), familiar to many research mathematicians but perhaps not to so many computer scientists, of why most legacy TeX/LaTeX markup of mathematics is not ambiguous for robots when augmented by adequate “type” information. Legacy practice has been to include “type” information in paper documents as part of an article's descriptive text. In a few words, mathematicians are usually careful and fussy about notation. GELLMU will eventually provide for “declared symbols” and optional associated alpha-numeric “type” information. Ultimately there should emerge a public formal object, the “mathematical expression” (mathexpr) that is something like the “regular expression” (regexp) that is familiar to users of “ed”, ELISP, “Perl”, etc. One will want a separate, probably simpler syntax for the specification of the type of a mathexpr.

My philosophy, and I think the only realistic philosophy, is that such types for mathexprs should involve relative, rather than absolute, semantics.

One of the most basic types is categorical “morphism”, which is a generalization of a calculus student's notion of “function”; for much that is of interest to many, the notion of function will suffice, provided that each function symbol is understood to imply “domain” and “target” with “target” not always the same as “image” or “range”. Regardless, users may conceptualize “morphisms” as “functions”.

4.  Brief Introductions

To summarize there are two concepts in this project.

4.1.  Basic GELLMU

This may be useful for some authors familiar with LaTeX who wish to write directly for an SGML or XML document type. It provides rudimentary LaTeX-like commands with single argument syntax. SGML attribute strings may be entered using a single LaTeX-like option.

It also offers a LaTeX-like meta-command \newcommand, which provides for macros with arguments. See Using the GELLMU Syntactic Translator to Write HTML. For example, the previous anchor would be marked up in HTML as

<a href="/~hammond/gellmu/ghtml.html"
>Using ... <kbd>HTML</kbd></a> ,

and this is marked up somewhat more succinctly in GELLMU source as

]{Using ... \kbd{HTML}} .

With the newcommand definition for \href


the even more succinct markup

\href{/~hammond/gellmu/ghtml.html}{Using ... \kbd{HTML}}


4.2.  Advanced GELLMU

This goes beyond basic LaTeX-like command / argument syntax to provide LaTeX-like multiple argument / option syntax and also what might be called LaTeX-like grammar including \begin{…}\end and, if desired, blank lines to initiate paragraphs.

When desired, advanced GELLMU has knowledge of a few command names, but the author must know the SGML or XML document type.

4.3.  Regular GELLMU: The Didactic Production System

The didactic production system is a beginning at emulating LaTeX with an XML document type. In fact, LaTeX can be modeled more precisely with SGML than with XML.

The didactic production system consists of

There is validation of each stage of output. Indeed, validation of the GELLMU Syntactic Translator's SGML output is very useful for catching author errors. To assist with this there is line number alignment between the source and GELLMU Syntactic Translator output. If necessary1 one may intervene at any stage of the processing since the output of each stage is quite readable by humans.

The two document types are parallel; the XML version is intended to be the nearest XML approximation of the SGML version. The SGML version should be regarded as “in-house”, while the XML version is suitable for export. (Usable, though not identical, source may be recovered from the XML document type.)

The document types have been designed for translation to many output formats. I have the intention ultimately to write or find others to write translators from the XML document type to other formats.

Finally the article document type may have value as a layout vehicle that is useful as an intermediate formatting stage for structure-rich document types such as DocBook and TEI, and I would encourage those who might be so inclined to think about writing translators from such document types to GELLMU article.

4.4.  Other Production Systems

An author may use advanced GELLMU as a front end to many other SGML or XML production systems with appropriate setting of variables for the GELLMU Syntactic Translator.

5.  Materials

All that one should need to get started is in the current tarball. One should look at the user guide, the manual, both listed as “Quick Anchors” above, and the examples. Note that the driver scripts found in the bin directory of the unpacked tarball may need editing for location names.

Note also that the tarball may be installed in a “Windows” system equipped with Cygwin, enhanced by a sufficient array of Cygwin-provided packages, using the Linux driver scripts.

In principle, it should also work on MacOS X, but I have no reports, and I have no idea what might be required to port it to earlier versions of MacOS.

Although the project was begun begun in June 1998, its alpha release was in July, 2001. It will not be considered to have reached beta stage until I have more knowledge about use experience of others.

Some older odds and ends may be found on the GELLMU veterans page, and the very old page for early preview of materials is still available.

6.  Relevant Public Discussion and Comment

My annotations allude, though not entirely precisely, to the article The Cathedral and the Bazaar by Eric Raymond.

Electronic Math Journals
Use “subscribe EMJ” in the BODY of a message.
There is an archive at the host site.
This is a bazaar. Sometimes technical, sometimes economic or legal, sometimes other.
LaTeX Development
Use “subscribe LATEX-L” in the BODY of a message.
Archive location, if any, unknown.
Neither a bazaar, nor a cathedral. Very sophisticated and technical. User questions are not wanted.
MathML and the HTML Math WG
Make your message SUBJECT “subscribe”. Message BODY should be blank.
An archive will be found behind the W3C Math web site.
This is a small bazaar in the nave of a cathedral. The cathedral “chapter” has its own private list. Many chapter members, not all, who speak in the nave seem to feel constrained to representation of the chapter.
UseNet news on SGML (if you get “news”)
A bazaar with many, many important people. Sophisticated and technical, questions about SGML (but not HTML, nor http, nor “the web”, ...) are usually answered well.
UseNet news on XML (if you get “news”)
A recent spin-off from the SGML discussion. Eventually it should operate at much higher volume than the SGML discussion.

7.  Pointers to a Few Related Things

Slides from 2001
A presentation given at The University of Delaware during the 2001 annual meeting of TUG.
Blahtex converts LaTeX-like math markup to MathML for use with MediaWiki, which is wiki implementation software for Wikipedia.
The TBook System for XML Authoring by Torsten Bronger.
MathML, Version 2.0, Second Edition
A W3C recommendation (October 21, 2003). In the fall of 2009 <em>MathML, Version 3</em> and the <em>MathML for CSS Profile</em> became candidate recommendations at W3C. See
OMDoc: Open Mathematical Documents
A content based XML markup format by Michael Kohlhase of Universität Saarlandes and Carnegie Mellon University for mathematics on the Internet that extends OpenMath to the document level. Released November 1, 2000.
Daniele Giacomini's Sgmltexi
Sgmltexi provided the first SGML model of Texinfo, the language of the GNU Documentation System. Since its first release in the year 2000 Texinfo itself has incorporated an XML model.
itex2MML is the TeX-math to MathML converter that at one time had been featured at Paul Gartside's MathZilla site. It is now used with Jacques Distler's very active mathematical physics blog Musings.
David Carlisle's xmltex
xmltex uses TeX, the program, to parse (without validation) an XML document and then set it in TeX, according to user rules written in code for TeX, that govern what is done for each of the tags in the corresponding XML document type definition. The same items are also available at CTAN in “macros/xmltex”.
Sebastian Rahtz's “PassiveTeX”
Uses TeX as a formatting back end for documents prepared under an XML language according to an XSL stylesheet. It is availabe through CTAN.
TeX4ht, htlatex, … : Work of Eitan Gurari at Ohio State University.

An important way to make HTML and XML versions of TeX and LaTeX documents. This is based on a C program TeX4ht, and on a related macro package for TeX. The macro package causes “TeX, the program”, to add specials to its DVI output. The program TeX4ht operates on a DVI that has been so prepared and makes HTML or XML. (The DVI format has the abstract structure of a classical assembly language. There are several “special” instructions that serve as wildcards. These “specials” are of use only to processors that know about them on a case-by-case basis. They should, in theory, be ignored by processors that do not recognize them.)

In recent editions of TUG's TeXLive a convenient default interface for using TeX4ht to make classical HTML from LaTeX is the command htlatex, while the interface for making HTML with MathML is the command mzlatex. Aside from the standard TeX4ht docs, those interested in this approach might want to consult

Sadly, Eitan Gurari, the author of TeX4ht, died in 2009.

An early (mid 90's) package (unfortunately not on CTAN) for the production of LaTeX and HTML from a single specialized LaTeX source document. Hyperlatex is somewhat similar to GELLMU in its use of an Emacs Lisp program for generating HTML though it seems not to provide a method for conscious writing under other SGML or XML document types.
The LaTeX3 Project
Information is available in the document section of the current LaTeX2E base distribution under the filename “ltx3info.tex” (with DVI and PostScript version nearby). On the web one may consult the PDF version. Plans for SGML are mentioned in this document. There is a mailing list on the topic of LaTeX3 development at the address [email protected].
TeXML is an XML vocabulary for describing TeX syntax that has evolved from Doug Lovell's TeXML, which became available in the late 1990s.

It's useful for converting XML documents to TeX, LaTeX, or Context, but it's not useful for translating TeX documents to XML document types. One writes an XSL style sheet to translate an XML document type into TeXML. Another program then translates TeXML to TeX.

Bruce Miller's LaTeXML
LaTeXML is a Perl program for converting LaTeX documents to the LaTeXML XML document type. A separate program is provided for translating the LaTeXML XML document type to XHTML+MathML. While LaTeXML tries to mimic the actions of LaTeX, the program, in typesetting LaTeX documents as DVI or PDF, it does not employ a TeX engine.

LaTeXML is the converter that was used in the project called arXMLiv for converting LaTeX documents at The arXiv to XML.

Smart Documents.
There are various forms of “smartness”. SGML will provide easily for all of them. See Richard Fateman's material on More Versatile Scientific Documents ....
Linux Documentation.
The “How To” documents for Linux systems are based on an SGML language with ancestry in the LaTeX-like language of the QWERTZ document formatting system from the University of Exeter (U.K.) in the early 1990's. The SGMLtools-Lite Project is a recent effort to bring Linux documents under the DocBook language.
Luc Maranget's Hevea
Hevea is a LaTeX to HTML translator, said to produce correct HTML 4.0.
Latex2html and Latex2html-with-MathML.

The familiar Perl package latex2html gained popularity in math departments during the early days of the web not only by translating the LaTeX commands that could be marked up into HTML but also by automatically putting out mathematics in graphic objects housed in "<img>" tags; the graphics were created with subprocesses that used TeX, dvips, and some netpbm utilities. Many features have been added.

A 1998 variant at The Geometry Center offers the option of replacing the graphic objects with MathML objects.

The philosophy of Kernighan and Pike.
If you have never looked at their classic 1984 book, here are a few quoted paragraphs. Don't let their use of a trademark get in your way.

8.  About this Document

This document, which is primarily a web page, is itself a regular GELLMU document (see §4.3). Versions of this document other than the HTML version include the original GELLMU source, its translation to XML (from which the HTML version is derived), the derived translation to XHTML+MathML, and the derived LaTeX source from which a device independent (DVI) file and a file in Adobe's portable document format (PDF) were compiled. The PDF copy, which was generated using the free program pdflatex, is tuned for printing on 8.5 x 11 inch paper by those who have yet to equip themselves (freely) for printing DVI.


  1. * But only in very exceptional situations