Project for the Documentation of the Languages of Mesoamerica (PDLMA)
El Proyecto para la Documentación de las Lenguas de Mesoamérica (PDLMA)

Terrence Kaufman, John Justeson and Roberto Zavala Maldonado, directors

This site presents the aims, history, and results of research by the Project for the Documentation of the Languages of Mesoamerica, internally known as the "Snake Jaguar Project". The pages that describe the aims and history of the Project, and instructions for access to and use of posted materials, are updated at moderately frequent intervals. The project databases will once again be available online in the near future. Papers by project members are posted irregularly.

This page was last revised on April 17, 2001.

Aims and history

In 1993 we began a project to document the lexicon, phonology, and morphosyntax of selected Mije-Sokean languages, which by 1995 was extended to all living Mije-Sokean languages. Besides the value of the work for its own sake, this documentation was undertaken in order to facilitate a reconstruction of the proto-Mije-Sokean protolanguage. This reconstruction, and the documentation of the individual Mije-Sokean languages, was to serve as a resource for revising and extending the decipherment of Epi-Olmec writing (Justeson and Kaufman 1993, 1996 [1994], 1997).

In 1995 the Project began research on 5 [JCH, CHI, CHO, LCH, ZEN] of a projected 11 Sapotekan languages. These were to be documented, the ancestral proto-Sapotekan language was to be reconstructed, and the reconstruction, along with the documentation of the individual Sapotekan languages, was to help in the decipherment of Sapoteko hieroglyphic writing, which had been under way since 1992. In 1996 research on 4 more Sapotekan languages [ATE, ZAN, COA, YAI] was started. Work is not yet effectively under way on CUI and YTZ. There are arguably more than 11 Sapotekan languages; they fall into 6 branches: since we could not reasonably document them all, at least one language from each of the branches had to be documented, plus any additional languages that promised to be straightforwardly useful for reconstructing proto-Sapotekan, proto-Sapoteko, and proto-Chatino. This set of languages contained 11 members.

In 1997 we began work on Matlatzinka [MTL] and Mecayapan Gulf Nawa [MEC].

In 1998 we began work on Tlawika (Okwilteko) [TLW].

In 1999 we began work on Zongolica Nawa [ZNG], Huehuetla Tepewa [HUE] and Otlaltepec Popoloka [OTL].

In 2000 we will begin work on Zapotitlán Totonaco [ZPT] and Yatzachi-Zoogocho Northern Sapoteko [ZOO].

The preparation of a dictionary for each language was undertaken by a different linguist; some of these linguists were advanced graduate students, others post-PhD professionals; one is a beginning graduate student.

A major feature of the Project is that a set of specialists in each language family is trained in the context of regular and long-term interaction, helping to generate a body of lore that is tested through discussion and comparison of the results of individual investigation.

Although not all of these languages are radically underdocumented, it is fair to say that there is not yet a theory or model of Mije-Sokean, Sapotekan, or Oto-Pamean grammar. We hope that such will eventuate from the work we have begun.

Languages being investigated and their abbreviations/codes

Mije-Sokean languages

Mijean branch

OLU Oluta
TOT Totontepec Mije
GUI Guichicovi Mije
SAY Sayula

Sokean branch

TEX Texistepec Gulf Sokean
SOT Soteapan Gulf Sokean
AYA Ayapa Gulf Sokean
COP Copainalá Soke
MAR Santa María Chimalapa Soke
MIG San Miguel Chimalapa Soke

Sapotekan languages

Sapoteko subfamily

JCH Central: Juchitán
CHI Central: Chichicapan
CHO Northern: Choapan
ATP Northern: Atepec
ZOO Northern: Yatzachi-Zoogocho
CUI Southern: Cuixtla
COA Southern: Coatlán
BAL Southern: San Baltasar Loxicha
ZAN Papabuco: Zaniza
LCH Solteco: Lachixío

Chatino subfamily

ZEN Zenzontepec
YAI Yaitepec

Nawa languages and dialects
MEC Mecayapan Gulf Nawa
PAJ Pajapan Gulf Nawa
ZNG Zongolica Nawa
CHN Chontla Eastern Huasteca Nawa
CHC Chicontepec Eastern Huasteca Nawa
AJO Los Ajos Eastern Huasteca Nawa
COX Coxcatlán Western Huasteca Nawa
PAM Tampamolón Western Huasteca Nawa
TUZ Tuzantla Western Huasteca Nawa
CUA Cuatlamayán Western Huasteca Nawa

Oto-Pamean languages
MTL Matlatzinka
TLW Tlawika (Okwilteko)

Totonakan languages
ZPT Zapotitlán Totonako
XCT Xicotepec Totonako
HUE Huehuetla Tepewa

Sources of funding

The work of the Project has been supported by major grants from The National Geographic Society [NGS] and The National Science Foundation [NSF], with smaller amounts of narrowly-targeted and occasional funds from the University of Pittsburgh and SUNY-Albany.

NSF #SBR-9411247 (1994-1995)
#SBR-9511713 (1995-1998)
#SBR-9809985 (1998-2001)
NGS #4190-92 (1992-1993)
#5319-94 (1994)
#5978-97 (1997)
#6317-98 (1998)
#6503-99 (1999)


An online version of each Project dictionary will be accessible at this website. Hardcopy dictionaries for the Mije-Sokean languages will be published in the monograph series of SUNY Albany's Institute for Mesoamerican Studies, which is distributed by the University of Texas Press. We intend to arrange for hardcopy publication of dictionaries from the other language families at a later date.

Data collection

These dictionaries do not claim to be complete, but it is fair to say that they are quite large. They range in size between 5000 and 10,500 lexical items. Some are the product of 5 solid months of elicitation [8 hours a day, 6 days a week]. Others are based on considerably more elicitation.

These dictionaries are novel/innovative in several ways.

Every effort has been made to collect all the morphemes of the language.

Identifying all the grammatical morphemes of a language is not a major problem. It requires careful elicitation and a fairly extensive collection of texts in various genres.

Getting all the root morphemes requires testing for all possible roots; the possible root shapes must be tested in a fairly large number of grammatical contexts in order to determine both their existence and their proper classification. Our procedures test for all possible monosyllabic roots, and those disyllabic roots ending in a vowel: at least 95% of the total native root stock.

Systematic elicitation of ethnobiological and ethnomedical terminology and concepts is undertaken. Plant and animal names make up as much as 25% of a neotropical language's lexicon.

Sound symbolism is studied: this consititutes a domain of variable size cross-linguistically, but one that fills out the evidence for, and often tests, models of grammatical structure.

Other semantic domains are also investigated -- given names, surnames, nicknames, place names, kinship, astronomy.

All lexical material and all roots are carefully and repeatedly tested for their grammatical behavior, and the results of this testing used to determine the classes of roots and lexemes, and these are encoded for each item. Outside of the Mayan field, this kind of research is novel for Meso-American languages, or at least is rarely reported on.

Any existing material that has been collected by linguists and published or otherwise made available to us has been checked with our language consultants.

Older documentation, especially from the colonial period, is gone over for what it might yield.

One feature that these dictionaries do not have is systematic exemplification of all lexemes. The examples that appear in these dictionaries were extracted from texts, offered during elicitation, or elicited with the purpose of establishing grammatical or semantic parameters/features. Different dictionaries coming out of the PDLMA have greater or lesser numbers of examples for the lexical material.

In all cases, the dictionaries are based on a thorough knowledge of the inflexional and derivational patterns of the language.

For many of the languages documented lexically and in texts by the PDLMA, preparations of grammatical descriptions have been undertaken by the linguists primarily responsible for the dictionaries. The preparation of such grammars is not currently part of the scope of the PDLMA, but may become so in the future if circumstances are favorable.

As a component of the Project, the reconstruction of Mije-Sokean phonology, grammar, and lexicon has been undertaken by Kaufman, and has in fact been in progress since 1959, but not continuously. The reconstruction of Sapotekan has been under way since 1965 by Kaufman, also not continuously, and it is hoped and expected that other Project members will be involved in this effort using the documentation produced by the Project. For Nawa, a pan-dialectal dictionary is a logical outcome of the work of this Project, when combined with existing materials from other forms of Nawa.

These dictionaries may appear in more than one edition. All the data that they contain is believed to be accurate. Data that has not been fully checked out is omitted until fully verified; it will appear in later editions. On-line versions of the lexical databases are also being made available.

The research plan for production of the dictionaries was designed by Kaufman, and Kaufman is the final editor of all the dictionaries. Justeson has been responsible for overseeing the development of the databases, and configuring them for printing and on-line access.

Online editions of the dictionaries

As of the time of the start-up of this website,

[1] each online dictionary consists of several thousand lexical entries that are thoroughly edited and vetted. Some lexical material for the language may be on hand that has not been thoroughly checked out and so is not available on this site;

[2] each dictionary is provided with an introduction that explains how it is set up and how to use it.

Eventually, each dictionary will contain a structural sketch of the language it represents.

Each dictionary will also contain an introductory chapter providing a structural outline of the language family the language belongs to.

Several of the dictionaries represent languages for which research is ongoing. Although it is believed that as of the time of their first issue/posting, virtually all native lexical/root morphemes, and essentially all grammatical morphemes are included in the dictionary, it is certain that more lexical material will be uncovered as more texts are collected and analyzed, and as additional semantic fields are looked into.

In light of this, it is expected that at least some of the lexical databases will be updated from time to time. Those interested in keeping abreast of such updates should return to this website from time to time. It is unlikely that updates will be posted more often than once a year.

The grammatical classifications of lexemes, roots, and affixes are occasionally overdifferentiated and not all redundancies have been eliminated, nor have all specifiable generalizations been worked out. The work is ongoing.

There are rather more instances than we would like of English glosses being not totally reliable. We will be working on remedying this for the next edition. The data was gathered through the medium of Spanish, and the Spanish glosses we use were provided by the speakers, and subsequently tweaked as needed. The dictionaries would be adequate if glossed only in Spanish: however, providing English glosses has two advantages -- it makes the material accessible to non-users of Spanish who know English, and it requires the analyst to focus on whether s/he really knows what the word means.

Users of these online dictionaries may feel that there are certain ways in which the structure and content of our postings could be improved. We will be happy to receive such suggestions, at We will acknowledge any suggestions that we follow and had not already thought of beforehand. We will not necessarily acknowledge suggestions that we choose not to follow; however if any such suggestions are made by several different people, we may offer a brief statement as to why we did not choose to follow such advice. In such cases we will identify the proponents of the positions we do not accept only if they ask to be named.


The phonological representations of the languages documented by PDLMA are practical ASCII-based orthographies with a minimum (ideally an absence) of non-linear diacritics.

Page layout

The alphabetical order (sort order) used in the dictionary is cited at the bottom of each page.

The dictionary entry (not all fields are represented):

Lemmas (entry keys) are cited in alphabetical order in bold type at the left margin. All known morphemes, desinences, and lexical items appear as entries. Roots that do not function as lexical items without some kind of derivational material being added to them are cited with a preceding <%>. Different entries with the same spellings are distinguished with a following [1], [2], etc. Bound morphemes are marked on either the left or right edge by a code for their status: compounded or incorporated (prepound, postpound, incorporee) <=>, derivational/lexical <.>, inflexional <->, shifter <>>, clitic <+>.

The representation of the morpheme or lexical item is underlying, or at least with all phonological processes operating at their margins under affixation or cliticization unpacked. Internal phonological processes may be unpacked or not, according to the decision of the individual linguist, in consultation with the overall editor (Kaufman).

Surface phonology. When the surface phonology is fairly different from the underlying representation, it may be provided between forward slashes /ABC/ when taxonomic phonemic, and between square brackets [ABC] when showing allophonic representation.

Variant forms. Unpredictable variant pronunciations are given, when known. As needed, they are also listed as lexical entries, but all detailed information is cross-referenced to the "main" (or canonical) pronunciation.

Grammatical class. Next is provided a code indicating for a lexeme its inflexional behavior and certain aspects of its derivational history. A separate key to grammatical classes is provided for each dictionary.

Principal parts. Any inflected or "shifted" forms needed to establish the grammatical behavior of the lexeme (and support the analysis given) are cited, with codes as to their grammatical content following them in parentheses.

Gloss(es). Senses (distinguishable meanings) of a lexeme are subdivided 1, 2, 3, etc. Each sense is glossed in Spanish, then in English, with a double forward slash between the Spanish and the English.

Synonyms. Known synonyms are cited, with reference to senses distinguished under Gloss(es), when necessary.

Semantic Field(s). The semantic field(s), especially of ethnobiological terms may be supplied, keyed to any multiple senses of the lexeme.

Example(s). Examples, usually example sentences, are numbered as needed. They appear in either underlying or surface phonological representation, or both. The Spanish and English glosses/translations of the examples are separated by a double forward slash.

Supplemental forms. Forms of a word that do not necessarily occur according to the class it belongs to, but if they do, do not create new lexical items -- such as participles, gerunds, passives, and antipassives -- are cited under the main lexical form, and not given separate entries unless they have special semantics or syntax.

Grammatical class(es) of supplemental form(s).

Gloss(es) of supplemental form(s).

Example(s) of supplemental form(s).

Historical source. When known we cite the historical source, whether an ancestral stage of the language in question, or the source of a borrowing from a known (or suspected/hypothesized) other language.

Root(s). For each lexeme, all its roots are named, classified grammatically, and glossed (unless the lexeme is a single root). A separate key to root classes/types is provided for each dictionary.

Morpheme-by-morpheme gloss. The morpheme-by-morpheme breakdown normally falls out of the representation of the lemma; the morpheme-by-morpheme gloss is also provided here.

Cross-references. References may be made to places in the dictionary where more information or related information is to be found.

Data source. The source for the data in the dictionary is given, whether collected by the compiler, or found in an earlier source. A separate key to data sources is provided for each dictionary.

Superordinate forms. The immediately antecedent lexical form(s) to the current lexical entry is (are) named. Any further information about them should be sought at their position in the lexicon.

Subordinate forms. Lexical items that are based on the current lexical entry are named. Any further information about them should be sought at their position in the lexicon.

The project databases will once again be available online in the near future.