Inf 703 Information Organisation

About the course
    Catalog Description
    A More Honest Description
    Course Objectives
    About the Instructor
    Class Conduct
    Course Evaluation/Grading
    Tentative Schedule

About the course:

   Catalog Description:

Examination of the organization of information from the perspectives of data base systems conceptualization, structure and design; classificatory and data ordering principles that facilitate information retrieval; informetrics including knowledge production and representation patterns, cognitive, semantic and citation/consultation factors.

   A More Honest Description:

The organisation and retrieval of information is examined from the perspective of a number of fields including statistical natural language processing, logic, philosophy, and computing.

   Course Objectives:

This is one of the four proseminars in the Information Science Doctoral program and a crucial foundational seminar specially for those proposing to major in the Organisation of Knowledge Records specialisation. The objectives of this seminar are
  • To provide a broad perspective on information organisation and knowledge representation.
  • To develop in-depth knowledge of the statistical analysis of unstructured data (text) to facilitate retrieval.
  • To gain an understanding and develop competence in modeling in the context of corpus based research.
  • To develop competence in conducting corpus based empirical research in information organisation and retrieval.
By the end of the semester, you should be able to
  • Interpret seminal journal articles in the field.
  • Build simple statistical models for text.
  • Statistically analyse corpus data.
  • Be familiar with well-known corpora, their tagging schemes, etc.


  • Course homepage:
  • Course Newsgroup:    sunya.class.inf703
  • Course E-Mail:
  • Meeting Time:    W 4:15 - 7:05
  • Meeting Room:    BA 363

   About the Instructor:

      Instructor: Jagdish S. Gangolly
      Instructor's Office: BA 365C
      Instructor's e-mail:
      Instructor's Homepage:


      The main texts for the course are:

  • Foundations of Statistical Natural Language Processing,    Christopher D. Manning and Hinrich Schutze    (The MIT Press, 1999).
  • Knowledge Representation: Logical, Philosophical, and Computational Foundations,   John F. Sowa    (Brooks/Cole Thomson Learning, 2000).
  • Modern Applied Statistics with S-Plus, Third Edition,    W.N. Venables and B.D. Ripley    (Springer-Verlag, 1999).
I shall be using primarily MS and JS, but you may like to refer to the Venables and Ripley book for the projects/homework. I shall be introducing S-Plus as we go along. However, if you are more comfortable with other statistical software such as SPSS, SAS, etc., please feel free to do so, but you will be on your own (ie., no tech support on my part).

In addition to the above, I might assign additional readings. Should I do so, they will be either placed in the Dewey Library in the downtown campus, or a link provided to an appropriate site on the internet. course page is continuously updated during the semester, and therefore it is important that you visit that page often.

   Class Conduct:

Due to attrition (retirements and sabbaticals), the Inf703 team has been reduced to just one (me) this time. The topical coverage and their orientation in the course, therefore, reflects my own interests. However, I will have the support of guest faculty from the School of Information Science & Policy as well as from Computer Science and Geography departments, so you will get a broader perspective on the subject matter of the course. I will update (or amend) the schedule as the guest lecture dates become certain.

I shall stay fairly faithful to the texts given in the tentative schedule below. You are expected to do the readings before the class meetings. I also expect you to have attempted the exercises I expect you to do before the class meetings. We will discuss some of them in the class. I also may ask some of you to discuss the problem solutions.

I expect the students in the class to be familiar with the unix operating system and have minimum competence in programming (provided usually in Inf 523), or get the minimum proficiency in those areas, as we go along, to complete the project.

I shall give homework off and on during the semester; they are all to be done individually. Their purpose is to gain a deeper understanding of the material covered in the class. Often, such homework will involve analysis of some corpus. I shall be providing the default corpus. Should your research interests dictate a different corpus, you should discuss it with me as soon as such homework is assigned.

You are most welcome to use the opportunity provided by the homework & projects in this course to narrow your research interests, survey the literature in the area that might form your dissertation area, or even to do some preliminary empirical work to explore or advance your dissertation topic. Students aspiring to specialise in the Organisation of Knowledge Records area are particularly invited to avail of this opportunity.

For computational work, you may use any of the labs on campus to which you have access; however, you are welcome to use the Arthur Andersen Lab in BA363 at the uptown campus. You will need a separate account on its NT network, which you can obtain by sending e-mail to or, stating that you are enrolled in this course.

   Course Evaluation & Grading:

The final course grade will depend on the following components:

100 Points (Homework)
200 Points (Group Project Report & Presentation)
50 Points (Individual Project)

I shall be adding the points scored and the total points scored will form the basis for assigning letter grades in the course.


January 19 & 26, February 2, 2000    Introduction: Language & Mathematical Preliminaries

  • Approaches to Language:    Language & cognition as probabilistic phenomena -- Ambiguity of language -- Zipf's laws -- Collocations & Concordances.
  • Probability:    Elementary Probability theory -- random variables -- joint & conditional distributions -- some standard distributions (Binomial, Poisson, and Normal) -- Bayesian decision theory.
  • Information Theory:    Entropy -- joint & conditional entropy -- mutual information -- the Noisy Channel Model -- Cross entropy -- Perplexity.
  • Reading Assignments:    MS    Ch.1 and 2.
  • Do:   MS    Ch.2.1, 2.3, 2.4, 2.6, 2.9, 2.10, 2.12, 2.13, 2.14, 2.15.

February 9 & 16, 2000    Introduction: Linguistic Preliminaries \& Corpus Based Work

  • Linguistics Preliminaries:   Parts of speech & Morphology -- Phrase Structure Grammar -- Dependency (Arguments and Adjuncts) -- Phrase structure ambiguity -- Semantics & Pragmatics.
  • Corpus Based Work:   Corpora -- Tokenisation -- Markup Schemes -- Grammatical Tagging.
  • Reading Assignments:   MSCh.3, 4.
  • DO:   M   S Ch.3.1--12, Ch.4.3.

February 23 & March 1, 2000    Collocations & n-gram Models

  • Topics:    Likelihood ratios -- Relative frequency ratios -- Mutual information -- Reliability & Discrimination -- Statistical estimators (maximum likelihood, Laplace's law, Lidstone's law, Jeffreys-Perks law) -- heldout estimation, cross-validation.
  • Reading Assignments:   MS   Ch.5, 6.
  • DO:   MS    Ch.5.1, 5.4, 5.5, 5.6, 5.7, 5.9, 5.12, 5.13, 5.17.

March 8, 2000    No Class (Spring Break)

March 15, 2000    Guest Lecture on Natural Language Processing (Prof. Andrew Haas)

March 22, 2000    Word Sense Disambiguation \& Lexical Acquisition

  • Supervised & Unsupervised Disambiguation:   Bayesian Classification -- Information theoretic disambiguation -- Dictionary based disambiguation -- Disambiguation based on sense definitions -- thesaurus based disambiguation -- Unsupervised disambiguation.
  • Lexical Acquisition:   Evaluation measures (Precision & Recall) -- Verb sub-categorisation -- Attachment ambiguities -- PP attachment -- Selectional preferences -- Semantic similarity -- Vector space measures --Probabilistic measures -- The role of lexical acquisition in Statistical NLP.
  • Reading Assignments:   MS    Ch.7, 8.

March 29 & April 5, 2000    Clustering, Information Retrievaln \& Text Categorisation

  • Clustering:   Hierarchical clustering (Agglomerative & Divisive clustering, single-link and complete-link) -- non-hierarchical clustering (K-means, EM algorithms).
  • Information Retrieval:   Evaluation measures -- the Probability Ranking Principle -- the Vector Space Model -- Term Weighting -- Term Distribution Models (Poisson and Two-Poisson models) -- Inverse Document Frequency -- Latent Semantic Indexing -- Singular Value Decomposition -- Discourse segmentation (Text Tiling).
  • Text Categorisation:   Decision Trees -- Maximum Entropy Modeling -- Generalised Iterative Scaling -- Perceptrons -- k-nearest neighbour classification.
  • Reading Assignments:   MS    Ch.14, 15, 16.

April 12, 2000    Guest Lecture on Geographic Information Systems (Prof. James Mower)

April 19, 2000    Knowledge Representation I: Logic

  • Logic Preliminaries:   Propositional & Predicate logic -- Boolean operators -- Formation rules -- Rules of inference -- Quantification and rules for quantifiers -- Varieties of logic -- Typed Predicate Logic -- Conceptual graphs -- Knowledge Interface Format (KIF).
  • Logic & Knowledge Representation:   Conceptual graphs -- Names, Types and Measures.
  • Reading Assignments:   JS    Ch.1.
  • Do:   JS    Ch.1.1, 1.4, 1.7, 1.8.

April 26, 2000    Knowledge Representation II: Ontology

  • Ontological Categories:   Quine's criterion -- CYC categories -- Approaches to categorisation (Aristotle, Kant, Peirce, Husserl, Whitehead, Heidegger).
  • Categories Analysis & Synthesis, etc.:   Contrasts, Distinctions & categories -- Lattice of categories -- Describing physical entities -- Defining abstractions -- Sets, Collections, Types, and Categories.
  • Reading Assignments:   JS   Ch.2.
  • Do:   JS    Ch.2.1, 2.2, 2.3, 2.6, 2.8.

May 3, 2000    Knowledge Representation III: Representations

  • Knowledge Engineering:    Informal specifications -- Formalisation -- Knowledge representation principles -- Ontological committments -- Representing structure in frames -- Mapping frames to logic -- Frames and Syllogisms -- Multiple inheritance -- Rules and data -- Object-oriented systems -- Natural language semantics -- Levels of representation.
  • Reading Assignments:   JS   Ch.3.
  • Do:    JS    Ch.3.1, 3.3, 3.6, 3.7.

May 10, 2000    Group Project Presentations

Updated March 6, 2000.