About
the course Catalog
Description A More
Honest Description Course
Objectives Administrivia About
the Instructor Texts: Class
Conduct Course
Evaluation/Grading Tentative
Schedule |
Examination
of the organization of information from the perspectives of data base
systems conceptualization, structure and design; classificatory and
data ordering principles that facilitate information retrieval;
informetrics including knowledge production and representation
patterns, cognitive, semantic and citation/consultation factors.
The
organisation and retrieval of information is examined from the
perspective of a number of fields including statistical natural
language processing, logic, philosophy, and computing.
This
is one of the four proseminars in the Information Science Doctoral
program and a crucial foundational seminar specially for those
proposing to major in the Organisation of Knowledge Records
specialisation. The objectives of this seminar are
- To provide a broad perspective on
information organisation and knowledge representation.
- To develop in-depth knowledge of
the statistical analysis of unstructured data (text) to facilitate
retrieval.
- To gain an understanding and
develop competence in modeling in the context of corpus based
research.
- To develop competence in
conducting corpus based empirical research in information
organisation and retrieval.
By the end of the semester,
you should be able to
- Interpret seminal journal articles
in the field.
- Build simple statistical models
for text.
- Statistically analyse corpus data.
- Be familiar with well-known
corpora, their tagging schemes, etc.
- Course homepage: http://www.albany.edu/acc/courses/inf703.spring98.html
- Course Newsgroup:
sunya.class.inf703
- Course E-Mail:
inf703@cnsunix.albany.edu
- Meeting Time:
W 4:15 - 7:05
- Meeting Room:
BA 363
Instructor:
Jagdish S. Gangolly Instructor's
Office: BA 365C Instructor's
e-mail: gangolly@cnsunix.albany.edu Instructor's
Homepage: http://www.albany.edu/acc/gangolly
The
main texts for the course are:
- Foundations of Statistical Natural
Language Processing, Christopher D.
Manning and Hinrich Schutze (The MIT
Press, 1999).
- Knowledge Representation: Logical,
Philosophical, and Computational Foundations, John
F. Sowa (Brooks/Cole Thomson Learning, 2000).
- Modern Applied Statistics with
S-Plus, Third Edition, W.N. Venables
and B.D. Ripley (Springer-Verlag, 1999).
I shall be using primarily
MS and JS, but you may like to refer to the Venables and Ripley book
for the projects/homework. I shall be introducing S-Plus as we go
along. However, if you are more comfortable with other statistical
software such as SPSS, SAS, etc., please feel free to do so, but you
will be on your own (ie., no tech support on my part).
In
addition to the above, I might assign additional readings. Should I do
so, they will be either placed in the Dewey Library in the downtown
campus, or a link provided to an appropriate site on the internet.
course page is continuously updated during the semester, and therefore
it is important that you visit that page often.
Due to
attrition (retirements and sabbaticals), the Inf703 team has been
reduced to just one (me) this time. The topical coverage and their
orientation in the course, therefore, reflects my own interests.
However, I will have the support of guest faculty from the School of
Information Science & Policy as well as from Computer Science and
Geography departments, so you will get a broader perspective on the
subject matter of the course. I will update (or amend) the schedule as
the guest lecture dates become certain.
I shall stay fairly
faithful to the texts given in the tentative schedule below. You are
expected to do the readings before the class meetings. I also
expect you to have attempted the exercises I expect you to do before
the class meetings. We will discuss some of them in the class. I also
may ask some of you to discuss the problem solutions.
I
expect the students in the class to be familiar with the unix
operating system and have minimum competence in programming (provided
usually in Inf 523), or get the minimum proficiency in those areas, as
we go along, to complete the project.
I shall give homework
off and on during the semester; they are all to be done individually.
Their purpose is to gain a deeper understanding of the material
covered in the class. Often, such homework will involve analysis of
some corpus. I shall be providing the default corpus. Should your
research interests dictate a different corpus, you should discuss it
with me as soon as such homework is assigned.
You are most
welcome to use the opportunity provided by the homework & projects
in this course to narrow your research interests, survey the
literature in the area that might form your dissertation area, or even
to do some preliminary empirical work to explore or advance your
dissertation topic. Students aspiring to specialise in the Organisation
of Knowledge Records area are particularly invited to avail of
this opportunity.
For computational work, you may use any of
the labs on campus to which you have access; however, you are welcome
to use the Arthur Andersen Lab in BA363 at the uptown campus. You will
need a separate account on its NT network, which you can obtain by
sending e-mail to bc8273@cnsunix.albany.edu or
ub8279@cnsunix.albany.edu, stating that you are enrolled in this
course.
The
final course grade will depend on the following components:
100
Points (Homework) 200 Points (Group Project Report &
Presentation) 50 Points (Individual Project)
I shall be
adding the points scored and the total points scored will form the
basis for assigning letter grades in the course.
January 19 & 26, February 2, 2000
Introduction: Language & Mathematical
Preliminaries
- Approaches to Language:
Language & cognition as probabilistic phenomena -- Ambiguity
of language -- Zipf's laws -- Collocations & Concordances.
- Probability:
Elementary Probability theory -- random variables -- joint &
conditional distributions -- some standard distributions (Binomial,
Poisson, and Normal) -- Bayesian decision theory.
- Information Theory:
Entropy -- joint & conditional entropy -- mutual information
-- the Noisy Channel Model -- Cross entropy -- Perplexity.
- Reading Assignments:
MS Ch.1 and 2.
- Do: MS Ch.2.1,
2.3, 2.4, 2.6, 2.9, 2.10, 2.12, 2.13, 2.14, 2.15.
February 9 & 16, 2000 Introduction:
Linguistic Preliminaries \& Corpus Based Work
- Linguistics Preliminaries: Parts
of speech & Morphology -- Phrase Structure Grammar -- Dependency
(Arguments and Adjuncts) -- Phrase structure ambiguity -- Semantics
& Pragmatics.
- Corpus Based Work: Corpora
-- Tokenisation -- Markup Schemes -- Grammatical Tagging.
- Reading Assignments: MSCh.3,
4.
- DO: M S
Ch.3.1--12, Ch.4.3.
February 23 & March 1, 2000
Collocations & n-gram Models
- Topics: Likelihood ratios --
Relative frequency ratios -- Mutual information -- Reliability &
Discrimination -- Statistical estimators (maximum likelihood,
Laplace's law, Lidstone's law, Jeffreys-Perks law) -- heldout
estimation, cross-validation.
- Reading Assignments: MS Ch.5,
6.
- DO: MS Ch.5.1,
5.4, 5.5, 5.6, 5.7, 5.9, 5.12, 5.13, 5.17.
March 8, 2000 No
Class (Spring Break)
March 15, 2000
Guest Lecture on Natural Language Processing (Prof. Andrew Haas)
March 22, 2000
Word Sense Disambiguation \& Lexical Acquisition
- Supervised & Unsupervised
Disambiguation: Bayesian Classification --
Information theoretic disambiguation -- Dictionary based
disambiguation -- Disambiguation based on sense definitions --
thesaurus based disambiguation -- Unsupervised disambiguation.
- Lexical Acquisition: Evaluation
measures (Precision & Recall) -- Verb sub-categorisation --
Attachment ambiguities -- PP attachment -- Selectional preferences
-- Semantic similarity -- Vector space measures --Probabilistic
measures -- The role of lexical acquisition in Statistical NLP.
- Reading Assignments: MS
Ch.7, 8.
March 29 & April 5, 2000
Clustering, Information Retrievaln \& Text Categorisation
- Clustering: Hierarchical clustering
(Agglomerative & Divisive clustering, single-link and
complete-link) -- non-hierarchical clustering (K-means, EM
algorithms).
- Information Retrieval: Evaluation
measures -- the Probability Ranking Principle -- the Vector Space
Model -- Term Weighting -- Term Distribution Models (Poisson and
Two-Poisson models) -- Inverse Document Frequency -- Latent Semantic
Indexing -- Singular Value Decomposition -- Discourse segmentation
(Text Tiling).
- Text Categorisation: Decision Trees
-- Maximum Entropy Modeling -- Generalised Iterative Scaling --
Perceptrons -- k-nearest neighbour classification.
- Reading Assignments: MS
Ch.14, 15, 16.
April 12, 2000
Guest Lecture on Geographic Information Systems (Prof. James
Mower)
April 19, 2000
Knowledge Representation I: Logic
- Logic Preliminaries: Propositional
& Predicate logic -- Boolean operators -- Formation rules --
Rules of inference -- Quantification and rules for quantifiers --
Varieties of logic -- Typed Predicate Logic -- Conceptual graphs --
Knowledge Interface Format (KIF).
- Logic & Knowledge
Representation: Conceptual graphs -- Names,
Types and Measures.
- Reading Assignments: JS
Ch.1.
- Do: JS
Ch.1.1, 1.4, 1.7, 1.8.
April 26, 2000
Knowledge Representation II: Ontology
- Ontological Categories: Quine's
criterion -- CYC categories -- Approaches to categorisation
(Aristotle, Kant, Peirce, Husserl, Whitehead, Heidegger).
- Categories Analysis & Synthesis, etc.: Contrasts,
Distinctions & categories -- Lattice of categories -- Describing
physical entities -- Defining abstractions -- Sets, Collections,
Types, and Categories.
- Reading Assignments: JS Ch.2.
- Do: JS Ch.2.1,
2.2, 2.3, 2.6, 2.8.
May 3, 2000
Knowledge Representation III: Representations
- Knowledge Engineering: Informal
specifications -- Formalisation -- Knowledge representation
principles -- Ontological committments -- Representing structure in
frames -- Mapping frames to logic -- Frames and Syllogisms --
Multiple inheritance -- Rules and data -- Object-oriented systems --
Natural language semantics -- Levels of representation.
- Reading Assignments: JS Ch.3.
- Do: JS Ch.3.1,
3.3, 3.6, 3.7.
May 10, 2000 Group
Project Presentations
Updated March 6, 2000. |