I. Introduction to HSTA865, call# 7358
    A. Course Evaluation
        What do you want to see?
        How do you want to see it?
        The purpose of the examinations.
        Extra Lectures - LeeAnn McCue?
                                    David Andersen?

   B. Motivation
        1. A brief history of communications
                Henry Morse - statue at Inventors Gate in Central Park
                Problems of Scale up - works in lab, noisey underground
                                                     Problem getting grant from Congress
                                                     Experienced poverty but, eventually wealth
                Problems with the transatlantic cable
                Western union just opened an office in Albany
                How fast can transmission occur and in how much noise - particularly                                                                                           important for cell phones


   Shannon showed that: a) channel capcity, calculated from the noise                                                                     characteristics, sets an upper bound on the transmission rate.

                                      b) the complexity of a signal (music or voice) limits the                                                     amount  of compression.  ENTROPY=irreducible                                                                                                                              complexity

                                      c) error free transmission If entropy < channel capacity.         
   2. The biological message See Could Information = Entropy
                Inference that DNA existed was indirect (Mendel)
                Human Genome Project as lead to LOTS of message that is not understood
                        Which parts are code protein, perform regulation,
                                                                                   or establish morphology
                         What is the 'meaning' of each coded portion: 'codons'
                                                                                   give rise to protein.
              HIV and noisey transcription leading to mutations -
                                               error corrrection machinery uses redundancy in code.
        3. Other areas of application: Computer Science (Kolmogorov Complexity),
                                                        Statistics, Investment.
        4. What is ENTROPY? (Analogy to thermodynamics)
                      1st Law: conservation of energy during 'process' transformation (work)
                      2nd Law:  Only processes increasing entropy can occur.
                                       Systems equillibrate at maximum entropy
                       Change in Heat = Free energy change + T(Change in entropy)



Could Information Equal Entropy?

If someone says that information = uncertainty = entropy, then they are confused, or something was not stated that should have been. Those equalities lead to a contradiction, since entropy of a system increases as the system becomes more disordered. So information corresponds to disorder according to this confusion.

If you always take information to be a decrease in uncertainty at the receiver and you will get straightened out:

R = Hbefore - Hafter.

where H is the Shannon uncertainty:
H = - sum (from i = 1 to number of symbols) Pi log2 Pi (bits per symbol)
and Pi is the probability of the ith symbol.

Imagine that we are in communication and that we have agreed on an alphabet. Before I send you a bunch of characters, you are uncertain (Hbefore) as to what I'm about to send. After you receive a character, your uncertainty goes down (to Hafter). Hafter is never zero because of noise in the communication system. Your decrease in uncertainty is the information (R) that you gain.

Since Hbefore and Hafter are state functions, this makes R a function of state. It allows you to lose information (it's called forgetting). You can put information into a computer and then remove it in a cycle.

Many of the statements in the early literature assumed a noiseless channel, so the uncertainty after receipt is zero (Hafter=0). This leads to the SPECIAL CASE where R = Hbefore. But Hbefore is NOT "the uncertainty", it is the uncertainty of the receiver BEFORE RECEIVING THE MESSAGE.

A way to see this is to work out the information in a bunch of DNA binding sites.

Definition of "binding": many proteins stick to certain special spots on DNA to control genes by turning them on or off. The only thing that distinguishes one spot from another spot is the pattern of letters (nucleotide bases) there. How much information is required to define this pattern?

Here is an aligned listing of the binding sites for the cI and cro proteins of the bacteriophage (i.e., virus) named lambda:

alist 5.66 aligned listing of:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
piece names from:
* 96/10/08 19:47:44, 96/10/08 19:31:56, lambda cI/cro sites
The alignment is by delila instructions
The book is from:   -101 to 100
This alist list is from: -15 to 15

                       ------                   ++++++
                       111111--------- +++++++++111111
                       5432109876543210123456789012345
                       ...............................
OL1 J02459  35599 +  1 tgctcagtatcaccgccagtggtatttatgt
    J02459  35599 -  2 acataaataccactggcggtgatactgagca
OL2 J02459  35623 +  3 tttatgtcaacaccgccagagataatttatc
    J02459  35623 -  4 gataaattatctctggcggtgttgacataaa
OL3 J02459  35643 +  5 gataatttatcaccgcagatggttatctgta
    J02459  35643 -  6 tacagataaccatctgcggtgataaattatc
OR3 J02459  37959 +  7 ttaaatctatcaccgcaagggataaatatct
    J02459  37959 -  8 agatatttatcccttgcggtgatagatttaa
OR2 J02459  37982 +  9 aaatatctaacaccgtgcgtgttgactattt
    J02459  37982 - 10 aaatagtcaacacgcacggtgttagatattt
OR1 J02459  38006 + 11 actattttacctctggcggtgataatggttg
    J02459  38006 - 12 caaccattatcaccgccagaggtaaaatagt
                                             ^

Each horizontal line represents a DNA sequence, starting with the 5' end on the left, and proceeding to the 3' end on the right. The first sequence begins with: 5' tgctcag ... and ends with ... tttatgt 3'. Each of these twelve sequences is recognized by the lambda repressor protein (called cI) and also by the lambda cro protein.

What makes these sequences special so that these proteins like to stick to them? Clearly there must be a pattern of some kind.

Read the numbers on the top vertically. This is called a "numbar". Notice that position +7 always has a T (marked with the ^). That is, according to this rather limited data set, one or both of the proteins that bind here always require a T at that spot. Since the frequency of T is 1 and the frequencies of other bases there are 0, H(+7) = 0 bits. But that makes no sense whatsoever! This is a position where the protein requires information to be there.

That is, what is really happening is that the protein has two states. In the BEFORE state, it is somewhere on the DNA, and is able to probe all 4 possible bases. Thus the uncertainty before binding is Hbefore = log2(4) = 2 bits. In the AFTER state, the protein has bound and the uncertainty is lower: Hafter(+7) = 0 bits. The information content, or sequence conservation, of the position is Rsequence(+7) = Hbefore - Hafter = 2 bits. That is a sensible answer. Notice that this gives Rsequence close to zero outside the sites.

If you have uncertainty and information and entropy confused, I don't think you would be able to work through this problem. For one thing, one would get high information OUTSIDE the sites. Some people have published graphs like this.

A nice way to display binding site data so you can see them and grasp their meaning rapidly is by the sequence logo method. The sequence logo for the example above is at http://www-lecb.ncifcrf.gov/~toms/gallery/hawaii.fig1.gif.

More information about the theory of BEFORE and AFTER states is given in the papers http://www-lecb.ncifcrf.gov/~toms/paper/nano2 , http://www-lecb.ncifcrf.gov/~toms/paper/ccmm and http://www-lecb.ncifcrf.gov/~toms/paper/edmm.

Also there is the problem of finding genes:
A new Fourier transform approach for protein coding measure based on the format of the Z curve