Computational Genomics: A new frontier for statisticians.

Michael Zuker


 
 

The development of rapid DNA sequencing techniques two decades ago
allowed molecular biologists to begin creating databases of
nucleic acid and protein sequences. The great competition in recent
years to finish a rough draft of the human genome has resulted in a
tremendous flood of data into public databases that continues
unabated. Despite this accumulation of precise molecular data, few
reliable and robust methods are available to make full use of them.

Starting thirty years ago, computational pioneers from a variety of
backgrounds began to develop methods to compare and analyze these
data. We know how to compare sequences and are just beginning to know
how to compare entire bacterial genomes. We can infer evolutionary
history with some degree of reliability. We are starting to infer
structure and function from molecular sequences. Much more remains to
be done. The reliability and significance of the above mentioned
methods often depends on crude Monte Carlo estimates. We need to
develop new methods to assess error rates in the databases, new
algorithms to detect weak signals and new concepts of clustering for
sequence and structural data.  Perhaps the time has come for
professional statisticians to jump in and to lend their expertise to
these problems.