The development of rapid DNA sequencing techniques two
decades ago
allowed molecular biologists to begin creating databases
of
nucleic acid and protein sequences. The great competition
in recent
years to finish a rough draft of the human genome has
resulted in a
tremendous flood of data into public databases that continues
unabated. Despite this accumulation of precise molecular
data, few
reliable and robust methods are available to make full
use of them.
Starting thirty years ago, computational pioneers from
a variety of
backgrounds began to develop methods to compare and analyze
these
data. We know how to compare sequences and are just beginning
to know
how to compare entire bacterial genomes. We can infer
evolutionary
history with some degree of reliability. We are starting
to infer
structure and function from molecular sequences. Much
more remains to
be done. The reliability and significance of the above
mentioned
methods often depends on crude Monte Carlo estimates.
We need to
develop new methods to assess error rates in the databases,
new
algorithms to detect weak signals and new concepts of
clustering for
sequence and structural data. Perhaps the time
has come for
professional statisticians to jump in and to lend their
expertise to
these problems.