NONLINEAR KERNEL PLS REGRESSION FOR
DNA SPLICE JUNCTION PREDICTION
Robert A. Bress
Decision Sciences and Engineering Systems Department
Rensselaer Polytechnic Institute
Troy, New York 12180
Keywords:
Nonlinear, Kernel, Partial Least Squares, Regression, DNA, Data Mining, Classification, Principal Components Analysis, Splice Junction
The
discovery of useful genetic information within the human genome is a
challenging statistical and computational endeavor. Less than 1% of
the billions of nucleotides that make up human DNA actually code for
some relevant human feature or biological function. The 3 billion
base pairs of human DNA code for about 100,000 genes. Even within a
single gene, the regions that contain useful information (exons) and
so called "junk DNA" (introns) are interspersed. Sifting
through the data is further complicated by the fact that there is
uncertainty in the known genetic flags that alert us of gene
presence. A reliable source that signals the presence of genetic
information is splice junction motifs [1]. These motifs signal
interruptions in coding sequences of DNA by stretches of non-coding
sequences and the other way around. They are made of the same
sequence of two nucleotides in over 99% of all cases. The problem,
given these motifs, is to classify them as either a splice site or
not. Correctly classifying splice sites provides powerful indicators
for the presence of encoding genetic material. In this presentation,
kernel partial least squares regression is explored as a statistical
alternative to machine learning techniques for splice junction
classification [2, 3].
[1] BALDI, P., BRUNAK, S., Bioinformatics: The Machine learning Approach, second edition, MIT Press, 2001
[2] ROSIPAL, R., TREJO, L., Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space, Journal of Machine Learning Research, 2001, vol. 2, pp 97-123.
[3] WOLD, S., SJOSTROM, M., ERIKSSON, L., PLS-Regression: A Basic Tool of Chemometrics, Chemometrics and Intelligent Laboratory Systems, 2001, vol. 58, pp 109-130.