NONLINEAR KERNEL PLS REGRESSION FOR

DNA SPLICE JUNCTION PREDICTION


Robert A. Bress


Decision Sciences and Engineering Systems Department

Rensselaer Polytechnic Institute

Troy, New York 12180



Keywords:

Nonlinear, Kernel, Partial Least Squares, Regression, DNA, Data Mining, Classification, Principal Components Analysis, Splice Junction

ABSTRACT


The discovery of useful genetic information within the human genome is a challenging statistical and computational endeavor. Less than 1% of the billions of nucleotides that make up human DNA actually code for some relevant human feature or biological function. The 3 billion base pairs of human DNA code for about 100,000 genes. Even within a single gene, the regions that contain useful information (exons) and so called "junk DNA" (introns) are interspersed. Sifting through the data is further complicated by the fact that there is uncertainty in the known genetic flags that alert us of gene presence. A reliable source that signals the presence of genetic information is splice junction motifs [1]. These motifs signal interruptions in coding sequences of DNA by stretches of non-coding sequences and the other way around. They are made of the same sequence of two nucleotides in over 99% of all cases. The problem, given these motifs, is to classify them as either a splice site or not. Correctly classifying splice sites provides powerful indicators for the presence of encoding genetic material. In this presentation, kernel partial least squares regression is explored as a statistical alternative to machine learning techniques for splice junction classification [2, 3].


[1] BALDI, P., BRUNAK, S., Bioinformatics: The Machine learning Approach, second edition, MIT Press, 2001


[2] ROSIPAL, R., TREJO, L., Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space, Journal of Machine Learning Research, 2001, vol. 2, pp 97-123.


[3] WOLD, S., SJOSTROM, M., ERIKSSON, L., PLS-Regression: A Basic Tool of Chemometrics, Chemometrics and Intelligent Laboratory Systems, 2001, vol. 58, pp 109-130.