BALSA: Bayesian Algorithm for Local Sequence Alignment

 

Bobbie-Jo M . Webb

Rensselaer Polytechnic Institute

 

 

Abstract

 

Keywords: Sequence alignment, Dynamic programming, Bayesian statistics, Posterior probability

 

Local sequence alignment is a fundamental tool in current biomedical research and despite its long history, it still has a number of well-known shortcomings. Dynamic programming algorithms are the traditional method of choice, but they yield a single alignment, albeit optimal, which can be strongly affected by the scoring matrix and gap penalties chosen. Additionally, the scores obtained are dependent upon the lengths of the sequences aligned, requiring a post-analysis conversion. We undertook a study to examine the utility of Bayesian statistics to overcome these issues. We developed a Bayesian algorithm for local sequence alignment, BALSA, that takes into account the uncertainty associated with all unknown variables by incorporating in its forward sums a series of scoring matrices, gap parameters, and all possible alignments. The algorithm returns a representative sample of alignments and the posterior probabilities of gap penalties and scoring matrices. Furthermore, it inherently adjusts for variations in sequence lengths. BALSA was compared to SSEARCH with E-values, to date the best performing dynamic programming algorithm in the detection of structural neighbors. Using the SCOP databases PDB40D-B, PDB90D-B, and PDB41-90D-B, BALSA detected 19.8%, 41.3%, and 67.2% of remote homologs while SSEARCH detected 18.4%, 38%, and 60% at an error rate of 1% EPQ over the databases respectively.