Minerva Gen*NY*Sis Center for Excellence in Cancer Genomics
University at Albany, State University of New York UAlbany Home UAlbany Site Index UAlbany Search

Back to the main page



Main

Research
Publications
Lab
Resources
Teaching

CRC

Igor B. Kuznetsov

Research supported by funding from
National Institutes of Health

 Research Background  Research Projects


Background

Due to the recent technological advances in genomic data acquisition, bioinformatics has become a crucial element of genomics. The main task of bioinformatics is to develop computational tools capable of dealing with diverse types of genomic data, filtering out noise, and finding reliable patterns associated with biological properties of interest. Our laboratory is developing bioinformatics methods, stand-alone and web-based software for the analysis of various types of genomic data and applying these methods to genome research. We do this by utilizing a variety of methods from statistics, information theory, classification/pattern recognition, and data mining (Figure 1).

Figure 1. Knowledge-based discovery of structural and functional relationships in genomic data.

 

Research Projects

Back to top 

Conformational flexibility in proteins.

Proteins are dynamic and flexible macromolecules. Upon changes in environment, the protein backbone can undergo significant conformational transitions (Figure 2). Macromolecular flexibility is an important structural property of proteins and is involved in a variety of fundamental biological activities. Conformational transitions are also involved in a number of fatal illnesses, such as neurodegenerative disorders and cancer, and viral infections. One of the most striking examples of the importance of protein flexibility in the development of neurodegenerative disorders is presented by the prion protein (PrP). The ordered C-terminal domain of this protein undergoes significant conformational changes during transition from a benign, mostly-helical protein, to a beta-sheet-rich pathogenic conformation (Prusiner, 1998). Broadly, protein flexibility can be classified into two overlapping categories: (1) disordered flexibility observed in intrinsically-unstructured segments that do not have a well-defined folded structure and (2) ordered flexibility observed in segments that adopt at least two different folded conformations.

 

Figure 2. Japanese encephalitis virus envelope protein E: a model of conformational transition upon membrane fusion. A moving loop on the left side represents an artifact resulting from an unstructured fragment missing electron density. Morph constructed using Yale Morph Server.

We developed Generalized Local Propensity (GLP) profile, a novel quantitative sequence-based measure of conformational flexibility in proteins and showed that GLP is able to discriminate between different types of conformational flexibility (Kuznetsov and Rackovsky, 2003a). Comparison of the GLP profiles of the ordered C-terminal domain of PrP and its paralog Doppel, which is topologically identical to PrP but does not undergo pathogenic conformational transitions, has shown that these profiles are significantly different and Doppel does not have flexible segments similar to those of PrP (Figure 3) (Kuznetsov and Rackovsky, 2004). We also identified a potential target for experimental structure determination aimed at obtaining a structural template that can be used to model the pathogenic conformation of prion protein (Kuznetsov and Rackovsky, 2003b). This finding was featured in the review article "Did the first virus self-assemble from self-replicating prion proteins and RNA?" published in 2007 "Medical Hypothesis" magazine (link to the article).

Figure 3. The GLP profiles of human prion protein (PrP) and Doppel protein. The three topologically identical regions are shown by dashed boxes.

We also developed software that uses GLP profiles an scan statistics to identify protein segments with high degree of conformational flexibility (available at http://cfp.rit.albany.edu).

We used a large non-redundant set of experimentally characterized proteins that undergo ordered conformational transitions obtained from the Database of Macromolecular Movements to study sequence and low-resolution structural properties of positions that exhibit significant changes in backbone conformation and the utility of these properties for the prediction of such conformationally variable positions using supervised pattern recognition (Kuznetsov, 2008; the dataset is available here). The results of this study show that ordered changes in backbone conformation are not limited to solvent accessible loop regions. A considerable fraction of conformationally variable positions is observed in helices and strands, and in buried positions. Conformationally variable positions are less conserved in evolution. Local patterns of (a) sequence neighbors, (b) evolutionary conservation, and (c) solvent accessibility can be used to predict conformationally variable positions with balanced sensitivity and specificity, albeit with large variance at the level of individual proteins. Application of this methodology to the prion protein showed that conformationally variable positions predicted in its ordered C-terminal domain are located within segments presumed to be involved in refolding of PrP. The methodology for predicting residue positions involved in ordered conformational transitions was implemented as a web-server (Kuznetsov and McDuffie, 2008; available at http://flexpred.rit.albany.edu).

Back to top 

Protein-DNA interactions.

Proteins that interact with DNA are involved in a number of fundamental biological activities such as DNA replication, transcription, and repair (Figure 4). A reliable identification of DNA-binding sites on DNA-binding proteins is important for functional annotation, site-directed mutagenesis, and modeling protein-DNA interactions.

Figure 4. Transcription factor HNF3 bound to DNA.

We developed a series of Support Vector Machine (SVM) classifiers for the prediction of DNA-binding sites on DNA-binding proteins (Kuznetsov et al, 2006). Our results indicate that including the profile of evolutionary conservation of sequence positions in the form of a properly scaled Position Specific Scoring Matrix obtained using a non-redundant sequence database significantly improves the accuracy of the prediction of DNA-binding sites. The highest prediction accuracy is achieved using a classifier that utilizes a combination of evolutionary conservation and low-resolution structural information (Figure 5). 

Figure 5. ROC curves of the SVM predictors of DNA-binding sites.
The predictors have the following values of the area under the curve (AUC):
seq-SVM AUC=0.748; seq-str-SVM AUC=0.776;
pssm-SVM AUC=0.836; pssm-str-SVM AUC=0.840.

We developed DP-BIND, a web-server for predicting DNA-binding sites in a DNA-binding protein from its amino acid sequence (Hwang et al, 2007) (link to the web-server). The web server implements three machine learning methods: support vector machine, kernel logistic regression, and penalized logistic regression. The outputs of all three individual methods are combined into a consensus prediction to help identify positions predicted with high level of confidence.

We tested the utility of pattern recognition methods for predicting DNA-binding interfaces on the example of kernel logistic regression (KLR) predictors of DNA-binding residues. We showed that predictors that utilize sequence properties of proteins can successfully predict DNA-binding residues in proteins from a novel structural class that was not used for training. We used multiple linear regression (MLR) to establish a quantitative relationship between protein properties and the expected accuracy of KLR predictors. The expected accuracy provided by this MLR model is close to the actual accuracy (Figure 6) and can be used to assess the overall confidence of the prediction of DNA-binding interfaces in the case of novel proteins (Gou and Kuznetsov, 2008).

Figure 6. Scatter plot of the observed per-protein accuracy of KLR predictor obtained from leave-one-class-out cross-validation vs. accuracy estimated using multiple linear regression (MLR).
R-squared=0.918, F-statistic=14.25 with P-value= 0.0.
Back to top 

Functional implications of compositional bias in
protein sequences.

Most biological sequences contain compositionally biased segments in which one or more residue types are significantly over-represented. The function and evolution of these segments are poorly understood. Usually, all types of compositionally biased segments are masked and ignored during sequence analysis. However, it has been shown for a number of proteins that biased segments that contain amino acids with similar chemical properties are involved in a variety of molecular functions and human diseases. A detailed large-scale analysis of the functional implications and evolutionary conservation of different compositionally biased segments requires a sensitive method capable of detecting user-specified types of compositional bias. We developed BIAS, a novel sensitive method for the detection of compositionally biased segments composed of a user-specified set of residue types (Kuznetsov and Hwang, 2006). BIAS is implemented as stand-alone software and as a web-server (Kuznetsov, 2008). BIAS uses the discrete scan statistics that provides a highly accurate correction for multiple tests to compute analytical estimates of the significance of each compositionally-biased segment (Figure 7). The method can take into account global compositional bias when computing analytical estimates of the significance of local clusters. We used BIAS to show that groups of proteins with the same biological function are significantly associated with particular types of compositionally biased segments.

Figure 7. Flowchart that shows application of of the BIAS algorithm to search for clusters of negatively charged amino acids in
a protein sequence.
Back to top 

Predictive inference using genetic biomarkers.

N-acetyltransferase-2 (NAT2) is an enzyme that catalyzes the acetylation of aromatic and heterocyclic amine carcinogens. Based on the level of NAT2 enzymatic activity, individuals in human populations are divided into three enzymatic phenotypes: rapid (normal activity), intermediate, and slow (reduced activity). Because of its involvement in the detoxification of carcinogens, mutations within NAT2 that affect its enzymatic activity may also modify risk of cancer development. We developed a highly accurate supervised pattern recognition method for inferring the enzymatic activity of NAT2 from single nucleotide polymorphisms (SNPs) found in NAT2 gene (Kuznetsov et al, 2009). The methodology was implemented as a web-server. Given a combination of NAT2 SNPs observed in a particular individual, the web-server assigns one of the three NAT2 phenotypes, slow, intermediate, or rapid, to this individual. The web-server can be used for a fast determination of the NAT2 acetylator phenotype in genetic screens. The results of an independent evaluation of NAT2PRED performed on a worldwide dataset composed of 56 populations are available in a manuscript published in BMC Medical Genetics (Sabbagh et al, 2009).

Back to top 

Integration of multiple sources of genomic data.

High-throughput genome analysis techniques produce the ever increasing number of heterogeneous large-scale data sets. Studies of these mutually complementary sources of data promise insights into a global picture of the living cell. We developed a bioinformatics methodology for the analysis of multiple heterogeneous sources of ‘omic’ (genomic, proteomic, etc) data (Hwang and Kuznetsov, 2007). We applied this methodology to study associations among four types of human ‘omic’ data: protein-protein interactions, gene expression, transcription factor binding sites, and functional pathways. The results of our study demonstrated that the proposed approach can be used to identify and rank statistically significant functional associations among genes. We showed that combinations of multiple data types provide additional insights into the properties of functional pathways. The proposed methodology can also be used as a quantitative procedure for evaluating the quality of ‘omic’ data sets.

Back to top 

Comparison of distantly related protein sequences.

Amino acid sequence alignment is a cornerstone sequence comparison method. The quality of alignment critically depends on the choice of the alignment scoring function. Therefore, for a specific alignment problem one needs a way of selecting the best performing scoring function. We designed protein family-specific and fold-specific amino acid similarity matrices that cover the entire SCOP database and an adaptive global sequence alignment procedure that automatically selects an appropriate similarity matrix and optimized gap penalties based on the properties of the sequences being aligned. We also designed a quantitative statistical framework for the comparative evaluation of the performance of alignment scoring functions in global sequence alignment. This framework was applied to study how the existing general-purpose amino acid similarity matrices perform on individual protein families and structural folds and to compare them to the family-specific and fold-specific similarity matrices. The results of this study showed that using protein family-specific similarity matrices significantly improves the quality of the alignment of distantly related homologous sequences (Kuznetsov, 2011). The family-specific matrices and the optimized gap penalties are available at http://taurus.crc.albany.edu/fsm.

Back to top 
     

Back to the main page