Correlation Matrix Based Feature Selection with Genetic Algorithms
for In-Silico Drug Design
Mark J. Embrechts (embrem@rpi.edu) and Muhsin Ozdemir
Department of Decision Sciences and Engineering Systems
Rensselaer Polytechnic Institute, Troy, NY 12180
Tel 518-276-4009; FAX 518-276-8227
In-silico design of pharmaceuticals is based on formulating predictive models for a drug-related bio-activity based on several hundred descriptive features and often for datasets with relatively few molecules. The process for selecting potential drug candidates with desirable properties can be divided in a feature reduction stage and a predictive modeling stage. Traditional statistical approaches such as predictive models based on principal components typically fail for such applications because of the highly nonlinearity of the dependence of the bio-activity of interest with respect to the predictive features and because often there are actually fewer molecules in the dataset than the number of predictive features. This presentation explains a novel method for feature selection for small datasets with a large number of features based genetic algorithms and the correlation matrix of the features with the bio-activity. The genetic algorithm optimizes a correlation-matrix based objective function by selecting features that are highly correlated with the activity, but show low correlation between the selected features. It will be shown that highly accurate models can result by following a GA/correlation matrix based methodology for feature selection.