Data to Knowledge in Pharmaceutical Research Dr. Ann DeWitt * , Saziye Bayram † , German Enciso ‡ , Harshini Fernando § , Justin Kao ¶ , Bernardo Pagnoncelli  , Deena Schmidt ** , and Jaﬀar Ali Shahul Hameed †† August 18, 2004 Abstract This report is concerned with the analysis of data from “high-throughput” screening of possible drug compounds. High-throughput screening is a rel- atively new process yielding thousands of data points at a time, more than can be handled by traditional methods of biological data analysis. We ex- amine a few methods for extracting knowledge from this data and also illustrate the use of descriptors for predicting drug activity. Finally, we present suggestions for improvements in the process and ideas for future work. 1 Introduction The lengthy process of bringing pharmaceutical products from concept to market begins with drug discovery. A signiﬁcant part of modern drug discovery is the testing of thousands of compounds in a chemical library for drug-like activity. One of the primary tools of this testing is high- throughput screening, a highly automated system to assess the biological activity of thousands of compounds at a time. The main purpose of HTS is to ﬁnd chemical families that have the de- sired activity and to show a structure-activity relationship. A secondary purpose of HTS is to elucidate biological insights given multiple types of biological results, initiating further wet lab experimentation. Because high-throughput screening is a manufacturing process with highly vari- able output and active compounds comprise only a small fraction of the * 3M Corporation † State University of New York, Buﬀalo ‡ Rutgers University § Texas Tech University ¶ Northwestern University  Pontif´ ıcia Universidade Cat´ olica, Rio de Janeiro ** Cornell University †† Mississippi State University 1