Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. V (2010), No. 3, pp. 280-291 Extreme Data Mining: Inference from Small Datasets R. Andonie R˘ azvan Andonie Computer Science Department Central Washington University, Ellensburg, USA and Department of Electronics and Computers Transylvania University of Bra¸ sov, Romania E-mail: andonie@cwu.edu Abstract: Neural networks have been applied successfully in many fields. However, satisfactory results can only be found under large sample conditions. When it comes to small training sets, the performance may not be so good, or the learning task can even not be accomplished. This deficiency limits the applications of neural network severely. The main reason why small datasets cannot provide enough information is that there exist gaps between samples, even the domain of samples cannot be ensured. Several computational intelligence techniques have been proposed to overcome the limits of learning from small datasets. We have the following goals: i. To discuss the meaning of "small" in the context of inferring from small datasets. ii. To overview computational intelligence solutions for this problem. iii. To illustrate the introduced concepts with a real-life application. 1 Introduction Small dataset conditions exist in many applications, such as disease diagnosis, fault diagnosis or deficiency detection in biology and biotechnology, mechanics, flexible manufacturing system scheduling, drug design, and short-term load forecasting (an activity conducted on a daily basis by electrical utilities). In this section, we describe a computational chemistry problem, review a class of neural networks to be used, and summarize our previous work in this area. 1.1 A Real-World Problem: Assist Drug Discovery Current treatments for HIV/AIDS consist of co-administering a protease inhibitor and two reverse transcriptase inhibitors (usually referred to as combination therapy). This therapy is effective in reducing viremia to very low levels; however, in 30-50% of patients it is ineffective due to resistance development often caused by viral mutations. Due to resistance and poor bioavailability 1 profiles, as well as toxicity associated with these therapies, there is an urgent need for more efficient design of drugs. We focus on inhibitors to the HIV-1 protease enzyme, using the IC as the target value. A detailed description of the problem, from a computational chemistry point of view, can be found in our papers [1–3]. The IC value represents the concentration of a compound that is required to reduce enzyme activity by 50%. A low IC value indicates good inhibitory activity. The available dataset consists of 196 compounds with experimentally determined IC values. Twenty of these molecules are used as an external test set after the training is completed. The remaining 176 molecules are used for training and cross-validation. Our practical goal is to predict the (unknown) IC values for 26 novel compounds which are candidates for HIV-1 protease inhibitors. We use two IC prediction accuracy measures: the RMSE (Root Mean Squared Error) and the Symmetric Mean Absolute Percentage Error (sMAPE). 1 Bioavailability is the rate at which the drug reaches the systemic circulation. Copyright c 2006-2010 by CCC Publications