AIDS RESEARCH AND HUMAN RETROVIRUSES Volume 19, Number 2, 2003, pp. 145–149 © Mary Ann Liebert, Inc. Sequence Note A New Perspective on V3 Phenotype Prediction SATISH PILLAI, 1,* BENJAMIN GOOD, 2,* DOUGLAS RICHMAN, 1,2 and JACQUES CORBEIL 1,2 ABSTRACT The particular coreceptor used by a strain of HIV-1 to enter a host cell is highly indicative of its pathology. HIV-1 coreceptor usage is primarily determined by the amino acid sequences of the V3 loop region of the vi- ral envelope glycoprotein. The canonical approach to sequence-based prediction of coreceptor usage was de- rived via statistical analysis of a less reliable and significantly smaller data set than is presently available. We aimed to produce a superior phenotypic classifier by applying modern machine learning (ML) techniques to the current database of V3 loop sequences with known phenotype. The trained classifiers along with the se- quence data are available for public use at the supplementary website: http://genomiac2.ucsd.edu:8080/wet- cat/v3.html 145 T HE ENTRY OF HIV-1 into a host cell is a two-stage process. First, the viral envelope glycoprotein binds to the cell sur- face molecule CD4, inducing a conformational change in the gp120 ectodomain of the protein. Second, the glycoprotein docks to a seven-transmembrane chemokine coreceptor on the cell surface, triggering the presentation of its gp41 transmem- brane segment. This sequence of events results in membrane fusion and penetration of the virus into the cytosol. 1 The two principal coreceptors used by HIV-1 are CXCR4 and CCR5, members of the CXC and CC chemokine receptor families, re- spectively. 2 The particular coreceptor used by a strain of HIV-1 (CXCR4 vs. CCR5) largely defines its replication kinetics and cy- topathology in vitro. Moreover, coreceptor usage is indicative of the pathogenicity, tissue tropism, and transmissibility of a virus in vivo. Unsurprisingly,the determinationof thisviralphe- notype is critical in a wide variety of HIV research contexts. Several experiments have been conducted on HIV isolates to pinpoint the genetic basis underlying coreceptor preference. The generation and analysis of chimeric (recombinant) viruses have localized the primary determinant of coreceptor usage to the 35-amino acid V3 loop subregion of the HIV envelope gly- coprotein. 3 Earlier work involving statistical analysis of V3 loop amino acid sequences and their respective phenotypes suggested that the presence of a positively charged residue at positions 11 and/or 25 of the V3 loop (numbered according to the North American consensus; see Fig. 1) conferred the ability to dock with CXCR4, while CCR5 binding is the default condition. 4 To date, this “charge rule” is the most accepted method of se- quence-basedprediction.However,predictionbasedon this rule does not always align with experimental determination of core- ceptor usage. 5 The inaccuracy of the charge rule is most likely due to the comparatively sparse and unreliable data that were available at the time of its creation. Since then, the number of sequences with known phenotype has increased substantially, and the laboratory-basedassays used to generate the data have improved. Another possible candidate for a deficiency in this predictive scheme is the considerationof only 2 of the 35 avail- able amino acid positions in the V3 loop. Modern machine learning (ML) techniques for class predic- tion can provide advantages over traditional statistics in terms of their abilities to identify and exploit interactionsbetween fea- ture variables. In addition, the rules they generate can often be interpreted with relative ease. 6,7 ML has already proven ex- tremely useful in segregating biological sequence data into 1 University of California, San Diego, La Jolla, California 92093. 2 Veterans Administration, San Diego Healthcare System, San Diego, California 92161. * Satish Pillai and Benjamin Good contributed equally to this work.