POFs: what we don’t know can hurt us Martin Gollery 1 , Jeff Harper 2 , John Cushman 2 , Taliah Mittler 2 and Ron Mittler 2, 3 1 TimeLogic – a Division of Active Motif, Incline Village, NV 89451, USA 2 Department of Biochemistry and Molecular Biology, MS200, University of Nevada, Reno, NV 89557, USA 3 Department of Plant Science, Hebrew University of Jerusalem, Givat Ram, Jerusalem 91904, Israel Over a quarter of all eukaryotic genes encode proteins with obscure features that lack currently defined motifs or domains (POFs). Interestingly, most of the differences in gene repertoire among species were recently found to be attributed to POFs. A comparison of the Arabidopsis, rice and poplar genomes reveals that Arabidopsis con- tains 5069 POFs, of which 2045 have no obvious homo- logs in rice or poplar and are likely to be involved in species- or phylogenetic-specific functions in Arabidop- sis. The study of POFs is an important endeavor that will shed much needed light on the genetic properties that make any given plant species unique. Furthermore, with respect to many species-specific features, such studies show that we seem to be limited in what we can expect to learn from a model plant such as Arabidopsis. Why are POFs important? The rapid proliferation of genomic and metagenomic sequencing data has drawn increasing attention to the large number of genes of unknown function, which seem to be an integral part of the genetic blueprint of most organisms [1–6]. On average, 15–40% of every eukaryotic genome sequenced to date contains genes that encode proteins with obscure features that lack currently defined motifs or domains (POFs; see Glossary) [2]. The recent metagenomic ocean-sequencing expedition uncovered tens of thousands of proteins with undefined features, high- lighting how little we know of the amazing diversity of protein sequences [1]. But what are the roles of POFs? Is it possible that, buried within each new genome sequenced, there exists an entire set of pathways and genetic pro- grams of which we are completely unaware? Here, we discuss the role of POFs in plants and highlight the possib- ility that they have a key role in determining ecologically and agronomically important species-specific features of different plants. How to provide a measurable definition to the unknown? Although the term ‘gene of unknown function’ is used broadly, it is difficult to define. How do we quantify the unknown in genes with unknown function? Using overall amino acid sequences similarity One way to define a ‘gene of unknown function’ is to use a similarity-based definition using a nucleotide or amino acid sequence comparison or equivalent. If a gene has no homolog in any other genome sequenced to date, or in any of the databases available [e.g. using a BLAST search against all sequences in the National Center for Biotech- nology Information (NCBI) database], then it could be defined as an unknown. This type of classification has been used to define orphan open reading frames (ORFans) [3,5,6]. The total number of ORFans in the NCBI database was recently estimated to be 80 000, underlying the magnitude of the problem that researchers face in under- standing these genes and their roles [1]. However, at least two problems are associated with the ORFan definition: (i) although there are many proteins that have some degree of similarity among different genomes, this similarity fails to provide clues to a possible gene function; and (ii) the BLAST E-value cutoff used to classify ORFans is often not rigorously defined. Using too high (e.g. >10 2 ), or too low (e.g. <10 30 ) cutoff values could drastically change the annotation of an ORFan. A similarity-based definition of an unknown can also be used to define a protein with a homolog(s) in other genomes or in available databases but these homologs have no known classification. Such genes are often referred to as expressed proteins with unknown function or hypothetical proteins. At least two problems are also associated with this type of definition: (i) the BLAST E-value cutoff used for the similarity search is often poorly defined; and (ii) the annotation of genes in the available databases is often inaccurate or outdated. The second problem is serious Opinion TRENDS in Plant Science Vol.12 No.11 Glossary BLAST: a program that finds protein or nucleotide sequences that are similar to a target sequence. It provides two values: S and E. The S-score is a measure of the similarity between the query and the sequence. The E-value is a measure of the reliability of the S-score. The definition of the E-value is, therefore, the probability owing to chance that there is another alignment with a similarity greater than the given S-score. Domain of unknown function (DUF): a domain that can be identified in a given protein by an HMM search but has no defined function. Hidden Markov model (HMM): a type of probabilistic model used to align and analyze sequence datasets by generalization from a sequence profile; it is well suited to providing a mathematical framework for profile analysis. Hypothetical protein: a predicted protein for which there is no experimental evidence that it is expressed in vivo. Meta-genomic: study of the collective genomes of microorganisms (as opposed to clonal cultures). The technique is to sequence DNA obtained directly from the environment of the microorganism. ORFan: orphan open reading frame (ORF) with no detectable sequence similarity to any other sequence in the databases. Protein families (Pfam): a large collection of multiple sequence alignments and HMMs covering many common protein families. Protein with defined feature (PDF): a protein that contains at least one previously defined domain or motif. Protein with obscure features (POF): a protein that lacks currently defined motifs or domains. Corresponding author: Mittler, R. (ronm@unr.edu). www.sciencedirect.com 1360-1385/$ – see front matter ß 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.tplants.2007.08.018