POFs: what we don’t know can hurt us Martin Gollery 1 , Jeff Harper 2 , John Cushman 2 , Taliah Mittler 2 and Ron Mittler 2, 3 1 TimeLogic – a Division of Active Motif, Incline Village, NV 89451, USA 2 Department of Biochemistry and Molecular Biology, MS200, University of Nevada, Reno, NV 89557, USA 3 Department of Plant Science, Hebrew University of Jerusalem, Givat Ram, Jerusalem 91904, Israel Over a quarter of all eukaryotic genes encode proteins with obscure features that lack currently deﬁned motifs or domains (POFs). Interestingly, most of the differences in gene repertoire among species were recently found to be attributed to POFs. A comparison of the Arabidopsis, rice and poplar genomes reveals that Arabidopsis con- tains 5069 POFs, of which 2045 have no obvious homo- logs in rice or poplar and are likely to be involved in species- or phylogenetic-speciﬁc functions in Arabidop- sis. The study of POFs is an important endeavor that will shed much needed light on the genetic properties that make any given plant species unique. Furthermore, with respect to many species-speciﬁc features, such studies show that we seem to be limited in what we can expect to learn from a model plant such as Arabidopsis. Why are POFs important? The rapid proliferation of genomic and metagenomic sequencing data has drawn increasing attention to the large number of genes of unknown function, which seem to be an integral part of the genetic blueprint of most organisms [1–6]. On average, 15–40% of every eukaryotic genome sequenced to date contains genes that encode proteins with obscure features that lack currently deﬁned motifs or domains (POFs; see Glossary) [2]. The recent metagenomic ocean-sequencing expedition uncovered tens of thousands of proteins with undeﬁned features, high- lighting how little we know of the amazing diversity of protein sequences [1]. But what are the roles of POFs? Is it possible that, buried within each new genome sequenced, there exists an entire set of pathways and genetic pro- grams of which we are completely unaware? Here, we discuss the role of POFs in plants and highlight the possib- ility that they have a key role in determining ecologically and agronomically important species-speciﬁc features of different plants. How to provide a measurable deﬁnition to the unknown? Although the term ‘gene of unknown function’ is used broadly, it is difﬁcult to deﬁne. How do we quantify the unknown in genes with unknown function? Using overall amino acid sequences similarity One way to deﬁne a ‘gene of unknown function’ is to use a similarity-based deﬁnition using a nucleotide or amino acid sequence comparison or equivalent. If a gene has no homolog in any other genome sequenced to date, or in any of the databases available [e.g. using a BLAST search against all sequences in the National Center for Biotech- nology Information (NCBI) database], then it could be deﬁned as an unknown. This type of classiﬁcation has been used to deﬁne orphan open reading frames (ORFans) [3,5,6]. The total number of ORFans in the NCBI database was recently estimated to be 80 000, underlying the magnitude of the problem that researchers face in under- standing these genes and their roles [1]. However, at least two problems are associated with the ORFan deﬁnition: (i) although there are many proteins that have some degree of similarity among different genomes, this similarity fails to provide clues to a possible gene function; and (ii) the BLAST E-value cutoff used to classify ORFans is often not rigorously deﬁned. Using too high (e.g. >10 2 ), or too low (e.g. <10 30 ) cutoff values could drastically change the annotation of an ORFan. A similarity-based deﬁnition of an unknown can also be used to deﬁne a protein with a homolog(s) in other genomes or in available databases but these homologs have no known classiﬁcation. Such genes are often referred to as expressed proteins with unknown function or hypothetical proteins. At least two problems are also associated with this type of deﬁnition: (i) the BLAST E-value cutoff used for the similarity search is often poorly deﬁned; and (ii) the annotation of genes in the available databases is often inaccurate or outdated. The second problem is serious Opinion TRENDS in Plant Science Vol.12 No.11 Glossary BLAST: a program that finds protein or nucleotide sequences that are similar to a target sequence. It provides two values: S and E. The S-score is a measure of the similarity between the query and the sequence. The E-value is a measure of the reliability of the S-score. The definition of the E-value is, therefore, the probability owing to chance that there is another alignment with a similarity greater than the given S-score. Domain of unknown function (DUF): a domain that can be identified in a given protein by an HMM search but has no defined function. Hidden Markov model (HMM): a type of probabilistic model used to align and analyze sequence datasets by generalization from a sequence profile; it is well suited to providing a mathematical framework for profile analysis. Hypothetical protein: a predicted protein for which there is no experimental evidence that it is expressed in vivo. Meta-genomic: study of the collective genomes of microorganisms (as opposed to clonal cultures). The technique is to sequence DNA obtained directly from the environment of the microorganism. ORFan: orphan open reading frame (ORF) with no detectable sequence similarity to any other sequence in the databases. Protein families (Pfam): a large collection of multiple sequence alignments and HMMs covering many common protein families. Protein with defined feature (PDF): a protein that contains at least one previously defined domain or motif. Protein with obscure features (POF): a protein that lacks currently defined motifs or domains. Corresponding author: Mittler, R. (ronm@unr.edu). www.sciencedirect.com 1360-1385/$ – see front matter ß 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.tplants.2007.08.018