The origin of correlations in metabolomics data Diogo Camacho, Alberto de la Fuente, and Pedro Mendes* Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, MC 0477, Washington St., Blacksburg, VA, 24061, USA Received 19 August 2004; accepted 15 September 2004 A phenomenon observed earlier in the development of metabolomics as a systems biology methodology, consists of a small but significant number of metabolites whose levels are highly correlated between biological replicates. Contrary to initial interpretations, these correlations are not necessarily only between neighboring metabolites in the metabolic network. Most metabolites that participate in common reactions are not correlated in this way, while some non-neighboring metabolites are highly correlated. Here we investigate the origin of such correlations using metabolic control analysis and computer simulation of biochemical networks. A series of cases is identified which lead to high correlation between metabolite pairs in replicate measurement. These are (1) chemical equilibrium, (2) mass conservation, (3) asymmetric control distribution, and (4) unusually high variance in the expression of a single gene. The importance of identifying metabolite correlations within a physiological state and changes of correlation between different states is discussed in the context of systems biology. KEY WORDS: metabolomics; correlations; metabolic control analysis; systems biology. 1. Introduction Large-scale molecular profiling methods, often re- ferred to as ‘‘omics’’, are becoming predominant in molecular biology. They have facilitated the appearance of observation-driven hypothesis generation experi- ments, where the emphasis is predominantly in identi- fying new phenomena, rather than investigate a known one in detail. These technologies are also becoming important to a systems biology approach, where they are applied in the context of a solid theoretical back- ground and through computational models (Kitano, 2002). Global transcript analysis through microarrays (Liang et al., 2004) is the most commonly used tech- nique, followed closely by large-scale protein identifi- cation and quantification (Righetti et al., 2004), and analysis of protein–protein interactions (Janin and Ser- aphin, 2003; Cho et al., 2004). There are also several approaches for metabolite identification and quantifi- cation (Raamsdonk et al., 2001; Reo, 2002; Sumner et al., 2003), commonly known as metabolomics or metabolite profiling, but see (Fiehn, 2001) for a clarifi- cation of terms. The techniques used in metabolomics are predominantly based on chromatography and mass spectrometry (Roessner et al., 2000; Tolstikov et al., 2003), although nuclear magnetic resonance (Reo, 2002), Fourier-transform infra-red spectroscopy (Oliver et al., 1998; Harrigan et al., 2004), and capillary elec- trophoresis (Baggett et al., 2002; Soga et al., 2002) are also commonly used. Like the other ‘‘omic’’ techniques, metabolomics data are usually in the form of ratios of concentrations; absolute concentrations are rarely ob- tained except in targeted analyses that cover only a small number of metabolites. Metabolomics is a crucial tool in systems biology because it monitors the ultimate products of gene expression (Oliver et al., 1998): organic molecules that are not directly encoded in the genome and are synthe- sized by a diversity of enzymes. Metabolites are pro- duced from other metabolites resulting in a level of interdependence between their concentrations that does not exist between transcripts or proteins. These con- straints result from the structure of the metabolic net- work (stoichiometry) and, when known, can be used to derive structural biochemical properties of those net- works (e.g. Schilling et al., 1999). But currently the structure of metabolic networks is limited to the primary metabolism of microbial model organisms and some mammalian tissues. Very little is known about second- ary metabolism or even the primary metabolism of many organisms, resulting in the order of one hundred thousand natural products for which no synthetic pathway is known. Thus, stoichiometry-based analysis of metabolomics data is currently limited to a handful of cases. What remains to be seen is if metabolomics data can actually be used to uncover those unknown meta- bolic networks. Metabolomics data can be analyzed with the same methods used in transcriptomics and proteomics, such as clustering (Roessner et al., 2001), principal compo- nent analysis (Oliver et al., 1998; Nicholson et al., 1999), or machine learning (Kell, 2002; Ott et al., 2003). Additionally, it may be possible to develop novel anal- yses by exploring the vast body of existing theory on *To whom correspondence should be addressed. E-mail: mendes@vbi.vt.edu Metabolomics Vol. 1, No. 1, January 2005 (Ó 2005) 53 DOI: 10.1007/s11306-005-1107-3 1011-372X/05/0100–0053/0 Ó 2005 Springer Science+Business Media, Inc.