Cécile Chauvel, PhD, is a researcher in biostatistics in the Data Management and Analysis unit at Bioaster, Lyon, France. Alexei Novoloaca is a PhD student in biostatistics in the Epigenetics Group at the International Agency for Research on Cancer, World Health Organization, Lyon, France. Pierre Veyre is a computer scientist in the Data Management and Analysis unit at Bioaster, Lyon, France. Frédéric Reynier, PhD, is the head of the Genomics and Transcriptomics unit at Bioaster, Lyon, France. Jérémie Becker, DPhil, is a researcher in biostatistics in the Genomics and Transcriptomics unit at Bioaster, Lyon, France. BIOASTER is a technological research institute in microbiology that aims to develop new innovative and high-value technology solutions through collaborative projects. Its main interest lies in tackling antimicrobial resistance, developing new diagnostics, improving vaccines’ safety and efficacy and understanding the involvement of microbiome in human and animal health. Submitted: 26 September 2018; Received (in revised form): 12 January 2019 © The authors 2019. Published by Oxford University Press on behalf of the Institute of Mathematics and its Applications. All rights reserved. 541 Briefings in Bioinformatics, 21(2), 2020, 541–552 doi: 10.1093/bib/bbz015 Advance Access Publication Date: 14 February 2019 Review article Evaluation of integrative clustering methods for the analysis of multi-omics data Cécile Chauvel , Alexei Novoloaca , Pierre Veyre, Frédéric Reynier and Jérémie Becker Corresponding author: Jérémie Becker, BIOASTER Research Institute, 40 avenue Tony Garnier, 69007 Lyon, France. Tel.: +33 4 69 85 19 21; Fax: +33 4 72 70 48 2; E-mail: jeremie.becker@bioaster.org The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Abstract Recent advances in sequencing, mass spectrometry and cytometry technologies have enabled researchers to collect large-scale omics data from the same set of biological samples. The joint analysis of multiple omics offers the opportunity to uncover coordinated cellular processes acting across different omic layers. In this work, we present a thorough comparison of a selection of recent integrative clustering approaches, including Bayesian (BCC and MDI) and matrix factorization approaches (iCluster, moCluster, JIVE and iNMF). Based on simulations, the methods were evaluated on their sensitivity and their ability to recover both the correct number of clusters and the simulated clustering at the common and data-specific levels. Standard non-integrative approaches were also included to quantify the added value of integrative methods. For most matrix factorization methods and one Bayesian approach (BCC), the shared and specific structures were successfully recovered with high and moderate accuracy, respectively. An opposite behavior was observed on non-integrative approaches, i.e. high performances on specific structures only. Finally, we applied the methods on the Cancer Genome Atlas breast cancer data set to check whether results based on experimental data were consistent with those obtained in the simulations. Key words: benchmark; clustering; data integration; multi-omics; unsupervised analysis Introduction The accumulation of large molecular data sets has fueled the development of translational bioinformatics and systems biology that share a holistic view on omics data. While the former aims to link biological to clinical data to improve our understanding of disease mechanisms, the latter explores the basic functional properties of living organisms based on the premise that biological processes build upon the interplay between macromolecules. Both approaches rely on the idea that biological mechanisms (and, more generally, phenotypic traits) can only be fully captured through the study of molecular interactions among different omics layers. Multi-omic approaches have received much attention in recent years for their potential applications in clinics. In genome-wide association studies for example, the mechanisms by which the identified loci inf luence phenotypes remain generally unknown and are likely to be unveiled using functional Downloaded from https://academic.oup.com/bib/article/21/2/541/5316049 by guest on 06 November 2022