HANDLING HIGHLY CORRELATED GENES OF SINGLE-CELL RNA SEQUENCING DATA IN PREDICTION MODELS LI XING 2 , SONGWAN JOUN 1 , KURT MACKEY 1 , MARY LESPERANCE 1 , AND XUEKUI ZHANG 1,* Abstract. Motivation : Selecting feature genes and predicting cells’ phenotype are typical tasks in the analysis of scRNA-seq data. Many algorithms were developed for these tasks, but high correlations among genes create challenges specifically in scRNA-seq analysis, which are not well addressed. Highly correlated genes lead to unreliable prediction models due to technical problems, such as multi-collinearity. Most importantly, when a causal gene (whose variants have a true biological effect on the phenotype) is highly correlated with other genes, most algorithms select one of them in a data-driven manner. The correlation structure among genes could change substantially. Hence, it is critical to build a prediction model based on causal genes. Results: To address the issues discussed above, we propose a grouping algorithm that can be integrated into prediction models. Using real benchmark scRNA-seq data and simulated cell phenotypes, we show our novel method significantly outperforms standard models in both prediction and feature selection. Our algorithm reports the whole group of correlated genes, allowing researchers to either use their common pattern as a more robust predictor or conduct follow-up studies to identify the causal genes in the group. Availability: An R package is being developed and will be available on the Comprehensive R Archive Network (CRAN) when the paper is published. 1. Introduction The technologies for the Next Generation Sequencing (NGS) have developed rapidly over the past decade. Among all applications of such technologies, single-cell sequencing [Nawy, 2014] is at the forefront of genomic research. Single-cell sequencing examines the genomic information from individual cells with optimized NGS technologies. It provides a higher resolution of cellular differences and a better understanding of the function of a single cell in the context of its microenvironment. However, the development of analytic tools has trailed the rapid advance in biochemistry and molecular biology [Gawad et al., 2016], and there are still many challenges required to be addressed to fully leverage the information in single-cell sequencing profiles. Tissues are complex ecosystems containing multiple types of cells. For example, tumors are made of cancer cells and non-cancerous cells, each with their own activation status. This heterogeneous cell composi- tion is called a tumor microenvironment [Aran et al., 2017]. The single-cell RNA-sequencing (scRNA-seq) technology can measure gene expression profiles in the resolution of single cells, which is a powerful tool to study the composition of cell types of various tissues, such as lung [Angelidis et al., 2019], peripheral blood [Newman et al., 2019], and breast tumor [Wagner et al., 2019]. Grouping cells by cell types can be achieved by the cluster analysis of gene expression profiles obtained from the scRNA-seq data. The review article [Kiselev et al., 2019] summarized popular methods and software pipelines for clustering scRNA-seq data. 1 arXiv:2007.02455v2 [stat.AP] 30 Jul 2020