SCReadCounts: Estimation of cell-level SNVs from scRNA-seq data Prashant NM 1* , Nawaf Alomran 1* , Yu Chen 2 , Hongyu Liu 1 , Pavlos Bousounis 1 , Mercedeh Movassagh 3 , Nathan Edwards 2 , and Anelia Horvath 1,3# *equal contribution, #correspondence 1 McCormick Genomics and Proteomics Center, School of Medicine and Health Sciences, The George Washington University, 20037 Wash- ington, DC, USA, 2 Department of Biochemistry and Molecular & Cellular Biology, Georgetown University, Washington, DC 20057, USA, 3 Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA. Summary: SCReadCounts is a method for a cell-level estimation of the sequencing read counts bearing a par- ticular nucleotide at genomic positions of interest from barcoded scRNA-seq alignments. SCReadCounts gener- ates an array of outputs, including cell-SNV matrices with the absolute variant-harboring read counts, as well as cell-SNV matrices with expressed Variant Allele Fraction (VAFRNA); we demonstrate its application to estimate cell level expression of somatic mutations and RNA-editing on cancer datasets. SCReadCounts is benchmarked against GATK and Samtools and is freely available as a 64-bit self-contained binary distribution (Linux), along with MacOS and Python installation. Availability: https://github.com/HorvathLab/NGS/tree/master/SCReadCounts Supplementary Information: SCReadCounts_Supplementary_Data.zip Introduction Estimation of single nucleotide variants (SNV) expression from single cell RNA sequencing (scRNA-seq) data is an emerging field with quickly expanding applications, including as- sessment of allele expression, transcriptional burst kinetics, quanti- tative loci traits (QTLs), haplotype inference, X-chromosome inac- tivation, and demultiplexing (Vu et al., 2019; Larsson et al., 2019; Reinius et al., 2016; Van Der Wijst et al., 2018; Hongyu Liu; Prashant et al., 2019, 2020; Edsgärd et al., 2016; Xu et al., 2019; Griffiths et al., 2017; D’Antonio-Chronowska et al., 2019). In can- cer, studies on cell-level genetic heterogeneity have been instrumen- tal to trace lineages and resolve subclonal architecture (Vu et al., 2019; Lee et al., 2017; Puram et al., 2017; Venteicher et al., 2017; Müller et al., 2016). Genetically distinct tumor cell populations are shown to exert gene expression (GE) heterogeneity, and to differ in clinical features. However, it is currently challenging to extract ge- netically distinct cells for downstream analyses (Petti et al., 2019). To aid these types of studies, we have developed a tool – SCReadCounts - for cell-level estimation of reference and variant read counts (nvar and nref, respectively), from pooled barcoded scRNA-seq alignments. Provided a list of variant sites, SCReadCounts estimates nvar and nref, calculates expressed Variant Allele Fraction (VAFRNA = nvar / (nvar + nref)) and outputs cell-SNV matrices. The cell-SNV matrices can be used as inputs for a wide range of downstream analyses. We demonstrate the application of SCReadCounts to estimate cell level expression of somatic muta- tions and RNA-editing on cancer datasets from Adrenal Neuroblas- toma (Dong et al., 2020). We also exemplify a downstream applica- tion to correlate VAFRNA to GE using scReQTL (Liu et al., 2020). Results ScReadCounts requires two inputs: a barcoded scRNA-seq alignment, and a list of genomic positions of interest. In the exemplified workflow (Fig.1a), the barcodes and the Unique Molecular Identifiers (UMIs) are processed using UMItools, and the sequencing reads are aligned to the reference genome (GRCh38) us- ing STAR (v.2.5.7b)(Smith et al., 2017; Dobin et al., 2013). The resulting pooled alignments can be filtered to correct for allele-map- ping bias (WASP, Van De Geijn et al., 2015); this filtering utilizes the same list of positions to be used as input for scReadCounts. Ex- amples of positions of interest include SNVs called in the corre- sponding alignments (i.e. using GATK, see Fig.1a), or user-speci- fied lists of coordinates from external sources, such as sets of so- matic mutations (COSMIC) or RNA-editing loci (Auwera Mauricio O. et al., 2002; Tate et al., 2019; Picardi et al., 2017). SCReadCounts generates three main outputs in a tab-separated value format: (1) a table containing nvar and nref with quality and fil- tering metrics for each barcode, (2) a cell-SNV matrix with the ab- solute nvar and nref counts, and, (3) a cell-SNV matrix with the VAFRNA estimated at a user-defined threshold of minimum number of required sequencing reads (minR) (S_Fig.1). Performance We compared the SCReadCounts estimations with the analogous modules of the mpileup utility of Samtools (Li et al., 2009) and the haplotype caller of GATK. SCReadCounts default options generate nearly identical values to mpileup and GATK (Fig.1b and S_Fig.2). SCReadCounts uses, by default, a very simple read-filtering criteria, but it can also be readily configured to achieve scenario-specific, mpileup-consistent, or GATK-consistent results, with optional explicit output of the number of reads discarded by each filtering rule. In regard to efficiency, on our system (2x14 cores CPUs with 1.5TB RAM compute node) processing of a file contain- ing ~5000 cells, ~150mln reads, and ~80K SNVs, requires approxi- mately 4h for the estimation of nvar and nref, and up to 2 minutes for the generation of the cell-SNV matrices. The later enables the users to quickly generate VAFRNA matrices at various minR. Applications We first tested the ability of SCReadCounts to assess the expression of known somatic mutations in the neuroblas- toma scRNA-seq dataset. To do that we extracted from COSMIC 404,693 single nucleotide substitutions located in Cancer Census Genes and not overlapping with known germline SNV loci (S_Table 1). SCReadCounts estimated detectable expression of a number of COSMIC mutations in a low to moderate proportion of the individ- ual cells. An example is COSV67805199 in the gene RHOH, with a variable VAFRNA (minR = 5) across a number of cells from sample SRR10156295 (Fig.1c). Next, we sought to assess if SCReadCounts can detect cell-spe- cific RNA-editing levels. We used the same set of neuroblastoma samples, this time assessing 101,713 single nucleotide editing events from the REDI database (S_Table_2) (Picardi et al., 2017). At minR = 5, SCReadCounts identified multiple loci with variable levels of editing, some with apparent clustering in the two-dimensional space of certain cell-types (cell-types are classified using Seurat and Sin- gleR (Hafemeister and Satija, 2019; D. et al., 2019) An example of RNA-editing in position 14:100846310_A>G) located in the cancer- implicated lincRNA MEG3 is shown on Fig. 1d. Cells with edited MEG3 were seen predominantly in neurons, where they showed clear positional clustering. The levels of editing showed positive correlation (albeit with small effect size) with the expression of the harboring MEG3 (cis-scReQTL, Fig1.e left,), and higher size effect correlations with other genes (trans-scReQTLs), for example the Neuronal Vesicle Trafficking Associated NSG1 (Fig.1e, right). Considerations When applying SCReadCounts, the follow- ing considerations are in place. First, as mentioned earlier, modeling sequencing errors in the UMI is essential. Second, estimation of nvar . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted November 23, 2020. ; https://doi.org/10.1101/2020.11.23.394569 doi: bioRxiv preprint