Neuron Interference: Evidence-Based Batch Effect Removal Matthew Amodio 1 Ruth Montgomery 2* Jenna Pappalardo 2* David Haﬂer 2* Smita Krishnaswamy 21 Abstract New technologies such as single-cell RNA se- quencing and mass cytometry are measuring cellu- lar populations in high dimensions, offering unpar- alleled insights into cellular behavior and enabling new scientiﬁc discoveries. However, when these measurements are applied to multiple samples or experimental conditions, the resulting systematic variations, or batch effects, confound biological variation and create a vexing problem in compar- ing cellular populations. Moreover, these batch effects, which arise as a result of changed environ- mental condition, instrument variation, machine calibration, or human handling differences, can be complex and highly non-linear transformations. Despite their ubiquity, there are few computa- tional tools designed to correct generally for such effects while maintaining biological differences. The ones that exist hold strong assumptions (such as linear shifts between batches). Here, we propose an entirely novel approach to disentangling biological from batch variation where we take a speciﬁc subpopulation of cells as a control between the batches. This subpopulation can be an unchanged population (known via prior biology) or a repeatedly measured spike-in. We use an autoencoder to model the variation in the control, and then interfere with neuron activations on inference to correct for these differences on the entire sample. This technique, which we term neuron interference, is unique in its ability to gen- eralize a batch effect learned on a subpopulation to the entire population. 1. Introduction As biological researchers, our ability to make tremendous new discoveries is in large part facilitated by improvements in data acquisition with new instruments that can generate * Equal contribution 1 Department of Computer Science, Yale University 2 Department of Genetics, Yale University. Correspon- dence to: Matthew Amodio <matthew.amodio@yale.edu>. . more measurements per observation, more observations per subject, and more subjects per experiment than ever before. As promising as these technologies are, they come with a major limitation: batch effects. Batch effects are technical artifacts in the data that arise from our not being able to replicate experimental conditions exactly. The observed data can be affected by many factors such as the humidity in the laboratory, the calibration of the measuring instru- ment, the quality control on a purchased lot of reagent, and exactly how much reagent a technician pipetted into each well. Because our experiments cannot measure everything of interest in one run of the instrument, the datasets from each run must be combined and analyzed together despite the variability from these other sources. Existing approaches to analyzing this combined data in- clude removing observations determined to be unreliable or combining the results of separate analyses for each batch with meta-analysis [1, 2, 3]. Another approach is to use a statistical alignment model to remove differences between the samples. For example, one alignment model based on canonical correlation analysis maps both batches to a latent space, assuming a linear latent direction through the genes [4]. A mutual nearest neighbors alignment model imposes the very strong assumption that the shape of the data in each batch is the same and the batch effect shifts are orthogo- nal to the direction of the data [5]. An alignment model matching the mean and variance for each dimension of each batch imposes the assumption that all differences in the ﬁrst two moments are batch effects and no differences in higher moments are batch effects [6]. Not only do these methods depend on strong assumptions in order to align the batches, but they then also attempt to remove all differences that ﬁt their models of variation. Instead of assuming that all variation that takes a speciﬁc form is a batch effect, and conversely that all batch effects take that form, we propose using an evidence-based method for learning a batch effect model from real data. To do this, we can use a control, or an invariant subpopulation that is measured in each sample of interest [7, 8, 9, 10, 11]. While the observed changes in the control offer the potential for differentiating between true and technical variation in the sample, it is a challenge to use this information to correct non-controls because batch effects can be highly non-linear, non-uniform in the space, and different in each sample [12]. arXiv:1805.12198v1 [q-bio.QM] 30 May 2018