Machine learned regression for abductive DNA sequencing David Thornley * , member IEEE, Maxim Zverev and Stavros Petridis, member IEEE Department of Computing, Imperial College London 180 Queen’s Gate, South Kensington, London SW7 2RH Abstract We construct machine learned regressors to predict the behaviour of DNA sequencing data from the fluorescent la- belled Sanger method. These predictions are used to as- sess hypotheses for sequence composition through calcula- tion of likelihood or deviation evidence from the compari- son of predictions from the hypothesized sequence with tar- get trace data. We machine learn a means for comparing the measures taken from competing hypotheses for the se- quence. This is a machine learned implementation of our proposal for abductive DNA basecalling. The results of the present experiments suggest that neural nets are a more ef- fective means for predicting peak sizes than decision tree regressors, and for assembling evidence for competing hy- potheses in this context. This is despite the availability of variance estimates in our decision tree regressors. 1 Introduction In his thesis of 1993 [5], Blanchard examined se- quence dependent variations in Sanger sequencing [13] trace data [7]. He expressed the opinion that this knowl- edge would not be useful in basecalling. Indeed, no base- calling package in current use leverages this peak height knowledge. The leading third party basecaller, PHRED uses peak spacing to excellent effect [3] with quantifiable error rates [4], although it tracks overall trends rather than using detailed knowledge [6]. In unrelated work, Thorn- ley counsels against dismissing interest in peak size varia- tion, and explains that the accompanying attempt to reduce peak height variation is throwing away important informa- tion [14]. He also provides a simple functional model for peak size variation, and formulates a method of analysis which actively takes advantage of peak size variation. That approach comprises abduction of basecalls in which we hy- pothesize a sequence composition, and assess the peak sizes predicted from each hypothesis against the target trace data to find that which fits best [1]. * This work is a deliverable of EPSRC grant GR/S60266/01 The primitive model used in that initial work enabled validation of our suggestion that there is information en- coded in the peak size behaviour in the context of a base position. We used a contrived approach which we refer to as blind spot analysis or BSA, in which we isolate the contextual information by entirely omitting the data at the basecalling position, or pivot as we will refer to this posi- tion hereafter. This means that the information normally used for basecalling is omitted. Thus we use only the con- textual information which we have proposed is encoded in the repeatable, sequence motif correlated behaviour of DNA sequencing trace data from the Sanger method. Our exploration of this information using machine learn- ing tools has demonstrated that viable basecalls can be made using contextual information alone by direct classi- fication [17, 18]. In this work we found that the depen- dencies examined by Lipshutz [9] during work to estimate confidence in existing basecalls relate to the information we seek to exploit. When a classifier is given access to the data at the base- calling position – which we refer to as the pivot – it only uses that information in its decision, effectively ignoring the contextual data provided. This is because high quality data can generally be called using the pivotal data alone. Indeed, this pivotal data provides the only peak heights used in cur- rent basecalling methods. To enable comparison of classi- fier effectiveness in using context information, we excluded the pivot data to perform “blind spot analysis” in the sense introduced in [14]. We now seek to build a machine learned approach to the basecall abduction proposed in [14]. In this new work we move on from proof of principle toward establishing com- ponents for the abductive process as originally intended. The goal of the present work is to find an effective means for regressing peak sizes, and to explore comparison methods. Since we intend to use the resulting regressor in a general basecaller [1], it must use all the information available. We have found that if we supply the regression information at the pivotal position to the hypothesis comparison step, re- gardless of which regressor or comparison method is used, the success rate is approximately 100%. This is because