An Encoder-Decoder Based Basecaller for Nanopore DNA Sequencing
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Nanopore DNA sequencing is a method in which DNA bases are determined (basecalled) using electric current signals generated by passing DNA through nanopore sensors. The raw measured signals can be aggregated into event data presenting new bases entering the nanopore. This thesis has two contributions. First, we implemented RNN-based single- and double-strand basecallers for simulated event data to analyze the effect of signal noise. As the SNR decreased from 20 dB to 5 dB, the accuracy of the single-strand basecaller dropped 9% while the accuracy of double-strand basecaller only dropped 0.5%. Second, we implemented an end-to-end single-strand basecaller, directly processing the raw signal using an encoder-decoder model with attention instead of the CTC-style approach used in available basecallers. We achieved an accuracy of 81.9% for a viral sample and an accuracy of 90.9% for a bacterial sample. Our accuracy is comparable to state-of-the-art basecallers with a considerably smaller model.