Metamorphosis: Changing the shape of genomic data using Kafka B Lawlor, P Walsh Department of Computer Science CIT – Cork Institute of Technology, Ireland e-mail: brendan.lawlor@mycit.ie Keywords: Bioinformatics, Big Data, Cloud Computing A bioinformatic researcher, when seeking novel ways to process genomic data, will naturally look for a standarad reference genomic database to test their new ideas against. Perhaps the best-known genomic reference is NCBI’s Blast[1]. The word ‘Blast’ can refer both to the database and to the search algorithm used to access it, and in fact this ambiguity highlights an underlying problem for the aforementioned bioinformatic researcher. The shape of the Blast genomic data is set by the Blast search algorithm. To make this data available in its basic form – for example to researchers who wish to test other search algorithms or process the entire database contents - it must be downloaded, processed and reformatted into flat FASTA files. In either form – algorithm-specific proprietory format or more than 800GB of text files - the genomic reference data is not in good shape. Genomic data is big data. It has high volume and increasingly high velocity (more database entries, more query sequences). The format in which that data is maintained matters. It should be robust, scalable and amenable to parallelized access. It should also be available in a way that stimulates innovation. If it is only available through BLAST, then it will only be used in ways that BLAST permits. If instead it is open to exhaustive, streamed, distributed and parallelized access, then novel applications of the data will result. This paper explores the use of Apache Kafka[2] - a technology that was not conceived as a database but as a messaging system - for general purpose, efficient, distributed, stream-ready storage and retrieval of genomic data. It offers scalable, durable, distributed storage, with high-speed and flexible access. We consider how this technology could emerge as a primary format for storing genomic sequences, directly accessible by general purpose programs, but from which secondary algorithm-specific formats could be derived where needed. We present a Proof of Concept implementation of a Kafka Genomic Database and measure its properties. The wider software community is exploring improved ways of storing big data that avoid the bottlenecks of centralized transactional databases. A central feature of such systems is the re- interpretation of system state as ‘events’. ‘Event sourcing’[3] is an increasingly popular and scaleable solution to maintaining the state of a system without creating database bottlenecks[4]. Reference genomes can be seen as the system state of many bioinformatics applications, and a Kafka-based log of the adds, updates and deletes of genomic sequences could offer a powerful implementation of this concept. It is interesting to note that Event Sourcing as described above is analogous to transaction logs – the history of actions maintained by traditional database management systems to guarantee ACID properties[5].