Name
SQLite: Grand Unified Genomics File Format?
Description

Since the inception of GA4GH, we’ve recognized the opportunity to formulate genomic data models more abstractly from their manifestation in text/binary file formats. Decoupling these could streamline how we define APIs and evolve them to accommodate new applications; while reducing friction in deploying general-purpose “big data” technologies in the genomics domain. The pursuit of these goals has been challenging for many reasons, among them a lack of alternative data containers that (i) rise to our demanding performance & compression needs; (ii) preserve neutrality between vendors/platforms/languages; and (iii) add enough further value to reconsider the roadmap for dedicated formats and their tooling. We explore the filling this gap with SQLite, a public-domain relational database manager used constantly on billions of devices, universally compatible and appreciated for its speed and flexibility. Our open-source Genomics Extension for SQLite (“GenomicSQLite”) addresses key limitations previously hindering its use for genomics applications: data compression, genome range SQL queries, and configuration tuning for large datasets. Much work remains on a path to delivering the well-honed tool capabilities that have made current genomics file formats so successful -- but accommodating almost any data model, and promoting interoperability with the whole database technology ecosystem.

VIEW POSTER