CeBiTec – Colloquium
Monday, January 14, 2019, 17 c.t.
G2-104, CeBiTec Building
Dr. James M. Hogan
Science and Engineering Faculty, Queensland University of Technology (Australia)

Fast Clustering of Very Large Sequence Collections
The abundance of molecular sequencing data arising from next generation sequencing technologies presents a wide range of computational challenges. Clustering is often seen as an important first step in analysing these data, whether driven by a particular scientific question or by a more general imperative to allow efficient indexing and search. Many of the more commonly used methods for sequence clustering will not scale to the collections that have emerged over recent years, and new approaches are required.
All clustering is dependent on similarity, and issues of scale may appear less through the number of bases in the collection than through the number of sequences or reads to be compared. Like many before us, we seek approaches which support accurate, but lightning-fast comparison of pairs of sequences, but are careful to make these comparisons only when necessary. The problem then decouples into two, almost independent components: sequence representation and efficient construction and storage of the resulting clusters.
In this talk I will present our work based on binary signature representations of sequence data, coupled with two distinct clustering approaches, which we call SigClust and ParkTree. Each of these methods require that sequences be embedded in a binary vector space of some fixed dimension, usually some small multiple of the machine word size (256 is a typical choice). Signature mappings are locality preserving, supporting approximate matches. Pairwise comparison then reduces to a small number of standard machine instructions, proving orders of magnitude faster matching than standard methods. I will describe different methods of signature construction, their applications and limitations, but the essential issue is that each sequence, regardless of length, is mapped to a binary vector of a common size.
Once these representations are obtained, the collection is clustered using SigClust (effectively a k-means based approach) or ParkTree (a balanced, hierarchical k-Tree). We demonstrate the success of our approach on the largest datasets available from the SILVA project, showing that our methods provide clustering quality comparable to those of a number of standard methods while offering performance an order of magnitude faster than these alternatives. Through the use of synthetic data, we show that this new approach is able to handle even extreme scale data sets in convenient timeframes, allowing rapid analyses to be performed over collections of tens of millions of reads in a single run.
[Joint work with Shlomo Geva, Timothy Chappell and Dimitri Perrin. SigClust and ParkTree are available from https://github.com/tchappell/SigClust and https://github.com/tchappell/parktree respectively.]
 
Host: Prof. Dr. Jens Stoye