Arranging and analyzing large amounts of genome data
In the last few decades, genome data has been amassing enormously, empowered, for example, by various population scale sequencing projects and genome-wide association studies. The amassment of data has several consequences. First, one needs to develop data structures that support effortless handling of the data. Second, one needs to provide methodology that supports appropriate arrangement and analysis of the data. Third, one needs to watch closely when amounts of data reach tipping points in terms of paradigm shifts. For example, in certain areas of application amounts of data may have reached saturation to be applicable in modern artificial intelligence approaches. Another prominent example is computational pangenomics, which is concerned with data structures that enable us to store thousands of genomes using graphs, instead of (thousands of) sequences, which reduces storage requirements by several orders of magnitude.
Previous and Current Research
Our group has traditionally been concerned with the analysis of pangenome graphs that support comfortable handling of genomes raised in population scale studies. Further, our group has been investing in the exploration of advanced deep learning techniques that support the analysis of genome data sets that have reached their tipping points in terms of sheer abundance and availability.
From a more detailed perspective, we have recently been involved in developing assembly and computational pangenomics techniques that enable us to analyze polyploid organisms and mixed samples, as arising in the analysis of pathogens. Secondly, we have invested in the use of artifical intelligence to disentangle the hereditary foundations of genetically involved diseases, such as, for example, amyotrophic lateral sclerosis.
Future Projects and Aims
We are currently investigating how to develop "genome embeddings", which are data formats that first compress genome data, second, provide means to evaluate the complexity of genomes and the "richness" of populations, and third, to support the seamless integration of genome data into advanced artificial intelligence approaches. We are further investigating how to exploit accumulating multi-omics datasets in advanced recommendation system frameworks. Last but not least, we keep developing programs that support the assembly of genomes from mixed samples, such as viral quasispecies or metagenome datasets.