A. Sczyrba: Computational Metagenomics
Over 99% of the microbial species observed in nature cannot be grown in pure culture making them inaccessible to classical genomic studies. Metagenomics and single cell genomics are two approaches to study the microbial ‘dark matter’.
Previous and Current Research
Metagenomics, the direct analysis of DNA from a whole environmental community, represents a strategy for discovering genes with diverse functionality. In the past, the identification of new genes with desired activities has relied primarily on relatively low-throughput function-based screening of environmental DNA clone libraries. Current sequencing technologies can generate more than 600 Gbp of sequence data in a single experiment, allowing sequence-based metagenomic discovery of complete genes or even genomes from environmental samples with moderate microbial species complexity.
The cow rumen metagenome, sequenced at the DOE Joint Genome Institute (JGI), is one of the largest metagenomic datasets from a single sample to date (>500 Gbp). The paucity of enzymes that efficiently deconstruct plant polysaccharides represents a major bottleneck for industrial-scale conversion of cellulosic biomass into biofuels. Cow rumen microbes specialize in degradation of cellulosic plant material and are therefore an promising target for the identification of novel carbohydrate-active genes. Datasets of such a large size require high-throughput computational techniques to cope with the analysis of billions of sequencing reads. In collaboration with the JGI we develop high-throughput gene-centric and de-novo assembly pipelines for metagenomic datasets. In case of the cow rumen dataset, we were able to identify more than 27,000 putative carbohydrate-active genes and assemble 15 uncultured microbial genomes using these pipelines.
A complementary approach to sequencing the DNA of a whole microbial community is single cell genomics. Over 99% of the microbial species observed in nature cannot be grown in pure culture, making it impossible to study them using classical genomic methods. DNA sequencing from single amplified genomes of individual cells is a novel approach in genome research, which allows to study the genomes of uncultured species from diverse environments. Selective collection techniques such as fluorescence-activated cell sorting (FACS) followed by multiple displacement amplification (MDA) and sequencing can reveal the genomic sequence of isolated single cells. This way, a sample can be enriched for one or several organisms.
Future Projects and Aims
Despite the fact that we managed to assemble a large number of genes and genomes from a complex metagenome as the cow rumen, there is still a need for metagenome-specific assemblers. Current short read assemblers were specifically designed for the assembly of isolate genomes, but metagenome data sets pose a number of challenges on the assembly problem. We are developing new tools and approaches for the metagenomic assembly problem.
For single cell genomics we develop automated bioinformatic pipelines that support each step of the analysis. Tremendous bias in the coverage of the genome, introduced by the amplification technique, pose a challenge for the bioinformatic analysis of the resulting sequence data. The raw data has to be pre-processed to achieve good assembly results. At each step, possible sample contamination has to be identified and removed, a difficult task if the target genome is not closely related to any previously sequenced genome. Finally, genome completeness can be estimated using single copy and core gene analyses.
A promising approach for future metagenomic studies is the combination of high-throughput metagenome sequencing and large-scale single cell genomics. Both data sources can be combined into bioinformatic “pan-genome” analyses to gain a better understanding of the phylogenetic composition of microbial communities, their population structure and functionality.