The Bloom Filter Trie: alignment-free and reference-free indexing of many similar sequences

Palestrante: Jens Stoye, Bielefeld University.

Data: 28 de setembro de 2016, 14h.

Local: Sala 402, Bloco H, IME, Campus Gragoatá, UFF.

Resumo: High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices.

We developed an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. The BFT was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, the BFT is up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory. We will also mention some other uses of the basic data structure in DNA sequence comparison and compression.

Nota: Trabalho em conjunto com Guillaume Holley e Roland Wittler. O professor Jens Stoye é professor visitante da Pós-graduação da Matemática - UFF - através do projeto PVE/Ciência sem Fronteiras/CAPES coordenado pela professora Simone Dantas.