Visualization and queries for biology in de Bruijn graphs

 Stage · Stage M2  · 6 mois    Bac+5 / Master   CRIStAL · Lille (France)


sequence, genomics, visualization, graph, embedding


Master Internship: Visualization and queries for biology in de Bruijn graphs
This internship takes place in Lille University (campus cité Scientifique, Villeneuve d'Ascq), and will be co supervised by members of team BONSAI (sequence bioinformatics, CRIStAL laboratory) and team LINKS (databases, logical queries, CRIStAL & Inria). It is funded by ANR Full-RNA (ANR-22-CE45-0007).
The de Bruijn graph is a fundamental data structure in computational biology, widely used for tasks such as genome assembly, variant calling, and k-mer set representation. Several fast construction methods for de Bruijn graphs (such as BCALM2[1], Bifrost[2] and recent GGCAT[3]) have popularized their usage in recent years. Bandage[4] is the reference software for visualizing de Bruijn graphs in bioinformatics using a graphical user interface (GUI), but is primarly designed to explore genome assemblies. Prior to Bandage, other works relied on tools like Cytoscape or ad-hoc solutions[5], but none integrated specificities of de Bruijn graphs, such as the possibility to compact the graph into a unitig graph for a more compact representation, or to have a semantic on the nodes' colors (in so-called colored de Bruijn graphs).
Teams BONSAI and LINKS introduced Vizitig, a tool for visualizing and manipulating de Bruijn graphs. Vizitig is implemented using NetworkX, making it compatible with Python ecosystems. Users can explore the graph, search for sequences, and extract relevant information. In particular, colored de Bruijn graphs are currently used as indexes of collections of sequences, to be queried with k-mers of interest. In many applications, these queries are not exactly identical to the graph's sequences (e.g. they include mutations) and ask for a relaxed matching.
We would like to compare, both theoretically and in practice, different approaches to find queries in such graphs, and to implement a solution in Vizitig. Different solutions include sequence alignment on graphs, for which heuristics and optimal algorithms exists[6], and novel methods developed by the collaboration in Lille, based on graph embedding, a machine learning approaching to graph mining[7]. The intern would have opportunities to contribute to Vizitig's development, as well as its benchmarking while integrating query solutions.
Applicants ideally have a background in either programming/computer science/bioinformatics.
Contacts: camille.marchet, charles.paperman, mikael.salson at univ-lille
[1] Chikhi, R., Limasset, A., & Medvedev, P. (2016). Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12), i201-i208.
[2] Holley, G., & Melsted, P. (2020). Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome biology, 21(1), 1-20.
[3] Cracco, A., & Tomescu, A. I. (2023). Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. Genome Research, gr-277615.
[4] Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.
[5] Jaillard, M., Lima, L., Tournoud, M., Mahé, P., Van Belkum, A., Lacroix, V., & Jacob, L. (2018). A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events. PLoS genetics, 14(11), e1007758.
[6] Rautiainen, M., & Marschall, T. (2020). GraphAligner: rapid and versatile sequence-to-graph alignment. Genome biology, 21(1), 253.
[7] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864).


Procédure :

Date limite : None


Camille Marchet

Offre publiée le 19 septembre 2023, affichage jusqu'au 30 novembre 2023