Mots-Clés
microbial genomics
pangenomics
machine learning
Description
Prokaryotes (i.e. bacteria and archaea) constitute a fascinating field of living organisms, representing remarkable diversity and ubiquity. Their impact on the biosphere is immense, influencing human and animal health, soil and ocean biogeochemistry, and much more. Large-scale exploration of microbial genomes has helped uncover the molecular mechanisms underlying their diversity, and particularly the role of Mobile Genetic Elements (MGE).
In recent years, with the explosion of sequencing projects, several bioinformatics approaches have been developed based on the pangenome concept, offering solutions for efficiently managing and exploiting large quantities of data [1]. Pangenomics examines genetic variability across all available genomes of a given group, usually a species, rather than relying on a single reference genome or making pairwise comparisons. In terms of gene content, a distinction is made between the core genome, i.e. the genes present in all individuals, and the accessory (or variable) genes that are more or less conserved in the genomes, and therefore likely to explain phenotypic particularities. The development of pangenomic methods is thus a response to the challenge of massive data in biology, helping to understand the evolution of microorganisms in relation to epidemiological or environmental data.
For several years now, the LABGeM laboratory has been working on a model to represent genomic data as a pangenome graph at the gene family level, enabling the compression of information from thousands of genomes while preserving the chromosomal organization of genes. The PPanGGOLiN software suite [2] (awarded an Open Science Research Prize by the French Ministry of Research in 2023; >160 citations since 2020) has been developed to reconstruct and analyze pangenome graphs. It includes methods such as the identification of regions of genomic plasticity (panRGP method) [3] and their fine description in conserved modules (panModule method) [4], demonstrating their utility for identifying genomic islands and their MGEs. LABGeM is also developing PANORAMA, an innovative tool that exploits the pangenome graphs reconstructed by PPanGGOLiN to identify biological systems using rule-based algorithms, while detecting conserved genomic contexts across the pangenomes of different species.
Current methods for analyzing genomic contexts [5-6] have shown their effectiveness in predicting biological functions, but suffer from problems of scaling up to fully exploit the diversity of genomes available in databases. PANORAMA offers one of the first perspectives in comparative pangenomic analysis of genomic contexts in thousands of genomes, but relies on predefined algorithmic rules to identify similar biological systems, which limits its ability to discover completely new ones. New Transformer-based artificial intelligence methods for language models have shown their effectiveness in capturing large-scale semantic relationships through attention mechanisms [7] and are beginning to be used to predict and generate new genomic contexts [8-9].
This thesis proposes to exploit artificial intelligence methods, in particular language models, applied to pangenome graphs. By representing their contents as sequences of sentences, where each word corresponds to a functional unit encoded by a gene family, this approach opens up new prospects for revealing complex patterns through learning on large-scale datasets. This will make it possible to predict missing or uncertain annotations, offering insights into gene function and uncharacterized biological processes. The main objectives of this work will be to:
- build a dataset of annotated pangenome graphs at different functional levels, serving as a basis for model training and validation
- evaluate different machine learning methods, including language models, in order to identify the best performing approaches
- apply the developed method to the identification of new biological systems, such as metabolic pathways, macromolecular or defense systems.
This work will benefit from the projects and developments carried out within the LABGeM team as well as the expertise in microbial metabolism of our research unit.
This project will be done in collaboration with Christophe Ambroise (LaMME laboratory, University of Evry), for the statistics and artificial intelligence, and Guillaume Gautreau (MaIAGE, INRAE).
This thesis will take place in Evry is funded by the CEA.
Candidate profile
- Master’s degree (or equivalent) in Bioinformatics or Computer Science
- Solid background in Machine Learning
- Good programming skills in Python
- Prior knowledge in microbial genomics is a plus, but not mandatory
References
[1] Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2016. doi:10.1093/bib/bbw089
[2] Gautreau G, et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. 2020;16: e1007732. doi:10.1371/journal.pcbi.1007732
[3] Bazin A, et al. panRGP: a pangenome-based method to predict genomic islands and explore their diversity. Bioinformatics. 2020;36: i651–i658. doi:10.1093/bioinformatics/btaa792
[4] Bazin A, et al. panModule: detecting conserved modules in the variable regions of a pangenome graph. bioRxiv. 2021. p. 2021.12.06.471380. doi:10.1101/2021.12.06.471380
[5] Snel B, et al. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 2000;28: 3442–3444. doi:10.1093/nar/28.18.3442
[6] Zhang R, et al. De novo discovery of conserved gene clusters in microbial genomes with Spacedust. BiorXiv 2024 doi:10.1101/2024.10.02.616292
[7] Vaswani A, et al. Attention Is All You Need. arXiv 2023. doi:10.48550/arXiv.1706.03762
[8] Hwang Y, et al. Genomic language model predicts protein co-regulation and function. Nat Commun. 2024 Apr 3;15(1):2880. doi: 10.1038/s41467-024-46947-9
[9] Nguyen E, et al. Sequence modeling and design from molecular to genome scale with Evo. Science. 2024 Nov 15;386(6723):eado9336. doi: 10.1126/science.ado9336