Candidate genes prioritization using knowledge graphs and AI

 CDD · Thèse  · 36 mois    Bac+5 / Master   University of Montpellier; LIRMM and IRD · Montpellier (France)  2135 gross euros / month

 Date de prise de poste : 1 novembre 2023


Deep Learning, Graph Neural Network, Bioinformatics, Gene prioritization, Gene Regulation Networks, knowledge graphs, neuro-symbolic AI


Context: To meet the challenges of the global demand for food in a context of climate change, a better understanding of agronomically important traits, such as yield, quality, and resistance to abiotic and biotic stresses is crucial to improve crops production capacities. Deciphering molecular mechanisms that drive a particular trait is one of the most critical research areas in biology. However, these genotype-phenotype interactions are difficult to identify because they occur at different molecular levels in the plant and are strongly influenced by environmental factors (i.e., climate change). For biologists, it is difficult to search for relevant information as it is often dispersed in several databases on the Internet each with different data models, scales or distinct means of access. Today's major challenges are related to the development of methods to integrate these heterogeneous data and to enrich biological knowledge. The scientists also need methods to dig into this mass of data and to highlight relevant information that identifies key genes. To this end, we developed the AgroLD [1] platform which is a knowledge graph that uses Semantic Web technologies to integrate heterogeneous agronomic data from the genome to the phenome (i.e., from the set of genes to the set of phenotypes observed in a plant organism). AgroLD is actively developed. As of today, AgroLD contains more than 900 million triples resulting from the integration of around 100 datasets gathered in 33 named graphs.

The thesis is proposed under the frame of the DIG-AI ANR project which aims to develop machine learning methods combined with knowledge graphs such as AgroLD to study the molecular interactions driving the phenotype development in crops. 

Objective 1: The current challenges are related to the development of methods for functional analysis of genes and in particular to methods for prioritization of candidate genes. Indeed, the data integrated from databases are incomplete, heterogeneous, insufficient to infer genes function with good accuracy. One of the first objectives of the thesis will be the development of knowledge extraction methods to extract functional information on genes in scientific documents.

Objective 2: The recent success of graph neural networks (GNNs) suggests the possibility of systematically incorporating multiple sources of information into a heterogeneous network and learning the nonlinear relationship between phenotypes and genes [2]. However, knowledge graphs like AgroLD can be complex and contain interference information. Therefore, as proposed by [3, 4], some GNN models could reduce the influence of noisy data on the overall prediction effect by assigning low weights to unreliable nodes/edges. The second objective will be to develop an adapted approach to the AgroLD context by building meaningful representations from the high dimensional and complex gene data.

Objective 3: Finally,  based on previous candidate gene studies in the biomedical field [5, 6] and because inferring gene regulatory networks (GRN) can be formulated as a link prediction problem in Graph Neural Networks (GNN) [7], the third objective will be to apply GNN models to implement candidate gene prioritization and GRN methods to answer biological questions related to adaptation of crops to drought stress and plant diseases.


1. Venkatesan A, Tagny Ngompe G, Hassouni NE, Chentli I, Guignon V, Jonquet C, et al. Agronomic Linked Data (AgroLD): A knowledge-based system to enable integrative biology in agronomy. PLOS ONE. 2018;13:1–17. 

2. Zhang X-M, Liang L, Liu L, Tang M-J. Graph Neural Networks and Their Current Applications in Bioinformatics. Front Genet. 2021;12.

3. Neil D, Briody J, Lacoste A, Sim A, Creed P, Saffari A. Interpretable Graph Convolutional Neural Networks for Inference on Noisy Knowledge Graphs. ArXiv181200279 Cs Stat. 2018.

4. Li X, Saude J. Explain Graph Neural Networks to Understand Weighted Graph Features in Node Classification. ArXiv200200514 Cs. 2020.

5. Alshahrani M, Hoehndorf R. Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes. Bioinform. 2018;34:i901–7.

6. Chen J, Althagafi A, Hoehndorf R. Predicting candidate genes from phenotypes, functions and anatomical site of expression. Bioinformatics. 2021;37:853–60.

7. Gligorijević V, Barot M, Bonneau R. deepNF: deep network fusion for protein function prediction. Bioinformatics. 2018;34:3873–81.


Procédure : Applications have to be send before June 23th 2023 and require the following documents: 1) Motivation letter 2) 2-pages max CV 3) M1, M2 academic transcripts 4) references if possible to be sent by mail to: and

Date limite : 23 juin 2023


Pierre Larmande

Offre publiée le 23 mai 2023, affichage jusqu'au 23 juin 2023