Predicting Gene-Phenotype Associations with a Knowledge Graph-based Approach

 Stage · Stage M2  · 6 mois    Bac+4   UMR 232 DIADE Equipe CERES · Montpellier (France)  650 / month

 Date de prise de poste : 3 mars 2025

Mots-Clés

data integration knowledge graphs machine learning gene prioritization NLP

Description

Context: A better understanding of genotype-phenotype relationships requires an integration of biological information of various kinds. However, this information is often dispersed in several online databases, having heterogeneous access. For biologists, it is difficult to analyze these data as the mass of information is hard to manage. In this context, we developed the AgroLD platform [1] ( www.agrold.org ). AgroLD is a Knowledge graph database covering information on genes, proteins, molecular interactions, and some genetic and phenotypic studies for crop species including rice, arabidopsis, wheat, and sorghum. Currently, AgroLD contains 900 million triples created by transforming more than 100 datasets from 15 sources such as the rice databases of the South Green platform or international databases such as Gramene.org [2] for cereals. 

The proposed project aims to develop computer solutions to improve the extraction of valuable information from massive and heterogeneous functional annotation data, either available in raw text or databases. The project will focus on sorghum a staple crop of interest for more than 300 M° people in semi-arid regions. Sorghum is also a species that presents highly desirable properties to mitigate the effects of climate change in original areas of production as well as in Europe.


Objectives: The first objective will be to integrate new complementary datasets that can provide functional information. The second objective will be to develop text-mining methods to extract functional information on genes and functional traits in scientific publications. The third objective will be to consolidate and rank functional annotations related to genes or genomic regions utilizing the AgroLD Knowledge Base. Finally, functional analysis methods (i.e. gene prioritization) will be developed and validated on published data. Through the augmentation of AgroLD with new functional annotations, including those specific to sorghum, the internship endeavors to establish a comprehensive platform. This platform will facilitate inventive aggregation and prioritization strategies applicable to a diverse array of plant species.

 

Program:

- Extend data coverage to QTL/GWAS, expression and co-expression, interactome, and metabolic pathway information. 

- Development of text-mining-based methods based on a corpus of scientific publications identified by the partners.

- Extend the Data Model to new datasets. 

- Development of functional analysis methods including prioritization of candidate genes.

- Validation of functional analysis methods through a use case published in an international journal.

References:

1. Venkatesan A, Tagny Ngompe G, Hassouni NE, Chentli I, Guignon V, Jonquet C, et al. Agronomic Linked Data (AgroLD): a Knowledge-based System to Enable Integrative Biology in Agronomy. PLoS ONE. 2018;:13:17.

2. Tello-Ruiz MK, Naithani S, Stein JC, Gupta P, Campbell M, Olson A, et al. Gramene 2018: Unifying comparative genomics and pathway resources for plant research. Nucleic Acids Res. 2018.

 

Candidature

Procédure : Application: Applications for this position (CV, Motivation Letter, last grade report, References) will be received EXCLUSIVELY in a single PDF document accessible for download via email sent to Pierre LARMANDE  (firstname.lastname@ird.fr).

Date limite : None

Contacts

Pierre Larmande

 piNOSPAMerre.larmande@ird.fr

 https://sites.google.com/site/larmandepierre/positions/gene-prio-2025

Offre publiée le 15 septembre 2024, affichage jusqu'au 13 décembre 2024