Integration of a priori biological knowledge in multi-omics analysis methods

 Stage · Stage M2  · 6 mois    Bac+5 / Master   Centre National de Recherche en Génomique Humaine (CNRGH) · Évry-Courcouronnes (France)  Oui

 Date de prise de poste : 1 mars 2023


Machine Learning Statistics Genomics Multi-omic


Context. In order to fully grasp the complexity of a disease, biologists have access to a wide variety of measurements on the human genome. They each shed light on a particular aspect of the underlying molecular mechanisms. Studied separately, they are often not enough to understand the undergoing dysregulation, which led to the establishment of multi-omic studies that aim at merging the information coming from different modalities (omics) observed on the same set of genomes in the hope of catching a meaningful signal.

With the raise of high-throughput sequencing, multi-omics studies and their expectations to finally decipher molecular mechanisms underneath complex diseases, steadily pass from a mere concept 20 years ago to a classical design in nowadays experiments. Yet, the analysis of such data is quite hard as each omics suffers from high-dimensionality (the human genome is composed of approximately 20.000 genes) and potentially missing observations, which in the context of a low number of samples, has to be dealt with a better strategy than throwing out samples with missing values. At the single omics level, analysing such data already requires state-of-the-art Machine Learning techniques to overcome these issues. Thus, analysing them jointly to unravel commonalities or interactions that would better explain a diagnosis is even harder and requires the development of new techniques pushing forward the boundaries of Artificial Intelligence.

This has been the focus of the last decade, where an astonishing number of multi-omics methods have been flourishing in the literature (Hesami et al., 2022). Very recently, in order to understand the capabilities of all these tools, benchmarks have been published (Cantini et al., 2021; Meng et al., 2016; Pierre-Jean et al., 2020; Rappoport and Shamir, 2018).

General Goal. Starting from there, the goal of this internship is to explore an avenue, that we believe is under-looked, consisting in including more prior biological knowledge into the existing tools that usually are  general and can be used in other application fields.

Objectives. The First objective of this internship is to include biological knowledge into already existing tools. Firstly, by including group knowledge at the variables level. This would both leads to more interpretable results and allows the selection of groups of variables by the model, which is an interesting way of dealing with high-dimensionality. For example, this could be done by aggregating genes in groups corresponding to biological pathways according to classical databases such as the Kyoto Encyclopedia of Genes and Genomes
(KEGG) (Kanehisa et al., 2016) and the Human Metabolome DataBase (HMDB) (Wishart et al., 2013). This group prior knowledge can be included in several multi-omics methods by different strategies: either by dividing each omics into as many matrices as there exist pathways, similarly to (Garali et al., 2018), or by specifying the group structure into an appropriate penalty term in the model as performed in (Du et al., 2018; Guillemot et al., 2021; Löfstedt et al., 2016). Secondly, by enlarging the list of methods in actual benchmarks, especially in a subcategory of models called joint Dimension Reduction (jDR) models. jDR aims at estimating a lower
dimensional space describing the joint information between omics. However, in almost all benchmarks, compared jDR methods impose that this information is shared by all omics. This assumption is interesting to gain statistical power by looking for phenomenons common to all modalities, however, in the case of mechanisms shared by only a few omics, this may be too constraining and make these methods incapable of recovering such situations. In (Smilde et al., 2022), the distinction is made between jDR methods that are able to estimate a lower dimensional space based either only on Common (C) information across omics, on Common and Distinct
(CD) or on Common, Distinct and Local (CDL; understand Local to a subset of omics) information. This last category corresponds to a recent active field, with methods such as (Lock et al., 2022; Park and Lock, 2020; Samorodnitsky et al., 2022; Yi et al., 2022) that we wish to include in current benchmark studies. Actually, combining the two approaches, being able to insert a group structure and extract CDL information, is even more interesting as they are complementary. This would allow to extract subgroups of omics describing a specific interaction based on a small number of biological pathways for example.

The second objective is to propose new multi-omics methods that would make use of more biological information. A major avenue that is going to be exploited consists in including more constraints in the models by specifying that some variables, even though measured in different omic modalities, are located in the same genomic region, which is almost never taken into account in multi-omics methods. A first way to integrate this knowledge is to define a common scale for all variables, for example the gene scale, and aggregate all omics to this very same scale. Thus each omics would be represented by two common dimensions: the sample and
the gene dimension, allowing to work with thoroughly studied mathematical objects called tensors (Acar and Yener, 2009; Kolda and Bader, 2009).

Delivrable. All these developments would be built upon current state-of-the-art benchmarks in order to have systematically an evaluation of the effect of integrating a new biological prior to existing models. Results will be made reproducible for the community, in the form of a R package for instance.

Ultimately, developments would be evaluated in the context of exploratory analysis on open datasets such as TCGA or on different collaborative projects of the CNRGH, such as the France Genomic Medicine Plan (Sanlaville, 2022) for personalized medicine in the field of cancer or the PROPSY project recently laureate for the call for proposals for Prioritized Exploratory Research Projects, Programs and Equipments 2022 aiming at identifying new biomarkers for 4 major mental disorders : Autism, Schizophrenia, Major Depressive Disorder and Bipolar Disorder.

Required Profile.
- M2 or last year of engineer school with specialty/knowledge in Computer Science / Statistics / Machine
Learning / BioStatistics.
- Working knowledge in programming (R, Python, ...).

- Previous experience with applications to Genomics is a plus.


Procédure : To apply, please submit: a cover letter summarizing research interests and expertise and a Curriculum Vitae (including contact information) to Arnaud GLOAGUEN ( and Edith LE FLOCH (

Date limite : 23 décembre 2022



Offre publiée le 24 novembre 2022, affichage jusqu'au 23 décembre 2022