M2 internship-Deep Mendelian Randomization: explaining causality between traits at genome-wide scale

 CDD · Stage M2  · 6 mois    Bac+5 / Master   BioSTM - Faculté de Pharmacie de Paris - Université Paris Cité · Paris (France)

 Date de prise de poste : 1 février 2024


Deep Learning Mendelian Randomization Causal inference, Genome-Wide association studies Pleiotropy


Deep Mendelian Randomization: explaining causality between different hereditary traits at genome-wide scale


In recent years, a surge in technological advancements and the availability of large-scale genomic data has propelled genetic research towards a notable upswing [1]. This surge has been particularly focused on unraveling the causal relationships between complex traits and diseases, which is instrumental in understanding the underlying causes, or etiology, of these conditions. Moreover, it plays a pivotal role in shaping the development of more effective therapeutic interventions.

Traditionally, instrumental variable analysis has been the method of choice for deriving causal inferences regarding the impact of an exposure, such as a risk factor, on an outcome using observa- tional data. Over the last decade, this approach has witnessed a rise in popularity, especially with the adoption of genetic variants as instrumental variables — a technique commonly referred to as Mendelian randomization (MR) [2]. These genetic variants serve as a promising wellspring of poten- tial instrumental variables, offering a powerful means to disentangle complex relationships between exposures and outcomes. However, one significant challenge faced in MR studies is pleiotropy, where a single genetic variant influences multiple traits or outcomes, as we have demonstrated pervasive pleiotropy [3]. This phenomenon can complicate the interpretation of causal relationships, as the observed association between the genetic variant and the outcome may stem from its influence on a different, correlated trait rather than the exposure under investigation. Additionally, the assumption of a linear relationship between the instruments, exposures, and outcomes may not always hold true in complex biological systems, potentially leading to biased estimates. Furthermore, the presence of linkage disequilibrium (LD) — a spatial correlation between genetic variants physically close to each other on a chromosome — poses another concern in MR. If a genetic variant used as an instrument for an exposure is in LD with other genetic variants influencing the outcome, it can introduce a source of bias into the estimated causal effect [4].

In response to these challenges, deep learning emerges in the bioinformatics field as a promising solution [5]. It boasts several advantages, including its ability to effectively leverage spatial dependen- cies among genetic variants. Additionally, deep learning excels in handling high-dimensional datasets, a critical point in genetic research. Its flexibility allows it to approximate complex, non-linear rela- tionships, a crucial capability for dissecting intricate genetic associations. Hence, deep learning holds immense potential in enhancing the precision and reliability of causal inference in MR analyses.

With our ongoing efforts, we have developed Deep Learning (DL) methods to improve MR tech- niques, and we remain committed to further refining and presenting efficient frameworks. These frameworks aim to effectively address the unique challenges within the realm of genetic research, ultimately advancing our understanding of causal relationships in this complex field.

Project description

Method development

We have introduced a novel framework based on Deep Learning (DL) aimed at overcoming key limi- tations in MR. This novel method leverages the Double Machine Learning (DML) technique, enabling conventional predictive machine learning models to discern causality between exposures and outcomes. Unlike conventional MR models, which typically focus on a single outcome, our approach is extended to handle multiple outcomes within a unified framework. The model is based on a neural network architecture, developed with customized layers and loss functions. Furthermore, to ensure realistic data representation, we’ve proposed two simulation models – one linear and one partially linear – incorporating considerations for LD-blocks. Despite these advances, the model reveals a weakness in accurately estimating causality parameters, potentially attributed to its challenge in capturing spatial dependence within the data. Additionally, the estimated residuals in the final step of the Double Ma- chine Learning (DML) process exhibit some level of auto-correlation. Despite our optimization efforts, the model remains computationally intensive. During this internship, our primary goal is to advance the current work by introducing a robust framework capable of handling the intricate complexities in- herent in genetic data. This framework will address the limitations identified in the current developed model. Firstly, we will ensure that the simulated data closely mirrors the characteristics of real-world genetic datasets. This encompasses accurately replicating LD patterns, polygenic effects, and trait distributions. This step serves as a benchmark for evaluating the effectiveness of both our developed model and alternative methods. Next, we will focus on developing a customized model. This will involve a comprehensive exploration of different approaches, including various Machine Learning and Deep Learning methodologies, coupled with the powerful Double Machine Learning technique. This may encompass the utilization of tree-based models or neural network-based models. Simultaneously, we will delve into strategies to optimize the computational performance of the model. This may involve the implementation of parallel computing techniques or the exploration of more efficient al- gorithms. In addition to performance, we will assess the robustness and stability of the developed method regarding data perturbations. This will be achieved through techniques like bootstrapping or the introduction of different levels of noise, providing a thorough evaluation of the model’s reliability. Furthermore, we will integrate biological networks into our causal inference framework. This will involve the incorporation of network-based features or the introduction of a network regularization term within the model. This not only enhances the interpretability of causal relationships but also provides a biologically meaningful context for understanding underlying mechanisms. Moreover, this integration has the potential to mitigate challenges associated with pleiotropy.

Data availability

We have procured an extensive dataset of publicly available summary statistics pertaining to com- ponents of the metabolic syndrome. These statistics were sourced from numerous genome-wide association studies encompassing conditions such as coronary artery disease, type 2 diabetes, waist- hip-ratio, and systolic blood pressure. Additionally, we have gathered summary statistics related to serum metabolite levels from publicly accessible resources.


The contribution of this internship will be to develop Deep MR methods on a genome-wide scale. The frameworks Keras and/or Tensorflow make powerful deep learning tools available and will be mainly used to develop the neural network. Importantly, the full code of the produced methodology will be made publicly available and highlighted in scientific publications.

The successful candidate

The successful candidate should possess the following qualifications: (1) A Master’s degree in com- puter science, computational biology, statistical genetics, bioinformatics, or an equivalent field. (2) Good skills in programming, preferably in Python and/or R, along with an interest in machine learning and deep learning. (3) Strong communication skills and the ability to work effectively in a team. (4) While experience with genome-wide association studies, whole-genome sequencing data, or biological networks is advantageous, it is not a strict requirement.

Research group

The Master 2 internship will be supervised by Dr. Marie Verbanck, an Assistant Professor (Maître de Conférences) at Université Paris Cité, and a member of the BioSTM unit (Biostatistique, Traitement et Modélisation des données biologiques - UR 7537). Additionally, the internship will be co-supervised by Asma Nouira, a postdoctoral fellow at BioSTM, Université Paris Cité. BioSTM is a research group focused on Data Science, dedicated to developing state-of-the-art statistical methodologies to address real-world biological challenges. The team places a strong emphasis on promoting reproducible and open research practices.


To apply, please send a concise email describing your research interests and experience as well as an up-to-date CV to Marie Verbanck (marie.verbanck@u-paris.fr) and/or Asma Nouira (asma.nouira@u- paris.fr). Name and contact for references will be appreciated.


[1]  Teri A. Manolio and Francis S. Collins et al. Finding the missing heritability of complex diseases. Nature, 2009.
[2]  StephenBurgessandDylanSSmalletal.Areviewofinstrumentalvariableestimatorsformendelianrandomization. Statistical Methods in Medical Research, 2017.
[3]  Marie Verbanck and Chia-Yen Chen et al. Detection of widespread horizontal pleiotropy in causal relationships inferred from mendelian randomization between complex traits and diseases. Nat Genet, 2018.
[4]  Stephen Burges and Robert A. Scott et al. Using published data in mendelian randomization: a blueprint for efficient identification of causal risk factors. Eur J Epidemiol, 2015.
[5]  James Zou and Mikael Huss et al. A primer on deep learning in genomics. Nature Genetics, 2019.


Procédure :

Date limite : 1 décembre 2023

Offre publiée le 14 septembre 2023, affichage jusqu'au 1 décembre 2023