PhD position on Causal Inference and Machine Learning methods for heterogeneous biological data
CDD · Thèse · 36 mois Bac+5 / Master Institut Curie · Paris (France)
Date de prise de poste : 1 octobre 2023
Machine learning Causal inference Heterogeneous data Cellular graph representations 3D images Single-cell transcriptomics Developmental biology
Description of the PhD thesis project
This PhD project concerns the development of causal inference and machine learning methods to analyze heterogeneous biological data, with application to 4D imaging and single-cell omics of multicellular systems.
Time-lapse imaging microscopy and single-cell transcriptomics, now routinely used in cell and developmental biology labs, produce massive amounts of video images and gene expression data at single cell resolution. However, this wealth of heterogeneous data remain largely under-explored due to the lack of unsupervised methods and tools to analyze them. This highlights the need to develop new Machine Learning and Artificial Intelligence strategies to better exploit the richness and complexity of the information contained in space- and time-resolved cell and developmental biology data.
The Isambert lab has developed novel causal inference methods and tools (https://miic.curie.fr, MIIC R package) to learn cause-effect relationships in a variety of biological or biomedical datasets, from single-cell transcriptomic and genomic data (1-3) to medical records of patients (4-6). These machine learning methods combine multivariate information analysis with interpretable graphical models (6-8) and outperform other methods on a broad range of benchmarks, achieving better results with only ten to hundred times fewer samples. These methods have also been recently adapted to analyze time series data such as live-cell time-lapse images of “tumor-on-chips”, which are micro-tumors reconstituted in vitro (9).
The present PhD project will extend these causal inference and unsupervised Machine Learning methods to analyze large scale heterogeneous data with applications to time-lapse 3D imaging (i.e. 4D imaging) and single-cell transcriptomic data on 3D multicellular systems, such as “gastruloids”, which are early mammalian development models derived from embryonic stem cells, in collaboration with our biologist and biophysicist partners from the multidisciplinary MecaCell3D consortium.
The first application will use cellular graph representations, where nodes correspond to cells and edges to cell-cell contacts, which offer a very sparse and scalable data structure to embed a variety of cell features on the nodes (e.g. cell shape, volume, pressure, protein expression levels, etc) and cell interface-specific features on the edges (e.g. cell polarity, contact area, inferred mechanical forces, membrane/cortical protein expression levels, etc). Such integrative graph representations, which retain also all relative positions of the cells in 3D multicellular systems, will facilitate and extend the scope of our inference methods to uncover the interdependence and possible causal relations between all extractable features from 3D multicellular systems. The method will be applied to interpret morphodynamic data on “gastruloids” and developing “organoids”, which are early micro-organs grown in vitro, in collaboration with the Turlier lab (College de France), Baroud lab (Inst. Pasteur and Polytechnique) and Lenne lab (Marseille).
The second application will combine causal inference analyses from single-cell transcriptomic data and from morphodynamic features extracted from live-cell images. The method will be applied to analyze “tumor-on-chip” ecosystems, in collaboration with MC Parrini (Institut Curie), and mouse “gastruloids”, in collaboration with the Lescroart lab (Marseille Medical Genetics). The analysis of patient-derived tumor-on-chips will aim at evaluating the determinants of the efficacy of immunotherapies for the patients providing the tumor specimen. The analysis of mouse gastruloids will combine single-cell transcriptomic data together with clonal history (barcodes) in order to identify cell trajectories in time and space leading to the formation of the gastruloids and compare them with the lineage tree in the mouse embryo.
Expected profile of the candidate
Applicants should have a strong background in machine learning, computer science or physics and a keen interest to analyze complex heterogeneous data of biological interests. Applicants should be proficient in programming and willing to interact with scientists from different disciplines, from data scientists, biophysicists to cell and developmental biologists. Applicants are expected to show a clear capacity for independent and creative thinking. Experience on causal inference analysis is a plus but not required as long as the applicant has a strong motivation to learn.
Related publications (available as pdf from the link to the PhD project)
1. Verny L, Sella N, Affeldt S, Singh PP, Isambert H, Learning causal networks with latent variables from multivariate information in genomic data. PLoS Comput. Biol. 13(10):e1005662 (2017).
2. Sella N, Verny L, Uguzzoni G, Affeldt S, Isambert H, MIIC online: a web server to reconstruct causal or non-causal networks from non-perturbative data. Bioinformatics 34 (13):2311-2313 (2018).
3. Desterke C, Petit L, Sella N, Chevallier N, Cabeli V, Coquelin L, Durand C, Oostendorp RAJ, Isambert H, Jaffredo T, Charbord P, Inferring gene networks in bone marrow Hematopoietic Stem Cell-supporting stromal niche populations. iScience 23(6):101222 (2020).
4. Cabeli V, Verny L, Sella N, Uguzzoni G, Verny M, Isambert H, Learning clinical network from medical records based on information estimates in mixed-type data. PLoS Comput. Biol. 16(5):e1007866 (2020).
5. Sella N, Hamy AS, Cabeli V, Darrigues L, Laé M, Reyal F, Isambert H, Interactive exploration of a global clinical network from a large breast cancer cohort. npj Digital Med. 5, 113 (2022).
6. Ribeiro-Dantas M, Li H, Cabeli V, Dupuis L, Simon F, Hettal L, Hamy AS, Isambert H, Learning interpretable causal networks from very large datasets, application to 400,000 medical records of breast cancer patients, arXiv (2023).
7. Li H, Cabeli V, Sella N, Isambert H, Constraint-based causal structure learning with consistent separating sets. Advances in Neural Information Processing Systems (NeurIPS) 32, 14257 (2019).
8. Cabeli V, Li H, Ribeiro-Dantas M, Simon F, Isambert H, Reliable causal discovery based on mutual information supremum principle for finite dataset in Why21 at Neural Information Processing Systems (NeurIPS) (2021).
9. Simon F, Comes MC, Tocci T, Dupuis L, Cabeli V, Lagrange N, Mencattini A, Parrini MC, Martinelli E, Isambert H, CausalXtract: a flexible pipeline to extract causal effects from live-cell time-lapse imaging data, preprint (2023).
Procédure : Please send complete CV, Master's transcripts with marks and the name(s) of one or more references to firstname.lastname@example.org Informal inquiries are welcome. Starting date: Sept-Oct 2023, the position will be open until filled.
Date limite : 30 juin 2023
Offre publiée le 3 avril 2023, affichage jusqu'au 31 juillet 2023