LLM for Drug Discovery : Your AI biologist Assistant

LLM - Python - disease knowledge


Our Company

OPM is a biopharmaceutical company specialized in precision medicine. OPM's mission is to bring innovative therapeutic and diagnostic solutions to treat therapeutic resistance and metastasis evolution. The patient is at the center of our reflection, of our unique innovative model, and our investments. For OPM "our collective success is paramount", there can be no value creation without exchange, without dialogue. The value creation resulting for us from reciprocity, i.e. balanced and fair exchanges at all levels, whether between internal collaborators, or with our partners, therapists, patients, experts and investors.


Over the past year, Large Language Models (LLMs) have taken the world by storm [1]. It was clear early on that the use of these large models could revolutionize different disciplines and fields of application (education, finance, law, and so on). The scientific community was immediately excited about the potential breakthrough that these models could bring to the medical field (disease understanding, target discovery, drug design, and so on). Applying these models to the medical field, though, presents additional challenges (specific language, particular complexity, specific knowledge) [2-3].

The advent of open-source models (LLaMA, Falcon) [4-5] has allowed the scientific comity to investigate how to adapt these models to the biomedical field. Notable examples are PMC-LLaMA and Chat-Doctor [6-7] where pre-trained models were further refined by training them with scientific abstracts or medical documents. To truly realize the potential of this scientific revolution, it is necessary to have quality datasets with which to train these models for specific medical applications.

Oncodesign Precision Medicine (OPM) is a company focused on the identification of therapeutical targets in oncology and the development of medical drugs against resistant cancers. Thus, At OPM we have collected a large corpus containing medical information (articles, patents, clinical trials, etc...) as well as possessing a vast amount of oncology patient data collected over the years. This year we trained a LLM model specifically on pancreatic cancer, showing how such a model can reach the state of the art on medical questions.

Building on this success, OPM has an interest in continuing to investigate the use of LLM for the identification of new tumor therapeutic targets. The final goal is to train a model capable of providing information about a specific target, potential therapeutic opportunities, and disease understanding. The model is meant to assist in future target selection, investigation, and analysis. This model will be tested in real-case scenarios as a biologist assistant and for completing a target dossier.

The objectives of the internship are to deliver:

  • a state-of-the-art analysis for LLM training, quantization, and deployment technologies.
  • an extensive review of methods that can be implemented.
  • Model fine-tuning and deployment. The obtained model should be able to respond to drug discovery-related questions, competition landscape, disease-related information, and so on.
  • Pipeline to integrate the model with different databases. Test tasks to show the LLM’s capabilities in finding information from different sources.

Missions & activities of the internship

Under supervision of a Senior Data Scientist holding PhD title and an interdisciplinary background in artificial intelligence, immunology, mathematics, genetics, genomics, and bioinformatics, your duties will be the following one.

  • Evaluate the state-of-the-art per language model and fine-tuning of LLM. Starting from the obtained baseline, we want to refine the LLM by implementing the latest technologies. We will explore strategies for tailoring a model focused on drug discovery, target, and disease knowledge.
  • Deployment of the model. Establish a pipeline for monitoring and deployment of the model (running on a server or alternatives)
  • Biology database integration. Integrate the model with a wide range of biological databases (mutations, pathways, patient data, and so on), so that it can talk to them and extract information from them, and therefore perform complex tasks.
  • Git code repositories with well-documented scripts in Python and notebooks with any conducted analysis.
  • A report summarizing your findings and contributions.

Student expected background/Knowledge.

M2 student with educational background in a relevant field (Computational biology, bioinformatics, artificial intelligence or related).

Essential skills include programming, machine learning, understanding of key concepts of molecular biology.

Fluent in French & English languages


