Stage M2 - ARN codant/non-codant et méthode IA
Stage · Stage M2 · 6 mois Bac+5 / Master Institut de Biochimie et Génétique Cellulaire (IBGC) · Bordeaux (France) rémunération de stage statutaire
Date de prise de poste : 3 février 2025
Mots-Clés
non-coding RNA feature detection machine learning classification
Description
Deciphering features of coding and non-coding RNAs to improve ML-based transcript classification
Scientific background
A significant fraction of the human genome is non-coding, with only 3% comprising protein-coding genes. Remarkably, nearly 80% is actually transcribed to various types of RNA with different sizes and functions, including a large and diverse group of long non-coding RNAs (lncRNAs). Recent studies have highlighted the roles of specific lncRNAs in regulating gene transcription and shaping cell identity, sparking an increased interest in 1) filling the lncRNA repertoire and 2) elucidating their biological functions. Like mRNAs, lncRNAs are transcribed by RNA polymerase II and frequently spliced and polyadenylated, which makes their identification complex. Several machine learning (ML) tools have been developed with this purpose, mostly relying on the coding potential of the sequences and their secondary structure. However, these tools struggle to accurately classify small or fragmented RNA segments from de novo RNA assembly. Furthermore, the yearly increase in newly identified sequences strongly suggests that the lncRNA catalog is still incomplete.
Objectives
Building on published approaches and current work in the team, we are interested in tackling this transcript classification problem. We hypothesize that the difficulty in computational identification of lncRNAs is partly due to unidentified characteristics in their sequence.The aim of the internship is to fill this gap by extracting and understanding specific characteristics allowing to discriminate coding from non-coding RNA sequences. More specifically, the student will work to:
(1) Statistically investigate apparent features (GC content, k-mer frequencies, secondary structure,...) in coding and non-coding sequences.
(2) Identify and compare machine learning tools previously used for transcript classification, performing benchmarking with a testing dataset.
(3) Study classification discrepancies to gain insights into why previous approaches underperform.
Skills
We are looking for either a biology student with a strong interest in data analysis and bioinformatics, or a bioinformatics student with a specialization in omics and AI basic knowledge. The following skills are a plus:
- Knowledge of cellular, genome and RNA biology
- Knowledge of omics concepts and terminologies (transcriptomics, genomics)
- Knowledge of biological and statistical analysis of high-throughput data
- Use of Python programming
- Basic knowledge in AI
- Mastery of the Unix/Linux environment and the Bash language
Environment
The internship will take place within the Computational Biology and Bioinformatics (CB&B) team at the IBGC in Bordeaux, under the supervision of Daniel García-Ruano and Domitille Chalopin-Fillot. The CB&B, directed by Dr. Macha Nikolski, is a multidisciplinary team gathering Engineers, PhD students, Postdocs and Associate professors both from the biology and the computer science fields.
Team website: https://bordeaux-bioinformatics.fr/
Contacts
Daniel Garcia Ruano (daniel.garciaruano@ibgc.cnrs.fr) and Domitille Chalopin (domitille.chalopin-Fillot@u-bordeaux.fr).
Duration of the internship: 6 months
Amount of compensation: statutory internship compensation.
Candidature
Procédure : Envoyer un mail avec un CV et une lettre de motivation à : daniel.garciaruano@ibgc.cnrs.fr domitille.chalopin-fillot@u-bordeaux.fr
Date limite : 13 décembre 2024
Contacts
Daniel Garcia Ruano
daNOSPAMniel.garciaruano@ibgc.cnrs.fr
Offre publiée le 19 novembre 2024, affichage jusqu'au 13 décembre 2024