Stage M2 - ARN codant/non-codant et méthode IA

 Stage · Stage M2  · 6 mois    Bac+5 / Master   Institut de Biochimie et Génétique Cellulaire (IBGC) · Bordeaux (France)  rémunération de stage statutaire

 Date de prise de poste : 3 février 2025

Mots-Clés

non-coding RNA feature detection machine learning classification

Description

Deciphering features of coding and non-coding RNAs to improve ML-based transcript classification

Scientific background

A significant fraction of the human genome is non-coding, with only 3% comprising protein-coding genes. Remarkably, nearly 80% is actually transcribed to various types of RNA with different sizes and functions, including a large and diverse group of long non-coding RNAs (lncRNAs). Recent studies have highlighted the roles of specific lncRNAs in regulating gene transcription and shaping cell identity, sparking an increased interest in 1) filling the lncRNA repertoire and 2) elucidating their biological functions. Like mRNAs, lncRNAs are transcribed by RNA polymerase II and frequently spliced and polyadenylated, which makes their identification complex. Several machine learning (ML) tools have been developed with this purpose, mostly relying on the coding potential of the sequences and their secondary structure. However, these tools struggle to accurately classify small or fragmented RNA segments from de novo RNA assembly. Furthermore, the yearly increase in newly identified sequences strongly suggests that the lncRNA catalog is still incomplete. 

 

Objectives

Building on published approaches and current work in the team, we are interested in tackling this transcript classification problem. We hypothesize that the difficulty in computational identification of  lncRNAs is partly due to unidentified characteristics in their sequence.The aim of the internship is to fill this gap by extracting and understanding specific characteristics allowing to discriminate coding from non-coding RNA sequences. More specifically, the student will work to:

(1) Statistically investigate apparent features (GC content, k-mer frequencies, secondary structure,...) in coding and non-coding sequences.

(2) Identify and compare machine learning tools previously used for transcript classification, performing benchmarking with a testing dataset. 

(3) Study classification discrepancies to gain insights into why previous approaches  underperform.

 

Skills

We are looking for either a biology student with a strong interest in data analysis and bioinformatics, or a bioinformatics student with a specialization in omics and AI basic knowledge. The following skills are a plus:

  • Knowledge of cellular, genome and RNA biology
  • Knowledge of omics concepts and terminologies (transcriptomics, genomics)
  • Knowledge of biological and statistical analysis of high-throughput data
  • Use of Python programming
  • Basic knowledge in AI
  • Mastery of the Unix/Linux environment and the Bash language

 

Environment

The internship will take place within the Computational Biology and Bioinformatics (CB&B) team at the IBGC in Bordeaux, under the supervision of Daniel García-Ruano and Domitille Chalopin-Fillot. The CB&B, directed by Dr. Macha Nikolski, is a multidisciplinary team gathering Engineers, PhD students, Postdocs and Associate professors both from the biology and the computer science fields.

Team website: https://bordeaux-bioinformatics.fr/

Contacts

Daniel Garcia Ruano (daniel.garciaruano@ibgc.cnrs.fr) and Domitille Chalopin (domitille.chalopin-Fillot@u-bordeaux.fr).

Duration of the internship: 6 months

Amount of compensation: statutory internship compensation.

Candidature

Procédure : Envoyer un mail avec un CV et une lettre de motivation à : daniel.garciaruano@ibgc.cnrs.fr domitille.chalopin-fillot@u-bordeaux.fr

Date limite : 13 décembre 2024

Contacts

Daniel Garcia Ruano

 daNOSPAMniel.garciaruano@ibgc.cnrs.fr

Offre publiée le 19 novembre 2024, affichage jusqu'au 13 décembre 2024