Stage M2 - ARN codant/non-codant et méthode IA | SFBI Société Française de Bioinformatique

Revenir à la liste des offres d'emplois

Stage M2 - ARN codant/non-codant et méthode IA

Stage · Stage M2 · 6 mois Bac+5 / Master Institut de Biochimie et Génétique Cellulaire (IBGC) · Bordeaux (France) rémunération de stage statutaire

Date de prise de poste : 3 février 2025

Mots-Clés

non-coding RNA feature detection machine learning classification

Description

Deciphering features of coding and non-coding RNAs to improve ML-based transcript classification

Scientific background

A significant fraction of the human genome is non-coding, with only 3% comprising protein-coding genes. Remarkably, nearly 80% is actually transcribed to various types of RNA with different sizes and functions, including a large and diverse group of long non-coding RNAs (lncRNAs). Recent studies have highlighted the roles of specific lncRNAs in regulating gene transcription and shaping cell identity, sparking an increased interest in 1) filling the lncRNA repertoire and 2) elucidating their biological functions. Like mRNAs, lncRNAs are transcribed by RNA polymerase II and frequently spliced and polyadenylated, which makes their identification complex. Several machine learning (ML) tools have been developed with this purpose, mostly relying on the coding potential of the sequences and their secondary structure. However, these tools struggle to accurately classify small or fragmented RNA segments from de novo RNA assembly. Furthermore, the yearly increase in newly identified sequences strongly suggests that the lncRNA catalog is still incomplete.

Objectives

Building on published approaches and current work in the team, we are interested in tackling this transcript classification problem. We hypothesize that the difficulty in computational identification of lncRNAs is partly due to unidentified characteristics in their sequence.The aim of the internship is to fill this gap by extracting and understanding specific characteristics allowing to discriminate coding from non-coding RNA sequences. More specifically, the student will work to:

(1) Statistically investigate apparent features (GC content, k-mer frequencies, secondary structure,...) in coding and non-coding sequences.

(2) Identify and compare machine learning tools previously used for transcript classification, performing benchmarking with a testing dataset.

(3) Study classification discrepancies to gain insights into why previous approaches underperform.

Skills

We are looking for either a biology student with a strong interest in data analysis and bioinformatics, or a bioinformatics student with a specialization in omics and AI basic knowledge. The following skills are a plus:

Knowledge of cellular, genome and RNA biology
Knowledge of omics concepts and terminologies (transcriptomics, genomics)
Knowledge of biological and statistical analysis of high-throughput data
Use of Python programming
Basic knowledge in AI
Mastery of the Unix/Linux environment and the Bash language

Environment

The internship will take place within the Computational Biology and Bioinformatics (CB&B) team at the IBGC in Bordeaux, under the supervision of Daniel García-Ruano and Domitille Chalopin-Fillot. The CB&B, directed by Dr. Macha Nikolski, is a multidisciplinary team gathering Engineers, PhD students, Postdocs and Associate professors both from the biology and the computer science fields.

Team website: https://bordeaux-bioinformatics.fr/

Contacts

Daniel Garcia Ruano (daniel.garciaruano@ibgc.cnrs.fr) and Domitille Chalopin (domitille.chalopin-Fillot@u-bordeaux.fr).

Duration of the internship: 6 months

Amount of compensation: statutory internship compensation.

Candidature

Procédure : Envoyer un mail avec un CV et une lettre de motivation à : daniel.garciaruano@ibgc.cnrs.fr domitille.chalopin-fillot@u-bordeaux.fr

Date limite : 13 décembre 2024

Contacts

Daniel Garcia Ruano

daNOSPAMniel.garciaruano@ibgc.cnrs.fr

Offre publiée le 19 novembre 2024, affichage jusqu'au 13 décembre 2024