ARN et Bioinformatiquedenise/CoursBioinfo/ARN-BIBS-CASM...ADN/ARN ADN ADN/ARN droit droit gauche...

Post on 17-Jun-2020

12 views 1 download

Transcript of ARN et Bioinformatiquedenise/CoursBioinfo/ARN-BIBS-CASM...ADN/ARN ADN ADN/ARN droit droit gauche...

ARN et Bioinformatique

Daniel GautheretV.2006.0 http://www.esil.univ-mrs.fr/~dgaut/Cours

Plan du cours

• Séance 1– Les molécules d’ARN– Structures des ARN

• Secondaire• Tertiaire

– La prédiction de structure. • Energies et covariation

• Séance 2– ARNomique

• Résultats expérimentaux, RFAM– Recherche d’ARN dans les génomes

• ARN connus• ARN inconnus

• Séance 3– TD: l’ARNt en 3D / Mfold / Erpin

1a. Molécules d’ARN

L’ARN messager

Les ARN non-messagers ou ARN non codants (ncRNA)

• ARNt• ARNr• ARNsn • ARNsno • micro-ARN• Autres ARN 4.5S, 10Sa, Spot42, DicF, MicF, OxyS, DsrA, 6S

(procaryotes) produits des gènes XIST, H19, IPW, 7H4, His-1, NTT (mammifères).

ARNr 18S et 28S: promoteurs pol-I. ARNt et ARNr 5S: promoteurs pol-III

ARN de transfert (tRNA)

• Potentiellement 64 différents dans un génomes

• En pratique une quarantaine dans les génomes microbiens. Qq centaines dans les génomes de mammifères.

• Le premier ARN observé en 3D (1974)

ARN ribosomique (rRNA)

• 5S, 16S, 23S pour les procaryotes (120, 1500 et 3000 bp)

• 5.5S, 18S et 28S pour les eucaryotes.

• Complexés avec protéines (~40)

• Observées en quelques exemplaires chez les procaryotes (7 opérons chez coli)

• Les génomes vertébrés peuvent accueillir plusieurs centaines de copies identiques. (clusters de gènes)

• Structure 3D en 1999Source: http://www.cytochemistry.net/Cell-biology/ribosome.htm

Small nuclear RNA (snRNA)

• Petits ARN du noyau impliqués dans l’épissage et la maintenance des télomères

• Complexés à des protéines• Plusieurs sortes nommées U1, U2, U4, U6,

U12…

Source: RFAM

Small nucleolar RNAs (snoRNAs)

• Guide de méthylation des rRNA• Eucaryotes et archae• Deux familles: boite C/D et boite H/ACA• Nomenclature: E2, E3, U3, U14, U23…

Source: plant snoRNA database http://bioinf.scri.sari.ac.uk/cgi-bin/plant_snorna/home

microRNA (miRNA)

Stem loop Stem loop precursor precursor (pre(pre--miRNA70nt)miRNA70nt)

miRNA miRNA transcription transcription unitunit

Dr

Dr

nucleus cytoplasm

()

mRNAmRNAtargettargetMature miRNAMature miRNA

(21(21--22 nt.)iceice 22 nt.)Translation Translation block or block or cleavagecleavage

Polycistronic Polycistronic transcript (pritranscript (pri--miRNA)

miRNA+RISC miRNA+RISC complexcomplex

miRNA)

miRNA

• Se trouvent chez les animaux et les plantes• 300 à 1000 chez l’homme• Un seul miRNA peut cibler 100 gènes ou plus• Recherche de miRNA

– Conservation– Structure

Les génomes à ARN

• Virus à ARN– Double brin, simple brin– En partie codant, en partie non-codant

Source: Wikipedia

1b. Structure des ARN

Les bases

Le ribose ou déoxyribose

O2’

C1’

C3’ C2’

C4’

C5’

O5’

O4’

O3’Credit: Richard Hallick. http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structure/DNA_Tutorial.HTML

2’OH: ribose

Nucléosides

Credit: Richard Hallick. http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structure/DNA_Tutorial.HTML

La chaine d’ADN/ARN

O2’

C1’

C3’ C2’

C4’

C5’

O5’

P

O3’

OBOA

baseO4’

O3’

Credit: Richard Hallick. http://www.blc.arizona.edu/Molecular_Graphics/DNA_Structure/DNA_Tutorial.HTML

Les degrés de liberté du nucléotide

ν0 à ν4 résumés par:Phase+amplitude

Le plissement du sucre (sugar pucker)En raison des interactions entre non covalentes entre substituants du cycle, les susbstituants ont tendance à se palcer le plus loin possible les uns des autres. En conséquence, 4 des 5 atomes du ribose sont approximativement dans un plan, mais le 5ème sort du plan de 0,5 Å environ. 4 conformations majeures: C2': endo, C3'-exo, C3'-endo, C2' -exo.Endo: coté base, exo: loin base

C3’ endo

C2’ endo

L’orientation de la base

N

N

ON

N

O

HH

H

HH

3'

3'G C– Effet sur l ’orientation

des brins (parallèle ou antiparallèle):

Contraintes sur les angles de torsion

– Roues conformationnelles

La double-hélice

(ADN)

Les paires Watson-Crick

La distance entre les deux points d’attachement des paires A - T et G - C sont identiques.

La même géométrie est compatible avec des pairesA – T et T – A ou G – C et C – G

Les sillons

majeur

mineur

Axe de l’hélice

Axe de pseudo-symétrie

Donneurs et accepteurs de ponts-H: l’identité des paires de bases

majeur majeur

mineurmineur

P

OH

OO

donneur

accepteur

(RNA)

Hélices A, B et Z

A: ADN/ARN B: ADN Z: ADN/ARN

Vues axiales des hélices A et B

Les types d’hélicesSelon: force ionique, solvents, degré d’hydratation.

A B Z

ADN/ARN ADN ADN/ARN

droit droit gauche2’endo 2’ endo (py)

3’endo (pu)20 Å 18 Åanti Anti (py)

Syn (pu)aucun12 Å plat

Sens hélice

Déplacement bp/axelargeurSillon majeur 9 Åprof

6 Å étroitlargeurSillon mineur 7,5 Å profondprof

11 10 12Nt/tour

Confo sucre 3’ endoDiamètre 26 Å

Liaison glycosidique anti4 Å3 Å

13,5 Å11 Å3 Å

L’empilement des bases (stacking)

Empilement dans un brin de type B

• L’empilement n’est pas imposé par la double hélice: il est l’un des principaux facteurs de stabilité des A.N.

• Cycles aromatiques séparés d’environ 4 Å

• Causes: hydrophobicité et interactions VdW

• Les séquences YpR et PrY ont un empilement différent: La séquence influe sur la stabilité et le forme de l’hélice.

Paramètres des hélices

On peut définir une double hélice avec les paramètres suivants:Tilt (theta-t): autour de l’axe pseudo-symTwist (t): tour/residu autour axe hélicepropeller Twist (theta-p): entre une base et son appariéeaxial rise (h): élévation/residuDislocation (D): distance entre centre bp et axe héliceRoll (theta-r): angle autour du 3ème axe (axe C6-C8).

ARN: les structures secondaires

Extrait ARNr 23S

Interactions secondaires et

tertiaires

Eléments de structure secondaire

a. épingle à cheveux (hairpin)

b. interne

c. bulge

d. multi-branche

e. duplex (longue-distance)

f. pseudonoeud

Secondaire et tertiaire: un exemple…

– The 1030-1124 Region of 23S rRNAhas seveal tertiary interactions: two canonical lone base pairs (1082:1086and 1087:1102), a base triple ((1092:1099)1072) and three non-canonical tertiary base pairs (1032:1122, 1039:1116, and1040:1115)

L’ARNt

Présence de nombreux nucléosides inhabituels, par ex: inosine (I), pseudouridine (Ψ), dihydrouridine (D), ribothymidine (T), base Y, etc.

L’ARNt: du 2D au 3D

Source: Molecular Biology of the Cell (http://www.garlandscience.com/textbooks/0815332181.asp)

L’ARNtBoucle TΨC

Boucle D

Boucle Variable

Tige Acceptrice

BoucleAnticodon Forme d’un « L »

Interactions dans la structure tertiaire:

Formation des triplets; interactions avec des P – du backbone et avec le 2’OH des riboses

ARNt: interactions tertiaires

Le ribosome

Masse totale: 2.6 x 103 kDa (100 fois plus que le Lysosyme)Composition: 1/3 protéine 2/3 nucléotidesSous-unité 30S : Interaction avec les codons du mRNA et les anticodons du tRNASous-unité 50S : activité peptidyl-transferase et interaction avec le GTP-binding protein.Meilleures structures aujourd’hui :Ribosome 70S de T. thermophilus à 7.8 Å (1999) Sous-unité 30S de T. thermophilus à 4.5 Å (1999)Sous-unité 50S de H. marsimortui à 2.4 Å (2000)

23S: interactions secondaires et tertiaires

(moitié 5’)

23S: interactions

secondaires et tertiaires

(moitié 3’)

Ribosome: les sites de fixation des ARNt

Source: Molecular Biology of the Cell (http://www.garlandscience.com/textbooks/0815332181.asp)

La grande sous-unité (ARN 23S)

crystal structure of the large ribosomal subunit from Haloarcula marismortui at 2.4 angstrom resolution

Grande sous Unité: 35 prot + 2 ARN (23S, 5S)

Nenad Ban, Poul Nissen, Jeffrey Hansen, Peter B. Moore, Thomas A. Steitz Science. 289:878-9, 2000

Protéines de la grande sous-Unité

Repliement des domaines

dominés apr les interactions

inter-hélices

A proposed mechanism of peptide synthesis catalyzed by the ribosome. A2486 (A2451) is shown as the standard tautomer in allsteps, but could be represented as the imino tautomer, which would have a negative unprotonated N3 and a neutral protonated N3. We expect that the electronic distribution is actually between these two extremes. (A) The N3 of A2486 abstracts a proton from the NH2 group as the latter attacks the carbonyl carbon of the peptidyl-tRNA. (B) A protonated N3 stabilizes the tetrahedral carbon intermediate by hydrogen bonding to the oxyanion. (C) The proton is transferred from the N3 to the peptidyl tRNA 3' OH as the newly formed peptide deacylates. Among the variations on this mechanism that should be considered would be a protonated A2486 stabilizing the intermediate, as in (B), with less contribution on acid-base catalysis, as shown in (A) and (C).

stabilisation du carbone tetrahedrique par N+

Site P libéré par le déplacement du peptide sur le tRNA du site A

Site P Site A

Centre peptidyl-

transférase

La stabilisation de l’imino tautomère du A2486 augmente le pKa de A2486 N3 à 7.6

Notez les interactions tertiaires

Principale fonction de la S.U. 50S

Pas de peptide à moins de 18Å

Abstraction proton par N3 de 2486

Attaque nucléophile

The polypeptide exit tunnel. (A) The subunit has been cut in half, roughly bisecting its central protuberance and its peptide tunnel along the entire length. The two halves have been opened like the pages of a book. All ribosome atoms are shown in space-filling representation, with all RNA atoms that do not contact solvent shown in white and all protein atoms that do not contact solvent shown in green. Surface atoms of both protein and RNA are color-coded with carbon yellow, oxygen red, and nitrogen blue. A possible trajectory for a polypeptide passing through the tunnel is shown as a white ribbon. PT, peptidyl transferase site. (B) Detail of the polypeptide exit tunnel showing distribution of polar and nonpolar groups, with atoms colored as in (A), the constriction and bend in the tunnel formed by proteins L4 and L22 (green patches close to PT), and the relatively wide exit of the tunnel. A modeled polypeptide is in white. (C) The tunnel surface is shown with backbone atoms of the RNA color coded by domain. Domains I (yellow), II (light blue), III (orange), IV (green), V (light red), 5S (pink), and proteins are blue.

Tunnel de sortie du

polypeptide

Les motifs ARN

••

--

RNAG

Boucle GNRA

•-

• -

R NA G

G U

G

-

C -U

AA

UA

GC -

• •-interaction GNRA- boucle interne

-•-

YU

U-turn

--

CNGU

?

Boucle UNCG

Plateforme d'adénine

• •-

-AG U/A•

Aempilement

Quelques motifs ARN. Les traits pointillés désignent des interactions tertiaires. Les flèches indiquent le sens 5'->3'. Le point d'interrogation dans le motif d indique que le partenaire de cette interaction tertiaire n'est pas connu.

a

ed

cb

Boucle E

La paire G:U Wobble

La paire Hoogsteen

Boucle TTC de tRNA Phe levure

Le U-turn

Boucle anticodon de tRNA Phe levure

La boucle GNRA

Les triplets de bases

Triple hélice 10-13 + 22-25+ 9+45+46 de tRNA Phe levure

1.c La prédiction des structures secondaires d’ARN

Les Dot-Plots

Algorithme de Zuker et Stiegler

• L’algorithme de programmation dynamique appliqué aux structures secondaires

• recherche les structures de plus basse énergie pour tous les sous-fragments d’une séquence (Zuker and Stiegler, 1981).

• Garantit une solution optimale à l’intérieur du modèle énergétique choisi. N’autorise pas les pseudo-nœuds.

Prédire les structures secondaires: énergies libres d’empilement (stacking free energy) (Turner et al.)

– Dans l’hélice (seulement paires WC et GU):

------------------ ------------------ ------------------ ------------------A C G U A C G U A C G U A C G U

------------------ ------------------ ------------------ ------------------5' --> 3' 5' --> 3' 5' --> 3' 5' --> 3'

UX UX UX UXAY CY GY UY

3' <-- 5' 3' <-- 5' 3' <-- 5' 3' <-- 5'. . . -8.1 . . . . . . . -4.0 . . . .. . -13.3 . . . . . . . -5.3 . . . . .. -10.5 . -4.0 . . . . . -1.8 . -5.3 . . . .

-6.6 . -3.6 . . . . . -3.6 . -3.6 . . . . .

ACGU

– Attention: AU

U

A

A

5 ’

A

U

U

5 ’

5 ’5 ’

Turner’s free energies (suite)

– Terminal mismatches:

Y Y Y Y------------------ ------------------ ------------------ ------------------A C G U A C G U A C G U A C G U

------------------ ------------------ ------------------ ------------------5' --> 3' 5' --> 3' 5' --> 3' 5' --> 3'

AX AX AX AXAY CY GY UY

3' <-- 5' 3' <-- 5' 3' <-- 5' 3' <-- 5'. . . -4.0 . . . -4.3 . . . -3.8 -4.3 -6.0 -6.0 -6.0. . -5.2 . . . -7.2 . . . -7.1 . -2.6 -2.4 -2.4 -2.4. -10.3 . -7.2 . -5.2 . -4.8 . -9.4 . -6.6 -3.4 -6.9 -6.9 -6.9

-4.3 . -4.3 . -2.6 . -2.6 . -3.4 . -3.4 . -3.3 -3.3 -3.3 -3.3

ACGU

A

G

C

A

5 ’

5 ’

Turner’s free energies (suite)

– Dangling ends:

X X X X------------------ ------------------ ------------------ ------------------A C G U A C G U A C G U A C G U

------------------ ------------------ ------------------ ------------------5' --> 3' 5' --> 3' 5' --> 3' 5' --> 3'

AX AX AX AXA C G U

3' <-- 5' 3' <-- 5' 3' <-- 5' 3' <-- 5'. . . . . . . . . . . . -4.9 -0.9 -5.5 -2.3

A

U

A

5 ’

5 ’

Analyse comparative

L’analyse comparative est la façon la plus fiable de déterminer les structures secondaires

Elle repose sur la détection de covariationsElle nécessite plusieurs séquences homologues alignées

Fonctionne même pour les paires non Watson-Crick!

Mesures de la covariation

• Table de contingence: table p normal 53 61

a c g u -53, 61| 0 97 0 3 0 61------+------------------a 3 | 0 0 0 3 0 c 0 | 0 0 0 0 0 g 97 | 0 97 0 0 0 u 0 | 0 0 0 0 0 - 0 | 0 0 0 0 0 53

gc=( 29, 28.03, 96.7%) au=( 1, 0.03, 3.3%)

• Tests:– Chi 2– Information Mutuelle– évènements phylogénétiques

Exemple: covariations d’un nucléotide avec tous les autres

– Position 1 du tRNA contre toutes les autres positions:

Cours 2. ARNomique

Supports:

http://www.esil.univ-mrs.fr/~dgaut/cours/orsay.html

A partir de novembre:

http://rna.igmors.u-psud.fr/

Bacterial genome

• 90% transcribed (~90% coding)

codingintergenic

T

Transcription: T

Vertebrate Genomes: 90% transcribed too!

Intergenic DNAIntrons, UTRs

TT

T

T?Transcription: T Satellite DNA

Transposable elementstRNA, rRNACoding regions: 1.5%

Vertebrate gene: 30kb (coding: 1,5kb)

RFAM: The RNA databaseSam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay

Khanna, Sean R. Eddy and Alex Bateman.

RFAM Stats

• 503 Familles d’ARN– une famille = un groupe d’ARN homologues et

alignables. – miRNA= >20 familles.

• Homme: – ~100 familles – 3000 ARN différents annotés

• E. Coli: ~40 familles

Familles, orthologues etc.

Combien d’autres ARNnc ?

2.1 Approches expérimentales

Cloning– Rnomics*

– Extract total RNA– Isolate small RNAs – Tag & Reverse transcribe– Clone & Sequence

• ~200 ncRNAs identified in mouse • ~100 ncRNA in bacteria

* Huttenhofer et al., 2001

Limitations

• Sensibilité: ncRNA rares, exprimés à des sites précis ou pendant une courte durée

• Non exhaustivité

Autres approches

– Full-length cDNA projects• FANTOM3:

– 100,000 mouse cDNAs– 32,000 non-coding!

– Tiling arrays• Half human transcriptome is polyA-, cytoplasmic and maps

unnanotated loci

Limitations

• Nombreux transcrits non fonctionnels (« fuites » de la transcription)

• Pas de preuve de fonction en tant qu’ARN

2.2 Recherche Bioinformatique de gènes ARN

What’s Special About ncRNA Detection?

s

s

s

BLAST HMM

– No ORF– No Markov model / sequence statistics– ncRNA is defined both by primary and secondary structure– « Substitution matrices » for nucleic acids are terrible compared

to aminoacids counterparts

2.2.a Looking for known ncRNAs

– How can we detect ncRNA genes from known families?

Descriptor-based programs

Rnamot / Rnamotif (Gautheret 91, Macke ‘02)

Palingol (Viari 96)

Patscan (Overbeek ’00)

PatSearch (Pesole ‘01)

h1 s1 h1 s2 h2 s3 h2

h1 5:5 1h2 5:5 NNNNR:YNNNNs1 7:7 NUNNNNNs2 4:40s3 7:7 UUCNNNN

RnaMot descriptor for anticodon+TYC domain of

tRNA

Descriptor-based programs

PROS CONS

Draft descriptors can be quickly sketched and tested

Requires a good prior knowledgeof secondary structure and

sequence constraints

Alignment is not compulsory, although it is very helpful to

have one

Requires basic computer skills to translate biological constraints

into computer script

Biologists decide what features are important or not (see also

CONS!)

Biologists have the responsibility of correctly weighting each

important feature

Probabilistic ncRNA search programs

• Stochastic Context Free Grammars (firstadaptation of CFG to RNA: Searls 94; SCFG: Eddy & Durbin 94)

– Time cost = O(N4) for sequence of length N – Not « practical » for large alignments or genome-

wide searches– Pseudoknots not allowed

describe how to generateany structure in thetraining set

ProductionrulesTraining set

ERPIN: Secondary Structure Profiles

• 16-row matrix captures base correlations and base-pair freqs.

1 2 3A C CA C CG C GA C CA C G

7 8 9G G UG G UU G CG G UC G U

A:AG:AC:AU:AA:GG:GC:GU:G...

4 5 6C G AC A AC G -- G -- G A

AGCU-Sb1,b2 =

log(Fb1b2 /Fb1xFb2)

Alignment

Weight matrices Implemented inPSI-Blast orProsite

Usual 5-row matrix for single strands

Profile Search Strategy

l=10 l=14

Seq

uenc

e (1

4nt

)

best score for forl=14 (0 gaps)

Single-strand profile (14 positions)

best score forl=10 (4 gaps)

Helix score for h5-h3computed from helix profile

GTTCTTGCATGTTTGACGGAACGTTCTTGCATGATTGACGGAACGTTCTTGCATGTTTGACGGAACTTTCCTGCATGCTTGACGGAACTTTAT--CAAGTTCAT-ATAAAATTAT--CGTGCCTTC-ATAATATTAT--CGTGTCTTC-ATAATATTAT--CATGTTTC--ATAAT

Training set

h3h5

Target sequence

ERPIN: Profile-based searchGautheret & Lambert, JMB, 2001, 313, p. 1005.

Training set

A:AG:AC:AU:AA:GG:GC:GU:G

AGCU-

U:U...

Single-strandprofile (5xN)

Helix profile (16xN)

Sb = log(Fb /Eb)

Sb1,b2 = log(Fb1b2 /Fb1xFb2)

Problem with information-poor training sets

A Mir-133 training-set:

(( - ((((((( ------ ((( - (((( ---------------------- )))) ))) ------ ))))))) - ))TC t GGCTGGT caaac- GGA a CCAA gtccgtcttcctgagaggt--- TTGG TCC CCTTCA ACCAGCT a CATG t GGCTGGT caaac- GGA a CCAA gtcaggtgtttctgtgaggt-- TTGG TCC CCTTCA ACCAGAC t ATTG t GGCTGGT aaaac- GGA a CCAA gtcaggtgtttttgtgaggt-- TTGG TCC CCTTCA ACCAGCT a TGTG c GGCTGGT gaaaa- GGA a CCAC atcaacccagaaaaaggat--- TTGG TCC CCTTCA ACCAGCC g CATA t GGCTGGT caaac- GGA a CCAA gtccgtcttccttagaggt--- TTGG TCC CCTTCA ACCAGCT a TTAG t TGCTGGT aaaac- GGA a CCAA gtcgggtgtttgcgagaggt-- TTGG TCC CTTTCA ACCAGCT a CTTG t GGCTGGT caaat- GGA a CCAA gtcaggtgtttctgcgaggt-- TTGG TCC CCTTCA ACCAGCT a CT

100% C:GOther scores = log (obs/expected) = abritrary low value!

What about G:C or A:U in this column? Is it as bad as C:C or A:G?

Pseudocounts

– Principle: fill columns with expected counts, based on a reasonable model

– Example: column c contains 7 C:Gs, we know C:G often substitutes for G:C, let’s allow for someG:Cs.

– We need substitution matrices!

Henikoff & Henikoff pseudocounts

(for aminoacids)

Total # pseudocounts in column c

Counts of a in column c

a substituted by i

With the previous example:

Column c is 100% C:GProbability(C:G)=1, others = 0

Count of A:Ts = Bc * 1 * Probability (C:G | A:T)Count of A:As = Bc * 1 * Probability (C:G | A:A), etc.

RNA substitution matrices

Obtained from euk+archae+bac 16S/18S rRNA alignement

AA AT AG AC TA TT TG TC GA GT GG GC CA CT CG CC6.54e-04 5.20e-06 3.88e-05 4.22e-05 2.13e-05 5.51e-06 1.21e-05 3.84e-05 8.52e-05 1.28e-05 1.76e-04 2.89e-06 1.47e-05 6.47e-06 3.19e-06 4.69e-067.96e-05 9.00e-04 5.19e-05 1.78e-04 1.69e-04 1.43e-04 8.85e-05 1.86e-04 4.15e-05 1.69e-04 1.22e-04 1.99e-04 8.73e-05 2.44e-04 1.25e-04 3.30e-041.00e-04 8.72e-06 1.35e-03 1.27e-04 1.72e-05 5.09e-06 3.10e-05 1.38e-04 5.74e-05 1.59e-05 1.01e-04 8.22e-06 9.99e-06 1.62e-05 1.33e-05 2.56e-054.11e-05 1.13e-05 4.81e-05 9.79e-04 2.79e-06 7.02e-06 2.79e-06 4.47e-05 4.93e-06 1.97e-05 3.05e-05 8.06e-06 5.40e-06 1.55e-05 2.47e-06 7.30e-054.23e-04 2.19e-04 1.33e-04 5.69e-05 1.16e-03 2.21e-04 2.35e-04 2.78e-04 9.59e-05 1.18e-04 1.79e-04 1.08e-04 3.54e-04 2.04e-04 2.24e-04 9.28e-051.05e-05 1.80e-05 3.80e-06 1.38e-05 2.14e-05 9.30e-04 2.57e-05 7.75e-05 5.79e-06 2.33e-05 4.87e-05 1.18e-05 1.57e-05 8.72e-05 1.83e-05 5.25e-041.05e-04 5.03e-05 1.04e-04 2.49e-05 1.03e-04 1.16e-04 1.14e-03 1.80e-04 4.69e-05 4.56e-05 1.25e-04 4.26e-05 1.70e-04 2.15e-04 7.52e-05 3.23e-051.45e-05 4.59e-06 2.03e-05 1.73e-05 5.30e-06 1.52e-05 7.82e-06 1.60e-04 4.55e-06 8.99e-06 4.77e-06 3.66e-06 9.00e-06 6.17e-05 2.95e-06 1.61e-052.57e-04 8.19e-06 6.74e-05 1.53e-05 1.46e-05 9.11e-06 1.63e-05 3.64e-05 1.47e-03 2.50e-05 8.70e-05 2.12e-05 3.02e-05 2.83e-05 4.40e-06 8.02e-061.24e-04 1.07e-04 6.02e-05 1.96e-04 5.81e-05 1.18e-04 5.10e-05 2.31e-04 8.04e-05 1.28e-03 9.39e-05 8.77e-05 2.53e-05 9.12e-05 3.55e-05 4.58e-051.82e-04 8.24e-06 4.08e-05 3.24e-05 9.35e-06 2.61e-05 1.49e-05 1.30e-05 2.97e-05 9.98e-06 5.62e-04 6.96e-06 6.83e-06 8.80e-06 1.32e-05 1.06e-051.14e-04 5.14e-04 1.26e-04 3.27e-04 2.16e-04 2.44e-04 1.94e-04 3.84e-04 2.78e-04 3.57e-04 2.67e-04 1.49e-03 1.07e-04 5.26e-04 2.57e-04 2.87e-041.30e-05 5.04e-06 3.43e-06 4.90e-06 1.58e-05 7.22e-06 1.73e-05 2.10e-05 8.85e-06 2.30e-06 5.85e-06 2.40e-06 5.30e-04 5.30e-05 1.58e-05 7.68e-063.86e-06 9.54e-06 3.78e-06 9.51e-06 6.16e-06 2.71e-05 1.48e-05 9.78e-05 5.60e-06 5.61e-06 5.10e-06 7.94e-06 3.58e-05 2.95e-04 5.22e-06 3.52e-051.04e-04 2.68e-04 1.70e-04 8.32e-05 3.71e-04 3.12e-04 2.83e-04 2.56e-04 4.77e-05 1.19e-04 4.21e-04 2.12e-04 5.86e-04 2.86e-04 1.35e-03 2.50e-042.13e-06 9.81e-06 4.54e-06 3.41e-05 2.12e-06 1.24e-04 1.69e-06 1.94e-05 1.21e-06 2.14e-06 4.70e-06 3.31e-06 3.95e-06 2.68e-05 3.48e-06 5.45e-04

A T G C9.13e-04 8.22e-05 1.05e-04 9.35e-055.57e-05 6.70e-04 7.98e-05 1.41e-046.94e-05 7.78e-05 7.32e-04 5.03e-054.09e-05 9.15e-05 3.33e-05 6.03e-04

AAATAGACTATTTGTCGAGTGGGCCACTCGCC

ATGC

Performance as a function of training set size and pseudocount weight

specificity

sensitivity

From 1 to 19 sequences in training set…

pseudocountspseudocounts + true-counts

pcw =

Setting the pseudocount weight

pseudocountspcw = pseudocounts + true-counts

An E-value for RNA motifs?

E-value:# expected hits of score > S, by chance

Can we guess this? Idea 1:

run against random database and compute score distributionExtrapolate for any score

High Scores are Well Behaved

• Y = 6.6 105 e-0.3 S

• E = Kmn e-λ S

• (extreme value distrib.)

• Score 30 (min):

• 3.8 hits/100mb• Score 40 (ave): • 0.13 hits/100mb10

100

1000

10000

20 25 30 35score

#hits

randommarkov

SECIS hits in700 mb randomized sequences An E-value is possible

1 day to run simulation!

Global Score Distribution is not as Good!

A:AG:AC:AU:AA:GG:GC:GU:GA:CG:CC:CU:CA:UG:UC:U

« Log(0) » or pseudocount

« Finite » scoreHigh scores

Helix profile

Distribution of all scores

U:U

Not GaussianHow can we model it? (same for single-

strand profile)

Discrete convolutions

computation

A:AG:AC:AU:AA:GG:GC:GU:GA:CG:CC:CU:CA:UG:UC:UU:U

simulation

E-values of complete motifs

simulated

computed

Profile-based search

PROS CONS

All constraints in the training set are efficiently exploited, resulting

in highly specific detections

Alignment and secondary structure constraints must be

accurate

No programming is needed Helices of variable length need to be reduced to their shortest

consensus

Scoring system is defined automatically

Program will not depart from initial alignment in terms of motif

sizeE-values are provided for each hit Users still have to decide on

search order and masked elements

Running a successful ncRNA search

Example: the Signal Recognition Particle(SRP) RNA

172 sequences availableAll 3 kingdomsSignature: 50-nt domain IV

Organize ncRNA informationAlignment is a must

Should be structure-based

ClustalW OK only as a first attempt

RNAalifold (Vienna package) can identify covarying basepairs

Secondary Structureannotation

Will help identify sequence/ structureconstraints: helix sizes, conservedbases, etc.

Want to publish your finds in absence of E-value?Prepare Control Procedure

TP TP+FN

Sensitivity: SN = Total « true » objects

TP TP+FP

Specificity : SP = Total predictions

TP and FN: easy to obtain, using training set (leave-one out)

FP: harder! How do you know a hit is false?

Hint: express SP as: FP / Mb in a random sequence

Make it large enough and of same composition (mono & di-nt) as search database (e.g. with the shuffle program)

Using the ERPIN program

>structure000000000000000000000000000000000000001111001000222443333333666555588877777788899996661111443222>AQU.AEO.AGGGUGAACU-CCCCCAGGCCCGAA--AGGGAGCAAGGGUAAGC-CCG>THE.THE.GGCGUGAACC-GGGUCAGGUCCGGA--AGGAAGCAGCCCUAAGC-GCC

erpin srp.epn sequence.fasta -8,8 -nomaskerpin srp.epn sequence.fasta –2,2 -nomaskerpin srp.epn sequence.fasta -2,2 –umask 5 9 -nomask

ERPIN results

Score: based on profile values

E-value: How manyhits expected at thisscore or higher?

No need for random sequence tests!

Looking for MicroRNA Precursors

Stem loop Stem loop precursor precursor (pre(pre--miRNA70nt)miRNA70nt)

miRNA miRNA transcription transcription unitunit

Dr

Dr

nucleus cytoplasm

()

mRNAmRNAtargettargetMature miRNAMature miRNA

(21(21--22 nt.)iceice 22 nt.)Translation Translation block or block or cleavagecleavage

Polycistronic Polycistronic transcript (pritranscript (pri--miRNA)

miRNA+RISC miRNA+RISC complexcomplex

miRNA)

miR Training sets

• >50 miRNA families with different signatures– Can’t use a single profile for all

• 18 training sets built for 18 miRNA families– Using CLUSTALW + Alifold– 10 sequences/family on average

Legendre, Lambert, Gautheret, Bioinformatics 2004

ERPIN vs BLAST

• 20 animal genomes scanned• Compare Erpin and WU-BLAST w/ sensitive

parameters (W=7)• E-value ≤ 0.01

ERPIN WU-BLAST

43 (0) 212 (5) 41 (9)

Analysis of a miR Cluster

miR17 clusterciona

ciona Grey: initial training set

“E” indicates hits identified by ERPIN only,

“EB” indicates hits identified by both ERPIN and BLAST.

• Important homologues missed by WU-BLAST• Profile search a must in miRNA detection

2.2b De novo ncRNA finding

– How can we detect ncRNA genes when noprior sequence/structure data is available?

Thermodynamic Profiling (Le et al. 88)

-20100 bases

-2

100 bases

-3100 bases

-2

100 bases

Profile : - 4

100 bases

window free energy - mean (energy of rnd seq.)Z-score =

Var(energy of rnd seq.)

The problem with thermodynamics

OK for strong local structures (some success in viral genomes)However: true ncRNA (tRNA, rRNA) do not display higher folding energy than random sequences of same composition (di-nt: Rivas & Eddy 2000)The method would not detect many known ncRNAsG+C content alone is a better ncRNA predictor than free energy

G+C content

In high A+T background (thermophilic archaebacteria), ncRNA stand out clearly. Combining (G+C)% and CpG% provides the best discriminant (Schattner ’02). A dozen such predicted ncRNA experimentally confirmed in M. jannaschii and P. furiosus.Does not work in genomes with « normal » G+C contents, except as a complement to other methods (thermodynamics, etc.)

RNAGenie (Carter, Dubchak & Holbrook, 2001)

– Partition E. coli sequences into 2 sets of overlapping 80nt windows :

• 675kb intergenic (negative set) • 8kb true ncRNA (16S,23S,5S,tRNA,other small RNAs)

(positive set)

– Train neural network capturing different « compositional characters »

• nt and di-nt composition• frequency of typical RNA motifs: UNCG, GNRA, AAR, CUAG• Folding free energy

RNAGenie results

– Claim about 80-90% accuracy– Predict 370 new ncRNA– 13 predictions confirmed subsequently– Limitations

• small training set: squewed prediction• Possible contamination of non-ncRNA sequences in

« negative » set.

Comparative Genomics rules!

Wassarman et. al. ‘01: comparaison Escherichia, Salmonella, Klebsiella : 60 ncRNA predicted, 23 confirmedMany ongoing projects in bacteria, Xenopus, Ciona, human

Comparative genomics:a major source of ncRNA in eukaryotes

– 5-6% of mamalian genome under selection vs 1.5% coding (3 times as much as in nematodes)

– Tiling arrays and full length cDNA identify as many transcripts in intergenic as in known genes

– As many polyA- as polyA+

Functional assignment of conserved regions

– Coding exons– Regulatory sequences in exons

and introns– Promoters– ncRNA– Ancestral repeats– Others (matrix attachment,

etc.)

Margulies et al, 2003

Fraction of conserved sequences in.. (AR=ancestral repeats)

Detect this!

Need classification software- QRNA (Rivas & Eddy, 2001)- RNAz (Hofacker & Stadler, 2005)

Q-RNA (Rivas & Eddy 2001)

– Analysis of Blast alignment (SCFG based)

Synonymous mutations

Compensatory mutations

•Model for protein coding gene

•Model for ncRNA

(also include loop probabilities obtained fromtraining set of real ncRNA)

RNA-Alifold (Hofacker, Stadler)

• Originally: secondary structure prediction• Predicts best common structure for a set of

aligned RNAs• Dynamic programming, averaging:

– Energies of aligned bases– Covariation term

RNAz (Washietl et al. 2004)

• Uses multiple alignments. • two basic components:

– (1) Measure for RNA secondary structure conservation based on computing an RNAalifold consensus secondary structure (Structure Conservation Index)

– (2) Measure of thermodynamic stability, based on a zscore normalized w.r.t. both sequence length and base composition and can be calculated without sampling from shuffled sequences (SVM)

– An SVM again to combine 1 and 2

Q-RNA & RNAz

– QRNA • Limited range for similarity (65%-85%): too dissimilar= incorrect

Blast alignments, too similar=no covariation

• Pairwise: limited covariation

– RNAz• Multiple alignment: better use of sequence variation• Already applied to mammals (5-species alignment) & Ciona

Problem: Human/mouse/rat ncRNAs not in this range!

The right species for ncRNA detection?

Human/mouse ncRNA: ~98-100% id18S fugu/xenopus/human: 95% id! Obvious interest for diverse animal models

Can’t observe covariation

Conserved elements detected using various species

Human/mouse tRNAAsp