Transcriptomique haut-débit pour l'évolution moléculaire et la génétique des populations
CBGP, mars 2011
Nicolas Galtier
UMR 5554 - Institut des Sciences de l'Evolution - Montpellier
Molecular evolution in the 21st century
- an enormous amount of data (genomics)
- a robust theoretical framework (population genetics)
⇒ we should understand molecular variation patterns
We have:
Yet we do not really know:
- why some species evolve (much) faster than other, proteome-wise
- why GC-content varies between and across genomes
- by how much population size determines genetic diversity
- etc…
Molecular evolution in the 21st century
Why so many unsolved, basic questions?
- lacking theory
- biased sampling
species
genes
PopPhyl goals
Injecting species biology/ecology into comparative genomics
Exploring the molecular diversity of nonmodel taxa
Testing predictions of the population genetic theory genome-wide
life history traits population genetic parameters
genomic variation data
body mass generation time abundance mating system
mutation rate population size selection recombination
within-species between species
PopPhyl goals
Injecting species biology/ecology into comparative genomics
Exploring the molecular diversity of nonmodel taxa
Testing predictions of the population genetic theory genome-wide
Some specific questions we want to address:
- Why are fast-evolving taxa fast? (mutation, selection) - Are abundant species more polymorphic than scarce ones? - Is selection less efficient in selfers than outcrossers? - How does longevity influence mito vs nuclear DNA evolution? - Who optimizes codon usage, who does gBGC, and why? - Is the rate of selective sweeps higher in large populations?
How?
- Target = transcriptome coding sequences
expression data
- Sampling scheme:
focal species (10 individuals)
outgroups (1 or 2 individuals)
X 30
- Next-Generation Sequencing technology
For each taxon: 5.105 400 bp reads (454, pooled individuals) 5.107 100 bp reads (illumina, tagged individuals)
Species sampling
Demosponges Eponges
Cnidaires Cténophores Rotifères Acanthocéphales Entoproctes
Plathelminthes Némertes
Annélides Mollusques Ectoproctes Brachiopodes Chaetognathes Tardigrades Onychophores Arthropodes Loricifères Kinorhynches Priapulides Nématodes Hémichordés Echinodermes Céphalochordés Urochordés Vertébrés
Why are tunicates fast-evolving, proteome-wise?
T
V
- higher mutation rate? - more prevalent adaptive evolution ? - relaxed selective constraint on housekeeping genes ?
E
C
Data analysis pipeline
454
Solexa
reference transcriptome assembling
transcriptome reads
mapping
SNPs and genotypes
SNP calling
πN, πS, dN, dS
allele frequencies
coding annot.
Assembling transcriptomes from NGS data: a benchmark in Ciona
454
Solexa
reference transcriptome assembling
D
454 reads
A c s
Celera
454 reads
B c s
Mira
454 reads
C c s
Cap3
Illumina reads
Abyss
c
s
Cap3 Cap3
c+s c+s
F E
C
454 reads Illumina reads
Abyss
Cap3
Cap3
merge reads
454 reads Illumina reads
Abyss
Cap3
Cap3
Cap3
merge contigs
F' - refine
c s
c+s
c+s
c+s
c+s c
c s
s
de novo transcriptome assembly: quantitative assessment
data set method contigs mean lg median lg N50 assembly
lg (Mb) touched
genes
A Ciona_454 Celera 25,669 491 438 491 12.6 7616
B Ciona_454 Mira 33,196 635 526 650 21.1 7951
C Ciona_454 Cap3 24,515 671 540 713 16.5 7945
D Ciona_illu Abyss+Cap3 27,426 574 380 769 15.8 7704
E Ciona_mix merge reads 29,097 571 399 721 16.6 7982
F Ciona_mix merge contigs 27,956 726 529 891 20.3 8207
0
500
1000
1500
2000
2500
200
230
260
290
320
350
380
410
440
470
500
530
560
590
620
650
680
710
740
770
800
830
860
890
920
950
980
1010
1040
1070
1100
1130
1160
1190
1220
1250
1280
454_Con0gs
Illumina_con0gs
Mix_con0gs
Illumina contigs
454 contigs
Mix contigs
Illumina_contigs
454_contigs
Mix_contigs
0
20
40
60
80
100
120
140 120
80
40
1000 2000 1500
Assembling transcriptomes from NGS data: a benchmark using Ciona intestinalis
no hit
1→1
m→1
1→n
m→n
predicted contigs
reference transcriptome
BLAST
1→1 : full
partial m→1 :
fragments
alleles
1→n : chimera
multi m→n :
full or partial
multi
no hit 1→1
m→1
1→n
m→n
de novo transcriptome assembly: qualitative assessment
Average contig length varies between categories
4000 8000 12000
60%
80%
Improving assemblies by filtering according to length + coverage
number of contigs
correct
de novo transcriptome assembly from NGS data: conclusions
- illumina > 454 (454 useful yet)
- correct cDNA predictions are minoritary in typical assemblies
- existing programs differ substantially in performance (in PopPhyl we retain Cap3 and Abyss)
- contig length + coverage is a reasonable quality criterion
- somewhat variable across species
Data analysis pipeline
454
Solexa
reference transcriptome assembling
transcriptome reads
mapping
SNPs and genotypes
SNP calling
πN, πS, dN, dS
allele frequencies
coding annot.
Calling SNPs and genotypes from transcriptome reads
>contig1 pos ind1 ind2 ind3 1 5/0/9/0 0/0/8/0 10/0/0/0 2 0/4/0/0 0/7/0/0 0/17/0/0 3 1/0/0/17 0/0/0/6 0/0/0/22 … >contig2 pos ind1 ind2 ind3 1 0/0/0/4 0/0/0/8 0/2/0/11 2 34/1/13/0 52/0/45/0 4/0/8/0 …
reads
Calling SNPs and genotypes from transcriptome reads
>contig1 pos ind1 ind2 ind3 1 5/0/9/0 AG 0/0/8/0 GG 6/0/0/0 AA 2 0/4/0/0 CC 0/7/0/0 CC 0/17/0/0 CC 3 1/0/0/17 TT 0/0/0/6 TT 0/0/0/5 TT … >contig2 pos ind1 ind2 ind3 1 0/0/0/1 TT 0/0/0/8 TT 0/2/0/11 CT(90%) 2 14/1/9/0 AG 8/0/15/0 AG 12/0/0/0 AA …
genotypes
Calling SNPs and genotypes from transcriptome reads
Model M1 : sequencing error ε
A:1 C:0 G:6 T:0
reads genotype
[AG]
[GG] 7 ε/3 (1-ε)6
7 (1/2-ε/3)7
Calling SNPs and genotypes from transcriptome reads
Model M2: sequencing error ε and allelic bias α
A:1 C:0 G:6 T:0
reads genotype
[AG]
[GG] 7 ε (1-3ε)6
7 [q' q''6/2 + q'' q'6/2]
A:0 C:3 G:0 T:16
A:4 C:0 G:1 T:0
A:0 C:19 G:2 T:0
A:8 C:0 G:2 T:1
A:0 C:3 G:12 T:0
Population genomics of a fast-evolver
M1 M2
SNPs
error rate
allelic bias
30020
0.021 [0.012-0.038]
0
nb best model 70 (4.6%) 1532 (95.4%)
29544
0.020 [0.011-0.035]
[0.08-0.5]
stop codons 77 (0.26%) 117 (0.39%)
FIT -0.017 -0.054
focal species: Ciona intestinalis B (8 individuals) outgroup: Ciona intestinalis A (reference sequence)
1602 contigs (>10X in >5 individuals), of average length 138 codons
focal species: Ciona intestinalis B (8 individuals) outgroup: Ciona intestinalis A (reference sequence)
Population genomics of a fast-evolver
1602 contigs (>10X in >5 individuals), of average length 138 codons
average πS: 0.057 per site (a highly polymorphic species)
average πN: 0.0026 per site
πN/πS : 0.046 (strong level of purifying selection)
dN/dS : 0.103 (high impact of adaptive evolution)
estimated proportion of adaptive non-synonymous substitutions: 54%
Why are tunicates fast-evolving, proteome-wise?
T
V
- higher mutation rate? YES - more prevalent adaptive evolution ? YES - relaxed selective constraint on housekeeping genes ? NO
E
C
adaptive
neutral
deleterious
→ large Ne, large µ (per year)
Conclusions
- de novo population genomics from NGS transcriptome data is doable
- transcriptome assembly is probably the most tricky step
- major population genomic descriptors are robust to error models
- life history traits apparently impact molecular evolution to some extant
- long-lived, small population-sized species are the best choice for phylogenomics
VERTEBRES INSECTES
MOLLUSQUES
UROCHORDES CNID.
NEM. NEMATODES
ANNELIDES CRUSTACES SPONG.
- selfers vs outcrossers in snails and nematodes
- long-lived vs short-lived in insects
- big vs small in amniotes phylogeny of turtles
- fast proteic evolution in tunicates and nematodes
- extreme longevity
Subprojects we have started
Thanks to:
Philippe Gayral Vincent Cahais Georgia Tsagkogeorga Marion Ballenghien Zef Melo Ferreira Ylenia Chiari Lucy Weinert
Sylvain Glémin Nico Bierne Khalid Belkhir Fred Delsuc Vincent Ranwez
Guillaume Dugas Sébastien Harispe Caroline Benoist
CNRS
ISEM
ERC
Top Related