genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate...

23
Genome annotation Erwin Datema (2011) Sandra Smit (2012, 2013)

Transcript of genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate...

Page 1: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Genome annotation

Erwin Datema (2011) Sandra Smit (2012, 2013)

Page 2: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!

AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!

Gene: SL6G63120.1

Description: Putative disease resistance gene, Mi-homolog

Domains: NBS, LRR

Best Blast hit: SA004581

Page 3: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Gene prediction – the eukaryotic gene model

intergenic intergenic

initial exon internal exon terminal exon

intron intron

5’ UTR 3’ UTR

splice sites ATG *

promoter poly-A

protein coding region

Page 4: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Structural gene annotation – alignment–based (1)

  Prediction of gene structures based on alignments   Transcripts and proteins provide direct evidence   Requires experimental data for each gene

ATG *

ATG *

Page 5: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Structural gene annotation – alignment–based (2)

  Genome-to-genome alignment   Requires annotated genome of closely related species

Page 6: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Example – tomato genome browser

Page 7: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

* ATG!

Structural gene annotation – ab initio (1)

  Prediction of gene structures based on ‘gene model’   Start, stop, splice sites   Exon, intron, intergenic length distributions   Triplet/hexamer frequencies (coding vs. non-coding)

ATG! ATG! ATG!ATG!* * *

Page 8: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Structural gene annotation – ab initio (2)

 Different predictors produce different results  Underlying models (HMM, SVM, …)  Quality of training  Lack of understanding of biology

Page 9: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Structural gene annotation – ab initio (3)

  Requirements for ab initio gene predictors   Training through verified transcript (and protein) alignments   Sufficient sequence context in order to make accurate predictions

  Some properties are common for all eukaryotes   Start, stop, splice site consensus

  Many properties differ, even between related species   Intron and intergenic length distribution   Codon usage

Page 10: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Example - differences in intron lengths

~200 nt ~300 nt

~1200 nt ~2200 nt

Bradnam and Korf, PLoS One. 2008

Page 11: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Generation of a consensus gene structure

GP1

GP2 GP3 EST prot

gene

ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGG!

Page 12: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Example – tomato genome browser

Page 13: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Functional gene annotation – alignment–based

  Inferring function through sequence similarity   Proteins with similar sequence often share function

  Annotation quality of database sequences   Many proteins with unknown function   Propagation of erroneous annotation

GQPKSKITHVVFCCTSGVDMPGADYQLTKLLGLRPSVKRLMMYQQG!|||| |: ||||| ||||||||| ||||| |||||:| |||||||!GQPKEKLGHVVFCTTSGVDMPGA--QLTKLMGLRPSIKKLMMYQQG!

Page 14: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Functional gene annotation – domain–based

  Inferring function through domain searches   Domains are the functional parts of a protein

  Global functional annotation of the protein   E.g. kinase, ATP-binding   Gene Ontology (GO) terms

NBS CC LRR LRR LRR

Page 15: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

The annotated gene

model

blastx Unknown protein [Arabidopsis thaliana]

domains NBS! LRR! LRR! LRR!

Gene Ontology terms   GO:0005524 ATP binding   GO:0006915 apoptosis

blastn Putative disease resistance gene, Mi-homolog

ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCAGCTTTGTACTTATCTAAGG!

Page 16: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Genome annotation: more than gene finding AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!

AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!

  Non-coding RNAs   tRNA genes   rRNA   miRNAs   …

  Repetitive sequences   Interspersed repeats (e.g. transposons)   Tandem repeats, SSRs

Page 17: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Repeat identification and masking

  Repeats may contain coding elements   E.g. reverse transcriptase in a retrotransposon

  This may result in many ‘false’ gene predictions   56,797 genes predicted in rice   16,220 of these are repeat-related!

  Prior to gene prediction, repeats should be masked   sequence similarity: requires database of known repeats   de novo: distinguish between gene families and repeats

Page 18: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Genome annotation pipeline genome sequence

BLAST

gene predictor

repeat masker

gene predictor gene predictor

integration

domain search

repeat database transcripts

miRNA tRNA

Page 19: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

The annotated genome

ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCAC

genes

repeats

tRNAs

CGAGTCAGCTTCATATACTGCGCGCGATATATATTATCGCGTACGATCGATCGATCTGTACGGGTGACTTATTCGTGTATAGTCTATATCTTCGCTAGCTGATTATCGAGCGTACGTACGT

genes

repeats

tRNAs

blastx

blastx

ABC transporter

Cytochrome P450 MADS box transcription factor

Page 20: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Example – tomato genome browser

Page 21: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Beyond genome annotation

  Automated annotation can provide candidate genes   Similarity to known genes from other species   Targets for crop improvement, treatment of (genetic)

diseases, etc.

  Comparative genomics   Study the evolution of species   What makes a species unique?   What makes an individual unique?

Page 22: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

Activities

  Read “A beginner's guide to eukaryotic genome annotation”   Mark Yandell, Daniel Ence   Nature Reviews Genetics 2012 vol. 13 (5) pp. 329-42

  Explore the tomato genome annotation   http://solgenomics.net/   Tomato Genome Project

  Explore the Arabidopsis genome annotation   http://www.arabidopsis.org/   Tools, gbrowse

Page 23: genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment

The End

© Wageningen UR