genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate...
Embed Size (px)
Transcript of genome annotation - WUR · Beyond genome annotation Automated annotation can provide candidate...

Genome annotation
Erwin Datema (2011) Sandra Smit (2012, 2013)

Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
Gene: SL6G63120.1
Description: Putative disease resistance gene, Mi-homolog
Domains: NBS, LRR
Best Blast hit: SA004581

Gene prediction – the eukaryotic gene model
intergenic intergenic
initial exon internal exon terminal exon
intron intron
5’ UTR 3’ UTR
splice sites ATG *
promoter poly-A
protein coding region

Structural gene annotation – alignment–based (1)
Prediction of gene structures based on alignments Transcripts and proteins provide direct evidence Requires experimental data for each gene
ATG *
ATG *

Structural gene annotation – alignment–based (2)
Genome-to-genome alignment Requires annotated genome of closely related species

Example – tomato genome browser

* ATG!
Structural gene annotation – ab initio (1)
Prediction of gene structures based on ‘gene model’ Start, stop, splice sites Exon, intron, intergenic length distributions Triplet/hexamer frequencies (coding vs. non-coding)
ATG! ATG! ATG!ATG!* * *

Structural gene annotation – ab initio (2)
Different predictors produce different results Underlying models (HMM, SVM, …) Quality of training Lack of understanding of biology

Structural gene annotation – ab initio (3)
Requirements for ab initio gene predictors Training through verified transcript (and protein) alignments Sufficient sequence context in order to make accurate predictions
Some properties are common for all eukaryotes Start, stop, splice site consensus
Many properties differ, even between related species Intron and intergenic length distribution Codon usage

Example - differences in intron lengths
~200 nt ~300 nt
~1200 nt ~2200 nt
Bradnam and Korf, PLoS One. 2008

Generation of a consensus gene structure
GP1
GP2 GP3 EST prot
gene
ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGG!

Example – tomato genome browser

Functional gene annotation – alignment–based
Inferring function through sequence similarity Proteins with similar sequence often share function
Annotation quality of database sequences Many proteins with unknown function Propagation of erroneous annotation
GQPKSKITHVVFCCTSGVDMPGADYQLTKLLGLRPSVKRLMMYQQG!|||| |: ||||| ||||||||| ||||| |||||:| |||||||!GQPKEKLGHVVFCTTSGVDMPGA--QLTKLMGLRPSIKKLMMYQQG!

Functional gene annotation – domain–based
Inferring function through domain searches Domains are the functional parts of a protein
Global functional annotation of the protein E.g. kinase, ATP-binding Gene Ontology (GO) terms
NBS CC LRR LRR LRR

The annotated gene
model
blastx Unknown protein [Arabidopsis thaliana]
domains NBS! LRR! LRR! LRR!
Gene Ontology terms GO:0005524 ATP binding GO:0006915 apoptosis
blastn Putative disease resistance gene, Mi-homolog
ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCAGCTTTGTACTTATCTAAGG!

Genome annotation: more than gene finding AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
Non-coding RNAs tRNA genes rRNA miRNAs …
Repetitive sequences Interspersed repeats (e.g. transposons) Tandem repeats, SSRs

Repeat identification and masking
Repeats may contain coding elements E.g. reverse transcriptase in a retrotransposon
This may result in many ‘false’ gene predictions 56,797 genes predicted in rice 16,220 of these are repeat-related!
Prior to gene prediction, repeats should be masked sequence similarity: requires database of known repeats de novo: distinguish between gene families and repeats

Genome annotation pipeline genome sequence
BLAST
gene predictor
repeat masker
gene predictor gene predictor
integration
domain search
repeat database transcripts
miRNA tRNA

The annotated genome
ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCAC
genes
repeats
tRNAs
CGAGTCAGCTTCATATACTGCGCGCGATATATATTATCGCGTACGATCGATCGATCTGTACGGGTGACTTATTCGTGTATAGTCTATATCTTCGCTAGCTGATTATCGAGCGTACGTACGT
genes
repeats
tRNAs
blastx
blastx
ABC transporter
Cytochrome P450 MADS box transcription factor

Example – tomato genome browser

Beyond genome annotation
Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment of (genetic)
diseases, etc.
Comparative genomics Study the evolution of species What makes a species unique? What makes an individual unique?

Activities
Read “A beginner's guide to eukaryotic genome annotation” Mark Yandell, Daniel Ence Nature Reviews Genetics 2012 vol. 13 (5) pp. 329-42
Explore the tomato genome annotation http://solgenomics.net/ Tomato Genome Project
Explore the Arabidopsis genome annotation http://www.arabidopsis.org/ Tools, gbrowse

The End
© Wageningen UR