Giab ashg webinar 160224

Genome in a Bottle Consortium February 24, 2016

Reference Materials for Human Genome Sequencing

Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology

Outline

• Genome in a Bottle (GIAB) products

• Current and future work• Best practices for using

GIAB products to benchmark variant calls

• Genome in a Bottle– Open consortium to

develop well-characterized genomes for benchmarking

– 100-150 public, private, and academic participants at workshops

GIAB Scope• The Genome in a Bottle Consortium is

developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls. • Priority is authoritative characterization of

human genomes.GIAB steering committee, Aug 2015

Well-characterized, stable RMs• Obtain metrics for

validation, QC, QA, PT• Determine sources and

types of bias/error• Learn to resolve difficult

structural variants• Improve reference

genome assembly• Optimization• Enable regulated

applications

Analytical Performance

• Use well-characterized genomic DNA reference materials to benchmark performance

• Tools to facilitate their use– With the Global Alliance

Data Working Group Benchmarking Team

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

gene

ric m

easu

rem

ent p

roce

ss

High-confidence SNP/indel calls

• Methods to develop SNP/indel call set described in manuscript

• Broad and quick adoption of call set for benchmarking– struck nerve

Zook et al., Nature Biotechnology, 2014.

Candidate NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #

CEPH Mother/Daughter

N/A GM12878 HG001 RM8398

AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)

AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)

AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)

Asian Son hu91BD69 GM24631 HG005 RM8393

Asian Father huCA017E GM24694 N/A N/A

Asian Mother hu38168C GM24695 N/A N/A

Note: RMs 8391 to 8393 are planned for release by end of Q2 2016

Dataset AJ Son AJ Parents Chinese son Chinese parents

NA12878

Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X

Paper describing the data…

Data Release: Real-time, Open, Public Release

Individual Datasets• Uploaded to GIAB FTP site

as data are collected• Includes raw reads, aligned

reads, and variant/reference calls

• 12 datasets described in bioRxiv paper

• Develop SNP, indel, and homozygous reference calls similar to NA12878

• Developing methods to form high-confidence calls for difficult variant types and regions

• Released calls are versioned• Preliminary call-sets will be

made available to be critiqued

Integrated High-confidence Calls

SNP/Indel Integration Method Update• Implementing refined integration methods

– Developed so others can readily reproduce results– Consistent results for all GIAB genomes– Simpler process taking advantage of best practices

for each technology• Validating with released NA12878 RM data

– Preliminary comparisons show minor changes• Application to PGP trios

– Plan to analyze AJ trio by Q2 2016– Release of NIST RMs in Q2 2016– Develop calls for GRCh38

Proposed approach to form high-confidence SV (and non-SV) calls

Generate Candidate Calls

Compare/evaluate calls using Parliament/MetaSV/svclassify/others?;

manual inspection

Integrate new and revised calls; manual inspection

Combine integrated calls; manual inspection; targeted experimental validation?

Aug/Dec 2015

Aug 2015-Jan 2016

Planning in Jan-Feb 2016

Feb 2016 and beyond

Preliminary comparisons of 17 Deletion CallsetsSensitivity to calls in 2 technologies

NOTE: These are preliminary comparisons of data under active development and likely different from true sensitivity of callers

Preliminary comparisons of 17 Deletion CallsetsDifference between predicted size and median predicted size

NOTE: These are preliminary comparisons of data under active development and likely different from true size accuracy

Preliminary comparisons of 17 Deletion CallsetsNumber of unique calls

NOTE: These are preliminary comparisons of data under active development without filtering and unique calls may be correct

GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call

http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

Global Alliance for Genomics and Health Benchmarking Task Team

Progress:

• Initial version of standardized definitions for performance metrics like TP, FP, and FN.

• Continued development of sophisticated benchmarking tools– vcfeval – Len Trigg– hap.py – Peter Krusche– vgraph – Kevin Jacobs

• Standardized intermediate and final file formats• Standardized bed files with difficult genome contexts for

stratification• github.com/ga4gh/benchmarking-tools

https://github.com/ga4gh/benchmarking-tools


Proposed Performance Metrics Definitions

• Define TP/FP/FN/TN in 4 ways depending on required stringency of match:

• Loose match: TP if within x-bp of a true variant• Allelle match: TP if ALT allele matches• Genotype match: TP if genotype and ALT allele

match• Phasing match: TP if genotype, ALT allele, and

phasing with nearby variants all match• True negatives are difficult to define because an

infinite number of potential alleles exist

Approaches to Benchmarking Variant Calling

• Well-characterized whole genome Reference Materials

• Many samples characterized in clinically relevant regions

• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time

Challenges in Benchmarking Small Variant Calling

• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)

• Easiest to benchmark only within high-confidence bed file, but…

• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites

• Challenges with benchmarking complex variants near boundaries of high-confidence regions

• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics

Benchmarking on PrecisionFDA

Acknowledgments

• FDA • Many members of

Genome in a Bottle– New members

welcome!– Sign up on website

for email newsletters

GIAB Steering Committee– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser

Data: http://biorxiv.org/content/early/2015/09/15/026468

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools

Twice yearly public workshops – Winter at Stanford University, California, USA– Summer at NIST, Maryland, USA

Justin Zook: [email protected] Salit: [email protected]

http://www.genomeinabottle.org/

https://github.com/genome-in-a-bottle



http://www.slideshare.net/genomeinabottle

http://www.slideshare.net/genomeinabottle

http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

http://biorxiv.org/content/early/2015/09/15/026468





Giab ashg webinar 160224

Health & Medicine

Transcript of Giab ashg webinar 160224