Efficient Assembly of Large Genomes

Shaun Jackman

Genome Sciences Centre, BC Cancer, Vancouver, Canada
@sjackman · github.com/sjackman · sjackman.ca

2018 Tigmint. BMC Bioinformatics

2017 ABySS 2.0. Genome Research

2016 White Spruce Organelles. Genome Biology and Evolution

2015 UniqTag. PLOS ONE

Efficient Assembly
of Large Genomes

Introduction
ABySS 2.0
Tigmint
UniqTag
ORCA
Organellar genomes of white spruce
Mitochondrial genome of Sitka spruce
Genome assembly of western redcedar
Conclusion

Short Read Genome Assembly

ABySS 1.0 (2009) was the first to assemble
a human genome from short reads (42 bp!)

ABySS 1.0

de Bruijn graph assembler
Stored k-mers in a hash table
Distributed the hash table over many machines
Used MPI to aggregate sufficient memory
Assembles large genomes

ABySS 1.0

	Human	Spruce
Genome size	3 Gbp	20 Gbp
RAM	418 GB	4.3 TB
CPU cores	64	1,380
Wall time	14 hours	12 days
Year	2017	2013
Short DOI	doi:f9x8qp	doi:f4zzrr

Challenges

High memory usage
Interprocess communication is slow
Intermachine communication is really slow

Solution

A memory-efficient data structure
reduces memory usage
Fitting entire graph in a single machine
eliminates intermachine communication
OpenMP rather than MPI
eliminates interprocess communication

ABySS 2.0

ABySS 2.0 (2017) reduces the memory
usage of ABySS by ten fold.

Memory efficient de Bruijn graph using a Bloom filter Memory usage is independent of k — Memory efficient de Bruijn graph using a Bloom filter
Memory usage is independent of k

Navigating a Bloom filter de Bruijn graph Introduced by Minia (Chikhi et al. 2012) — Navigating a Bloom filter de Bruijn graph
Introduced by Minia (Chikhi *et al.* 2012)

Sequencing errors and Bloom filter false positives

Solid reads are extended using the Bloom filter de Bruijn graph to assemble unitigs

ABySS 2.0 reduces memory usage by 10 fold vs ABySS 1.0 for human genome assembly (GIAB HG004 NA24143)

Spruce genome assemblies

ABySS	1.3.5	2.0.0
Spruce species	Interior	Sitka
Machines	115	1
RAM (GB)	4,300	500
CPU cores	1,380	64
CPU time* (years)	6.0	3.2
Wall time* (days)	1.6	18
Year	2013	2017
Short DOI	doi:f4zzrr	NA

* Time of unitig assembly without scaffolding

Contiguity and correctness are comparable

41.9 Mbp NG50 scaffolded with BioNano optical mapping

ABySS 2.0 Conclusions

ABySS 2.0 reduces memory usage by 10 fold
from 418 GB for ABySS 1.0
to 34 GB for ABySS 2.0
for a human genome assembly
High-throughput short-read sequencing
combined with large molecule scaffolding
such as 10X Genomics, BioNano, Hi-C
permits cost effective assembly of large genomes

Linked Reads

Tools for Linked Reads

Align linked reads
Lariat (Long Ranger) · EMA
Structural variants
Long Ranger · GROC-SVs · NAIBR · SVenX · Topsorter
Phase variants
Long Ranger
Genome sequence assembly
Supernova
Scaffolding
ARCS · Architect · Fragscaff · Scaff10x

https://github.com/johandahlberg/awesome-10x-genomics

Contigs and scaffolds
come to an end due to…

repeats
sequencing gaps
structural variation
misassemblies

Misassemblies limit contiguity

particularly for highly contiguous assemblies.

Most scaffolding tools do not correct misassemblies.

Correct misassemblies

Scaffold

Method

Map reads to the assembly
Group reads within d bp of each other (d = 50 kbp)
Infer start and end coordinates of molecules
Construct an interval tree of the molecules
Each w bp region ought to be spanned by n molecules
(w = 1 kbp, n = 20)
Identify regions with fewer than n spanning molecules
Cut sequences at regions with insufficient coverage

Tracks from top to bottom
molecule coverage, molecules, read coverage, reads

https://github.com/JustinChu/JupiterPlot

Human genome assembly (GIAB HG004 NA24143)

Assemble human HG004 with PE, MP, and linked reads
Scaffolding with ARCS improved NGA50 from 3 to 8 Mbp
Tigmint reduced misassemblies by 216 (27% reduction)
Tigmint + ARCS improved NGA50 over five-fold to 16 Mbp

Human genome assemblies (GIAB HG004 NA24143)

Note: Supernova used only linked reads, others PE+MP+LR.

Corrects and improves long read assemblies too!

Sequencing	Nanopore	PacBio
Assembler	Canu	Falcon
NGA50 before Tigmint + ARCS	5.4 Mbp	4.2 Mbp
NGA50 after Tigmint + ARCS	10.9 Mbp	12.0 Mbp
Improvement	2.0x	2.9x

Time and Memory

bwa mem Map reads to assembly
5½ hours, 17 GB RAM, 48 threads
tigmint-molecule Group reads into molecules
3¼ hours, 0.08 GB RAM, 1 thread
tigmint-cut Identify misassemblies and cut sequences
7 minutes, 3.3 GB RAM, 48 threads

Tigmint Conclusions

Scaffolding after correcting with Tigmint yields an assembly both more correct and more contiguous.

Linked reads permit cost-effective assembly of large genomes using high-throughput sequencing.

Western redcedar (Thuja plicata) — Western redcedar (*Thuja plicata*)

Western Redcedar Methods

Trim adapters with Trimadap and NxTrim
Count k-mers with ntCard
Estimate genome size GenomeScope
Assemble PE and MP reads with ABySS 2.0
Correct assembly errors
with Chromium reads using Tigmint
Scaffold with Chromium reads using ARCS
Assess genome completeness using BUSCO

Western Redcedar Results

12.5 Gbp genome size estimated by flow cytometry
(Hizume et al. 2001 https://doi.org/d89svf)
9.8 Gbp genome size estimated by GenomeScope
7.95 Gbp assembled in scaffolds 1 kbp or larger
2.31 Mbp scaffold N50
1.71 Mbp scaffold NG50 (with G=10 Gbp)
Tigmint improved NG50 by 14% over ARCS alone
BUSCO 60.4% of core single-copy genes present
53.9% complete, 6.5% fragmented, 39.6% missing

Efficient Assembly
of Large Genomes

Introduction
ABySS 2.0
Tigmint
UniqTag
ORCA
Organellar genomes of white spruce
Mitochondrial genome of Sitka spruce
Genome assembly of western redcedar
Conclusion

Thesis Committee

Research Supervisors

Inanc Birol, Medical Genetics
Joerg Bohlmann, Michael Smith Laboratories

Committee Members

Steven Hallam, Microbiology & Immunology
Steven Jones, Medical Genetics

University Examiners

Keith Adams, Botany
Patricia Schulte, Zoology

Physlr

Physical Maps of Linked Reads

Physlr map of a plastid genome (120 kbp)

Physlr map of fruit fly (7 chromosomes, 138 Mbp)

Physlr map of zebrafish (25 chromosomes, 1.35 Gbp)

Our Dance Card

👫 Human (3 Gbp)

🌲 Western redcedar (12 Gbp)

🌲 Sitka spruce (20 Gbp)

fin

Supplemental Slides

Publications

Four first-author (or joint) papers
One paper each year from 2015 through 2018
Collaborated on 32 papers since 2009
28 papers with at least 10 citations
One first-author manuscript in review (ORCA)
One first-author manuscript in preparation
(Sitka spruce mitochondrion)
ABySS has been cited over 2,700 times!

First-author Publications

Tigmint: correcting assembly errors using linked reads from large molecules
SD Jackman, L Coombe, J Chu, RL Warren, BP Vandervalk, …
BMC Bioinformatics 2018
ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter
SD Jackman^*, BP Vandervalk^*, H Mohamadi, J Chu, S Yeo, SA Hammond, …
Genome Research 2017
Organellar genomes of white spruce (Picea glauca): assembly and annotation
SD Jackman, RL Warren, EA Gibb, BP Vandervalk, H Mohamadi, J Chu, …
Genome Biology and Evolution 2015
UniqTag: content-derived unique and stable identifiers for gene annotation
SD Jackman, J Bohlmann, I Birol
PLOS ONE 2015

Selected Publications

Assembly of the complete Sitka spruce chloroplast… L Coombe, RL Warren, SD Jackman, C Yang, BP Vandervalk, …, I Birol
PloS one 2016
Spaced seed data structures for de novo assembly
I Birol, J Chu, H Mohamadi, SD Jackman, K Raghavan, …, RL Warren
International journal of genomics 2015
Konnector v2.0: pseudo-long reads from PE sequencing
BP Vandervalk, C Yang, Z Xue, K Raghavan, J Chu, H Mohamadi, SD Jackman, …, I Birol
BMC medical genomics 2015
Sealer: a scalable gap-closing application…
D Paulino, RL Warren, BP Vandervalk, A Raymond, SD Jackman, I Birol
BMC Bioinformatics 2015
On the representation of de Bruijn graphs
R Chikhi, A Limasset, SD Jackman, JT Simpson, P Medvedev
Journal of Computational Biology 2015
Improved white spruce (Picea glauca) genome…
RL Warren, CI Keeling, MMS Yuen, A Raymond, GA Taylor, …, J Bohlmann
The Plant Journal 2015
Assembling the 20Gb white spruce genome…
I Birol, A Raymond, SD Jackman, S Pleasance, R Coope, …, SJM Jones
Bioinformatics 2013

Efficient Assembly of Large Genomes

Shaun Jackman

Efficient Assemblyof Large Genomes

Short Read Genome Assembly

ABySS 1.0

ABySS 1.0

Challenges

Solution

ABySS 2.0

Spruce genome assemblies

ABySS 2.0 Conclusions

Linked Reads

Tools for Linked Reads

Contigs and scaffoldscome to an end due to…

Misassemblies limit contiguity

Method

Time and Memory

Tigmint Conclusions

Western Redcedar Methods

Western Redcedar Results

Efficient Assemblyof Large Genomes

Thesis Committee

Research Supervisors

Committee Members

University Examiners

Physlr

Physical Maps of Linked Reads

Our Dance Card

👫 Human (3 Gbp)

🌲 Western redcedar (12 Gbp)

🌲 Sitka spruce (20 Gbp)

fin

Supplemental Slides

Publications

First-author Publications

Selected Publications

Efficient Assembly
of Large Genomes

Contigs and scaffolds
come to an end due to…

Efficient Assembly
of Large Genomes