PhD thesis committee meeting

Shaun Jackman

2015-06-15

People

Shaun Jackman

Genome Sciences Centre, BC Cancer Agency
Vancouver, Canada
@sjackman
github.com/sjackman
sjackman.ca

Photo

Thesis committee

Inanc Birol
Joerg Bohlmann
Steven Hallam
Jenny Bryan

Activities in 2015

rOpenSci Unconference

San Francisco, USA · 2015 March 26–27

rOpenSci Unconf

RECOMB 2015

Research in Computational Molecular Biology
Warsaw, Poland · 2015 April 10–15

RECOMB 2015 poster

Teaching assistant

Designed and taught two one-week modules

 

Automating data analysis pipelines

STAT 545 Data wrangling, exploration, and analysis with R

 

Genomic epidemiology

BIOF 520 Problem-Based Learning In Bioinformatics

Publications

UniqTag

UniqTag: Content-Derived Unique and Stable Identifiers for Gene Annotation

  • Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism RL Warren, CI Keeling, MMS Yuen, A Raymond, GA Taylor, …
    The Plant Journal 2015
  • Spaced Seed Data Structures for De Novo Assembly I Birol, J Chu, H Mohamadi, SD Jackman, K Raghavan, BP Vandervalk, …
    International Journal of Genomics 2015
  • DIDA: Distributed Indexing Dispatched Alignment H Mohamadi, BP Vandervalk, A Raymond, SD Jackman, J Chu, …
    PLOS ONE 2015
  • On the Representation of De Bruijn Graphs R Chikhi, A Limasset, SD Jackman, JT Simpson, P Medvedev
    Journal of Computational Biology 2015
  • BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters J Chu, S Sadeghi, A Raymond, SD Jackman, KM Nip, R Mar, …
    Bioinformatics 2014
  • Konnector: Connecting paired-end reads using a bloom filter de Bruijn graph BP Vandervalk, SD Jackman, A Raymond, H Mohamadi, C Yang, D Attali, …
    Bioinformatics and Biomedicine (BIBM) 2014

Manuscripts

White Spruce Organelles

Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation

Plastid Mitochondrion

ABySS

Assembly by Spaced Seeds

for the assembly of long reads

ABySS 2.0

DistanceEst

Estimating the distance between two contigs

DistanceEst

Homebrew Science

Homebrew | Linuxbrew | Homebrew-science

Homebrew-science

Dependencies of bioinformatics tools in Homebrew

Single molecule sequencing

PacBio RS II

PacBio RS II

Oxford Nanopore MinION

Oxford Nanopore MinION

Sequencing technologies

Technology Read length Error rate
Sanger 800 bp 0.1–1%
454 700 bp ~1%
Illumina 2 x 300 bp ~0.1%
PacBio 8–40 kbp ~13%
Oxford Nanopore 8–200 kbp ~15%

A visual comparison

8 kbp MinION read and 2 x 300 bp Illumina read
8 kbp MinION read and 2 x 300 bp Illumina read

PacBio circular consensus

Circular consensus

Nanopore 2D

Nanopore 2D reads

Assembly of Single-molecule Sequencing

Assembly Stages

Overlap

Find all significantly overlapping reads

Correct

Recall the consensus base of each read

Layout

Determine the order and orientation of the reads

Consensus

Call the consensus base of each contig

Tools

Overlap

BLASR · DALIGNER · MHAP

Correct

PBDAGCon · Falcon · Dazzler · Nanocorrect

Layout

Celera Assembler · Falcon · Dazzler

Consensus

Quiver · Nanopolish

Pipelines

Assembler Overlap Correct Layout Consensus
HGAP BLASR PBDAGCon Celera Quiver
Falcon DALIGNER Falcon Falcon Quiver
PBcR MHAP Falcon Celera Quiver
Dazzler DALIGNER Dazzler Dazzler Quiver
Nanocorr BLAST PBDAGCon Celera Celera
Nanopolish DALIGNER Nanocorrect Celera Nanopolish

SMS assembly tools

Assembling large genomes

PacBio-LargeGenomes

Assembling small genomes in 2015

Nanopore + Illumina

Nanocorr Saccharomyces cerevisiae (12 Mbp)

Nanopore only

Nanopolish Escherichia coli (5 Mbp)

Thesis proposal

Thesis proposal

  1. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation
  2. Scale ABySS to assemble long, accurate reads
  3. Scaffold using SMS and molecular barcoding
  4. Assemble single-molecule sequencing data

Organellar genomes

Completed

  • Assembled cpDNA and mtDNA genomes
  • Annotated genes (mRNA, rRNA, tRNA) and repeats
  • Quantified expression in eight tissues

Plan

  • Analyse RNA-seq data to annotate
    • expressed and spliced ORFs
    • C to U RNA editing
  • Submit annotated genomes to GenBank
  • Complete manuscript with Rene and Ewan

Scaling ABySS

For the assembly of long reads

  • 500 bp overlapping paired-end reads (MiSeq)
  • 10 kbp synthetic long reads (Moleculo)
  • 8–200 kbp corrected single-molecule sequencing

 

Using memory efficient data structures

  • spaced seed de Bruijn Graph
  • Bloom filter de Bruijn Graph

Scaling ABySS

Completed

  • Implemented Bloom filter de Bruijn Graph
    • Konnector with Ben
    • Sealer with Daniel
  • Implemented spaced seed de Bruijn Graph with Karthika
  • Reduced memory and time requirements for long reads
  • Memory usage is independent of parameter k

Plan

  • Assemble model organisms, human in particular, with ABySS 2.0
  • Compare to ABySS 1.5 and other dBG assemblers

RECOMB 2015

Research in Computational Molecular Biology
Warsaw, Poland · 2015 April 10–15

RECOMB 2015 poster

Scaffolding

Order and orient contigs to build scaffolds using…

 

Completed

  • Illumina mate-pair libraries
  • Illumina Moleculo with Tony

 

Plan

  • Single-molecule sequencing
  • Molecular barcoding
    10X Genomics GemCode
    for white spruce (Picea glauca)

Assemble SMS

Drosophila melanogaster

  • Assembly with PBcR used 621,000 CPU-hours,
    or 26 days with 1000 cores
  • MHAP reduced assembly time to 700 CPU-hours
  • Efficient algorithms make a big difference!

 

Homo sapiens

  • Assembly of CHM1 with HGAP used 405,000 CPU-hours
    in one day on Google Cloud!
  • More work to be done for PacBio
  • And lots more work to be done for Nanopore

Assemble SMS

Overlap

  • MinHash clustering without alignment
  • Spaced seed clustering with Hamid and Justin

Layout

  • Numerical solution with ABySS-DrawGraph

Consensus

  • BWA-MEM and samtools
  • Assembled a small 100 kbp synthetic genome

Plan

  • Continue development with Hamid, Justin and Ben
  • Assemble Nanopore Escherichia coli (5 Mbp)
    • Nanopore Saccharomyces cerevisiae (12 Mbp)
    • PacBio Homo sapiens CHM1 (3 Gbp)

Summary

  1. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation
  2. Scale ABySS to assemble long, accurate reads
  3. Scaffold using SMS and molecular barcoding
  4. Assemble single-molecule sequencing data

fin

Shaun Jackman

Genome Sciences Centre, BC Cancer Agency
Vancouver, Canada
@sjackman
github.com/sjackman
sjackman.ca

Photo

Software

BLASR | BWA-MEM | Celera Assembler
DALIGNER | Dazzler | Falcon | HGAP
LAST | MHAP | Nanocorr | Nanocorrect
Nanopolish | PBcR | PBDAGCon | POA
Quiver

More slides

2014

Presentations and posters

Plant and Animal Genome XXII
San Diego, California, USA · 2014 January 10–15
International HPC Summer School 2014
Budapest, Hungary · 2014 June 1–6
Conifer Genome Summit 2014
Forêt Montmorency, Québec, Canada · 2014 June 16–18
HiTSeq and ISMB 2014
Boston, Massachusetts, USA · 2014 July 11–15

Teaching assistant

STAT 540 Statistical Methods for High Dimensional Biology

BIOF 520 Problem-Based Learning In Bioinformatics

Retool for SMS

Retool for SMS

Alignment

Seed and extend is difficult when
one in seven bases is incorrect

Assembly

de Bruijn graphs also require accurate seeds
Return to overlap, layout, consensus (OLC)

Align

  • BLASR PacBio
  • BWA-MEM (Burrows-Wheeler Aligner) Heng Li
  • LAST Computational Biology Research Consortium, Tokyo, Japan

Overlap

Read correction

Layout

Consensus

Assembly Pipeline

  • HGAP (Hierarchical Genome Assembly Process) PacBio
  • Falcon PacBio
  • PBcR (PacBio Corrected Reads) Celera Assembler, JCVI
  • Dazzler (Dresden Azzembler) Gene Myers
  • Nanocorr Mike Schatz
  • Nanopolish Jared Simpson