PhD thesis committee meeting

Shaun Jackman

2015-07-27

People

Shaun Jackman

Genome Sciences Centre, BC Cancer Agency
Vancouver, Canada
@sjackman
github.com/sjackman
sjackman.ca

Photo

Thesis committee

Inanc Birol
Joerg Bohlmann
Jenny Bryan
Steven Hallam

 

Chair

Steven Jones

Activities in 2015

Conferences

rOpenSci Unconference
RECOMB 2015

Teaching assistant

STAT 545 Data wrangling, exploration, and analysis with R
Automating data analysis pipelines

BIOF 520 Problem-Based Learning In Bioinformatics
Genomic epidemiology

  • UniqTag: Content-derived unique and stable identifiers for gene annotation
    SD Jackman, J Bohlmann, I Birol
    PLOS ONE 2015
  • Sealer: a scalable gap-closing application for finishing draft genomes
    D Paulino, RL Warren, BP Vandervalk, A Raymond, SD Jackman, I Birol
    BMC Bioinformatics 2015
  • Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism
    RL Warren, CI Keeling, MMS Yuen, A Raymond, GA Taylor, …
    The Plant Journal 2015
  • Spaced Seed Data Structures for De Novo Assembly
    I Birol, J Chu, H Mohamadi, SD Jackman, K Raghavan, BP Vandervalk, …
    International Journal of Genomics 2015
  • DIDA: Distributed Indexing Dispatched Alignment
    H Mohamadi, BP Vandervalk, A Raymond, SD Jackman, J Chu, …
    PLOS ONE 2015
  • On the Representation of De Bruijn Graphs
    R Chikhi, A Limasset, SD Jackman, JT Simpson, P Medvedev
    Journal of Computational Biology 2015
  • BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters
    J Chu, S Sadeghi, A Raymond, SD Jackman, KM Nip, R Mar, …
    Bioinformatics 2014
  • Konnector: Connecting paired-end reads using a bloom filter de Bruijn graph
    BP Vandervalk, SD Jackman, A Raymond, H Mohamadi, C Yang, D Attali, …
    Bioinformatics and Biomedicine (BIBM) 2014

Manuscripts

White Spruce Organelles

Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation

Plastid Mitochondrion

ABySS

Assembly by Spaced Seeds

for the assembly of long reads

ABySS 2.0

DistanceEst

Estimating the distance between two contigs

DistanceEst

Homebrew Science

Homebrew | Linuxbrew | Homebrew-science

Homebrew-science

Dependencies of bioinformatics tools in Homebrew

Thesis proposal

Thesis proposal

  1. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation
  2. Scale ABySS to assemble long, accurate reads
  3. Scaffold using SMS and molecular barcoding
  4. Assemble single-molecule sequencing data

Organellar Genomes of White Spruce: Assembly and Annotation

Organellar genomes

Completed

  • Assembled cpDNA and mtDNA genomes
  • Annotated genes (mRNA, rRNA, tRNA) and repeats
  • Analysed RNA-seq data to quantify
    • transcript abundance in eight tissues
    • expressed ORFs
    • C-to-U RNA editing

Organellar genomes

Plan

  • Analyse RNA-seq data to annotate
    • cryptic ACG start codons
      due to C-to-U RNA editing
    • spliced ORFs
  • Submit annotated genomes to GenBank
  • Complete manuscript

Scale ABySS to assemble long, accurate reads

Scaling ABySS

For the assembly of long reads

  • 500 bp overlapping paired-end reads (MiSeq)
  • 10 kbp synthetic long reads (Moleculo)
  • 8–200 kbp corrected single-molecule sequencing

 

Using memory efficient data structures

  • spaced seed de Bruijn Graph
  • Bloom filter de Bruijn Graph

Scaling ABySS

Completed

  • Implemented Bloom filter de Bruijn Graph
  • Implemented spaced seed de Bruijn Graph with Karthika
  • Reduced memory and time requirements for long reads
  • Memory usage is independent of parameter k

Plan

  • Assemble model organisms, human in particular, with ABySS 2.0
  • Compare to ABySS 1.5 and other dBG assemblers

RECOMB 2015

Research in Computational Molecular Biology
Warsaw, Poland · 2015 April 10–15

RECOMB 2015 poster

Scaffold using SMS and molecular barcoding

Scaffolding

Order and orient contigs to build scaffolds using…

 

Completed

  • Illumina mate-pair libraries
  • Illumina Moleculo with Tony

 

Plan

  • Single-molecule sequencing
  • Molecular barcoding
    10X Genomics GemCode
    for white spruce (Picea glauca)

Assemble single-molecule sequencing data

Assemble SMS

Drosophila melanogaster

  • Assembly with PBcR used 621,000 CPU-hours,
    or 26 days with 1000 cores
  • MHAP reduced assembly time to 700 CPU-hours
  • Efficient algorithms make a big difference!

 

Homo sapiens

  • Assembly of CHM1 with HGAP used 405,000 CPU-hours
    in one day on Google Cloud!
  • More work to be done for PacBio
  • And lots more work to be done for Nanopore

Assemble SMS

Overlap

  • Spaced seed clustering with Hamid and Justin

Layout

  • Numerical solution with ABySS-DrawGraph

Consensus

  • BWA-MEM and samtools
  • Assembled a small 100 kbp synthetic genome

Plan

  • Continue development with Hamid, Justin and Ben
  • Assemble Nanopore Escherichia coli (5 Mbp)
    • Nanopore Saccharomyces cerevisiae (12 Mbp)
    • PacBio Homo sapiens CHM1 (3 Gbp)

Thesis proposal

Thesis proposal

  1. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation
  2. Scale ABySS to assemble long, accurate reads
  3. Scaffold using SMS and molecular barcoding
  4. Assemble single-molecule sequencing data

fin

Shaun Jackman

Genome Sciences Centre, BC Cancer Agency
Vancouver, Canada
@sjackman
github.com/sjackman
sjackman.ca

Photo

Software

BLASR | BWA-MEM | Celera Assembler
DALIGNER | Dazzler | Falcon | HGAP
LAST | MHAP | Nanocorr | Nanocorrect
Nanopolish | PBcR | PBDAGCon | POA
Quiver

More slides

Single molecule sequencing

PacBio RS II

PacBio RS II

Oxford Nanopore MinION

Oxford Nanopore MinION

Sequencing technologies

Technology Read length Error rate
Sanger 800 bp 0.1–1%
454 700 bp ~1%
Illumina 2 x 300 bp ~0.1%
PacBio 8–40 kbp ~13%
Oxford Nanopore 8–200 kbp ~15%

A visual comparison

8 kbp MinION read and 2 x 300 bp Illumina read
8 kbp MinION read and 2 x 300 bp Illumina read

PacBio circular consensus

Circular consensus

Nanopore 2D

Nanopore 2D reads

Assembly of Single-molecule Sequencing

Assembly Stages

Overlap

Find all significantly overlapping reads

Correct

Recall the consensus base of each read

Layout

Determine the order and orientation of the reads

Consensus

Call the consensus base of each contig

Tools

Overlap

BLASR · DALIGNER · MHAP

Correct

PBDAGCon · Falcon · Dazzler · Nanocorrect

Layout

Celera Assembler · Falcon · Dazzler

Consensus

Quiver · Nanopolish

Pipelines

Assembler Overlap Correct Layout Consensus
HGAP BLASR PBDAGCon Celera Quiver
Falcon DALIGNER Falcon Falcon Quiver
PBcR MHAP Falcon Celera Quiver
Dazzler DALIGNER Dazzler Dazzler Quiver
Nanocorr BLAST PBDAGCon Celera Celera
Nanopolish DALIGNER Nanocorrect Celera Nanopolish

SMS assembly tools

Assembling large genomes

PacBio-LargeGenomes

Assembling small genomes in 2015

Nanopore + Illumina

Nanocorr Saccharomyces cerevisiae (12 Mbp)

Nanopore only

Nanopolish Escherichia coli (5 Mbp)

Retool for SMS

Retool for SMS

Alignment

Seed and extend is difficult when
one in seven bases is incorrect

Assembly

de Bruijn graphs also require accurate seeds
Return to overlap, layout, consensus (OLC)

Align

  • BLASR PacBio
  • BWA-MEM (Burrows-Wheeler Aligner) Heng Li
  • LAST Computational Biology Research Consortium, Tokyo, Japan

Overlap

Read correction

Layout

Consensus

Assembly Pipeline

  • HGAP (Hierarchical Genome Assembly Process) PacBio
  • Falcon PacBio
  • PBcR (PacBio Corrected Reads) Celera Assembler, JCVI
  • Dazzler (Dresden Azzembler) Gene Myers
  • Nanocorr Mike Schatz
  • Nanopolish Jared Simpson