SIB workshop “Bioinformatics of long read sequencing” hands-on

July 5th, 2016. University of Bern.

https://www.isb-sib.ch/training/upcoming-training-events/training/2016-07-longreads

Contributors

Amina Echchiki, Kamil Jaron, Julien Roux: Evolutionary Bioinformatics Group, UNIL and SIB

Walid Gharib: Interfaculty Bioinformatics Unit, UNIBE and Training and Outreach team, SIB

Introduction

The aim of this practicals session is to get your hands on real long reads sequencing data generated from two different technologies:

The biological material that was sequenced using these two platforms is DNA from the lambda phage. This is not a particularly interesting genomic material for a long read sequencing study, since such a small genome can be assembled easily with short Illumina reads (check this publication for example). However, its genome is small (48kb), which makes it feasible to run an analysis yourself during the limited time of this practicals session.

You will go through different steps, which include the extraction of reads from their native encoding formats (HDF5), quality control, de novo genome assembly, and a mapping of reads to a reference genome.

After DNA fragmentation, the MinION library is prepared by ligating adapters to the double-stranded DNA fragments. One side receives a Y-form adapter, and the other side a hairpin-form adaptor. The adapters are conjugated with motor proteins that help control the translocation speed of DNA through the pore. When the Y-form adapter approaches the pore, one strand starts to be sequenced. This sequence is called template. After the hairpin-form adapter is sequenced, the complement sequence is read. The template and complement sequences are sometimes called 1D reads. The consensus of template and complement sequences is called a 2D read. An illustration of the protocol is provided in this figure:

protocol

For PacBio sequencing (or SMRT sequencing), hairpin-form adapters (SMRTbell) are ligated on both sides of double-stranded DNA fragments, forming a circular sequence. Circular sequences attach to the bottom of the SMRT cell wells, and are continuously read, and a movie is recorded. Nowadays, the total length of movie allows to record up to 75,000 nucleotide basecalls. This implies that short DNA fragments will be read many times, while long fragments may be read just one or twice. The complete sequence is called a polymerase read. After the adapter sequences are removed, the sequence is split into subreads. The consensus of subreads is called a circular consensus read or read of insert.

protocol

How to connect to the Vital-IT cluster?

By now, you should have received a username and password to access the high performance computing cluster of the SIB (Vital-IT). Obviously, you can use your own Vital-IT account if you have one.

Warning Even if you have access to another cluster, the data required for this practical is stored on Vital-IT so all participants should connect to vital-IT.

In order to connect to the cluster and set up your working directory follow the next steps:

For Linux / Mac OSX users

In a terminal, type ssh <username>@prd.vital-it.ch. You will be prompted to input your user password (<username>@prd.vital-it.ch's password:). Type in your password (you will not see what you are typing) and press Enter.

You are in! Jump to Setting up your working directory.

For Windows users

You should first install a ssh client (e.g., PuTTY). If you already have one, we assume you know how to use it. Connect to Vital-IT and jump to Setting up your working directory.

If you do not have a ssh client, follow these steps.

Setting up your working directory

  • On Vital-IT, you must read and write files in the “scratch” directory. Move to /scratch weekly directory:

    cd /scratch/beegfs/weekly/

Warning Be careful, files in this folder are erased after a week. At the end of the practicals, if you want to keep your results, you need to back-up the data, for example in a compressed tarball that you move to your home or archive folder, or to another computer.

  • Create your own directory:

    mkdir <username>; cd <username>
  • You will always be working from this directory. Before launching commands please be sure that you are located in the right directory by typing: pwd. The expected output is: /scratch/beegfs/weekly/<username>

Submitting commands to the cluster

It is not allowed to launch any calculation on the frontal machine where you are connected (prd.vital-it.ch). You need to submit each job for batched execution through a job scheduler that will dispatch it on the cluster nodes. For example:

bsub '[my command line here]'

Or, better, write your commands in a script and submit it with:

bsub < script.sh

Please have a look at this short tutorial to help you write such a script yourself, with the Nano text editor. To save you some typing, a skeleton of a submission script can be found here ;)

1. Read extraction

Formats

The output of MinION and PacBio RS II are stored in Hierarchical Data Format, that is basically an archive format (such as .zip or .tar), but allowing a very quick access to its content (you can find details on wikipedia). Those files include all details about reads and basecalling. What you will need for this practical are just the sequences and their qualities, that can be stored in a simple .fastq.

Tip The files saved in Hierarchical Data Format can be explored using HDF5 command-line tools:

h5dump <HDF_file> | less
h5stat <HDF_file> | less

fast5

The MinION basecaller produces one file per read, which includes the time series of the measured ionic current used for basecalling, the basecalled template and complement read sequences, and the consensus 2D reads.

bas.h5 and bax.h5

The PacBio RS II system produces three bax.h5 files and a bas.h5 per SMRT cell. The three bax.h5 files correspond to the first, second and third part of the movie capturing the SMRT cell. The bas.h5 file contains metadata. PacBio announced a change of data format to more classcial .bam files for the next platform (called Sequel), however all data produced by the RS II platform will be still in the .h5 formats you are going to work with.

Tools

All used tools are already installed on Vital-IT. Before the first use, you will just need to load the package.

Poretools is a toolkit for working with sequencing data from MinION, for the purposes of quality control and downstream analysis. It operates directly on the native .fast5 format and provides format conversion utilities and data exploration tools. Software usage is detailed in the documentation.

pbh5tools is a set of python scripts to extract .fasta and .fastq from bas.h5 and bax.h5 files. These scripts allow filtering based on error rate, read length and read type. Source code is available on GitHub and software usage is detailed in documentation.

Extraction of MinION reads

First, create a working directory for MinION raw reads and create links to the raw sequencing files.

mkdir -p lambda_minion/fast5/
ln -s /scratch/beegfs/monthly/SIB_long_read_workshop/MinION_lambda_reads/*.fast5 \
  lambda_minion/fast5/

Question Can you guess what each file corresponds to? How many reads has this sequencing experiment produced?

To do You will convert the MinION raw reads from their .fast5 to a more classical and readable .fastq. This can be done with the poretools fastq utility. To find out which options to use, please refer to the documentation, or type poretools fastq -h. For now, we are interested in extracting all types of reads (template, complement and 2D), so check out the option --type. Be careful, the output .fastq file is written in the standard output, so you need to save the standard output to a file using >. For example:

module add UHTS/Analysis/poretools/0.5.1
poretools fastq [...] > lambda_minion/all_reads.fastq
## And even better you can add a step to compress the output file:
gzip -9 lambda_minion/all_reads.fastq

Warning Remember, that it is not allowed to launch your command on the frontal machine! You need to write the above commands in a bash script (called for example minion_extract.sh) that you will submit to the cluster with bsub (see Submitting commands to the cluster):

bsub < minion_extract.sh

Question Have a look at the .fastq file created, using less or nano. Looking at header lines (starting with a @), do you identify which reads are 2D, template or complement reads? Do you identify which lines correspond to the quality scores (their header is a + sign).

Focus on a 2D read chosen at random. Look for the corresponding template and complement reads, which should be the next ones in the file (or you can search for the unique hash at the beginning of the header, which is common between 2D, template and complement reads).

Question Do the quality scores seem to be improved in 2D reads? You can refer to this wikipedia page for help on the PHRED quality scores in .fastq files.

Tip You will not do this today, but for basic quality control of the reads, you can also launch the fastqc software (widely used for Illumina data) on this fastq file. The reported statistics are correct, just keep in mind that warning flags in the report are meaningful for short reads, and sometimes not very informative for long reads.

help If you are lost, you can get extracted MinION reads by executing the following script. While the script is running, have a look at it to understand what is done.

bsub < /scratch/beegfs/monthly/SIB_long_read_workshop/scripts/1_Extraction_MinION.sh

Extraction of PacBio RS II reads

First, create a working directory for raw reads and create links to the raw sequencing files.

mkdir -p lambda_RSII/raw_reads/
ln -s /scratch/beegfs/monthly/SIB_long_read_workshop/RSII_lambda_reads/*.h5 \
  lambda_RSII/raw_reads/

Question How many SMRT cells do we have?

To do Extract PacBio subreads using bash5tools.py from pbh5tools. Again, refer to the documentation. The commands will look like this:

module add UHTS/PacBio/pbh5tools/0.8.0

bash5tools.py <file.bas.h5> \
  --outFilePrefix lambda_RSII/<prefix_for_extracted_reads> \
  --readType subreads \
  --outType fastq
# compress the output file:
gzip -9 lambda_RSII/[prefix.fastq]

Put the full command in a script and submit it using bsub:

bsub < pacbio_extract.sh

Question How many subreads were produced?

help If you are lost, you can get extracted PacBio reads by executing

bsub < /scratch/beegfs/monthly/SIB_long_read_workshop/scripts/1_Extraction_RSII.sh

2. Genome assembly

Genome assembly with long reads is, just like the established assembly with short-reads, based on graph theory. But string graphs are used instead of De Brujin graphs, and instead of kmer decomposition, the graph of reads (nodes) and their overlaps (edges) is constructed.

Tools

Canu

Canu is an assembler for noisy long-reads sequences. It is a fork of Celera with added modules for error correction and trimming of reads. Canu was designed for both MinION and PacBio data, and is one of the two leading assemblers for longs reads (together with Falcon). Canu has many parameters, which are not discussed on this workshop, but many details are given in the manual.

Miniasm

Miniam is lightweight assembler. It very fast, but for a cost of simplicity and low parametrization. It can used as a first proxy of the data content, but for a final assembly, another assembler should be considered. As a first step, the overlap of reads is computed in a separated step using a standalone program called minimap.

MinION

You will assemble the lambda phage genome using only the 2D reads. 1D reads are of substantially lower quality, so you would like to avoid using them if there are enough 2D reads. Since this is the case (~3,000 reads, i.e., coverage of ~500X), the assembly should be trivial: the genome is 48.8 kb and we have some reads of 20kb. We expect only one contig!

Canu

To do Choose the appropriate parameters to run Canu:

module add UHTS/Assembler/canu/1.3; # load canu software 

usage: canu -p <assembly-prefix>    # use a specified prefix to name canu output files
            -d <assembly-directory> # use a specified prefix for output directories
            genomeSize=<number>     # expected genome size
                  useGrid=false           # disables automatic submission to cluster
            -nanopore-raw           # specifies that ONT MinION data are used
            <data in fastq format>

Put the full command in a script and submit it using bsub:

bsub < minion_assembly_canu.sh

The output directory contains many files. The most interesting ones are:

  • *.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.

  • *.trimmedReads.fastq : file containing the sequences after correction and final trimming

  • *.layout : file containing informations about read inclusion in the final assembly

  • *.gfa : file containing the assembly graph by Canu

  • *.contigs.fasta: file containing everything that could be assembled and is part of the primary assembly

Question How many contigs were produced? Does the total size seem to match your expectations?

Warning During our tests, the assembly did not always work. Sometimes the job was killed by the cluster, please check carefully the standard output and error files. We think that the memory requirement of Canu might be too big for some machines of the cluster. If this happens to you, rerun the job using the queue dee-hugemem instead of the queue priority. It night be useful to increase the amount of memory requested with the options -v 20000000 -R "rusage[swp=20000]" and -M 10000000 -R "rusage[mem=10000]". If this still does not work, you can see a successful run that we pre-computed in the folder /scratch/beegfs/monthly/jroux/tp_long_reads/lambda_minion/canu/.

PacBio RS II

One PacBio RS II SMRT cell produces between 0.5 and 1 gb. This roughly correspond to a 10,000x coverage of the lambda phage genome which is clearly an overkill. We can decrease the computational load by using a subset of only a few thousand reads for the assembly.

To do Go to the directory including the extracted PacBio reads and extract a subset of the first 3,000 reads.

zcat RSII_reads.fastq.gz | 
  head -12000 | 
  gzip -9 > 
  RSII_reads_subset.fastq

The assembly step is now analogous to the assembly of MinION data.

To do Modify the command performing Canu assembly for PacBio data. Change the type of input reads to -pacbio-raw, the name of the input file and the name of output folder, and launch the assembly.

Question What is the length of the resulting assembly? Is there anything weird going on?

Bonus: Miniasm

Since the assembly with Canu does not seem to be successful, you will try a different assembler, miniasm.

To do You first need to compute the overlaps of reads using minimap. As recommended in the documentation, put the following command and parameters in your submission script:

minimap -S -w 5 -L 100 -m 0 <reads.fq> <reads.fq> | 
  gzip -9 > 
  <overlaps.gz>

To do Then, to perform the assembly using miniasm:

module add UHTS/Analysis/minimap/0.2.r124.dirty; # load module minimap
module add UHTS/Analysis/miniasm/0.2.r137.dirty; # load module minimap

miniasm -f <reads.fq> <overlaps.gz> > <contigs.gfa>
## convert .gfa to .fasta
awk '/^S/{print ">"$2"\n"$3}' <contigs.gfa> | fold > <contigs.fa>

Put these commands in a script and submit it using bsub:

bsub < RSII_assembly_miniasm.sh