de novo genome assembly using Supernova 2.1.0

Vijithkumar V
Sep 6, 2023
7 min read

Updated: Oct 16, 2023

What is Supernova 2.1.0?

Everything starts with an individual DNA source. From this single DNA source, a single whole genome library will be prepared. This single whole-genome library will be subjected to an Illumina sequencing platform like NovaSeq, and linked reads will be generated. Then these Linked-reads need to be assembled to generate the whole genome. This is achieved by the software package called Supernova. The latest version of Supernova is Supernova 2.1.0.

What is a key feature of the Supernova package ?

The key feature of the Supernova software package is that it creates diploid assemblies. And it can represent maternal and paternal chromosomes separately.

Supernova package includes two processing pipelines and one post-processing pipeline.

As we have seen in the previous article about the TELLYSIS software package, where there are three processing pipelines, such as Quality control pipeline (TELL-read), phasing pipeline (TELL-sort), and assembly pipeline (TELL-link), in the case of Supernova software package, there are three important pipelines. supernova mkfastq, supernova run, and supernova mkoutput are the important pipelines.

supernova mkfastq is for generating the demultiplexed fastq files. Here, the Base Call (BCL) files - including the forward and the reverse reads, and the barcode files from a single flowcell - and the samplesheet (a csv file) will be provided to demultiplex and convert the files to fastq

Supernova run is the assembly pipeline that uses the sample-demultiplexed fastq files to assemble the genome. So, it takes the fastq files containing the barcoded reads, and assemble the genome using the graph-based assembly process. The approach is such that it first builds the assembly using the read k-mers, where k is 48. This means, it uses 48-mers from the reads, and builds an assembly. Then resolve this assembly using read pairs to make 200-mers, and finally 1,00,000-mers.

Supernova mkoutput is the pipeline that takes the output from the Supernova run and produces different styles of FASTA files, for downstream processes and analysis.

Generating the FASTQ files with supernova mkfastq

supernova mkfastq is a wrapper around the Illumina’s bcl2fastq. The path to the flowcell directory is provided as an input parameter. Importantly, we need to understand that each flowcell contains lanes, and each lane contains wells. A single whole genome library is deposited in a single well. In each well, there is a lawn of P7 or/ and P5 adapters. If only P7 adapters are there, these adapters are associated with one of four unique oligonucleotide sequences. Every well that contains a whole genome library is studded with adapters and associated four oligonucleotide sequences. These oligonucleotides are unique among themselves and among the whole sets. When you provide the path to the flowcell directory, you also need to provide a information comprising the lanes in the flowcell that contain the libraries, and the name of the well with a set of four oligonucleotide sequences.

The information is provided as a comma-delimited series of lane number, name of the sample in each well, and the well name, for supernova mkfastq to pullout the oligonucleotide set used for the whole genome library. All the BCL files sharing the same set of oligonucleotide sequences as provided in the sample sheet, will be sample-demultiplexed.

For example, let us look at this picture. In this picture, there are two individual whole genome libraries. Let us imagine that these two whole genome libraries were prepared from two individual DNA sources. These two libraries are fed into two independent wells, either on the same lane or on a different lane of the same flowcell. These wells are studded with the P7 and/ or P5 adapters, with a unique set of oligonucleotide sequences. In the picture, given above, Library1 and Library2 are the two distinct whole genome libraries, prepared from two different DNA sources. these two libraries are fed into an Illumina sequencer, let us call it NovaSeq , and these libraries are now in two independent wells of the same lane or a different lane of the same flowcell

Now, the well, where Library1 was fed into is called A1. The A1 well contains P7 and/or P5 adapters with the same set of oligonucleotide sequences as the ones that were used for the Library1 preparation. Similarly, the Library2 was fed into another well, A2. When we provide the BCL files, associated with a flowcell directory, to the supernova mkfastq pipeline, it is supposed to sample-demultiplex the BCL files (read files and the barcode files) on the basis of the sets of sample indices. So, we need to provide a sample-sheet

Usually, the samplesheet is a csv file that contains three columns: 1. Lane number of the flowcell 2. Sample/library name 3. The well name, for example SI-GA-A1 (here, A1) is the well name. Seeing the well name, SI-GA-A1, supernova mkfastq automatically pulls out the set of four oligonucleotide sequences, and demultiplex the BCL files by collating all the reads with these sample indices. Finally, these BCL files are converted into FASTQ. Here, the whole genome library, Library1, was loaded into the well (SI-GA-A1) of the lane 1 (of the flowcell); the second whole genome library, Library2, was loaded into the well (SI-GA-A2), of lane 1, and lane 2 (of the same flowcell). Here, the BCL files were demultiplexed into two sets of FASTQ files, because both the libraries were sample-indexed using two distinct sets of sample indices.

Arguments and Options to be fed to the parameters of the supernova mkfastq

Supernova mkfastq is a wrapper around the bcl2fastq, and therefore the former accepts arguments to parameters of the latter.

—run: this is a required parameter. The path to the Illumina’s run folder needs to be provided here. This folder should contain the BCL run files, associated with a flowcell
—id: The argument that provide to this parameter becomes the name of the folder that supernova mkfastq creates. This is an optional parameter and this refers to the name of the flowcell.
—samplesheet: This is the path to an Illumina Experiment Manager-compatible samplesheet. This samplesheet should contain the following columns: 1. the sample indices set names, such as SI-GA-A1 or SI-GA-A2, depending upon the number of different libraries loaded into the flowcell 2. Sample names (of distinct libraries) 3. Lane number.
—csv: Here, you need to provide path to the csv file. This file should contain three columns of information, i.e. column specifying the lanes to be used. For example, 1, 1-2, etc., sample names, depending upon the number of distinct whole genome libraries, Sample index set names for demultiplexing purpose.
—lanes: We know that there are lanes in a flowcell, and each lane contains nanowells. if all the lanes have been utilized for sequencing the libraries, but you want only BCL files of certain lanes to be de-multiplexed , then those lanes can be specified in a comma-delimited format; for example 1,2
—output-dir: This is the path to the directory in which supernova mkfastq is going to generate the demultiplexed fastq files.

An Example data

$ supernova mkfastq --id=tiny-bcl \\
                     --run=/path/to/tiny_bcl \\
                     --csv=tiny-bcl-simple-2.1.0.csv

supernova mkfastq
Copyright (c) 2017 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - 2.1.1-v2.3.3
Running preflight checks (please wait)...
2017-08-09 16:33:54 [runtime] (ready)           ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
2017-08-09 16:33:57 [runtime] (split_complete)  ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
2017-08-09 16:33:57 [runtime] (run:local)       ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main
2017-08-09 16:34:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
...

here, /path/to/tiny_bcl is the path to the flowcell directory that contains all the BCL files, to be demultiplexed.

tiny-bcl-simple-2.1.0.csv is the path to the csv file, that contains three columns of information, comprising the names of the distinct whole genome libraries, lanes associated with the respective libraries, and sample index set names, for demultiplexing the libraries.

Lane,Sample,Index
1,test_sample,SI-GA-A3

Assembly pipeline: supernova run

Supernova run is the assembly pipeline, a component of the supernova software package. This is used for the whole genome assembly of the 10x genomics linked-reads, sequenced from the chromium prepared whole genome library. That being said, supernova run can also work with TELL-seq linked-reads (addressed in the previous article).

What are the conditions that need to be met before proceeding with assembly using supernova run?

Supernova should be run using 38 - 56X coverage of the genome. If you can estimate the size of the genome, know the sequence read length, and the total number of reads, it is possible to calculate the coverage.
The maximum number of reads that supernova run can handle is not more than 2.14 billion.
The maximum genome size that supernova has been tested for is not more than 4 GB.

How to set up the supernova run command, with essential parameters.

The following are the required parameters.

—id: This is a unique run id. This becomes the name of the output folder.

—fastqs: This is the path where the demultiplexed fastqs are located

—maxreads: This is to specify the number of reads that you need. You can either use all of your reads or specify the number of reads. We need to note that the maximum number of reads that can be used is 2.14 billion.

Why do we need to set this parameter?

The ideal genome coverage for a good quality assembly is in the range of 38 - 56X, for supernova run pipeline. If the coverage is more than 56X, it is better, but it might become deleterious if it is way off. If the coverage is far less than the the lowest recommended value of 38X, supernova run would terminate.

If you have reads more than the cut off value of 2.14 billion, you may set a value (number of reads) corresponding to a coverage of 56X. This requires that you have a prior knowledge about the genome size. Supernova does estimate the genome size in the beginning, and you can calculate the number of reads corresponding to a coverage of 56X.

For example, let us say you have estimated the genome size to be at 800Mb. What would be the number of reads for a 56X coverage of this 800Mb genome?

Length of reads x number of reads) / (genome size) = coverage

Number of reads = (coverage x genome size ) / length of reads

→ (56 x 800, 000,000) / 150 = 0.298 billion

this is less than the cut off value of 2.14 billion.

Now, let us say the genome size estimated was at 8 GB. In this condition, a coverage of 56X should correspond to a read number of

(56 x 8,000,000,000) / 150 = 2.98 billion. This is much higher than the recommended value of 2.14 billion. In this condition, supernova run would be considering a number of reads of 2.14 billion, which should correspond to a coverage of 40X.

Running supernova run

After determining these input arguments, call supernova run:

$ supernova run --id=sample345 \\
                --fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path

Life Science