Processing the Raw Reads using Trimmomatic; a Polishing Task Before Genome Assembly

Vijithkumar V
Sep 13, 2023
9 min read

Updated: Oct 16, 2023

Trimmomatic is performed before genome assembly

We have something called raw reads. If it is a paired-end sequencing, we have something called paired-end reads, consisting of an Read1 (R1) and read2 (R2) file.

Can we directly assemble and reconstruct the genome from the raw read files?

It is generally not advised to perform an assembly directly from these raw read files. These read files also contain low-quality reads that can not reliably be used to reconstruct the genome. These raw read files often are contaminated, therefore need to be subjected to a quality check up, before the assembly process.

What is the major source of contamination?

The major source of contamination is the sequencing adapters. Primer sequences can also be present within the reads. These adapter/ primer sequence-containing reads need to be processed for obtaining the valid read sequences. A FastQC analysis will generate a report, bearing information about the presence / absence of adapters in the R1 and R2 files.

💡 Here, you can see the adapter-ligated DNA fragment. In a SE sequencing, this blue-adapter-containing sequence will be read. In Paired-End sequencing, there is another round of cluster-generation. The second adapter-containing DNA fragment will be sequenced.

How do adapter sequences get into the reads?

After the library preparation, the library sequences will be loaded onto an Illumina Flow cell, where the adapter-ligated gDNA sequences (in case of de novo assembly) or RNA fragments (for transcriptome analysis) will be PCR-amplified, and further read. The library sequences will be read for a fixed number of cycles. In the case of Paired-end sequencing, a sequence will be read in the forward and reverse direction, for, let’s say, 150 cycles. This will result in the generation of reads of 150 bases.

💡 Here, the blue fragments, attached to the both ends of the insert, are called adapters. We can call them Read1 adapter and Read2 adapter. The Read1 adapter contains the Read1 primer-binding site, and the Read2 adapter contains the Read2 primer-binding site. These fragments will be read/ sequenced for a fixed number of cycles, in the forward and reverse direction.

Sometimes there can be an inner distance that is left un-sequenced. Broadly speaking, this can happen when the read length (the number of read cycles) is less than the length of the DNA fragment/ insert.

When there is shorter fragment, shorter than the length of the number of read cycles, the sequencing process can over-shoot to the adapter segment of the reverse read. This adapter contamination is called “adapter read-through”. This “adapter read-through” can result in the inclusion of either the full adapter or partial adapter being present in the read.

How to remove adapters or other Illumina technical sequences like primers?

It is very important to remove adapters, over represented sequences and other technical sequences like primers, from the reads. There are multitude of ways by which we can remove contaminating sequences. The popular bioinformatic pipelines that can potentially remove adapters are Trimmomatic, Trim_galore, fastp, etc.

Removing adapters using Trimmomatic

Adapters and other technical sequences can be removed using the Trimmomatic. The basic syntax of the Trimmomatic command is as follows.

java -jar <path to trimmomatic.jar> PE [-threads <threads] [-phred33 | -phred64] [-trimlog <logFile>] >] [-basein <inputBase> | <input 1> <input 2>] [-baseout <outputBase> | <unpaired output 1> <paired output 2> <unpaired output 2> <step 1>

The actual Trimmomatic program is a .jar file, and this needs JAVA to be installed in the machine.

PE flag specifies that the read files are from a paired-end sequencing.

-threads specifies the number of threads in a multi-core processor.

-phred33 specifies the quality encoding of the base in the fastq nucleotide sequences. According to phred33 quality encoding, the quality score of each nucleotide base in the read is represented using ASCII encoding system. Normally, for each nucleotide base, the phred score is calculated as follows:

phred = -10*log(probability that a base is wrong).

This phred score is converted into an ASCII character via adding 33 to each phredscore value.

-trimlog specifies a log file in text format. This log file contains information about the trimmed reads. The log file contains the following information:

The read name
The surviving sequence length.
The amount of sequence trimmed from the start.
The amount trimmed from the end.

Next we need to specify the two input files. The first one is the forward fastq file, and the second one is the reverse fastq file. You may explicitly specify the fastq files. Otherwise, you may enter the name of the forward read followed by -basein flag, and the reverse read will be detected automatically. For example, if the name of the forward read file is Sample_Name_R1_001.fq.gz, the reverse file Sample_Name_R2_001.fq.gz will be detected automatically. If the forward file is Sample_Name.f.fastq, the program will look for Sample_Name.r.fastq for reverse file.

Next, we need to specify the output files.

There are four output files we need to specify.

For paired forward read
For unpaired forward read
For paired reverse reads
For unpaired reverse reads.

The paired forward reads contain all the surviving forward reads. The unpaired forward reads contain all the reads whose partner reads were dropped.

One may specify all the output files. Otherwise, a base-name can be provided following the --baseout flag, and the names of all output files will be created.

For example, one may provide a base-name, mySampleFiltered.fq.gz, and the program will generate the output files, following a particular nomenclature pattern.

o mySampleFiltered_1P.fq.gz - for paired forward reads o mySampleFiltered_1U.fq.gz - for unpaired forward reads o mySampleFiltered_2P.fq.gz - for paired reverse reads o mySampleFiltered_2U.fq.gz - for unpaired reverse reads

All these parameters are called basic parameters, that are associated with setting up the work environment. Now we need to deal with the trimming modules.

The first trimming module is called ILLUMINACLIP. This module accepts various parameters that are associated with finding and trimming the adapters.

ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

Here, the parameters are:

fastaWithAdaptersEtc: Here we need to specify the path to the fasta file containing the adapter sequence, primer sequence etc.
seedMismatches: This is for a simple match strategy in Trimmomatic. Here, we need to specify the number of maximum mismatch count that would be taken into consideration as a full match. A higher value would result in the generation of more false positive data.
palindromeClipThreshold: Here we need to specify how accurate the match between the two adapter-ligated reads must be for PE palindrome read alignment.
simpleClipThreshold: Here, we need to specify how accurate the match between the adapter and the read must be.

ILLUMINACLIP takes two additional parameters also.

ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>:<minAdapterLength>:<keepBothReads>

minAdapterLength: Here, we need to specify the length of the minimum length of the adapter, in which case the adapter will be removed. This is in addition to the palindromeClipThreshold value. The default value is 8. This means that as long as the minimum length of the adapter is 8 or greater, the adapter read-through will be trimmed. This helps to remove shorter adapter fragments.
keepBothReads: This parameter applies to the palindrome match strategy. When an adapter read-through is found and the adapter removed, then the reverse read is the same as the forward one, except the fact that the latter is a reverse compliment of the former. In such a scenario, the reverse read will be dropped by the system. If keepBothReads is set to True, the reverse read will also be retained.

The second module associated with Trimming is called SLIDINGWINDOW

Here, a moving window of multiple-bases will be assessed for the average base quality. If the average base quality falls below a certain threshold, the sequence will be removed. So, even if the single base has a low quality, and the average base quality is still above the threshold, the sequence won’t be trimmed.

The basic syntax of the SLIDINGWINDOW is:

SLIDINGWINDOW:<windowSize>:<requiredQuality>

Here, the <windowSize> is the number of bases to average across.

<requiredQuality>is the threshold quality value

The third module associated with Trimming: LEADING

This module trims the base with a quality value below a certain threshold.

The basic syntax is:

LEADING:<quality>

where the <quality> is the the value below which a base will be trimmed.

The fourth important module associated with Trimming is, TRAILING

This module takes a single argument of threshold quality value. If the last base in the read has a quality value below the set threshold, it will be removed.

The basic syntax is:

TRAILING:<quality>

The fifth module associated with trimming the read: CROP

With this module in action, bases, regardless of quality, can be trimmed from the end of the read, so that the read, maximally, has a length according to the set-value.

CROP:<length>

where <length> is the final length of the read after bases have been removed.

The sixth module associated with Trimming the read: HEADCROP

With this module in action, bases, regardless of their quality, will be trimmed from the beginning of the read, so that the read, maximally, has a length according to the set-value.

HEADCROP:<length>

where <length> is the number of bases to be removed from the beginning of the read

The seventh module is associated with filtering the reads: MINLEN

MINLEN takes a single argument. This specifies the minimum length of the read to be able to survive. After trimming, if the length of the read is less than the value stipulated in the module, the read will be dropped.

TOPHRED33

This module is associated with the quality of the base. If the quality of the bases are calculated according to the ASCII encoding, TOPHRED33 is a valid module.

More about the ILLUMINACLIP module

ILLUMINACLIP does the finding and the clipping of the adapter.

Technically, the adapter and other sequences like primer can appear in any location within the read, the most common mechanism of adapter contamination is when the sequencer reads a DNA fragment which is less than the length of the read. In this scenario, the read starts with a valid DNA sequence, and since the DNA fragment is shorter, the sequencer continues reading and “reads through” to the adapter of the opposite (reverse) read. This way, an adapter creeps in to the 3’-end of the read. The 5’-end starts with valid DNA sequence. This results in a partial or full adapter sequence in the 3’-end of the read. If the adapter sequence is full, it is easier to identify the adapter and remove it from the read. If it is a partial adapter, it is quite difficult to remove it.

💡 This is how adapters are ligated to the DNA fragments.

💡 How short inserts affect sequencing performance - Illumina Knowledge

This picture shows very clearly how adapters are added to the DNA insert. The adapter complex is comprised of P5/P7 sequences, required to bind to the Flow Cell, Sample Indexes, Read1 and Read 2 primers.

Now, these adapters are attached to the DNA insert. DNA inserts have Adenine over-hangs, and the adapters have Thymine over-hangs. As you see in the picture, Adenine and Thymine hybridize and the phosphate group binds with the 3’-OH, to form the backbone. Thus we have the sequencing reads. Now, the DNA inserts bind to the flow cell for sequencing. Primer binds to the primer-binding site on the read.

How does Trimmomatic trim the adapter sequences

In Single End sequencing method, where the DNA insert is sequenced only in one direction (there is no reverse pair), Trimmomatic uses a Simple strategy. Since the adapter sequence is known to us, continuous sections of each adapter, of 16bp length, will be matched with the read, in every possible positions. If there is a perfect match, or a close-match determined by the argument fed to the <seed mismatches> parameter, the entire alignment between the adapter and the read will be scored.

If <seed mismatches> was fed with an argument of 2 (the default value), even if there is a mismatch in two bases between the “seed” and the read, it will be considered as a perfect match.

Now, clipping of the adapter will happen based on the threshold value we feed, as an argument, to the <simple clip threshold> parameter. For every match, a score will be calculated as follows:

score = log10(probability that there is a random match of nucleotide base)

score = log10(4) ~ 0.602

This, for each match of nucleotide base, the score increases by 0.602. If there is a mis-match, the score reduces by Q/10, where Q is the base-quality, that varies from 0 to 40. Therefore, the penalty value ranges from 0 to 4. The adapter will be trimmed as per the threshold score value we assign to the <simple clip threshold> . If we provide a value of “7”, this is associated with a perfect match of 12 bases. If we want the adapter to be trimmed if there is a perfect match of 25 sequences, the threshold value is “15”

This scenario can be understood well below.

💡 Trimmomatic: a flexible trimmer for Illumina sequence data | Bioinformatics | Oxford Academic (oup.com)

The advantage of the SimpleClipThreshold method is that, it can detect any technical sequence located anywhere within the read. But as you can see in A and D, when there is only a short partial read-through/ match, detecting the adapter reliably is not possible.

When there is a partial read-through at the 3’-end, Palindrome-Match is employed

In palindrome mode, the adapter-ligated forward and the reverse reads are aligned, and the algorithm will calculate the score. If there is a matching score of 30, corresponding to a match of 50 bases, the adapters will be trimmed.

💡 Trimmomatic: a flexible trimmer for Illumina sequence data | Bioinformatics | Oxford Academic (oup.com)

In the palindrome mode, the adapter-ligated forward and the reverse reads are aligned, and the global score will be calculated.

The palindrome mode starts with an alignment of adapters and the start of each read, as you can see in A. If adapters directly joined, with no useful sequence information between them, such reads will be dropped. Then the testing process proceeds by moving the relative position of the reads backwards. When there is a match, both reads will be clipped. Even if the adapter is partial and reaches to the ligated-adapter, the reads will be clipped. The clipping will finish when there is no more over-hanging part that reaches to the adapter.

Life Science