Demo files for Trimmomatic were downloaded from “bioinfogp.cnb.csic.es/files/samples/rnaseq/”
The files were downloaded by using the following command:
wget https://bioinfogp.cnb.csic.es/files/samples/rnaseq/RNA-Seq_Sample_Files.zip
These are paired end-reads. For example,
1M_SRR9336468_1.fastq.gz and 1M_SRR9336468_2.fastq.gzarepaired end reads.
Checking the number of reads in the fastq file
We need to check the number of sequences in the fastq file. We have 1M_SRR9336468_1.fastq.gz and 1M_SRR9336468_2.fastq.gz.
This can be achieved by the following command:
zcat your_file.fastq.gz | wc -l | awk '{print $1/4}’
It turned out that there are 1000000 reads in the fastq file.
For the purpose of testing, we need a simple fastq file, with a very few sequences. Let us cut down the fastq file to have, let’s say 10 reads in the file.
Splitting the fastq file
In order to cut down the fastq file, we will use seqkit split2 module. The following command can be used to achieve this task.
seqkit split2 -1 1M_SRR9336468_1.fastq.gz -2 1M_SRR9336468_2.fastq.gz -s 10 -O out -f -e .gz
This will generate 1 lakh parts of both forward and the reverse reads, containing 10 reads in each. For the sake of this demo-experimentation, we will have only very few fastq files.
Now, let’s add forward and reverse adapter sequences to the respective read files.
Appending the read files with adapter sequences
We will use biopython for the purpose of appending the fastq reads with adapter sequences.
Here is the code that we have crafted.
from Bio.SeqIO.QualityIO import FastqGeneralIterator
"""Now we need to define the forward adapter and the reverse adapter"""
forAdapter = "TACACTCTTTCCCTACACGACGCTCTTCCGATCT"
revAdapter = "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT"
"""forward file path"""
forPath = r"D:\website\wixSite\articles\trimmomaticDemo\inFiles\1M_SRR9336468_1.part_001.fastq."
"""Open the forward read file"""
forHandle = open(forPath, "r", encoding="utf-8")
"""reverse file path"""
revPath = r"D:\website\wixSite\articles\trimmomaticDemo\inFiles\1M_SRR9336468_2.part_001.fastq."
"""open the reverse file """
revHandle = open(revPath, "r", encoding="utf-8")
#declare enmpty list forward
newForFastq=[]
#declare enmpty list reverse
newRevFastq=[]
"""Iterate through each fastq block; declare itervariables to store header, sequence, and the quality"""
for header, seq, qual in FastqGeneralIterator(forHandle):
seqNewFor = seq.rstrip()+forAdapter
newForFastq.append("@%s\n%s\n+\n%s\n" % (header, seqNewFor, qual))
"""Iterate through each fastq block; declare itervariables to store header, sequence, and the quality"""
for header, seq, qual in FastqGeneralIterator(revHandle):
seqNewRev = seq.rstrip()+revAdapter
newRevFastq.append("@%s\n%s\n+\n%s\n" % (header, seqNewRev, qual))
forPathOut = r"D:\website\wixSite\articles\trimmomaticDemo\outFiles\1M_SRR9336468_1.part_001.fastq"
revPathOut = r"D:\website\wixSite\articles\trimmomaticDemo\outFiles\1M_SRR9336468_2.part_001.fastq"
with open(forPathOut, "w")as forOut:
for line in newForFastq:
forOut.write(line)
with open(revPathOut, "w")as revOut:
for line in newRevFastq:
revOut.write(line)
We have added “TACACTCTTTCCCTACACGACGCTCTTCCGATCT” at the end of every sequence in the forward fastq file. And we have added GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT at the end of every sequence in the reverse fastq file.
Let us look at the first 10 lines of both forward and the reverse file.
Forward
┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/RNA-Seq_Sample_Files/out]
└─$ zcat 1M_SRR9336468_1.part_001.fastq.gz | head -n 10
@SRR9336468.1 1/1
NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTC
+
#AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-F
@SRR9336468.2 2/1
NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTAC
+
#AAFFJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJFFJFJFAJJJJJJJJFFJJJJFJJJJJJJJJJJJAJAJJJJJJJJJJJJFAJJJ7FFJJJF<AFAJFFFJJJJFFFJ-AF
@SRR9336468.3 3/1
NGATGACATGTGGAAATATAACAGAGACAGAATATTTAATTTTATGTATGTAAGCATCGATTGGACACCAGGCTTATTGATGACCTTACTCGTCCAATTTGGCACGGACCGCTTTAACTTGCAAGTAGTTTTGTAAAGCATCAACAGACA
Reverse
┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/RNA-Seq_Sample_Files/out]
└─$ zcat 1M_SRR9336468_2.part_001.fastq.gz | head -n 10
@SRR9336468.1 1/2
NTTGGTATCTACTACAATTCTGGTGAGGTCTGTTGTGCGGGTTCAAGGGTGTATGTTGAAGAATCTATTTACGACAAATTCATTGAAGAGTTCAAAGCCGCTTCTGAATCCATCAAGGTGGGCGACCCATTCGATGAATCTACTTTCCAA
+
#AAAFJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJFJJJ<AFJ-AF7FJJJJJJJJJ<AFJFJJ-77FFJJJAJFFFFFFFFFFJ7-<<JA7F<-AAFAAFJ-<FJ
@SRR9336468.2 2/2
NTCCAAGAGAACCAAGAGATGGTACAAGAATGCCGGTTTGGGATTCAAGACCCCAAAAACCGCTATTGAAGGTTCCTACATTGACAAGAAATGTCCATTCACTGGTTTAGTTTCCATCCGTGGTAAGATCTTGACCGGTACCGTCGTCTC
+
#AAFFJFJFFJJJJJFJJJJJJJJJJJJJJJFJJJJFJJJJJJJFJJJFFJJJJJFJJJJJJJAJAJJJJJJFJJJJJJJJJJJJJJ-FJFFJJJJJJJFFJFJJJFFJJFFAJ<F7FFFJJ7JJF<JFJFAAAAAA-FFJ)--<<AA<F
@SRR9336468.3 3/2
NACACCTCTAATATTAATACCGCCTTAAAAGTGGCTGATAGAGTTAATGCGGGTACGGTCTGGATAAACACTTATAACGATTTCCACCACGCAGTTCCTTTCGGTGGGTTCAATGCATCTGGTTTGGGCAGGGAAATGTCTGTTGATGCT
Here, we can see that the length of the forward and the reverse read is 150bp. The forward and the reverse reads are not reverse complimentary to each other. Normally, when there is an adapter read-through, the forward and the reverse reads are complimentary to each other.
Let’s try working with a single fastq file, for SE sequencing.
Let’s run Trimmomatic.
java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt \
/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq \
~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz \
ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 \
MINLEN:36
So, let us look at the progress of the Trimmomatic.
┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/Trimmomatic/Trimmomatic-0.39]
└─$ java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt \
/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq \
~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz \
ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 \
MINLEN:36
TrimmomaticSE: Started with arguments:
-trimlog /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:36
Automatically using 4 threads
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 10 Surviving: 10 (100.00%) Dropped: 0 (0.00%)
TrimmomaticSE: Completed successfully
Here, the report says that all the reads are surviving. No reads have been dropped.
trimlog.txt; The log file of all the reads in the fastq file
Let us look at the log report.
SRR9336468.1 1/1 150 0 150 34
SRR9336468.2 2/1 150 0 150 34
SRR9336468.3 3/1 150 0 150 34
SRR9336468.4 4/1 150 0 150 34
SRR9336468.5 5/1 150 0 150 34
SRR9336468.6 6/1 150 0 150 34
SRR9336468.7 7/1 150 0 150 34
SRR9336468.8 8/1 150 0 150 34
SRR9336468.9 9/1 150 0 150 34
SRR9336468.10 10/1 150 0 150 34
Here, the columns are as follows.
the read name
the surviving sequence length
the location of the first surviving base, aka. the amount trimmed from the start
the location of the last surviving base in the original read
the amount trimmed from the end
The first column contains list of all the reads, prefixed by SRR. The second column contains the length of the surviving reads. The third column contains the position of the fist surviving base. It is zero for all the reads. The fourth column contains the position of the last surviving base. It is 150 for all the reads. The last column contains the amount trimmed from the end of the read. It is 34.
Since MINLEN was set to 36 (means, drop the reads if the length of the read falls below 36), no reads were dropped.
Setting MINLEN to 151
┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/Trimmomatic/Trimmomatic-0.39]
└─$ java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq ~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:151
TrimmomaticSE: Started with arguments:
-trimlog /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:151
Automatically using 4 threads
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 10 Surviving: 0 (0.00%) Dropped: 10 (100.00%)
TrimmomaticSE: Completed successfully
Here, we can see that all the 10 reads have been dropped.
SRR9336468.1 1/1 0 0 0 0
SRR9336468.2 2/1 0 0 0 0
SRR9336468.3 3/1 0 0 0 0
SRR9336468.4 4/1 0 0 0 0
SRR9336468.5 5/1 0 0 0 0
SRR9336468.6 6/1 0 0 0 0
SRR9336468.7 7/1 0 0 0 0
SRR9336468.8 8/1 0 0 0 0
SRR9336468.9 9/1 0 0 0 0
SRR9336468.10 10/1 0 0 0 0
Changing the simpleClipThresholdvalue to 21
simpleClipThreshold is a parameter of the ILLUMINACLIP module. So, once a “seed” of 16bp are aligned in every position possible with the read, the program will look for a match between the seed and the read. If a match is established, the full adapter will be aligned with the read and a score will be calculated.
Score = log10(probability that there is random match)
~0.602. So, for 34 bases, the score is less than 21
Look at the log report:
SRR9336468.1 1/1 184 0 184 0
SRR9336468.2 2/1 184 0 184 0
SRR9336468.3 3/1 184 0 184 0
SRR9336468.4 4/1 184 0 184 0
SRR9336468.5 5/1 184 0 184 0
SRR9336468.6 6/1 184 0 184 0
SRR9336468.7 7/1 184 0 184 0
SRR9336468.8 8/1 184 0 184 0
SRR9336468.9 9/1 184 0 184 0
SRR9336468.10 10/1 184 0 184 0
Here, we can see that no trimming has been done.
Changing the simpleClipThresholdvalue to 20
Here, we have 34 sequences, and therefore the maximum permissible score is:
34 x log10(4)
=0.602 x 34
=20.468
~20
Let us look at the log report:
SRR9336468.1 1/1 150 0 150 34
SRR9336468.2 2/1 150 0 150 34
SRR9336468.3 3/1 150 0 150 34
SRR9336468.4 4/1 150 0 150 34
SRR9336468.5 5/1 150 0 150 34
SRR9336468.6 6/1 150 0 150 34
SRR9336468.7 7/1 150 0 150 34
SRR9336468.8 8/1 150 0 150 34
SRR9336468.9 9/1 150 0 150 34
SRR9336468.10 10/1 150 0 150 34
See, here we can see that the trimming has been done!
What it means is that, if you set 20 as the argument forsimpleClipThreshold , even if there is a couple of mismatches, the corresponding reads won’t be clipped.
Changing two bases in the adapter sequence in the read and setting the simpleClipThreshold to 20
@SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCGG
+ #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-FJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJ
@SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTACAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Look at the above fastq sequence. I have purposefully changed two bases in the adapter. This shows that only 32 out of 34 bases score and there are two penalties. So, let’s calculate the score.
score = 0.602 x log10(4)
=0.602 x 32
=19.264
~19
So, when we set the simpleClipThreshold value to 20, the read whose adapter has got changes in two bases fail to meet the threshold that was set, therefore the adapter of the said read won’t get trimmed.
Let’s look at the log report here.
SRR9336468.1 1/1 184 0 184 0
SRR9336468.2 2/1 150 0 150 34
SRR9336468.3 3/1 150 0 150 34
SRR9336468.4 4/1 150 0 150 34
SRR9336468.5 5/1 150 0 150 34
SRR9336468.6 6/1 150 0 150 34
SRR9336468.7 7/1 150 0 150 34
SRR9336468.8 8/1 150 0 150 34
SRR9336468.9 9/1 150 0 150 34
SRR9336468.10 10/1 150 0 150 34
See here, the adapter hasn’t been clipped from the first read.
When the read-through adapters are short and partial
@SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTCAGATCGGAAGAGCACACGTCTGAACTCCA
+ #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-FJJJJJJJJJJJJJJFFJJJJJJJJJJJJJ
@SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTACAGATCGGAAGAGCACACGTCTGAA
+ #AAFFJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJFFJFJFAJJJJJJJJFFJJJJFJJJJJJJJJJJJAJAJJJJJJJJJJJJFAJJJ7FFJJJF<AFAJFFFJJJJFFFJ-AFJJJJJJJJJJJJJJFFJJJJJJJJ
@SRR9336468.3 3/1 NGATGACATGTGGAAATATAACAGAGACAGAATATTTAATTTTATGTATGTAAGCATCGATTGGACACCAGGCTTATTGATGACCTTACTCGTCCAATTTGGCACGGACCGCTTTAACTTGCAAGTAGTTTTGTAAAGCATCAACAGACAAGATCGGAAGAGCACACGT
+ #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJFAJJJJFJJJJJJJJJJJFJJJJFJJJFJFJJFAJJJJFFJJJJF<7FJJ-7AFAFFFJAFJJFFJJJFFFFJFJJJJJJJJJJJJJJJJFFJJJ
@SRR9336468.4 4/1 NATAGTTATAAAAATATATGTATATAAAGTATGTGAGTGTGTGAGTGTGTGAAAAGGATGTTATTAAATAGTATGTTATTATTTCTTGTCGTATCTGGAGTAGTATTTCTTTGAAATTATACCGTCCAGCGTCAAAAGACCCAGAGCGCCAGATCGGAAGAGCA
+ #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJFFJJJ-FJJJJJF-AFJ<7<<JJJJJ<JJJJJFJJJFJJJJJJJJJJJJJJJJ
@SRR9336468.5 5/1 NGTTGAGTAACTTGACCTTCACACGTAGTTGCAAGTACCGATGCAAAATTTTTACACTGGTGACAACCTTAATGGTATAATCCAATTAGCGACATCACCGAGAAAGTATCCAAGTCGATTTGATGAATAAAATAATAACTTAAACATGCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Here, 5, 10, 15, 20 bases have been clipped from the first, second, third and the fourth fastq files. The simpleClipThreshold is set to 20. This quality score is associated with a perfect match of 34 bases.
The following is the log report.
SRR9336468.1 1/1 179 0 179 0
SRR9336468.2 2/1 174 0 174 0
SRR9336468.3 3/1 169 0 169 0
SRR9336468.4 4/1 164 0 164 0
SRR9336468.5 5/1 150 0 150 34
Adapters were not clipped from the first four reads.
The simpleClipThreshold value was set to 15. This is associated with a perfect match of 25 bases.
The log report is:
SRR9336468.1 1/1 150 0 150 29
SRR9336468.2 2/1 174 0 174 0
SRR9336468.3 3/1 169 0 169 0
SRR9336468.4 4/1 164 0 164 0
SRR9336468.5 5/1 150 0 150 34
read 1 has 29 bases in the adapter, and it meets a quality score of 15.
@SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTC
+ #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-F
The simpleClipThreshold value was set to 10. This is associated with a perfect match of 16 bases.
Since the first read has 29 bases, the match score will be higher than 10, so the adapter will be clipped. Since the second read has 24 bases, the match score will be still above 10, and therefore will be clipped. The third read has 19 bases, the match score is still above 10, and therefore the adapter will be clipped.
The simpleClipThreshold value was set to 1. This is associated with a perfect match of 2 bases.
But in such a scenario, there are chances of false positives being generated.
Let’s look at the trim log report:
SRR9336468.1 1/1 150 0 150 29
SRR9336468.2 2/1 150 0 150 24
SRR9336468.3 3/1 150 0 150 19
SRR9336468.4 4/1 148 0 148 16
SRR9336468.5 5/1 150 0 150 34
댓글