top of page

Search Results

52 items found for ""

  • de novo genome assembly using Supernova 2.1.0

    What is Supernova 2.1.0? Everything starts with an individual DNA source. From this single DNA source, a single whole genome library will be prepared. This single whole-genome library will be subjected to an Illumina sequencing platform like NovaSeq, and linked reads will be generated. Then these Linked-reads need to be assembled to generate the whole genome. This is achieved by the software package called Supernova. The latest version of Supernova is Supernova 2.1.0. What is a key feature of the Supernova package ? The key feature of the Supernova software package is that it creates diploid assemblies. And it can represent maternal and paternal chromosomes separately. Supernova package includes two processing pipelines and one post-processing pipeline. As we have seen in the previous article about the TELLYSIS software package, where there are three processing pipelines, such as Quality control pipeline (TELL-read), phasing pipeline (TELL-sort), and assembly pipeline (TELL-link), in the case of Supernova software package, there are three important pipelines. supernova mkfastq, supernova run, and supernova mkoutput are the important pipelines. supernova mkfastq is for generating the demultiplexed fastq files. Here, the Base Call (BCL) files - including the forward and the reverse reads, and the barcode files from a single flowcell - and the samplesheet (a csv file) will be provided to demultiplex and convert the files to fastq Supernova run is the assembly pipeline that uses the sample-demultiplexed fastq files to assemble the genome. So, it takes the fastq files containing the barcoded reads, and assemble the genome using the graph-based assembly process. The approach is such that it first builds the assembly using the read k-mers, where k is 48. This means, it uses 48-mers from the reads, and builds an assembly. Then resolve this assembly using read pairs to make 200-mers, and finally 1,00,000-mers. Supernova mkoutput is the pipeline that takes the output from the Supernova run and produces different styles of FASTA files, for downstream processes and analysis. Generating the FASTQ files with supernova mkfastq supernova mkfastq is a wrapper around the Illumina’s bcl2fastq. The path to the flowcell directory is provided as an input parameter. Importantly, we need to understand that each flowcell contains lanes, and each lane contains wells. A single whole genome library is deposited in a single well. In each well, there is a lawn of P7 or/ and P5 adapters. If only P7 adapters are there, these adapters are associated with one of four unique oligonucleotide sequences. Every well that contains a whole genome library is studded with adapters and associated four oligonucleotide sequences. These oligonucleotides are unique among themselves and among the whole sets. When you provide the path to the flowcell directory, you also need to provide a information comprising the lanes in the flowcell that contain the libraries, and the name of the well with a set of four oligonucleotide sequences. The information is provided as a comma-delimited series of lane number, name of the sample in each well, and the well name, for supernova mkfastq to pullout the oligonucleotide set used for the whole genome library. All the BCL files sharing the same set of oligonucleotide sequences as provided in the sample sheet, will be sample-demultiplexed. For example, let us look at this picture. In this picture, there are two individual whole genome libraries. Let us imagine that these two whole genome libraries were prepared from two individual DNA sources. These two libraries are fed into two independent wells, either on the same lane or on a different lane of the same flowcell. These wells are studded with the P7 and/ or P5 adapters, with a unique set of oligonucleotide sequences. In the picture, given above, Library1 and Library2 are the two distinct whole genome libraries, prepared from two different DNA sources. these two libraries are fed into an Illumina sequencer, let us call it NovaSeq , and these libraries are now in two independent wells of the same lane or a different lane of the same flowcell Now, the well, where Library1 was fed into is called A1. The A1 well contains P7 and/or P5 adapters with the same set of oligonucleotide sequences as the ones that were used for the Library1 preparation. Similarly, the Library2 was fed into another well, A2. When we provide the BCL files, associated with a flowcell directory, to the supernova mkfastq pipeline, it is supposed to sample-demultiplex the BCL files (read files and the barcode files) on the basis of the sets of sample indices. So, we need to provide a sample-sheet Usually, the samplesheet is a csv file that contains three columns: 1. Lane number of the flowcell 2. Sample/library name 3. The well name, for example SI-GA-A1 (here, A1) is the well name. Seeing the well name, SI-GA-A1, supernova mkfastq automatically pulls out the set of four oligonucleotide sequences, and demultiplex the BCL files by collating all the reads with these sample indices. Finally, these BCL files are converted into FASTQ. Here, the whole genome library, Library1, was loaded into the well (SI-GA-A1) of the lane 1 (of the flowcell); the second whole genome library, Library2, was loaded into the well (SI-GA-A2), of lane 1, and lane 2 (of the same flowcell). Here, the BCL files were demultiplexed into two sets of FASTQ files, because both the libraries were sample-indexed using two distinct sets of sample indices. Arguments and Options to be fed to the parameters of the supernova mkfastq Supernova mkfastq is a wrapper around the bcl2fastq, and therefore the former accepts arguments to parameters of the latter. —run: this is a required parameter. The path to the Illumina’s run folder needs to be provided here. This folder should contain the BCL run files, associated with a flowcell —id: The argument that provide to this parameter becomes the name of the folder that supernova mkfastq creates. This is an optional parameter and this refers to the name of the flowcell. —samplesheet: This is the path to an Illumina Experiment Manager-compatible samplesheet. This samplesheet should contain the following columns: 1. the sample indices set names, such as SI-GA-A1 or SI-GA-A2, depending upon the number of different libraries loaded into the flowcell 2. Sample names (of distinct libraries) 3. Lane number. —csv: Here, you need to provide path to the csv file. This file should contain three columns of information, i.e. column specifying the lanes to be used. For example, 1, 1-2, etc., sample names, depending upon the number of distinct whole genome libraries, Sample index set names for demultiplexing purpose. —lanes: We know that there are lanes in a flowcell, and each lane contains nanowells. if all the lanes have been utilized for sequencing the libraries, but you want only BCL files of certain lanes to be de-multiplexed , then those lanes can be specified in a comma-delimited format; for example 1,2 —output-dir: This is the path to the directory in which supernova mkfastq is going to generate the demultiplexed fastq files. An Example data $ supernova mkfastq --id=tiny-bcl \\ --run=/path/to/tiny_bcl \\ --csv=tiny-bcl-simple-2.1.0.csv supernova mkfastq Copyright (c) 2017 10x Genomics, Inc. All rights reserved. ------------------------------------------------------------------------------- Martian Runtime - 2.1.1-v2.3.3 Running preflight checks (please wait)... 2017-08-09 16:33:54 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2017-08-09 16:33:57 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2017-08-09 16:33:57 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main 2017-08-09 16:34:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET ... here, /path/to/tiny_bcl is the path to the flowcell directory that contains all the BCL files, to be demultiplexed. tiny-bcl-simple-2.1.0.csv is the path to the csv file, that contains three columns of information, comprising the names of the distinct whole genome libraries, lanes associated with the respective libraries, and sample index set names, for demultiplexing the libraries. Lane,Sample,Index 1,test_sample,SI-GA-A3 Assembly pipeline: supernova run Supernova run is the assembly pipeline, a component of the supernova software package. This is used for the whole genome assembly of the 10x genomics linked-reads, sequenced from the chromium prepared whole genome library. That being said, supernova run can also work with TELL-seq linked-reads (addressed in the previous article). What are the conditions that need to be met before proceeding with assembly using supernova run? Supernova should be run using 38 - 56X coverage of the genome. If you can estimate the size of the genome, know the sequence read length, and the total number of reads, it is possible to calculate the coverage. The maximum number of reads that supernova run can handle is not more than 2.14 billion. The maximum genome size that supernova has been tested for is not more than 4 GB. How to set up the supernova run command, with essential parameters. The following are the required parameters. —id: This is a unique run id. This becomes the name of the output folder. —fastqs: This is the path where the demultiplexed fastqs are located —maxreads: This is to specify the number of reads that you need. You can either use all of your reads or specify the number of reads. We need to note that the maximum number of reads that can be used is 2.14 billion. Why do we need to set this parameter? The ideal genome coverage for a good quality assembly is in the range of 38 - 56X, for supernova run pipeline. If the coverage is more than 56X, it is better, but it might become deleterious if it is way off. If the coverage is far less than the the lowest recommended value of 38X, supernova run would terminate. If you have reads more than the cut off value of 2.14 billion, you may set a value (number of reads) corresponding to a coverage of 56X. This requires that you have a prior knowledge about the genome size. Supernova does estimate the genome size in the beginning, and you can calculate the number of reads corresponding to a coverage of 56X. For example, let us say you have estimated the genome size to be at 800Mb. What would be the number of reads for a 56X coverage of this 800Mb genome? Length of reads x number of reads) / (genome size) = coverage Number of reads = (coverage x genome size ) / length of reads → (56 x 800, 000,000) / 150 = 0.298 billion this is less than the cut off value of 2.14 billion. Now, let us say the genome size estimated was at 8 GB. In this condition, a coverage of 56X should correspond to a read number of (56 x 8,000,000,000) / 150 = 2.98 billion. This is much higher than the recommended value of 2.14 billion. In this condition, supernova run would be considering a number of reads of 2.14 billion, which should correspond to a coverage of 40X. Running supernova run After determining these input arguments, call supernova run: $ supernova run --id=sample345 \\ --fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path

  • Processing the Raw Reads using Trimmomatic; a Polishing Task Before Genome Assembly

    Trimmomatic is performed before genome assembly We have something called raw reads. If it is a paired-end sequencing, we have something called paired-end reads, consisting of an Read1 (R1) and read2 (R2) file. Can we directly assemble and reconstruct the genome from the raw read files? It is generally not advised to perform an assembly directly from these raw read files. These read files also contain low-quality reads that can not reliably be used to reconstruct the genome. These raw read files often are contaminated, therefore need to be subjected to a quality check up, before the assembly process. What is the major source of contamination? The major source of contamination is the sequencing adapters. Primer sequences can also be present within the reads. These adapter/ primer sequence-containing reads need to be processed for obtaining the valid read sequences. A FastQC analysis will generate a report, bearing information about the presence / absence of adapters in the R1 and R2 files. 💡 Here, you can see the adapter-ligated DNA fragment. In a SE sequencing, this blue-adapter-containing sequence will be read. In Paired-End sequencing, there is another round of cluster-generation. The second adapter-containing DNA fragment will be sequenced. How do adapter sequences get into the reads? After the library preparation, the library sequences will be loaded onto an Illumina Flow cell, where the adapter-ligated gDNA sequences (in case of de novo assembly) or RNA fragments (for transcriptome analysis) will be PCR-amplified, and further read. The library sequences will be read for a fixed number of cycles. In the case of Paired-end sequencing, a sequence will be read in the forward and reverse direction, for, let’s say, 150 cycles. This will result in the generation of reads of 150 bases. 💡 Here, the blue fragments, attached to the both ends of the insert, are called adapters. We can call them Read1 adapter and Read2 adapter. The Read1 adapter contains the Read1 primer-binding site, and the Read2 adapter contains the Read2 primer-binding site. These fragments will be read/ sequenced for a fixed number of cycles, in the forward and reverse direction. Sometimes there can be an inner distance that is left un-sequenced. Broadly speaking, this can happen when the read length (the number of read cycles) is less than the length of the DNA fragment/ insert. When there is shorter fragment, shorter than the length of the number of read cycles, the sequencing process can over-shoot to the adapter segment of the reverse read. This adapter contamination is called “adapter read-through”. This “adapter read-through” can result in the inclusion of either the full adapter or partial adapter being present in the read. How to remove adapters or other Illumina technical sequences like primers? It is very important to remove adapters, over represented sequences and other technical sequences like primers, from the reads. There are multitude of ways by which we can remove contaminating sequences. The popular bioinformatic pipelines that can potentially remove adapters are Trimmomatic, Trim_galore, fastp, etc. Removing adapters using Trimmomatic Adapters and other technical sequences can be removed using the Trimmomatic. The basic syntax of the Trimmomatic command is as follows. java -jar PE [-threads ] >] [-basein | ] [-baseout | The actual Trimmomatic program is a .jar file, and this needs JAVA to be installed in the machine. PE flag specifies that the read files are from a paired-end sequencing. -threads specifies the number of threads in a multi-core processor. -phred33 specifies the quality encoding of the base in the fastq nucleotide sequences. According to phred33 quality encoding, the quality score of each nucleotide base in the read is represented using ASCII encoding system. Normally, for each nucleotide base, the phred score is calculated as follows: phred = -10*log(probability that a base is wrong). This phred score is converted into an ASCII character via adding 33 to each phredscore value. -trimlog specifies a log file in text format. This log file contains information about the trimmed reads. The log file contains the following information: The read name The surviving sequence length. The amount of sequence trimmed from the start. The amount trimmed from the end. Next we need to specify the two input files. The first one is the forward fastq file, and the second one is the reverse fastq file. You may explicitly specify the fastq files. Otherwise, you may enter the name of the forward read followed by -basein flag, and the reverse read will be detected automatically. For example, if the name of the forward read file is Sample_Name_R1_001.fq.gz, the reverse file Sample_Name_R2_001.fq.gz will be detected automatically. If the forward file is Sample_Name.f.fastq, the program will look for Sample_Name.r.fastq for reverse file. Next, we need to specify the output files. There are four output files we need to specify. For paired forward read For unpaired forward read For paired reverse reads For unpaired reverse reads. The paired forward reads contain all the surviving forward reads. The unpaired forward reads contain all the reads whose partner reads were dropped. One may specify all the output files. Otherwise, a base-name can be provided following the --baseout flag, and the names of all output files will be created. For example, one may provide a base-name, mySampleFiltered.fq.gz, and the program will generate the output files, following a particular nomenclature pattern. o mySampleFiltered_1P.fq.gz - for paired forward reads o mySampleFiltered_1U.fq.gz - for unpaired forward reads o mySampleFiltered_2P.fq.gz - for paired reverse reads o mySampleFiltered_2U.fq.gz - for unpaired reverse reads All these parameters are called basic parameters, that are associated with setting up the work environment. Now we need to deal with the trimming modules. The first trimming module is called ILLUMINACLIP. This module accepts various parameters that are associated with finding and trimming the adapters. ILLUMINACLIP:::: Here, the parameters are: fastaWithAdaptersEtc: Here we need to specify the path to the fasta file containing the adapter sequence, primer sequence etc. seedMismatches: This is for a simple match strategy in Trimmomatic. Here, we need to specify the number of maximum mismatch count that would be taken into consideration as a full match. A higher value would result in the generation of more false positive data. palindromeClipThreshold: Here we need to specify how accurate the match between the two adapter-ligated reads must be for PE palindrome read alignment. simpleClipThreshold: Here, we need to specify how accurate the match between the adapter and the read must be. ILLUMINACLIP takes two additional parameters also. ILLUMINACLIP:::::: minAdapterLength: Here, we need to specify the length of the minimum length of the adapter, in which case the adapter will be removed. This is in addition to the palindromeClipThreshold value. The default value is 8. This means that as long as the minimum length of the adapter is 8 or greater, the adapter read-through will be trimmed. This helps to remove shorter adapter fragments. keepBothReads: This parameter applies to the palindrome match strategy. When an adapter read-through is found and the adapter removed, then the reverse read is the same as the forward one, except the fact that the latter is a reverse compliment of the former. In such a scenario, the reverse read will be dropped by the system. If keepBothReads is set to True, the reverse read will also be retained. The second module associated with Trimming is called SLIDINGWINDOW Here, a moving window of multiple-bases will be assessed for the average base quality. If the average base quality falls below a certain threshold, the sequence will be removed. So, even if the single base has a low quality, and the average base quality is still above the threshold, the sequence won’t be trimmed. The basic syntax of the SLIDINGWINDOW is: SLIDINGWINDOW:: Here, the is the number of bases to average across. is the threshold quality value The third module associated with Trimming: LEADING This module trims the base with a quality value below a certain threshold. The basic syntax is: LEADING: where the is the the value below which a base will be trimmed. The fourth important module associated with Trimming is, TRAILING This module takes a single argument of threshold quality value. If the last base in the read has a quality value below the set threshold, it will be removed. The basic syntax is: TRAILING: The fifth module associated with trimming the read: CROP With this module in action, bases, regardless of quality, can be trimmed from the end of the read, so that the read, maximally, has a length according to the set-value. CROP: where is the final length of the read after bases have been removed. The sixth module associated with Trimming the read: HEADCROP With this module in action, bases, regardless of their quality, will be trimmed from the beginning of the read, so that the read, maximally, has a length according to the set-value. HEADCROP: where is the number of bases to be removed from the beginning of the read The seventh module is associated with filtering the reads: MINLEN MINLEN takes a single argument. This specifies the minimum length of the read to be able to survive. After trimming, if the length of the read is less than the value stipulated in the module, the read will be dropped. TOPHRED33 This module is associated with the quality of the base. If the quality of the bases are calculated according to the ASCII encoding, TOPHRED33 is a valid module. More about the ILLUMINACLIP module ILLUMINACLIP does the finding and the clipping of the adapter. Technically, the adapter and other sequences like primer can appear in any location within the read, the most common mechanism of adapter contamination is when the sequencer reads a DNA fragment which is less than the length of the read. In this scenario, the read starts with a valid DNA sequence, and since the DNA fragment is shorter, the sequencer continues reading and “reads through” to the adapter of the opposite (reverse) read. This way, an adapter creeps in to the 3’-end of the read. The 5’-end starts with valid DNA sequence. This results in a partial or full adapter sequence in the 3’-end of the read. If the adapter sequence is full, it is easier to identify the adapter and remove it from the read. If it is a partial adapter, it is quite difficult to remove it. 💡 This is how adapters are ligated to the DNA fragments. 💡 How short inserts affect sequencing performance - Illumina Knowledge This picture shows very clearly how adapters are added to the DNA insert. The adapter complex is comprised of P5/P7 sequences, required to bind to the Flow Cell, Sample Indexes, Read1 and Read 2 primers. Now, these adapters are attached to the DNA insert. DNA inserts have Adenine over-hangs, and the adapters have Thymine over-hangs. As you see in the picture, Adenine and Thymine hybridize and the phosphate group binds with the 3’-OH, to form the backbone. Thus we have the sequencing reads. Now, the DNA inserts bind to the flow cell for sequencing. Primer binds to the primer-binding site on the read. How does Trimmomatic trim the adapter sequences In Single End sequencing method, where the DNA insert is sequenced only in one direction (there is no reverse pair), Trimmomatic uses a Simple strategy. Since the adapter sequence is known to us, continuous sections of each adapter, of 16bp length, will be matched with the read, in every possible positions. If there is a perfect match, or a close-match determined by the argument fed to the parameter, the entire alignment between the adapter and the read will be scored. If was fed with an argument of 2 (the default value), even if there is a mismatch in two bases between the “seed” and the read, it will be considered as a perfect match. Now, clipping of the adapter will happen based on the threshold value we feed, as an argument, to the parameter. For every match, a score will be calculated as follows: score = log10(probability that there is a random match of nucleotide base) score = log10(4) ~ 0.602 This, for each match of nucleotide base, the score increases by 0.602. If there is a mis-match, the score reduces by Q/10, where Q is the base-quality, that varies from 0 to 40. Therefore, the penalty value ranges from 0 to 4. The adapter will be trimmed as per the threshold score value we assign to the . If we provide a value of “7”, this is associated with a perfect match of 12 bases. If we want the adapter to be trimmed if there is a perfect match of 25 sequences, the threshold value is “15” This scenario can be understood well below. 💡 Trimmomatic: a flexible trimmer for Illumina sequence data | Bioinformatics | Oxford Academic (oup.com) The advantage of the SimpleClipThreshold method is that, it can detect any technical sequence located anywhere within the read. But as you can see in A and D, when there is only a short partial read-through/ match, detecting the adapter reliably is not possible. When there is a partial read-through at the 3’-end, Palindrome-Match is employed In palindrome mode, the adapter-ligated forward and the reverse reads are aligned, and the algorithm will calculate the score. If there is a matching score of 30, corresponding to a match of 50 bases, the adapters will be trimmed. 💡 Trimmomatic: a flexible trimmer for Illumina sequence data | Bioinformatics | Oxford Academic (oup.com) In the palindrome mode, the adapter-ligated forward and the reverse reads are aligned, and the global score will be calculated. The palindrome mode starts with an alignment of adapters and the start of each read, as you can see in A. If adapters directly joined, with no useful sequence information between them, such reads will be dropped. Then the testing process proceeds by moving the relative position of the reads backwards. When there is a match, both reads will be clipped. Even if the adapter is partial and reaches to the ligated-adapter, the reads will be clipped. The clipping will finish when there is no more over-hanging part that reaches to the adapter.

  • Trimmomatic | Single End sequencing | Working with demo files

    Demo files for Trimmomatic were downloaded from “bioinfogp.cnb.csic.es/files/samples/rnaseq/” The files were downloaded by using the following command: wget https://bioinfogp.cnb.csic.es/files/samples/rnaseq/RNA-Seq_Sample_Files.zip These are paired end-reads. For example, 1M_SRR9336468_1.fastq.gz and 1M_SRR9336468_2.fastq.gzarepaired end reads. Checking the number of reads in the fastq file We need to check the number of sequences in the fastq file. We have 1M_SRR9336468_1.fastq.gz and 1M_SRR9336468_2.fastq.gz. This can be achieved by the following command: zcat your_file.fastq.gz | wc -l | awk '{print $1/4}’ It turned out that there are 1000000 reads in the fastq file. For the purpose of testing, we need a simple fastq file, with a very few sequences. Let us cut down the fastq file to have, let’s say 10 reads in the file. Splitting the fastq file In order to cut down the fastq file, we will use seqkit split2 module. The following command can be used to achieve this task. seqkit split2 -1 1M_SRR9336468_1.fastq.gz -2 1M_SRR9336468_2.fastq.gz -s 10 -O out -f -e .gz This will generate 1 lakh parts of both forward and the reverse reads, containing 10 reads in each. For the sake of this demo-experimentation, we will have only very few fastq files. Now, let’s add forward and reverse adapter sequences to the respective read files. Appending the read files with adapter sequences We will use biopython for the purpose of appending the fastq reads with adapter sequences. Here is the code that we have crafted. from Bio.SeqIO.QualityIO import FastqGeneralIterator """Now we need to define the forward adapter and the reverse adapter""" forAdapter = "TACACTCTTTCCCTACACGACGCTCTTCCGATCT" revAdapter = "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT" """forward file path""" forPath = r"D:\website\wixSite\articles\trimmomaticDemo\inFiles\1M_SRR9336468_1.part_001.fastq." """Open the forward read file""" forHandle = open(forPath, "r", encoding="utf-8") """reverse file path""" revPath = r"D:\website\wixSite\articles\trimmomaticDemo\inFiles\1M_SRR9336468_2.part_001.fastq." """open the reverse file """ revHandle = open(revPath, "r", encoding="utf-8") #declare enmpty list forward newForFastq=[] #declare enmpty list reverse newRevFastq=[] """Iterate through each fastq block; declare itervariables to store header, sequence, and the quality""" for header, seq, qual in FastqGeneralIterator(forHandle): seqNewFor = seq.rstrip()+forAdapter newForFastq.append("@%s\n%s\n+\n%s\n" % (header, seqNewFor, qual)) """Iterate through each fastq block; declare itervariables to store header, sequence, and the quality""" for header, seq, qual in FastqGeneralIterator(revHandle): seqNewRev = seq.rstrip()+revAdapter newRevFastq.append("@%s\n%s\n+\n%s\n" % (header, seqNewRev, qual)) forPathOut = r"D:\website\wixSite\articles\trimmomaticDemo\outFiles\1M_SRR9336468_1.part_001.fastq" revPathOut = r"D:\website\wixSite\articles\trimmomaticDemo\outFiles\1M_SRR9336468_2.part_001.fastq" with open(forPathOut, "w")as forOut: for line in newForFastq: forOut.write(line) with open(revPathOut, "w")as revOut: for line in newRevFastq: revOut.write(line) We have added “TACACTCTTTCCCTACACGACGCTCTTCCGATCT” at the end of every sequence in the forward fastq file. And we have added GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT at the end of every sequence in the reverse fastq file. Let us look at the first 10 lines of both forward and the reverse file. Forward ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/RNA-Seq_Sample_Files/out] └─$ zcat 1M_SRR9336468_1.part_001.fastq.gz | head -n 10 @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTC + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-F @SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTAC + #AAFFJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJFFJFJFAJJJJJJJJFFJJJJFJJJJJJJJJJJJAJAJJJJJJJJJJJJFAJJJ7FFJJJF<AFAJFFFJJJJFFFJ-AF @SRR9336468.3 3/1 NGATGACATGTGGAAATATAACAGAGACAGAATATTTAATTTTATGTATGTAAGCATCGATTGGACACCAGGCTTATTGATGACCTTACTCGTCCAATTTGGCACGGACCGCTTTAACTTGCAAGTAGTTTTGTAAAGCATCAACAGACA Reverse ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/RNA-Seq_Sample_Files/out] └─$ zcat 1M_SRR9336468_2.part_001.fastq.gz | head -n 10 @SRR9336468.1 1/2 NTTGGTATCTACTACAATTCTGGTGAGGTCTGTTGTGCGGGTTCAAGGGTGTATGTTGAAGAATCTATTTACGACAAATTCATTGAAGAGTTCAAAGCCGCTTCTGAATCCATCAAGGTGGGCGACCCATTCGATGAATCTACTTTCCAA + #AAAFJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJFJJJ<AFJ-AF7FJJJJJJJJJ<AFJFJJ-77FFJJJAJFFFFFFFFFFJ7-<<JA7F<-AAFAAFJ-<FJ @SRR9336468.2 2/2 NTCCAAGAGAACCAAGAGATGGTACAAGAATGCCGGTTTGGGATTCAAGACCCCAAAAACCGCTATTGAAGGTTCCTACATTGACAAGAAATGTCCATTCACTGGTTTAGTTTCCATCCGTGGTAAGATCTTGACCGGTACCGTCGTCTC + #AAFFJFJFFJJJJJFJJJJJJJJJJJJJJJFJJJJFJJJJJJJFJJJFFJJJJJFJJJJJJJAJAJJJJJJFJJJJJJJJJJJJJJ-FJFFJJJJJJJFFJFJJJFFJJFFAJ<F7FFFJJ7JJF<JFJFAAAAAA-FFJ)--<<AA<F @SRR9336468.3 3/2 NACACCTCTAATATTAATACCGCCTTAAAAGTGGCTGATAGAGTTAATGCGGGTACGGTCTGGATAAACACTTATAACGATTTCCACCACGCAGTTCCTTTCGGTGGGTTCAATGCATCTGGTTTGGGCAGGGAAATGTCTGTTGATGCT Here, we can see that the length of the forward and the reverse read is 150bp. The forward and the reverse reads are not reverse complimentary to each other. Normally, when there is an adapter read-through, the forward and the reverse reads are complimentary to each other. Let’s try working with a single fastq file, for SE sequencing. Let’s run Trimmomatic. java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt \ /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq \ ~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz \ ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 \ MINLEN:36 So, let us look at the progress of the Trimmomatic. ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/Trimmomatic/Trimmomatic-0.39] └─$ java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt \ /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq \ ~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz \ ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 \ MINLEN:36 TrimmomaticSE: Started with arguments: -trimlog /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:36 Automatically using 4 threads Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Quality encoding detected as phred33 Input Reads: 10 Surviving: 10 (100.00%) Dropped: 0 (0.00%) TrimmomaticSE: Completed successfully Here, the report says that all the reads are surviving. No reads have been dropped. trimlog.txt; The log file of all the reads in the fastq file Let us look at the log report. SRR9336468.1 1/1 150 0 150 34 SRR9336468.2 2/1 150 0 150 34 SRR9336468.3 3/1 150 0 150 34 SRR9336468.4 4/1 150 0 150 34 SRR9336468.5 5/1 150 0 150 34 SRR9336468.6 6/1 150 0 150 34 SRR9336468.7 7/1 150 0 150 34 SRR9336468.8 8/1 150 0 150 34 SRR9336468.9 9/1 150 0 150 34 SRR9336468.10 10/1 150 0 150 34 Here, the columns are as follows. the read name the surviving sequence length the location of the first surviving base, aka. the amount trimmed from the start the location of the last surviving base in the original read the amount trimmed from the end The first column contains list of all the reads, prefixed by SRR. The second column contains the length of the surviving reads. The third column contains the position of the fist surviving base. It is zero for all the reads. The fourth column contains the position of the last surviving base. It is 150 for all the reads. The last column contains the amount trimmed from the end of the read. It is 34. Since MINLEN was set to 36 (means, drop the reads if the length of the read falls below 36), no reads were dropped. Setting MINLEN to 151 ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/Trimmomatic/Trimmomatic-0.39] └─$ java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq ~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:151 TrimmomaticSE: Started with arguments: -trimlog /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:151 Automatically using 4 threads Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Quality encoding detected as phred33 Input Reads: 10 Surviving: 0 (0.00%) Dropped: 10 (100.00%) TrimmomaticSE: Completed successfully Here, we can see that all the 10 reads have been dropped. SRR9336468.1 1/1 0 0 0 0 SRR9336468.2 2/1 0 0 0 0 SRR9336468.3 3/1 0 0 0 0 SRR9336468.4 4/1 0 0 0 0 SRR9336468.5 5/1 0 0 0 0 SRR9336468.6 6/1 0 0 0 0 SRR9336468.7 7/1 0 0 0 0 SRR9336468.8 8/1 0 0 0 0 SRR9336468.9 9/1 0 0 0 0 SRR9336468.10 10/1 0 0 0 0 Changing the simpleClipThresholdvalue to 21 simpleClipThreshold is a parameter of the ILLUMINACLIP module. So, once a “seed” of 16bp are aligned in every position possible with the read, the program will look for a match between the seed and the read. If a match is established, the full adapter will be aligned with the read and a score will be calculated. Score = log10(probability that there is random match) ~0.602. So, for 34 bases, the score is less than 21 Look at the log report: SRR9336468.1 1/1 184 0 184 0 SRR9336468.2 2/1 184 0 184 0 SRR9336468.3 3/1 184 0 184 0 SRR9336468.4 4/1 184 0 184 0 SRR9336468.5 5/1 184 0 184 0 SRR9336468.6 6/1 184 0 184 0 SRR9336468.7 7/1 184 0 184 0 SRR9336468.8 8/1 184 0 184 0 SRR9336468.9 9/1 184 0 184 0 SRR9336468.10 10/1 184 0 184 0 Here, we can see that no trimming has been done. Changing the simpleClipThresholdvalue to 20 Here, we have 34 sequences, and therefore the maximum permissible score is: 34 x log10(4) =0.602 x 34 =20.468 ~20 Let us look at the log report: SRR9336468.1 1/1 150 0 150 34 SRR9336468.2 2/1 150 0 150 34 SRR9336468.3 3/1 150 0 150 34 SRR9336468.4 4/1 150 0 150 34 SRR9336468.5 5/1 150 0 150 34 SRR9336468.6 6/1 150 0 150 34 SRR9336468.7 7/1 150 0 150 34 SRR9336468.8 8/1 150 0 150 34 SRR9336468.9 9/1 150 0 150 34 SRR9336468.10 10/1 150 0 150 34 See, here we can see that the trimming has been done! What it means is that, if you set 20 as the argument forsimpleClipThreshold , even if there is a couple of mismatches, the corresponding reads won’t be clipped. Changing two bases in the adapter sequence in the read and setting the simpleClipThreshold to 20 @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCGG + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-FJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJ @SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTACAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC Look at the above fastq sequence. I have purposefully changed two bases in the adapter. This shows that only 32 out of 34 bases score and there are two penalties. So, let’s calculate the score. score = 0.602 x log10(4) =0.602 x 32 =19.264 ~19 So, when we set the simpleClipThreshold value to 20, the read whose adapter has got changes in two bases fail to meet the threshold that was set, therefore the adapter of the said read won’t get trimmed. Let’s look at the log report here. SRR9336468.1 1/1 184 0 184 0 SRR9336468.2 2/1 150 0 150 34 SRR9336468.3 3/1 150 0 150 34 SRR9336468.4 4/1 150 0 150 34 SRR9336468.5 5/1 150 0 150 34 SRR9336468.6 6/1 150 0 150 34 SRR9336468.7 7/1 150 0 150 34 SRR9336468.8 8/1 150 0 150 34 SRR9336468.9 9/1 150 0 150 34 SRR9336468.10 10/1 150 0 150 34 See here, the adapter hasn’t been clipped from the first read. When the read-through adapters are short and partial @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTCAGATCGGAAGAGCACACGTCTGAACTCCA + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-FJJJJJJJJJJJJJJFFJJJJJJJJJJJJJ @SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTACAGATCGGAAGAGCACACGTCTGAA + #AAFFJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJFFJFJFAJJJJJJJJFFJJJJFJJJJJJJJJJJJAJAJJJJJJJJJJJJFAJJJ7FFJJJF<AFAJFFFJJJJFFFJ-AFJJJJJJJJJJJJJJFFJJJJJJJJ @SRR9336468.3 3/1 NGATGACATGTGGAAATATAACAGAGACAGAATATTTAATTTTATGTATGTAAGCATCGATTGGACACCAGGCTTATTGATGACCTTACTCGTCCAATTTGGCACGGACCGCTTTAACTTGCAAGTAGTTTTGTAAAGCATCAACAGACAAGATCGGAAGAGCACACGT + #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJFAJJJJFJJJJJJJJJJJFJJJJFJJJFJFJJFAJJJJFFJJJJF<7FJJ-7AFAFFFJAFJJFFJJJFFFFJFJJJJJJJJJJJJJJJJFFJJJ @SRR9336468.4 4/1 NATAGTTATAAAAATATATGTATATAAAGTATGTGAGTGTGTGAGTGTGTGAAAAGGATGTTATTAAATAGTATGTTATTATTTCTTGTCGTATCTGGAGTAGTATTTCTTTGAAATTATACCGTCCAGCGTCAAAAGACCCAGAGCGCCAGATCGGAAGAGCA + #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJFFJJJ-FJJJJJF-AFJ<7<<JJJJJ<JJJJJFJJJFJJJJJJJJJJJJJJJJ @SRR9336468.5 5/1 NGTTGAGTAACTTGACCTTCACACGTAGTTGCAAGTACCGATGCAAAATTTTTACACTGGTGACAACCTTAATGGTATAATCCAATTAGCGACATCACCGAGAAAGTATCCAAGTCGATTTGATGAATAAAATAATAACTTAAACATGCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC Here, 5, 10, 15, 20 bases have been clipped from the first, second, third and the fourth fastq files. The simpleClipThreshold is set to 20. This quality score is associated with a perfect match of 34 bases. The following is the log report. SRR9336468.1 1/1 179 0 179 0 SRR9336468.2 2/1 174 0 174 0 SRR9336468.3 3/1 169 0 169 0 SRR9336468.4 4/1 164 0 164 0 SRR9336468.5 5/1 150 0 150 34 Adapters were not clipped from the first four reads. The simpleClipThreshold value was set to 15. This is associated with a perfect match of 25 bases. The log report is: SRR9336468.1 1/1 150 0 150 29 SRR9336468.2 2/1 174 0 174 0 SRR9336468.3 3/1 169 0 169 0 SRR9336468.4 4/1 164 0 164 0 SRR9336468.5 5/1 150 0 150 34 read 1 has 29 bases in the adapter, and it meets a quality score of 15. @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTC + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-F The simpleClipThreshold value was set to 10. This is associated with a perfect match of 16 bases. Since the first read has 29 bases, the match score will be higher than 10, so the adapter will be clipped. Since the second read has 24 bases, the match score will be still above 10, and therefore will be clipped. The third read has 19 bases, the match score is still above 10, and therefore the adapter will be clipped. The simpleClipThreshold value was set to 1. This is associated with a perfect match of 2 bases. But in such a scenario, there are chances of false positives being generated. Let’s look at the trim log report: SRR9336468.1 1/1 150 0 150 29 SRR9336468.2 2/1 150 0 150 24 SRR9336468.3 3/1 150 0 150 19 SRR9336468.4 4/1 148 0 148 16 SRR9336468.5 5/1 150 0 150 34

  • Mastering YAML with pyYAML - A Pythonic Guide

    This post is about pyYAML library of Python to work with yaml files.

  • Twitter Round
  • b-facebook

© 2035 by Z-Photography. Powered and secured by Wix

PHONE:
+91 9645304093

EMAIL:
thelifesciencehub2023@gmail.com
Subscribe to our website and receive updates!

Thanks for submitting!

Lifescience, C/o Vijithkumar, Dept. of Biotechnology,

Manonmaniam Sundaranar University, Tirunelveli, Tamil Nadu, India, PIN: 627012.

bottom of page