top of page

Search Results

68 items found for ""

  • RepeatMasker with Omics Box; an easy way to mask repetitive sequences

    Repetitive DNA sequences Repetitive DNA sequences are of two different types: low-complexity repeats and transposable elements. What are low-complexity repeats? Low-complexity repeats are homopolymeric runs of nucleotides. For example, it could be a homopolymeric stretch of Adenine (A), Thymine (T), Cytosine (C), or Guanine (G). A homopolymeric run of adenine looks like this: AAAAAAAAA Transposable elements These are DNA sequences that change positions in the genome. Transposable elements are also called jumping genes. Viral genome, long Interspersed Nuclear Element, and short Interspersed Nuclear Element are the major transposable elements. How do repetitive DNA sequences affect the genome assembly? Repetitive DNA sequences can pose problem while performing annotation of the protein-coding genes. When you don’t mask the repetitive sequences, it can generate many spurious BLAST alignments. Unmasked repeats align with similar non-homologous DNA sequences, thus producing non-reliable data. Gene annotation involves describing the location of genes in a genome. If you keep the repetitive DNA sequences unmasked, the gene annotation process can falsely show the evidence for a gene in a location where, in reality, there is no gene, but exists a repetitive element. Transposable elements are repetitive sequences that can move within a genome. These transposable elements have ORFs, which are sequences in genes that code for proteins. Some of the transposable elements’ ORFs resemble the host genes. Gene predictors are software tools that describe the location and structure of genes in a genome. If the transposable elements are left unmasked, the inaccurate gene predictors will falsely represent transposable elements’ ORFs as host genes. How to mask repetitive sequences in a genome? The repetitive sequences in the genome can be masked by using a software called RepeatMasker. What RepeatMasker does is that it scans the DNA sequences for transposable elements, satellites, and low-complexity repeats. There are two outputs. One is a comprehensive report of the annotation of the repeats in the query sequence, that comprises the different types of repetitive DNA elements found in the input DNA. The second output is the modified DNA whose repetitive DNA sequences are masked. By masking, as I mentioned above, the nucleotides of the repetitive elements are transformed into “N” or “X”, or their lowercase version. RepeatMasker also uses Tandem Repeat Finder (TRF) to find the tandem repeats. How does RepeatMasker get information about the repetitive elements RepeatMasker comes with the Dfam database. This Dfam database harbors information on Transposable Element families. As of Dfam version 3.7, there is information on 3,437,876 Transposable element families, spanning 2,306 species. Dfam website contains a collection of multiple sequence alignments. Each sequence alignment contains multiple members of a Transposable element family. By aligning these representative members of a Transposable element family, HMMs and a consensus sequence of Transposable element for each family has been made. Omics Box also works with RepBase which contains representative repetitive sequences from eukaryotic species. How to perform repeat masking using OmicsBox. The Repeat Masking functionality is under the Genome Analysis. Here, we have to select the input DNA sequences in the fasta format. Then we can set the analysis parameters. Setting the search configuration for Repeat Masker First of all, we need to select the search engine to perform the search for repeats. There are two search engines: 1. RMBlast 2. HMMER RMBlast is an NCBI BLAST-compatible version of Repeat Masker. Basically, it works with NCBI BLAST to find the repetitive sequence within the DNA sequence by searching the query sequence against a nucleotide database. What RMBlast does is it aligns the query DNA sequence against the repetitive sequences in databases like Dfam. It calculates alignment scores to find the similarity between the query DNA sequence and the repetitive sequence. Based on the degree of similarity between the query DNA sequence and the repetitive element, the RMBlast identifies the repetitive sequences and provides this information to Repeat Masker. Another search engine for homology search is HMMER. Using homologous sequences can be searched for the query sequence against a database of sequences, using the profile Hidden Markov Model can detect remote homologs as sensitively as possible, using an underlying probability model. HMMER is now as fast as BLAST. uses the nhmmer program to search one or more nucleotide queries against a database. Using the query sequence, search the target database of sequences, and output the top matches. Selection of Repetitive sequence database RepeatMasker works with a number of databases. Repeat Masker works with two databases at present. They are the Dfam database and RepBase database. Dfam is a database of Transposable elements. If it is selected as the database, then it is not necessary to provide a database file. But if you choose RepBase as the database, it is necessary that we need the database file loaded into the wizard in the EMBL format. The library should be RepBase RepeatMasker edition, and it can be downloaded from https://www.girinst.org/server/RepBase/. One can also use the custom option. In the custom option, one should provide a custom library of repetitive sequences to be masked in the query sequence. The library file should contain repetitive sequences in the fasta format. In the repetitive sequence library, the fasta IDs should be formatted as follows: “>repeatname#class/subclass” For example, let's see how to mask the repetitive elements under the subclass AluY. Alu elements are 300 bp and are classified under the Short Interspersed Nuclear Elements (SINE). So, when you create a custom library for an element, let’s say belonging to AluY, then the fasta should be of the format: >Alu#SINE/AluY Here, Alu is the name of the repetitive element. SINE stands for the name of the class to which the repetitive element Alu belongs. AluY is the name of the subclass. Let us look at another example. Let’s imagine that we have done a genome assembly of a plant species, let's call it Chloropicon primus. The consensus sequence of the transposable element, downloaded from the plant repetitive element database, PlantRep (PlantRep) is as follows. >rnd-2_family-71#DNA/MULE-MuDR ( Recon Family Size = 28, Final Multiple Alignment Size = 24 ) GAGGATTGCANAAGAGGGGCGAAGTNCTNCGATCACCAGCACGTCGTCGA NTGGATCGAGGCNAATTCTTAACCAAGAACAATGTCTCACGACGAACCTG TCATCGACCTTCGCTTCCTCTTCCGATTCCTCCTCCTCCTCTCTCCTTTG CTCTTCCTCCTTCGCTCTTTCTCTTCCGGTGGATCTTCTCCCTCCCGCTG AACCATCAAAACGGCCCGAGGCCACGAGGGGCGAACGAGGACGAGGNGAG AGAGAAACGAAGAGGACGGGAGCGCGCGGAGGAGGGAGAGAGAGAGAGAG AAGNGGAGGACAGGAAAGAAGGAAGCGTCGTCTCNTGCTCTTTCGAACGA GCCCTCGCGCGAAAGAAACGACCCAGTGGCGAGGATCTGGCGACGCGAAA CGCGAAGAAGAGAGGCAGAAAGGAATCGAGGAGTAGATCACCGAGGAAGG Specifying the name of the species One can specify the name of the species. This name must be present in the NCBI Taxonomy database. If you use HMMER as the search engine, it uses Dfam database to search for repetitive sequences. In Dfam database, information on repetitive sequences for organisms such as human beings, Caenorhabditis elegans, mouse, zebrafish, fruit fly, and nematode are present. RMBlast option If RMBlast option is used as the search engine, we need to specify an option for the speed/ sensitivity parameter. There are three options: Rush: When you choose Rush, the speed of database search, alignment, repeat masking and annotation will be 4 - 10 faster than the default option. But it will be 10% less sensitive than the default option. Quick: When you choose this option, the process will be 5 - 10% less sensitive, but will be 2 - 5 times faster than the default process. Slow: When you choose this option, the process will be 0 - 5 times more sensitive, but 2 - 3 times slower than the default. Divergence Cut-off Divergence cut-off can be applied. When you turn on this option, the RepeatMasker will mask the repetitive sequences that are less diverged from the consensus than the value we provide as a divergence cut-off. RepBase, PlantBase, Dfam, etc. repeat databases contain consensus repeat sequences. The repetitive elements are likely to have mutations, and they may diverge from the consensus. When you provide a divergence cut-off value, let's say 20%, the RepeatMasker will mask the repetitive sequence if it is less than 20% diverged from the consensus sequence in the database or in the library file. When you run the commandline of the RepeatMasker as follows RepeatMasker [-options] you can also add the -div parameter where you can mention the percentage divergence; the repeats that have been diverged less than the cut-off percentage value will be masked. Output options There is a set of parameters that need to be provided with a value. Masking option: Here, we need to specify how the repetitive sequences should be masked. Bases of the repetitive sequences can be replaced with "N" or "X" or the lowercase version of the respective bases. Only Alu elements: If the query DNA is from a primate, this option can be used. Alu repeats are 300bp long repeats of the class SINE. Type of Repeats Here, we can set the type of repeats that the RepeatMasker should mask. For example, you may set Interspersed repeats, simple repeats, and low-complexity repeats. Output The fasta output file contains the masked sequence in fasta format. The locations of the masked repeats are provided in the GFF format.

  • Pornography Addiction and Dhat Syndrome: A Case Study

    About porn addiction and dhat syndrome. A case study about how porn addiction causes compulsive masturbation and social withdrawal.

  • Will an LVM partition help me achieve an increased storage space? What is the success rate?

    I was working on a genome assembly project when I was caught off guard by an issue: the assembly process was terminated due to insufficient space on the hard disk. I was assembling the linked reads using Universal Sequencing Technology's TELLYSIS pipeline. After consulting with the company, I was informed that I needed at least 1024 GB of free space before starting the assembly. I am using an IBM server installed with Ubuntu distro, version 22.04. This server is a decade old but has recently undergone an upgrade, increasing its memory capacity to an impressive 248 GB. This ten-year-old machine is equipped with eight SAS hard disks, each with a capacity of 300 GB. This theoretically scales up to 8x300 = 2400 GB or 2.4 TB of storage capacity. However, there are slots for 12 physical disks, of which four slots cannot accommodate additional disks due to a lack of adequate SAS connectivity on the motherboard. A quick and technically less demanding solution I considered was to replace one of the existing 300 GB SAS disks with a higher-capacity one. Upon further investigation and communication with local procurement services, I discovered that the SAS disks compatible with my machine are quite rare in the market, and the available ones are rather expensive. As mentioned earlier, there are eight SAS disks, each with a 300 GB capacity, so the total storage capacity should amount to 2400 GB. However, the system's usable storage capacity is only 1800 GB. Later, it became clear that two sets of four hard disks had been configured in a RAID 5 setup. When four hard disks are RAID 5 configured, a pooled storage capacity equivalent to three hard disks is available for use. In the event of disk failure, the fourth disk acts as a backup. One set of four hard disks is configured as Drive 0, and the next set of four disks is similarly configured as Drive 1. This means that when you write a file to a disk, it is not immediately clear which one of the four disks it is on, or in whose clusters the file was registered. Drive 0 has two partitions: /dev/sda1 and /dev/sda2. Similarly, Drive 1 has two partitions: /dev/sdb1 and /dev/sdb2. Both partitions are created with an ext4 filesystem. While /dev/sda2 is a /boot/ partition, /dev/sdb2 is a general partition. The capacity of /dev/sda2 is 835 GB, and that of /dev/sdb2 is 870 GB. In each of these partitions, there is approximately 750 GB of free space, which is less than the free space required for running the genome assembly. To meet the requirement, I need to install an additional hard disk with a storage capacity of more than 1024 GB. Since these partitions belong to two distinct drives, resizing or merging the partitions is not as straightforward as it is in the Windows OS. This is when I came across LVM (Logical Volume Management) technology. Using LVM, multiple physical disks can be merged to create a logical volume partition with the pooled storage capacity of the disks involved. LVM stands for Large Volume Management. With this technology, physical volumes can be created using desired partitions, a volume group can be formed from these physical volumes, and finally, a logical volume of the desired size can be created from the volume group. The storage capacity of the logical volume partition is accounted for by multiple physical disks. However, there is a risk involved. If any of the physical disks that are part of the volume group fail, the entire volume group fails, rendering the data irrecoverable. So, caution should be exercised, and data backup is essential. However I have a backup, and I want to proceed with creating an LVM partition on the existing physical disks in my system. Here's a simplified protocol for the process: 1. Boot your system from an Ubuntu bootable USB drive. 2. Delete the partitions you want to use for creating Physical Volumes (in this case, /dev/sda2 and /dev/sdb2). 3. Create new primary partitions and convert them into "LVM type." 4. Create a Physical Volume using these partitions. 5. Create a volume group involving /dev/sda2 and /dev/sdb2. 6. Create a logical volume partition of the combined desired size based on the storage capacity of /dev/sda2 and /dev/sdb2 within the volume group. 7. Create an ext4 filesystem for the newly created logical volume. 8. Mount the new logical volume partition. I have yet to implement these steps to create the LVM partition, and I have concerns about the success rate given the age of the SAS disks in the system. What do you think? How about I install Linux on a Virtual Machine using VirtualBox? Can I reduce the risk of data loss in case the logical volume somehow gets damaged? Share your thoughts!

  • Article: The Chromium de novo assembly solution.

    The de novo assembly solution Chromium de novo solution helps to generate a diploid assembly of a diploid genome, which means it generates assembly of both alleles. 10x Linked-reads are generated from a single DNA library using the Chromium’s de novo whole genome library preparation process. Is 10x genomics an easy way to get your de novo genome assembly done? Getting a genome assembled - especially of higher animals or the one that contains more repeats - through an assembly process is not very easy. It is a long and arduous process; sometimes, it may involve an additional upstream process like inbreeding to get a good DNA sample. And once the reads are ready, then a bioinformatician has to assemble the reads. In this context, we have 10x genomics, whose laboratory and bioinformatics processes are a turnkey and cookbook. They provide easy to follow instructions on DNA extraction, library preparation, and sequencing. Once the sequencing is completed, 10x genomics’ assembly software Supernova 2.1.0 can be run to assemble the genome. Supernova 2.1.0 does expect the user to be very proficient in the fundamentals of computer programming; all that the software requires is a batch of all fastq files associated with your library. Chromium de novo assembly preparation does not require much DNA sample. For chromium de novo assembly preparation, approximately 1ng of DNA is required. This means, you do not need to perform inbreeding to clonally select samples, so that you can avoid induction of any heterozygosity, or any complication associated with the mixing of wild samples. For library preparation, for organisms of genome size less than 1.6 gb, 10x genomics library preparation protocols recommends a loading mass of 0.6ng of DNA. if genome size is in the range of 1.6 - 3.2 gb, the loading mass of DNA is supposed to be interpolated between 0.6 and 1.2 ng. For genome size between 3.2Gb to 4.0Gb, the required DNA, to be loaded is 1.2 ng. Any genome of size 4.0 Gb is not recommended. Reduced cost of sequencing. Chromium de novo assembly solution costs much less as compared to other sequencing sequencing technologies, especially long-read sequencing technology. Since it requires a very low quantity of DNA (at nanograms level), and no upstream process involved, the cost is less. So, very importantly, the Chromium de novo library is sequenced using Illumina’s low -cost platforms like NovaSeq, HiSeq X, HiSeq 2500, and therefore the sequencing cost is not very exorbitant. Besides the monitory investment, 10x genomics’ linked-reads can be assembled using the Supernova 2.1.0 software package, which is less programmatically intense. How to make Supernova work for your genome of interest? Supernova has been tested on a wide repertoire of organisms. The scope of applicability of Supernova has been characterized by testing on genomes that vary in multitude features. the smallest genome tested using Supernova is of 140Mb size. And, the largest genome is of 3.2Gb size. A genome of size more than 4.0Gb should be considered experimental because it has not yet been tested out using Supernova. While sequencing genome, it is recommended to have a genome coverage within the range of 38 - 56X. If you know the genome size, one can manually calculate the number of reads that correspond to a recommended coverage. It is also recommended to to have reads not more than 2.14 billion. Also, optimally, the read length should be 150bp. For instance, in my case, the total number of reads generated was 378 million, for a genome of size 2.6 Gb - as estimated by Supernova in the preliminary run. The TELL-Seq library was sequenced using the Illumina’s NovaSeq 6000 model, and the Linked reads are of 150bp. From these information, I was able to calculate the genome coverage, and it as around 21X. This is far lower than the recommended lowest coverage of 38X, and Supernova is not recommended for the assembly. Preparation of long, undamaged DNA is important for a good assembly. We need DNA of a single individual, and DNA from clonal population can also be used. While trying to isolate DNA from the clonal population, it is important to note that DNA from wild individuals should not get mixed. An upstream process of inbreeding is not recommended because only few nanograms of DNA is required. Normally, an intact long DNA molecule can be easily made from the sample types of cell lines and blood. Long genomic DNA molecules, resulting from the fragmentation process, are used for the generation of barcoded-linked reads. These long genomic DNA molecules trapped in a Gel EMulsion beads (GEM beads), and the short reads that generate from them are clonally barcoded. The length of the genomic DNA molecule is key here. This dictates the quality of the assembly. If the length of the genomic DNA molecule is less than 50 Kb, then it is problematic. If it is less than 20Kb, it is highly problematic. For example, in our TELL-Seq library preparation, the genomic DNA molecule had an average length of 16 Kb, with per DNA reads are around 3 -4 in number. This is a serious problematic situation, and can result in a not-so-good quality assembly.

  • de novo genome assembly using Supernova 2.1.0

    What is Supernova 2.1.0? Everything starts with an individual DNA source. From this single DNA source, a single whole genome library will be prepared. This single whole-genome library will be subjected to an Illumina sequencing platform like NovaSeq, and linked reads will be generated. Then these Linked-reads need to be assembled to generate the whole genome. This is achieved by the software package called Supernova. The latest version of Supernova is Supernova 2.1.0. What is a key feature of the Supernova package ? The key feature of the Supernova software package is that it creates diploid assemblies. And it can represent maternal and paternal chromosomes separately. Supernova package includes two processing pipelines and one post-processing pipeline. As we have seen in the previous article about the TELLYSIS software package, where there are three processing pipelines, such as Quality control pipeline (TELL-read), phasing pipeline (TELL-sort), and assembly pipeline (TELL-link), in the case of Supernova software package, there are three important pipelines. supernova mkfastq, supernova run, and supernova mkoutput are the important pipelines. supernova mkfastq is for generating the demultiplexed fastq files. Here, the Base Call (BCL) files - including the forward and the reverse reads, and the barcode files from a single flowcell - and the samplesheet (a csv file) will be provided to demultiplex and convert the files to fastq Supernova run is the assembly pipeline that uses the sample-demultiplexed fastq files to assemble the genome. So, it takes the fastq files containing the barcoded reads, and assemble the genome using the graph-based assembly process. The approach is such that it first builds the assembly using the read k-mers, where k is 48. This means, it uses 48-mers from the reads, and builds an assembly. Then resolve this assembly using read pairs to make 200-mers, and finally 1,00,000-mers. Supernova mkoutput is the pipeline that takes the output from the Supernova run and produces different styles of FASTA files, for downstream processes and analysis. Generating the FASTQ files with supernova mkfastq supernova mkfastq is a wrapper around the Illumina’s bcl2fastq. The path to the flowcell directory is provided as an input parameter. Importantly, we need to understand that each flowcell contains lanes, and each lane contains wells. A single whole genome library is deposited in a single well. In each well, there is a lawn of P7 or/ and P5 adapters. If only P7 adapters are there, these adapters are associated with one of four unique oligonucleotide sequences. Every well that contains a whole genome library is studded with adapters and associated four oligonucleotide sequences. These oligonucleotides are unique among themselves and among the whole sets. When you provide the path to the flowcell directory, you also need to provide a information comprising the lanes in the flowcell that contain the libraries, and the name of the well with a set of four oligonucleotide sequences. The information is provided as a comma-delimited series of lane number, name of the sample in each well, and the well name, for supernova mkfastq to pullout the oligonucleotide set used for the whole genome library. All the BCL files sharing the same set of oligonucleotide sequences as provided in the sample sheet, will be sample-demultiplexed. For example, let us look at this picture. In this picture, there are two individual whole genome libraries. Let us imagine that these two whole genome libraries were prepared from two individual DNA sources. These two libraries are fed into two independent wells, either on the same lane or on a different lane of the same flowcell. These wells are studded with the P7 and/ or P5 adapters, with a unique set of oligonucleotide sequences. In the picture, given above, Library1 and Library2 are the two distinct whole genome libraries, prepared from two different DNA sources. these two libraries are fed into an Illumina sequencer, let us call it NovaSeq , and these libraries are now in two independent wells of the same lane or a different lane of the same flowcell Now, the well, where Library1 was fed into is called A1. The A1 well contains P7 and/or P5 adapters with the same set of oligonucleotide sequences as the ones that were used for the Library1 preparation. Similarly, the Library2 was fed into another well, A2. When we provide the BCL files, associated with a flowcell directory, to the supernova mkfastq pipeline, it is supposed to sample-demultiplex the BCL files (read files and the barcode files) on the basis of the sets of sample indices. So, we need to provide a sample-sheet Usually, the samplesheet is a csv file that contains three columns: 1. Lane number of the flowcell 2. Sample/library name 3. The well name, for example SI-GA-A1 (here, A1) is the well name. Seeing the well name, SI-GA-A1, supernova mkfastq automatically pulls out the set of four oligonucleotide sequences, and demultiplex the BCL files by collating all the reads with these sample indices. Finally, these BCL files are converted into FASTQ. Here, the whole genome library, Library1, was loaded into the well (SI-GA-A1) of the lane 1 (of the flowcell); the second whole genome library, Library2, was loaded into the well (SI-GA-A2), of lane 1, and lane 2 (of the same flowcell). Here, the BCL files were demultiplexed into two sets of FASTQ files, because both the libraries were sample-indexed using two distinct sets of sample indices. Arguments and Options to be fed to the parameters of the supernova mkfastq Supernova mkfastq is a wrapper around the bcl2fastq, and therefore the former accepts arguments to parameters of the latter. —run: this is a required parameter. The path to the Illumina’s run folder needs to be provided here. This folder should contain the BCL run files, associated with a flowcell —id: The argument that provide to this parameter becomes the name of the folder that supernova mkfastq creates. This is an optional parameter and this refers to the name of the flowcell. —samplesheet: This is the path to an Illumina Experiment Manager-compatible samplesheet. This samplesheet should contain the following columns: 1. the sample indices set names, such as SI-GA-A1 or SI-GA-A2, depending upon the number of different libraries loaded into the flowcell 2. Sample names (of distinct libraries) 3. Lane number. —csv: Here, you need to provide path to the csv file. This file should contain three columns of information, i.e. column specifying the lanes to be used. For example, 1, 1-2, etc., sample names, depending upon the number of distinct whole genome libraries, Sample index set names for demultiplexing purpose. —lanes: We know that there are lanes in a flowcell, and each lane contains nanowells. if all the lanes have been utilized for sequencing the libraries, but you want only BCL files of certain lanes to be de-multiplexed , then those lanes can be specified in a comma-delimited format; for example 1,2 —output-dir: This is the path to the directory in which supernova mkfastq is going to generate the demultiplexed fastq files. An Example data $ supernova mkfastq --id=tiny-bcl \\ --run=/path/to/tiny_bcl \\ --csv=tiny-bcl-simple-2.1.0.csv supernova mkfastq Copyright (c) 2017 10x Genomics, Inc. All rights reserved. ------------------------------------------------------------------------------- Martian Runtime - 2.1.1-v2.3.3 Running preflight checks (please wait)... 2017-08-09 16:33:54 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2017-08-09 16:33:57 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2017-08-09 16:33:57 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main 2017-08-09 16:34:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET ... here, /path/to/tiny_bcl is the path to the flowcell directory that contains all the BCL files, to be demultiplexed. tiny-bcl-simple-2.1.0.csv is the path to the csv file, that contains three columns of information, comprising the names of the distinct whole genome libraries, lanes associated with the respective libraries, and sample index set names, for demultiplexing the libraries. Lane,Sample,Index 1,test_sample,SI-GA-A3 Assembly pipeline: supernova run Supernova run is the assembly pipeline, a component of the supernova software package. This is used for the whole genome assembly of the 10x genomics linked-reads, sequenced from the chromium prepared whole genome library. That being said, supernova run can also work with TELL-seq linked-reads (addressed in the previous article). What are the conditions that need to be met before proceeding with assembly using supernova run? Supernova should be run using 38 - 56X coverage of the genome. If you can estimate the size of the genome, know the sequence read length, and the total number of reads, it is possible to calculate the coverage. The maximum number of reads that supernova run can handle is not more than 2.14 billion. The maximum genome size that supernova has been tested for is not more than 4 GB. How to set up the supernova run command, with essential parameters. The following are the required parameters. —id: This is a unique run id. This becomes the name of the output folder. —fastqs: This is the path where the demultiplexed fastqs are located —maxreads: This is to specify the number of reads that you need. You can either use all of your reads or specify the number of reads. We need to note that the maximum number of reads that can be used is 2.14 billion. Why do we need to set this parameter? The ideal genome coverage for a good quality assembly is in the range of 38 - 56X, for supernova run pipeline. If the coverage is more than 56X, it is better, but it might become deleterious if it is way off. If the coverage is far less than the the lowest recommended value of 38X, supernova run would terminate. If you have reads more than the cut off value of 2.14 billion, you may set a value (number of reads) corresponding to a coverage of 56X. This requires that you have a prior knowledge about the genome size. Supernova does estimate the genome size in the beginning, and you can calculate the number of reads corresponding to a coverage of 56X. For example, let us say you have estimated the genome size to be at 800Mb. What would be the number of reads for a 56X coverage of this 800Mb genome? Length of reads x number of reads) / (genome size) = coverage Number of reads = (coverage x genome size ) / length of reads → (56 x 800, 000,000) / 150 = 0.298 billion this is less than the cut off value of 2.14 billion. Now, let us say the genome size estimated was at 8 GB. In this condition, a coverage of 56X should correspond to a read number of (56 x 8,000,000,000) / 150 = 2.98 billion. This is much higher than the recommended value of 2.14 billion. In this condition, supernova run would be considering a number of reads of 2.14 billion, which should correspond to a coverage of 40X. Running supernova run After determining these input arguments, call supernova run: $ supernova run --id=sample345 \\ --fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path

  • Processing the Raw Reads using Trimmomatic; a Polishing Task Before Genome Assembly

    Trimmomatic is performed before genome assembly We have something called raw reads. If it is a paired-end sequencing, we have something called paired-end reads, consisting of an Read1 (R1) and read2 (R2) file. Can we directly assemble and reconstruct the genome from the raw read files? It is generally not advised to perform an assembly directly from these raw read files. These read files also contain low-quality reads that can not reliably be used to reconstruct the genome. These raw read files often are contaminated, therefore need to be subjected to a quality check up, before the assembly process. What is the major source of contamination? The major source of contamination is the sequencing adapters. Primer sequences can also be present within the reads. These adapter/ primer sequence-containing reads need to be processed for obtaining the valid read sequences. A FastQC analysis will generate a report, bearing information about the presence / absence of adapters in the R1 and R2 files. 💡 Here, you can see the adapter-ligated DNA fragment. In a SE sequencing, this blue-adapter-containing sequence will be read. In Paired-End sequencing, there is another round of cluster-generation. The second adapter-containing DNA fragment will be sequenced. How do adapter sequences get into the reads? After the library preparation, the library sequences will be loaded onto an Illumina Flow cell, where the adapter-ligated gDNA sequences (in case of de novo assembly) or RNA fragments (for transcriptome analysis) will be PCR-amplified, and further read. The library sequences will be read for a fixed number of cycles. In the case of Paired-end sequencing, a sequence will be read in the forward and reverse direction, for, let’s say, 150 cycles. This will result in the generation of reads of 150 bases. 💡 Here, the blue fragments, attached to the both ends of the insert, are called adapters. We can call them Read1 adapter and Read2 adapter. The Read1 adapter contains the Read1 primer-binding site, and the Read2 adapter contains the Read2 primer-binding site. These fragments will be read/ sequenced for a fixed number of cycles, in the forward and reverse direction. Sometimes there can be an inner distance that is left un-sequenced. Broadly speaking, this can happen when the read length (the number of read cycles) is less than the length of the DNA fragment/ insert. When there is shorter fragment, shorter than the length of the number of read cycles, the sequencing process can over-shoot to the adapter segment of the reverse read. This adapter contamination is called “adapter read-through”. This “adapter read-through” can result in the inclusion of either the full adapter or partial adapter being present in the read. How to remove adapters or other Illumina technical sequences like primers? It is very important to remove adapters, over represented sequences and other technical sequences like primers, from the reads. There are multitude of ways by which we can remove contaminating sequences. The popular bioinformatic pipelines that can potentially remove adapters are Trimmomatic, Trim_galore, fastp, etc. Removing adapters using Trimmomatic Adapters and other technical sequences can be removed using the Trimmomatic. The basic syntax of the Trimmomatic command is as follows. java -jar PE [-threads ] >] [-basein | ] [-baseout | The actual Trimmomatic program is a .jar file, and this needs JAVA to be installed in the machine. PE flag specifies that the read files are from a paired-end sequencing. -threads specifies the number of threads in a multi-core processor. -phred33 specifies the quality encoding of the base in the fastq nucleotide sequences. According to phred33 quality encoding, the quality score of each nucleotide base in the read is represented using ASCII encoding system. Normally, for each nucleotide base, the phred score is calculated as follows: phred = -10*log(probability that a base is wrong). This phred score is converted into an ASCII character via adding 33 to each phredscore value. -trimlog specifies a log file in text format. This log file contains information about the trimmed reads. The log file contains the following information: The read name The surviving sequence length. The amount of sequence trimmed from the start. The amount trimmed from the end. Next we need to specify the two input files. The first one is the forward fastq file, and the second one is the reverse fastq file. You may explicitly specify the fastq files. Otherwise, you may enter the name of the forward read followed by -basein flag, and the reverse read will be detected automatically. For example, if the name of the forward read file is Sample_Name_R1_001.fq.gz, the reverse file Sample_Name_R2_001.fq.gz will be detected automatically. If the forward file is Sample_Name.f.fastq, the program will look for Sample_Name.r.fastq for reverse file. Next, we need to specify the output files. There are four output files we need to specify. For paired forward read For unpaired forward read For paired reverse reads For unpaired reverse reads. The paired forward reads contain all the surviving forward reads. The unpaired forward reads contain all the reads whose partner reads were dropped. One may specify all the output files. Otherwise, a base-name can be provided following the --baseout flag, and the names of all output files will be created. For example, one may provide a base-name, mySampleFiltered.fq.gz, and the program will generate the output files, following a particular nomenclature pattern. o mySampleFiltered_1P.fq.gz - for paired forward reads o mySampleFiltered_1U.fq.gz - for unpaired forward reads o mySampleFiltered_2P.fq.gz - for paired reverse reads o mySampleFiltered_2U.fq.gz - for unpaired reverse reads All these parameters are called basic parameters, that are associated with setting up the work environment. Now we need to deal with the trimming modules. The first trimming module is called ILLUMINACLIP. This module accepts various parameters that are associated with finding and trimming the adapters. ILLUMINACLIP:::: Here, the parameters are: fastaWithAdaptersEtc: Here we need to specify the path to the fasta file containing the adapter sequence, primer sequence etc. seedMismatches: This is for a simple match strategy in Trimmomatic. Here, we need to specify the number of maximum mismatch count that would be taken into consideration as a full match. A higher value would result in the generation of more false positive data. palindromeClipThreshold: Here we need to specify how accurate the match between the two adapter-ligated reads must be for PE palindrome read alignment. simpleClipThreshold: Here, we need to specify how accurate the match between the adapter and the read must be. ILLUMINACLIP takes two additional parameters also. ILLUMINACLIP:::::: minAdapterLength: Here, we need to specify the length of the minimum length of the adapter, in which case the adapter will be removed. This is in addition to the palindromeClipThreshold value. The default value is 8. This means that as long as the minimum length of the adapter is 8 or greater, the adapter read-through will be trimmed. This helps to remove shorter adapter fragments. keepBothReads: This parameter applies to the palindrome match strategy. When an adapter read-through is found and the adapter removed, then the reverse read is the same as the forward one, except the fact that the latter is a reverse compliment of the former. In such a scenario, the reverse read will be dropped by the system. If keepBothReads is set to True, the reverse read will also be retained. The second module associated with Trimming is called SLIDINGWINDOW Here, a moving window of multiple-bases will be assessed for the average base quality. If the average base quality falls below a certain threshold, the sequence will be removed. So, even if the single base has a low quality, and the average base quality is still above the threshold, the sequence won’t be trimmed. The basic syntax of the SLIDINGWINDOW is: SLIDINGWINDOW:: Here, the is the number of bases to average across. is the threshold quality value The third module associated with Trimming: LEADING This module trims the base with a quality value below a certain threshold. The basic syntax is: LEADING: where the is the the value below which a base will be trimmed. The fourth important module associated with Trimming is, TRAILING This module takes a single argument of threshold quality value. If the last base in the read has a quality value below the set threshold, it will be removed. The basic syntax is: TRAILING: The fifth module associated with trimming the read: CROP With this module in action, bases, regardless of quality, can be trimmed from the end of the read, so that the read, maximally, has a length according to the set-value. CROP: where is the final length of the read after bases have been removed. The sixth module associated with Trimming the read: HEADCROP With this module in action, bases, regardless of their quality, will be trimmed from the beginning of the read, so that the read, maximally, has a length according to the set-value. HEADCROP: where is the number of bases to be removed from the beginning of the read The seventh module is associated with filtering the reads: MINLEN MINLEN takes a single argument. This specifies the minimum length of the read to be able to survive. After trimming, if the length of the read is less than the value stipulated in the module, the read will be dropped. TOPHRED33 This module is associated with the quality of the base. If the quality of the bases are calculated according to the ASCII encoding, TOPHRED33 is a valid module. More about the ILLUMINACLIP module ILLUMINACLIP does the finding and the clipping of the adapter. Technically, the adapter and other sequences like primer can appear in any location within the read, the most common mechanism of adapter contamination is when the sequencer reads a DNA fragment which is less than the length of the read. In this scenario, the read starts with a valid DNA sequence, and since the DNA fragment is shorter, the sequencer continues reading and “reads through” to the adapter of the opposite (reverse) read. This way, an adapter creeps in to the 3’-end of the read. The 5’-end starts with valid DNA sequence. This results in a partial or full adapter sequence in the 3’-end of the read. If the adapter sequence is full, it is easier to identify the adapter and remove it from the read. If it is a partial adapter, it is quite difficult to remove it. 💡 This is how adapters are ligated to the DNA fragments. 💡 How short inserts affect sequencing performance - Illumina Knowledge This picture shows very clearly how adapters are added to the DNA insert. The adapter complex is comprised of P5/P7 sequences, required to bind to the Flow Cell, Sample Indexes, Read1 and Read 2 primers. Now, these adapters are attached to the DNA insert. DNA inserts have Adenine over-hangs, and the adapters have Thymine over-hangs. As you see in the picture, Adenine and Thymine hybridize and the phosphate group binds with the 3’-OH, to form the backbone. Thus we have the sequencing reads. Now, the DNA inserts bind to the flow cell for sequencing. Primer binds to the primer-binding site on the read. How does Trimmomatic trim the adapter sequences In Single End sequencing method, where the DNA insert is sequenced only in one direction (there is no reverse pair), Trimmomatic uses a Simple strategy. Since the adapter sequence is known to us, continuous sections of each adapter, of 16bp length, will be matched with the read, in every possible positions. If there is a perfect match, or a close-match determined by the argument fed to the parameter, the entire alignment between the adapter and the read will be scored. If was fed with an argument of 2 (the default value), even if there is a mismatch in two bases between the “seed” and the read, it will be considered as a perfect match. Now, clipping of the adapter will happen based on the threshold value we feed, as an argument, to the parameter. For every match, a score will be calculated as follows: score = log10(probability that there is a random match of nucleotide base) score = log10(4) ~ 0.602 This, for each match of nucleotide base, the score increases by 0.602. If there is a mis-match, the score reduces by Q/10, where Q is the base-quality, that varies from 0 to 40. Therefore, the penalty value ranges from 0 to 4. The adapter will be trimmed as per the threshold score value we assign to the . If we provide a value of “7”, this is associated with a perfect match of 12 bases. If we want the adapter to be trimmed if there is a perfect match of 25 sequences, the threshold value is “15” This scenario can be understood well below. 💡 Trimmomatic: a flexible trimmer for Illumina sequence data | Bioinformatics | Oxford Academic (oup.com) The advantage of the SimpleClipThreshold method is that, it can detect any technical sequence located anywhere within the read. But as you can see in A and D, when there is only a short partial read-through/ match, detecting the adapter reliably is not possible. When there is a partial read-through at the 3’-end, Palindrome-Match is employed In palindrome mode, the adapter-ligated forward and the reverse reads are aligned, and the algorithm will calculate the score. If there is a matching score of 30, corresponding to a match of 50 bases, the adapters will be trimmed. 💡 Trimmomatic: a flexible trimmer for Illumina sequence data | Bioinformatics | Oxford Academic (oup.com) In the palindrome mode, the adapter-ligated forward and the reverse reads are aligned, and the global score will be calculated. The palindrome mode starts with an alignment of adapters and the start of each read, as you can see in A. If adapters directly joined, with no useful sequence information between them, such reads will be dropped. Then the testing process proceeds by moving the relative position of the reads backwards. When there is a match, both reads will be clipped. Even if the adapter is partial and reaches to the ligated-adapter, the reads will be clipped. The clipping will finish when there is no more over-hanging part that reaches to the adapter.

  • Trimmomatic | Single End sequencing | Working with demo files

    Demo files for Trimmomatic were downloaded from “bioinfogp.cnb.csic.es/files/samples/rnaseq/” The files were downloaded by using the following command: wget https://bioinfogp.cnb.csic.es/files/samples/rnaseq/RNA-Seq_Sample_Files.zip These are paired end-reads. For example, 1M_SRR9336468_1.fastq.gz and 1M_SRR9336468_2.fastq.gzarepaired end reads. Checking the number of reads in the fastq file We need to check the number of sequences in the fastq file. We have 1M_SRR9336468_1.fastq.gz and 1M_SRR9336468_2.fastq.gz. This can be achieved by the following command: zcat your_file.fastq.gz | wc -l | awk '{print $1/4}’ It turned out that there are 1000000 reads in the fastq file. For the purpose of testing, we need a simple fastq file, with a very few sequences. Let us cut down the fastq file to have, let’s say 10 reads in the file. Splitting the fastq file In order to cut down the fastq file, we will use seqkit split2 module. The following command can be used to achieve this task. seqkit split2 -1 1M_SRR9336468_1.fastq.gz -2 1M_SRR9336468_2.fastq.gz -s 10 -O out -f -e .gz This will generate 1 lakh parts of both forward and the reverse reads, containing 10 reads in each. For the sake of this demo-experimentation, we will have only very few fastq files. Now, let’s add forward and reverse adapter sequences to the respective read files. Appending the read files with adapter sequences We will use biopython for the purpose of appending the fastq reads with adapter sequences. Here is the code that we have crafted. from Bio.SeqIO.QualityIO import FastqGeneralIterator """Now we need to define the forward adapter and the reverse adapter""" forAdapter = "TACACTCTTTCCCTACACGACGCTCTTCCGATCT" revAdapter = "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT" """forward file path""" forPath = r"D:\website\wixSite\articles\trimmomaticDemo\inFiles\1M_SRR9336468_1.part_001.fastq." """Open the forward read file""" forHandle = open(forPath, "r", encoding="utf-8") """reverse file path""" revPath = r"D:\website\wixSite\articles\trimmomaticDemo\inFiles\1M_SRR9336468_2.part_001.fastq." """open the reverse file """ revHandle = open(revPath, "r", encoding="utf-8") #declare enmpty list forward newForFastq=[] #declare enmpty list reverse newRevFastq=[] """Iterate through each fastq block; declare itervariables to store header, sequence, and the quality""" for header, seq, qual in FastqGeneralIterator(forHandle): seqNewFor = seq.rstrip()+forAdapter newForFastq.append("@%s\n%s\n+\n%s\n" % (header, seqNewFor, qual)) """Iterate through each fastq block; declare itervariables to store header, sequence, and the quality""" for header, seq, qual in FastqGeneralIterator(revHandle): seqNewRev = seq.rstrip()+revAdapter newRevFastq.append("@%s\n%s\n+\n%s\n" % (header, seqNewRev, qual)) forPathOut = r"D:\website\wixSite\articles\trimmomaticDemo\outFiles\1M_SRR9336468_1.part_001.fastq" revPathOut = r"D:\website\wixSite\articles\trimmomaticDemo\outFiles\1M_SRR9336468_2.part_001.fastq" with open(forPathOut, "w")as forOut: for line in newForFastq: forOut.write(line) with open(revPathOut, "w")as revOut: for line in newRevFastq: revOut.write(line) We have added “TACACTCTTTCCCTACACGACGCTCTTCCGATCT” at the end of every sequence in the forward fastq file. And we have added GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT at the end of every sequence in the reverse fastq file. Let us look at the first 10 lines of both forward and the reverse file. Forward ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/RNA-Seq_Sample_Files/out] └─$ zcat 1M_SRR9336468_1.part_001.fastq.gz | head -n 10 @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTC + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-F @SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTAC + #AAFFJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJFFJFJFAJJJJJJJJFFJJJJFJJJJJJJJJJJJAJAJJJJJJJJJJJJFAJJJ7FFJJJF<AFAJFFFJJJJFFFJ-AF @SRR9336468.3 3/1 NGATGACATGTGGAAATATAACAGAGACAGAATATTTAATTTTATGTATGTAAGCATCGATTGGACACCAGGCTTATTGATGACCTTACTCGTCCAATTTGGCACGGACCGCTTTAACTTGCAAGTAGTTTTGTAAAGCATCAACAGACA Reverse ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/RNA-Seq_Sample_Files/out] └─$ zcat 1M_SRR9336468_2.part_001.fastq.gz | head -n 10 @SRR9336468.1 1/2 NTTGGTATCTACTACAATTCTGGTGAGGTCTGTTGTGCGGGTTCAAGGGTGTATGTTGAAGAATCTATTTACGACAAATTCATTGAAGAGTTCAAAGCCGCTTCTGAATCCATCAAGGTGGGCGACCCATTCGATGAATCTACTTTCCAA + #AAAFJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJFJJJ<AFJ-AF7FJJJJJJJJJ<AFJFJJ-77FFJJJAJFFFFFFFFFFJ7-<<JA7F<-AAFAAFJ-<FJ @SRR9336468.2 2/2 NTCCAAGAGAACCAAGAGATGGTACAAGAATGCCGGTTTGGGATTCAAGACCCCAAAAACCGCTATTGAAGGTTCCTACATTGACAAGAAATGTCCATTCACTGGTTTAGTTTCCATCCGTGGTAAGATCTTGACCGGTACCGTCGTCTC + #AAFFJFJFFJJJJJFJJJJJJJJJJJJJJJFJJJJFJJJJJJJFJJJFFJJJJJFJJJJJJJAJAJJJJJJFJJJJJJJJJJJJJJ-FJFFJJJJJJJFFJFJJJFFJJFFAJ<F7FFFJJ7JJF<JFJFAAAAAA-FFJ)--<<AA<F @SRR9336468.3 3/2 NACACCTCTAATATTAATACCGCCTTAAAAGTGGCTGATAGAGTTAATGCGGGTACGGTCTGGATAAACACTTATAACGATTTCCACCACGCAGTTCCTTTCGGTGGGTTCAATGCATCTGGTTTGGGCAGGGAAATGTCTGTTGATGCT Here, we can see that the length of the forward and the reverse read is 150bp. The forward and the reverse reads are not reverse complimentary to each other. Normally, when there is an adapter read-through, the forward and the reverse reads are complimentary to each other. Let’s try working with a single fastq file, for SE sequencing. Let’s run Trimmomatic. java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt \ /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq \ ~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz \ ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 \ MINLEN:36 So, let us look at the progress of the Trimmomatic. ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/Trimmomatic/Trimmomatic-0.39] └─$ java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt \ /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq \ ~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz \ ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 \ MINLEN:36 TrimmomaticSE: Started with arguments: -trimlog /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:36 Automatically using 4 threads Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Quality encoding detected as phred33 Input Reads: 10 Surviving: 10 (100.00%) Dropped: 0 (0.00%) TrimmomaticSE: Completed successfully Here, the report says that all the reads are surviving. No reads have been dropped. trimlog.txt; The log file of all the reads in the fastq file Let us look at the log report. SRR9336468.1 1/1 150 0 150 34 SRR9336468.2 2/1 150 0 150 34 SRR9336468.3 3/1 150 0 150 34 SRR9336468.4 4/1 150 0 150 34 SRR9336468.5 5/1 150 0 150 34 SRR9336468.6 6/1 150 0 150 34 SRR9336468.7 7/1 150 0 150 34 SRR9336468.8 8/1 150 0 150 34 SRR9336468.9 9/1 150 0 150 34 SRR9336468.10 10/1 150 0 150 34 Here, the columns are as follows. the read name the surviving sequence length the location of the first surviving base, aka. the amount trimmed from the start the location of the last surviving base in the original read the amount trimmed from the end The first column contains list of all the reads, prefixed by SRR. The second column contains the length of the surviving reads. The third column contains the position of the fist surviving base. It is zero for all the reads. The fourth column contains the position of the last surviving base. It is 150 for all the reads. The last column contains the amount trimmed from the end of the read. It is 34. Since MINLEN was set to 36 (means, drop the reads if the length of the read falls below 36), no reads were dropped. Setting MINLEN to 151 ┌──(vijithkumar㉿LAPTOP-2MN1I195)-[~/nextGenSeq/Trimmomatic/Trimmomatic-0.39] └─$ java -jar trimmomatic-0.39.jar SE -trimlog ~/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq ~/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:151 TrimmomaticSE: Started with arguments: -trimlog /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/Trimlog/trimlogFile.txt /mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/1M_SRR9336468_1.part_001.fastq /home/vijithkumar/nextGenSeq/TrimmomaticOutput/singleEnd1/lane1_forward.fq.gz ILLUMINACLIP:/mnt/d/website/wixSite/articles/trimmomaticDemo/outFiles/TruSeq3.fa:2:30:10 MINLEN:151 Automatically using 4 threads Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC' ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Quality encoding detected as phred33 Input Reads: 10 Surviving: 0 (0.00%) Dropped: 10 (100.00%) TrimmomaticSE: Completed successfully Here, we can see that all the 10 reads have been dropped. SRR9336468.1 1/1 0 0 0 0 SRR9336468.2 2/1 0 0 0 0 SRR9336468.3 3/1 0 0 0 0 SRR9336468.4 4/1 0 0 0 0 SRR9336468.5 5/1 0 0 0 0 SRR9336468.6 6/1 0 0 0 0 SRR9336468.7 7/1 0 0 0 0 SRR9336468.8 8/1 0 0 0 0 SRR9336468.9 9/1 0 0 0 0 SRR9336468.10 10/1 0 0 0 0 Changing the simpleClipThresholdvalue to 21 simpleClipThreshold is a parameter of the ILLUMINACLIP module. So, once a “seed” of 16bp are aligned in every position possible with the read, the program will look for a match between the seed and the read. If a match is established, the full adapter will be aligned with the read and a score will be calculated. Score = log10(probability that there is random match) ~0.602. So, for 34 bases, the score is less than 21 Look at the log report: SRR9336468.1 1/1 184 0 184 0 SRR9336468.2 2/1 184 0 184 0 SRR9336468.3 3/1 184 0 184 0 SRR9336468.4 4/1 184 0 184 0 SRR9336468.5 5/1 184 0 184 0 SRR9336468.6 6/1 184 0 184 0 SRR9336468.7 7/1 184 0 184 0 SRR9336468.8 8/1 184 0 184 0 SRR9336468.9 9/1 184 0 184 0 SRR9336468.10 10/1 184 0 184 0 Here, we can see that no trimming has been done. Changing the simpleClipThresholdvalue to 20 Here, we have 34 sequences, and therefore the maximum permissible score is: 34 x log10(4) =0.602 x 34 =20.468 ~20 Let us look at the log report: SRR9336468.1 1/1 150 0 150 34 SRR9336468.2 2/1 150 0 150 34 SRR9336468.3 3/1 150 0 150 34 SRR9336468.4 4/1 150 0 150 34 SRR9336468.5 5/1 150 0 150 34 SRR9336468.6 6/1 150 0 150 34 SRR9336468.7 7/1 150 0 150 34 SRR9336468.8 8/1 150 0 150 34 SRR9336468.9 9/1 150 0 150 34 SRR9336468.10 10/1 150 0 150 34 See, here we can see that the trimming has been done! What it means is that, if you set 20 as the argument forsimpleClipThreshold , even if there is a couple of mismatches, the corresponding reads won’t be clipped. Changing two bases in the adapter sequence in the read and setting the simpleClipThreshold to 20 @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTCAGATCGGAAGAGCACACGTCTGAACTCCAGTCGG + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-FJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJ @SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTACAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC Look at the above fastq sequence. I have purposefully changed two bases in the adapter. This shows that only 32 out of 34 bases score and there are two penalties. So, let’s calculate the score. score = 0.602 x log10(4) =0.602 x 32 =19.264 ~19 So, when we set the simpleClipThreshold value to 20, the read whose adapter has got changes in two bases fail to meet the threshold that was set, therefore the adapter of the said read won’t get trimmed. Let’s look at the log report here. SRR9336468.1 1/1 184 0 184 0 SRR9336468.2 2/1 150 0 150 34 SRR9336468.3 3/1 150 0 150 34 SRR9336468.4 4/1 150 0 150 34 SRR9336468.5 5/1 150 0 150 34 SRR9336468.6 6/1 150 0 150 34 SRR9336468.7 7/1 150 0 150 34 SRR9336468.8 8/1 150 0 150 34 SRR9336468.9 9/1 150 0 150 34 SRR9336468.10 10/1 150 0 150 34 See here, the adapter hasn’t been clipped from the first read. When the read-through adapters are short and partial @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTCAGATCGGAAGAGCACACGTCTGAACTCCA + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-FJJJJJJJJJJJJJJFFJJJJJJJJJJJJJ @SRR9336468.2 2/1 NGTATCTGTTGTACTTTGGAATGTAATGCAAGTAAGCCCTTCTGATGACAATGGTACGGTGCATCTTGGTGGAGACGACGGTACCGGTCAAGATCTTACCACGGATGGAAACTAAACCAGTGAATGGACATTTCTTGTCAATGTAGGTACAGATCGGAAGAGCACACGTCTGAA + #AAFFJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJFFJFJFAJJJJJJJJFFJJJJFJJJJJJJJJJJJAJAJJJJJJJJJJJJFAJJJ7FFJJJF<AFAJFFFJJJJFFFJ-AFJJJJJJJJJJJJJJFFJJJJJJJJ @SRR9336468.3 3/1 NGATGACATGTGGAAATATAACAGAGACAGAATATTTAATTTTATGTATGTAAGCATCGATTGGACACCAGGCTTATTGATGACCTTACTCGTCCAATTTGGCACGGACCGCTTTAACTTGCAAGTAGTTTTGTAAAGCATCAACAGACAAGATCGGAAGAGCACACGT + #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJFAJJJJFJJJJJJJJJJJFJJJJFJJJFJFJJFAJJJJFFJJJJF<7FJJ-7AFAFFFJAFJJFFJJJFFFFJFJJJJJJJJJJJJJJJJFFJJJ @SRR9336468.4 4/1 NATAGTTATAAAAATATATGTATATAAAGTATGTGAGTGTGTGAGTGTGTGAAAAGGATGTTATTAAATAGTATGTTATTATTTCTTGTCGTATCTGGAGTAGTATTTCTTTGAAATTATACCGTCCAGCGTCAAAAGACCCAGAGCGCCAGATCGGAAGAGCA + #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJFFJJJ-FJJJJJF-AFJ<7<<JJJJJ<JJJJJFJJJFJJJJJJJJJJJJJJJJ @SRR9336468.5 5/1 NGTTGAGTAACTTGACCTTCACACGTAGTTGCAAGTACCGATGCAAAATTTTTACACTGGTGACAACCTTAATGGTATAATCCAATTAGCGACATCACCGAGAAAGTATCCAAGTCGATTTGATGAATAAAATAATAACTTAAACATGCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC Here, 5, 10, 15, 20 bases have been clipped from the first, second, third and the fourth fastq files. The simpleClipThreshold is set to 20. This quality score is associated with a perfect match of 34 bases. The following is the log report. SRR9336468.1 1/1 179 0 179 0 SRR9336468.2 2/1 174 0 174 0 SRR9336468.3 3/1 169 0 169 0 SRR9336468.4 4/1 164 0 164 0 SRR9336468.5 5/1 150 0 150 34 Adapters were not clipped from the first four reads. The simpleClipThreshold value was set to 15. This is associated with a perfect match of 25 bases. The log report is: SRR9336468.1 1/1 150 0 150 29 SRR9336468.2 2/1 174 0 174 0 SRR9336468.3 3/1 169 0 169 0 SRR9336468.4 4/1 164 0 164 0 SRR9336468.5 5/1 150 0 150 34 read 1 has 29 bases in the adapter, and it meets a quality score of 15. @SRR9336468.1 1/1 NTGCATTTGAGAGGTTTGTGCACCTTGGAAAGTAGATTCATCGAATGGGTCGCCCACCTTGATGGATTCAGAAGCGGCTTTGAACTCTTCAATGAATTTGTCGTAAATAGATTCTTCAACATACACCCTCGAACCCGCACAACAGACCTC + #AAAFFAJJJJJJAAJJJ7FJJJJFJJJJJJJJJJJJ<FJJJJJJJJJJFJAJJJFJJ7<FJJJJJJJJFJJJFJJJJF7FFJJJJJJJJFFFAJJFFFFFFF-FJFAFFJ-7J7AFJFJJJFJFJJJJ--7FFAJ-F<JJJJJAFJ7-F The simpleClipThreshold value was set to 10. This is associated with a perfect match of 16 bases. Since the first read has 29 bases, the match score will be higher than 10, so the adapter will be clipped. Since the second read has 24 bases, the match score will be still above 10, and therefore will be clipped. The third read has 19 bases, the match score is still above 10, and therefore the adapter will be clipped. The simpleClipThreshold value was set to 1. This is associated with a perfect match of 2 bases. But in such a scenario, there are chances of false positives being generated. Let’s look at the trim log report: SRR9336468.1 1/1 150 0 150 29 SRR9336468.2 2/1 150 0 150 24 SRR9336468.3 3/1 150 0 150 19 SRR9336468.4 4/1 148 0 148 16 SRR9336468.5 5/1 150 0 150 34

  • Mastering YAML with pyYAML - A Pythonic Guide

    This post is about pyYAML library of Python to work with yaml files.

bottom of page