RepeatMasker with Omics Box; an easy way to mask repetitive sequences

Vijithkumar V
Oct 16, 2023
6 min read

Repetitive DNA sequences

Repetitive DNA sequences are of two different types: low-complexity repeats and transposable elements.

What are low-complexity repeats?

Low-complexity repeats are homopolymeric runs of nucleotides. For example, it could be a homopolymeric stretch of Adenine (A), Thymine (T), Cytosine (C), or Guanine (G). A homopolymeric run of adenine looks like this: AAAAAAAAA

Transposable elements

These are DNA sequences that change positions in the genome. Transposable elements are also called jumping genes. Viral genome, long Interspersed Nuclear Element, and short Interspersed Nuclear Element are the major transposable elements.

How do repetitive DNA sequences affect the genome assembly?

Repetitive DNA sequences can pose problem while performing annotation of the protein-coding genes. When you don’t mask the repetitive sequences, it can generate many spurious BLAST alignments. Unmasked repeats align with similar non-homologous DNA sequences, thus producing non-reliable data. Gene annotation involves describing the location of genes in a genome. If you keep the repetitive DNA sequences unmasked, the gene annotation process can falsely show the evidence for a gene in a location where, in reality, there is no gene, but exists a repetitive element.

Transposable elements are repetitive sequences that can move within a genome. These transposable elements have ORFs, which are sequences in genes that code for proteins. Some of the transposable elements’ ORFs resemble the host genes. Gene predictors are software tools that describe the location and structure of genes in a genome. If the transposable elements are left unmasked, the inaccurate gene predictors will falsely represent transposable elements’ ORFs as host genes.

How to mask repetitive sequences in a genome?

The repetitive sequences in the genome can be masked by using a software called RepeatMasker. What RepeatMasker does is that it scans the DNA sequences for transposable elements, satellites, and low-complexity repeats. There are two outputs. One is a comprehensive report of the annotation of the repeats in the query sequence, that comprises the different types of repetitive DNA elements found in the input DNA. The second output is the modified DNA whose repetitive DNA sequences are masked. By masking, as I mentioned above, the nucleotides of the repetitive elements are transformed into “N” or “X”, or their lowercase version. RepeatMasker also uses Tandem Repeat Finder (TRF) to find the tandem repeats.

How does RepeatMasker get information about the repetitive elements

RepeatMasker comes with the Dfam database. This Dfam database harbors information on Transposable Element families. As of Dfam version 3.7, there is information on 3,437,876 Transposable element families, spanning 2,306 species.

Dfam website contains a collection of multiple sequence alignments. Each sequence alignment contains multiple members of a Transposable element family. By aligning these representative members of a Transposable element family, HMMs and a consensus sequence of Transposable element for each family has been made. Omics Box also works with RepBase which contains representative repetitive sequences from eukaryotic species.

How to perform repeat masking using OmicsBox.

The Repeat Masking functionality is under the Genome Analysis. Here, we have to select the input DNA sequences in the fasta format. Then we can set the analysis parameters.

Setting the search configuration for Repeat Masker

First of all, we need to select the search engine to perform the search for repeats. There are two search engines: 1. RMBlast 2. HMMER

RMBlast is an NCBI BLAST-compatible version of Repeat Masker. Basically, it works with NCBI BLAST to find the repetitive sequence within the DNA sequence by searching the query sequence against a nucleotide database. What RMBlast does is it aligns the query DNA sequence against the repetitive sequences in databases like Dfam. It calculates alignment scores to find the similarity between the query DNA sequence and the repetitive sequence. Based on the degree of similarity between the query DNA sequence and the repetitive element, the RMBlast identifies the repetitive sequences and provides this information to Repeat Masker.

Another search engine for homology search is HMMER. Using homologous sequences can be searched for the query sequence against a database of sequences, using the profile Hidden Markov Model can detect remote homologs as sensitively as possible, using an underlying probability model. HMMER is now as fast as BLAST. uses the nhmmer program to search one or more nucleotide queries against a database. Using the query sequence, search the target database of sequences, and output the top matches.

Selection of Repetitive sequence database

RepeatMasker works with a number of databases.

Repeat Masker works with two databases at present. They are the Dfam database and RepBase database. Dfam is a database of Transposable elements. If it is selected as the database, then it is not necessary to provide a database file. But if you choose RepBase as the database, it is necessary that we need the database file loaded into the wizard in the EMBL format. The library should be RepBase RepeatMasker edition, and it can be downloaded from https://www.girinst.org/server/RepBase/.

One can also use the custom option. In the custom option, one should provide a custom library of repetitive sequences to be masked in the query sequence. The library file should contain repetitive sequences in the fasta format. In the repetitive sequence library, the fasta IDs should be formatted as follows:

“>repeatname#class/subclass”

For example, let's see how to mask the repetitive elements under the subclass AluY. Alu elements are 300 bp and are classified under the Short Interspersed Nuclear Elements (SINE). So, when you create a custom library for an element, let’s say belonging to AluY, then the fasta should be of the format:

>Alu#SINE/AluY

Here, Alu is the name of the repetitive element. SINE stands for the name of the class to which the repetitive element Alu belongs. AluY is the name of the subclass.

Let us look at another example.

Let’s imagine that we have done a genome assembly of a plant species, let's call it Chloropicon primus.

The consensus sequence of the transposable element, downloaded from the plant repetitive element database, PlantRep (PlantRep) is as follows.

>rnd-2_family-71#DNA/MULE-MuDR ( Recon Family Size = 28, Final Multiple Alignment Size = 24 )
GAGGATTGCANAAGAGGGGCGAAGTNCTNCGATCACCAGCACGTCGTCGA
NTGGATCGAGGCNAATTCTTAACCAAGAACAATGTCTCACGACGAACCTG
TCATCGACCTTCGCTTCCTCTTCCGATTCCTCCTCCTCCTCTCTCCTTTG
CTCTTCCTCCTTCGCTCTTTCTCTTCCGGTGGATCTTCTCCCTCCCGCTG
AACCATCAAAACGGCCCGAGGCCACGAGGGGCGAACGAGGACGAGGNGAG
AGAGAAACGAAGAGGACGGGAGCGCGCGGAGGAGGGAGAGAGAGAGAGAG
AAGNGGAGGACAGGAAAGAAGGAAGCGTCGTCTCNTGCTCTTTCGAACGA
GCCCTCGCGCGAAAGAAACGACCCAGTGGCGAGGATCTGGCGACGCGAAA
CGCGAAGAAGAGAGGCAGAAAGGAATCGAGGAGTAGATCACCGAGGAAGG

Specifying the name of the species

One can specify the name of the species. This name must be present in the NCBI Taxonomy database. If you use HMMER as the search engine, it uses Dfam database to search for repetitive sequences. In Dfam database, information on repetitive sequences for organisms such as human beings, Caenorhabditis elegans, mouse, zebrafish, fruit fly, and nematode are present.

RMBlast option

If RMBlast option is used as the search engine, we need to specify an option for the speed/ sensitivity parameter. There are three options:

Rush: When you choose Rush, the speed of database search, alignment, repeat masking and annotation will be 4 - 10 faster than the default option. But it will be 10% less sensitive than the default option.
Quick: When you choose this option, the process will be 5 - 10% less sensitive, but will be 2 - 5 times faster than the default process.
Slow: When you choose this option, the process will be 0 - 5 times more sensitive, but 2 - 3 times slower than the default.

Divergence Cut-off

Divergence cut-off can be applied. When you turn on this option, the RepeatMasker will mask the repetitive sequences that are less diverged from the consensus than the value we provide as a divergence cut-off. RepBase, PlantBase, Dfam, etc. repeat databases contain consensus repeat sequences. The repetitive elements are likely to have mutations, and they may diverge from the consensus. When you provide a divergence cut-off value, let's say 20%, the RepeatMasker will mask the repetitive sequence if it is less than 20% diverged from the consensus sequence in the database or in the library file.

When you run the commandline of the RepeatMasker as follows

RepeatMasker [-options] <seqfiles(s) in fasta format>

you can also add the -div parameter where you can mention the percentage divergence; the repeats that have been diverged less than the cut-off percentage value will be masked.

Output options

There is a set of parameters that need to be provided with a value.

Masking option: Here, we need to specify how the repetitive sequences should be masked. Bases of the repetitive sequences can be replaced with "N" or "X" or the lowercase version of the respective bases.
Only Alu elements: If the query DNA is from a primate, this option can be used. Alu repeats are 300bp long repeats of the class SINE.

Type of Repeats

Here, we can set the type of repeats that the RepeatMasker should mask. For example, you may set Interspersed repeats, simple repeats, and low-complexity repeats.

Output

The fasta output file contains the masked sequence in fasta format. The locations of the masked repeats are provided in the GFF format.

Life Science

RepeatMasker with Omics Box; an easy way to mask repetitive sequences

Recent Posts

Comments

Subscribe for Free Life Science Updates: Get Latest Research, Webinars & Career Insights!