top of page
  • Writer's pictureVijithkumar V

How to leverage GNU parallel to utilize multiple cores while running AUGUSTUS

For de novo gene prediction in your genome assembly project, AUGUSTUS is a reliable tool. To get started, one can use the following command line.

Here, the --species parameter specifies the organism to use as a gene model. In this case, I used maize. The --progress option displays the process.

Unfortunately, when you run the command, it uses only one CPU core:

Linux htop showing the color bars indicating the CPU and Memory usage.
AUGUSTUS utilizing only one CPU core

AUGUSTUS does not have a built-in parameter to set the number of CPU cores. However, we can leverage GNU Parallel to utilize multiple cores.


Check if the parallel is Installed and is in the PATH

You can verify GNU Parallel's installation by running the following command in your terminal:

parallel --version. It should output as follows.



Or you can type dpkg -l | grep parallel and it should bring


If GNU Parallel is not found or installed, you can install it on Ubuntu using the following command:

sudo apt-get install parallel


Split the main fasta file into multiple files

To divide a large FASTA file into smaller, manageable files, use the following Python script. This code splits the file in a "fasta-aware" manner, ensuring each resulting file remains valid. In the code, you can replace the value of "n" with the number of fasta files you need.

Check the number of CPU cores

The following command can be used to check the number of CPU cores.

cat /proc/cpuinfo

My server is equipped with two Intel Xeon E5-2609 processors, each featuring 4 cores. This configuration provides a total of 8 cores. To verify cores and threads per core, use

lscpu


Automate AUGUSTUS on multiple FASTA files in parallel, across “N” CPU cores

  1. First of all, we need to configure OpenMPI for AUGUSTUS. To control thread allocation for AUGUSTUS, set the OMP_NUM_THREADS environment variable. This variable determines the number of threads allocated to OpenMPI-enabled applications. To restrict AUGUSTUS to single-threaded operation, set:

  1. Run parallel instances of AUGUSTUS. With OMP_NUM_THREADS set, you can now run multiple instances of AUGUSTUS in parallel using GNU Parallel: parallel --jobs N augustus [options] input_files.

    Replace:

    1. N with the desired number of parallel instances (equal to the number of cores, if hyperthreading is not supported), 2. [options] with AUGUSTUS command-line options

    3. input_files with your input FASTA files. To maximize efficiency, you may set the optimum number of parallel instances; for example, set N to the number of cores, considering: 1. Single-threaded operation (OMP_NUM_THREADS=1) 2. No hyperthreading support. In my case, with 8 cores and single-threading: parallel --jobs 8 augustus [options] input_files.

  2. Now to automate, set an input directory variable (INPUT_DIR) and an output directory variable (OUTPUT_DIR), and assign them the path to the folder that contains all split input files and the path where the output files are to be created. Now 8 parallel instances of AUGUSTUS can be run using the following command:

  1. Here, the {} grabs each input file (each one from the group of split files) mentioned in "$INPUT_DIR"/subset_{1..8}.fasta. Each parallel job is individually redirected.

  2. Combine the output files: The '>' "$OUTPUT_DIR"/{/}.gff3 will create individual output corresponding to each AUGUSTUS task, and all the split output files need to be combined. This can be done by:

The complete program

You can find the complete program here.


2 views0 comments

Comments


bottom of page