For de novo gene prediction in your genome assembly project, AUGUSTUS is a reliable tool. To get started, one can use the following command line.
Here, the --species parameter specifies the organism to use as a gene model. In this case, I used maize. The --progress option displays the process.
Unfortunately, when you run the command, it uses only one CPU core:
AUGUSTUS does not have a built-in parameter to set the number of CPU cores. However, we can leverage GNU Parallel to utilize multiple cores.
Check if the parallel is Installed and is in the PATH
You can verify GNU Parallel's installation by running the following command in your terminal:
parallel --version. It should output as follows.
Or you can type dpkg -l | grep parallel and it should bring
If GNU Parallel is not found or installed, you can install it on Ubuntu using the following command:
sudo apt-get install parallel
Split the main fasta file into multiple files
To divide a large FASTA file into smaller, manageable files, use the following Python script. This code splits the file in a "fasta-aware" manner, ensuring each resulting file remains valid. In the code, you can replace the value of "n" with the number of fasta files you need.
Check the number of CPU cores
The following command can be used to check the number of CPU cores.
cat /proc/cpuinfo
My server is equipped with two Intel Xeon E5-2609 processors, each featuring 4 cores. This configuration provides a total of 8 cores. To verify cores and threads per core, use
lscpu
Automate AUGUSTUS on multiple FASTA files in parallel, across “N” CPU cores
First of all, we need to configure OpenMPI for AUGUSTUS. To control thread allocation for AUGUSTUS, set the OMP_NUM_THREADS environment variable. This variable determines the number of threads allocated to OpenMPI-enabled applications. To restrict AUGUSTUS to single-threaded operation, set:
Run parallel instances of AUGUSTUS. With OMP_NUM_THREADS set, you can now run multiple instances of AUGUSTUS in parallel using GNU Parallel: parallel --jobs N augustus [options] input_files.
Replace:
1. N with the desired number of parallel instances (equal to the number of cores, if hyperthreading is not supported), 2. [options] with AUGUSTUS command-line options
3. input_files with your input FASTA files. To maximize efficiency, you may set the optimum number of parallel instances; for example, set N to the number of cores, considering: 1. Single-threaded operation (OMP_NUM_THREADS=1) 2. No hyperthreading support. In my case, with 8 cores and single-threading: parallel --jobs 8 augustus [options] input_files.
Now to automate, set an input directory variable (INPUT_DIR) and an output directory variable (OUTPUT_DIR), and assign them the path to the folder that contains all split input files and the path where the output files are to be created. Now 8 parallel instances of AUGUSTUS can be run using the following command:
Here, the {} grabs each input file (each one from the group of split files) mentioned in "$INPUT_DIR"/subset_{1..8}.fasta. Each parallel job is individually redirected.
Combine the output files: The '>' "$OUTPUT_DIR"/{/}.gff3 will create individual output corresponding to each AUGUSTUS task, and all the split output files need to be combined. This can be done by:
The complete program
You can find the complete program here.
Comments