Running BLAST from the command-line

From mn/ibv/bioinfwiki
Jump to: navigation, search

Running BLAST from the command-line

A simple BLAST search

While demanding basic knowledge of command-line arguments, running BLAST from a terminal window is relatively easy. Below follow instructions for running a basic BLAST search on Abel. As stated above, BLAST is already installed on Abel, and will be added to the PATH when loaded. If running BLAST on the command-line on a PC or MAC, BLAST will have to be installed first. Absolute filenames can be given to avoid adding BLAST to the path.

module load blast+
This loads the newest BLAST package  (blast+ is a newer version of the blast package also available on Abel).

mkdir blast_test
Creating a directory containing BLAST data.

cd blast_test/
Stepping into this directory.

echo "MKSPALQPLSMAGLQLMTPAS" > query.fa
We are creating our query file for the BLAST search. The command transfers the "MKSPALQPLSMAGLQLMTPAS" sequence to the “query.fa” query file. (For simplicity, the header line normally present in a fasta file has been omitted. Still, BLAST can use this format).

blastp -query query.fa -db /work/databases/bio/ncbi/nr -out blastp_out.txt
We are running the blastp program, using the “query.fa” as our query file. The database is set to the “nr” database, which is part of the BLAST installation on Abel. Finally, we specify the BLAST output to be written to the “blastp_out.txt” file. The execution of this BLAST search will take about two minutes.

more blastp_out.txt
This command displays the results of the BLAST search.

Creating a custom BLAST database

Often, it is necessary to use a custom database when BLASTing. For instance, the data you want to search through may not yet be deposited in the NCBI “nr” or “nr/nt” databases. Or, due to performance gains or e-value improvements, you want to restrict the database size. Such a database needs to be formatted as a number of FASTA-records contained in one text file, but additionally BLAST needs binary index files speeding up the search process. This is achieved by using the ”makeblastdb” program that comes as part of the BLAST package. On Abel, this program is added to the PATH when loading the blast+ module. It is highly recommended to create a separate folder for the BLAST database which will contain all the database-associated files.

module load blast+
The blast+ package is loaded.

mkdir sparc
This command creates the folder that will hold our custom database files.

cd sparc
We step into the database folder.

wget http://diark.org/diark/check_licence?filename=genomes_genbank%2FSphaeroforma_arctica_v1_contigs.zip -O sparc_contigs.zip
The “wget” command downloads a zipped file containing the genome of the eukaryotic protist Sphaeroforma arctica from the diark.org website. The –O argument renames the downloaded file.

unzip sparc_contigs.zip
This unzips the downloaded genome file.

ls –l
This command displays the files in our current folder (“sparc”). We see two files, the zipped “sparc_contigs.zip” file and the unzipped “Sphaeroforma_arctica_v1_contigs.fasta” file.

rm sparc_contigs.zip
We remove the zipped file, as we do not need it any more.

more Sphaeroforma_arctica_v1_contigs.fasta
This command prints the contends of the “Sphaeroforma_arctica_v1_contigs.fasta” file. It is easy to see that this file contains numerous fasta sequences. Press “q” to exit the file display.

makeblastdb -in Sphaeroforma_arctica_v1_contigs.fasta -dbtype nucl
Here the additional index files are created. In addition to specifying the input file name, we also have to state the type of sequences used. The main options here are “nucl” and “prot” for nucleotide (RNA and DNA) or amino acid sequences, respectively.

ls –l
Again, this command prints out a list over the files contained in the current directory. We see four file, the original “Sphaeroforma_arctica_v1_contigs.fasta” file and three new files. The new files all retain the filename of the original fasta file, but append the “nhr”, “nin” and “nsq” extensions.

The BLAST database is now ready to be used. When referring to this database in a BLAST search, it needs to be addressed as “Sphaeroforma_arctica_v1_contigs.fasta” (possibly including the full path) - i.e. NOT as “Sphaeroforma_arctica_v1_contigs” or “Sphaeroforma_arctica_v1_contigs.fasta.nin”.