cnv_blast2gff.pl


NAME

cnv_blast2gff.pl - Convert blast output to GFF


VERSION

This documentation refers to program version $Rev: 598 $


SYNOPSIS

Usage

    cnv_blast2gff.pl -i blast_report.bln -o blast_out.gff -n seq_id

Required Arguments

    -i      # Path to the input file
            # If not specified the program will expect input from STDIN
    -o      # Path to the output file
            # If not specified the program will write to STDOUT
    -s      # Name of the query sequence
            # If a name is not specified, the name will be
            # extracted from the blast report or use "seq"


DESCRIPTION

This program will translate a blast report for a single query sequence into the GFF format. Since this uses the general bioperl BLAST parser code, this should also be able to parse output from BLAT or wublast. This code works best for converting a blast report of a single query sequence against a single database.


REQUIRED ARGUMENTS

-i,--infile

Path of the input file. If an input file is not specified, the program will expect input from STDIN.

-o,--outfile

Path of the output file. If an output file is not specified, the program will write output to STDOUT.


OPTIONS

-s,--seqname

Identifier of the sequence that has been used as the query sequence in the blast report. This will be used as the first column in the gff output file.

-p,--program

The blast program used. This will be used to identify the source program in the second column of the GFF output file. Example of valid values include blastn, blastx, or wublast.

--feature

The type of feature. Be default, this is set to exon to facilitate using this blast report in Apollo. It is also possible to set this to an ontology complient name such as match or expressed_sequence_match.

-d,--database

The name for the database that was blasted against. If provided, this will be appended to the program variable in the second colum of the GFF output file.

-m,--align

The alignment format use in the BLAST report to be parsed. The program will assume that you are using the default alginment format for blast. Otherwise, you can specify 'tab' or '8' or '9' for tab delimited blast.

-e,--maxe

The maximum e value threshold to accept.

-l,--min-len

The minimum length to accept.

--verbose

Run the program with maximum reporting of error and status messages.

--usage

Short overview of how to use program from command line.

--help

Show program usage with summary of options.

--version

Show program version.

--man

Show the full program manual. This uses the perldoc command to print the POD documentation for the program.

-q,--quiet

Run the program with minimal output.


EXAMPLES

The following are examples on how to use the cnv_blast2gff.pl program.

Typical Use

The typical use of this program will be to convert an existing blast output to to the GFF file format:

  cnv_blast2gff.pl -i blast_result.bln -o parsed_result.gff

This will generate a GFF format file named parsed_result.gff.

Piping BLAST Result Directly to the Conversion Utility

It is also possible to directly send the blast result to the cnv_blast2gff.pl program using the standard streams.

  blastall -p blastin .. | cnv_blast2gff.pl -o blast_result.gff

This will take the blast output from NCBI's blastall program and convert the output to the gff format.

Combining Blast Results in GFF format

Since the cnv_blast2gff.pl program will write the results to the standard output stream if no file path is specified, it is possible to use standard unix commands to combined results. Consider the following set of commands:

  cnv_blast2gff.pl blast_result01.bln > combined_results.gff
  cnv_blast2gff.pl blast_result02.bln >> combined_results.gff
  cnv_blast2gff.pl blast_result03.bln >> combined_results.gff

This will combined the blast results from the 01, 02 and 03 search into a single gff file named combined_results.gff

Specify the Sequence ID with --name

The first column in the GFF output file results indicates the id of the sequence that is being annotated. By default, the cnv_blast2gff.pl program will attempt to extract this ID from the blast result. It is also possible to specify this from the command line using the --name option. For example consider you had a blast report that gave the following result``

  cnv_blast2gff.pl -i bl_result.bln -o gff_result.gff

that generated a gff file like the following

 HEX3045G05   blast:mips   exon     8537    8667    39   +    .  rire1
 HEX3045G05   blast:mips   exon     9911    9996    38   +    .  rire1
 HEX3045G05   blast:mips   exon     10025   10191   36   +    .  rire1
 HEX3045G05   blast:mips   exon     76161   76235   35   +    .  rire1
 HEX3045G05   blast:mips   exon     81151   81200   34   +    .  rire1
 ...

where HEX304GO5 indicates the sequence id. This sequence identifier could be modified using the --name option:

  cnv_blast2gff.pl -i bl_result.bln -o gff_result.gff --seqname HEX001

this would give the following result:

 HEX001   blast:mips   exon     8537    8667    39       +    .  rire1
 HEX001   blast:mips   exon     9911    9996    38       +    .  rire1
 HEX001   blast:mips   exon     10025   10191   36       +    .  rire1
 HEX001   blast:mips   exon     76161   76235   35       +    .  rire1
 HEX001   blast:mips   exon     81151   81200   34       +    .  rire1
 ...

Specify the Database name with --database

By default the cnv_blast2gff.pl program will identify the database in the second column as a suffix to the blast program, separated by a colon. The command:

  cnv_blast2gff.pl -i bl_result.bln -o gff_result.gff

that generated a gff file like the following

 HEX3045G05   blast:mips   exon     8537    8667    39   +    .  rire1
 HEX3045G05   blast:mips   exon     9911    9996    38   +    .  rire1
 HEX3045G05   blast:mips   exon     10025   10191   36   +    .  rire1
 HEX3045G05   blast:mips   exon     76161   76235   35   +    .  rire1
 HEX3045G05   blast:mips   exon     81151   81200   34   +    .  rire1
 ...

Could have the database suffix modified using the --database option as follows:

  cnv_blast2gff.pl -i bl_result.bln -o gff_result.gff --name tes

This would modify the gff output to the following:

 HEX3045G05   blast:tes   exon     8537    8667    39    +    .  rire1
 HEX3045G05   blast:tes   exon     9911    9996    38    +    .  rire1
 HEX3045G05   blast:tes   exon     10025   10191   36    +    .  rire1
 HEX3045G05   blast:tes   exon     76161   76235   35    +    .  rire1
 HEX3045G05   blast:tes   exon     81151   81200   34    +    .  rire1


DIAGNOSTICS


CONFIGURATION AND ENVIRONMENT

This program does not make use of a configuration file or any variables set in the user's environment.


DEPENDENCIES

Required Software

This program requires output from a BLAST program. Since this program makes use of the BioPerl blast parser, it should be possible to convert local alignment results from any of the following programs:

Required Perl Modules

The following perl modules are required for this program:


BUGS AND LIMITATIONS

Bugs

Limitations


REFERENCE

A manuscript is being submitted describing the DAWGPAWS program. Until this manuscript is published, please refer to the DAWGPAWS SourceForge website when describing your use of this program:

JC Estill and JL Bennetzen. 2009. The DAWGPAWS Pipeline for the Annotation of Genes and Transposable Elements in Plant Genomes. http://dawgpaws.sourceforge.net/


LICENSE

GNU General Public License, Version 3

http://www.gnu.org/licenses/gpl.html

THIS SOFTWARE COMES AS IS, WITHOUT ANY EXPRESS OR IMPLIED WARRANTY. USE AT YOUR OWN RISK.


AUTHOR

James C. Estill <JamesEstill at gmail.com>


HISTORY

STARTED: 08/06/2007

UPDATED: 01/31/2009

VERSION: $Rev: 598 $

 cnv_blast2gff.pl