cnv_genemark2gff.pl


NAME

cnv_genemark2gff.pl - Convert genemark output to gff format


VERSION

This documentation refers to program version $Rev: 592 $


SYNOPSIS

Usage

    cnv_genemark2gff.pl -i infile.genemark -o outfile.gff

Required Arguments

    --infile     # Path to the input file to translate
                 # If not provided, assumes input from STDIN
    --outfile    # Path to the output gff file
                 # If not provided, writes output to STDOUT
    --seqname    # The id of the sequence analyzed


DESCRIPTION

Converts the output from the genemark.hmm program to the gff format. This has been tested to work with gmhmme2 and gmhmme3. All exons are currently tagged as 'exon' in the gff output file. This is for compatibility with the Apollo genome annotation curation program.


REQUIRED ARGUMENTS

-i,--infile

Path to the genemark file to translate to gff. If an infile is not specified, then the program will expect input from standard input.

-o,--outfile

Path to the gff output file. If an outfile is not specified, the progrm will write the gff file to standard output.


OPTIONS

--seqname

This is the value listed as the source sequence in the gff output file. While not a specifically required variable, the default value for this in unknown. This will generally be set to the BAC ID or contig ID.

--program

This is the source program name used in the gff output file. By default this is set to be GeneMarkHMM. This option allows you to set the source program to any value that you would want.

--usage

Short overview of how to use program from command line.

--help

Show program usage with summary of options.

--version

Show program version.

--man

Show the full program manual. This uses the perldoc command to print the POD documentation for the program.

--verbose

Run the program with maximum output.

--test

Run the program without doing the system commands.


EXAMPLES

Typical Use

The typical use of this program will be to parse a file produce from the genemark.hmm program.

    cnv_genemark2gff.pl -i HEX2493A05_genemark_hv.out --seqname HEX2493A05
                        -o HEX2493A05_genemark_hv.gff

This will produce a gff output file similar to the following:

    HEX2493A05 GeneMarkHMM      exon    683     1393    .  +  . RNA0001
    HEX2493A05 GeneMarkHMM      exon    1736    2084    .  +  . RNA0001
    HEX2493A05 GeneMarkHMM      exon    2195    2515    .  +  . RNA0001
    HEX2493A05 GeneMarkHMM      exon    2696    2803    .  +  . RNA0001
    HEX2493A05 GeneMarkHMM      exon    2918    3035    .  +  . RNA0001
    HEX2493A05 GeneMarkHMM      exon    3058    3131    .  +  . RNA0001
    HEX2493A05 GeneMarkHMM      exon    3219    3502    .  +  . RNA0001
    HEX2493A05 GeneMarkHMM      exon    3552    3559    .  +  . RNA0002
    HEX2493A05 GeneMarkHMM      exon    3711    3801    .  +  . RNA0002
    HEX2493A05 GeneMarkHMM      exon    3947    4711    .  +  . RNA0002
    ...

The --seqname option used above allows you to specify the value written in the first column of the gff file. If the --seqname was not specified like the following:

    cnv_genemark2gff.pl -i HEX2493A05_genemark_hv.out
                        -o HEX2493A05_genemark_hv.gff

The gff output would be similar to the following:

    unknown_src GeneMarkHMM     exon    683     1393    .  +  . RNA0001
    unknown_src GeneMarkHMM     exon    1736    2084    .  +  . RNA0001
    unknown_src GeneMarkHMM     exon    2195    2515    .  +  . RNA0001
    unknown_src GeneMarkHMM     exon    2696    2803    .  +  . RNA0001
    unknown_src GeneMarkHMM     exon    2918    3035    .  +  . RNA0001
    unknown_src GeneMarkHMM     exon    3058    3131    .  +  . RNA0001
    unknown_src GeneMarkHMM     exon    3219    3502    .  +  . RNA0001
    unknown_src GeneMarkHMM     exon    3552    3559    .  +  . RNA0002
    unknown_src GeneMarkHMM     exon    3711    3801    .  +  . RNA0002
    unknown_src GeneMarkHMM     exon    3947    4711    .  +  . RNA0002
    ...

Specify the Training Matrix Used

It is also possible to designate the second column of the gff output file using the --program option. This can be used to specify the training data use for gene predictions. This will allow you to later separate gene models for different training data sets. For example if I used the wheat training matrix, I may do the following:

    cnv_genemark2gff.pl -i HEX2493A05_genemark_hv.out --seqname HEX2493A05
                        -o HEX2493A05_genemark_hv.gff 
                        --program GeneMark:wheat

This will produce output similar to the following:

    HEX2493A05 GeneMark:wheat   exon    683     1393    .  +  . RNA0001
    HEX2493A05 GeneMark:wheat   exon    1736    2084    .  +  . RNA0001
    HEX2493A05 GeneMark:wheat   exon    2195    2515    .  +  . RNA0001
    HEX2493A05 GeneMark:wheat   exon    2696    2803    .  +  . RNA0001
    HEX2493A05 GeneMark:wheat   exon    2918    3035    .  +  . RNA0001
    HEX2493A05 GeneMark:wheat   exon    3058    3131    .  +  . RNA0001
    HEX2493A05 GeneMark:wheat   exon    3219    3502    .  +  . RNA0001
    HEX2493A05 GeneMark:wheat   exon    3552    3559    .  +  . RNA0002
    HEX2493A05 GeneMark:wheat   exon    3711    3801    .  +  . RNA0002
    HEX2493A05 GeneMark:wheat   exon    3947    4711    .  +  . RNA0002
    ...


DIAGNOSTICS

The error messages that can be generated will be listed here.


CONFIGURATION AND ENVIRONMENT

This program does not make use of a configuration file or any variables set in the user environment.


DEPENDENCIES

Required Software

Required Perl Modules


BUGS AND LIMITATIONS

Limitations


SEE ALSO

The program is part of the DAWG-PAWS package of genome annotation programs. See the DAWG-PAWS web page ( http://dawgpaws.sourceforge.net/ ) or the Sourceforge project page ( http://sourceforge.net/projects/dawgpaws ) for additional information about this package.


REFERENCE

A manuscript is being submitted describing the DAWGPAWS program. Until this manuscript is published, please refer to the DAWGPAWS SourceForge website when describing your use of this program:

JC Estill and JL Bennetzen. 2009. The DAWGPAWS Pipeline for the Annotation of Genes and Transposable Elements in Plant Genomes. http://dawgpaws.sourceforge.net/


LICENSE

GNU General Public License, Version 3

http://www.gnu.org/licenses/gpl.html

THIS SOFTWARE COMES AS IS, WITHOUT ANY EXPRESS OR IMPLIED WARRANTY. USE AT YOUR OWN RISK.


AUTHOR

James C. Estill <JamesEstill at gmail.com>


HISTORY

STARTED: 10/30/2007

UPDATED: 03/24/2009

VERSION: $Rev: 592 $

 cnv_genemark2gff.pl