cnv_repmask2gff.pl


NAME

batch_mask.pl - Convert RepeatMasker output to the gff format.


VERSION

This documentation refers to program version $Rev: 600 $


SYNOPSIS

Usage

    cnv_repmask2gff.pl -i infile.out -o outfile.gff

Required Variables

    -i    # Path to the repeatmasker file to convert
    -o    # Path to the gff output file


DESCRIPTION

This program will convert the output from RepeatMasker to the standard gff file format.


REQUIRED ARGUMENTS

-i,--infile

Path to the intput file that contains the repeatmasker out file to convert to the gff format. If this option is not specified, the program will expect input from STDIN.

-o,--oufile

Path to the gff output file. If and outfile file path is not specified, the program will write the output to STDERR.


OPTIONS

-p,--param

The parameter name to append to the source program name. This information will be appended to the second column of the gff output file. This is used to specify if you masked with the same database using a different parameter set or if you used a different database.

--program

The program name to use. This is the data in the second column of the gff output file. Be default, this is set to 'repeatmasker'. This option allows you to specify other program names if desired.

-s,--seqname

Identifier for the sequence file that was masked with repeatmasker. The out file from repeatmasker may have truncated your original file name, and this option allows you to use the full sequence name.

--plus

Write all of the annotations to be in the positive (plus) strand. By default the program will interpret strand results reported as 'C' to be in the negative strand orientation. Setting the --plus flag will report all RepeatMasker hit results as occurring in the plus strand orientation.

--append

Append the results to an existing gff file. This must be used in conjunctions with the --outfile option.

-q,--quiet

Run the program with minimal output.

--test

Run the program without doing the system commands.

--usage

Short overview of how to use program from command line.

--help

Show program usage with summary of options.

--version

Show program version.

--man

Show the full program manual. This uses the perldoc command to print the POD documentation for the program.


EXAMPLES

Typical Use

The typical use of this program will be to convert a repeat masker output file to the gff format.

  cnv_repmask2gff.pl -i HEX3045G05_TREP.out -o HEX3045G05_TREP.gff

This will produce a GFF file similar to the following:

 HEX3045G05  repeatmasker    exon     469       493     25      -     . AT_rich
 HEX3045G05  repeatmasker    exon     716       754     25      +     . AT_rich
 HEX3045G05  repeatmasker    exon     1764      2069    469     +     . TREP20
 HEX3045G05  repeatmasker    exon     1816      2105    507     +     . TREP58
 HEX3045G05  repeatmasker    exon     1920      2248    450     +     . TREP214
 ...
Identify the Database Used

It may be also be useful to run repeatmasker against a number of different databases. You would therefore want to specify the database used in your gff output file. This can be specified using the --param option, to specify the database in the parameter tag. For example, if you used the TREP database as your database for masking:

  cnv_repmask2gff.pl -i rm_result.out -o rm_resout.gff --param TREP

This will append the parameter tag 'TREP' to the source column (col 2) of the gff output file and will produce a GFF file simlar to the following:

 HEX3045G05  repeatmasker:TREP  exon   469     493     25     - .      AT_rich
 HEX3045G05  repeatmasker:TREP  exon   716     754     25     + .      AT_rich
 HEX3045G05  repeatmasker:TREP  exon   1764    2069    469    + .      TREP20
 HEX3045G05  repeatmasker:TREP  exon   1816    2105    507    + .      TREP58
 HEX3045G05  repeatmasker:TREP  exon   1920    2248    450    + .      TREP214
 ...
Program Source

It may also be useful for you to specify a different program source name depending on the needs of your individual pipeline. You can do this using the --program option. For example, to shorten the full name repeatmasker to 'RM', you could use the following command

  cnv_repmask2gff.pl -i rm_result.out -o rm_result.gff --program RM

This will changed the source id in the second column of the output to RM and will result in a GFF output file similar to the following:

  HEX3045G05    RM     exon     469     493     25      -     . AT_rich
  HEX3045G05    RM     exon     716     754     25      +     . AT_rich
  HEX3045G05    RM     exon     1764    2069    469     +     . TREP20
  HEX3045G05    RM     exon     1816    2105    507     +     . TREP58
  HEX3045G05    RM     exon     1920    2248    450     +     . TREP214
  ...

This can also be used in conjunction with the param tag:

  cnv_repmask2gff.pl -i result.out -o result.gff --program RM --param TREP

This will result in a GFF file similar to the following

 HEX3045G05   RM:TREP   exon    469     493     25      -     . AT_rich
 HEX3045G05   RM:TREP   exon    716     754     25      +     . AT_rich
 HEX3045G05   RM:TREP   exon    1764    2069    469     +     . TREP20
 HEX3045G05   RM:TREP   exon    1816    2105    507     +     . TREP58
 ...
Specify the Sequence Source

It is possible that the repeatmasker out file will truncate the name of your source sequence. You can restore this to the original name using the --name option. For example, if your full name was HEX3045G05_A001 you could specify this as:

  cnv_repmask2gff.pl -i result.out -o result.gff --name HEX3045G05_A001

This will result in a GFF output file similar to the following:

 HEX3045G05_A001   repeatmasker exon    469     493     25      -    .  AT_rich
 HEX3045G05_A001   repeatmasker exon    716     754     25      +    .  AT_rich
 HEX3045G05_A001   repeatmasker exon    1764    2069    469     +    .  TREP20
 HEX3045G05_A001   repeatmasker exon    1816    2105    507     +    .  TREP58
 HEX3045G05_A001   repeatmasker exon    1920    2248    450     +    .  TREP214
 ...


DIAGNOSTICS

The error messages that can be generated will be listed here.


CONFIGURATION AND ENVIRONMENT

This program does not make use of a configuration file or variables set in the user's environment.


DEPENDENCIES

Required Software

Required Perl Modules

This program does not make use of Perl modules outside of the normal suite of modules present in a typical installation of perl.


BUGS AND LIMITATIONS

Bugs

Limitations


REFERENCE

A manuscript is being submitted describing the DAWGPAWS program. Until this manuscript is published, please refer to the DAWGPAWS SourceForge website when describing your use of this program:

JC Estill and JL Bennetzen. 2009. The DAWGPAWS Pipeline for the Annotation of Genes and Transposable Elements in Plant Genomes. http://dawgpaws.sourceforge.net/


LICENSE

GNU General Public License, Version 3

http://www.gnu.org/licenses/gpl.html

THIS SOFTWARE COMES AS IS, WITHOUT ANY EXPRESS OR IMPLIED WARRANTY. USE AT YOUR OWN RISK.


AUTHOR

James C. Estill <JamesEstill at gmail.com>


HISTORY

STARTED: 04/10/2006

UPDATED: 03/24/2009

VERSION: $Rev: 600 $

 cnv_repmask2gff.pl