DAWGPAWS
User Manual

James C. Estill
April 30, 2009


TABLE OF CONTENTS

I. GETTING STARTED

  1. Install DAWGPAWS
  2. Install Required Software
  3. Install Required Perl Modules
  4. Define Environment Variables
  5. Test Installation

II. ANNOTATION PROCESS

  1. Prepare Sequence Files for Annotation
    1. Split FASTA Files with multiple records
    2. Shorten FASTA file names
    3. Softmask Sequence files
    4. Hardmask Sequence files
  2. Structural Feature Annotation
    1. Annotate Gaps in the Assembly
  3. Gene Annotation De Novo Computational Results
    1. GenScan
    2. GeneMark.hmm
    3. EuGene
    4. GeneID
    5. FGenesh
  4. Transposable Element De Novo Computational Results
    1. LTR_Struc
    2. LTR_FINDER
    3. find_ltr
    4. ltr_seq
    5. LTRHarvest - not currently implemented
    6. Repseek
    7. FINDMITE
    8. Tandem Repeats Finder
  5. Transposable Element Homology Based Computational Results
    1. RepeatMasker
    2. TE Nest
    3. HMMER - TE Models
    4. Oligomer Counts
  6. BLAST Based Homology Searches
    1. NCBI-BLAST
  7. Preparing Computational Results for Apollo
  8. Human Curation of Computational Results

III. ADDITIONAL INFORMATION

  1. Apollo Data Tiers
  2. Computational Results Directory Structure

IV. MANUAL FILES FOR INDIVUDAL PROGRAMS

V. LITERATURE CITED



I. GETTING STARTED

Getting started using DAWGPAWS may require that you install required software or define variables in your user environment. The steps involved in preparing to do genome annotation with the DAWGPAWS package is described in detail below.

This user documentation will generally assume that you are running DAWGPAWS on a Unix or Linux machine and that you are operating from the bash shell. Anywhere where I have written yourname below, simply substitute your user name.



Install DAWGPAWS

The release 1.0 package of the DAWGPAWS suite of programs can be downloaded from the DAWGPAWS SourceForge web site, it is also possible to check out the 'live' version of DAWGPAWS using subversion.

Download From SourceForge

The easiest way to obtain the most stable version of the DAWGPAWS programs is to download the programs from the SourceForge project downloads page.

Anonymous Checkout via SVN

An alternative way to download the DAWGPAWS program is to anonymously check out the SVN code from the code
repository on SourceForge. This will allow you to have the most recent version of all of the DAWGPAWS programs and documentation. This will allow you to have the most recent bug fixes that may not have been added to the stable release and will give you access to experimental new features as they are added. These programs will be constantly changing, and using the newest versions may introduce new bugs into your source.

Checking out the programs via SVN will require that you have the SVN client program installed which can be downloaded from http://subversion.tigris.org/project_packages.html. Once you have the SVN client program installed, navigate to the location where you want to place the DAWGPAWS program and check out the code. This is illustrated by the following commands:
  >mkdir /home/yourname/apps
>cd /home/yourname/apps/
>svn co https://dawgpaws.svn.sourceforge.net/svnroot/dawgpaws/trunk dawgpaws
The above command will download all the working copies of the scripts, program manuals, and emacs modes from the SourceForge repository. These will be placed in the following directory structure:
The svn 'co' command will check out the version of DAWGPAWS that is the most up to date at the time of your initial download. To update to the most recent version of DAWGPAWS at any time in the future you will need to run the svn update command from within the dawgpaws directory:
  >cd /home/yourname/apps/dawgpaws
>svn update
This will update to the most current version of DAWGPAWS.

Symbolic Link to the Perl Binary

In Linux and the Mac OS X environment, it is possible to run all of the DAWGPAWS programs without needing to type perl before the program name. To make this possible, all of the DAWGPAWS programs assume that your local installation of perl is at /usr/bin/perl. You can see this by opening up a DAWGPAWS program in a text editor. You will see that #!/usr/bin/perl -w is the first line in the program.

If your local copy of the perl binary is not at the /usr/bin/perl location, you can make a symbolic link using the 'ln' command. In general, the ln command is used as
  >ln existing_file new_name
where existing_file is the name of the file on your machine, and new_name is the name to the new shortcut. The following command example shows to to make a shortcut when your local installation of perl is at /usr/local/bin/perl instead of the expected /usr/bin/perl.
  >ln -s /usr/local/bin/perl /usr/bin/perl



Install Additional Required Software

The DAWGPAWS suite of programs are scripts that provide for high throughput execution of existing genome annotation software. The following represents an alphabetical list of software and web services that may be used by DAWGPAWS. This list provides a link to the source files for installation of the software or links to the web services. The Operating Systems (OS) that the software can be executed on is also listed as well as references (Ref) to the peer reviewed publications describing the software. You do not need to install all of these programs to make use of the DAWGPAWS package; this is the complete list of programs that DAWGPAWS can use.


Install Required Perl Modules

In addition to the standard modules included in most installations of Perl, the following Perl modules are required:


Define Environment Variables

In the Unix-like operating systems including Unix, Linux, and Mac OS X, you can define variables in your command line environment. These variables change the way you interact with the command line, and can be used to simplify common tasks. Modifying the environment will allow you to use a program at the command line without referring to its full path, and can simplify using DAWGPAWS scripts by letting you define common variables in your environment instead of defining them at the command line every time you use a program. The following information shows you how to define environment variables for use by the DAWGPAWS suite of programs. For more information on the shell and user environment in Linux you can refer to http://www.comptechdoc.org/os/linux/usersguide/linux_ugenvironment.html or

Add the DAWGPAWS code directory to your Path

Adding the DAWGPAWS code directory to your path will allow you type in the DAWGPAWS commands without needing to type the entire path to the DAWGPAWS program. For example, you will be able to just type batch_blast.pl to launch the batch_blast program.pl instead of needing to type /home/yourname/code/dawgpaws/scripts/batch_blast.pl. In the bash shell you would add the following line to your user profile.
  export PATH=$PATH:$HOME/apps/dawgpaws/scripts

Define paths to required software

Many of the DAWGPAWS command line programs use external software that may be stored at different locations in different user environments. You can choose to define the location of these external programs at the command line, but you are also able to define these paths in your user environment.

An example of settings these path variables in the bash shell is below:

  #FIND_LTR
export PATH=$PATH:/home/yourname/apps/LTRDeNovo/LTR/tool

The enviroment options that are available for each program are described in the man page under the Configuration and Environment heading.

Full variable set for the bash shell

In the bash shell you can copy and paste the following to your .bashrc or .profile file. The following assumes that you have installed the applications in a directory named apps in your home directory. You will of course need to modify the directory paths to the true locations of the software on your machine. Note that the $HOME option below refers to your user home directory (ie /home/jestill/). Using the $HOME variable instead of your actual path makes this profile movable among different machines where your home directory may be at a different location.e

#-----------------------------+
# SOFTWARE |
#-----------------------------+

# DAWGPAWS SCRIPTS
export PATH=$PATH:$HOME/code/dawgpaws/scripts

# FIND_LTR
export PATH=$PATH:$HOME/apps/LTRDeNovo/LTR/tool
export FIND_LTR_ROOT='$HOME/apps/LTRDeNovo/LTR/tool/'

# LTR_FINDER

export TRNA_DB='$HOME/apps/LTR_FINDER.'
export PROSITE_DIR='$HOME/apps/LTR_FINDER/ps_scan/'

# LTR_seq
export PATH=$PATH:$HOME/apps/ltr_seq

# EuGene
export EUGENEDIR='$HOME/apps/eugene-3.3'
 
#-----------------------------+

# DAWGPAWS VARS |
#-----------------------------+
# The following variables are used directly by DAWGPAWS scripts

# LTR_Finder path
export LTR_FINDER='$HOME/apps/ltr_finder/ltr_finder'
# GENSCAN
export DP_GENSCAN_BIN='$HOME/apps/genscan/genscan'
export DP_GENSCAN_LIB='$HOME/apps/genscan/Maize.smat'

# GeneMark
export GM_BIN_DIR='$HOME/apps/GenMark/genemark_hmm_euk.linux/'
export GM_LIB_DIR='$HOME/apps/GenMark/genemark_hmm_euk.linux/'

# VennMaster
export VMASTER_DIR='$HOME/apps/VennMaster/VennMaster-0.36.0/'
export VMASTER_JAVA_BIN='/usr/java/jre1.6.0/bin/java'

xport GENEID_BIN=/usr/local/genome/geneid/geneid_v1.3/geneid/bin/geneid

# LTR SEQ
export LTR_SEQ_DIR='$HOME/apps/ltr_seq/'
export LTR_SEQ_BIN='$HOME/apps/ltr_seq/LTR_seq'

# TRF - Tandem Repeats Finder Location
export TRF_BIN='$HOME/apps/bin/trf400.linux.exe'

# DAWGPAWS NCBI-BLAST OPTIONS
export DP_BLAST_BIN='/usr/local/genome/ncbiblast/blast-2.2.13/bin/blastall'
export DP_BLAST_DIR='$HOME/paws/'

# TEnest Options
export TE_NEST_BIN='/home/jestill/Apps/te_nest/TEnest.pl'
#export TE_NEST_DIR='/home/jestill/Apps/te_nest/'
#export DP_WUBLAST_DIR='/usr/local/genome/wu_blast/'

# RepeatMasker
export DP_RM_BIN='RepeatMasker'




Test Your Installation of DAWGPAWS

A number of test programs have been written to test the user environment, to check for required software, and to test that the conversion programs are installed properly. The test files use test data that are part of the DAWGPAWS package. Following perl conventions, these test scripts are in the t/ directory of DAWGPAWS, and the data required for these tests are in the data subdirectory within this test directory. To run one of these test scripts, navigate to the t/ dir in DAWGPAWS and then run the test script in the t/ dir as:
  >./dp_cnv_test.t

A number of individual tests are provided for the batch_run programs, the conversion programs as well as individual options in the DAWGPAWS environment. To run all of these tests at one time, you can use the dp_test_all.t script. It is recommended that you test an individual component at a time as you install the required software. A list of the test files available as of April 30, 2009 are listed below.


II. ANNOTATION PROCESS

Despite the name, the use of DAWGPAWS for annotation is really more of a process then a pipeline. There is not a single place where you drop in a sequence, push a button and get a fully annotated genome as output. The series of scripts that comprise DAWGPAWS are designed to be robust across operating systems as much as possible, and are designed to be usable in an a la carte fashion. The following outline of an annotation process indicates the use of the full suite DAWGPAWS programs; however, a subset of these programs could be chosen for use in producing computational results for genome annotation.

The process described below assumes that you will be curating the computational evidences using the Apollo Genome Annotation Curation tool. I assume that you are using the game.xml file format as the working copy of you annotation.



A. Preparing Sequence Files for Annotation

The following is an overview of the process used to annotate sequences with the DAWGPAWS set of programs. This assumes that you are running the analysis on a Linux machine, and that you are storing your annotation computational results in the directory
  /home/yourname/projects/wheat_annotation/wheat_analysis/

This directory will have a subdirectory named for every annotated sequence. The subdirectories for the annotated sequence will follow the directory structure indicated in the DAWGPAWS Directory Structure below.

In all of the following command examples, the > character represents the command line prompt.

1. Split MutliFASTA Files

The entire DAWGPAWS process assumes that each FASTA file contains a single record representing a large contig such as an individual BAC, YAC or chromosome pseudomolecule. If your query sequence file contains multiple fasta files, you will first need to split the fasta file into individual records. For example if you had a multiple record fasta file name multi_fasta.fasta you could split it into a single fasta file for each record using the following command:
  >cnv_seq2dir.pl -i multi_fasta.fasta -o outdir/
It is actually possible to read input sequences in any record format that is compatible with the bioperl SeqIO format. The format is specified with the -f  option. The output files will always be generated in the fasta format.

2. Rename FASTA files to a short unique name

Because many of the programs that are used in the DAWGPAWS process have limits on the size of FASTA headers, it may be necessary to first shorten the name of the FASTA files.

Starting with a directory in which you have a single fasta file for every BAC to process at the following directory path

  /home/YourName/projects/wheat_annotation/wheat_analysis/fasta_orig

Navigate to the wheat_analysis directory using the cd command to change directories:

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis

Issue the following command from the wheat_analysis directory:

  >fasta_shorten.pl -i fasta_orig/ -o fasta_short/ -l 10 --uppercase

This will rename all of the fasta files in the fasta_orig directory and will place the results in the fasta_short directory. The -l option of 10 will shorten the names to 10 characters, and the --uppercase flag will convert all lowercase base calls to uppercase.

3. Soft mask the renamed files using RepeatMasker and the TREP database

For the wheat BACs, the nonredundant TREP database is used.

Navigate to the wheat_annotation directory

  >cd /home/YourName/projects/wheat_annotation

Issue the following command from the wheat_annotation directory

  >batch_repmask.pl -i wheat_analysis/fasta_short/ -o wheat_analysis/ 
                 
-c wheat_analysis/batch_mask.jcfg
                 
--engine wublast

This will softmask all fasta files in the fasta_short directory and will place the results in the wheat_analysis directory.

The -c option should point to the location of the batch_repmask configuration file. The --engine wublast option will use the wublast engine for masking the sequences. The default behavior is to mask with crossmatch.

4. Hard mask the softmasked files generated above

Many of the gene prediction programs are not soft-masked "aware" so it is necessary to also make a hard masked copy of the fasta file. This will replace the lowercase letters with an N or X character. The example below shows masking the fasta files in the soft_masker directory with an uppercase N and placing the output in the hard_masked directory.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis  
>
batch_hardmask.pl -i soft_masked/ -o hard_masked/ -m N



B. Structural Feature Annotation

At this point, DAWGPAWS can currently only annotate gaps in the assembly. Structural features are defined here as features of the sequence such as gaps, gc content, etc. that are not directly related to the  annotation of biological sequence features such as genes or transposable elements.

1. Annotate the gaps in the assembly

The following will find gaps in your fasta files.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis
 
>batch_findgaps.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
The batch_findgaps.pl program is slow and uses and ugly algorithm but it does work. Results will be placed in gff format in the gff directory as well as in game.xml format in the game directory.


C. Gene Annotation De Novo Computational Results

The information below assumes that you have properly prepared the sequence files for annotation as described above. You can currently use the following five De Novo gene annotation programs within DAWGPAWS. The native results from these programs are translated to a gff format that can be use with the Apollo program.

1. GENSCAN gene prediction program

Required Programs:
The GENSCAN program is not part of the TriAnnot web sevice, thus the GENSCAN program must be run on your local machine. It is also possible to run the GENSCAN program on a web server such as http://genes.mit.edu/GENSCAN.html. If you do Genscan on a remote web server, you may be able to convert the results to GFF format using the program cnv_genscan2gff.pl.
  >batch_genscan.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/

2. GenMarkHMM

Required Programs:
The batch_genemark.pl program allows you to run GeneMark locally. The GenMarkHMM program requires that you obtain a license to run it on your local machine.
  >batch_genemark.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/

3. FGeneSH

Required Programs:
The FGeneSH program can be run in a limited form on the Softberry web site. I do have not have access to the code to write a script for running this program in batch mode. An alternative to running this program on the softberry web server.

If you have output from the softberry web site, you can convert to gff format using the cnv_fgenesh2gff.pl program:
  >cnv_fgenesh2gff.pl -i fgenesh_result.txt -o fgenesh_result.gff
If you did not save this output in text format, the cnv_fgenesh2gff.pl program will attempt to remove HTML tags before converting to gff format. To see the full list of options available in this program, read the cnv_fgenesh2gff.pl man page.

4. GeneID

Required Programs:
To run the geneid program in batch mode you would use the following command.
  >batch_geneid.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat.param.Apr_22_2004
You may also specify the full path to the geneid binary using the the --geneid-path option.
  >batch_geneid.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat.param.Apr_22_2004
--geneid-path /usr/local/geneid.March_1_2005
The original results from the geneid program will be placed in the directory named geneid/ while the gff formatted results prepared for Apollo will be placed in the gff/ directory.

5. EuGène

Required Programs:
Running the EuGène program in batch mode will make use of the hard masked input fasta files, as well as a parameter file created for your organism. Parameter files in EuGène allow you to specify many
  >batch_eugene.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat_eugene.par
The HTML and images files resulting from EuGene will be placed in the eugene directory. The original gff output from EuGene will also be placed in the eugene directory, while a gff file translated for use in Apollo will be placed in the gff direcotry. The results from this batch_eugene run will differ from the results on the TriAnnot web server due to differences in the paramter files between the two pipelines.



D. Transposable Element De Novo Computational Results

The information below assumes that you have properly prepared the sequence files for annotation as described above. The following transposable element annotation steps can generally be done in any order, and you may choose to not include some of these programs in your analysis pipeline.

1. LTR_Struc Program for LTR Retrotransposon Prediction

Required Programs:
The LTR_Struc program has not been optimized for use in a high-throughput fashion, it has a executable binary that is only available for the Windows platform, and it does not provide output in a form that can be easilty mapped back onto the query sequence. The source code for LTR_Struc is not available to allow for modifications of these limiting factors. The process for running LTR_Struc therefore requires some additional steps as outlined below:
  1. Prepare sequence files for analysis in the MS Windows environment. The ltrstruc_prep.pl program will convert the UNIX format line endings to DOS formated files with the *.txt extension and will also create the flist.txt file required by LTR_Struc.
      >cd /home/YourName/projects/wheat_annotation
     
    >ltrstruc_prep.pl -i wheat_analysis/masked_soft/ -o for_ltrstruc/
  2. Copy the directory of transformed fasta files and flist.txt to a MS Windows machine for analysis and place these in the same directory as the LTR_Struc binary.
  3. Use the batch_ltrstruc.vbs program to run the LTR_Struc program in batch mode. This program is a visual basic script program that works to send the information to the LTR_Struc command line. This has been tested and known to work under the Windows XP operating system. You can download batch_ltrstruc.vbs here from the DAWGPAWS subversion repository and place this in the same directory as the ltr_struc binary. As written, this will run LTR_Struc under the most stringent conditions.
      C:>cd ltrstrucdir
    C:>Cscript.exe //NoLogo batch_ltrstruc.vbs | cmd.exe
  4. When the LTR_Struc analysis is complete, transfer the LTR_Struc output back to your Linux machine.
  5. Use the cnv_ltrstruc2gff.pl program to convert the LTR_Struc output to a format that can be mapped back onto the query sequence. This script will extract sequence strings as reported by LTR_Struc and will map the strings back onto the query sequence using a Perl based string matching function. This matching function currently assumes only a single match to this string exists in the query sequence.
      >cd /home/YourName/projects/wheat_annotation
     
    >cnv_ltrstruc2gff.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
    --results wheat_analysis/ltrstruc_results/

2. LTR_FINDER Program for LTR Retrotransposon Prediction

Required Programs:
The following assumes that you have obtained a binary of the LTR_FINDER program. It is also possible to generate LTR_FINDER Results using the LTR_FINDER web page, but I have not tested the DAWGPAWS scripts on the web page output. It is possible that the cnv_ltrfidner2gff.pl program will work for single fasta files analyzed with the LTR_FINDER web page.

The LTR_FINDER program has been designed with high throughput use in mind, it provides rich output for the location and biology of the LTR retrotransposons that it predicts, and it allows for fine scale control of search parameters. The batch_ltrfinder.pl program provides an interface to run LTR_FINDER for multiple parameter sets for each query sequence, and it converts output to the GFF file format.
  >cd /home/YourName/projects/wheat_annotation
>batch_ltrfinder.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
-c config/batch_ltrfinder.jcfg

3. find_ltr Program for LTR Retrotransposon Prediction

Required Programs:
It is unclear if the current license for the find_ltr program allows modification of the source code.  If the license prevents modification of the program source, this would unfortunately include modifying paramaters that are important to using this program for LTR discovery. I have modified the program code to accept these parameter changes at the command line, and the batch_findltr.pl program depends on these source code modifications. An example of using the find_ltr program in batch mode is illustrated below:
  >cd /home/YourName/projects/wheat_annotation
>batch_findltr.pl -i wheat_analysis/masked_soft -o wheat_analysis
-c config/batch_findltr.jcfg
The above example will produce a GFF file for each parameter set in the batch_findltr.jcfg file for each fasta input sequence recognized in the masked_soft input directory.

4. LTR_seq for LTR Retrotransposon Prediction

Required Programs:
Running LTR_seq in batch mode uses the batch_ltrseq.pl program. This program can make use of a configuration file that specifies a name for a parameter set, and the LTR_seq config file that this parameter set is specified in. The following example shows using batch_ltrseq.pl with no configuration file. This will run the LTR_seq program using the default parameters:
  >batch_ltrseq.pl -i wheat_analysis/masked_soft -o wheat_analysis
You may also specify a configuration file with the -c option. This config file will allow the batch_ltrseq.pl program to run the LTR_seq program for multiple parameter combinations for every fasta file in the input sequence directory. The configuration file will be a two column, tab delimited text file with the following options:  Col. 1 Configuration Set Name, * Col. 2 Config File Path

An example using batch_ltrseq.pl with a config file is below:
  >batch_ltrseq.pl -i wheat_analysis/masked_soft -o wheat_analysis
-c ltrseq_set.jcfg
Please see the batch_ltrseq.pl documentaion for more information regarding the batch_ltrseq config file as well as how to designate parameters for the LTR_seq program.

You may also do an individual run of LTR_seq manually and then convert the output to gff using the cnv_ltrseq2gff.pl program.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis/HEX0014K09
>mkdir ltr_seq
 
>cnv_ltrseq2gff.pl -i ltr_seq/ltr_seq_out.txt -o gff/HEX0014K09_ltrseq.gff
-s HEX0014K09
This will produce a file name HEX0014K09_ltrseq.gff that contains all of the output from the LTR_seq program that was determined to be an LTR Retrotransposon. The current version of the LTR_seq program appears to produce duplicate predictions.

5. LTRharvest for LTR Retrotransposon Prediction

Required Programs:
LTRharvest is currently not supported to run in batch mode as part of the DAWGPAWS package. However, the GFF3 output can be manually opened in the newest version of Apollo.

6. RepSeek

Information on repseek here.

Required Programs:

7. FINDMITE for MITE Discovery

Required Programs:
The batch_findmite program will do a FINDMITE analysis for each parameter set in your configuration file for each query sequence in your input directory. The results from FINDMITE have a VERY high false positive rate so you will need to further evaluate your results to find the true MITEs in your query sequence.

To run the batch_findmite.pl program you would do the following:
  >cd /home/YourName/projects/wheat_annotation  
>batch_findmite.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
-c config/batch_findmite.jcfg --gff
The FINDMITE results will be place in the findmite directory while the results converted to the GFF format would be placed in the gff directory.

8. Tandem Repeats Finder (TRF)

Required Programs:
The Tandem Repeats Finder can be run in batch mode. This script will work with TRF 4.0, I have not tried to use it with earlier versions of TRF.
  >batch_trf.pl -i wheat_analysis/masked_soft -o wheat_analysis
-c config/batch_hmmer.jcfg
The TRF data file will be placed in the directory, TRF. The gff formatted results will be placed in the gff directory.



E. Transposable Element Homology Based Computational Results

The following computational results for transposable elements all require a database of known TEs for annotation. These are therefore idenified as homology based computational results.

1. RepeatMask Sequences

Required Programs:
The process for running RepeatMasker to identify known Transposable Elements is described above.

2. TE Nest

The Te Nest program can annotate nested insertions. You have the option of running the TE Nest program as a web service, or downloading the TE Nest program and running it on a local machine.

TE Nest Web Service

Required Programs:
The TE Nest program is provided as a web page based service from the Plant GDB web server. The TE Nest program uses a homology based search approach to identify nested elements in your query sequence. The fetch_tenest.pl program can be used to bulk download results from the TE Nest web server and convert results from the TE Nest text format to the GFF format.
  1. Submit your sequence jobs to the TE Nest web server http://www.plantgdb.org/PlantGDB-cgi/TE_nest/cgi/displayTE.pl
  2. Record the job id that is returned for your sequence file by adding a row to your config file that indicates (1) The name of your query sequence as used in the rest of your analysis, (2) the job id that TE Nest assigned to your submision. An example config file is available from the DAWGPAWS SVN site.
  3. When all of you jobs have completed on the TE Nest web server, download the results using the fetch_tenest.pl program.
      >cd /home/YourName/projects/wheat_annotation
     
    >fetch_tenest.pl -c tenest_results.txt -o wheat_analysis/ --gff
    This will place the results in the wheat_analysis folder with a separate set of folders for query sequence defined in the config file.

TE Nest Local Installation

An alternative to running TE Nest as a web service is to run TE nest locally using the batch_tenest.pl program. You are required to run the program locally if you are

Required Programs:
  >cd /home/YourName/projects/wheat_annotation
 
>batch_tenest.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/ --org wheat
TEnest will use the maize database by default. This script currently does not convert the output to gff format. Valid options for --org include barley, wheat, maize, rice. You may also create your own database for use with TE Nest using the cnv_fasta2tenest.pl program. This will create the directory structure and files needed to run TE Nest on a set of transposable elements.

3.HMMER - TE Models

Required Programs:
It is possible to use profile hidden Markov models of known transposable elements to find the location of TEs in your query sequence. This approach was used in to identify MITES and MULEs in the Rice genome (Juretic et al. 2004). You can produce your own Profile HMMs using multiple alignments of sequences of your choice, or you can test this approach using HMMs developed for rice.
  >cd /home/YourName/projects/wheat_annotation
>batch_hmmer.pl
-i wheat_analysis/masked_soft -o wheat_analysis
-c config/batch_hmmer.jcfg --gff

4. Oligomer Counts

Required Programs:
The seq_oligocount.pl program is at an early stage of development. It currently breaks the query sequence into oligomers of length k, it then uses the vmatch program to query these kmers against a index database to determine oligo copy number depth in the index database. This index database could be an index of the raw shotgun reads, an index of the assembled reads, or a database index of an external dataset. The result produced by seq_oligocount.pl is a GFF format file that gives the oligomer copy number of every segment of length k in the query sequence. It would make sense to allow for binning these results over some window, but that is currently not implemented. The process for using this program is outlined below.
  1. Create a persistent index of your database with the mkvtree program. The example below shows how to index the file named my_seqs.fasta.
      >cd /home/YourName/projects/wheat_annotation
    >mkdir seq_index
    >cp my_seqs.fasta seq_index/
    >cd seq_index/
    >mkvtree -db my_seqs.fasta -dna -allout -pl
    This index can be used to generate an index of coverage.
  2. Use the persistent index created above as a database to query your sequence against using the seq_oligocount.pl program. The example below shows how to create an oligo count tier for the fasta file HEX0014K09.fasta where the length of the oligos is 20 bases and the index file is the my_seqs.fasta file from above. The output will be placed in the directory: wheat_analysis/HEX0014K09/
      >cd /home/YourName/projects/wheat_annotation
    >seq_oligocount.pl --infile masked_soft/HEX0014K09.fasta -n HEX0014K09
    --db seq_index/my_seqs.fasta -k 20
    --outdir wheat_analysis/HEX0014K09/
The seq_oligocount.pl program will place the GFF formatted results in the gff directory.

The results can be translated to UCSC wiggle format using the cnv_gff2wig.pl program. This program currently takes three arguments as name, description, and filename to process.

  >cnv_gff2wig.pl 20mer_counts 20mers vmatch_out.gff
HOWEVER ... as currently written this will produce a file format not compatible with Apollo and this function is under current development.

As an alternative to converting to the wiggle format, you can convert these oligomer counts into segments that exceed a threshold value using the gff_seg.pl program to segment the raw counts into segment features that exceed the threshold. For example, to generate a gff feature file of segments with oliogmers that occur at least 50 times in the index database:
  >gff_seg.pl --infile HEX0014K09_20mer.gff --seg-out HEX0014K09_50.gff --thresh 50
This will produce a gff output file that defines all segments that are represented by 50 or more copies of sequential 20mers in the query index.


F. NCBI-BLAST Homology Searches

Additional gene and TE annotation tiers are added using the NCBI BLAST program.

NOTE: THE FOLLOWING INFORMATION  REFERS TO THE TRADITIONAL NCBI-BLAST COMMAND LINE PROGRAM. THE BLAST+ PROGRAM IS NOT CURRENTLY SUPPORTED. -- OCTOBER 20, 2009

1. NCBI-BLAST processes

Homology searches with NCBI-BLAST will use the soft masked files generated above.

First, you will need to prepare sequence files to serve as databases for BLAST queries. This will use the formatdb command. The DAWGPAWS program uses the name that BLAST assigns to your database, so you should indicate the database name and ID when formatting the database. You should choose a shortened form of the database name with no spaces. The following shows how to format the fasta file my_database_of_repeats.fasta using the small name repdb as the database name.

  >formatdb -i my_database_of_repeats.fasta -p F -n repdb -t repdb

For a small number of blast jobs, you can run the BLAST jobs on your machine using the batch_blast.pl program with the batch_blast_full.jcfg configuration file. The -d argument is used to indicate the location of the directory containing the BLAST databses.

  >batch_blast.pl -i wheat_annotation/soft_masked -o wheat_annotation/ 
                 
-c batch_blast_full.jcfg -d /db/paws/
                 
--logfile wheat_annotation/blast_job.log

The resulting blast results will be placed in the 'blast' directory for each contig, and the results translated to the gff format will be placed in the 'gff' directory.

Running BLAST jobs in a cluster computing environment

For larger number of BLAST jobs, the BLAST program will generally be run in a parallel cluster computing framework. This will require the following:

  1. Split the fasta directory into subdirectories. Each subdirectory will be analyzed on a separate node on the cluster machine.
  2.   >fasta_dirsplit -i wheat_analysis/soft_masked -o wheat_analysis/blast_dirs/
                     
    -n 16 -b geterdone
  3. Copy the new dirs to the cluster machine
  4.   >scp -p -r geterdone*/ YourName@yourclustermachine:/scratch/YourName/wheat_in/
  5. Run blast processes on the cluster machine. The specifics of running the blast processes on your cluster will depend on the schedulding software used by your cluster. The easiest way to deal with this is to write a shell script which submits the jobs to your queue. The you will just need to run the shell script to execute this blast on the cluster.
  6.   >./subjob.sh
  7. Download BLAST results to your local machine
  8.   >scp -p -r YourName@yourclustermachine:/scratch/YourName/wheat_out/ 
           
    /home/YourName/projects/wheat_annotation/blast_results
  9. Merge the BLAST results into the wheat_analysis directory
  10.   >cd /home/YourName/projects/wheat_annotation
     
    >dir_merge.pl -i blast_results/wheat_out -o wheat_analysis


G. Preparing Computational Results for Apollo

1. Audit the computational results

The audit program will move gff files to the gff dir if needed and will alert you to any expected files that could not be found. Currently this program only audits a subset of the expected results from DAWGPAWS with a focus on gene annotation output. This program will probably end up getting deprecated, but I will leave it here for now.

  >cd /home/YourName/projects/wheat_annotation
 
>batch_audit.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/ --full --color

The --full option will run a complete audit. This currently includes the TriAnnotation output,

The --color option will print error messages about missing files in red font.

2. Concatenate the gff files

This is currently a manual step that needs to be automated.

For each query sequence, navigate to the directory containing the gff output files. For example, for the file HEX0014K09

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis/HEX0014K09/gff/

Concatenated all of the gff files in this directory into a single gff output file.

  >cat *.gff > HEX0014K09.gff

3. Convert concatenated gff file to game.xml format

Converting the concatenated gff file can be done using the batch_convert.pl progra. This takes as its input the input directory of fasta files. The output will be stored in the game directory. A copy of all of the game.xml files will also be placed in the base wheat_analysis directory.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis
 
>batch_convert.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/

H. Human Curation of Computational Results

The Apollo program can be used for human curation of the computational results generated above. Information on Apollo is available at: http://www.dhgp.org/current/index.html with the installation files available from: http://apollo.berkeleybop.org/current/install.html. Your use of Apollo to annotate genomes will be faciliated by the use of the tiers file wheat.tiers file that I have developed for wheat. I have also attempted to make this wheat.tiers file compatible with output from the TriAnnot web service.



IV. ADDITIONAL INFORMATION

The following information is provided for additional reference.



1. Apollo Data Tiers

The GFF files produced by the computational results above are intended to be visualized using the Apollo Genome Annotation Curation Tool. The way that Apollo displays data can be modified through the use of tiers files and style files. These files are placed in your home directory in the directory named '.apollo'. The dot in the front of the name indicates that this direcotry is hidden.A tiers file and a style file has been created for visualizing the results for wheat annotation. These are included as part of the DAWGPAWS SVN package, but are also available at:
I prefer to point to these using symlinks from within my home directory as shown below.
  >cd $HOME/.apollo 
>
ln -s $HOME/apps/dawgpaws/apollo/wheat.tiers fly.tiers
>ln -s $HOME/apps/dawgpaws/apollo/wheat.style fly.style
These will create files name fly.tiers and fly.style that just point to the location of the wheat.tiers and wheat.style files in the dawgpaws SVN directory. The advantage of this is that when additional data tiers are added to the computational evidences that I use, I can create a new tiers file and propagate this to all of the annotators using SVN.


Computational Results Directory Structure

The DAWGPAWS programs will store the computational results in a predefined directory structure. The software will establish this structure when output from the various annotation programs are produced or parsed.

An example below shows an example directory structure of computational results for the BAC identified as HEX0014K09.

Instructions for Individual Programs

Links to the program manuals for the indvidual scripts are listed below in alphabetical order. These are extensive help files for each command line program used in the genome annotation process.
batch_blast.pl
Do NCBI-BLAST searches for a set of fasta files
batch_cnv_blast2gff.pl
Convert blast output to GFF format
batch_cnv_ta2ap.pl
Convert TriAnnotation to Apollo GFF
batch_eugene.pl
Run the eugene annotation program in batch modes
batch_findgaps.pl
Annotate gaps in a fasta file
batch_findltr.pl
Run the find_ltr.pl program in batch mode
batch_findmite.pl
Run the findmite program in batch mode
batch_game2gff.pl
Convert game.xml annotations to GFF format
batch_geneid.pl
Run the geneid program in batch mode.
batch_genemark.pl
Run GeneMark.hmm and parse results to a GFF format file
batch_genscan.pl
Run genscan in batch mode and parse results to GFF format
batch_gff2game.pl
Convert GFF files to the game.xml format
batch_hardmask.pl
Hardmask a directory of softmasked fasta files
batch_hmmer.pl
Run the HMMER program in batch mode
batch_ltrfinder.pl
Run the LTRFinder program in batch mode
batch_ltrseq.pl
Run the LTR_seq program in batch mode
batch_repmask.pl
Run RepeatMasker and parse results to a GFF format file
batch_seq_summary.pl
Print summary info for a directory of sequence files
batch_tenest.pl
Run the TE nest program in batch mode on a directory of fasta files.
batch_trf.pl
Run the Tandem Repeats Finder program in batch mode.
clust_write_shell.pl
Write shell scripts for the Platform LSF queuing system
cnv_blast2gff.pl
Convert BLAST output to the gff format.
cnv_fgenesh2gff.pl
Convert a single FGENESH output to GFF format
cnv_findltr2gff.pl
Convert a single findltr output file to GFF format
cnv_findmite2gff.pl
Convert a single findmite output file to GFF format
cnv_game2gff3.pl
Convert a game xml file to the GFF3 format.
cnv_genemark2gff.pl
Convert a single output file from genemark to GFF format
cnv_gff2game.pl
Convert a GFF file to the game.xml format.
cnv_ltrfinder2gff.pl
Convert a single output file from ltrfinder to GFF format
cnv_ltrseq2gff.pl
Convert a single output file from LTR_seq to GFF format
cnv_ltrsruc2gff.pl
Convert output from LTR_struc to GFF format
cnv_repmask2gff.pl
Convert a single output file from RepeatMasker to GFF format
cnv_repseek2gff.pl
Convert a single output from RepSeek to GFF format.
cnv_seq2dir.pl
Convert a multiple record sequence file to multiple files.
cnv_ta2ap.pl
Convert the TriAnnotation GFF3 output to apollo compatible GFF format
cnv_tenest2gff.pl
Convert a single output from the TE Nest to GFF format
dir_merge.pl
Merge directories of DAWGPAWS output to a single directory
fasta_merge.pl
Merge a directory of fasta files into a single fasta file.
fasta_dirsplit.pl
Split a directry of fasta files into a n subdirectories
fasta_shorten.pl
Change fasta headers to shorter names
fetch_tenest.pl
Download a set of results from the TE Nest web server
gff_seg.pl
Segment and parse a large gff file
ltrstruc_prep.pl
Create the files needed to run the LTR_struc program
seq_oligocount.pl
Count oligo redundancy for a input sequence
vennseq.pl
Create Venn Diagrams of sequence features


LITERATURE CITED

Achaz, G., F. Boyer, et al. (2007). "Repseek, a tool to retrieve approximate repeats from large DNA sequences." Bioinformatics 23(1): 119-21.

Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res 25(17): 3389-402.

Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res 27(2): 573-80.

Burge, C. and S. Karlin (1997). "Prediction of complete gene structures in human genomic DNA." J Mol Biol 268(1): 78-94.

Edgar, R. C. and E. W. Myers (2005). "PILER: identification and classification of genomic repeats." Bioinformatics 21 Suppl 1: i152-8.

Ellinghaus, D., S. Kurtz, et al. (2008). "LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons." BMC Bioinformatics 9: 18.

Juretic, N., T. E. Bureau, et al. (2004). "Transposable element annotation of the rice genome." Bioinformatics 20(2): 155-160.

Kalyanaraman, A. and S. Aluru (2006). "Efficient algorithms and software for detection of full-length LTR retrotransposons." J Bioinform Comput Biol 4(2): 197-216.

Kent, W. J. (2002). "BLAT--the BLAST-like alignment tool." Genome Res 12(4): 656-64.

Kurtz, S. (2004). Vmatch. http://www.vmatch.de/

Kronmiller, B. A. and R. P. Wise (2007). "TE nest: Automated chronological annotation and visualization of nested plant transposable elements." Plant Physiol.

Lewis, S. E., S. M. Searle, et al. (2002). "Apollo: a sequence annotation editor." Genome Biol 3(12): RESEARCH0082.

McCarthy, E. M. and J. F. McDonald (2003). "LTR_STRUC: a novel search and identification program for LTR retrotransposons." Bioinformatics 19(3): 362-7.

Parra, G., E. Blanco, et al. (2000). "GeneID in Drosophila." Genome Res 10(4): 511-5.

Quesneville, H., C. M. Bergman, et al. (2005). "Combined evidence annotation of transposable elements in genome sequences." PLoS Comput Biol 1(2): 166-75.

Kurtz, S., J. V. Choudhuri, et al. (2001). "REPuter: the manifold applications of repeat analysis on a genomic scale." Nucleic Acids Res 29(22): 4633-42.

Rho, M., J. H. Choi, et al. (2007). "De novo identification of LTR retrotransposons in eukaryotic genomes." BMC Genomics 8: 90.

Schiex, T., A. Moisan, et al. (2001). EuGene: An Eucaryotic Gene Finder that combines several sources of evidence. . Computational Biology. O. Gascuel and M.-F. Sagot: 111-125.

Tu, Z. (2001). "Eight novel families of miniature inverted repeat transposable elements in the African malaria mosquito, Anopheles gambiae." Proc Natl Acad Sci U S A 98(4): 1699-704.

Wicker, T., F. Sabot, et al. (2007). "A unified classification system for eukaryotic transposable elements." Nat Rev Genet 8(12): 973-82.

Xu, Z. and H. Wang (2007). "LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons." Nucleic Acids Res 35(Web Server issue): W265-8.


Author: James Estill
Last Updated: October 20, 2009

SourceForge.net Logo