Getting started using DAWGPAWS may require that you install required software or define variables in your user environment. The steps involved in preparing to do genome annotation with the DAWGPAWS package is described in detail below.

This user documentation will generally assume that you are running DAWGPAWS on a Unix or Linux machine and that you are operating from the bash shell. Anywhere where I have written yourname below, simply substitute your user name.

Install DAWGPAWS

The release 1.0 package of the DAWGPAWS suite of programs can be downloaded from the DAWGPAWS SourceForge web site, it is also possible to check out the 'live' version of DAWGPAWS using subversion.

Download From SourceForge

The easiest way to obtain the most stable version of the DAWGPAWS programs is to download the programs from the SourceForge project downloads page.

Anonymous Checkout via SVN

An alternative way to download the DAWGPAWS program is to anonymously check out the SVN code from the code
repository on SourceForge. This will allow you to have the most recent version of all of the DAWGPAWS programs and documentation. This will allow you to have the most recent bug fixes that may not have been added to the stable release and will give you access to experimental new features as they are added. These programs will be constantly changing, and using the newest versions may introduce new bugs into your source.

Checking out the programs via SVN will require that you have the SVN client program installed which can be downloaded from http://subversion.tigris.org/project_packages.html. Once you have the SVN client program installed, navigate to the location where you want to place the DAWGPAWS program and check out the code. This is illustrated by the following commands:

  >mkdir /home/yourname/apps
  >cd /home/yourname/apps/
  >svn co https://dawgpaws.svn.sourceforge.net/svnroot/dawgpaws/trunk dawgpaws

The above command will download all the working copies of the scripts, program manuals, and emacs modes from the SourceForge repository. These will be placed in the following directory structure:

dawgpaws
This base directory will be created when you check out dawgpaws for the first time

apollo
The tiers and type file for using the apollo program are installed here
docs
Additional documentation for using the DAWGPAWS program and the full text of the GNU GPL license.
emacs_modes
Major modes files for emacs, this will provide syntax highlighting to the apollo tiers file, types files and configuration files.
html
This is a copy of the web pages that are hosted at http://dawgpaws.sourceforge.net

htdocs
The base directory of the html files. This includes the pages that are posted at the SourceForge web site.
cgi-bin
Programs run through the dawgpaws web are here.

scripts
The scripts that are the backbone of DAWGPAWS are stored in this directory
t
Test files for checking if DAWGPAWS is ready to go

data
Data files that will be used by the test scripts.

The svn 'co' command will check out the version of DAWGPAWS that is the most up to date at the time of your initial download. To update to the most recent version of DAWGPAWS at any time in the future you will need to run the svn update command from within the dawgpaws directory:

  >cd /home/yourname/apps/dawgpaws
  >svn update

This will update to the most current version of DAWGPAWS.

Symbolic Link to the Perl Binary

In Linux and the Mac OS X environment, it is possible to run all of the DAWGPAWS programs without needing to type perl before the program name. To make this possible, all of the DAWGPAWS programs assume that your local installation of perl is at /usr/bin/perl. You can see this by opening up a DAWGPAWS program in a text editor. You will see that

#!/usr/bin/perl
-w

is the first line in the program.

If your local copy of the perl binary is not at the /usr/bin/perl location, you can make a symbolic link using the 'ln' command. In general, the ln command is used as

  >ln existing_file new_name

where existing_file is the name of the file on your machine, and new_name is the name to the new shortcut. The following command example shows to to make a shortcut when your local installation of perl is at /usr/local/bin/perl instead of the expected /usr/bin/perl.

  >ln -s /usr/local/bin/perl /usr/bin/perl

Install Additional Required Software

The DAWGPAWS suite of programs are scripts that provide for high throughput execution of existing genome annotation software. The following represents an alphabetical list of software and web services that may be used by DAWGPAWS. This list provides a link to the source files for installation of the software or links to the web services. The Operating Systems (OS) that the software can be executed on is also listed as well as references (Ref) to the peer reviewed publications describing the software. You do not need to install all of these programs to make use of the DAWGPAWS package; this is the complete list of programs that DAWGPAWS can use.

Apollo : Genome Annotation Curation Tool

Source: http://www.fruitfly.org/annot/apollo/
OS: Win32, Linux32, Mac OS X
Ref: Lewis, S. E., S. M. Searle, et al. (2002). Genome Biol 3(12): RESEARCH0082.

BLAST (NCBI-BLAST)

Source http://www.ncbi.nlm.nih.gov/blast/download.shtml
For legacy BLAST please download from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.17/ until I have time to work around the changes in command line BLAST
OS: Win32, Win64, Linux32, Linux64, Mac OS X, Solaris32, Solaris64
Ref: Altschul, S. F., T. L. Madden, et al. (1997). Nucleic Acids Res 25(17): 3389-402.

Cross_Match
Cross_match is part of the Phrap package and is used by RepeatMasker. It is a general purpose application for comparing any two DNA sequence sets.

Source: http://www.phrap.org/
Ref: No literature reference for this program

EuGène
An open gene finder for eukaryotic organisms. Compared to most existing gene finders, EuGène is characterized by its ability to simply integrate arbitrary sources of information in its prediction process.

Source:http://www.inra.fr/internet/Departements/MIA/T/EuGene/
OS: Linux
Ref: Schiex, T., A. Moisan, et al. (2001). Computational Biology. 111-125.

find_ltr
The find_ltr program is a de novo LTR Retrotransposon discovery tool

Source:http://darwin.informatics.indiana.edu/evolution/LTR.tar.gz
Ref: Rho, M., J. H. Choi, et al. (2007). BMC Genomics 8: 90.

FINDMITE
Identifies putative MITES based on structural criteria.

Source:http://jaketu.biochem.vt.edu/dl_software.htm
OS: Redhat Linux
Ref: Tu, Z. (2001). Proc Natl Acad Sci U S A 98(4): 1699-704.

FGENESH

Source:http://www.softberry.com/berry.phtml
OS: Multiple, Web Interface Available
Web Query: http://linux1.softberry.com/berry.phtml?topic=fgenesh
Ref: Solovyev VV, Salamov AA, Lawrence CB. (1995). Proc Int Conf Intell Syst Mol Biol 3:367-375.

GeneID

Source:http://www1.imim.es/software/geneid/index.html
Ref: Parra, G., E. Blanco, et al. (2000). Genome Res 10(4): 511-5.

GeneMarkHMM

Source:http://opal.biology.gatech.edu/GeneMark/
Ref: Lukashin, A. V. and M. Borodovsky (1998). Nucleic Acids Res 26(4): 1107-15.

GENSCAN

Source:http://genes.mit.edu/GENSCANinfo.html
Ref: Burge, C. and S. Karlin (1997). J Mol Biol 268(1): 78-94.

HMMER
HMMER is a freely distributable implementation of profile HMM software for protein sequence analysis.

Source:http://hmmer.janelia.org/
OS: Linux32,Linux64,Mac OS X
Ref: Eddy SR (1998). Bioinformatics 14:755-763.

LTR_FINDER

Source :A Linux Binary is available upon contacting the authors : xuzh <at> fudan.edu.cn
OS: Linux
Web Query:http://tlife.fudan.edu.cn/ltr_finder/
Ref: Xu, Z. and H. Wang (2007). Nucleic Acids Res 35(Web Server issue): W265-8.

LTR_Seq - Identifies LTR retrotransposons based on structural criteria

Source:http://www.eecs.wsu.edu/~ananth/software.htm
OS: Linux
Ref: Kalyanaraman, A. and S. Aluru (2006). J Bioinform Comput Biol 4(2): 197-216.

LTR_Struc

Source:http://www.genetics.uga.edu/retrolab/data/LTR_Struc.html
OS: Win32
Ref: McCarthy, E. M. and J. F. McDonald (2003). Bioinformatics 19(3): 362-7.

RepeatMasker
RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences

Source:http://www.repeatmasker.org/
Ref: No published reference for this program

Tandem Repeats Finder
Locates and displays tandem repeats in DNA sequences.

Source:http://tandem.bu.edu/trf/trf.download.html
OS: Win32, Linux32, Linux64, Mac OSX, Solaris
Ref: Benson, G. (1999). Nucleic Acids Res 27(2): 573-80.

TE Nest
TE nest is an annotation and visualization tool for the identification transposable elements. TE nest is unique in that it reconstructs transposable elements separated by nesting of subsequent insertions.

Web Query:http://www.plantgdb.org/prj/TE_nest/TE_nest.html
Ref: Kronmiller, B. A. and R. P. Wise (2007). Plant Physiol. in press

TriAnnot
The purpose the TriAnnot web server is to provide a web based automated annotation system for wheat similar to existing resources for rice.

Web Query:http://urgi.versailles.inra.fr/projects/TriAnnot/pipeline.php
Ref: No published reference for this program

Vmatch
A tool for efficiently solving large scale sequence matching tasks using a persistent index. Vmatch subsumes the software Reputer (Kurtz et al 2001).

Source:http://www.vmatch.de/
OS: Linux32, Linux64, Mac OSX, Solaris32, Solaris64
Program Manual : http://www.zbh.uni-hamburg.de/vmatch/virtman.pdf
Ref:

WU-BLAST
WU-BLAST is a Local Alignment program similar to NCBI-BLAST but it quite different in its implementation, speed and results.

Source:http://blast.wustl.edu/licensing/
OS: Linux32, Linux64, Mac OSX, Solaris32, Solaris64
Note: WU-BLAST no longer exists, although existing licences are still valid, it is now owned by http://www.advbiocomp.com and is called AB-BLAST.

Install Required Perl Modules

In addition to the standard modules included in most installations of Perl, the following Perl modules are required:

BioPerl
BioPerl is required for some of the scripts. For information on installing BioPerl on your system see:

http://www.bioperl.org/wiki/Installing_BioPerl

Define Environment Variables

In the Unix-like operating systems including Unix, Linux, and Mac OS X, you can define variables in your command line environment. These variables change the way you interact with the command line, and can be used to simplify common tasks. Modifying the environment will allow you to use a program at the command line without referring to its full path, and can simplify using DAWGPAWS scripts by letting you define common variables in your environment instead of defining them at the command line every time you use a program. The following information shows you how to define environment variables for use by the DAWGPAWS suite of programs. For more information on the shell and user environment in Linux you can refer to http://www.comptechdoc.org/os/linux/usersguide/linux_ugenvironment.html or

Add the DAWGPAWS code directory to your Path

Adding the DAWGPAWS code directory to your path will allow you type in the DAWGPAWS commands without needing to type the entire path to the DAWGPAWS program. For example, you will be able to just type batch_blast.pl to launch the batch_blast program.pl instead of needing to type /home/yourname/code/dawgpaws/scripts/batch_blast.pl. In the bash shell you would add the following line to your user profile.

  export PATH=$PATH:$HOME/apps/dawgpaws/scripts

Define paths to required software

Many of the DAWGPAWS command line programs use external software that may be stored at different locations in different user environments. You can choose to define the location of these external programs at the command line, but you are also able to define these paths in your user environment.

An example of settings these path variables in the bash shell is below:

  #FIND_LTR
  export PATH=$PATH:/home/yourname/apps/LTRDeNovo/LTR/tool

The enviroment options that are available for each program are described in the man page under the Configuration and Environment heading.

Full variable set for the bash shell

In the bash shell you can copy and paste the following to your .bashrc or .profile file. The following assumes that you have installed the applications in a directory named apps in your home directory. You will of course need to modify the directory paths to the true locations of the software on your machine. Note that the $HOME option below refers to your user home directory (ie /home/jestill/). Using the $HOME variable instead of your actual path makes this profile movable among different machines where your home directory may be at a different location.e

#-----------------------------+
# SOFTWARE                    |
#-----------------------------+

# DAWGPAWS SCRIPTS 
export PATH=$PATH:$HOME/code/dawgpaws/scripts

# FIND_LTR
export PATH=$PATH:$HOME/apps/LTRDeNovo/LTR/tool
export FIND_LTR_ROOT='$HOME/apps/LTRDeNovo/LTR/tool/'

# LTR_FINDER
export TRNA_DB='$HOME/apps/LTR_FINDER.'
export PROSITE_DIR='$HOME/apps/LTR_FINDER/ps_scan/'

# LTR_seq
export PATH=$PATH:$HOME/apps/ltr_seq

# EuGene
export EUGENEDIR='$HOME/apps/eugene-3.3'
 
#-----------------------------+
# DAWGPAWS VARS              |
#-----------------------------+
# The following variables are used directly by DAWGPAWS scripts

# LTR_Finder path
export LTR_FINDER='$HOME/apps/ltr_finder/ltr_finder'

# GENSCAN 
export DP_GENSCAN_BIN='$HOME/apps/genscan/genscan'
export DP_GENSCAN_LIB='$HOME/apps/genscan/Maize.smat'

# GeneMark                                
export GM_BIN_DIR='$HOME/apps/GenMark/genemark_hmm_euk.linux/'
export GM_LIB_DIR='$HOME/apps/GenMark/genemark_hmm_euk.linux/'

# VennMaster
export VMASTER_DIR='$HOME/apps/VennMaster/VennMaster-0.36.0/'
export VMASTER_JAVA_BIN='/usr/java/jre1.6.0/bin/java'

xport GENEID_BIN=/usr/local/genome/geneid/geneid_v1.3/geneid/bin/geneid

# LTR SEQ                                                                                                                                                
export LTR_SEQ_DIR='$HOME/apps/ltr_seq/'
export LTR_SEQ_BIN='$HOME/apps/ltr_seq/LTR_seq'

# TRF - Tandem Repeats Finder Location                                                                              
export TRF_BIN='$HOME/apps/bin/trf400.linux.exe'

# DAWGPAWS NCBI-BLAST OPTIONS                                                            
export DP_BLAST_BIN='/usr/local/genome/ncbiblast/blast-2.2.13/bin/blastall'
export DP_BLAST_DIR='$HOME/paws/'

# TEnest Options                                                                
export TE_NEST_BIN='/home/jestill/Apps/te_nest/TEnest.pl'
#export TE_NEST_DIR='/home/jestill/Apps/te_nest/'                                  
#export DP_WUBLAST_DIR='/usr/local/genome/wu_blast/'                               

# RepeatMasker
export DP_RM_BIN='RepeatMasker'

Test Your Installation of DAWGPAWS

A number of test programs have been written to test the user environment, to check for required software, and to test that the conversion programs are installed properly. The test files use test data that are part of the DAWGPAWS package. Following perl conventions, these test scripts are in the t/ directory of DAWGPAWS, and the data required for these tests are in the data subdirectory within this test directory. To run one of these test scripts, navigate to the t/ dir in DAWGPAWS and then run the test script in the t/ dir as:

  >./dp_cnv_test.t

A number of individual tests are provided for the batch_run programs, the conversion programs as well as individual options in the DAWGPAWS environment. To run all of these tests at one time, you can use the dp_test_all.t script. It is recommended that you test an individual component at a time as you install the required software. A list of the test files available as of April 30, 2009 are listed below.

dp_batch_eugene_test.t
Test the batch_eugene.pl program.
dp_batch_findltr_test.t
Test the batch_findltr.pl program.
dp_batch_findmite_test.t
Test the batch_findmite.pl program.
dp_batch_geneid_test.t
Test the batch_geneid.pl program.
dp_batch_genscan_test.t
Test the batch_genscan.pl program.
dp_batch_hmmer_test.t
Test the batch_hmmer.pl program.
dp_batch_ltrfinder_test.t
Test the batch_ltrfinder.pl program.
dp_batch_ltrseq_test.t
Test the batch_ltrseq.pl program
dp_batch_repmask_test.t
Test the batch_repmask.pl program.
dp_batch_tenest_test.t
Test the batch_tenest.pl program.
dp_batch_trf_test.t
Test the batch_trf.pl program.
dp_cnv_game_test.t
Test of the conversion from gff to game.xml using the Apollo program to mediate the conversion.
dp_cnv_gff3_test.t
Test of the conversion from game.xml to gff3 format using the Apollo program to mediate the conversion.
dp_cnv_test.t
Test of the DAWGPAWS conversion programs.
dp_env_test.t
Test to see if variables defined in the user environment are okay.
dp_findgap_test.t
Test the batch_findgaps.pl program.
dp_gff_seg_test.t
Test the gff segmentation program.
dp_module_test.t
Test to see if modules required by DAWG-PAWS are present.
dp_test_all.t
Run all of the DAWPAWS tests. A good way to check if everything is still working, a bad way to get started.
dp_venn_test.t
Test the vennseq.pl program. This tests the basic crosstab results, as well as checks that the vennmaster program is installed and runs correctly.

II. ANNOTATION PROCESS

Despite the name, the use of DAWGPAWS for annotation is really more of a process then a pipeline. There is not a single place where you drop in a sequence, push a button and get a fully annotated genome as output. The series of scripts that comprise DAWGPAWS are designed to be robust across operating systems as much as possible, and are designed to be usable in an a la carte fashion. The following outline of an annotation process indicates the use of the full suite DAWGPAWS programs; however, a subset of these programs could be chosen for use in producing computational results for genome annotation.

The process described below assumes that you will be curating the computational evidences using the Apollo Genome Annotation Curation tool. I assume that you are using the game.xml file format as the working copy of you annotation.

A. Preparing Sequence Files for Annotation

The following is an overview of the process used to annotate sequences with the DAWGPAWS set of programs. This assumes that you are running the analysis on a Linux machine, and that you are storing your annotation computational results in the directory

  /home/yourname/projects/wheat_annotation/wheat_analysis/

This directory will have a subdirectory named for every annotated sequence. The subdirectories for the annotated sequence will follow the directory structure indicated in the DAWGPAWS Directory Structure below.

In all of the following command examples, the > character represents the command line prompt.

1. Split MutliFASTA Files

The entire DAWGPAWS process assumes that each FASTA file contains a single record representing a large contig such as an individual BAC, YAC or chromosome pseudomolecule. If your query sequence file contains multiple fasta files, you will first need to split the fasta file into individual records. For example if you had a multiple record fasta file name multi_fasta.fasta you could split it into a single fasta file for each record using the following command:

  >cnv_seq2dir.pl -i multi_fasta.fasta -o outdir/

It is actually possible to read input sequences in any record format that is compatible with the bioperl SeqIO format. The format is specified with the -f option. The output files will always be generated in the fasta format.

2. Rename FASTA files to a short unique name

Because many of the programs that are used in the DAWGPAWS process have limits on the size of FASTA headers, it may be necessary to first shorten the name of the FASTA files.

Starting with a directory in which you have a single fasta file for every BAC to process at the following directory path

  /home/YourName/projects/wheat_annotation/wheat_analysis/fasta_orig

Navigate to the wheat_analysis directory using the cd command to change directories:

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis

Issue the following command from the wheat_analysis directory:

  >fasta_shorten.pl -i fasta_orig/ -o fasta_short/ -l 10 --uppercase

This will rename all of the fasta files in the fasta_orig directory and will place the results in the fasta_short directory. The -l option of 10 will shorten the names to 10 characters, and the --uppercase flag will convert all lowercase base calls to uppercase.

3. Soft mask the renamed files using RepeatMasker and the TREP database

For the wheat BACs, the nonredundant TREP database is used.

Navigate to the wheat_annotation directory

  >cd /home/YourName/projects/wheat_annotation

Issue the following command from the wheat_annotation directory

  >batch_repmask.pl -i wheat_analysis/fasta_short/ -o wheat_analysis/ 
                 -c wheat_analysis/batch_mask.jcfg 
                 --engine wublast

This will softmask all fasta files in the fasta_short directory and will place the results in the wheat_analysis directory.

The -c option should point to the location of the batch_repmask configuration file. The --engine wublast option will use the wublast engine for masking the sequences. The default behavior is to mask with crossmatch.

4. Hard mask the softmasked files generated above

Many of the gene prediction programs are not soft-masked "aware" so it is necessary to also make a hard masked copy of the fasta file. This will replace the lowercase letters with an N or X character. The example below shows masking the fasta files in the soft_masker directory with an uppercase N and placing the output in the hard_masked directory.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis  
  >batch_hardmask.pl -i soft_masked/ -o hard_masked/ -m N

B. Structural Feature Annotation

At this point, DAWGPAWS can currently only annotate gaps in the assembly. Structural features are defined here as features of the sequence such as gaps, gc content, etc. that are not directly related to the annotation of biological sequence features such as genes or transposable elements.

1. Annotate the gaps in the assembly

The following will find gaps in your fasta files.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis
  >batch_findgaps.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/

The batch_findgaps.pl program is slow and uses and ugly algorithm but it does work. Results will be placed in gff format in the gff directory as well as in game.xml format in the game directory.

C. Gene Annotation De Novo Computational Results

The information below assumes that you have properly prepared the sequence files for annotation as described above. You can currently use the following five De Novo gene annotation programs within DAWGPAWS. The native results from these programs are translated to a gff format that can be use with the Apollo program.

1. GENSCAN gene prediction program

Required Programs:

GENSCAN : http://genes.mit.edu/GENSCANinfo.html
batch_genscan.pl

The GENSCAN program is not part of the TriAnnot web sevice, thus the GENSCAN program must be run on your local machine. It is also possible to run the GENSCAN program on a web server such as http://genes.mit.edu/GENSCAN.html. If you do Genscan on a remote web server, you may be able to convert the results to GFF format using the program cnv_genscan2gff.pl.

  >batch_genscan.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/

2. GenMarkHMM

Required Programs:

GeneMarkHMM : http://opal.biology.gatech.edu/GeneMark/
batch_genemark.pl

The batch_genemark.pl program allows you to run GeneMark locally. The GenMarkHMM program requires that you obtain a license to run it on your local machine.

  >batch_genemark.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/

3. FGeneSH

Required Programs:

FGeneSH : http://www.softberry.com/berry.phtml
cnv_fgenesh2gff.pl

The FGeneSH program can be run in a limited form on the Softberry web site. I do have not have access to the code to write a script for running this program in batch mode. An alternative to running this program on the softberry web server.

If you have output from the softberry web site, you can convert to gff format using the cnv_fgenesh2gff.pl program:

  >cnv_fgenesh2gff.pl -i fgenesh_result.txt -o fgenesh_result.gff

If you did not save this output in text format, the cnv_fgenesh2gff.pl program will attempt to remove HTML tags before converting to gff format. To see the full list of options available in this program, read the cnv_fgenesh2gff.pl man page.

4. GeneID

Required Programs:

geneid : http://www1.imim.es/software/geneid/index.html
batch_geneid.pl

To run the geneid program in batch mode you would use the following command.

  >batch_geneid.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat.param.Apr_22_2004

You may also specify the full path to the geneid binary using the the --geneid-path option.

  >batch_geneid.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat.param.Apr_22_2004
                   --geneid-path /usr/local/geneid.March_1_2005

The original results from the geneid program will be placed in the directory named geneid/ while the gff formatted results prepared for Apollo will be placed in the gff/ directory.

5. EuGène

Required Programs:

Running the EuGène program in batch mode will make use of the hard masked input fasta files, as well as a parameter file created for your organism. Parameter files in EuGène allow you to specify many

  >batch_eugene.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat_eugene.par

The HTML and images files resulting from EuGene will be placed in the eugene directory. The original gff output from EuGene will also be placed in the eugene directory, while a gff file translated for use in Apollo will be placed in the gff direcotry. The results from this batch_eugene run will differ from the results on the TriAnnot web server due to differences in the paramter files between the two pipelines.

D. Transposable Element De Novo Computational Results

The information below assumes that you have properly prepared the sequence files for annotation as described above. The following transposable element annotation steps can generally be done in any order, and you may choose to not include some of these programs in your analysis pipeline.

1. LTR_Struc Program for LTR Retrotransposon Prediction

Required Programs:

The LTR_Struc program has not been optimized for use in a high-throughput fashion, it has a executable binary that is only available for the Windows platform, and it does not provide output in a form that can be easilty mapped back onto the query sequence. The source code for LTR_Struc is not available to allow for modifications of these limiting factors. The process for running LTR_Struc therefore requires some additional steps as outlined below:

Prepare sequence files for analysis in the MS Windows environment. The ltrstruc_prep.pl program will convert the UNIX format line endings to DOS formated files with the *.txt extension and will also create the flist.txt file required by LTR_Struc.
```
  >cd /home/YourName/projects/wheat_annotation
  >ltrstruc_prep.pl -i wheat_analysis/masked_soft/ -o for_ltrstruc/
```
Copy the directory of transformed fasta files and flist.txt to a MS Windows machine for analysis and place these in the same directory as the LTR_Struc binary.
Use the batch_ltrstruc.vbs program to run the LTR_Struc program in batch mode. This program is a visual basic script program that works to send the information to the LTR_Struc command line. This has been tested and known to work under the Windows XP operating system. You can download batch_ltrstruc.vbs here from the DAWGPAWS subversion repository and place this in the same directory as the ltr_struc binary. As written, this will run LTR_Struc under the most stringent conditions.
```
  C:>cd ltrstrucdir
  C:>Cscript.exe //NoLogo batch_ltrstruc.vbs | cmd.exe
```
When the LTR_Struc analysis is complete, transfer the LTR_Struc output back to your Linux machine.
Use the cnv_ltrstruc2gff.pl program to convert the LTR_Struc output to a format that can be mapped back onto the query sequence. This script will extract sequence strings as reported by LTR_Struc and will map the strings back onto the query sequence using a Perl based string matching function. This matching function currently assumes only a single match to this string exists in the query sequence.
```
  >cd /home/YourName/projects/wheat_annotation
  >cnv_ltrstruc2gff.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
                       --results wheat_analysis/ltrstruc_results/
```

2. LTR_FINDER Program for LTR Retrotransposon Prediction

Required Programs:

LTR_FINDER

LTR_FINDER Web : http://tlife.fudan.edu.cn/ltr_finder/
LTR_FINDER Manual : http://tlife.fudan.edu.cn/ltr_finder/help/help.pdf
Binary : A Linux Binary is available upon contacting the authors : xuzh <at> fudan.edu.cn

batch_ltrfinder.pl
cnv_ltrfinder2gff.pl

The following assumes that you have obtained a binary of the LTR_FINDER program. It is also possible to generate LTR_FINDER Results using the LTR_FINDER web page, but I have not tested the DAWGPAWS scripts on the web page output. It is possible that the cnv_ltrfidner2gff.pl program will work for single fasta files analyzed with the LTR_FINDER web page.

The LTR_FINDER program has been designed with high throughput use in mind, it provides rich output for the location and biology of the LTR retrotransposons that it predicts, and it allows for fine scale control of search parameters. The batch_ltrfinder.pl program provides an interface to run LTR_FINDER for multiple parameter sets for each query sequence, and it converts output to the GFF file format.

  >cd /home/YourName/projects/wheat_annotation
  >batch_ltrfinder.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
                      -c config/batch_ltrfinder.jcfg

3. find_ltr Program for LTR Retrotransposon Prediction

Required Programs:

find_ltr.pl : This is a component of the LTRDeNovo packge
This is available at : http://darwin.informatics.indiana.edu/evolution/LTR.tar.gz
batch_findltr.pl

It is unclear if the current license for the find_ltr program allows modification of the source code. If the license prevents modification of the program source, this would unfortunately include modifying paramaters that are important to using this program for LTR discovery. I have modified the program code to accept these parameter changes at the command line, and the batch_findltr.pl program depends on these source code modifications. An example of using the find_ltr program in batch mode is illustrated below:

  >cd /home/YourName/projects/wheat_annotation
  >batch_findltr.pl -i wheat_analysis/masked_soft -o wheat_analysis
                    -c config/batch_findltr.jcfg

The above example will produce a GFF file for each parameter set in the batch_findltr.jcfg file for each fasta input sequence recognized in the masked_soft input directory.

4. LTR_seq for LTR Retrotransposon Prediction

Required Programs:

ltr_seq : this is the latest incarnation of the ltr_par program
A binary of this program is available upon contacting the program author:
http://www.eecs.wsu.edu/~ananth/contact.htm
batch_ltrseq.pl
cnv_ltrseq2gff.pl

Running LTR_seq in batch mode uses the batch_ltrseq.pl program. This program can make use of a configuration file that specifies a name for a parameter set, and the LTR_seq config file that this parameter set is specified in. The following example shows using batch_ltrseq.pl with no configuration file. This will run the LTR_seq program using the default parameters:

  >batch_ltrseq.pl -i wheat_analysis/masked_soft -o wheat_analysis

You may also specify a configuration file with the -c option. This config file will allow the batch_ltrseq.pl program to run the LTR_seq program for multiple parameter combinations for every fasta file in the input sequence directory. The configuration file will be a two column, tab delimited text file with the following options: Col. 1 Configuration Set Name, * Col. 2 Config File Path

An example using batch_ltrseq.pl with a config file is below:

  >batch_ltrseq.pl -i wheat_analysis/masked_soft -o wheat_analysis
                   -c ltrseq_set.jcfg

Please see the batch_ltrseq.pl documentaion for more information regarding the batch_ltrseq config file as well as how to designate parameters for the LTR_seq program.

You may also do an individual run of LTR_seq manually and then convert the output to gff using the cnv_ltrseq2gff.pl program.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis/HEX0014K09
  >mkdir ltr_seq
  >cnv_ltrseq2gff.pl -i ltr_seq/ltr_seq_out.txt -o gff/HEX0014K09_ltrseq.gff
                     -s HEX0014K09

This will produce a file name HEX0014K09_ltrseq.gff that contains all of the output from the LTR_seq program that was determined to be an LTR Retrotransposon. The current version of the LTR_seq program appears to produce duplicate predictions.

5. LTRharvest for LTR Retrotransposon Prediction

Required Programs:

LTRharvest : http://www.zbh.uni-hamburg.de/LTRharvest/index.php

Reference: Ellinghaus, D., S. Kurtz, et al. (2008). "LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons." BMC Bioinformatics 9: 18.

LTRharvest is currently not supported to run in batch mode as part of the DAWGPAWS package. However, the GFF3 output can be manually opened in the newest version of Apollo.

6. RepSeek

Information on repseek here.

Required Programs:

RepSeek : http://wwwabi.snv.jussieu.fr/~public/RepSeek/

Reference: Achaz, G., F. Boyer, et al. (2007). "Repseek, a tool to retrieve approximate repeats from large DNA sequences." Bioinformatics 23(1): 119-21.

cnv_repseek2gff.pl
batch_repseek.pl

7. FINDMITE for MITE Discovery

Required Programs:

FINDMITE : http://jaketu.biochem.vt.edu/dl_software.htm
batch_findmite.pl

The batch_findmite program will do a FINDMITE analysis for each parameter set in your configuration file for each query sequence in your input directory. The results from FINDMITE have a VERY high false positive rate so you will need to further evaluate your results to find the true MITEs in your query sequence.

To run the batch_findmite.pl program you would do the following:

  >cd /home/YourName/projects/wheat_annotation  
  >batch_findmite.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
                      -c config/batch_findmite.jcfg --gff

The FINDMITE results will be place in the findmite directory while the results converted to the GFF format would be placed in the gff directory.

8. Tandem Repeats Finder (TRF)

Required Programs:

Tandem Repeats Finder (trf) : http://tandem.bu.edu/trf/trf.download.html
batch_trf.pl

The Tandem Repeats Finder can be run in batch mode. This script will work with TRF 4.0, I have not tried to use it with earlier versions of TRF.

  >batch_trf.pl -i wheat_analysis/masked_soft -o wheat_analysis
                -c config/batch_hmmer.jcfg

The TRF data file will be placed in the directory, TRF. The gff formatted results will be placed in the gff directory.

E. Transposable Element Homology Based Computational Results

The following computational results for transposable elements all require a database of known TEs for annotation. These are therefore idenified as homology based computational results.

1. RepeatMask Sequences

Required Programs:

RepeatMasker : http://www.repeatmasker.org/

The process for running RepeatMasker to identify known Transposable Elements is described above.

2. TE Nest

The Te Nest program can annotate nested insertions. You have the option of running the TE Nest program as a web service, or downloading the TE Nest program and running it on a local machine.

TE Nest Web Service

Required Programs:

TE Nest : http://www.public.iastate.edu/~imagefpc/Subpages/software.html

TENest Web : http://www.plantgdb.org/prj/TE_nest/TE_nest.html
Reference: Kronmiller, B. A. and R. P. Wise (2008). "TE nest: Automated chronological annotation and visualization of nested plant transposable elements." Plant Physiol 146(1): 45-59

fetch_tenest.pl - Fetch sets of result from the TE Nest web server
cnv_tenest2gff.pl - Convert TE Nest *.LTR output to the gff format.

The TE Nest program is provided as a web page based service from the Plant GDB web server. The TE Nest program uses a homology based search approach to identify nested elements in your query sequence. The fetch_tenest.pl program can be used to bulk download results from the TE Nest web server and convert results from the TE Nest text format to the GFF format.

Submit your sequence jobs to the TE Nest web server http://www.plantgdb.org/PlantGDB-cgi/TE_nest/cgi/displayTE.pl
Record the job id that is returned for your sequence file by adding a row to your config file that indicates (1) The name of your query sequence as used in the rest of your analysis, (2) the job id that TE Nest assigned to your submision. An example config file is available from the DAWGPAWS SVN site.
When all of you jobs have completed on the TE Nest web server, download the results using the fetch_tenest.pl program.
```
  >cd /home/YourName/projects/wheat_annotation
  >fetch_tenest.pl -c tenest_results.txt -o wheat_analysis/ --gff
```
This will place the results in the wheat_analysis folder with a separate set of folders for query sequence defined in the config file.

TE Nest Local Installation

An alternative to running TE Nest as a web service is to run TE nest locally using the batch_tenest.pl program. You are required to run the program locally if you are

Required Programs:

TEnest.pl : http://www.public.iastate.edu/~imagefpc/Subpages/te_nest.html
batch_tenest.pl - Run the TE Nest program locally in batch mode.
cnv_fasta2tenest.pl - This program can convert a fasta file of TEs into a TE Nest database. This program is only required if you want to generate you own databases for use with TE Nest.

  >cd /home/YourName/projects/wheat_annotation
  >batch_tenest.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/ --org wheat

TEnest will use the maize database by default. This script currently does not convert the output to gff format. Valid options for --org include barley, wheat, maize, rice. You may also create your own database for use with TE Nest using the cnv_fasta2tenest.pl program. This will create the directory structure and files needed to run TE Nest on a set of transposable elements.

3.HMMER - TE Models

Required Programs:

hmmsearch : http://hmmer.janelia.org/
batch_hmmer.pl
Example Config file : batch_hmmer.jcfg

It is possible to use profile hidden Markov models of known transposable elements to find the location of TEs in your query sequence. This approach was used in to identify MITES and MULEs in the Rice genome (Juretic et al. 2004). You can produce your own Profile HMMs using multiple alignments of sequences of your choice, or you can test this approach using HMMs developed for rice.

  >cd /home/YourName/projects/wheat_annotation
  >batch_hmmer.pl -i wheat_analysis/masked_soft -o wheat_analysis
                  -c config/batch_hmmer.jcfg --gff

4. Oligomer Counts

Required Programs:

Vmatch : http://www.vmatch.de/
The Vmatch package is available for noncomercial academic use upon signing a license agreement form

Vmatch manual

seq_oligocount.pl

The seq_oligocount.pl program is at an early stage of development. It currently breaks the query sequence into oligomers of length k, it then uses the vmatch program to query these kmers against a index database to determine oligo copy number depth in the index database. This index database could be an index of the raw shotgun reads, an index of the assembled reads, or a database index of an external dataset. The result produced by seq_oligocount.pl is a GFF format file that gives the oligomer copy number of every segment of length k in the query sequence. It would make sense to allow for binning these results over some window, but that is currently not implemented. The process for using this program is outlined below.

Create a persistent index of your database with the mkvtree program. The example below shows how to index the file named my_seqs.fasta.
```
  >cd /home/YourName/projects/wheat_annotation
  >mkdir seq_index
  >cp my_seqs.fasta seq_index/
  >cd seq_index/
  >mkvtree -db my_seqs.fasta -dna -allout -pl
```
This index can be used to generate an index of coverage.
Use the persistent index created above as a database to query your sequence against using the seq_oligocount.pl program. The example below shows how to create an oligo count tier for the fasta file HEX0014K09.fasta where the length of the oligos is 20 bases and the index file is the my_seqs.fasta file from above. The output will be placed in the directory: wheat_analysis/HEX0014K09/
```
  >cd /home/YourName/projects/wheat_annotation
  >seq_oligocount.pl --infile masked_soft/HEX0014K09.fasta -n HEX0014K09
                     --db seq_index/my_seqs.fasta -k 20
                     --outdir wheat_analysis/HEX0014K09/
```

The seq_oligocount.pl program will place the GFF formatted results in the gff directory.

The results can be translated to UCSC wiggle format using the cnv_gff2wig.pl program. This program currently takes three arguments as name, description, and filename to process.

  >cnv_gff2wig.pl 20mer_counts 20mers vmatch_out.gff

HOWEVER ... as currently written this will produce a file format not compatible with Apollo and this function is under current development.

As an alternative to converting to the wiggle format, you can convert these oligomer counts into segments that exceed a threshold value using the gff_seg.pl program to segment the raw counts into segment features that exceed the threshold. For example, to generate a gff feature file of segments with oliogmers that occur at least 50 times in the index database:

  >gff_seg.pl --infile HEX0014K09_20mer.gff --seg-out HEX0014K09_50.gff --thresh 50

This will produce a gff output file that defines all segments that are represented by 50 or more copies of sequential 20mers in the query index.

F. NCBI-BLAST Homology Searches

Additional gene and TE annotation tiers are added using the NCBI BLAST program.

NOTE: THE FOLLOWING INFORMATION REFERS TO THE TRADITIONAL NCBI-BLAST COMMAND LINE PROGRAM. THE BLAST+ PROGRAM IS NOT CURRENTLY SUPPORTED. -- OCTOBER 20, 2009

1. NCBI-BLAST processes

Homology searches with NCBI-BLAST will use the soft masked files generated above.

First, you will need to prepare sequence files to serve as databases for BLAST queries. This will use the formatdb command. The DAWGPAWS program uses the name that BLAST assigns to your database, so you should indicate the database name and ID when formatting the database. You should choose a shortened form of the database name with no spaces. The following shows how to format the fasta file my_database_of_repeats.fasta using the small name repdb as the database name.

  >formatdb -i my_database_of_repeats.fasta -p F -n repdb -t repdb

For a small number of blast jobs, you can run the BLAST jobs on your machine using the batch_blast.pl program with the batch_blast_full.jcfg configuration file. The -d argument is used to indicate the location of the directory containing the BLAST databses.

  >batch_blast.pl -i wheat_annotation/soft_masked -o wheat_annotation/ 
                  -c batch_blast_full.jcfg -d /db/paws/ 
                  --logfile wheat_annotation/blast_job.log

The resulting blast results will be placed in the 'blast' directory for each contig, and the results translated to the gff format will be placed in the 'gff' directory.

Running BLAST jobs in a cluster computing environment

For larger number of BLAST jobs, the BLAST program will generally be run in a parallel cluster computing framework. This will require the following:

Split the fasta directory into subdirectories. Each subdirectory will be analyzed on a separate node on the cluster machine.

  >fasta_dirsplit -i wheat_analysis/soft_masked -o wheat_analysis/blast_dirs/
                  -n 16 -b geterdone

Copy the new dirs to the cluster machine

  >scp -p -r geterdone*/ YourName@yourclustermachine:/scratch/YourName/wheat_in/

Run blast processes on the cluster machine. The specifics of running the blast processes on your cluster will depend on the schedulding software used by your cluster. The easiest way to deal with this is to write a shell script which submits the jobs to your queue. The you will just need to run the shell script to execute this blast on the cluster.

  >./subjob.sh

Download BLAST results to your local machine

  >scp -p -r YourName@yourclustermachine:/scratch/YourName/wheat_out/ 
       /home/YourName/projects/wheat_annotation/blast_results

Merge the BLAST results into the wheat_analysis directory

  >cd /home/YourName/projects/wheat_annotation
  >dir_merge.pl -i blast_results/wheat_out -o wheat_analysis

G. Preparing Computational Results for Apollo

1. Audit the computational results

The audit program will move gff files to the gff dir if needed and will alert you to any expected files that could not be found. Currently this program only audits a subset of the expected results from DAWGPAWS with a focus on gene annotation output. This program will probably end up getting deprecated, but I will leave it here for now.

  >cd /home/YourName/projects/wheat_annotation
  >batch_audit.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/ --full --color

The --full option will run a complete audit. This currently includes the TriAnnotation output,

The --color option will print error messages about missing files in red font.

2. Concatenate the gff files

This is currently a manual step that needs to be automated.

For each query sequence, navigate to the directory containing the gff output files. For example, for the file HEX0014K09

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis/HEX0014K09/gff/

Concatenated all of the gff files in this directory into a single gff output file.

  >cat *.gff > HEX0014K09.gff

3. Convert concatenated gff file to game.xml format

Converting the concatenated gff file can be done using the batch_convert.pl progra. This takes as its input the input directory of fasta files. The output will be stored in the game directory. A copy of all of the game.xml files will also be placed in the base wheat_analysis directory.

  >cd /home/YourName/projects/wheat_annotation/wheat_analysis
  >batch_convert.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/

H. Human Curation of Computational Results

The Apollo program can be used for human curation of the computational results generated above. Information on Apollo is available at: http://www.dhgp.org/current/index.html with the installation files available from: http://apollo.berkeleybop.org/current/install.html. Your use of Apollo to annotate genomes will be faciliated by the use of the tiers file wheat.tiers file that I have developed for wheat. I have also attempted to make this wheat.tiers file compatible with output from the TriAnnot web service.

IV. ADDITIONAL INFORMATION

The following information is provided for additional reference.

1. Apollo Data Tiers

The GFF files produced by the computational results above are intended to be visualized using the Apollo Genome Annotation Curation Tool. The way that Apollo displays data can be modified through the use of tiers files and style files. These files are placed in your home directory in the directory named '.apollo'. The dot in the front of the name indicates that this direcotry is hidden.A tiers file and a style file has been created for visualizing the results for wheat annotation. These are included as part of the DAWGPAWS SVN package, but are also available at:

I prefer to point to these using symlinks from within my home directory as shown below.

  >cd $HOME/.apollo 
  >ln -s $HOME/apps/dawgpaws/apollo/wheat.tiers fly.tiers
  >ln -s $HOME/apps/dawgpaws/apollo/wheat.style fly.style

These will create files name fly.tiers and fly.style that just point to the location of the wheat.tiers and wheat.style files in the dawgpaws SVN directory. The advantage of this is that when additional data tiers are added to the computational evidences that I use, I can create a new tiers file and propagate this to all of the annotators using SVN.

Computational Results Directory Structure

The DAWGPAWS programs will store the computational results in a predefined directory structure. The software will establish this structure when output from the various annotation programs are produced or parsed.

An example below shows an example directory structure of computational results for the BAC identified as HEX0014K09.

HEX0014K09

blast - Results of blast searches
eugene - Results of the eugene gene annoation program
fgenesh - Results of the fgenesh gene annoation program
find_ltr - Results from the find_ltr program.
findmite - Results from the FINDMITE program
game - game.xml files
gene - Results from the genscan gene annotation program
geneid - Results from the geneid gene annotation program
genemark - Results from the genemarkHMM gene annotaiton program
gff - gff format data files
hmmer - Results from the hmmsearch program.

rice_mite - Rice MITE hmm profiles
rice_mule - Rice MULE hmm profiles
wheat_ltr - Wheat LTR hmm profiles

ltr_finder - Results from the ltr_finder program.
ltr_seq - ltr_seq output
ltr_struc - Results from the ltr_struc program.
repseek - Results from the Repseek program
rm - Repeatmasker results
ta - TriAnnotation results
tenest - Results from TE NEST
trf - Results from Tandem Repeats Finder

Instructions for Individual Programs

Links to the program manuals for the indvidual scripts are listed below in alphabetical order. These are extensive help files for each command line program used in the genome annotation process.

batch_blast.pl: Do NCBI-BLAST searches for a set of fasta files
batch_cnv_blast2gff.pl: Convert blast output to GFF format
batch_cnv_ta2ap.pl: Convert TriAnnotation to Apollo GFF
batch_eugene.pl: Run the eugene annotation program in batch modes
batch_findgaps.pl: Annotate gaps in a fasta file
batch_findltr.pl: Run the find_ltr.pl program in batch mode
batch_findmite.pl: Run the findmite program in batch mode
batch_game2gff.pl: Convert game.xml annotations to GFF format
batch_geneid.pl: Run the geneid program in batch mode.
batch_genemark.pl: Run GeneMark.hmm and parse results to a GFF format file
batch_genscan.pl: Run genscan in batch mode and parse results to GFF format
batch_gff2game.pl: Convert GFF files to the game.xml format
batch_hardmask.pl: Hardmask a directory of softmasked fasta files
batch_hmmer.pl: Run the HMMER program in batch mode
batch_ltrfinder.pl: Run the LTRFinder program in batch mode
batch_ltrseq.pl: Run the LTR_seq program in batch mode
batch_repmask.pl: Run RepeatMasker and parse results to a GFF format file
batch_seq_summary.pl: Print summary info for a directory of sequence files
batch_tenest.pl: Run the TE nest program in batch mode on a directory of fasta files.
batch_trf.pl: Run the Tandem Repeats Finder program in batch mode.
clust_write_shell.pl: Write shell scripts for the Platform LSF queuing system
cnv_blast2gff.pl: Convert BLAST output to the gff format.
cnv_fgenesh2gff.pl: Convert a single FGENESH output to GFF format
cnv_findltr2gff.pl: Convert a single findltr output file to GFF format
cnv_findmite2gff.pl: Convert a single findmite output file to GFF format
cnv_game2gff3.pl: Convert a game xml file to the GFF3 format.
cnv_genemark2gff.pl: Convert a single output file from genemark to GFF format
cnv_gff2game.pl: Convert a GFF file to the game.xml format.
cnv_ltrfinder2gff.pl: Convert a single output file from ltrfinder to GFF format
cnv_ltrseq2gff.pl: Convert a single output file from LTR_seq to GFF format
cnv_ltrsruc2gff.pl: Convert output from LTR_struc to GFF format
cnv_repmask2gff.pl: Convert a single output file from RepeatMasker to GFF format
cnv_repseek2gff.pl: Convert a single output from RepSeek to GFF format.
cnv_seq2dir.pl: Convert a multiple record sequence file to multiple files.
cnv_ta2ap.pl: Convert the TriAnnotation GFF3 output to apollo compatible GFF format
cnv_tenest2gff.pl: Convert a single output from the TE Nest to GFF format
dir_merge.pl: Merge directories of DAWGPAWS output to a single directory
fasta_merge.pl: Merge a directory of fasta files into a single fasta file.
fasta_dirsplit.pl: Split a directry of fasta files into a n subdirectories
fasta_shorten.pl: Change fasta headers to shorter names
fetch_tenest.pl: Download a set of results from the TE Nest web server
gff_seg.pl: Segment and parse a large gff file
ltrstruc_prep.pl: Create the files needed to run the LTR_struc program
seq_oligocount.pl: Count oligo redundancy for a input sequence
vennseq.pl: Create Venn Diagrams of sequence features

LITERATURE CITED

Achaz, G., F. Boyer, et al. (2007). "Repseek, a tool to retrieve approximate repeats from large DNA sequences." Bioinformatics 23(1): 119-21.

Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res 25(17): 3389-402.

Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res 27(2): 573-80.

Burge, C. and S. Karlin (1997). "Prediction of complete gene structures in human genomic DNA." J Mol Biol 268(1): 78-94.

Edgar, R. C. and E. W. Myers (2005). "PILER: identification and classification of genomic repeats." Bioinformatics 21 Suppl 1: i152-8.

Ellinghaus, D., S. Kurtz, et al. (2008). "LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons." BMC Bioinformatics 9: 18.

Juretic, N., T. E. Bureau, et al. (2004). "Transposable element annotation of the rice genome." Bioinformatics 20(2): 155-160.

Kalyanaraman, A. and S. Aluru (2006). "Efficient algorithms and software for detection of full-length LTR retrotransposons." J Bioinform Comput Biol 4(2): 197-216.

Kent, W. J. (2002). "BLAT--the BLAST-like alignment tool." Genome Res 12(4): 656-64.

Kurtz, S. (2004). Vmatch. http://www.vmatch.de/

Kronmiller, B. A. and R. P. Wise (2007). "TE nest: Automated chronological annotation and visualization of nested plant transposable elements." Plant Physiol.

Lewis, S. E., S. M. Searle, et al. (2002). "Apollo: a sequence annotation editor." Genome Biol 3(12): RESEARCH0082.

McCarthy, E. M. and J. F. McDonald (2003). "LTR_STRUC: a novel search and identification program for LTR retrotransposons." Bioinformatics 19(3): 362-7.

Parra, G., E. Blanco, et al. (2000). "GeneID in Drosophila." Genome Res 10(4): 511-5.

Quesneville, H., C. M. Bergman, et al. (2005). "Combined evidence annotation of transposable elements in genome sequences." PLoS Comput Biol 1(2): 166-75.

Kurtz, S., J. V. Choudhuri, et al. (2001). "REPuter: the manifold applications of repeat analysis on a genomic scale." Nucleic Acids Res 29(22): 4633-42.

Rho, M., J. H. Choi, et al. (2007). "De novo identification of LTR retrotransposons in eukaryotic genomes." BMC Genomics 8: 90.

Schiex, T., A. Moisan, et al. (2001). EuGene: An Eucaryotic Gene Finder that combines several sources of evidence. . Computational Biology. O. Gascuel and M.-F. Sagot: 111-125.

Tu, Z. (2001). "Eight novel families of miniature inverted repeat transposable elements in the African malaria mosquito, Anopheles gambiae." Proc Natl Acad Sci U S A 98(4): 1699-704.

Wicker, T., F. Sabot, et al. (2007). "A unified classification system for eukaryotic transposable elements." Nat Rev Genet 8(12): 973-82.

Xu, Z. and H. Wang (2007). "LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons." Nucleic Acids Res 35(Web Server issue): W265-8.

Author: James Estill
Last Updated: October 20, 2009

DAWGPAWS User Manual

TABLE OF CONTENTS

I. GETTING STARTED

II. ANNOTATION PROCESS

III. ADDITIONAL INFORMATION

IV. MANUAL FILES FOR INDIVUDAL PROGRAMS

V. LITERATURE CITED

I. GETTING STARTED

Install DAWGPAWS

Download From SourceForge

Anonymous Checkout via SVN

Symbolic Link to the Perl Binary

Install Additional Required Software

Install Required Perl Modules

Define Environment Variables

Add the DAWGPAWS code directory to your Path

Define paths to required software

Full variable set for the bash shell

Test Your Installation of DAWGPAWS

II. ANNOTATION PROCESS

A. Preparing Sequence Files for Annotation

1. Split MutliFASTA Files

2. Rename FASTA files to a short unique name

3. Soft mask the renamed files using RepeatMasker and the TREP database

4. Hard mask the softmasked files generated above

B. Structural Feature Annotation

1. Annotate the gaps in the assembly

C. Gene Annotation De Novo Computational Results

1. GENSCAN gene prediction program

2. GenMarkHMM

3. FGeneSH

4. GeneID

5. EuGène

D. Transposable Element De Novo Computational Results

1. LTR_Struc Program for LTR Retrotransposon Prediction

2. LTR_FINDER Program for LTR Retrotransposon Prediction

3. find_ltr Program for LTR Retrotransposon Prediction

4. LTR_seq for LTR Retrotransposon Prediction

5. LTRharvest for LTR Retrotransposon Prediction

6. RepSeek

7. FINDMITE for MITE Discovery

8. Tandem Repeats Finder (TRF)

E. Transposable Element Homology Based Computational Results

1. RepeatMask Sequences

2. TE Nest

TE Nest Web Service

TE Nest Local Installation

3.HMMER - TE Models

4. Oligomer Counts

F. NCBI-BLAST Homology Searches

1. NCBI-BLAST processes

Running BLAST jobs in a cluster computing environment

G. Preparing Computational Results for Apollo

1. Audit the computational results

2. Concatenate the gff files

3. Convert concatenated gff file to game.xml format

H. Human Curation of Computational Results

IV. ADDITIONAL INFORMATION

1. Apollo Data Tiers

Computational Results Directory Structure

Instructions for Individual Programs

LITERATURE CITED

DAWGPAWS
User Manual