TABLE OF CONTENTS
I. GETTING STARTED
Getting started using DAWGPAWS may require that you install
required software or define variables in your user environment.
The steps involved in preparing to do genome annotation with the
DAWGPAWS package is described in detail below.
This user documentation will generally assume that you are running
DAWGPAWS on a Unix or Linux machine and that you are operating from
the bash shell.
Anywhere where I have written yourname
below,
simply substitute your user name.
Install DAWGPAWS
The release 1.0 package of the DAWGPAWS suite of programs can be
downloaded from the DAWGPAWS
SourceForge web site, it is also possible to check out the 'live'
version of DAWGPAWS using subversion.
Download From SourceForge
The easiest way to obtain the most stable version of the DAWGPAWS
programs is to download the programs from the SourceForge
project downloads page.
Anonymous Checkout via SVN
An alternative way to download the DAWGPAWS program is to
anonymously check
out the SVN code from the code
repository on SourceForge. This will allow you to have the most recent
version of all of the DAWGPAWS programs and documentation. This will
allow you to have the most recent bug fixes that may not have been
added to the stable release and will give you access to experimental
new features as they are added. These programs will be constantly
changing, and using the newest versions may introduce new bugs into
your source.
Checking out the programs via SVN will
require that you have the SVN client program installed which can be
downloaded from http://subversion.tigris.org/project_packages.html.
Once you have the SVN client program installed, navigate to the
location where you want to place the DAWGPAWS program and check out
the
code. This is illustrated by the following commands:
>mkdir /home/yourname/apps
>cd /home/yourname/apps/
>svn co https://dawgpaws.svn.sourceforge.net/svnroot/dawgpaws/trunk dawgpaws
The above command will download all the working copies of the scripts,
program manuals, and emacs modes from the SourceForge repository. These
will be placed in the following
directory structure:
- dawgpaws
This base directory will be created when you check out dawgpaws for the
first time
- apollo
The tiers and type file for
using the apollo program are installed here
- docs
Additional documentation for
using the DAWGPAWS program and the full text of the GNU GPL license.
- emacs_modes
Major modes files for emacs,
this will provide syntax highlighting to the apollo tiers file, types
files and configuration files.
- html
This is a copy of the web pages
that are hosted at http://dawgpaws.sourceforge.net
- htdocs
The base directory of the
html files. This includes the pages that are posted at the SourceForge
web site.
- cgi-bin
Programs run through the
dawgpaws web are here.
- scripts
The scripts that are the
backbone of DAWGPAWS are stored in this directory
- t
Test files for checking if DAWGPAWS is ready to go
- data
Data files that will be used by the test scripts.
The svn 'co' command will check out the version of DAWGPAWS that is the
most up to date at the time of your initial download. To update to the
most recent version of DAWGPAWS at any time in the future you will
need to run the svn update command from within the dawgpaws
directory:
>cd /home/yourname/apps/dawgpaws
>svn update
This will update to the most current version of DAWGPAWS.
Symbolic Link to the Perl Binary
In Linux and the Mac OS X environment, it is possible to run all of the
DAWGPAWS programs without needing to type perl
before the
program name. To make this
possible, all of the DAWGPAWS programs assume that your local
installation of perl is at /usr/bin/perl
. You
can see this by opening
up a DAWGPAWS program in a text editor. You will see that #!/usr/bin/perl
-w
is the first line in the program.
If your local copy of the perl binary is not at the /usr/bin/perl
location, you can make a symbolic link using the 'ln' command. In
general, the ln command is used as
>ln existing_file new_name
where existing_file is the name of the file on your machine, and
new_name is the name to the new shortcut. The following command example
shows to to make a shortcut when your local installation of perl is at
/usr/local/bin/perl instead of the expected /usr/bin/perl.
>ln -s /usr/local/bin/perl /usr/bin/perl
Install Additional
Required Software
The DAWGPAWS suite of programs are scripts that provide for high
throughput execution of existing genome annotation software. The
following represents an alphabetical list of software and web services
that may be used by DAWGPAWS. This list provides a link to the source
files for installation of the software or links to the web services.
The Operating Systems (OS) that the software can be executed on is also
listed as well as references (Ref) to the peer reviewed publications
describing
the software. You do not
need
to install all of these programs to make use of the DAWGPAWS
package; this is the complete list of programs that DAWGPAWS can
use.
- Apollo : Genome Annotation Curation
Tool
- BLAST (NCBI-BLAST)
- Cross_Match
Cross_match is part of the Phrap package and is used by RepeatMasker.
It is a general purpose application for comparing any two DNA sequence
sets.
- EuGène
An
open gene finder for eukaryotic organisms. Compared to most existing
gene finders, EuGène is characterized by its ability to simply
integrate arbitrary sources of information in its prediction process.
- find_ltr
The find_ltr program is a de novo LTR Retrotransposon discovery tool
- FINDMITE
Identifies putative MITES based on
structural criteria.
- FGENESH
- GeneID
- GeneMarkHMM
- GENSCAN
- HMMER
HMMER is a freely
distributable implementation of profile HMM software for protein
sequence analysis.
- LTR_FINDER
- Source :A Linux Binary is available upon contacting the
authors : xuzh <at> fudan.edu.cn
- OS: Linux
- Web Query:http://tlife.fudan.edu.cn/ltr_finder/
- Ref:
Xu, Z. and H. Wang (2007). Nucleic Acids Res
35(Web Server issue): W265-8.
- LTR_Seq - Identifies
LTR retrotransposons
based on structural criteria
- LTR_Struc
- RepeatMasker
RepeatMasker is a program
that screens DNA sequences for interspersed repeats and low complexity
DNA sequences
- Tandem Repeats Finder
Locates and displays tandem repeats in DNA sequences.
- TE Nest
TE
nest is an annotation and visualization tool for the identification
transposable elements. TE nest is unique in that it reconstructs
transposable elements separated by nesting of subsequent insertions.
- TriAnnot
The purpose the TriAnnot web server is to provide a web based automated
annotation system for wheat similar to existing resources for rice.
- Vmatch
A tool for efficiently
solving large scale sequence matching tasks using a persistent index. Vmatch subsumes the
software Reputer (Kurtz et al 2001).
- WU-BLAST
WU-BLAST is a Local Alignment program similar to NCBI-BLAST but it
quite
different in its implementation, speed and results.
- Source:http://blast.wustl.edu/licensing/
- OS: Linux32, Linux64, Mac OSX, Solaris32, Solaris64
- Note: WU-BLAST
no longer exists, although existing licences are still valid, it
is now owned by http://www.advbiocomp.com and is called AB-BLAST.
Install Required Perl Modules
In addition to the standard modules included in most installations of
Perl, the following Perl modules are required:
- BioPerl
BioPerl is required for some of the scripts. For information on
installing BioPerl
on your system see:
Define Environment Variables
In the Unix-like operating systems including Unix, Linux, and Mac OS X,
you can define variables in your command line environment. These
variables change the way you interact with the command line, and can be
used to simplify common tasks. Modifying the environment will
allow you to use a program at the command line without referring to its
full path, and can simplify using DAWGPAWS scripts by letting you
define common variables in your environment instead of defining them at
the
command line every time you use a program. The following information
shows you how to define environment variables for use by the DAWGPAWS
suite of programs. For more information on the shell and user
environment in Linux you can refer to
http://www.comptechdoc.org/os/linux/usersguide/linux_ugenvironment.html
or
Add the DAWGPAWS code directory to your Path
Adding the DAWGPAWS code directory to your path will allow you type in
the DAWGPAWS commands without needing to type the entire path to the
DAWGPAWS program. For example, you will be able to just type batch_blast.pl
to launch the batch_blast program.pl instead of needing to type /home/yourname/code/dawgpaws/scripts/batch_blast.pl
.
In the bash shell you would add the following line to your user profile.
export PATH=$PATH:$HOME/apps/dawgpaws/scripts
Define paths to required software
Many of the DAWGPAWS command line programs use external software that
may be stored at different locations in different user environments.
You can choose to define the location of these external programs at the
command
line, but you are also able to define these paths in your user
environment.
An example of settings these path variables in the bash shell is below:
#FIND_LTR
export PATH=$PATH:/home/yourname/apps/LTRDeNovo/LTR/tool
The enviroment options that are available for each program are
described in the man page under the Configuration
and Environment heading.
Full variable set for the bash shell
In the bash shell you can copy and paste the
following to your .bashrc or .profile file. The following assumes that
you have installed the applications in a directory named apps in your
home directory. You will of course need to modify the
directory paths to the true locations of the software on your machine.
Note that the $HOME option below refers to your user home directory (ie /home/jestill/). Using the $HOME
variable instead of your actual path makes this profile movable among
different machines where your home directory may be at a different
location.e
#-----------------------------+
# SOFTWARE |
#-----------------------------+
# DAWGPAWS SCRIPTS
export PATH=$PATH:$HOME/code/dawgpaws/scripts
# FIND_LTR
export PATH=$PATH:$HOME/apps/LTRDeNovo/LTR/tool
export FIND_LTR_ROOT='$HOME/apps/LTRDeNovo/LTR/tool/'
# LTR_FINDER
export TRNA_DB='$HOME/apps/LTR_FINDER.'
export PROSITE_DIR='$HOME/apps/LTR_FINDER/ps_scan/'
# LTR_seq
export PATH=$PATH:$HOME/apps/ltr_seq
# EuGene
export EUGENEDIR='$HOME/apps/eugene-3.3'
#-----------------------------+
# DAWGPAWS VARS |
#-----------------------------+
# The following variables are used directly by DAWGPAWS scripts
# LTR_Finder path
export LTR_FINDER='$HOME/apps/ltr_finder/ltr_finder'
# GENSCAN
export DP_GENSCAN_BIN='$HOME/apps/genscan/genscan'
export DP_GENSCAN_LIB='$HOME/apps/genscan/Maize.smat'
# GeneMark
export GM_BIN_DIR='$HOME/apps/GenMark/genemark_hmm_euk.linux/'
export GM_LIB_DIR='$HOME/apps/GenMark/genemark_hmm_euk.linux/'
# VennMaster
export VMASTER_DIR='$HOME/apps/VennMaster/VennMaster-0.36.0/'
export VMASTER_JAVA_BIN='/usr/java/jre1.6.0/bin/java'
xport GENEID_BIN=/usr/local/genome/geneid/geneid_v1.3/geneid/bin/geneid
# LTR SEQ
export LTR_SEQ_DIR='$HOME/apps/ltr_seq/'
export LTR_SEQ_BIN='$HOME/apps/ltr_seq/LTR_seq'
# TRF - Tandem Repeats Finder Location
export TRF_BIN='$HOME/apps/bin/trf400.linux.exe'
# DAWGPAWS NCBI-BLAST OPTIONS
export DP_BLAST_BIN='/usr/local/genome/ncbiblast/blast-2.2.13/bin/blastall'
export DP_BLAST_DIR='$HOME/paws/'
# TEnest Options
export TE_NEST_BIN='/home/jestill/Apps/te_nest/TEnest.pl'
#export TE_NEST_DIR='/home/jestill/Apps/te_nest/'
#export DP_WUBLAST_DIR='/usr/local/genome/wu_blast/'
# RepeatMasker
export DP_RM_BIN='RepeatMasker'
Test Your Installation of DAWGPAWS
A number of test programs have been written to test the user
environment, to check for required software, and to test that the
conversion programs are installed properly. The test files use test
data that are part of the DAWGPAWS package. Following perl conventions,
these test scripts are in the t/ directory of DAWGPAWS, and the data
required for these tests are in the data subdirectory within this test
directory. To run one of these test scripts, navigate to the t/ dir in
DAWGPAWS and then run the test script in the t/ dir as:
>./dp_cnv_test.t
A number of individual tests are provided for the batch_run programs,
the conversion programs as well as individual options in the DAWGPAWS
environment. To run all of these tests at one time, you can use the
dp_test_all.t script. It is recommended that you test an individual
component at a time as you install the required software. A list of the
test files available as of April 30, 2009 are listed below.
- dp_batch_eugene_test.t
Test the batch_eugene.pl program.
- dp_batch_findltr_test.t
Test the batch_findltr.pl program.
- dp_batch_findmite_test.t
Test the batch_findmite.pl
program.
- dp_batch_geneid_test.t
Test the batch_geneid.pl program.
- dp_batch_genscan_test.t
Test the batch_genscan.pl program.
- dp_batch_hmmer_test.t
Test the batch_hmmer.pl program.
- dp_batch_ltrfinder_test.t
Test the batch_ltrfinder.pl
program.
- dp_batch_ltrseq_test.t
Test the batch_ltrseq.pl program
- dp_batch_repmask_test.t
Test the batch_repmask.pl program.
- dp_batch_tenest_test.t
Test the batch_tenest.pl program.
- dp_batch_trf_test.t
Test the batch_trf.pl program.
- dp_cnv_game_test.t
Test of the conversion from gff
to game.xml using the Apollo program to mediate the conversion.
- dp_cnv_gff3_test.t
Test of the conversion from
game.xml to gff3 format using the Apollo program to mediate the
conversion.
- dp_cnv_test.t
Test of the DAWGPAWS conversion
programs.
- dp_env_test.t
Test to see if variables defined
in the user environment are okay.
- dp_findgap_test.t
Test the batch_findgaps.pl
program.
- dp_gff_seg_test.t
Test the gff segmentation program.
- dp_module_test.t
Test to see if modules required
by DAWG-PAWS are present.
- dp_test_all.t
Run all of the DAWPAWS tests. A
good way to check if everything is still working, a bad way to get
started.
- dp_venn_test.t
Test the vennseq.pl program. This
tests the basic crosstab results, as well as checks that the vennmaster
program is installed and runs correctly.
II. ANNOTATION PROCESS
Despite the name, the use of DAWGPAWS for annotation is really more of
a process then a pipeline. There is not a single place where you drop
in a sequence, push a button and get a fully annotated genome as
output. The series of scripts that comprise DAWGPAWS are designed to
be robust across operating systems as much as possible, and are
designed to be usable in an
a la carte fashion. The
following outline of an annotation process
indicates the use of the full suite DAWGPAWS programs; however, a
subset of these programs
could be chosen for use in producing computational results for genome
annotation.
The process described below assumes that you will be curating the
computational evidences using the Apollo Genome Annotation Curation
tool. I assume that you are using the game.xml file format as the
working copy of you annotation.
A. Preparing Sequence Files for
Annotation
The following is
an overview of the process used to annotate sequences with the DAWGPAWS
set of programs. This assumes that you are running the
analysis on a Linux machine, and that you are storing your annotation
computational results in the directory
/home/yourname/projects/wheat_annotation/wheat_analysis/
This
directory will have a subdirectory named for every annotated sequence.
The subdirectories for the annotated sequence will follow the directory
structure indicated in the DAWGPAWS
Directory Structure below.
In all of the following command examples, the > character
represents the command line prompt.
1. Split MutliFASTA Files
The entire DAWGPAWS process assumes that each FASTA file contains a
single record representing a large contig such as an individual BAC,
YAC or chromosome pseudomolecule. If your query
sequence file contains multiple fasta
files, you will first need to split the fasta file into individual
records. For example if you had a multiple record fasta file name
multi_fasta.fasta you could split it into a single fasta file for each
record using the following command:
>cnv_seq2dir.pl -i multi_fasta.fasta -o outdir/
It is actually possible to read input sequences in any record format
that is compatible with the bioperl SeqIO
format. The format is specified with the -f option. The
output files will always be generated in the fasta format.
2. Rename FASTA files to a short
unique name
Because
many of the programs that are used in the DAWGPAWS process have limits
on the size of FASTA headers, it may be necessary to first shorten the
name of the FASTA files.
Starting with a directory in which you have a single fasta file for
every BAC to process at the following directory path
/home/YourName/projects/wheat_annotation/wheat_analysis/fasta_orig
Navigate to the wheat_analysis directory using the cd command to
change directories:
>cd /home/YourName/projects/wheat_annotation/wheat_analysis
Issue the following command from the wheat_analysis directory:
>fasta_shorten.pl -i fasta_orig/ -o fasta_short/ -l 10 --uppercase
This
will rename all of the fasta files in the fasta_orig directory and will
place the results in the fasta_short directory. The -l option of 10
will shorten the names to 10 characters, and the --uppercase flag will
convert all lowercase base calls to uppercase.
3. Soft mask the renamed files using
RepeatMasker and the TREP
database
For the wheat BACs, the nonredundant
TREP database is used.
Navigate to the wheat_annotation directory
>cd /home/YourName/projects/wheat_annotation
Issue the following command from the wheat_annotation directory
>batch_repmask.pl -i wheat_analysis/fasta_short/ -o wheat_analysis/
-c wheat_analysis/batch_mask.jcfg
--engine wublast
This will softmask all fasta files in the fasta_short directory and
will place the results in the wheat_analysis directory.
The
-c option should point to the location of the batch_repmask
configuration
file. The --engine wublast option will use the wublast engine for
masking the sequences. The default behavior is to mask with crossmatch.
4. Hard mask the softmasked files
generated above
Many
of the gene prediction programs are not soft-masked "aware" so it is
necessary to also make a hard masked copy of the fasta file. This will
replace the lowercase letters with an N or X character. The example
below shows masking the fasta files in the soft_masker directory with
an uppercase N and placing the output in the hard_masked directory.
>cd /home/YourName/projects/wheat_annotation/wheat_analysis
>batch_hardmask.pl -i soft_masked/ -o hard_masked/ -m N
B. Structural Feature Annotation
At this point, DAWGPAWS can currently only annotate gaps in the
assembly.
Structural features are defined here as features of the sequence such
as gaps, gc content, etc. that are not directly related to the
annotation of biological sequence features such as genes or
transposable elements.
1. Annotate the gaps in the
assembly
The following will find gaps in your fasta files.
>cd /home/YourName/projects/wheat_annotation/wheat_analysis
>batch_findgaps.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
The batch_findgaps.pl program is slow and uses and ugly algorithm but
it does work. Results will be placed in gff format in the gff directory
as well as in game.xml format in the game directory.
C. Gene Annotation
De Novo Computational Results
The information below assumes that you have properly prepared the
sequence files for annotation as
described above. You can currently use the following five De Novo
gene annotation programs within DAWGPAWS. The native results from
these programs are translated to a gff format that can be use with the
Apollo program.
1. GENSCAN gene prediction program
Required Programs:
The GENSCAN program is not part of the TriAnnot web sevice, thus the
GENSCAN program must be run on your local machine. It is also possible
to run the GENSCAN program on a web server such as http://genes.mit.edu/GENSCAN.html.
If you do Genscan
on a remote web server, you may be able to convert the results to GFF
format using the program cnv_genscan2gff.pl.
>batch_genscan.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/
2. GenMarkHMM
Required Programs:
The batch_genemark.pl program allows you to run GeneMark
locally. The GenMarkHMM program requires that you obtain a license to
run it on your local machine.
>batch_genemark.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/
3. FGeneSH
Required Programs:
The FGeneSH program can be run in a limited form on the Softberry web
site. I do have
not have access to the code to write a script for running this program
in batch mode. An alternative to running this program on the softberry
web server.
If you have output from the softberry web site, you can convert to gff
format using the cnv_fgenesh2gff.pl program:
>cnv_fgenesh2gff.pl -i fgenesh_result.txt -o fgenesh_result.gff
If you did not save this output in text format, the cnv_fgenesh2gff.pl
program will attempt to remove HTML tags before converting to gff
format. To see the full list of options available in this program, read
the cnv_fgenesh2gff.pl man page.
4. GeneID
Required Programs:
To run the geneid program in batch mode you would use the following
command.
>batch_geneid.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat.param.Apr_22_2004
You may also specify the full path to the geneid binary using the the
--geneid-path option.
>batch_geneid.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat.param.Apr_22_2004
--geneid-path /usr/local/geneid.March_1_2005
The original results from the geneid program will be placed in the
directory named geneid/ while the gff formatted results prepared for
Apollo will be placed in the gff/ directory.
5. EuGène
Required Programs:
Running the EuGène program in batch mode will make use of the hard
masked input fasta files, as well as a parameter file created for your
organism. Parameter files in EuGène allow you to specify many
>batch_eugene.pl -i wheat_analysis/hard_masked/ -o wheat_analysis/ --param wheat_eugene.par
The HTML and images files resulting from EuGene will be placed in the
eugene directory. The original gff output from EuGene will also be
placed in the eugene directory, while a gff file translated for use in
Apollo will be placed in the gff direcotry. The results from this
batch_eugene run will differ from the results on the TriAnnot web
server due to differences in the paramter files between the two
pipelines.
D. Transposable Element De Novo
Computational
Results
The information below assumes that you have properly prepared the
sequence files for annotation as
described above. The following transposable element annotation
steps can generally be done in any order, and you may choose to not
include some of these programs in your analysis pipeline.
1. LTR_Struc Program for LTR
Retrotransposon Prediction
Required Programs:
The LTR_Struc program has not been optimized for use in a
high-throughput fashion, it has a executable binary that is only
available for the Windows platform, and it does not provide output in a
form that can be easilty mapped back onto the query sequence. The
source code for LTR_Struc is not available to allow for modifications
of these limiting factors. The process for running LTR_Struc therefore
requires some additional steps as outlined below:
- Prepare sequence files for analysis in the MS Windows
environment. The
ltrstruc_prep.pl
program
will convert the UNIX format line endings to DOS formated files with
the *.txt extension and will also create the flist.txt file required by
LTR_Struc.
>cd /home/YourName/projects/wheat_annotation
>ltrstruc_prep.pl -i wheat_analysis/masked_soft/ -o for_ltrstruc/
- Copy the directory of transformed fasta files and flist.txt to a
MS Windows machine for analysis and place these in the same directory
as the LTR_Struc binary.
- Use the
batch_ltrstruc.vbs
program to
run the LTR_Struc program in batch mode. This program is a visual basic
script program that works to send the information to the LTR_Struc
command line. This has been tested and known to work under the Windows
XP operating system. You can download
batch_ltrstruc.vbs here from the DAWGPAWS subversion repository
and place this in the same directory as the ltr_struc binary. As
written, this will run LTR_Struc under the most stringent conditions.
C:>cd ltrstrucdir
C:>Cscript.exe //NoLogo batch_ltrstruc.vbs | cmd.exe
- When the LTR_Struc analysis is complete, transfer the LTR_Struc
output back to your Linux machine.
- Use the
cnv_ltrstruc2gff.pl
program to
convert the LTR_Struc output to a format that can be mapped back onto
the query sequence. This script will extract sequence strings as
reported by LTR_Struc and will map the strings back onto the query
sequence using a Perl based string matching function. This matching
function currently assumes only a single match to this string exists in
the query sequence.
>cd /home/YourName/projects/wheat_annotation
>cnv_ltrstruc2gff.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
--results wheat_analysis/ltrstruc_results/
2. LTR_FINDER Program for LTR
Retrotransposon
Prediction
Required Programs:
The following assumes that you have obtained a binary of the LTR_FINDER
program. It is also possible to generate LTR_FINDER Results using the LTR_FINDER web page,
but I have not tested the DAWGPAWS scripts on the web page output. It
is possible that the cnv_ltrfidner2gff.pl program will work for single
fasta files analyzed with the LTR_FINDER web page.
The LTR_FINDER program has been designed with high throughput use in
mind, it provides rich output for the location and biology of the LTR
retrotransposons that it predicts, and it allows for fine scale control
of search parameters. The batch_ltrfinder.pl program provides an
interface to run LTR_FINDER for multiple parameter sets for each query
sequence, and it converts output to the GFF file format.
>cd /home/YourName/projects/wheat_annotation
>batch_ltrfinder.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
-c config/batch_ltrfinder.jcfg
3. find_ltr Program for LTR Retrotransposon
Prediction
Required Programs:
It is unclear if the current license for the find_ltr program allows
modification of the source code. If the license prevents
modification of the program source, this would unfortunately include
modifying paramaters that are important to using this program for LTR
discovery. I have modified the program code to accept these parameter
changes at the command line, and the batch_findltr.pl program depends
on these source code modifications. An example of using the find_ltr
program in batch mode is illustrated below:
>cd /home/YourName/projects/wheat_annotation
>batch_findltr.pl -i wheat_analysis/masked_soft -o wheat_analysis
-c config/batch_findltr.jcfg
The above example will produce a GFF file for each parameter set in the
batch_findltr.jcfg file for each fasta input sequence recognized in the
masked_soft input directory.
4. LTR_seq for LTR Retrotransposon Prediction
Required Programs:
Running LTR_seq in batch mode uses the batch_ltrseq.pl program. This
program can make use of a configuration file that specifies a name for
a parameter set, and the LTR_seq config file that this parameter set is
specified in. The following example shows using batch_ltrseq.pl with no
configuration file. This will run the LTR_seq program using the default
parameters:
>batch_ltrseq.pl -i wheat_analysis/masked_soft -o wheat_analysis
You may also specify a configuration file with the -c option. This
config file will allow the batch_ltrseq.pl program to run the LTR_seq
program for multiple parameter combinations for every fasta file in the
input sequence directory. The configuration file will be a two column,
tab delimited text file with the following options: Col. 1
Configuration Set Name, * Col. 2 Config File Path
An example using batch_ltrseq.pl with a config file is below:
>batch_ltrseq.pl -i wheat_analysis/masked_soft -o wheat_analysis
-c ltrseq_set.jcfg
Please see the batch_ltrseq.pl
documentaion for more information regarding the batch_ltrseq config
file as well as how to designate parameters for the LTR_seq program.
You may also do
an individual run of LTR_seq manually and then convert the output to
gff using the cnv_ltrseq2gff.pl
program.
>cd /home/YourName/projects/wheat_annotation/wheat_analysis/HEX0014K09
>mkdir ltr_seq
>cnv_ltrseq2gff.pl -i ltr_seq/ltr_seq_out.txt -o gff/HEX0014K09_ltrseq.gff
-s HEX0014K09
This will produce a file name HEX0014K09_ltrseq.gff that contains all
of the output from the LTR_seq program that was determined to be an LTR
Retrotransposon. The current version of the LTR_seq program appears to
produce duplicate predictions.
5. LTRharvest for LTR Retrotransposon
Prediction
Required Programs:
LTRharvest is currently not supported to run in batch mode as part of
the DAWGPAWS package. However, the GFF3 output can be manually opened
in the newest version of Apollo.
6. RepSeek
Information on repseek here.
Required Programs:
7. FINDMITE for MITE Discovery
Required Programs:
The batch_findmite program will do a FINDMITE analysis for each
parameter
set in your configuration file for each query sequence in your input
directory. The results from FINDMITE have a VERY high false positive
rate so you will need to further evaluate your results to find the true
MITEs
in your query sequence.
To run the batch_findmite.pl program you would do the following:
>cd /home/YourName/projects/wheat_annotation
>batch_findmite.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
-c config/batch_findmite.jcfg --gff
The FINDMITE results will be place in the findmite directory while the
results converted to the GFF format would be placed in the gff
directory.
8. Tandem Repeats Finder
(TRF)
Required Programs:
The Tandem Repeats Finder can be run in batch mode. This script will
work with TRF 4.0, I have not tried to use it with earlier versions of
TRF.
>batch_trf.pl -i wheat_analysis/masked_soft -o wheat_analysis
-c config/batch_hmmer.jcfg
The TRF data file will be placed in the directory, TRF. The gff
formatted results will be placed in the gff directory.
E. Transposable Element
Homology Based Computational Results
The following computational results for transposable elements all
require a database of known TEs for annotation. These are therefore
idenified as homology based computational results.
1. RepeatMask Sequences
Required Programs:
The process for running RepeatMasker to identify known Transposable
Elements is described above.
2. TE Nest
The Te Nest program can annotate nested insertions. You have the option
of running the TE Nest program as a web
service, or downloading the TE Nest program and running it on a local
machine.
TE Nest Web Service
Required Programs:
The TE Nest program is provided as a web
page based service from the Plant GDB web server. The TE Nest
program uses a homology based search approach to
identify nested elements in your query sequence. The fetch_tenest.pl
program can be used to bulk download results from the TE Nest web
server and convert results from the TE Nest text format to the GFF
format.
- Submit your sequence jobs to the TE Nest web server http://www.plantgdb.org/PlantGDB-cgi/TE_nest/cgi/displayTE.pl
- Record the job id that is returned for your sequence file by
adding a row to your config file that indicates (1) The name of your
query sequence as used in the rest of your analysis, (2) the job id
that TE Nest assigned to your submision. An example config file is
available from the DAWGPAWS SVN site.
- When all of you jobs have completed on the TE Nest web server,
download the results using the fetch_tenest.pl program.
>cd /home/YourName/projects/wheat_annotation
>fetch_tenest.pl -c tenest_results.txt -o wheat_analysis/ --gff
This will place the results in the wheat_analysis folder with a
separate set of folders for query sequence defined in the config file.
TE Nest Local Installation
An alternative to running TE Nest as a web service is to run TE nest
locally using the batch_tenest.pl
program. You are required to run the program locally if you are
Required Programs:
- TEnest.pl : http://www.public.iastate.edu/~imagefpc/Subpages/te_nest.html
- batch_tenest.pl - Run the TE Nest program locally in batch mode.
- cnv_fasta2tenest.pl - This program can convert a fasta file of
TEs into a TE Nest database. This program is only required if you want
to generate you own databases for use with TE Nest.
>cd /home/YourName/projects/wheat_annotation
>batch_tenest.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/ --org wheat
TEnest will use the maize database by default. This script currently
does not convert the output to gff format. Valid options for --org
include barley, wheat, maize, rice. You may also create your own
database for use with TE Nest using the cnv_fasta2tenest.pl program.
This will create the directory structure and files needed to run TE
Nest on a set of transposable elements.
3.HMMER - TE Models
Required Programs:
It is possible to use profile hidden Markov models of known
transposable elements to find the location of TEs in your query
sequence. This approach was used in to identify MITES and MULEs in
the Rice genome (Juretic et al. 2004). You can produce your own Profile
HMMs
using multiple
alignments of sequences of your choice, or you can test this approach
using HMMs developed for rice.
>cd /home/YourName/projects/wheat_annotation
>batch_hmmer.pl -i wheat_analysis/masked_soft -o wheat_analysis
-c config/batch_hmmer.jcfg --gff
4. Oligomer Counts
Required Programs:
The seq_oligocount.pl
program is at an early
stage of development. It
currently breaks the query sequence into oligomers of length k, it then
uses the vmatch program to query these kmers against a index database
to determine oligo copy number depth in the index database. This index
database could be an index of the raw shotgun reads, an index of the
assembled reads, or a database index of an external dataset. The result
produced by seq_oligocount.pl
is a GFF format
file that gives the
oligomer copy number of every segment of length k in the query
sequence. It would
make sense to allow for binning these results over some window,
but that is currently not implemented. The process for using this
program is outlined below.
- Create a persistent index of your database with the mkvtree
program. The example below shows how to index the file named
my_seqs.fasta.
>cd /home/YourName/projects/wheat_annotation
>mkdir seq_index
>cp my_seqs.fasta seq_index/
>cd seq_index/
>mkvtree -db my_seqs.fasta -dna -allout -pl
This index can be used to generate an index of coverage.
- Use the persistent index created above as a database to query
your sequence against using the seq_oligocount.pl program. The example
below shows how to create an oligo count tier for the fasta file
HEX0014K09.fasta where the length of the oligos is 20 bases and the
index file is the my_seqs.fasta file from above. The output will be
placed in the directory: wheat_analysis/HEX0014K09/
>cd /home/YourName/projects/wheat_annotation
>seq_oligocount.pl --infile masked_soft/HEX0014K09.fasta -n HEX0014K09
--db seq_index/my_seqs.fasta -k 20
--outdir wheat_analysis/HEX0014K09/
The seq_oligocount.pl program will place the GFF formatted results in
the gff directory.
The results can be translated to UCSC wiggle format using the
cnv_gff2wig.pl program. This program currently takes three arguments as
name, description, and filename to process.
>cnv_gff2wig.pl 20mer_counts 20mers vmatch_out.gff
HOWEVER ... as currently written this will produce a file format not
compatible with Apollo and this function is under current development.
As an alternative to converting to the wiggle format, you can convert
these oligomer counts into segments that exceed a threshold value using
the gff_seg.pl program to segment the raw counts into segment features
that exceed the threshold. For example, to generate a gff feature file
of segments with oliogmers that occur at least 50 times in the index
database:
>gff_seg.pl --infile HEX0014K09_20mer.gff --seg-out HEX0014K09_50.gff --thresh 50
This will produce a gff output file that defines all segments that are
represented by 50 or more copies of sequential 20mers in the query
index.
F. NCBI-BLAST Homology Searches
Additional gene and TE annotation tiers are added using the NCBI BLAST
program.
NOTE: THE
FOLLOWING INFORMATION REFERS TO THE TRADITIONAL NCBI-BLAST
COMMAND LINE PROGRAM. THE BLAST+ PROGRAM IS NOT CURRENTLY SUPPORTED. --
OCTOBER 20, 2009
1. NCBI-BLAST processes
Homology searches with NCBI-BLAST will use the soft masked files
generated above.
First, you will need to prepare sequence files to serve as databases
for BLAST queries. This will use the formatdb command. The DAWGPAWS
program uses the name that BLAST assigns to your database, so you
should indicate the database name and ID when formatting the database.
You should choose a shortened form of the database name with no spaces.
The following shows how to format the fasta file
my_database_of_repeats.fasta using the small name repdb as the database
name.
>formatdb -i my_database_of_repeats.fasta -p F -n repdb -t repdb
For a small number of blast jobs, you can run the BLAST jobs on your
machine using the batch_blast.pl program with the batch_blast_full.jcfg configuration file. The -d
argument is used to indicate the location of the directory containing
the BLAST databses.
>batch_blast.pl -i wheat_annotation/soft_masked -o wheat_annotation/
-c batch_blast_full.jcfg -d /db/paws/
--logfile wheat_annotation/blast_job.log
The resulting blast results will be placed in the 'blast' directory
for each contig, and the results translated to the gff format will be
placed in the 'gff' directory.
Running BLAST jobs in a cluster computing environment
For
larger number of BLAST jobs, the BLAST program will generally be run in
a parallel cluster computing framework. This will require the
following:
- Split the fasta directory into subdirectories. Each subdirectory
will be analyzed on a separate node on the cluster machine.
>fasta_dirsplit -i wheat_analysis/soft_masked -o wheat_analysis/blast_dirs/
-n 16 -b geterdone
- Copy the new dirs to the cluster machine
>scp -p -r geterdone*/ YourName@yourclustermachine:/scratch/YourName/wheat_in/
- Run
blast processes on the cluster machine. The specifics of running the
blast processes on your cluster will depend on the schedulding software
used by your cluster. The easiest way to deal with this is to write a
shell script which submits the jobs to your queue. The you will just
need to run the
shell script to execute this blast on the cluster.
>./subjob.sh
- Download BLAST results to your local machine
>scp -p -r YourName@yourclustermachine:/scratch/YourName/wheat_out/
/home/YourName/projects/wheat_annotation/blast_results
- Merge the BLAST results into the wheat_analysis directory
>cd /home/YourName/projects/wheat_annotation
>dir_merge.pl -i blast_results/wheat_out -o wheat_analysis
G. Preparing Computational
Results
for Apollo
1. Audit the computational results
The audit program will move gff files to the gff dir if needed and
will alert
you to any expected files that could not be found. Currently this
program only audits a subset of the expected results from DAWGPAWS
with a focus on gene annotation output. This program will probably end
up getting deprecated, but I will leave it here for now.
>cd /home/YourName/projects/wheat_annotation
>batch_audit.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/ --full --color
The --full option will run a complete audit. This
currently includes the TriAnnotation output,
The --color option will print error messages about missing
files in red font.
2. Concatenate the gff files
This is currently a manual step that needs to be automated.
For each query sequence, navigate to the directory containing the
gff output files. For example, for the file HEX0014K09
>cd /home/YourName/projects/wheat_annotation/wheat_analysis/HEX0014K09/gff/
Concatenated all of the gff files in this directory into a single
gff output file.
>cat *.gff > HEX0014K09.gff
3. Convert concatenated gff file to game.xml format
Converting the concatenated gff file can be done using the
batch_convert.pl progra. This takes as its input the input directory of
fasta files. The output will be stored in the game directory. A copy of
all of the game.xml files will also be placed in the base
wheat_analysis directory.
>cd /home/YourName/projects/wheat_annotation/wheat_analysis
>batch_convert.pl -i wheat_analysis/masked_soft/ -o wheat_analysis/
H. Human Curation of Computational
Results
The Apollo program can be used for human curation of the computational
results generated above. Information on Apollo is available at:
http://www.dhgp.org/current/index.html with the installation files
available from: http://apollo.berkeleybop.org/current/install.html.
Your use of Apollo to annotate genomes will be faciliated by the use of
the tiers file wheat.tiers
file that I have developed for wheat. I have also attempted to make
this wheat.tiers file compatible with output from the TriAnnot web
service.
IV. ADDITIONAL INFORMATION
The following information is provided for additional reference.
1. Apollo Data Tiers
The GFF files produced by the computational results above are intended
to be visualized using the Apollo Genome Annotation Curation Tool. The
way that Apollo displays data can be modified through the use of tiers
files and style files. These files are placed in your home directory in
the directory named
'.apollo'. The dot in the front of the name indicates that this
direcotry is hidden.A
tiers file and a style file has been created for visualizing the
results for wheat annotation. These are included as part of the
DAWGPAWS SVN package, but are also available at:
I prefer to point to these using symlinks from within my home directory
as shown below.
>cd $HOME/.apollo
>ln -s $HOME/apps/dawgpaws/apollo/wheat.tiers fly.tiers
>ln -s $HOME/apps/dawgpaws/apollo/wheat.style fly.style
These will create files name fly.tiers and fly.style that just point to
the location of the wheat.tiers and wheat.style files in the dawgpaws
SVN directory. The advantage of this is that when additional data tiers
are added to the computational evidences that I use, I can create a new
tiers file and propagate this to all of the annotators using SVN.
Computational Results Directory
Structure
The DAWGPAWS programs will store the computational results in a
predefined directory structure. The software will establish this
structure when output from the various annotation programs are produced
or parsed.
An example below shows an example directory structure of computational
results for the BAC identified as HEX0014K09.
- HEX0014K09
- blast - Results of blast searches
- eugene - Results of the eugene gene
annoation program
- fgenesh - Results of the fgenesh gene
annoation program
- find_ltr - Results from the find_ltr
program.
- findmite - Results from the FINDMITE
program
- game - game.xml files
- gene - Results from the genscan gene
annotation program
- geneid - Results from the geneid gene
annotation program
- genemark - Results from the genemarkHMM
gene annotaiton program
- gff - gff format data files
- hmmer - Results from the hmmsearch program.
- rice_mite - Rice MITE hmm profiles
- rice_mule - Rice MULE hmm profiles
- wheat_ltr - Wheat LTR hmm profiles
- ltr_finder - Results from the ltr_finder
program.
- ltr_seq - ltr_seq output
- ltr_struc - Results from the ltr_struc
program.
- repseek - Results from the Repseek program
- rm - Repeatmasker results
- ta - TriAnnotation results
- tenest - Results from TE NEST
- trf - Results from Tandem Repeats Finder
Instructions for Individual Programs
Links to the program manuals for the indvidual scripts are listed below
in
alphabetical
order. These are extensive help files for each command line
program used in the genome annotation process.
- batch_blast.pl
- Do NCBI-BLAST searches for a set of fasta files
- batch_cnv_blast2gff.pl
- Convert blast output to GFF format
- batch_cnv_ta2ap.pl
- Convert TriAnnotation to Apollo GFF
- batch_eugene.pl
- Run the eugene annotation program in batch modes
- batch_findgaps.pl
- Annotate gaps in a fasta file
- batch_findltr.pl
- Run the find_ltr.pl program in batch mode
- batch_findmite.pl
- Run the findmite program in batch mode
- batch_game2gff.pl
- Convert game.xml annotations to GFF format
- batch_geneid.pl
- Run the geneid program in batch mode.
- batch_genemark.pl
- Run GeneMark.hmm and parse results to a GFF format file
- batch_genscan.pl
- Run genscan in batch mode and parse results to GFF format
- batch_gff2game.pl
- Convert GFF files to the game.xml format
- batch_hardmask.pl
- Hardmask a directory of softmasked fasta files
- batch_hmmer.pl
- Run the HMMER program in batch mode
- batch_ltrfinder.pl
- Run the LTRFinder program in batch mode
- batch_ltrseq.pl
- Run the LTR_seq program in batch mode
- batch_repmask.pl
- Run RepeatMasker and parse results to a GFF format file
- batch_seq_summary.pl
- Print summary info for a directory of sequence files
- batch_tenest.pl
- Run the TE nest program in batch mode on a directory of fasta
files.
- batch_trf.pl
- Run the Tandem Repeats Finder program in batch mode.
- clust_write_shell.pl
- Write shell scripts for the Platform LSF
queuing system
- cnv_blast2gff.pl
- Convert BLAST output to the gff format.
- cnv_fgenesh2gff.pl
- Convert a single FGENESH output to GFF format
- cnv_findltr2gff.pl
- Convert a single findltr output file to GFF format
- cnv_findmite2gff.pl
- Convert a single findmite output file to GFF format
- cnv_game2gff3.pl
- Convert a game xml file to the GFF3 format.
- cnv_genemark2gff.pl
- Convert a single output file from genemark to GFF format
- cnv_gff2game.pl
- Convert a GFF file to the game.xml format.
- cnv_ltrfinder2gff.pl
- Convert a single output file from ltrfinder to GFF format
- cnv_ltrseq2gff.pl
- Convert a single output file from LTR_seq to GFF format
- cnv_ltrsruc2gff.pl
- Convert output from LTR_struc to GFF format
- cnv_repmask2gff.pl
- Convert a single output file from RepeatMasker to GFF format
- cnv_repseek2gff.pl
- Convert a single output from RepSeek to GFF format.
- cnv_seq2dir.pl
- Convert a multiple record sequence file to multiple files.
- cnv_ta2ap.pl
- Convert the TriAnnotation GFF3 output to apollo compatible GFF
format
- cnv_tenest2gff.pl
- Convert a single output from the TE Nest to GFF format
- dir_merge.pl
- Merge directories of DAWGPAWS output to a single directory
- fasta_merge.pl
- Merge a directory of fasta files into a single fasta file.
- fasta_dirsplit.pl
- Split a directry of fasta files into a n subdirectories
- fasta_shorten.pl
- Change fasta headers to shorter names
- fetch_tenest.pl
- Download a set of results from the TE Nest web server
- gff_seg.pl
- Segment and parse a large gff file
- ltrstruc_prep.pl
- Create the files needed to run the LTR_struc program
- seq_oligocount.pl
- Count oligo redundancy for a input sequence
- vennseq.pl
- Create Venn Diagrams of sequence features
LITERATURE CITED
Achaz, G., F. Boyer, et al. (2007). "Repseek, a tool to retrieve
approximate repeats from large DNA sequences." Bioinformatics 23(1):
119-21.
Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and
PSI-BLAST: a new generation of protein database search programs."
Nucleic Acids Res 25(17): 3389-402.
Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA
sequences." Nucleic Acids Res 27(2): 573-80.
Burge, C. and S. Karlin (1997). "Prediction of complete gene structures
in human genomic DNA." J Mol Biol 268(1): 78-94.
Edgar, R. C. and E. W. Myers (2005). "PILER: identification and
classification of genomic repeats." Bioinformatics 21 Suppl 1: i152-8.
Ellinghaus, D., S. Kurtz, et al. (2008). "LTRharvest, an efficient and
flexible software for de novo detection of LTR retrotransposons." BMC
Bioinformatics 9: 18.
Juretic, N., T. E. Bureau, et al. (2004). "Transposable element
annotation of the rice genome." Bioinformatics 20(2): 155-160.
Kalyanaraman, A. and S. Aluru (2006). "Efficient algorithms and
software for detection of full-length LTR retrotransposons." J
Bioinform Comput Biol 4(2): 197-216.
Kent, W. J. (2002). "BLAT--the BLAST-like alignment tool." Genome Res
12(4): 656-64.
Kurtz, S. (2004). Vmatch. http://www.vmatch.de/
Kronmiller, B. A. and R. P. Wise (2007). "TE nest: Automated
chronological annotation and visualization of nested plant transposable
elements." Plant Physiol.
Lewis, S. E., S. M. Searle, et al. (2002). "Apollo: a sequence
annotation editor." Genome Biol 3(12): RESEARCH0082.
McCarthy, E. M. and J. F. McDonald (2003). "LTR_STRUC: a novel search
and identification program for LTR retrotransposons." Bioinformatics
19(3): 362-7.
Parra, G., E. Blanco, et al. (2000). "GeneID in Drosophila." Genome Res
10(4): 511-5.
Quesneville, H., C. M. Bergman, et al. (2005). "Combined evidence
annotation of transposable elements in genome sequences." PLoS Comput
Biol 1(2): 166-75.
Kurtz, S., J. V. Choudhuri, et al. (2001). "REPuter: the manifold
applications of repeat analysis on a genomic scale." Nucleic Acids Res
29(22): 4633-42.
Rho, M., J. H. Choi, et al. (2007). "De novo identification of LTR
retrotransposons in eukaryotic genomes." BMC Genomics 8: 90.
Schiex, T., A. Moisan, et al. (2001). EuGene: An Eucaryotic Gene Finder
that combines several sources of evidence. . Computational Biology. O.
Gascuel and M.-F. Sagot: 111-125.
Tu, Z. (2001). "Eight novel families of miniature inverted repeat
transposable elements in the African malaria mosquito, Anopheles
gambiae." Proc Natl Acad Sci U S A 98(4): 1699-704.
Wicker, T., F. Sabot, et al. (2007). "A unified classification system
for eukaryotic transposable elements." Nat Rev Genet 8(12): 973-82.
Xu, Z. and H. Wang (2007). "LTR_FINDER: an efficient tool for the
prediction of full-length LTR retrotransposons." Nucleic Acids Res
35(Web Server issue): W265-8.