CCB » CBCC » GeneSplicer

GeneSplicer : A computational method for splice site prediction

Overview

A fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. The system has been trained and tested successfully on Plasmodium falciparum (malaria), Arabidopsis thaliana, human, Drosophila, and rice . Training data sets for human and Arabidopsis thaliana are included. See below for instructions on downloading the complete system including source code . 

System requirements

GeneSplicer is released as source code and was tested on Linux RedHat 6.x+, Sun Solaris, and Alpha OSF1, but should work on any Unix system.

Obtaining GeneSplicer

This software is  OSI Certified Open Source Software.

To download the complete GeneSplicer system, just click here .

After downloading, uncompress the distribution file by typing:

      % tar -xzf GeneSplicer.tar.gz
      

A directory named 'GeneSplicer/' will be created which contains the executable, training data sets, and other supporting files.

Training data sets are included in the tar file.

Training GeneSplicer

There is no independent program to train GeneSplicer, but there is a way to obtain the necessary files by using the training procedure of GlimmerHMM, which can be downloaded from here.

After running trainGlimmerHMM, create a directory with the following files from the resulted GlimmerHMM training directory:

  1. acc*
  2. don*
  3. score_*
  4. outex
  5. outin

In the same directory, create a file called config_file with the following info on a line, in this order:

  • a "high-confidence" threshold for acceptors
  • a "high-confidence" threshold for donors
  • a threshold for acceptors
  • a threshold for donors
  • 1 (if there are files like acc<number> among the training files, if there is only acc1.mar than this line should be 0)
  • 1 (if there are files like don<number> among the training files, if there is only don1.mar than this line should be 0)
  • a number representing the distance for filtering neighbouring acceptors (usually 60)
  • a number representing the distance for filtering neighbouring donors (usually 60)

Ideas for thresholds can be taken from the files false.acc and false.don for acceptors and donors respectively. These files are created in the initial GlimmerHMM training directory. Please consult the existing config_file's distributed with GeneSplicer to see concrete examples.

Contact Information

You can contact us about GeneSplicer at: mpertea jhu edu

References

M. Pertea , X. Lin , S. L. Salzberg . GeneSplicer : a new computational method for splice site prediction . Nucleic Acids Res . 2001 Mar 1;29(5):1185-90 .

Acknowledgements

The development of GeneSplicer was supported by the NSF under grant KDI-9980088 and DBI-0234704 and NIH grant R01-LM06845 .