Information about sim4cc

sim4cc: A Cross-Species Spliced Alignment Tool

sim4cc is a heuristic sequence alignment tool for comparing a cDNA sequence with a genomic sequence containing a homolog of the gene in another species. It is built on the foundation of sim4 and incorporates several techniques that make it suitable for cross-species comparisons.

A guide to using sim4cc's command line options
Download and installation procedures
References

Download sim4cc here.

A Guide to Using sim4cc's Command Line Options

NAME

sim4cc -- align an expressed DNA sequence with a genomic sequence from a related species, allowing for introns

SYNOPSIS

sim4cc seqfile1 seqfile2 {[AXRKCDHEPNBSZJ]=value}

DESCRIPTION

sim4cc retains many of the uses and characteristics of sim4. Its main application is the cross-species alignment of an expressed DNA sequence (EST, cDNA, mRNA) with a genomic sequence for the gene. It also detects end matches when the two input sequences overlap at one end (i.e., the start of one sequence overlaps the end of the other). If seqfile2 is a database of sequences, the sequence in seqfile1 will be aligned with each of the sequences in seqfile2.

Like its predecessor, sim4cc employs a blast-based technique to first determine the basic matching blocks representing the "exon cores". In stage two, the exon cores are extended into the adjacent as-yet-unmatched fragments using greedy alignment algorithms, and heuristics are used to favor configurations that conform to the splice-site recognition signals (GT-AG, CT-AC). If necessary, the process is repeated with less stringent parameters on the unmatched fragments.
To adapt it for cross-species comparisons, sim4cc has incorporated several algorithmic improvements. Instead of looking for exact k-mer matches, it searches for approximate matches according to an a priori determined pattern, called a spaced seed. In the seed pattern (e.g., 110x10x11), a 1 indicates a position required to match, 0 is a wildcard that allows either a match or a mismatch, and x allows either a match or a transition, but not a transversion. The seeds have been optimized for aligning cDNA and genomic sequences, and for a large number of species comparisons, and therefore it is unlikely that they will require adjustments. sim4cc also incorporates more sophisticated splice site models, and alignment algorithms adapted for cross-species comparison.
By default, sim4cc searches both strands and reports the best match, measured by the number of matching nucleotides found in the alignment. The R command line option can be used to restrict the search to one orientation (strand) only.
Currently, six major alignment display options are supported, controlled by the A option. By default (A=0), only the endpoints, overall similarity, and orientation of the introns are reported. An arrow sign (`->' or `<-') indicates the orientation of the intron (`+' or `-' strand), when the signals flanking the intron have three or more position matches with either the GT-AG or the CT-AC splice recognition signals. When the same number of matches is found for both orientations, the intron is reported as ambiguous, and represented by `--'. The sign `==' marks the absence from the alignment of a cDNA fragment starting at that position. Alternative formats (lav-block format, text, PipMaker-type `exons file', or certain combinations of these options) can be requested by specifying a different value for A.
If the P option is specified with a non-zero value, sim4cc will remove any 3'-end poly-A tails that it detects in the alignment.
Occasionally, sim4cc may miss an internal exon when surrounded by very large introns, typically longer than 100 Kb. When this is suspected, the H option can be used to reset the exons' weight to compensate for the intron gap penalty.
Ambiguity codes are by default allowed in sequence data, but sim4cc treats them non-differentially. If desired, the B command option can restrict the set of acceptable characters to A,C,G,T,N and X only.
sim4cc compares the lengths of the input sequences to distinguish between the cDNA (`short') and the genomic (`long') components in the comparison. When seqfile2 contains a collection of sequences, the first entry in the file will be used to determine the type of this and all subsequent comparisons.
In the description below, the term MSP denotes a maximal segment pair, that is, a pair of highly similar fragments in the two sequences, obtained during the blast-like procedure by extending a spaced seed hit by matches and perhaps a few mismatches.

OPTIONS

The algorithm parameters (included in the first two sections below) have already been tuned and do not normally require adjustment by the user.

Alignment parameters:

Z Sets the spaced seed pattern used to identify approximate matches in the first stage of the algorithm. The default seed pattern was optimized for cDNA-to-genomic sequence alignment and for a large number of species comparisons, but can be reset by the user if desired. In that case, a seed of weight 12 or 11, counting 1 for each 1 in the pattern and 0.5 for every x, is recommended.

X Controls the limits for terminating word extensions in the blast-like stage of the algorithm. The default value is 12.

K Sets the threshold for the MSP scores when determining the basic `exon cores', during the first stage of the algorithm. (If this option is not specified, the threshold is computed from the lengths of the sequences, using statistical criteria.) For example, a good value for genomic sequences in the range of a few hundred Kb is 16. To avoid spurious matches, however, a larger value may be needed for longer sequences.

C Sets the threshold for the MSP scores when aligning the as-yet-unmatched fragments, during the second stage of the algorithm. By default, the smaller of the constant 12 and a statistics-based threshold is chosen.

J Specifies the splice junction model to be used: J=0 (original: GT-AG PWM), J=1 (GeneSplicer; default), and J=2 (Glimmer). The option J=2 requires the Glimmer model to be present in the working directory.

D Sets the bound for the "diagonal" distance within consecutive MSPs in an exon. The default value is 10.

Context parameters:

R Specifies the direction of the search. If R=0, only the "+" (direct) strand is searched. If R=1, only the "-" (reverse complement) matches are sought. By default (R=2), sim4cc searches both strands and reports the best match, measured by the number of matching pairs in the alignment.

A Specifies the format of the output: exon endpoints only (A=0), alignment text (A=1), alignment in lav-block format (A=2), or both exon endpoints and alignment text (A=3 or A=4). Options A=5 and A=6 print the exon and CDS coordinates (the latter, if known) in the `exon file' format required by PipMaker and in the GFF3 format, respectively. If a reverse complement match is found, A=0,1,2,3,5,6 will give its position in the "+" strand of the longer sequence and the "-" strand of the shorter sequence. A=4 will give its position in the "+" strand of the first sequence (seqfile1) and the "-" strand of the second sequence (seqfile2), regardless of which sequence is longer. When the S command line option is used to specify the endpoints of the coding region (CDS) in the mRNA, the exon and CDS annotation in PipMaker format (except for A=6) are appended to the alignment in the user specified format.

P Specifies whether or not the program should report the fragment of the alignment containing the poly-A tail (if found). By default (P=0) the alignment is displayed as computed, but specifying a non-zero value will request sim4cc to remove the poly-A tails. When this feature is enabled, all display options produce additional lav alignment headers.

H Resets the MSPs' weight to compensate for very large introns. The default value is H=500, but some introns larger than 100 Kb may require higher values, typically between 1000 and 2500. This option should be used cautiously, generally in cases where an unmatched internal portion of the cDNA may disguise a missed exon within a very large intron. It is not recommended for ESTs, where they may produce spurious exons.

N Requests an additional search for small marginal exons (N=1) guided by the splice-site recognition signals. This option can be used when a high accuracy match is expected. The default value is N=0, specifying no additional search.

B Controls the set of characters allowed in the input sequences. By default (B=1), ambiguity characters (ABCDGHKMNRSTVWXY) are allowed. By specifying B=0, the set of acceptable characters is restricted to A,C,G,T,N and X only.

S Allows the user to specify the endpoints of the CDS in the input mRNA, with the syntax: S=n1..n2. If the option S=0..0 is invoked, the endpoints of the longest open reading frame (ORF) are calculated on-the-fly and reported in the output format requested by the user (see option A above). When the second file is an mRNA database, the command line specification for the CDS will apply to the first sequence in the file only.

EXAMPLES

   
sim4cc cdna genomic
        
sim4cc genomic cdnadb

sim4cc cdna genomic A=3 P=1 

sim4cc cdna1 cdna2 R=1

sim4cc mRNA genomic A=6 S=123..1020
        
sim4cc mRNA genomic A=3 Z=111010010100110111

Back to the top

Download and Installation Procedure

sim4cc was written to work on a Unix (ANSI/POSIX standard) system. It may require a few modifications to compile and run on a different platform.

Download the gzipped tar file sim4cc.tar.gz. (If the name appears to be altered by the web browser used, rename the package sim4cc.tar.gz.)

Unpack the downloaded files.

        gunzip < sim4cc.tar.gz | tar -xvf -

The tar file will unpack into a directory named sim4cc.[some-date]. (You'll see what the precise name is while tar is unpacking.) Make that directory current.
```
	cd sim4cc.*
```
Compile.
```
	make
```

You may get some warnings from the compiler. We're interested in seeing these, but they shouldn't cause any problems. The executable will be named ``sim4cc''. Copy it to your bin directory, or just leave it there.

Back to the top

References

sim4cc is described in the paper:

Zhou, L., M. Pertea, A. Delcher, L. Florea (2009) "Sim4cc: a cross-species spliced alignment program", Nucl. Acids Res. 37(11): e80.

Other useful references can be found here.

Back to the top

Z		Sets the spaced seed pattern used to identify approximate matches in the first stage of the algorithm. The default seed pattern was optimized for cDNA-to-genomic sequence alignment and for a large number of species comparisons, but can be reset by the user if desired. In that case, a seed of weight 12 or 11, counting 1 for each 1 in the pattern and 0.5 for every x, is recommended.
X		Controls the limits for terminating word extensions in the blast-like stage of the algorithm. The default value is 12.
K		Sets the threshold for the MSP scores when determining the basic `exon cores', during the first stage of the algorithm. (If this option is not specified, the threshold is computed from the lengths of the sequences, using statistical criteria.) For example, a good value for genomic sequences in the range of a few hundred Kb is 16. To avoid spurious matches, however, a larger value may be needed for longer sequences.
C		Sets the threshold for the MSP scores when aligning the as-yet-unmatched fragments, during the second stage of the algorithm. By default, the smaller of the constant 12 and a statistics-based threshold is chosen.
J		Specifies the splice junction model to be used: `J=0` (original: `GT-AG` PWM), `J=1` (GeneSplicer; default), and `J=2` (Glimmer). The option `J=2` requires the Glimmer model to be present in the working directory.
D		Sets the bound for the "diagonal" distance within consecutive MSPs in an exon. The default value is 10.

R		Specifies the direction of the search. If `R=0`, only the "+" (direct) strand is searched. If `R=1`, only the "-" (reverse complement) matches are sought. By default (`R=2`), sim4cc searches both strands and reports the best match, measured by the number of matching pairs in the alignment.
A		Specifies the format of the output: exon endpoints only (`A=0`), alignment text (`A=1`), alignment in lav-block format (`A=2`), or both exon endpoints and alignment text (`A=3` or `A=4`). Options `A=5` and `A=6` print the exon and CDS coordinates (the latter, if known) in the `exon file' format required by PipMaker and in the GFF3 format, respectively. If a reverse complement match is found, `A=0,1,2,3,5,6` will give its position in the "+" strand of the longer sequence and the "-" strand of the shorter sequence. `A=4` will give its position in the "+" strand of the first sequence (seqfile1) and the "-" strand of the second sequence (seqfile2), regardless of which sequence is longer. When the `S` command line option is used to specify the endpoints of the coding region (CDS) in the mRNA, the exon and CDS annotation in PipMaker format (except for `A=6`) are appended to the alignment in the user specified format.
P		Specifies whether or not the program should report the fragment of the alignment containing the poly-A tail (if found). By default (`P=0`) the alignment is displayed as computed, but specifying a non-zero value will request sim4cc to remove the poly-A tails. When this feature is enabled, all display options produce additional lav alignment headers.
H		Resets the MSPs' weight to compensate for very large introns. The default value is `H=500`, but some introns larger than 100 Kb may require higher values, typically between 1000 and 2500. This option should be used cautiously, generally in cases where an unmatched internal portion of the cDNA may disguise a missed exon within a very large intron. It is not recommended for ESTs, where they may produce spurious exons.
N		Requests an additional search for small marginal exons (`N=1`) guided by the splice-site recognition signals. This option can be used when a high accuracy match is expected. The default value is `N=0`, specifying no additional search.
B		Controls the set of characters allowed in the input sequences. By default (`B=1`), ambiguity characters (`ABCDGHKMNRSTVWXY`) are allowed. By specifying `B=0`, the set of acceptable characters is restricted to `A,C,G,T,N` and `X` only.
S		Allows the user to specify the endpoints of the CDS in the input mRNA, with the syntax: `S=n1..n2`. If the option `S=0..0` is invoked, the endpoints of the longest open reading frame (ORF) are calculated on-the-fly and reported in the output format requested by the user (see option A above). When the second file is an mRNA database, the command line specification for the CDS will apply to the first sequence in the file only.