Initial Training

The following 2 files:
were used in the process of training the GlimmerM system.

genbank.coord has the following format:

sequence_1_name CDS
start_exon_1 end_exon_1
start_exon_2 end_exon_2
start_exon_n end_exon_n

sequence_2_name CDS complement
end_exon_m start_exon_m
end_exon_1 start_exon_1


Here the first gene is on the direct strand of the sequence 1, and has n exons, while the second gene is on the reversed strand of sequence 2, and has m exons.

genbank.fasta is a multifasta file with the DNA sequences containing the genes described in genbank.coord.

Latest Training

pf_trainset_cDNAs  : set of 39 GenBank accessions of cDNA sequences encoding full-length genes.
pf_trainset_genes      : set of 117 GenBank accessions of genomic sequences encoding full-length genes.
training.exons             : set of GenBank accessions with exon coordinates that were experimentally verified.

Some statistics on the pf_trainset_genes data set are:

Average length of gene:2446.2
Average length of exon:1265.9
Average length of intron:180.3
Max no of introns: 15 for gene PFPRIMSSU
Max length of introns: 922 for gene PFDNACPN
Min length of introns: 73 for gene PFPRIMSSU
Max length of single exons: 12681 for gene PFA245435
Min length of single exons: 294 for gene PFAHMGP
Max length of first exons: 8520 for gene PFARPI
Min length of first exons: 3 for gene PFAALD
Max length of internal exons: 1869 for gene PFU07706
Min length of internal exons: 42 for gene PFPRIMSSU
Max length of last exons: 9707 for gene AF312917
Min length of last exons: 44 for gene AF161264
Max length of gene: 12681 for gene PFA245435
Min length of gene: 294 for gene PFAHMGP
Max length of sequence spanning a gene: 12681