GFF utilities

GFF files
The gffread utility

Extracting transcript sequences
Obtaining gffread

The gffcompare utility

Example: evaluating transcript discovery accuracy
Obtaining gffcompare

Publication

GFF files

Many bioinformatics programs represent genes and transcripts in GFF format (General Feature Format) which simply describes the locations and the attributes of gene and transcript features on the genome (chromosome or scaffolds/contigs).
GFF has many versions, but the two most popular that are GTF2 (Gene Transfer Format, described here) and GFF3 (defined here). Here are a few details about the way these formats are interpreted by StringTie and other bioinformatics programs.

GTF files

As seen in the GTF2 specification, the transcript_id attribute is required, so our GFF parser also expects it, while a gene_id attribute, though not strictly required by our GFF parser, is very useful for grouping alternative transcripts under a gene/locus identifier. An optional gene_name attribute, if found, will be taken and shown as a symbolic gene name or short-form abbreviation (e.g. gene symbols from HGNC or Entrez Gene). Some annotation sources (e.g. Ensembl) place a "human readable" gene name/symbol in the gene_name attribute, like a HUGO symbol (while gene_id might be just an automatically generated numeric identifier for the gene).

Most bioinformatics programs generally expect at least the exon features which are enough to define the overall transcript structure, with optional CDS features to specify the coding segments. Our GFF reader will ignore redundant features like start_codon, stop_codon when whole CDS features were provided, and similarly UTR features are ignored when whole exon features were given. However, it is still possible to provide only CDS and *UTR features and our GFF parser will reassemble the exonic structure accordingly (internally converting these segments to exon segments).

Example of a GTF2 transcript record with minimal attributes:

20	protein_coding	exon	9873504	9874841	.	+	.	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";
20	protein_coding	CDS	9873504	9874841	.	+	0	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";
20	protein_coding	exon	9877488	9877679	.	+	.	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";
20	protein_coding	CDS	9877488	9877679	.	+	0	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";
20	protein_coding	exon	9888412	9888586	.	+	.	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";
20	protein_coding	CDS	9888412	9888586	.	+	0	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";
20	protein_coding	exon	9891475	9891998	.	+	.	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";
20	protein_coding	CDS	9891475	9891995	.	+	2	gene_id "ENSBTAG00000020601"; transcript_id "ENSBTAT00000027448"; gene_name "ZNF366";

The GTF output of programs like StringTie and Cufflinks also have an additional transcript feature line acting as a parent feature for the exon and CDS features which define the transcript structure and have the same transcript_id attribute. This is not required by the GTF2 specification (and makes such files more similar to GFF3, which always have a "parent" feature, see below) but may be useful for holding the global attributes of each transcript (instead of having them duplicated for each exon feature).

GFF3 files

As defined by the GFF3 specification, the parent features (usually transcripts, i.e. "mRNA" features) are required to have an ID attribute, but here again an optional gene_name attribute can be used to specify a common gene name abbreviation. If gene_name is not given, it can be also inferred from the Name or ID attributes of the parent gene feature of the current parent mRNA feature (if given in the input file). CDS and exon features are required to have a Parent attribute whose value must match the value of the ID attribute of a parent transcript feature (usually a "mRNA" feature).

Example of a GFF3 transcript record with minimal attributes:

ctg123	example	mRNA	1300	9950	.	+	.	ID=t_012143;gene_name=EDEN
ctg123	example	exon	1300	1500	.	+	.	Parent=t_012143
ctg123	example	exon	3000	3902	.	+	.	Parent=t_012143
ctg123	example	CDS	3301	3902	.	+	0	Parent=t_012143
ctg123	example	exon	5000	5500	.	+	.	Parent=t_012143
ctg123	example	CDS	5000	5500	.	+	1	Parent=t_012143
ctg123	example	exon	7000	9000	.	+	.	Parent=t_012143
ctg123	example	CDS	7000	7600	.	+	1	Parent=t_012143
ctg123	example	exon	9400	9950	.	+	.	Parent=t_012143

Feature restrictions

For various reasons our GFF parser currently assumes the following limits (maximum values) for the genomic length (span) of gene and transcript features:

genes and transcripts should not span more than 7 Megabases on the genomic sequence
exons should not be longer than 30 Kilobases
introns should not be larger than 6 Megabases

Also, transcript IDs are expected to be unique per GFF input file (though we relaxed this restriction by limiting it to a chromosome/contig scope).

The gffread utility

The program gffread can be used to validate, filter, convert and perform various other operations on GFF files (use gffread -h to see the various usage options). Because the program shares the same GFF parser code with Cufflinks, Stringtie, and gffcompare, it could be used to verify that a GFF file from a certain annotation source is correctly "understood" by these programs. Thus the gffread utility can be used to simply read the transcripts from the file, and optionally print these transcripts back, in either GFF3 (default) or GTF2 format (with the -T option), while discarding any non-essential attributes, optionally fixing some potential issues with the input file(s). The command line for such a quick cleanup and a quick visual inspection of a given GFF file could be:

gffread -E annotation.gtf -o- | less

This will show the minimalist GFF3 re-formatting of the transcript records found in the input file (annotation.gtf in this example) which could be given in either GFF3 or GTF2 format. The -E option directs gffread to "expose" (display warnings about) any potential issues encountered while parsing the input file.

In order to see the GTF2 version of the same transcripts, the -T option should be added:

gffread -E annotation.gff -T -o- | more

The examples above also show that gffread can be used to convert a file between GTF2 and GFF3 file formats.

Extracting transcript sequences

gffread can also be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation a fasta file with the genomic sequences have to be provided as well. For example, one might want to extract the sequence of all transfrags (defined as transcripts or transcript fragments that result from the assembly process) assembled from a StringTie or Cufflinks assembly session. This can be accomplished with a command line like this:

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

The file genome.fa in this example would be a multi-fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa. This should be the case in our example if genome.fa is the file corresponding to the same genome (index) that was used for mapping the reads with Tophat.

Note that the retrieval of the transcript sequences this way is going to be much faster if a fasta index file (genome.fa.fai in this example) is found in the same directory with the genomic fasta file. Such an index file can be created with the samtools utility prior to running gffread, like this:

samtools faidx genome.fa

Then in subsequent runs using the -g option gffread will find this FASTA index and use it to speed up the extraction of transcript sequences.

Obtaining gffread

project on Github: https://github.com/gpertea/gffread
source package: gffread-0.12.9.tar.gz
precompiled Linux x86_64 binary (statically built on CentOS 5) : gffread-0.12.9.Linux_x86_64.tar.gz
precompiled Apple OS X binary (for OS X 10.7 and higher): gffread-0.12.9.OSX_x86_64.tar.gz

The gffcompare utility

The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the "query" files), when compared with a reference annotation (also provided as GFF/GTF). A more detailed documentation for the program and its output files can be found here (gffcompare documentation page)

Example: evaluating transcript discovery accuracy

Gffcompare can be used to evaluate and compare the accuracy of transcript assemblers - in terms of their structural correctness (exon/intron coordinates). This assessment can even be performed in case of more generic "transcript discovery" programs like gene finders. The best way to do this would be to use a simulated data set (where the "reference annotation" is also the set of the expressed transcripts being simulated), but for well annotated reference genomes (human, mouse etc.), gffcompare can be used to evaluate and compare the general accuracy of isoform discovery programs on a real data set, using just the known (reference) annotation of that genome.

As a practical example, let's assume we ran both Cufflinks and StringTie on a mouse RNA-Seq sample and we want to compare the overall accuracy of the two programs. In order to compare the baseline de novo transcript assembly accuracy, both Cufflinks and StringTie should be run without using any reference annotation data (i.e. no -G or -g options were used).

Assuming that Cufflinks' transcript assembly output file name is cufflinks_asm.gtf and StringTie's output is in stringtie_asm.gtf, while the reference annotation would be in a file called mm10.gff, the gffcompare commands would be:

 gffcompare -R -r mm10.gff -o cuffcmp cufflinks_asm.gtf
 gffcompare -R -r mm10.gff -o strtcmp stringtie_asm.gtf

The -R option is used here in order to adjust the sensitivity calculation as to only consider the "expressed" genes, which are those reference genes for which gffcompare found at least one overlapping transfrag in the given assembly data (*_asm.gtf file). (Of course this option would not be needed in the case of simulated RNA-Seq experiments where the reference transcripts would be all "expressed"). Multiple output files will be generated by gffcompare, with the given prefix - in the example above, for the stringtie assemblies, the output files will be:

   strtcmp.combined.gtf
   strtcmp.loci
   strtcmp.stats
   strtcmp.tracking
   strtcmp.stringtie_asm.gtf.refmap
   strtcmp.stringtie_asm.gtf.tmap

Comparing the transcript assembly accuracy of the two programs is done by looking at the Sensitivity and Precision values in the *.stats output files for each program (strtcmp.stats vs cuffcmp.stats in the example above). That output looks like this (partially):

#= Summary for dataset: stringtie_asm.gtf
#     Query mRNAs :   23555 in   17628 loci  (17231 multi-exon transcripts)
#            (3731 multi-transcript loci, ~1.3 transcripts per locus)
# Reference mRNAs :   16628 in   12062 loci  (15850 multi-exon)
# Super-loci w/ reference transcripts:    11552
#-----------------| Sensitivity | Precision  |
        Base level:    82.4     |    76.5    |
        Exon level:    81.2     |    82.9    |
      Intron level:    86.1     |    94.8    |
Intron chain level:    56.9     |    52.4    |
  Transcript level:    55.2     |    38.9    |
       Locus level:    70.1     |    48.0    |

Obtaining gffcompare

project on Github: https://github.com/gpertea/gffcompare
source package: gffcompare-0.12.10.tar.gz
precompiled Linux x86_64 binary (statically built on CentOS 5) : gffcompare-0.12.10.Linux_x86_64.tar.gz
precompiled Apple OS X binary (for OS X 10.7 and higher): gffcompare-0.12.10.OSX_x86_64.tar.gz

Publication

Pertea G and Pertea M. GFF Utilities: GffRead and GffCompare. F1000Research 2020, 9:304 DOI: 10.12688/f1000research.23297.1