Resources
Related Projects Project Members |
GENERIC FEATURE FORMAT VERSION 3
SUMMARYAuthor: Lincoln Stein Date: 23 May 2007 Version: 1.13 Although there are many richer ways of representing genomic features via XML, the stubborn persistence of a variety of ad-hoc tab-delimited flat file formats declares the bioinformatics community's need for a simple format that can be modified with a text editor and processed with shell tools like grep. The GFF format, although widely used, has fragmented into multiple incompatible dialects. When asked why they have modified the published Sanger specification, bioinformaticists frequently answer that the format was insufficient for their needs, and they needed to extend it. The proposed GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats. The new format: 1) adds a mechanism for representing more than one level of hierarchical grouping of features and subfeatures. 2) separates the ideas of group membership and feature name/id 3) constrains the feature type field to be taken from a controlled vocabulary. 4) allows a single feature, such as an exon, to belong to more than one group at a time. 5) provides an explicit convention for pairwise alignments 6) provides an explicit convention for features that occupy disjunct regions Online ValidatorAn online GFF3 validator is available at http://dev.wormbase.org/db/validate_gff3/validate_gff3_online. It is limited to files of 3,000,000 lines or less. If you wish to validate larger files, please use the command-line version which can be downloaded from the same site. DESCRIPTION OF THE FORMATThe format consists of 9 columns, separated by tabs (NOT spaces). The following characters must be escaped using URL escaping conventions (%XX hex codes): tab newline carriage return control characters The following characters have reserved meanings and must be escaped when used in other contexts: ; (semicolon) = (equals) % (percent) & (ampersand) , (comma) Unescaped quotation marks, backslashes and other ad-hoc escaping conventions that have been added to the GFF format are explicitly forbidden Note that unescaped spaces are allowed within fields, meaning that parsers must split on tabs, not spaces. Undefined fields are replaced with the "." character, as described in the original GFF spec. Column 1: "seqid" The ID of the landmark used to establish the coordinate system for the current feature. IDs may contain any characters, but must escape any characters not in the set [a-zA-Z0-9.:^*$@!+_?-|]. In particular, IDs may not contain unescaped whitespace and must not begin with an unescaped ">". Column 2: "source" The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature. Typically this is the name of a piece of software, such as "Genescan" or a database name, such as "Genbank." In effect, the source is used to extend the feature ontology by adding a qualifier to the type creating a new composite type that is a subclass of the type in the type column. Column 3: "type" The type of the feature (previously called the "method"). This is constrained to be either: (a) a term from the "lite" sequence ontology, SOFA; or (b) a SOFA accession number. The latter alternative is distinguished using the syntax SO:000000. Columns 4 & 5: "start" and "end" The start and end of the feature, in 1-based integer coordinates, relative to the landmark given in column 1. Start is always less than or equal to end. For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark. Column 6: "score" The score of the feature, a floating point number. As in earlier versions of the format, the semantics of the score are ill-defined. It is strongly recommended that E-values be used for sequence similarity features, and that P-values be used for ab initio gene prediction features. Column 7: "strand" The strand of the feature. + for positive strand (relative to the landmark), - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown. Column 8: "phase" For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3. For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field. The phase is REQUIRED for all CDS features. Column 9: "attributes" A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. These tags have predefined meanings: ID Indicates the name of the feature. IDs must be unique within the scope of the GFF file. Name Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file. Alias A secondary name for the feature. It is suggested that this tag be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file. Parent Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, an so forth. A feature may have multiple parents. Parent can *only* be used to indicate a partof relationship. Target Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is "target_id start end [strand]", where strand is optional and may be "+" or "-". If the target_id contains spaces, they must be escaped as hex escape %20. Gap The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is taken from the CIGAR format described in the Exonerate documentation. (http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate ?cvsroot=Ensembl). See "THE GAP ATTRIBUTE" for a description of this format. Derives_from Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural "part of" one. This is needed for polycistronic genes. See "PATHOLOGICAL CASES" for further discussion. Note A free text note. Dbxref A database cross reference. See the section "Ontology Associations and Db Cross References" for details on the format. Ontology_term A cross reference to an ontology term. See the section "Ontology Associations and Db Cross References" for details. Multiple attributes of the same type are indicated by separating the values with the comma "," character, as in: Parent=AF2312,AB2812,abc-3 Note that attribute names are case sensitive. "Parent" is not the same as "parent". All attributes that begin with an uppercase letter are reserved for later use. Attributes that begin with a lowercase letter can be used freely by applications. THE CANONICAL GENEFIGURE 1
This section describes the representation of a protein-coding gene in GFF3. To illustrate how a canonical gene is represented, consider Figure 1 (figure1.png). This indicates a gene named EDEN extending from position 1000 to position 9000. It encodes three alternatively-spliced transcripts named EDEN.1, EDEN.2 and EDEN.3, the last of which has two alternative translational start sites leading to the generation of two protein coding sequences. There is also an identified transcriptional factor binding site located 50 bp upstream from the transcriptional start site of EDEN.1 and EDEN2. Here is how this gene should be described using GFF3: 0 ##gff-version 3 1 ##sequence-region ctg123 1 1497228 2 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN 3 ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 4 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 5 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2 6 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3 7 ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003 8 ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mRNA00001,mRNA00002 9 ctg123 . exon 3000 3902 . + . ID=exon00003;Parent=mRNA00001,mRNA00003 10 ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003 11 ctg123 . exon 7000 9000 . + . ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003 12 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 13 ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 14 ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 15 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 16 ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 17 ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 18 ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2 19 ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 20 ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 21 ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3 22 ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 23 ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 24 Ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4 Lines beginning with ## are pragmas that provide meta-information about the document. Blank lines and lines beginning with a single # are ignored. Line 0 gives the GFF version using the ##gff-version pragma. Line 1 indicates the boundaries of the region being annotated (a 1,497,228 bp region named "ctg123") using the ##sequence-region pragma. Line 2 defines the boundaries of the gene. Column 9 of this line assigns the gene an ID of gene00001, and a human-readable name of EDEN. Because the gene is not part of a larger feature, it has no Parent. Line 3 annotates the transcriptional factor binding site. Since it is logically part of the gene, its Parent attribute is gene00001. Lines 4-6 define this gene's three spliced transcripts, one line for the full extent of each of the mRNAs. These features are necessary to act as parents for the four CDSs which derive from them, as well as the structural parents of the five exons in the alternative splicing set. Lines 7-11 identify the five exons. The Parent attributes indicate which mRNAs the exons belong to. Notice that several of the exons share the same parents, using the comma symbol to indicate multiple parentage. Lines 12-24 denote this gene's four CDSs. Each CDS belongs to one of the mRNAs. cds00003 and cds00004, which correspond to alternative start codons, belong to the same mRNA. Note that several of the features, including the gene, its mRNAs and the CDSs, all have Name attributes. This attributes assigns those features a public name, but is not mandatory. The ID attributes are only mandatory for those features that have children (the gene and mRNAs), or for those that span multiple lines. The IDs do not have meaning outside the file in which they reside. Hence, a slightly simplified version of this file would look like this: ##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001 ctg123 . exon 1300 1500 . + . Parent=mRNA00003 ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002 ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003 ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003 ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002 ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002 ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002 ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003 ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003 ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003 ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003 NOTE 1 - SOFA IDs: If using the SOFA IDs rather than the short names ("mRNA" etc), use the following mappings: gene SO:0000704 mRNA SO:0000234 exon SO:0000147 cds SO:0000316 Other mRNA parts that you might wish to use are intron SO:0000188 (redundant with exon) polyA_sequence SO:0000610 (part of the three_prime_utr) polyA_site SO:0000553 (part of the gene) five_prime_utr SO:0000204 three_prime_utr SO:0000205 NOTE 2 - "Orphan" exons CDSs, and other features. Ab initio gene prediction programs call hypothetical exons and CDS's that are attached to the genomic sequence and not necessarily to a known transcript. To handle these features, you may either (1) create a placeholder mRNA and use it as the parent for the exon and CDS subfeatures; or (2) attach the exons and CDSs directly to the gene. This is allowed by SO because of the transitive nature of the part_of relationship. NOTE 3 - UTRs, splice sites and translational start and stop sites. These are implied by the combination of exon and CDS and do not need to be explicitly annotated as part of the canonical gene. In the case of annotating predicted splice or translational start/stop sites independently of a particular gene, it is suggested that they be attached directly to the genomic sequence and not to a gene or a subpart of a gene. NOTE 4 - CDS features MUST have have a defined phase field. Otherwise it is not possible to infer the correct polypeptides corresponding to partially annotated genes. NOTE 5 - The START and STOP codons are included in the CDS. That is, if the locations of the start and stop codons are known, the first three base pairs of the CDS should correspond to the start codon and the last three correspond the stop codon. REPRESENTING SPLICED NON-CODING TRANSCRIPTSFor spliced non-coding transcripts, such as those produced by some processed snRNAs and viruses, use a parent feature of "noncoding_transcript" and a child of "exon." PARENT (PART-OF) RELATIONSHIPSTHE GAP ATTRIBUTEALIGNMENTSTRANSCRIPT-RELATIVE ALIGNMENTSONTOLOGY ASSOCIATIONS AND DB CROSS REFERENCESTwo reserved attributes, Ontology_term and Dbxref, can be used to establish links between a GFF3 feature and a data record contained in another database. Ontology_term is reserved for associations to ontologies, such as the Gene Ontology. Dbxref is used for all other cross references. While there is no firm boundary line between these two concepts, curators tend to treat ontology associations differently and hence ontology terms have been given their own reserved attribute label. The value of both Ontology_term and Dbxref is the ID of the cross referenced object in the form "DBTAG:ID". The DBTAG indicates which database the referenced object can be found in, and ID indicates the identifier of the object within that database. IDs can contain unescaped colons but DBTAGs cannot, so parsing code should split on the first colon encountered in the attribute value. The format of each type of ID varies from database to database. An authoritative list of databases, their DBTAGs, and the URL transformation rules that can be used to fetch the objects given their IDs can be found at this location: ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs Further details can be found here: ftp://ftp.geneontology.org/pub/go/doc/GO.xrf_abbs_spec Here are some common examples: * a dbxref to an EMBL sequence accession number: Dbxref="EMBL:AA816246" * a dbxref to an NCBI gi number: Dbxref="NCBI_gi:10727410" * a Ontology_term referring to a GO association Ontology_term="GO:0046703" OTHER SYNTAXComments are preceded by the # symbol. Meta-data and directives are preceded by ##. The following directives are recognized: ##gff-version 3 The GFF version, always 3 in this spec. This directive must be present, and must be the topmost line of the file. ##sequence-region seqid start end The sequence segment referred to by this file, in the format "seqid start end". This element is optional, but strongly encouraged because it allows parsers to perform bounds checking on features. There may be multiple ##sequence-region directives, each corresponding to one of the reference sequences referred to in the body of the file. ##feature-ontology URI This directive indicates that the GFF3 file uses the ontology of feature types located at the indicated URI or URL. Multiple URIs may be added, in which case they are merged (or raise an exception if they cannot be merged). The URIs for the released sequence ontologies are: Release 1: 5/12/2004 http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.6 Release 2: 5/16/2005 http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.12 This directive may occur several times per file. If no feature ontology is specified, then the most recent release of the Sequence Ontology is assumed. If multiple directives are given and a feature type is matched by multiple ontologies, the matching ontology included by the directive highest in the file wins the reference. The Sequence Ontology itself is always referenced last. The content referenced by URI must be in OBO or DAG-Edit format. ##attribute-ontology URI This directive indicates that the GFF3 uses the ontology of attribute names located at the indicated URI or URL. This directive may appear multiple times to load multiple URIs, in which case they are merged (or raise an exception if merging is not possible). Currently no formal attribute ontologies exist, so this attribute is for future extension. ##source-ontology URI This directive indicates that the GFF3 uses the ontology of source names located at the indicated URI or URL. This directive may appear multiple times to load multiple URIs, in which case they are merged (or raise an exception if merging is not possible). Currently no formal source ontologies exist, so this attribute is for future extension. ### This directive (three # signs in a row) indicates that all forward references to feature IDs that have been seen to this point have been resolved. After seeing this directive, a program that is processing the file serially can close off any open objects that it has created and return them, thereby allowing iterative access to the file. Otherwise, software cannot know that a feature has been fully populated by its subfeatures until the end of the file has been reached. It is recommended that complex features, such as the canonical gene, be terminated with the ### notation. ##FASTA This notation indicates that the annotation portion of the file is at an end and that the remainder of the file contains one or more sequences (nucleotide or protein) in FASTA format. This allows features and sequences to be bundled together. Example: ##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . 5'-UTR 1050 1200 . + . Parent=mRNA00001 ctg123 . CDS 1201 1500 . + 0 Parent=mRNA00001 ctg123 . CDS 3000 3902 . + 0 Parent=mRNA00001 ctg123 . CDS 5000 5500 . + 0 Parent=mRNA00001 ctg123 . CDS 7000 7600 . + 0 Parent=mRNA00001 ctg123 . 3'-UTR 7601 9000 . + . Parent=mRNA00001 ctg123 . cDNA_match 1050 1500 5.8e-42 + . ID=match00001;Target=cdna0123+12+462 ctg123 . cDNA_match 5000 5500 8.1e-43 + . ID=match00001;Target=cdna0123+463+963 ctg123 . cDNA_match 7000 9000 1.4e-40 + . ID=match00001;Target=cdna0123+964+2964 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc ... >cnda0123 ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg tcaaacagcggctgtaaaaatttgtgattatggttaaagg For backward-compatibility with the GFF version output by the Artemis tool, a GFF line that begins with the character > creates an implied ##FASTA directive. PATHOLOGICAL CASESThe following section discusses how to represent "pathological" cases that arise in prokaryotic and eukaryotic genetics. Most of these have to do with organisms' endlessly creative ways of processing transcripts. a) Single exon genes This is the case in which a single unspliced transcript encodes a single CDS. ----->XXXXXXX*------> The preferred representation is to create a gene, a transcript, an exon and a CDS: ChrX . gene XXXX YYYY . + . ID=gene01;name=resA ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 ChrX . exon XXXX YYYY . + . Parent=tran01 ChrX . CDS XXXX YYYY . + . Parent=tran01 Some groups will find this redundant. A valid alternative is to omit the exon feature: ChrX . gene XXXX YYYY . + . ID=gene01;name=resA ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 ChrX . CDS XXXX YYYY . + . Parent=tran01 It is not recommended to parent the CDS directly onto the gene, because this will make it impossible to determine the UTRs (since the gene may validly include untranscribed regulatory regions). Also note that mixing the two styles, as in the case of an organism with both spliced and unspliced transcripts, is liable to lead to the confusion of people working with the GFF3 file. b) Polycistronic transcripts This is the case in which a single (possibly spliced) transcript encodes multiple open reading frames that generate independent protein products. ----->XXXXXXX*-->BBBBBB*--->ZZZZ*-->AAAAAA*----- Since the single transcript corresponds to multiple genes that can be identified by genetic analysis, the recommended solution here is to create four "gene" objects and make them the parent for a single transcript. The transcript will contain a single exon (in the unspliced case) and four separate CDSs: ChrX . gene XXXX YYYY . + . ID=gene01;name=resA ChrX . gene XXXX YYYY . + . ID=gene02;name=resB ChrX . gene XXXX YYYY . + . ID=gene03;name=resX ChrX . gene XXXX YYYY . + . ID=gene04;name=resZ ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02,gene03,gene04 ChrX . exon XXXX YYYY . + . ID=exon00001;Parent=tran01 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene01 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene02 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene03 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene04 To disambiguate the relationship between which genes encode which CDSs, you may use the Derives_from relationship. c) Gene containing an intein An intein occurs when a portion of the protein is spliced out and the two polypeptide fragments are rejoined to become a functional protein. The portion that is spliced out is called the "intein," and it may itself have intrinsic molecular activity: ----->XXXXXXyyyyyyyyyyXXXXXXX*------- (yyyyyy is the intein) The preferred representation is to create one gene, one transcript, one exon, and one CDS. The CDS produces a pre-polypeptide using the "Derives_from" tag, and this polypeptide in turn gives rise to two mature_polypeptides, one each for the intein and the flanking protein: ChrX . gene XXXX YYYY . + . ID=gene01;name=resA ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 ChrX . exon XXXX YYYY . + . Parent=tran01 ChrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01 ChrX . polypeptide XXXX YYYY . + . ID=poly01;Derives_from=cds01 ChrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01 ChrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01 ChrX . intein XXXX YYYY . + . ID=poly03;Parent=poly01 Because the flanking mature_polypeptide has discontinuous coordinates on the genome, it appears twice with the same ID. If the intein is immediately degraded, you may not wish to annotate it explicitly, and its line would be deleted from the example. However, if it has molecular activity, it may correspond to a gene, in which case: ChrX . gene XXXX YYYY . + . ID=gene01;name=resA ChrX . gene XXXX YYYY . + . ID=gene02;name=inteinA ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 ChrX . exon XXXX YYYY . + . Parent=tran01 ChrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01 ChrX . polypeptide XXXX YYYY . + . ID=poly01;Derives_from=cds01 ChrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01;Derives_from=gene01 ChrX . mature_polypeptide XXXX YYYY . + . ID=poly02;Parent=poly01;Derives_from=gene01 ChrX . intein XXXX YYYY . + . ID=poly03;Parent=poly01;Derives_from=gene02 The term "polypeptide" is part of SO. The terms "mature_polypeptide" and "intein" are slated to be added in a pending release. d) Trans-spliced transcript This occurs when two genes contribute to a processed transcript via a trans-splicing reaction: spliced leader =======>----->XXXXXXX*------> The simplest way to represent this is to show the mRNA as being split across two discontinuous genomic locations: ChrX . gene XXXX YYYY . + . ID=gene01;name=my_gene ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01 ChrX . exon XXXX YYYY . + . Parent=tran01 ChrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01 However, this does not indicate which part of the transcript comes from the spliced leader. A preferred representation explicitly adds features for the spliced leader gene, the primary_transcript and the spliced_leader_RNA: ChrX . gene XXXX YYYY . + . ID=gene01;name=my_gene ChrX . gene XXXX YYYY . + . ID=gene02;name=leader_gene ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02 ChrX . primary_transcript XXXX YYYY . + . ID=pt01;Parent=tran01;Derives_from=gene01 ChrX . spliced_leader_RNA XXXX YYYY . + . ID=sl01;Parent=tran01;Derives_from=gene02 ChrX . exon XXXX YYYY . + . Parent=tran01 ChrX . CDS XXXX YYYY . + . ID=cds01;Parent=tran01 As shown here, the mRNA derives from two genes ("my_gene" and the leader gene) and occupies disjunct coordinates on the genome. The primary_transcript, which encodes the body of the mRNA, is part of (has as its Parent) this mRNA. The same relationship applies to the spliced leader RNA. The Derives_from relationship is used to indicate which genes produced the primary transcript and spliced leader respectively. The exon and CDS features follow in the normal fashion. e) Programmed frameshift This event occurs when the ribosome performs a programmed frameshift during translation in order to skip over an in-frame stop codon. The frameshift may occur forward or backward. -------------------------> mRNA ========== ============* CDS The representation of this is to make the CDS discontinuous: ChrX . gene XXXX YYYY . + . ID=gene01;name=my_gene ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01;Ontology_term=SO:1000069 ChrX . exon XXXX YYYY . + . Parent=tran01 ChrX . CDS XXXX YYYY 0 + . ID=cds01;Parent=tran01 ChrX . CDS YYYY-1 ZZZZ 1 + . ID=cds01;Parent=tran01 You will also need to adjust the phase field properly so that the CDS translates. It is suggested that the mRNA be tagged with the appropriate SO transcript attributes such as "minus_1_translational_frameshift" (SO:1000069). This will allow all such programmed frameshift mRNAs to be recovered with a query. The accession for "plus_1_translational_frameshift" is SO:1001263. f) An operon A classic operon occurs when the genes in a polycistronic transcript are co-regulated by cis-regulatory element(s): regulatory element * ================================================> operon ----->XXXXXXX*-->BBBBBB*--->ZZZZ*-->AAAAAA*----- It can be indicated in GFF3 in this way: ChrX . operon XXXX YYYY . + . ID=operon01;name=my_operon ChrX . promoter XXXX YYYY . + . Parent=operon01 ChrX . gene XXXX YYYY . + . ID=gene01;Parent=operon01;name=resA ChrX . gene XXXX YYYY . + . ID=gene02;Parent=operon01;name=resB ChrX . gene XXXX YYYY . + . ID=gene03;Parent=operon01;name=resX ChrX . gene XXXX YYYY . + . ID=gene04;Parent=operon01;name=resZ ChrX . mRNA XXXX YYYY . + . ID=tran01;Parent=gene01,gene02,gene03,gene04 ChrX . exon XXXX YYYY . + . ID=exon00001;Parent=tran01 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene01 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene02 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene03 ChrX . CDS XXXX YYYY . + . Parent=tran01;Derives_from=gene04 The regulatory element ("promoter" in this example) is part of the operon via the Parent tag. The four genes are part of the operon, and the resulting mRNA is multiply-parented by the four genes, as in the earlier example. At the time of this writing, promoters and other cis-regulatory elements cannot be part_of an operon, but this restriction is being reconsidered. |
---|
Change Log: 1.13 Wed May 23 10:31:01 EDT 2007 -Insist that CDS include the start and end codon. 1.12 Thu Apr 5 17:32:32 EDT 2007 -Use "match_part" as the subpart of cDNA_match in the paired EST example. -Phase is required for all CDS features. 1.11 Fri Dec 1 16:33:39 EST 2006 -Clarified definition of phase relative to reverse strand features 1.10 14 September 2006 -Reformatted for new SO web site. 1.09 Wed Sep 6 17:55:32 EDT 2006 -Information about the GFF3 validator. 1.08 Tue Jul 18 15:12:11 EDT 2006 -Added URLs for SO releases. 1.07 Wed May 24 21:59:02 EDT 2006 -Fixed description of phase (temporarily lost due to CVS glitches) 1.06 Wed May 24 11:44:22 EDT 2006 -Relaxed escaping rules. -Fixed typos found by Gordon Gremme. 1.05 Tue May 23 10:46:25 EDT 2006 -Fixed all IDs in the examples to make them internally consistent. Previously, some examples did not validate because of inconsistent numbers of zeroes in the identifiers (mRNA00001 vs mRNA0001).
Maintained on SourceForge. |
Supported by a grant from the National Human Genome Research Institute. |
Member of the Gene Ontology Consortium. |
Contact SO |