mRNA start used instead of gene start in makeTxDbFromGFF
1
0
Entering edit mode
@timotheeflutre-6727
Last seen 2.4 years ago
France

I am trying to understand how GenomicFeatures::makeTxDbFromGFF works by using a subset of the first example from the official GFF3 specification. (I made minor changes below compare to the example.)

> txt <- paste0("##gff-version 3",
                "\n##sequence-region ctg123 1 1497228",
                "\nctg123\t.\tgene\t1000\t9000\t.\t+\t.\tID=gene00001;Name=EDEN",
                "\nctg123\t.\tmRNA\t1050\t9000\t.\t+\t.\tID=mRNA00001;Parent=gene00001",
                "\nctg123\t.\texon\t1050\t1500\t.\t+\t.\tParent=mRNA00001",
                "\nctg123\t.\tCDS\t1201\t1500\t.\t+\t.\tID=cds00001;Parent=mRNA00001")
> cat(txt, file="subset.gff")
> txdb <- makeTxDbFromGFF(file=gff.file.small, format="auto", dataSource=url,
                          organism="Vitis vinifera", taxonomyId=29760,
                          chrominfo=data.frame(chrom="ctg123", length=1497228, is_circular=FALSE))
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

All looks good except that gene start is wrong:

> genes(txdb)
GRanges object with 1 range and 1 metadata column:
       seqnames       ranges strand |     gene_id
          <Rle>    <IRanges>  <Rle> | <character>
  EDEN   ctg123 [1050, 9000]      + |        EDEN

It should start at 1000, right?

Note that transcript, exon and CDS coordinates are correct:

> transcripts(txdb)
GRanges object with 1 range and 2 metadata columns:
      seqnames       ranges strand |     tx_id     tx_name
         <Rle>    <IRanges>  <Rle> | <integer> <character>
  [1]   ctg123 [1050, 9000]      + |         1   mRNA00001
> exons(txdb)
GRanges object with 1 range and 1 metadata column:
      seqnames       ranges strand |   exon_id
         <Rle>    <IRanges>  <Rle> | <integer>
  [1]   ctg123 [1050, 1500]      + |         1
> cds(txdb)
GRanges object with 1 range and 1 metadata column:
      seqnames       ranges strand |    cds_id
         <Rle>    <IRanges>  <Rle> | <integer>
  [1]   ctg123 [1201, 1500]      + |         1

 

maketxdbfromgff genomicfeatures • 769 views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States

Hi Tim,

We don't import the start/end found on the gene lines of a GFF file. In fact, in the SQLite db, unlike the transcripts, exons, and cds tables, the genes table doesn't have the start/end columns. Instead the genes() extractor infers the gene coordinates from the transcripts of a gene. The reason for this db schema is that, historically, we started to support UCSC and Ensembl annotations (via makeTxDbFromUCSC() and makeTxDbFromBiomart()) where storing the gene start/end was not needed. Another reason is that a gene can have transcripts on more than 1 chromosome so there are situations where there is no such thing as a gene start and end. Support for GFF (via makeTxDbFromGFF()) came a couple of years later. In my experience, the start/end reported on the gene line have always matched those inferred from the transcripts. Are you dealing with GFF files where this is not the case?

Thanks,

H.

ADD COMMENT
0
Entering edit mode

Ok, thanks Hervé. I was misled by the example from the official GFF3 specification.

ADD REPLY

Login before adding your answer.

Traffic: 489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6