Question

mRNA start used instead of gene start in makeTxDbFromGFF

0

Entering edit mode

TimothéeFlutre ▴ 80

@timotheeflutre-6727

Last seen 5.6 years ago

France

I am trying to understand how GenomicFeatures::makeTxDbFromGFF works by using a subset of the first example from the official GFF3 specification. (I made minor changes below compare to the example.)

> txt <- paste0("##gff-version 3",
                "\n##sequence-region ctg123 1 1497228",
                "\nctg123\t.\tgene\t1000\t9000\t.\t+\t.\tID=gene00001;Name=EDEN",
                "\nctg123\t.\tmRNA\t1050\t9000\t.\t+\t.\tID=mRNA00001;Parent=gene00001",
                "\nctg123\t.\texon\t1050\t1500\t.\t+\t.\tParent=mRNA00001",
                "\nctg123\t.\tCDS\t1201\t1500\t.\t+\t.\tID=cds00001;Parent=mRNA00001")
> cat(txt, file="subset.gff")
> txdb <- makeTxDbFromGFF(file=gff.file.small, format="auto", dataSource=url,
                          organism="Vitis vinifera", taxonomyId=29760,
                          chrominfo=data.frame(chrom="ctg123", length=1497228, is_circular=FALSE))
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

All looks good except that gene start is wrong:

> genes(txdb)
GRanges object with 1 range and 1 metadata column:
       seqnames       ranges strand |     gene_id
          <Rle>    <IRanges>  <Rle> | <character>
  EDEN   ctg123 [1050, 9000]      + |        EDEN

It should start at 1000, right?

Note that transcript, exon and CDS coordinates are correct:

> transcripts(txdb)
GRanges object with 1 range and 2 metadata columns:
      seqnames       ranges strand |     tx_id     tx_name
         <Rle>    <IRanges>  <Rle> | <integer> <character>
  [1]   ctg123 [1050, 9000]      + |         1   mRNA00001
> exons(txdb)
GRanges object with 1 range and 1 metadata column:
      seqnames       ranges strand |   exon_id
         <Rle>    <IRanges>  <Rle> | <integer>
  [1]   ctg123 [1050, 1500]      + |         1
> cds(txdb)
GRanges object with 1 range and 1 metadata column:
      seqnames       ranges strand |    cds_id
         <Rle>    <IRanges>  <Rle> | <integer>
  [1]   ctg123 [1201, 1500]      + |         1

maketxdbfromgff genomicfeatures • 1.5k views

ADD COMMENT • link updated 8.6 years ago by Hervé Pagès 16k • written 8.6 years ago by TimothéeFlutre ▴ 80

score 1 · Accepted Answer · 2016-05-05

Hi Tim,

We don't import the start/end found on the gene lines of a GFF file. In fact, in the SQLite db, unlike the transcripts, exons, and cds tables, the genes table doesn't have the start/end columns. Instead the genes() extractor infers the gene coordinates from the transcripts of a gene. The reason for this db schema is that, historically, we started to support UCSC and Ensembl annotations (via makeTxDbFromUCSC() and makeTxDbFromBiomart()) where storing the gene start/end was not needed. Another reason is that a gene can have transcripts on more than 1 chromosome so there are situations where there is no such thing as a gene start and end. Support for GFF (via makeTxDbFromGFF()) came a couple of years later. In my experience, the start/end reported on the gene line have always matched those inferred from the transcripts. Are you dealing with GFF files where this is not the case?

Thanks,

H.