I am trying to understand how GenomicFeatures::makeTxDbFromGFF
works by using a subset of the first example from the official GFF3 specification. (I made minor changes below compare to the example.)
> txt <- paste0("##gff-version 3", "\n##sequence-region ctg123 1 1497228", "\nctg123\t.\tgene\t1000\t9000\t.\t+\t.\tID=gene00001;Name=EDEN", "\nctg123\t.\tmRNA\t1050\t9000\t.\t+\t.\tID=mRNA00001;Parent=gene00001", "\nctg123\t.\texon\t1050\t1500\t.\t+\t.\tParent=mRNA00001", "\nctg123\t.\tCDS\t1201\t1500\t.\t+\t.\tID=cds00001;Parent=mRNA00001") > cat(txt, file="subset.gff") > txdb <- makeTxDbFromGFF(file=gff.file.small, format="auto", dataSource=url, organism="Vitis vinifera", taxonomyId=29760, chrominfo=data.frame(chrom="ctg123", length=1497228, is_circular=FALSE)) Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK
All looks good except that gene start is wrong:
> genes(txdb) GRanges object with 1 range and 1 metadata column: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> EDEN ctg123 [1050, 9000] + | EDEN
It should start at 1000
, right?
Note that transcript, exon and CDS coordinates are correct:
> transcripts(txdb) GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] ctg123 [1050, 9000] + | 1 mRNA00001 > exons(txdb) GRanges object with 1 range and 1 metadata column: seqnames ranges strand | exon_id <Rle> <IRanges> <Rle> | <integer> [1] ctg123 [1050, 1500] + | 1 > cds(txdb) GRanges object with 1 range and 1 metadata column: seqnames ranges strand | cds_id <Rle> <IRanges> <Rle> | <integer> [1] ctg123 [1201, 1500] + | 1
Ok, thanks Hervé. I was misled by the example from the official GFF3 specification.