rtracklayer export GRangesList as gtf and change attributes
1
0
Entering edit mode
Jake ▴ 90
@jake-7236
Last seen 9 months ago
United States

I have a GRangesList of exons by geneID that I want to export as a GTF file for some other programs. I have tried the following:

export(unlist(grl), '~/Desktop/exon.gtf')

However, column 3 of the gtf file lists sequence_feature rather than exon and the attributes column lists ID "ENSMUSG..." rather than the typical gene_id "ENSMUSG...". I realize that I can change these pretty easily afterwards with a script, but I was wondering if there is a way to modify the export behavior? Thanks

grangeslist rtracklayer • 3.1k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 9 days ago
United States

There is a bug where exporting a GRangesList to GTF leads to shaping the data as GFF3 (but then exporting it according to the general GFF2 spec). Note that rtracklayer considers GTF to stand for "Gene Transfer Format" (version 2.2 as defined by WashU and endorsed by UCSC), not Ensembl's "General Transfer Format" (which is the same as GFF2).

GTF 2.2 defines very specific constraints and semantics on top of GFF2, where one must indicate the CDS regions (and, separately, the start and stop codon) of every described transcript.  The UTRs are optional. I'll add a simple translation of GRangesList, where each element is assumed to represent the coding portion of a transcript (as output by cdsBy()). I'll also add a convenience for TxDb, which will at least define the required features.

However, all this means that GTF is not the appropriate file format for your specific example, where you have reduced the exons (coding and non-coding) into some other sort of exonic feature. It may be that you could call those features exons and that some tools would work just fine, but the file would not be strictly valid GTF.  The right format to use would be GFF3. 

ADD COMMENT
0
Entering edit mode

I want to do more filtering, but here is the simplest case of what I'm trying to do. I have a TxDb named "gencode" 

library('GenomicFeatures')
library('rtracklayer')
gencode <- loadDb('gencode_m6_basic.sqlite')
exonAll <- exonsBy(gencode, by='gene')
exonAll <- reduce(exonAll)
export(exonAll, 'exon.gtf')

This is what the resulting GTF file looks like. 

##gff-version 2

##source-version rtracklayer 1.28.6

##date 2015-07-19

chr3    rtracklayer     mRNA    108107280       108146146       .       -       .       ID "mRNA1"; Name "ENSMUSG00000000001.4";

chrX    rtracklayer     mRNA    77837901        77853623        .       -       .       ID "mRNA2"; Name "ENSMUSG00000000003.13";

chr16   rtracklayer     mRNA    18780447        18811972        .       -       .       ID "mRNA3"; Name "ENSMUSG00000000028.12";

chr7    rtracklayer     mRNA    142575558       142578120       .       -       .       ID "mRNA4"; Name "ENSMUSG00000000031.13";

chrX    rtracklayer     mRNA    161117193       161258213       .       +       .       ID "mRNA5"; Name "ENSMUSG00000000037.14";

chr11   rtracklayer     mRNA    108395293       108414396       .       +       .       ID "mRNA6"; Name "ENSMUSG00000000049.9";

chr11   rtracklayer     mRNA    121237253       121255856       .       +       .       ID "mRNA7"; Name "ENSMUSG00000000056.7";

chr6    rtracklayer     mRNA    17281185        17289115        .       +       .       ID "mRNA8"; Name "ENSMUSG00000000058.6";

 

ADD REPLY
0
Entering edit mode

Sorry, I will edit the answer to reflect that rtracklayer is incorrectly targeting GFF3 here, not GTF. 

ADD REPLY

Login before adding your answer.

Traffic: 150 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6