Does anybody know of R libraries that extracts information from GTF files? For instance, I have DESeq results with ensembl transcript IDs. I want to annotate each transcript row with the Transcript information from the GTF file. Column 9 of the GTF file is quite complicated by the many fields deliminted by ;. I was wondering if there are tools out there to manipulate GTF files.
Oh, sorry I'm using the latest Gencode v38 GTF file for Human found here: http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz. It has specific attributes that I don't find in the Ensembl GTFs (version 104). For instance, Ensembl GTF does not have information for the canonical transcript but the Gencode v38 version does.
Thank you for the information!
You could do something like
And then all those transcripts would be the canonical ones?
This is great thank you...it would have taken me hours to figure it out. However, it does not give me the info I need. For example the Gencode GTF file may have multiple
tag
attributes for each row:For example I'm interested in the gene TMEM14C. Here is an example row in the GTF file:
When I select for TMEM14C using the
z
object fromimport
and check thetag
values, I get this:However, I expect this when extracting all the
tag
attributes for TMEM14C on the command line:Any ideas for a workaround? Thank you again,
Could you just grep the gtf for lines containing "Ensembl_canonical"?
Here you go...
It might not be in the gtf, but you can get canonical transcript info from BioMart. It might be easier to query biomart than to parse the gtf.
Correct, I already tried that, thank you. But I'm interested in all the attributes in the GTF file. I was hoping for a simple way in Bioconductor to provide that in a neat little dataframe so I could join it to various differentially expressed transcript tables from DESeq2. However, it does not handle lines with multiple
tag
attributes and only parses the last one. I submitted this as an Issue to rtracklayer.You can always load it as a data frame and manipulate it to GRanges after :)