Dear all,
given a GTF file (for example, gencode.v28.basic.annotation.gtf), what is the simplest way to extract a table with the following information :
-- gene_name
-- gene_id
-- transcript_id
many thanks !
bogdan
Dear all,
given a GTF file (for example, gencode.v28.basic.annotation.gtf), what is the simplest way to extract a table with the following information :
-- gene_name
-- gene_id
-- transcript_id
many thanks !
bogdan
If you can use an Ensembl GTF, one easy and fast way is to use the refGenome package
library(refGenome)
gtf = ensemblGenome()
read.gtf(gtf, filename="Homo_sapiens.GRCh38.93.gtf")
genes = gtf@ev$gtf[ ,c("gene_name","gene_id","transcript_id")]
Another option with plyranges
library(plyranges)
gr <- read_gff("your_file.gtf") %>% select(gene_id, gene_name, transcript_id)
I don't see a read_gtf in plyranges, in either release or devel?
Anyway, this is just a two-liner using basic rtracklayer/GenomicRanges functions.
> library(rtracklayer)
> z <- import("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gtf.gz")
> mcols(z)[,c("gene_id","gene_name","transcript_id")]
DataFrame with 1684537 rows and 3 columns
                  gene_id   gene_name     transcript_id
              <character> <character>       <character>
1       ENSG00000223972.5     DDX11L1                NA
2       ENSG00000223972.5     DDX11L1 ENST00000456328.2
3       ENSG00000223972.5     DDX11L1 ENST00000456328.2
4       ENSG00000223972.5     DDX11L1 ENST00000456328.2
5       ENSG00000223972.5     DDX11L1 ENST00000456328.2
...                   ...         ...               ...
1684533 ENSG00000210195.2       MT-TT ENST00000387460.2
1684534 ENSG00000210195.2       MT-TT ENST00000387460.2
1684535 ENSG00000210196.2       MT-TP                NA
1684536 ENSG00000210196.2       MT-TP ENST00000387461.2
1684537 ENSG00000210196.2       MT-TP ENST00000387461.2
                    
                Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
thank you Jaro. I wish it works. On my Ubuntu system, by using a GTF file from STAR aligner website, it says :
It needs to be either Ensembl or UCSC (you'd use it with
gtf=ucscGenome()), that's the limitation. What exactly is the GTF file from the STAR website you describe? Can you post a link to it?Thank you Jaro. The links are :
http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ENSEMBL/homo_sapiens/ENSEMBL.homo_sapiens.release-83/
the file is : Homo_sapiens.GRCh38.83.gtf
http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/GENCODE/GRCh38_Gencode26/
the file is : gencode.v26.primary_assembly.annotation.gtf
During the last analysis, where 've mentioned the errors, the GTF files that 've used were from GENCODE:
https://www.gencodegenes.org/releases/current.html