extracting gene names, gene id and transcript id
2
0
Entering edit mode
Bogdan ▴ 670
@bogdan-2367
Last seen 14 months ago
Palo Alto, CA, USA

Dear all,

given a GTF file (for example,  gencode.v28.basic.annotation.gtf), what is the simplest way to extract a table with the following information :

-- gene_name

-- gene_id

-- transcript_id

many thanks !

bogdan

 

gtf • 10.0k views
ADD COMMENT
4
Entering edit mode
jaro.slamecka ▴ 140
@jaroslamecka-7419
Last seen 7 days ago
Mitchell Cancer Institute, Mobile AL, U…

If you can use an Ensembl GTF, one easy and fast way is to use the refGenome package

library(refGenome)
gtf = ensemblGenome()
read.gtf(gtf, filename="Homo_sapiens.GRCh38.93.gtf")
genes = gtf@ev$gtf[ ,c("gene_name","gene_id","transcript_id")]

ADD COMMENT
0
Entering edit mode

thank you Jaro. I wish it works. On my Ubuntu system, by using a GTF file from STAR aligner website, it says :

"terminate called after throwing an instance of 'std::length_error'

what():  basic_string::_S_create

Aborted (core dumped)"
ADD REPLY
0
Entering edit mode

It needs to be either Ensembl or UCSC (you'd use it with gtf=ucscGenome()), that's the limitation. What exactly is the GTF file from the STAR website you describe? Can you post a link to it?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

During the last analysis, where 've mentioned the errors, the GTF files that 've used were from GENCODE:

https://www.gencodegenes.org/releases/current.html

ADD REPLY
1
Entering edit mode
lee.s ▴ 70
@lees-15179
Last seen 5.1 years ago

Another option with plyranges

library(plyranges)
gr <- read_gff("your_file.gtf") %>% select(gene_id, gene_name, transcript_id)
ADD COMMENT
4
Entering edit mode

I don't see a read_gtf in plyranges, in either release or devel?

Anyway, this is just a two-liner using basic rtracklayer/GenomicRanges functions.

> library(rtracklayer)

> z <- import("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gtf.gz")

> mcols(z)[,c("gene_id","gene_name","transcript_id")]
DataFrame with 1684537 rows and 3 columns
                  gene_id   gene_name     transcript_id
              <character> <character>       <character>
1       ENSG00000223972.5     DDX11L1                NA
2       ENSG00000223972.5     DDX11L1 ENST00000456328.2
3       ENSG00000223972.5     DDX11L1 ENST00000456328.2
4       ENSG00000223972.5     DDX11L1 ENST00000456328.2
5       ENSG00000223972.5     DDX11L1 ENST00000456328.2
...                   ...         ...               ...
1684533 ENSG00000210195.2       MT-TT ENST00000387460.2
1684534 ENSG00000210195.2       MT-TT ENST00000387460.2
1684535 ENSG00000210196.2       MT-TP                NA
1684536 ENSG00000210196.2       MT-TP ENST00000387461.2
1684537 ENSG00000210196.2       MT-TP ENST00000387461.2
ADD REPLY
0
Entering edit mode

Yes you're right, thanks! - the backend of the readers use import so read_gff() should still work. I should update plyranges to explicitly include gtf.

ADD REPLY

Login before adding your answer.

Traffic: 618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6