Question

extracting gene names, gene id and transcript id

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 6 months ago

Palo Alto, CA, USA

Dear all,

given a GTF file (for example, gencode.v28.basic.annotation.gtf), what is the simplest way to extract a table with the following information :

-- gene_name

-- gene_id

-- transcript_id

many thanks !

bogdan

gtf • 8.6k views

ADD COMMENT • link updated 5.7 years ago by lee.s ▴ 70 • written 5.7 years ago by Bogdan ▴ 670

1

Entering edit mode

lee.s ▴ 70

@lees-15179

Last seen 4.5 years ago

Another option with plyranges

library(plyranges)
gr <- read_gff("your_file.gtf") %>% select(gene_id, gene_name, transcript_id)

ADD COMMENT • link 5.7 years ago lee.s ▴ 70

4

Entering edit mode

I don't see a read_gtf in plyranges, in either release or devel?

Anyway, this is just a two-liner using basic rtracklayer/GenomicRanges functions.

> library(rtracklayer)

> z <- import("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gtf.gz")

> mcols(z)[,c("gene_id","gene_name","transcript_id")]
DataFrame with 1684537 rows and 3 columns
                  gene_id   gene_name     transcript_id
              <character> <character>       <character>
1       ENSG00000223972.5     DDX11L1                NA
2       ENSG00000223972.5     DDX11L1 ENST00000456328.2
3       ENSG00000223972.5     DDX11L1 ENST00000456328.2
4       ENSG00000223972.5     DDX11L1 ENST00000456328.2
5       ENSG00000223972.5     DDX11L1 ENST00000456328.2
...                   ...         ...               ...
1684533 ENSG00000210195.2       MT-TT ENST00000387460.2
1684534 ENSG00000210195.2       MT-TT ENST00000387460.2
1684535 ENSG00000210196.2       MT-TP                NA
1684536 ENSG00000210196.2       MT-TP ENST00000387461.2
1684537 ENSG00000210196.2       MT-TP ENST00000387461.2

ADD REPLY • link 5.7 years ago James W. MacDonald 65k

0

Entering edit mode

Yes you're right, thanks! - the backend of the readers use import so read_gff() should still work. I should update plyranges to explicitly include gtf.

ADD REPLY • link 5.7 years ago lee.s ▴ 70

score 4 · Accepted Answer · 2018-08-28

4

Entering edit mode

jaro.slamecka ▴ 140

@jaroslamecka-7419

Last seen 19 months ago

Mitchell Cancer Institute, Mobile AL, U…

If you can use an Ensembl GTF, one easy and fast way is to use the refGenome package

library(refGenome) gtf = ensemblGenome() read.gtf(gtf, filename="Homo_sapiens.GRCh38.93.gtf") genes = gtf@ev$gtf[ ,c("gene_name","gene_id","transcript_id")]

ADD COMMENT • link 5.7 years ago jaro.slamecka ▴ 140

0

Entering edit mode

thank you Jaro. I wish it works. On my Ubuntu system, by using a GTF file from STAR aligner website, it says :

"terminate called after throwing an instance of 'std::length_error'

what():  basic_string::_S_create

Aborted (core dumped)"

ADD REPLY • link 5.7 years ago Bogdan ▴ 670

0

Entering edit mode

It needs to be either Ensembl or UCSC (you'd use it with gtf=ucscGenome()), that's the limitation. What exactly is the GTF file from the STAR website you describe? Can you post a link to it?

ADD REPLY • link 5.7 years ago jaro.slamecka ▴ 140

0

Entering edit mode

Thank you Jaro. The links are :

http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ENSEMBL/homo_sapiens/ENSEMBL.homo_sapiens.release-83/

the file is : Homo_sapiens.GRCh38.83.gtf

http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/GENCODE/GRCh38_Gencode26/

the file is : gencode.v26.primary_assembly.annotation.gtf

ADD REPLY • link 5.7 years ago Bogdan ▴ 670

0

Entering edit mode

During the last analysis, where 've mentioned the errors, the GTF files that 've used were from GENCODE: