How do I get the 3' UTR sequence per gene rather than transcript? Online resources suggest using columns="gene_name" parameter but it doesn't seem to be functional anymore.
How do I get the 3' UTR sequence per gene rather than transcript? Online resources suggest using columns="gene_name" parameter but it doesn't seem to be functional anymore.
Hi amgtech,
Let's start by extracting all the 3' UTRs:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
utr3_by_tx <- threeUTRsByTranscript(txdb)
The 3' UTRs in utr3_by_tx
are grouped by transcript. To get them grouped by gene, the idea is to (1) unlist utr3_by_tx
so they end up in a "flat" GRanges object, and then (2) split this "flat" GRanges object by gene.
## Get all 3' UTRs in a GRanges object:
all_utr3 <- unlist(utr3_by_tx, use.names=FALSE)
## The names on 'utr3_by_tx' are the **internal** transcript ids (the
## same you see in the 'tx_id' metadata column of the object returned
## by 'transcripts(txdb)', note that they're stored as integer values
## in the SQLite db). Let's propagate them to 'all_utr3':
mcols(all_utr3)$tx_id <- rep(as.integer(names(utr3_by_tx)), lengths(utr3_by_tx))
all_utr3
# GRanges object with 159975 ranges and 4 metadata columns:
# seqnames ranges strand | exon_id exon_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# [1] chr1 70009-71585 + | 23 <NA>
# [2] chr1 70009-70108 + | 24 <NA>
# [3] chr1 944154-944575 + | 177 <NA>
# ... ... ... ... . ... ...
# [159973] chrUn_GL000213v1 124992-125192 - | 647003 <NA>
# [159974] chrUn_GL000213v1 123983-124252 - | 647002 <NA>
# [159975] chrUn_GL000218v1 51867-53482 - | 647013 <NA>
# exon_rank tx_id
# <integer> <integer>
# [1] 3 9
# [2] 1 10
# [3] 14 74
# ... ... ...
# [159973] 7 226799
# [159974] 8 226799
# [159975] 1 226802
# -------
# seqinfo: 595 sequences (1 circular) from hg38 genome
The problem we're facing is that all_utr3
doesn't contain the gene information, which we need if we want to split the object by gene. So in the next steps we're going to add it i.e. we're going to add the gene_id
metadata column. While we are at it we're also going to add the tx_name
metadata column.
## Map transcripts to genes:
tx2gene <- mcols(transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")))
tx2gene$gene_id <- as.character(tx2gene$gene_id)
tx2gene
# DataFrame with 226811 rows and 3 columns
# tx_id tx_name gene_id
# <integer> <character> <character>
# 1 1 ENST00000456328.2 100287102
# 2 2 ENST00000450305.2 100287102
# 3 3 ENST00000473358.1 NA
# ... ... ... ...
# 226809 226809 ENST00000611690.1 NA
# 226810 226810 ENST00000616830.1 NA
# 226811 226811 ENST00000612925.1 NA
## Add "tx_name" and "gene_id" metadata columns to 'all_utr3':
m <- match(mcols(all_utr3)$tx_id, tx2gene$tx_id)
mcols(all_utr3) <- cbind(mcols(all_utr3), tx2gene[m, -1L, drop=FALSE])
all_utr3
now contains the gene information (in the gene_id
metadata column):
all_utr3
# GRanges object with 159975 ranges and 6 metadata columns:
# seqnames ranges strand | exon_id exon_name
# <Rle> <IRanges> <Rle> | <integer> <character>
# [1] chr1 70009-71585 + | 23 <NA>
# [2] chr1 70009-70108 + | 24 <NA>
# [3] chr1 944154-944575 + | 177 <NA>
# ... ... ... ... . ... ...
# [159973] chrUn_GL000213v1 124992-125192 - | 647003 <NA>
# [159974] chrUn_GL000213v1 123983-124252 - | 647002 <NA>
# [159975] chrUn_GL000218v1 51867-53482 - | 647013 <NA>
# exon_rank tx_id tx_name gene_id
# <integer> <integer> <character> <character>
# [1] 3 9 ENST00000641515.2 <NA>
# [2] 1 10 ENST00000335137.4 79501
# [3] 14 74 ENST00000342066.7 148398
# ... ... ... ... ...
# [159973] 7 226799 ENST00000616157.1 100288966
# [159974] 8 226799 ENST00000616157.1 100288966
# [159975] 1 226802 ENST00000612565.1 389834
# -------
# seqinfo: 595 sequences (1 circular) from hg38 genome
Note that not all transcripts are linked to a gene!
Finally we can get the 3' UTRs grouped by gene by splitting all_utr3
by the gene_id
metadata column:
utr3_by_gene <- split(all_utr3, mcols(all_utr3)$gene_id)
Hope this helps,
H.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
It works, thank you!
Somehow we need to make this easier. It always frustrates me that the gene information is not automatically added to the transcripts.
Same here.
FWIW here is some background:
There was this (questionable) early design decision to not store the gene ids in the
transcript
table of the SQLite db but to store them instead in agene
table (made of thegene_id
and_tx_id
columns) in order to support an hypothetical many-to-many transcript-to-gene relationship. The motivation for this was to support TxDb objects where a given transcript would be linked to multiple genes even though at the time this decision was made (more than 10 years ago) none of the annotations we wanted to support (UCSC tracks, Ensembl BioMart, GFF files) contained that situation. 10 years later we've still not seen it and I'm not sure annotation providers are going to start the "link one transcript to multiple genes" game any time soon. Does this even make sense from a biological point of view?As a consequence of this early design decision,
transcripts()
does not return the gene ids by default in order to avoid the cost of the join between thetranscript
andgene
tables. Yeah, it would probably not make much difference but the idea was also to have thetranscripts()
,exons()
, andcds()
extractors only extract stuff from their corresponding table by default (i.e. from thetranscript
,exon
, andcds
table). So their semantic would be kept simple by default.Another consequence of this early design decision is that, when you request it (e.g. with
transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
), thegene_id
column is returned as a CharacterList. This simply reflects the fact that more than one gene could be associated with a given transcript. Note that thegene_id
column can be turned into a character vector withas.character()
, which is what I did in my code above withtx2gene$gene_id <- as.character(tx2gene$gene_id)
. This makes it a lot easier and straightforward to handle this metada column downstream.Anyway, just wanted to point out that the "
transcripts(txdb)
should return the gene ids by default" story is part of a bigger story. If we want to revisit these things, probably a better place to continue this discussion would be on GitHub.H.
Any updates in the past year and a half?