threeUTRsByTranscript per GENE
1
2
Entering edit mode
amgtech ▴ 10
@amgtech-21740
Last seen 5.0 years ago

How do I get the 3' UTR sequence per gene rather than transcript? Online resources suggest using columns="gene_name" parameter but it doesn't seem to be functional anymore.

GenomicFeatures Transcript gene threeUTRsByTranscript • 2.1k views
ADD COMMENT
3
Entering edit mode
@herve-pages-1542
Last seen 5 hours ago
Seattle, WA, United States

Hi amgtech,

Let's start by extracting all the 3' UTRs:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
utr3_by_tx <- threeUTRsByTranscript(txdb)

The 3' UTRs in utr3_by_tx are grouped by transcript. To get them grouped by gene, the idea is to (1) unlist utr3_by_tx so they end up in a "flat" GRanges object, and then (2) split this "flat" GRanges object by gene.

## Get all 3' UTRs in a GRanges object:
all_utr3 <- unlist(utr3_by_tx, use.names=FALSE)

## The names on 'utr3_by_tx' are the **internal** transcript ids (the
## same you see in the 'tx_id' metadata column of the object returned
## by 'transcripts(txdb)', note that they're stored as integer values
## in the SQLite db). Let's propagate them to 'all_utr3':
mcols(all_utr3)$tx_id <- rep(as.integer(names(utr3_by_tx)), lengths(utr3_by_tx))

all_utr3
# GRanges object with 159975 ranges and 4 metadata columns:
#                    seqnames        ranges strand |   exon_id   exon_name
#                       <Rle>     <IRanges>  <Rle> | <integer> <character>
#        [1]             chr1   70009-71585      + |        23        <NA>
#        [2]             chr1   70009-70108      + |        24        <NA>
#        [3]             chr1 944154-944575      + |       177        <NA>
#        ...              ...           ...    ... .       ...         ...
#   [159973] chrUn_GL000213v1 124992-125192      - |    647003        <NA>
#   [159974] chrUn_GL000213v1 123983-124252      - |    647002        <NA>
#   [159975] chrUn_GL000218v1   51867-53482      - |    647013        <NA>
#            exon_rank     tx_id
#            <integer> <integer>
#        [1]         3         9
#        [2]         1        10
#        [3]        14        74
#        ...       ...       ...
#   [159973]         7    226799
#   [159974]         8    226799
#   [159975]         1    226802
#   -------
#   seqinfo: 595 sequences (1 circular) from hg38 genome

The problem we're facing is that all_utr3 doesn't contain the gene information, which we need if we want to split the object by gene. So in the next steps we're going to add it i.e. we're going to add the gene_id metadata column. While we are at it we're also going to add the tx_name metadata column.

## Map transcripts to genes:
tx2gene <- mcols(transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")))
tx2gene$gene_id <- as.character(tx2gene$gene_id)

tx2gene
# DataFrame with 226811 rows and 3 columns
#            tx_id           tx_name     gene_id
#        <integer>       <character> <character>
# 1              1 ENST00000456328.2   100287102
# 2              2 ENST00000450305.2   100287102
# 3              3 ENST00000473358.1          NA
# ...          ...               ...         ...
# 226809    226809 ENST00000611690.1          NA
# 226810    226810 ENST00000616830.1          NA
# 226811    226811 ENST00000612925.1          NA

## Add "tx_name" and "gene_id" metadata columns to 'all_utr3':
m <- match(mcols(all_utr3)$tx_id, tx2gene$tx_id)
mcols(all_utr3) <- cbind(mcols(all_utr3), tx2gene[m, -1L, drop=FALSE])

all_utr3 now contains the gene information (in the gene_id metadata column):

all_utr3
# GRanges object with 159975 ranges and 6 metadata columns:
#                    seqnames        ranges strand |   exon_id   exon_name
#                       <Rle>     <IRanges>  <Rle> | <integer> <character>
#        [1]             chr1   70009-71585      + |        23        <NA>
#        [2]             chr1   70009-70108      + |        24        <NA>
#        [3]             chr1 944154-944575      + |       177        <NA>
#        ...              ...           ...    ... .       ...         ...
#   [159973] chrUn_GL000213v1 124992-125192      - |    647003        <NA>
#   [159974] chrUn_GL000213v1 123983-124252      - |    647002        <NA>
#   [159975] chrUn_GL000218v1   51867-53482      - |    647013        <NA>
#            exon_rank     tx_id           tx_name     gene_id
#            <integer> <integer>       <character> <character>
#        [1]         3         9 ENST00000641515.2        <NA>
#        [2]         1        10 ENST00000335137.4       79501
#        [3]        14        74 ENST00000342066.7      148398
#        ...       ...       ...               ...         ...
#   [159973]         7    226799 ENST00000616157.1   100288966
#   [159974]         8    226799 ENST00000616157.1   100288966
#   [159975]         1    226802 ENST00000612565.1      389834
#   -------
#   seqinfo: 595 sequences (1 circular) from hg38 genome

Note that not all transcripts are linked to a gene!

Finally we can get the 3' UTRs grouped by gene by splitting all_utr3 by the gene_id metadata column:

utr3_by_gene <- split(all_utr3, mcols(all_utr3)$gene_id)

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

It works, thank you!

ADD REPLY
0
Entering edit mode

Somehow we need to make this easier. It always frustrates me that the gene information is not automatically added to the transcripts.

ADD REPLY
0
Entering edit mode

Same here.

FWIW here is some background:

There was this (questionable) early design decision to not store the gene ids in the transcript table of the SQLite db but to store them instead in a gene table (made of the gene_id and _tx_id columns) in order to support an hypothetical many-to-many transcript-to-gene relationship. The motivation for this was to support TxDb objects where a given transcript would be linked to multiple genes even though at the time this decision was made (more than 10 years ago) none of the annotations we wanted to support (UCSC tracks, Ensembl BioMart, GFF files) contained that situation. 10 years later we've still not seen it and I'm not sure annotation providers are going to start the "link one transcript to multiple genes" game any time soon. Does this even make sense from a biological point of view?

As a consequence of this early design decision, transcripts() does not return the gene ids by default in order to avoid the cost of the join between the transcript and gene tables. Yeah, it would probably not make much difference but the idea was also to have the transcripts(), exons(), and cds() extractors only extract stuff from their corresponding table by default (i.e. from the transcript, exon, and cds table). So their semantic would be kept simple by default.

Another consequence of this early design decision is that, when you request it (e.g. with transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))), the gene_id column is returned as a CharacterList. This simply reflects the fact that more than one gene could be associated with a given transcript. Note that the gene_id column can be turned into a character vector with as.character(), which is what I did in my code above with tx2gene$gene_id <- as.character(tx2gene$gene_id). This makes it a lot easier and straightforward to handle this metada column downstream.

Anyway, just wanted to point out that the "transcripts(txdb) should return the gene ids by default" story is part of a bigger story. If we want to revisit these things, probably a better place to continue this discussion would be on GitHub.

H.

ADD REPLY
0
Entering edit mode

Any updates in the past year and a half?

ADD REPLY

Login before adding your answer.

Traffic: 435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6