Question

threeUTRsByTranscript per GENE

2

Entering edit mode

amgtech ▴ 10

@amgtech-21740

Last seen 5.6 years ago

How do I get the 3' UTR sequence per gene rather than transcript? Online resources suggest using columns="gene_name" parameter but it doesn't seem to be functional anymore.

GenomicFeatures Transcript gene threeUTRsByTranscript • 2.5k views

ADD COMMENT • link updated 4.1 years ago by Dario Strbenac ★ 1.6k • written 5.6 years ago by amgtech ▴ 10

score 3 · Accepted Answer · 2019-08-27

Hi amgtech,

Let's start by extracting all the 3' UTRs:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
utr3_by_tx <- threeUTRsByTranscript(txdb)

The 3' UTRs in utr3_by_tx are grouped by transcript. To get them grouped by gene, the idea is to (1) unlist utr3_by_tx so they end up in a "flat" GRanges object, and then (2) split this "flat" GRanges object by gene.

## Get all 3' UTRs in a GRanges object:
all_utr3 <- unlist(utr3_by_tx, use.names=FALSE)

## The names on 'utr3_by_tx' are the **internal** transcript ids (the
## same you see in the 'tx_id' metadata column of the object returned
## by 'transcripts(txdb)', note that they're stored as integer values
## in the SQLite db). Let's propagate them to 'all_utr3':
mcols(all_utr3)$tx_id <- rep(as.integer(names(utr3_by_tx)), lengths(utr3_by_tx))

all_utr3
# GRanges object with 159975 ranges and 4 metadata columns:
#                    seqnames        ranges strand |   exon_id   exon_name
#                       <Rle>     <IRanges>  <Rle> | <integer> <character>
#        [1]             chr1   70009-71585      + |        23        <NA>
#        [2]             chr1   70009-70108      + |        24        <NA>
#        [3]             chr1 944154-944575      + |       177        <NA>
#        ...              ...           ...    ... .       ...         ...
#   [159973] chrUn_GL000213v1 124992-125192      - |    647003        <NA>
#   [159974] chrUn_GL000213v1 123983-124252      - |    647002        <NA>
#   [159975] chrUn_GL000218v1   51867-53482      - |    647013        <NA>
#            exon_rank     tx_id
#            <integer> <integer>
#        [1]         3         9
#        [2]         1        10
#        [3]        14        74
#        ...       ...       ...
#   [159973]         7    226799
#   [159974]         8    226799
#   [159975]         1    226802
#   -------
#   seqinfo: 595 sequences (1 circular) from hg38 genome

The problem we're facing is that all_utr3 doesn't contain the gene information, which we need if we want to split the object by gene. So in the next steps we're going to add it i.e. we're going to add the gene_id metadata column. While we are at it we're also going to add the tx_name metadata column.

## Map transcripts to genes:
tx2gene <- mcols(transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")))
tx2gene$gene_id <- as.character(tx2gene$gene_id)

tx2gene
# DataFrame with 226811 rows and 3 columns
#            tx_id           tx_name     gene_id
#        <integer>       <character> <character>
# 1              1 ENST00000456328.2   100287102
# 2              2 ENST00000450305.2   100287102
# 3              3 ENST00000473358.1          NA
# ...          ...               ...         ...
# 226809    226809 ENST00000611690.1          NA
# 226810    226810 ENST00000616830.1          NA
# 226811    226811 ENST00000612925.1          NA

## Add "tx_name" and "gene_id" metadata columns to 'all_utr3':
m <- match(mcols(all_utr3)$tx_id, tx2gene$tx_id)
mcols(all_utr3) <- cbind(mcols(all_utr3), tx2gene[m, -1L, drop=FALSE])

all_utr3 now contains the gene information (in the gene_id metadata column):

all_utr3
# GRanges object with 159975 ranges and 6 metadata columns:
#                    seqnames        ranges strand |   exon_id   exon_name
#                       <Rle>     <IRanges>  <Rle> | <integer> <character>
#        [1]             chr1   70009-71585      + |        23        <NA>
#        [2]             chr1   70009-70108      + |        24        <NA>
#        [3]             chr1 944154-944575      + |       177        <NA>
#        ...              ...           ...    ... .       ...         ...
#   [159973] chrUn_GL000213v1 124992-125192      - |    647003        <NA>
#   [159974] chrUn_GL000213v1 123983-124252      - |    647002        <NA>
#   [159975] chrUn_GL000218v1   51867-53482      - |    647013        <NA>
#            exon_rank     tx_id           tx_name     gene_id
#            <integer> <integer>       <character> <character>
#        [1]         3         9 ENST00000641515.2        <NA>
#        [2]         1        10 ENST00000335137.4       79501
#        [3]        14        74 ENST00000342066.7      148398
#        ...       ...       ...               ...         ...
#   [159973]         7    226799 ENST00000616157.1   100288966
#   [159974]         8    226799 ENST00000616157.1   100288966
#   [159975]         1    226802 ENST00000612565.1      389834
#   -------
#   seqinfo: 595 sequences (1 circular) from hg38 genome

Note that not all transcripts are linked to a gene!

Finally we can get the 3' UTRs grouped by gene by splitting all_utr3 by the gene_id metadata column:

utr3_by_gene <- split(all_utr3, mcols(all_utr3)$gene_id)

Hope this helps,

H.