I have a GRangesList with multiple transcripts per gene (see example below).
GRangesList object of length 9216:
$ENSMUSG00000000001.4
GRanges object with 2 ranges and 4 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank geneId
<Rle> <IRanges> <Rle> | <integer> <character> <integer> <character>
ENSMUST00000000001.4 chr3 [108109403, 108109421] - | 57602 ENSMUSE00000404895.1 8 ENSMUSG00000000001.4
ENSMUST00000000001.4 chr3 [108107280, 108109316] - | 57601 ENSMUSE00000363317.2 9 ENSMUSG00000000001.4
$ENSMUSG00000000028.14
GRanges object with 4 ranges and 4 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank geneId
ENSMUST00000000028.13 chr16 [18781896, 18781897] - | 245213 ENSMUSE00000645072.1 19 ENSMUSG00000000028.14
ENSMUST00000000028.13 chr16 [18780447, 18780573] - | 245211 ENSMUSE00000788192.1 20 ENSMUSG00000000028.14
ENSMUST00000096990.9 chr16 [18781896, 18781897] - | 245213 ENSMUSE00000645072.1 17 ENSMUSG00000000028.14
ENSMUST00000096990.9 chr16 [18780453, 18780573] - | 245212 ENSMUSE00000628024.1 18 ENSMUSG00000000028.14
I'd like to write the nucleotide sequences for all transcripts of a given gene to a fasta file grouped by gene id. Ex)
>ENSMUSG00000000028.14 [transcript 28 sequence][transcript 96990]
However, extractTranscriptSeqs requires that the exons be ordered by genomic location which would disrupt the sequence of each transcript for overlapping multiexon transcripts. I tried the more generic getSeqs, but it inserts transcript ids into the text and names rather than grouping everything just by gene id. Is there a way to do this? Thanks

This looks great. However, when I tried to run it on my own data, I couldn't access the
mcols()$geneIdin order to sort. Here is my full code.library('GenomicFeatures') library('hash') library('BSgenome.Mmusculus.UCSC.mm10') # Load genome and annotation files genome <- BSgenome.Mmusculus.UCSC.mm10 gencode <- loadDb('../gencode_m8_basic.sqlite') # Get transcript <-> gene annotation trans <- transcriptsBy(gencode, by='gene') transcript_gene_hash <- hash(mcols(unlist(trans))$tx_name, names(unlist(trans))) # Get 5' UTRs and 3' UTRs by transcript fiveutrByTx <- fiveUTRsByTranscript(gencode, use.names=T) # Get UTR of each transcript, concatenate together, group by gene # Get all UTRs, GENEID in mcols fiveutr <- unlist(fiveutrByTx) mcols(fiveutr)$geneId <- hash::values(transcript_gene_hash, keys=names(fiveutr)) fiveutr2 <- relist(fiveutr, fiveutrByTx)problem with mcols:
# Order by GENEID > fiveutr2 GRangesList object of length 36036: $ENSMUST00000027036.10 GRanges object with 1 range and 4 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank geneId <Rle> <IRanges> <Rle> | <integer> <character> <integer> <character> ENSMUST00000027036.10 chr1 [4807823, 4807913] + | 17 ENSMUSE00000792454.1 1 ENSMUSG00000025903.14 $ENSMUST00000150971.7 GRanges object with 1 range and 4 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank geneId ENSMUST00000150971.7 chr1 [4807830, 4807913] + | 18 ENSMUSE00000737251.2 1 ENSMUSG00000025903.14 $ENSMUST00000155020.1 GRanges object with 1 range and 4 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank geneId ENSMUST00000155020.1 chr1 [4807892, 4807913] + | 19 ENSMUSE00000730558.1 1 ENSMUSG00000104217.1 ... <36033 more elements> ------- seqinfo: 22 sequences (1 circular) from an unspecified genome; no seqlengthsYea, if you don't have an OrganismDbi package available, then things get annoying, since the TxDb methods for
exonsBy()etc are frustratingly limited.Maybe something like this will work?
library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(BSgenome.Hsapiens.UCSC.hg19) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene tx <- transcripts(txdb, columns=c("tx_id", "gene_id")) exons <- exonsBy(txdb) mcols(exons)$gene_id <- tx$gene_id[match(names(exons), tx$tx_id)] cds <- cdsBy(txdb) utrs <- psetdiff(exons[names(cds)], cds) mcols(utrs)$gene_id <- drop(mcols(utrs)$gene_id) utrs <- utrs[!is.na(mcols(utrs)$gene_id)] utrseqs <- extractTranscriptSeqs(Hsapiens, utrs) resplit <- function(x, f) { widths <- rowsum(lengths(x), f) widths <- setNames(as.integer(widths), rownames(widths)) relist(unlist(x[order(f)], use.names=FALSE), PartitioningByWidth(widths)) } resplit(utrseqs, mcols(utrs)$gene_id)If you want to keep the 3' and 5' UTRs separate, then maybe this code (3' UTR only, repeat for 5'):
tx <- transcripts(txdb, columns=c("tx_name", "gene_id")) utrs <- threeUTRsByTranscript(txdb, use.names=TRUE) mcols(utrs)$gene_id <- drop(tx$gene_id[match(names(utrs), tx$tx_name)]) utrseqs <- extractTranscriptSeqs(Hsapiens, utrs) resplit(utrseqs, mcols(utrs)$gene_id) # resplit defined in previous answerYep, that will do it.
A few more things:
Instead of using
resplit(), you can just dounstrsplit(split(utrseqs, mcols(utrs)$gene_id))for concatenating the UTRs sequences from the same gene. There is one difference though:resplit()will concatenate together all the transcripts that are not linked to a gene (orphan transcripts) but theunstrsplit(split())solution will drop them:The additional sequence in
answer1is from all the orphan transcripts.If you want the
unstrsplit(split())solution to keep them:Finally, in your original question you say that "extractTranscriptSeqs requires that the exons be ordered by genomic location" which is not correct. From its man page:
Note that, for each transcript, the exons must be ordered by
ascending rank, that is, by their position in the transcript.
See
?extractTranscriptSeqsfor more information.Cheers,
H.