Search
Question: Reorder exons on negative strand in GRanges list
0
2.5 years ago by
Jake60
United States
Jake60 wrote:

I am trying to make a fasta file of reduced exon sequences (collapse all isoforms of a gene into one transcript). I'm running into a problem where exons on the negative strand are listed in ascending order based on chromosome coordinates, but this results in the exon sequences being written in the wrong order. I've tried several methods to re-order the negative strand elements, but they were all running very slow. What would be the best way to do this?

Thanks

library('BSgenome.Mmusculus.UCSC.mm10')
library('GenomicFeatures')
genome <- BSgenome.Mmusculus.UCSC.mm10
exonAll <- exonsBy(gencode, by='gene')
exonAll2 <- reduce(exonAll)
seqs <- extractTranscriptSeqs(genome, exonAll2)
writeXStringSet(seqs, '~/Desktop/basic.fa')
modified 2.5 years ago by Michael Lawrence10k • written 2.5 years ago by Jake60
2
2.5 years ago by
United States
Michael Lawrence10k wrote:

A case could be made that extractTranscriptSeqs() should take care of this, but for now you could reverse the elements of exonAll2 when the gene is on the negative strand, e.g.:

exonAll2 <- revElements(exonAll2, any(strand(exonAll2) == "-"))

I'm confused. Why should extractTranscriptSeqs() take care of this? The reduce() call removes the exon_rank information. The resulting GRangesList after the reduce() call does even not contain actual transcript sequences for many genes.

It's true that this operation probably does not make sense biologically, but my point about extractTranscriptSeqs() was that it could sort exons based on their strand, or at least throw an error if there is inconsistency. It's easy to get into that situation if not using a TxDb.

Ah, got it. Thanks.

It appears the case was never made.  revElements certainly works as suggested, but it is dreadfully slow. Is there any other new BioC juju that might make this operation sing (and, would it help that in my case ALL all my exonAll2 have exactly two elements)?

I'm surprised that revElements() on a GRangesList is slow. It's implemented pretty efficiently. Do you have an example?

Here's timing produced by the five statements in the following function, to give you a sense of relative speed on my computer:

user  system elapsed
12.227   0.871  13.100
user  system elapsed
41.688   3.735  45.423
user  system elapsed
636.549  29.291 665.827
user  system elapsed
5.991   0.060   6.052
user  system elapsed
164.827  62.318 227.457

writeIntronFlankingSeqs<-function(genomeName='BSgenome.Hsapiens.UCSC.hg38'
,TxDbName='TxDb.Hsapiens.UCSC.hg38.knownGene'
,flankSize=25
,con=paste(sep='.','sjdb',flankSize,TxDbName,'fa')
) {
## PURPOSE: Write a fasta file holding <flankSize> bases flanking
## each unique intron known to <TxDbName>, as extracted from the
## BSgenome named <genomeName>.
library(rtracklayer)
library(GenomicFeatures)
library(BSgenome)
genome<-getBSgenome(genomeName)
require(TxDbName,char=T)
TxDb<-eval(parse(text=TxDbName))

print(system.timeintron.gr<-unique(unlist(intronsByTranscript(TxDb)))))

print(system.time(
flanking.grl<-
## Kudos: C: Element-wise setdiff when psetdiff fails for GenomicRanges
setdiff(as(trimintron.gr+flankSize),'GRangesList')
,asintron.gr,'GRangesList'))))

stopifnot(all(2==lengths(flanking.grl)))

print(system.time(
flanking.grl<-
## Required for correct 5-3' order in subsequent call to
## extractTranscriptSeqs.  Kudos:
## [[Reorder exons on negative strand in GRanges list][Reorder exons on
## negative strand in GRanges list]] - along with appeal for
## faster approach (THIS IS SLOW).
revElements(flanking.grl,any(strand(flanking.grl)=='-'))
))

print(system.time(
names(flanking.grl)<-
## Creates names of the form:
## chr:start1-end1:strand^chr:start2-end2:strand
do.call(function(...) paste(sep='^',...)
,as.data.frame(matrix(as.character(unlist(flanking.grl))
,ncol=2
,byrow=T)))
))

print(system.time(
writeXStringSet(extractTranscriptSeqs(genome,flanking.grl)
,con)
))

}


Thanks for taking a squiz Michael.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Malcolm Cook1.5k

I will look into the performance issue but I'm wondering whether it wouldn't be better to do this using flank(), which would get the upstream/downstream right based on the strand.

Turns out the issue is with the any(strand(flanking.grl)=='-')). There was a missing fast path for any() on a CompressedRleList. Try IRanges 2.15.19.

Mike - thanks for offering to take a look, and even more so for the "spot on" comment about using flank... you're absolutely right, computing flanking.grl as below is ~100x faster

up.gr<-trim(flank(intron.gr,flankSize,T))
down.gr<-trim(flank(intron.gr,flankSize,F))
flanking.grl<-split(c(up.gr,down.gr),seq_along(up.gr))
names(flanking.grl)<- paste(sep='^',as.character(up.gr),as.character(down.gr))

Any more idiomatic approaches to building that GRangesList are most welcome... I expect there is some unlist/relist juju I've never fully appreciated.

ADD REPLYlink modified 28 days ago • written 28 days ago by Malcolm Cook1.5k