Search
Question: Finding distance between end of all ORFs in utr and start of first cds exon in transcript coordinates
0
gravatar for hauken_heyken
3 months ago by
hauken_heyken40 wrote:

This is a speedup question like my last, my data is too big to do what I would normally do, I'm doing several TB of data.

So my question is how to match two GRangeslist by name and calculate distance between each match( that is, they are from the same transcript)

, where the first list contains several ORFs per transcript while cds only have unique rows, namely the first exon.

I have already made these lists, so a solution would be:

#uorfs: list of ORFs in utr

#cdsFirstExon: list of all first exons that have uorfs, so it only contains transcripts that have uorfs.

merged = merge( uorfs, cdsFirstExon, by.x = names(uorfs), by.y = names(cdsFirstExon)

distances = merged$uorf.end - merged$cdsFirstExon.start # distances now contains what I want

 

But the merging step is too slow with big data, is there a vectorized way ?

ADD COMMENTlink modified 3 months ago by Michael Lawrence9.9k • written 3 months ago by hauken_heyken40
1
gravatar for Michael Lawrence
3 months ago by
United States
Michael Lawrence9.9k wrote:

Well merge() is vectorized but it's a more general case than you require, and so is slower than something simpler. I think you can sort the first exons, flatten them to a GRanges, flatten the ORFs and expand the first exons accordingly. It may be easier and faster in the long run to keep the data as GRanges.

cdsFirstExon_gr <- unlist(cdsFirstExon[names(uorfs)])
uorfs_gr <- unlist(uorfs)
cdsFirstExon_long <- rep(cdsFirstExon_gr, lengths(uorfs))
distance(uorfs_gr, cdsFirstExon_long)

 


 

ADD COMMENTlink written 3 months ago by Michael Lawrence9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 388 users visited in the last hour