Hi,
I did parallel searching overlapped regions from two GRanges
objects, and I did Vectorization
over them, so I got CompressedIntegerList
objects (a.k.a, non-overlapped regions indicates integer(0)
in the list).
Now, I need to sum up two vectors (a.k.a, sum of vector indicates overall overlapped regions number for each). However, I did keep only one (the one with lowest pvalue
if multiple intersection found) and I have new CompressedIntegerList
objects, but it has NAs
index now ( NAs
values indicates non-overlapped regions).
I did sum over two vector, but R consider length of regions with NAs
index as 1
, while R also consider length of region with interger(0)
as 0
. So, I got wrong results if I used regions with NAs
index for vector sum(though I kept only one regiosn with lowest pvalue
, but it has NAs
index, this gives wrong results for vector sum) .
I tried every possible solution from statckoverflow and I could not find solution yet. How can I fix this issue? Thank you
For the sake of clarity of problem, I have this reproducible example:
# set up
a <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr3", "chr4"), c(3, 2, 1, 2)), ranges=IRanges(seq(1, by=9, len=8), seq(7, by=9, len=8)), rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE)) b <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(4, 1, 2, 1)), ranges=IRanges(seq(2, by=6, len=8), seq(4, by=6, len=8)), rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE)) c <- GRanges( seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(2, 3, 1,2)), ranges=IRanges(seq(6, by=7, len=8), seq(8, by=7, len=8)), rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE))
# pre-step : pvalue conversion for a, b,c
a$pvalue <- 10^(score(a)/-1) a$score <- NULL b$pvalue <- 10^(score(b)/-1) b$score <- NULL c$pvalue <- 10^(score(c)/-1) c$score <- NULL
#find overlap
ov_ab <- as(findOverlaps(a, b), "List") ov_ac <- as(findOverlaps(a, c), "List")
#keep only one regions with lowest pvalue from multiple intersection
b_idx <- as(which.min(extractList(b$pvalue, ov_ab)), "List") c_idx <- as(which.min(extractList(c$pvalue, ov_ac)), "List")
# count total overlapped number : (all NA must be replaced with integer(0) to proceed vector sum)
b_idx <- c(1,2,1,1,NA, NA, NA, NA) c_idx <- c(1,1, NA, 2, NA, NA, 1, NA)
# my desired output (I did manually, because I tried to replace NA with integer(0) but I always get error) :
a.len <- as(c(1,1,1,1,1,1,1,1), "List") b_one.len <- as(c(1,1,1,1, 0, 0, 0, 0), "List") c_one.len <- as(c(1,1,0, 1, 0, 0,1, 0), "List") totNumOvlap <- a.len + b_one.len + c_one.len
# this is my desire output (vector sum is very crucial for accuracy of overall result) :
totNumOvlp <- c(3,3,2,3,1,1,2,1)
Objective: I must replace NAs
values into integer(0)
in compressedIntegerList objects, then I can able to do vector sum over them.
How can I address this problem ? Is there any alternative way to solve the issue ? Thanks a lot
Best regards:
Jurat
Please correct your code (lengths() rather than lenghts() and format code-blocks with 'Code' rather than 'Inline Code' (continue to use 'Inline Code' for things like
GRanges
in your first paragraph, where the code is part of the paragraph and not stand-alone.Please provide the output that you are expecting -- a vector of length 4, I guess, but with what numbers? Maybe
(lengths(a.S_ab.hitlist) >= 1) + (lengths(a.S_ac.hitlist) >= 1)
?Dear Martin Morgan:
Thanks you for reminding me that format issue of my posting, I have updated my original post.
However, I have followed your strategy from our discussion, I think your suggestion is very efficient to solve my problem. In our algorithm , parallel searching overlapped regions for every regions from
a
overb ,c
, and count overall number of overlapped regions including regions from samplea,
then compare with threshold value and make a decision whether all keep it or discard it. So, counting total overlapped regions for each regions from samplea
in parallel way will effect overall accuracy of my results. Thanks a lotBest regards:
Jurat
I took the
lengths(a)
and added thelengths(ov_ab) >= 1
and thelengths(ov_ac) >= 1
From your solution, I am wondering even I got this result that match with my desired output , but How do I know which regions from
b,c
were kept (the one with lowest pvalue) ? (because inov_ab, ov_ac
hitlist, we got multiple intersection, so we need to only keep one and I must know which one it was).My strategy is firstly I have to keep only one regions out of multiple intersection from
ov_ab, ov_ac
, then I need to expandb_one, c_one
respectively, finally I will do vector sum by usinglenghts
method from S4Vector packages.How do I expand
b_one , c_one
that only kept one regions out of multiple intersection before doing vector sum? Because this part is one of the most important features of our algorithm. Thank you for your effort.Hi Jurat -- I can't really spend more time helping here, so this will be my last reply; maybe others will help further. You know the answer to your current question from
lengths(a) + (lengths(ov_ab) >= 1) + (lengths(ov_ac) >= 1)
. You know the regions that overlap from your earlier work, which you have edited out. Best of luck. MartinI am sorry, I think you did your best to solve my problem, I am grateful for your help. I can figure that out.