Question: How to replace NAs values in "CompressedIntegerList" objects [reproducible example is included] ?
0
3.6 years ago by
Chicago, IL, USA
Jurat Shahidin70 wrote:

Hi,

I did parallel searching overlapped regions from two GRanges objects, and I did Vectorization over them, so I got CompressedIntegerList objects (a.k.a, non-overlapped regions indicates integer(0) in the list).

Now, I need to sum up two vectors (a.k.a, sum of vector indicates overall overlapped regions number for each). However, I did keep only one (the one with lowest pvalue if multiple intersection found) and I have new CompressedIntegerList objects, but it has NAs index now ( NAs values indicates non-overlapped regions).

I did sum over two vector, but R consider length of regions with NAs index as 1, while R also consider length of region with interger(0) as 0. So, I got wrong results if I used regions with NAs index for vector sum(though I kept only one regiosn with lowest pvalue, but it has NAs index, this gives wrong results for vector sum) .

I tried every possible solution from statckoverflow and I could not find solution yet. How can I fix this issue? Thank you

For the sake of clarity of problem, I have this reproducible example:

# set up

a <- GRanges(
seqnames=Rle(c("chr1", "chr2", "chr3", "chr4"), c(3, 2, 1, 2)),
ranges=IRanges(seq(1, by=9, len=8), seq(7, by=9, len=8)),
rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE))
b <- GRanges(
seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(4, 1, 2, 1)),
ranges=IRanges(seq(2, by=6, len=8), seq(4, by=6, len=8)),
rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE))
c <- GRanges(
seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(2, 3, 1,2)),
ranges=IRanges(seq(6, by=7, len=8), seq(8, by=7, len=8)),
rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE))

# pre-step : pvalue conversion for a, b,c

a$pvalue <- 10^(score(a)/-1) a$score <- NULL
b$pvalue <- 10^(score(b)/-1) b$score <- NULL
c$pvalue <- 10^(score(c)/-1) c$score <- NULL


​#find overlap

ov_ab <- as(findOverlaps(a, b), "List")
ov_ac <- as(findOverlaps(a, c), "List")


#keep only one regions with lowest pvalue from multiple intersection

b_idx <- as(which.min(extractList(b$pvalue, ov_ab)), "List") c_idx <- as(which.min(extractList(c$pvalue, ov_ac)), "List")


# count total overlapped number : (all NA must be replaced with integer(0) to proceed vector sum)

b_idx <- c(1,2,1,1,NA, NA, NA, NA)
c_idx <- c(1,1, NA, 2, NA, NA, 1, NA)

# my desired output (I did manually, because I tried to replace NA with integer(0) but I always get error) :

a.len <- as(c(1,1,1,1,1,1,1,1), "List")
b_one.len <- as(c(1,1,1,1, 0, 0, 0, 0), "List")
c_one.len <- as(c(1,1,0, 1, 0, 0,1, 0), "List")
totNumOvlap <-  a.len + b_one.len + c_one.len

# this is my desire output  (vector sum is very crucial for accuracy of overall result) :

totNumOvlp <- c(3,3,2,3,1,1,2,1)

Objective: I must replace NAs values into integer(0) in compressedIntegerList objects, then I can able to do vector sum over them.

How can I address this problem ? Is there any alternative way to solve the issue ? Thanks a lot

Best regards:

Jurat

granges R bug • 611 views
modified 3.6 years ago • written 3.6 years ago by Jurat Shahidin70
1

Please correct your code (lengths() rather than lenghts() and format code-blocks with 'Code' rather than 'Inline Code' (continue to use 'Inline Code' for things like GRanges in your first paragraph, where the code is part of the paragraph and not stand-alone.

Please provide the output that you are expecting -- a vector of length 4, I guess, but with what numbers? Maybe (lengths(a.S_ab.hitlist) >= 1) + (lengths(a.S_ac.hitlist) >= 1)?

Dear Martin Morgan:

Thanks you for reminding me that format issue of my posting, I have updated my original post.

However, I have followed your strategy from our discussion, I think your suggestion is very efficient to solve my problem. In our algorithm , parallel searching overlapped regions for every regions from a over b ,c , and count overall number of overlapped regions including regions from sample a, then compare with threshold value and make a decision whether all keep it or discard it. So, counting total overlapped regions for each regions from sample a in parallel way will effect overall accuracy of my results. Thanks a lot

Best regards:

Jurat

1

I took the lengths(a) and added the lengths(ov_ab) >= 1 and the lengths(ov_ac) >= 1

> lengths(a) + (lengths(ov_ab) >= 1) + (lengths(ov_ac) >= 1)
[1] 3 3 2 3 1 1 2 1

From your solution, I am wondering even I got this result that match with my desired output , but How do I know which regions from b,c were kept (the one with lowest pvalue) ? (because in ov_ab, ov_ac hitlist, we got multiple intersection, so we need to only keep one and I must know which one it was).

My strategy is firstly I have to keep only one regions out of multiple intersection from ov_ab, ov_ac, then I need to expand b_one, c_one respectively, finally I will do vector sum by using lenghts method from S4Vector packages.

How do I expand b_one , c_one that only kept one regions out of multiple intersection before doing vector sum? Because this part is one of the most important features of our algorithm. Thank you for your effort.

1

Hi Jurat -- I can't really spend more time helping here, so this will be my last reply; maybe others will help further. You know the answer to your current question from lengths(a) + (lengths(ov_ab) >= 1) + (lengths(ov_ac) >= 1). You know the regions that overlap from your earlier work, which you have edited out. Best of luck. Martin