How to replace NAs values in "CompressedIntegerList" objects [reproducible example is included] ?
0
0
Entering edit mode
@jurat-shahidin-9488
Last seen 4.1 years ago
Chicago, IL, USA

Hi,

I did parallel searching overlapped regions from two GRanges objects, and I did Vectorization over them, so I got CompressedIntegerList objects (a.k.a, non-overlapped regions indicates integer(0) in the list).

Now, I need to sum up two vectors (a.k.a, sum of vector indicates overall overlapped regions number for each). However, I did keep only one (the one with lowest pvalue if multiple intersection found) and I have new CompressedIntegerList objects, but it has NAs index now ( NAs values indicates non-overlapped regions).

I did sum over two vector, but R consider length of regions with NAs index as 1, while R also consider length of region with interger(0) as 0. So, I got wrong results if I used regions with NAs index for vector sum(though I kept only one regiosn with lowest pvalue, but it has NAs index, this gives wrong results for vector sum) .

I tried every possible solution from statckoverflow and I could not find solution yet. How can I fix this issue? Thank you

For the sake of clarity of problem, I have this reproducible example:

# set up

a <- GRanges(
  seqnames=Rle(c("chr1", "chr2", "chr3", "chr4"), c(3, 2, 1, 2)),
  ranges=IRanges(seq(1, by=9, len=8), seq(7, by=9, len=8)),
  rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE))
b <- GRanges(
  seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(4, 1, 2, 1)),
  ranges=IRanges(seq(2, by=6, len=8), seq(4, by=6, len=8)),
  rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE))
c <- GRanges(
  seqnames=Rle(c("chr1", "chr2", "chr3","chr4"), c(2, 3, 1,2)),
  ranges=IRanges(seq(6, by=7, len=8), seq(8, by=7, len=8)),
  rangeName=letters[seq(1:8)], score=sample(1:20, 8, replace = FALSE))

# pre-step : pvalue conversion for a, b,c

a$pvalue <- 10^(score(a)/-1)
a$score <- NULL
b$pvalue <- 10^(score(b)/-1)
b$score <- NULL
c$pvalue <- 10^(score(c)/-1)
c$score <- NULL

​#find overlap 

ov_ab <- as(findOverlaps(a, b), "List")
ov_ac <- as(findOverlaps(a, c), "List")

#keep only one regions with lowest pvalue from multiple intersection

b_idx <- as(which.min(extractList(b$pvalue, ov_ab)), "List")
c_idx <- as(which.min(extractList(c$pvalue, ov_ac)), "List")

# count total overlapped number : (all NA must be replaced with integer(0) to proceed vector sum)

b_idx <- c(1,2,1,1,NA, NA, NA, NA)
c_idx <- c(1,1, NA, 2, NA, NA, 1, NA)

# my desired output (I did manually, because I tried to replace NA with integer(0) but I always get error) : 

a.len <- as(c(1,1,1,1,1,1,1,1), "List")     
b_one.len <- as(c(1,1,1,1, 0, 0, 0, 0), "List")           
c_one.len <- as(c(1,1,0, 1, 0, 0,1, 0), "List")             
totNumOvlap <-  a.len + b_one.len + c_one.len

# this is my desire output  (vector sum is very crucial for accuracy of overall result) : 

totNumOvlp <- c(3,3,2,3,1,1,2,1)

Objective: I must replace NAs values into integer(0) in compressedIntegerList objects, then I can able to do vector sum over them. 

How can I address this problem ? Is there any alternative way to solve the issue ? Thanks a lot

Best regards:

Jurat

r bug granges • 1.1k views
ADD COMMENT
1
Entering edit mode

Please correct your code (lengths() rather than lenghts() and format code-blocks with 'Code' rather than 'Inline Code' (continue to use 'Inline Code' for things like GRanges in your first paragraph, where the code is part of the paragraph and not stand-alone.

Please provide the output that you are expecting -- a vector of length 4, I guess, but with what numbers? Maybe (lengths(a.S_ab.hitlist) >= 1) + (lengths(a.S_ac.hitlist) >= 1)?

ADD REPLY
0
Entering edit mode

Dear Martin Morgan:

Thanks you for reminding me that format issue of my posting, I have updated my original post.

However, I have followed your strategy from our discussion, I think your suggestion is very efficient to solve my problem. In our algorithm , parallel searching overlapped regions for every regions from a over b ,c , and count overall number of overlapped regions including regions from sample a, then compare with threshold value and make a decision whether all keep it or discard it. So, counting total overlapped regions for each regions from sample a in parallel way will effect overall accuracy of my results. Thanks a lot

 

Best regards:

Jurat

ADD REPLY
1
Entering edit mode

I took the lengths(a) and added the lengths(ov_ab) >= 1 and the lengths(ov_ac) >= 1

> lengths(a) + (lengths(ov_ab) >= 1) + (lengths(ov_ac) >= 1)
[1] 3 3 2 3 1 1 2 1

 

ADD REPLY
0
Entering edit mode

From your solution, I am wondering even I got this result that match with my desired output , but How do I know which regions from b,c were kept (the one with lowest pvalue) ? (because in ov_ab, ov_ac hitlist, we got multiple intersection, so we need to only keep one and I must know which one it was).

My strategy is firstly I have to keep only one regions out of multiple intersection from ov_ab, ov_ac, then I need to expand b_one, c_one respectively, finally I will do vector sum by using lenghts method from S4Vector packages. 

How do I expand b_one , c_one that only kept one regions out of multiple intersection before doing vector sum? Because this part is one of the most important features of our algorithm. Thank you for your effort. 

 

ADD REPLY
1
Entering edit mode

Hi Jurat -- I can't really spend more time helping here, so this will be my last reply; maybe others will help further. You know the answer to your current question from lengths(a) + (lengths(ov_ab) >= 1) + (lengths(ov_ac) >= 1). You know the regions that overlap from your earlier work, which you have edited out. Best of luck. Martin

ADD REPLY
0
Entering edit mode

I am sorry, I think you did your best to solve my problem, I am grateful for your help. I can figure that out. 

ADD REPLY

Login before adding your answer.

Traffic: 882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6