Question: Understanding group_by, reduce_ranges,
0
gravatar for Konstantinos Yeles
13 days ago by
University of Salerno, Salerno, Italy
Konstantinos Yeles20 wrote:

Hello Bioconductor,

I'm am trying to understand how the functions of group_by and reduce in plyranges package work. My purpose is to create a Granges object that has all the genomic ranges from BAM files and reduce them, to find the common / union in all samples.

example:

# toy samples taken/modified from BiocWorkshops/fluent-genomic-data-analysis-with-plyranges
set.seed(42)
pos <- sample(1:10000, size = 100)
size <- sample(10:50, size = 100, replace = TRUE)
strand <- sample(c("+","-"), size = 100, replace = TRUE)
rep1 <- data.frame(seqnames = "chr1", 
              start = pos,
              width = size,
              strand =strand, 
              counts = round(rnorm(100, mean = 100,sd = 20)))
rep1 <- as_granges(rep1) 
seqlengths(rep1) <- 248956422
genome(rep1) <- "hg38"
rep2 <- rep1 %>% anchor_center %>% mutate(width = width / 2, counts = counts + 100 ) %>% shift_downstream(8L)
rep3 <- rep1 %>% shift_upstream(11L) %>% mutate(counts = counts / 2)
rep4 <- rep3 %>% anchor_start() %>% mutate(width = width * 1.3, counts = counts + 1000 ) %>% shift_downstream(7L)
rep4[c(1 , 50:70)] <- rep2[c(1, 50:70)] %>% shift_downstream(10L)
rep5 <- rep4 %>% anchor_center() %>% mutate(width = width * 1.5)
rep5[c(10:30)] <- rep2[c(20:40)] %>% anchor_center() %>% mutate(width = width / 2)
intensities <- bind_ranges(rep1, rep2, rep3, rep4, .id = "replicate") %>%
arrange(start)
r1 <- intensities %>% group_by(strand,replicate) %>% reduce_ranges_directed() %>% as_tibble
r2 <- intensities %>% group_by(strand) %>%  reduce_ranges_directed() %>% as_tibble
r3 <- intensities %>% reduce_ranges_directed() %>% as_tibble
r4 <- intensities %>% group_by(strand,replicate) %>% reduce_ranges() %>% as_tibble
r5 <- intensities %>% group_by(strand) %>% reduce_ranges()%>% as_tibble
r6 <- intensities %>%  reduce_ranges()%>% as_tibble
sapply(list(r1,r2,r3,r4,r5,r6), nrow)
[1] 349  81  81 349  81  63
identical(r2$end,r3$end)
[1] TRUE

r1 and r4 are identical, as r2, r3, r5. Although we start from 500 ranges with grouping by strand and replicate it keeps 457 ranges. In my specific case, I have the information about the strand and using whichever of the "r2,r3,r5" ways to collapse the Granges probably it doesn't change anything. In case of a Granges object with multiple chromosomes do we need to groupby(seqnames) or just by using reduceranges_directed() would be enough?

Then I would like to find the overlaps from each sample with the collapsed grange object in order to acquire a matrix with the counts:

r3 <- intensities %>% reduce_ranges_directed()
rep1 <- rep1  %>%  select(rep_1_counts = "counts",everything()) %>% arrange(start)
rep2 <- rep2  %>%  select(rep_2_counts = "counts",everything()) %>% arrange(start)
r3 %>% join_overlap_left_directed( rep1 ) %>% 
mutate(rep_1_counts = replace_na(rep_1_counts,0)) %>%  
  join_overlap_left_directed( rep2 ) %>% 
  mutate(rep_2_counts = replace_na(rep_2_counts,0))



      GRanges object with 164 ranges and 2 metadata columns:
        seqnames    ranges strand | rep_1_counts rep_2_counts
           <Rle> <IRanges>  <Rle> |    <numeric>    <numeric>
    [1]     chr1      5-63      + |          131          231
    [2]     chr1     80-89      + |            0            0
    [3]     chr1    91-168      + |          109          209
    [4]     chr1    91-168      + |          109          214
    [5]     chr1    91-168      + |          109          212
    ...      ...       ...    ... .          ...          ...
  [160]     chr1 9290-9333      - |          109          209
  [161]     chr1 9467-9515      - |          112          212
  [162]     chr1 9544-9585      - |          142          242
  [163]     chr1 9770-9834      - |           69          169
  [164]     chr1 9954-9995      - |          118          218
  -------
  seqinfo: 2 sequences from 2 genomes (NA, hg38)

Why it has now an NA in genomes and how can I remove it? In these toy data, there is the issue of having the same region e.g.: 91-168 with several entries, how is it possible to address it if it will arrise in real data?

Thank you for your time

granges plyranges • 96 views
ADD COMMENTlink modified 12 days ago by lee.s70 • written 13 days ago by Konstantinos Yeles20
Answer: Understanding group_by, reduce_ranges,
2
gravatar for lee.s
12 days ago by
lee.s70
lee.s70 wrote:

Hi Konstantinos,

r1 and r4 are identical, as r2, r3, r5. Although we start from 500 ranges with grouping by strand and replicate it keeps 457 ranges. In my specific case, I have the information about the strand and using whichever of the "r2,r3,r5" ways to collapse the Granges probably it doesn't change anything. In case of a Granges object with multiple chromosomes do we need to groupby(seqnames) or just by using reduceranges_directed() would be enough?

The purpose of reduce_ranges() is to merge overlapping ranges (and optionally apply some transformation based on those ranges metadata). What is considered as overlapping ranges, depends on the groupings and whether you are considering strand (i.e. using directed). So when you do groupby(replicate) then apply reduceranges, ranges within each replicate will be reduce. You don't need to group by seqnames when using reduceranges or reduceranges_directed as ranges on different seqnames are distinct.

Why it has now an NA in genomes and how can I remove it? In these toy data, there is the issue of having the same region e.g.: 91-168 with several entries, how is it possible to address it if it will arrise in real data?

This is a bug in left overlap join that should be fixed in plyranges 1.6.2 (once it gets onto the servers). For ranges, where there were multiple hits you could possibly use groupby + summarise + asgranges or group_by + reduce to aggregate the counts.

Here's an example - we use selfmatch as a short cut for doing group_by(seqnames, start, end, strand), also note that you can use the dplyr scoped verbs to save some type when replacing the NA values.

ans <- r3 %>% 
  join_overlap_left_directed( rep1 ) %>%  
  join_overlap_left_directed( rep2 ) %>% 
  mutate(revmap = selfmatch(.)) %>%
  dplyr::mutate_at(dplyr::vars(dplyr::starts_with('rep')), 
                   function(.) ifelseis.na(.), 0, .)
  ) %>% 
  group_by(revmap) %>% 
  reduce_ranges_directed(
    rep_1_counts = mean(rep_1_counts, na.rm = TRUE),
    rep_2_counts = mean(rep_2_counts, na.rm = TRUE)
  )
ADD COMMENTlink modified 11 days ago • written 12 days ago by lee.s70

Hello lee.s, thank you for your answer. I run the code but I get this:

ans <- r3 %>% 
  join_overlap_left_directed( rep1 ) %>%  
  join_overlap_left_directed( rep2 ) %>% 
  mutate(revmap = selfmatch(.)) %>%
  dplyr::mutate_at(dplyr::vars(dplyr::starts_with('rep')), 
                   function(.) ifelseis.na(.), 0, .) %>% 
  group_by(revmap) %>% 
  reduce_ranges_directed(
    rep_1_counts = mean(rep_1_counts, na.rm = TRUE),
    rep_2_counts = mean(rep_2_counts, na.rm = TRUE)
  )

Error in UseMethod("tblvars") : no applicable method for 'tblvars' applied to an object of class "quosures"

ADD REPLYlink written 12 days ago by Konstantinos Yeles20

what version of plyranges are you using? could you post your sessionInfo or provide a reprex?

ADD REPLYlink modified 12 days ago • written 12 days ago by lee.s70

I found the problem a bracket was missing...

 function(.) ifelseis.na(.), 0, .) %>% 
                   ^
 function(.) ifelse/is.na(.), 0, .) %>%

Regarding reduce_ranges_directed is there a way to use the mean not manually? And when we have more seqlevels the correct way to perform reduce is :

r3 <- intensities %>% reduce_ranges_directed()

without using any grouping?

ADD REPLYlink modified 11 days ago • written 11 days ago by Konstantinos Yeles20
1

it seems like the support site code editor doesn't show the bracket - how strange! What do you mean by the second part?

ADD REPLYlink written 4 days ago by lee.s70

By "manually way" I mean something to use like apply or map to be done for each of the samples instead of writing for each sample rep_1_counts = mean(rep_1_counts, na.rm = TRUE), but in real data, it is very time consuming (just using 3 samples it took 15mins).

reduce_ranges_directed(
    rep_1_counts = mean(rep_1_counts, na.rm = TRUE),
    rep_2_counts = mean(rep_2_counts, na.rm = TRUE)
  )

For that reason, I was searching a way to perform it faster and it seems is better to do it to each sample as tibble instead to a plyrange

ADD REPLYlink modified 4 days ago • written 4 days ago by Konstantinos Yeles20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 448 users visited in the last hour