Question

Using genomic ranges to get counts of non-overlapping (non-redundant) regions

0

Entering edit mode

new_user6 • 0

@new_user6-20726

Last seen 6.2 years ago

I have a list of peaks as below, sample shown for 2 samples but have more than that

myPeaks

 $`HindBrain_1`

      GRanges object with 6906 ranges and 0 metadata columns:
     seqnames            ranges strand
        <Rle>         <IRanges>  <Rle>
 [1]     chr1     180772-180997      *
 [2]     chr1     199827-199960      *
 [3]     chr1     629105-630021      *
 [4]     chr1     630263-630476      *
 [5]     chr1     630757-631369      *
 ...      ...               ...    ...
[6902]     chrY 56769777-56769888      *
[6903]     chrY 56839405-56839501      *
[6904]     chrY 56847748-56847872      *
[6905]     chrY 56850422-56850536      *
[6906]     chrY 56850787-56850976      *

seqinfo: 72 sequences from an unspecified genome; no seqlengths

$`HindBrain_2`

GRanges object with 13456 ranges and 0 metadata columns:

      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]     chr1       10371-10459      *
  [2]     chr1       29279-29387      *
  [3]     chr1     180747-180991      *
  [4]     chr1     199817-199972      *
  [5]     chr1     629105-629884      *
  ...      ...               ...    ...
 [13452]     chrY 26640511-26640600      *
 [13453]     chrY 56838911-56839024      *
 [13454]     chrY 56839408-56839506      *
 [13455]     chrY 56850376-56850536      *
 [13456]     chrY 56850784-56850953      *

seqinfo: 72 sequences from an unspecified genome; no seqlengths

and would like to get a count of the non-redundant regions (based on presence or absence in that sample), with a consensus id, something like this

 GRanges object with 89654 ranges and 2 metadata columns:
      seqnames               ranges strand | HindBrain_1 HindBrain_2
         <Rle>            <IRanges>  <Rle> |   <numeric>   <numeric>
  [1]     chr1   [3130336, 3130413]      * |           0           0
  [2]     chr1   [3400112, 3400250]      * |           0           0
  [3]     chr1   [3433954, 3434095]      * |           0           0
  [4]     chr1   [3515020, 3515102]      * |           0           0
  [5]     chr1   [3575895, 3576078]      * |           0           0
  ...      ...                  ...    ... .         ...         ...
 [89650]     chrY [90805058, 90805151]      * |           0           0
 [89651]     chrY [90807391, 90807562]      * |           0           0
 [89652]     chrY [90808761, 90808841]      * |           0           1
 [89653]     chrY [90812961, 90813237]      * |           1           1
 [89654]     chrY [90829708, 90829782]      * |           0           0

          consensusIDs
               <factor>
  [1]         consensus_1
  [2]         consensus_2
  [3]         consensus_3
  [4]         consensus_4
  [5]         consensus_5
  ...       ...       ...       ...       ...             ...
 [89650]      consensus_89765
 [89651]      consensus_89766
 [89652]      consensus_89767
 [89653]      consensus_89768
 [89654]      consensus_89769

seqinfo: 22 sequences from an unspecified genome; no seqlengths

I tried using soGGi:::runConsensusRegions but the function seems depreciated

genomicranges granges grangeslist R • 2.4k views

ADD COMMENT • link updated 6.2 years ago by James W. MacDonald 68k • written 6.2 years ago by new_user6 • 0

score 1 · Answer 1 · 2019-09-10

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

It's not deprecated:

> soGGi:::runConsensusRegions
function (testRanges, method = "majority", overlap = "any") 
{
    if (class(testRanges) == "GRangesList" & length(testRanges) > 
        1) {
        reduced <- reduce(unlist(testRanges))
        consensusIDs <- paste0("consensus_", seq(1, length(reduced)))
        mcols(reduced) <- do.call(cbind, lapply(testRanges, function(x) (reduced %over% 
            x) + 0))
        if (method == "majority") {
            reducedConsensus <- reduced[rowSums(as.data.frame(mcols(reduced))) > 
                length(testRanges)/2, ]
        }
        if (method == "none") {
            reducedConsensus <- reduced
        }
        if (is.numeric(method)) {
            reducedConsensus <- reduced[rowSums(as.data.frame(mcols(reduced))) > 
                method, ]
        }
        consensusIDs <- paste0("consensus_", seq(1, length(reducedConsensus)))
        mcols(reducedConsensus) <- cbind(as.data.frame(mcols(reducedConsensus)), 
            consensusIDs)
        return(reducedConsensus)
    }
}
<bytecode: 0x00000000046b0cf8>
<environment: namespace:soGGi>

But it isn't exported either, so why use an unexported function when the exported findconsensusRegions function uses runConsensusRegions to do its work, and appears to do what you want?

ADD COMMENT • link 6.2 years ago James W. MacDonald 68k

0

Entering edit mode

Mainly because my input is a list of peaks, am not sure that findconsensusRegions uses a list of peaks, and did not find any examples for the input in the documentation

ADD REPLY • link 6.2 years ago new_user6 • 0

0

Entering edit mode

So, findconsensusRegions requires a named character vector as input whereas I have a list of Granges objects. Should I just convert them into named character vectors and store them as a list? Tried looking for more documentation but could not find any examples in the vignette

ADD REPLY • link 6.2 years ago new_user6 • 0

1

Entering edit mode

It looks like findconsensusPeaks does do extra things that maybe you don't want. But as you have already noted, the unexported runConsensusPeaks function may do what you want, so maybe that's all you need.

It's generally a bad idea to use unexported functions simply because as such they are not expected to remain the same, nor do the same thing, nor even exist in future iterations of the package. The code for runConsensusPeaks is pretty straightforward, so you might consider writing your own function that does exactly what you want, based on that function, in which case you would then have control over what it does and can adjust as you see fit.

ADD REPLY • link 6.2 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks James. The solution was right there, and I did not notice it. Posting so that someone else might find it useful.

An excerpt from the function soGGi:::runConsensusRegions does the job.

non_overlapping_region_counts<-function(x){
reduced <- reduce(unlist(myGRangesList))
consensusIDs <- paste0("consensus_", seq(1, length(reduced)))
mcols(reduced) <- do.call(cbind, lapply(myGRangesList, function(x) (reduced %over% x) + 0))
reducedConsensus <- reduced
mcols(reducedConsensus) <- cbind(as.data.frame(mcols(reducedConsensus)), consensusIDs)
consensusIDs <- paste0("consensus_", seq(1, length(reducedConsensus)))
return(reducedConsensus)
}

Where the input is of the class "CompressedGRangesList".

Reference to the package http://bioconductor.org/packages/release/bioc/html/soGGi.html

ADD REPLY • link 6.2 years ago new_user6 • 0

1

Entering edit mode

Note that your function won't work, except in the case that you have an object myGRangesList that exists in your .GlobalEnv.

> grlst <- GRangesList(GRanges(c("chr1","chr2"), IRanges(c(1,5), c(6,9))), GRanges(c("chr1","chr2"), IRanges(c(45,34), c(55, 69))))
> grlst
GRangesList object of length 2:
[[1]] 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-6      *
  [2]     chr2       5-9      *

[[2]] 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1  45-55      *
  [2]     chr2  34-69      *

-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
> non_overlapping_region_counts(grlst)
Error in unlist(myGRangesList) : object 'myGRangesList' not found
> myGRangesList <- grlst
> non_overlapping_region_counts(grlst)
GRanges object with 4 ranges and 4 metadata columns:
      seqnames    ranges strand |        V1        V2 consensusIDs consensusIDs
         <Rle> <IRanges>  <Rle> | <numeric> <numeric>     <factor>     <factor>
  [1]     chr1       1-6      * |         1         0  consensus_1  consensus_1
  [2]     chr1     45-55      * |         0         1  consensus_2  consensus_2
  [3]     chr2       5-9      * |         1         0  consensus_3  consensus_3
  [4]     chr2     34-69      * |         0         1  consensus_4  consensus_4
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

ADD REPLY • link 6.2 years ago James W. MacDonald 68k

1

Entering edit mode

Put another way, the argument to your function isn't used at all.

> non_overlapping_region_counts()
GRanges object with 4 ranges and 4 metadata columns:
      seqnames    ranges strand |        V1        V2 consensusIDs consensusIDs
         <Rle> <IRanges>  <Rle> | <numeric> <numeric>     <factor>     <factor>
  [1]     chr1       1-6      * |         1         0  consensus_1  consensus_1
  [2]     chr1     45-55      * |         0         1  consensus_2  consensus_2
  [3]     chr2       5-9      * |         1         0  consensus_3  consensus_3
  [4]     chr2     34-69      * |         0         1  consensus_4  consensus_4
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

ADD REPLY • link 6.2 years ago James W. MacDonald 68k

0

Entering edit mode

Sorry about that, I already had a Grangeslist object in my environment, should have mentioned that.

ADD REPLY • link 6.2 years ago new_user6 • 0

0

Entering edit mode

My point is that you have to have a GRangesList called 'myGrangesList' in your .GlobalEnv for your function to work, and that the argument (x) to your function is superfluous. This is not ideal! You normally want your arguments to actually be, you know, arguments, and not to require some exacting set up for the function to work.

ADD REPLY • link 6.2 years ago James W. MacDonald 68k