Question: Using genomic ranges to get counts of non-overlapping (non-redundant) regions
0
gravatar for new_user6
6 days ago by
new_user60
new_user60 wrote:

I have a list of peaks as below, sample shown for 2 samples but have more than that

myPeaks

 $`HindBrain_1`

      GRanges object with 6906 ranges and 0 metadata columns:
     seqnames            ranges strand
        <Rle>         <IRanges>  <Rle>
 [1]     chr1     180772-180997      *
 [2]     chr1     199827-199960      *
 [3]     chr1     629105-630021      *
 [4]     chr1     630263-630476      *
 [5]     chr1     630757-631369      *
 ...      ...               ...    ...
[6902]     chrY 56769777-56769888      *
[6903]     chrY 56839405-56839501      *
[6904]     chrY 56847748-56847872      *
[6905]     chrY 56850422-56850536      *
[6906]     chrY 56850787-56850976      *

seqinfo: 72 sequences from an unspecified genome; no seqlengths

$`HindBrain_2`

GRanges object with 13456 ranges and 0 metadata columns:

      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]     chr1       10371-10459      *
  [2]     chr1       29279-29387      *
  [3]     chr1     180747-180991      *
  [4]     chr1     199817-199972      *
  [5]     chr1     629105-629884      *
  ...      ...               ...    ...
 [13452]     chrY 26640511-26640600      *
 [13453]     chrY 56838911-56839024      *
 [13454]     chrY 56839408-56839506      *
 [13455]     chrY 56850376-56850536      *
 [13456]     chrY 56850784-56850953      *

seqinfo: 72 sequences from an unspecified genome; no seqlengths

and would like to get a count of the non-redundant regions (based on presence or absence in that sample), with a consensus id, something like this

 GRanges object with 89654 ranges and 2 metadata columns:
      seqnames               ranges strand | HindBrain_1 HindBrain_2
         <Rle>            <IRanges>  <Rle> |   <numeric>   <numeric>
  [1]     chr1   [3130336, 3130413]      * |           0           0
  [2]     chr1   [3400112, 3400250]      * |           0           0
  [3]     chr1   [3433954, 3434095]      * |           0           0
  [4]     chr1   [3515020, 3515102]      * |           0           0
  [5]     chr1   [3575895, 3576078]      * |           0           0
  ...      ...                  ...    ... .         ...         ...
 [89650]     chrY [90805058, 90805151]      * |           0           0
 [89651]     chrY [90807391, 90807562]      * |           0           0
 [89652]     chrY [90808761, 90808841]      * |           0           1
 [89653]     chrY [90812961, 90813237]      * |           1           1
 [89654]     chrY [90829708, 90829782]      * |           0           0

          consensusIDs
               <factor>
  [1]         consensus_1
  [2]         consensus_2
  [3]         consensus_3
  [4]         consensus_4
  [5]         consensus_5
  ...       ...       ...       ...       ...             ...
 [89650]      consensus_89765
 [89651]      consensus_89766
 [89652]      consensus_89767
 [89653]      consensus_89768
 [89654]      consensus_89769

seqinfo: 22 sequences from an unspecified genome; no seqlengths

I tried using soGGi:::runConsensusRegions but the function seems depreciated

ADD COMMENTlink modified 6 days ago by James W. MacDonald51k • written 6 days ago by new_user60
Answer: Using genomic ranges to get counts of non-overlapping (non-redundant) regions
1
gravatar for James W. MacDonald
6 days ago by
United States
James W. MacDonald51k wrote:

It's not deprecated:

> soGGi:::runConsensusRegions
function (testRanges, method = "majority", overlap = "any") 
{
    if (class(testRanges) == "GRangesList" & length(testRanges) > 
        1) {
        reduced <- reduce(unlist(testRanges))
        consensusIDs <- paste0("consensus_", seq(1, length(reduced)))
        mcols(reduced) <- do.call(cbind, lapply(testRanges, function(x) (reduced %over% 
            x) + 0))
        if (method == "majority") {
            reducedConsensus <- reduced[rowSums(as.data.frame(mcols(reduced))) > 
                length(testRanges)/2, ]
        }
        if (method == "none") {
            reducedConsensus <- reduced
        }
        if (is.numeric(method)) {
            reducedConsensus <- reduced[rowSums(as.data.frame(mcols(reduced))) > 
                method, ]
        }
        consensusIDs <- paste0("consensus_", seq(1, length(reducedConsensus)))
        mcols(reducedConsensus) <- cbind(as.data.frame(mcols(reducedConsensus)), 
            consensusIDs)
        return(reducedConsensus)
    }
}
<bytecode: 0x00000000046b0cf8>
<environment: namespace:soGGi>

But it isn't exported either, so why use an unexported function when the exported findconsensusRegions function uses runConsensusRegions to do its work, and appears to do what you want?

ADD COMMENTlink written 6 days ago by James W. MacDonald51k

Mainly because my input is a list of peaks, am not sure that findconsensusRegions uses a list of peaks, and did not find any examples for the input in the documentation

ADD REPLYlink written 6 days ago by new_user60

So, findconsensusRegions requires a named character vector as input whereas I have a list of Granges objects. Should I just convert them into named character vectors and store them as a list? Tried looking for more documentation but could not find any examples in the vignette

ADD REPLYlink written 6 days ago by new_user60
1

It looks like findconsensusPeaks does do extra things that maybe you don't want. But as you have already noted, the unexported runConsensusPeaks function may do what you want, so maybe that's all you need.

It's generally a bad idea to use unexported functions simply because as such they are not expected to remain the same, nor do the same thing, nor even exist in future iterations of the package. The code for runConsensusPeaks is pretty straightforward, so you might consider writing your own function that does exactly what you want, based on that function, in which case you would then have control over what it does and can adjust as you see fit.

ADD REPLYlink written 6 days ago by James W. MacDonald51k

Thanks James. The solution was right there, and I did not notice it. Posting so that someone else might find it useful.

An excerpt from the function soGGi:::runConsensusRegions does the job.

non_overlapping_region_counts<-function(x){
reduced <- reduce(unlist(myGRangesList))
consensusIDs <- paste0("consensus_", seq(1, length(reduced)))
mcols(reduced) <- do.call(cbind, lapply(myGRangesList, function(x) (reduced %over% x) + 0))
reducedConsensus <- reduced
mcols(reducedConsensus) <- cbind(as.data.frame(mcols(reducedConsensus)), consensusIDs)
consensusIDs <- paste0("consensus_", seq(1, length(reducedConsensus)))
return(reducedConsensus)
}

Where the input is of the class "CompressedGRangesList".

Reference to the package http://bioconductor.org/packages/release/bioc/html/soGGi.html

ADD REPLYlink modified 6 days ago • written 6 days ago by new_user60
1

Note that your function won't work, except in the case that you have an object myGRangesList that exists in your .GlobalEnv.

> grlst <- GRangesList(GRanges(c("chr1","chr2"), IRanges(c(1,5), c(6,9))), GRanges(c("chr1","chr2"), IRanges(c(45,34), c(55, 69))))
> grlst
GRangesList object of length 2:
[[1]] 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-6      *
  [2]     chr2       5-9      *

[[2]] 
GRanges object with 2 ranges and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1  45-55      *
  [2]     chr2  34-69      *

-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
> non_overlapping_region_counts(grlst)
Error in unlist(myGRangesList) : object 'myGRangesList' not found
> myGRangesList <- grlst
> non_overlapping_region_counts(grlst)
GRanges object with 4 ranges and 4 metadata columns:
      seqnames    ranges strand |        V1        V2 consensusIDs consensusIDs
         <Rle> <IRanges>  <Rle> | <numeric> <numeric>     <factor>     <factor>
  [1]     chr1       1-6      * |         1         0  consensus_1  consensus_1
  [2]     chr1     45-55      * |         0         1  consensus_2  consensus_2
  [3]     chr2       5-9      * |         1         0  consensus_3  consensus_3
  [4]     chr2     34-69      * |         0         1  consensus_4  consensus_4
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
ADD REPLYlink written 6 days ago by James W. MacDonald51k
1

Put another way, the argument to your function isn't used at all.

> non_overlapping_region_counts()
GRanges object with 4 ranges and 4 metadata columns:
      seqnames    ranges strand |        V1        V2 consensusIDs consensusIDs
         <Rle> <IRanges>  <Rle> | <numeric> <numeric>     <factor>     <factor>
  [1]     chr1       1-6      * |         1         0  consensus_1  consensus_1
  [2]     chr1     45-55      * |         0         1  consensus_2  consensus_2
  [3]     chr2       5-9      * |         1         0  consensus_3  consensus_3
  [4]     chr2     34-69      * |         0         1  consensus_4  consensus_4
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
ADD REPLYlink written 6 days ago by James W. MacDonald51k

Sorry about that, I already had a Grangeslist object in my environment, should have mentioned that.

ADD REPLYlink written 6 days ago by new_user60

My point is that you have to have a GRangesList called 'myGrangesList' in your .GlobalEnv for your function to work, and that the argument (x) to your function is superfluous. This is not ideal! You normally want your arguments to actually be, you know, arguments, and not to require some exacting set up for the function to work.

ADD REPLYlink written 6 days ago by James W. MacDonald51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 234 users visited in the last hour