The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: behaviour of findOverlaps on GRangesList object
0
gravatar for lovro0
20 months ago by
lovro010
Slovenia/Ljubljana/Clinical institute of medical genetics
lovro010 wrote:

hi,

problem: findOverlaps() used on GRangesList and GRanges object returns a hits object which is too short (10522 instead of 10596 )

    > class(RGRreal)
    [1] "GRangesList"
    attr(,"package")
    [1] "GenomicRanges"
    > class(Mranges)
    [1] "GRanges"
    attr(,"package")
    [1] "GenomicRanges"

Over <- findOverlaps(RGRreal,Mranges)

    > length(Over)
    [1] 10522
    > sum(laply(RGRreal,length))  #laply from plyr package
    [1] 10596

If I use:

RGRcat <- unlist(RGRreal, recursive = TRUE, use.names = TRUE)
Over2 <- findOverlaps(RGRcat,Mranges)

    > length(RGRcat)
    [1] 10596
    > length(Over2)
    [1] 10596

the resulting hits object is the same length as the input, however, i need the grouping by probands inherent to RGRreal which is a list.

I have tried spliting Mranges into a list...

MrangesLs <- split(Mranges,seq(1,length(Mranges)))
Over3 <- findOverlaps(RGRreal,MrangesLs)
    > length(Over3)
    [1] 10522
...but the hits object is still too short.

The goal here is to get a data.frame with genomic loci in columns and proband names in rows, with data values obtained from mcols of RGRreal. This is how i tried to do it:

Result <- matrix(data = NA,nrow = length(RGRreal), ncol = length(Mranges))
Result <- as.data.frame(Result)
names(Result) <- paste0(MERGED$V1,":",MERGED$V2,"-",MERGED$V3)
row.names(Result) <- names(RGRreal)

WhichElement <- mcols(RGRcat)$Mobile.Element

for(i in 1:length(Over)){Result[[from(Over)[i],to(Over)[i]]] <- as.character(WhichElement[[i]])}

The values in the Result are wrong as objects Over and WhichElement are not alligned:
    > length(WhichElement)
    [1] 10596
    > length(Over)
    [1] 10522

I would really appreciate any help that would help me understand why the hits object resulting from findOverlaps is not the same length as the number of genomic ranges in RGRreal.
Alternatively, another method for obtaining the end result would be appreciated as well (i suspect there is a better way to utilise the hits object)

Thank you,
Lovro

ADD COMMENTlink written 20 months ago by lovro010

Why do you think it would give you back the same number of hits as the number of input ranges? One input range might overlap 0, 1, or more ranges. Also, the hits would be in terms of the GRangesList elements (if any range overlaps), not the individual ranges within them.

ADD REPLYlink written 20 months ago by Michael Lawrence10k

I'm sorry, I forgot to mention this. The Mranges object was made as an union of all intervals in RGRreal object. It was produced with "bedtools merge". Therefore each input range should intersect at least the the Mranges interval it contributed to in bedtools merge.

I did this because i wanted to cluster overlapping loci.

edit: is there a way to generate overlap hits object and assign NA to inputs not overlapping any of the features?

ADD REPLYlink modified 20 months ago • written 20 months ago by lovro010

You can use findOverlaps(..., select="first") to get an integer vector back with the index of the first hit in subject, NA otherwise. But depending on how the GRangesList query groups the underlying ranges, there may be some with multiple hits.

ADD REPLYlink written 20 months ago by Michael Lawrence10k

I have tried this and every element of the list produces a hit:

Over12 <- findOverlaps(RGRreal,Mranges, select = "first")

> length(Over12)

[1] 1441

This was expected since every element in RGRreal contains at least one range and all of them were used to create Mranges. Furthermore, unlisted RGRreal (RGRcat) produces a hit for every element (Over2 length is 10596). 

I guess the main question is now where do this ranges get lost when contained in RGRreal list object. I thought that the behaviour of findOverlaps for lists is that every range in the list is independently tested for overlap with the features ?

For the end result i need a list of all the features each element of the RGRreal list overlaps.

ADD REPLYlink written 20 months ago by lovro010

This is the "Over" object:

> Over
Hits object with 10522 hits and 0 metadata columns:
          queryHits subjectHits
          <integer>   <integer>
      [1]         1          17
      [2]         1          42
      [3]         1         687
      [4]         1         879
      [5]         1         880
      ...       ...         ...
  [10518]      1441         151
  [10519]      1441         221
  [10520]      1441         524
  [10521]      1441         631
  [10522]      1441         743
  -------
  queryLength: 1441 / subjectLength: 1225

Almost every range in RGRreal ovelaps a single range in Mranges object but every one of them should. Especially since unlisted RGRreal (RGRcat) does produce a hit for every range. I thought that in this way I can retain the information about the proband from which it originates (original results are 1 data sheet for each proband which were read into a list).

It is possible that I am misunderstanding something here... If so, please enlighten me.

Thank you

ADD REPLYlink written 20 months ago by lovro010

If there are two overlapping ranges within a single element of RGReal, then both of them will overlap one range in MRanges, but there will be only one hit in the output, since they are part of the same element in RGReal.
 

ADD REPLYlink written 20 months ago by Michael Lawrence10k

Oh. Thank you! I wasn't aware that any "within the list element" overlapping check is done. Is there any way I can avoid this?

I am curious how is this useful? A random input is ignored despite having an overlap with a feature just because it also has an overlap with another input that happens to be in the same element of the list? 

ADD REPLYlink written 20 months ago by lovro010

There is no such "overlap check". It's just a natural consequence of the aggregation. The algorithm first finds the overlaps between the unlisted form of the query and the subject. It then aggregates the hits so that they correspond to elements of the GRangesList, not the underlying GRanges. A GRangesList element is considered overlapping a subject range iff any internal range overlaps the subject range. So having multiple internal ranges that overlap the same subject will be aggregated to a single hit. This is useful when dealing with gapped alignments and exons grouped by transcript, among other things.

Why are your ranges in a GRangesList?

ADD REPLYlink written 20 months ago by Michael Lawrence10k

Thank you for all your prompt replies!
Unfortunately i got hit by the following:

"Sorry! Your posting limit of (5) posts per six hours has been reached."

Anyhow:

I have around 1500 probands and a result for each of them is generated by an external algorithm. These results are then parsed into tables and read into a list so further steps can be done with lapply. Tables are made into GRanges with lapplying makeGRangesFromDataFrame. I thought it would be sensible to then coerce these conventional list of GRanges to GRangesList with RGRreal <- GRangesList(RGR) to conform to a standard and then use findOverlaps on it. Unfortunately the behaviour of the function was not as I expected.

If I use findOverlaps on each individual GRanges within a RGR list with lapply, will I encounter the same problem?

Thank you,
Lovro

ADD REPLYlink written 20 months ago by lovro010

You could do that. Or just unlist them. Depends on what you want to do downstream.

ADD REPLYlink written 20 months ago by Michael Lawrence10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 163 users visited in the last hour