I would like to understand better the meaning of the indexes for both Granges
and Grangeslist
objects. I will post here what I think it's going on in this function and please do correct me if I'm wrong.
Let's assume two grange objects:
> g
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 101-113 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
c chr2 103-113 + | 3 0.777778
d chr2 150-200 * | 4 0.666667
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
and
> g2
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 111-120 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
If I use findoverlaps
like that:
> findOverlaps(g,g2)
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 2 2
[3] 3 2
-------
queryLength: 4 / subjectLength: 2
Here query
is g
and subject
is g2
and the indexes(numbers) of queryhits
from this Hits
object which are 1 2 3
represent the index
in the g
object. Is that correct ? So 1 2 3
indexes in this case will be the first 3 rows of g
, which are:
> g
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 101-113 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
c chr2 103-113 + | 3 0.777778
For the subjectHits
these are the indexes of g2
which the query
is hitting to, is that correct ? So the first index of g
hits the first index of g2
, the second hits the second and the third hits the second again.
If those statements are correct I think I got the idea of those indices in both queryHits
and subjectHits
columns.
Now things start to get confusing when using a grange
object and grangelist
, for example this grangelist
:
> grl
GRangesList object of length 2:
$txA
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chr2 103-106 + | 5 0.45
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
$txB
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
[1] chr1 107-109 + | 3 0.3
[2] chr1 113-115 - | 4 0.5
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
If I use findoverlaps
like that:
> findOverlaps(g,grl)
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 2
[2] 2 1
[3] 3 1
-------
queryLength: 4 / subjectLength: 2
I think the queryHits
are following the same logic as before, but subjectHits
is representing not the row indices of the different objects in the list but rather the list indices, is that correct ? So in this case 2 1 1
means that the first query is hitting a genomic region somewhere in the 2
index of this GrangeList
object. This 2
is equal to $txB
in this case.
I'm thinking about the possibilities of building a matrix of hits and missing hits using the genomicRanges package but I'm not sure if this function is appropriate for my use case:
I would like to have something like that when using findoverlaps(g,grl)
:
queryHits txa txb
<integer> <logical> <logical>
[1] 1 0 1
[2] 2 1 0
[3] 3 1 0
[4] 4 0 0
So here in this hypothetical case we have a logical
vectors for both txa
and txb
objects , there queryhits
correspond to the indexes of g
and even if there are no hits the index should be included, for example index 4
doesn't hit any regions in txa
and txb
and so we have 0 0
. For the first index it hits txb
but not txa
.
Any help is much appreciated!!! thanks
So if you have a
GRangesList
that is 100,000 items long you want to return aDataFrame
that has dimensionlength(gr)
xlength(grl)
? That will be a very sparseDataFrame
, and seems pretty inefficient? Perhaps it might be better if you just said what your use case is, and maybe then people can provide input.Hi James,
I would deeply appreciate if people could provide input on my understanding of what
findoverlaps
does when using 2granges
objects and when usinggranges
andgrangelist
, what really those indices represent in both scenarios. MyGRangesList
will have at most 5 items long for my use case, so I'm not expecting to produce a large sparse matrix from such computation.To see if your understanding is correct you can just read the help page, which says
Which I believe already answers your question?