default behaviour of import.bw
1
0
Entering edit mode
Yixin • 0
@33a90a56
Last seen 10 months ago
United States

Hi,

When I'm using the import.bw from rtracklayer, if two adjacent positions have the same scores, they will be put into a single GRanges, i.e., position 1 and 2 in this case

> gr1
GRanges object with 2 ranges and 1 metadata column:
seqnames    ranges strand |     score
<Rle> <IRanges>  <Rle> | <numeric>
[1]     chr2       1-2      * |         1
[2]     chr2         4      * |         1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths


But the GRanges I wish to get back is like this one,

> gr0
GRanges object with 3 ranges and 1 metadata column:
seqnames    ranges strand |     score
<Rle> <IRanges>  <Rle> | <numeric>
[1]     chr2         1      * |         1
[2]     chr2         2      * |         1
[3]     chr2         4      * |         1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths


The reason is I want to sum the scores across all sites, base by base, while

> sum(gr0$score) [1] 3 > sum(gr1$score)
[1] 2


I tried to find parameters in the function to control this behaviour but get no luck. Maybe I miss something.

I do know a function makeGRangesBRG from BRGenomics can help me with that but I'm wondering if there is a more native and direct way?

Thanks,

Yixin

import.bw rtracklayer • 438 views
2
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

Can you clarify? I don't see that behavior. Here is a reproducible example (which ideally you would have supplied).

## Get a Wig file to test

> test_path <- system.file("tests", package = "rtracklayer")
>  test_wig <- file.path(test_path, "step.wig")
> z <- import(test_wig)
> z
UCSC track 'test'
UCSCData object with 19 ranges and 1 metadata column:
seqnames            ranges strand |     score
<Rle>         <IRanges>  <Rle> | <numeric>
[1]    chr19          59104701      * |      10.0
[2]    chr19          59104901      * |      12.5
[3]    chr19          59105401      * |      15.0
[4]    chr19          59105601      * |      17.5
[5]    chr19          59105901      * |      20.0
...      ...               ...    ... .       ...
[15]    chr18 59109521-59109720      * |       500
[16]    chr18 59109821-59110020      * |       400
[17]    chr18 59110121-59110320      * |       300
[18]    chr18 59110421-59110620      * |       200
[19]    chr18 59110721-59110920      * |       100
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

> zz <- c(GRanges("chr19", IRanges(59104902, width = 1), strand = "*", score = 10.0), z)

## And sort

> zz <- sort(zz)
> zz
GRanges object with 20 ranges and 1 metadata column:
seqnames            ranges strand |     score
<Rle>         <IRanges>  <Rle> | <numeric>
[1]    chr19          59104701      * |      10.0     <------------- This and the next are the same score and adjacent
[2]    chr19          59104702      * |      10.0
[3]    chr19          59104901      * |      12.5
[4]    chr19          59105401      * |      15.0
[5]    chr19          59105601      * |      17.5
...      ...               ...    ... .       ...
[16]    chr18 59109521-59109720      * |       500
[17]    chr18 59109821-59110020      * |       400
[18]    chr18 59110121-59110320      * |       300
[19]    chr18 59110421-59110620      * |       200
[20]    chr18 59110721-59110920      * |       100
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

## export and convert to BigWig

> export.wig(zz, "test.wig")

## need seqinfo

> si <- seqinfo(GRangesForUCSCGenome("hg19", paste0("chr", 18:19)))
> wigToBigWig("test.wig", si, "test.bw")

## re-import and check

> bw <- import("test.bw")
> as(bw, "data.frame")
seqnames    start      end width strand  score
1     chr18 59108021 59108220   200      * 1000.0
2     chr18 59108321 59108520   200      *  900.0
3     chr18 59108621 59108820   200      *  800.0
4     chr18 59108921 59109120   200      *  700.0
5     chr18 59109221 59109420   200      *  600.0
6     chr18 59109521 59109720   200      *  500.0
7     chr18 59109821 59110020   200      *  400.0
8     chr18 59110121 59110320   200      *  300.0
9     chr18 59110421 59110620   200      *  200.0
10    chr18 59110721 59110920   200      *  100.0
11    chr19 59104701 59104701     1      *   10.0  <----------------- Still adjacent and same score
12    chr19 59104702 59104702     1      *   10.0
13    chr19 59104901 59104901     1      *   12.5
14    chr19 59105401 59105401     1      *   15.0
15    chr19 59105601 59105601     1      *   17.5
16    chr19 59105901 59105901     1      *   20.0
17    chr19 59106081 59106081     1      *   17.5
18    chr19 59106301 59106301     1      *   15.0
19    chr19 59106691 59106691     1      *   12.5
20    chr19 59107871 59107871     1      *   10.0
>

0
Entering edit mode

Thanks for your reply and sorry for the unclarity, James! I'm experimenting it a couple of times more and now realize actually it depends on how the BigWig files are written. It's good to know the behaviour of rtracklayer actually works as expected. See the following reproducible case,

> gr0 <- GRanges(Rle("chr2", 3),
+                IRanges(c(1, 2, 4), width=rep(1, 3)),
+                score = Rle(1, 3))
> seqlengths(gr0) <- 1e4
> gr0
GRanges object with 3 ranges and 1 metadata column:
seqnames    ranges strand | score
<Rle> <IRanges>  <Rle> | <Rle>
[1]     chr2         1      * |     1
[2]     chr2         2      * |     1
[3]     chr2         4      * |     1
-------
seqinfo: 1 sequence from an unspecified genome
>
> gr1 <- GRanges(Rle("chr2", 2),
+                IRanges(c(1, 4), width=c(2, 1)),
+                score = Rle(1, 2))
> seqlengths(gr1) <- 1e4
> gr1
GRanges object with 2 ranges and 1 metadata column:
seqnames    ranges strand | score
<Rle> <IRanges>  <Rle> | <Rle>
[1]     chr2       1-2      * |     1
[2]     chr2         4      * |     1
-------
seqinfo: 1 sequence from an unspecified genome
>
> export(gr0, "gr0.bw")
> import.bw("gr0.bw")
GRanges object with 3 ranges and 1 metadata column:
seqnames    ranges strand |     score
<Rle> <IRanges>  <Rle> | <numeric>
[1]     chr2         1      * |         1
[2]     chr2         2      * |         1
[3]     chr2         4      * |         1
-------
seqinfo: 1 sequence from an unspecified genome
>
> export(gr1, "gr1.bw")
> import.bw("gr1.bw")
GRanges object with 2 ranges and 1 metadata column:
seqnames    ranges strand |     score
<Rle> <IRanges>  <Rle> | <numeric>
[1]     chr2       1-2      * |         1
[2]     chr2         4      * |         1
-------
seqinfo: 1 sequence from an unspecified genome


Since I cannot control how the BigWig files are written, I guess my real problem is, is there any way to force rtracklayer to read gr1 as gr0, or maybe convert gr1 to gr0 using some common packages like GenomicRanges? Imagine it's genomic data so I have ~100M positions in reality.

0
Entering edit mode

I get what you want to do, but I don't understand why. A wiggle file is meant to define a genomic region with a 'score', which can then be used to plot a rectangle on a genome, where the base of the rectangle is defined by the position and width and the height is defined by the score. So if you have a wiggle file (or a bigWig of the same), and there is a region on chr2 from 1-2, with a score of 1, that means there is a genomic region that spans both those bases, the height of which is 1. If the first range was chr2:1-1e6, with a score of 1, would you want a GRanges with 1M rows, each of which having a score of 1? To what end? I mean if you plotted that, you know what it would look like? A box with height 1, from 1 - 1e6 on chr2, which is the same exact thing you would get if you didn't expand to 1M rows.

But maybe I completely misunderstand your use case, so if you can explain it?

0
Entering edit mode

Thanks James, I can see your point! The reason is the data I'm dealing with is quite sparse, most of time I have regions with width 1 and a score corresponding to a single position. However, sometimes the width could be 2, 3, or a few more bases and they may have same scores. I previously didn't notice this fact, so sum(gr0$score) (which is 3) is not equal to sum(gr1$score) (which is 2), and it eventually messes up my analysis.

Based on what you are saying, I think one way is to convert gr1 to gr0 so my sum is something I wish to get. Alternatively, now I think I can do sum(width(gr1) * gr1\$score), which is also 3 in total.

I really appreciate your questions and answers, which help me figure out which steps cause the issues and make me re-think what I actually need!