Search
Question: (Big)wig file of complete chromosome
0
16 months ago by
Assa Yeroslaviz1.4k
Munich, Germany
Assa Yeroslaviz1.4k wrote:

Hi,

I have a list of bigwig files from different data sets. We would like to create a table of all the genomic positions per chromosome to be able to compare the samples with each other.

As the peaks found in each samples are at different chromosomal positions, I would like to complete each chromosomes and add all the positions which have no reads with a 0. As an example

this is my file:

variableStep chrom=chrV
86      1.00
87      1.00
88      1.00
89      1.00
90      1.00
....
variableStep chrom=chrII
1       2.00
2       2.00
3       2.00
4       3.00
5       4.00

In this sample I would like for Chr V to add positions 1-85 with a 0.

And than split the bigwig file into separate chromosomes and creates a data.frame for each ofthe chromosomes.

I know it is possible to read bigwig files into R using rtracklayer and also to subset them per chromosome. What I don't know is if it is possible to add the empty positions into the GRanges object created from the bigwig file.

thanks

Assa

P.S.

I was thinking about it a bit longer now and was wondering if it is possible to create a GRangeList object from all samples and than modify it within this object so that I can modify all samples at once and than reduce the list into a data.frame for the separate chromosomes

thanks again

modified 16 months ago by Michael Lawrence10k • written 16 months ago by Assa Yeroslaviz1.4k
1
16 months ago by
United States
Michael Lawrence10k wrote:

Compute the coverage using the coverage from the file as the weight, and then coerce that back into a GRanges.

Like:

gr_filled <- GRanges(coverage(gr, weight="score"))

Thanks Michael for the neat solution. it is still not what i need though. The coverage function summarize the similar values into one row like that:

GRanges object with 5793227 ranges and 1 metadata column:
seqnames         ranges strand |     score
<Rle>      <IRanges>  <Rle> | <numeric>
[1]     chrI       [ 1, 11]      * |         1
[2]     chrI       [12, 12]      * |         2
[3]     chrI       [13, 13]      * |         5
[4]     chrI       [14, 14]      * |         7
[5]     chrI       [15, 15]      * |         9
...      ...            ...    ... .       ...
[5793223]    chrmt [85674, 85675]      * |         7
[5793224]    chrmt [85676, 85677]      * |         6
[5793225]    chrmt [85678, 85679]      * |         5
[5793226]    chrmt [85680, 85682]      * |         2
[5793227]    chrmt [85683, 85779]      * |         0

Is there a way to expand the GRanges object to contain all the positions separately. I need something like this one here (done manually):

GRanges object with 5793227 ranges and 1 metadata column:
seqnames         ranges strand |     score
<Rle>      <IRanges>  <Rle> | <numeric>
[1]     chrI       [ 1, 1]      * |         1
[2]     chrI       [ 2, 2]      * |         1
[3]     chrI       [ 3, 3]      * |         1
[4]     chrI       [ 4, 4]      * |         1
[5]     chrI       [ 5, 5]      * |         1
...      ...            ...    ... .       ...
[1]     chrI       [10, 10]      * |         1
[2]     chrI       [11, 11]      * |         1
[3]     chrI       [12, 12]      * |         2
[4]     chrI       [13, 13]      * |         5
[5]     chrI       [14, 14]      * |         5
...      ...            ...    ... .       ...

1

You want to use a GPos object. I thought this would be simple, but it turns out to require a couple lines:

gpos <- GPos(gr)
gpos$score <- rep(gr$score, width(gr))

thanks for the suggestion, but it is still not exactly what i need. The code above doesn't take the ends of each chromosome into account. This leads to different number of entries in the two GRanges objects. e.g.

> InputIP.control[seqnames(InputIP.control) == "I"]
GRanges object with 92413 ranges and 1 metadata column:
seqnames           ranges strand |            score
<Rle>        <IRanges>  <Rle> |        <numeric>
[1]        I     [   1, 6303]      * | 17.5913696289062
[2]        I     [6304, 6304]      * | 17.6069602966309
[3]        I     [6305, 6351]      * | 17.5913696289062
[4]        I     [6352, 6352]      * | 17.6825294494629
[5]        I     [6353, 6353]      * | 17.7580890655518
...      ...              ...    ... .              ...
[92409]        I [220379, 222823]      * | 17.5913696289062
[92410]        I [222824, 222824]      * | 17.6069602966309
[92411]        I [222825, 222825]      * | 17.5913696289062
[92412]        I [222826, 222826]      * | 17.6069602966309
[92413]        I [222827, 230210]      * | 17.5913696289062

> InputIgG.control[seqnames(InputIgG.control) == "I"]
GRanges object with 59889 ranges and 1 metadata column:
seqnames           ranges strand |            score
<Rle>        <IRanges>  <Rle> |        <numeric>
[1]        I       [  1, 450]      * |  18.625129699707
[2]        I       [451, 451]      * |  18.647180557251
[3]        I       [452, 452]      * |  18.625129699707
[4]        I       [453, 453]      * | 18.8590793609619
[5]        I       [454, 454]      * |  19.176929473877
...      ...              ...    ... .              ...
[59885]        I [223907, 223907]      * |  18.647180557251
[59886]        I [223908, 223908]      * | 18.7531299591064
[59887]        I [223909, 223910]      * |  18.647180557251
[59888]        I [223911, 223912]      * | 18.7531299591064
[59889]        I [223913, 230207]      * |  18.625129699707


or when changed into GPos objects

> gpos.InputIgG.control[seqnames(gpos.InputIgG.control) == "I"]
GPos object with 230207 positions and 1 metadata column:
seqnames       pos strand |           score
<Rle> <integer>  <Rle> |       <numeric>
[1]        I         1      * | 18.625129699707
[2]        I         2      * | 18.625129699707
[3]        I         3      * | 18.625129699707
[4]        I         4      * | 18.625129699707
[5]        I         5      * | 18.625129699707
...      ...       ...    ... .             ...
[230203]        I    230203      * | 18.625129699707
[230204]        I    230204      * | 18.625129699707
[230205]        I    230205      * | 18.625129699707
[230206]        I    230206      * | 18.625129699707
[230207]        I    230207      * | 18.625129699707

> gpos.InputIP.control[seqnames(gpos.InputIP.control) == "I"]
GPos object with 230210 positions and 1 metadata column:
seqnames       pos strand |            score
<Rle> <integer>  <Rle> |        <numeric>
[1]        I         1      * | 17.5913696289062
[2]        I         2      * | 17.5913696289062
[3]        I         3      * | 17.5913696289062
[4]        I         4      * | 17.5913696289062
[5]        I         5      * | 17.5913696289062
...      ...       ...    ... .              ...
[230206]        I    230206      * | 17.5913696289062
[230207]        I    230207      * | 17.5913696289062
[230208]        I    230208      * | 17.5913696289062
[230209]        I    230209      * | 17.5913696289062
[230210]        I    230210      * | 17.5913696289062


Even though the seqlengths() of the two Granges objects is the same for this chromosome, both of them contain less elements as the sequence length

> seqlengths(gpos.InputIP.control)
I
230218 ...
> seqlengths(gpos.InputIgG.control)
I
230218 ...

Is there a method to expand (extend) every chromosome in each of the GRanges objects based on the seqlengths() of this chromosome? ( What I would like to have for this example are two GPos objects for chromosome I with 230218 positions instead of 230207 and 230210 positions).

I need it so that I can compare the two bigwig files with each other.

1

Hi Assa,

Make sure you start with a GRanges object that has the seqlengths information on it:

library(GenomicRanges)
gr <- GRanges(c("chr1", "chr2", "chr2"),
IRanges(c(4, 3, 21), c(16, 10, 38)),
score=1:3/10,
seqlengths=c(chr1=20, chr2=40))

gr
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1  [ 4, 16]      * |       0.1
#   [2]     chr2  [ 3, 10]      * |       0.2
#   [3]     chr2  [21, 38]      * |       0.3
#   -------
#   seqinfo: 2 sequences from an unspecified genome

seqlengths(gr)
# chr1 chr2
#   20   40 

Insert the "no score regions":

gr_filled <- GRanges(coverage(gr, weight="score"))
gr_filled
# GRanges object with 8 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1  [ 1,  3]      * |         0
#   [2]     chr1  [ 4, 16]      * |       0.1
#   [3]     chr1  [17, 20]      * |         0
#   [4]     chr2  [ 1,  2]      * |         0
#   [5]     chr2  [ 3, 10]      * |       0.2
#   [6]     chr2  [11, 20]      * |         0
#   [7]     chr2  [21, 38]      * |       0.3
#   [8]     chr2  [39, 40]      * |         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Turn into a GPos object:

gpos <- GPos(gr_filled)
## Edit: it is MUCH preferred to use an Rle when propagating the
## score to the GPos object:
#mcols(gpos)$score <- rep(mcols(gr_filled)$score, width(gr_filled))
mcols(gpos)$score <- Rle(mcols(gr_filled)$score, width(gr_filled))
gpos
# GPos object with 60 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr1         1      * |     0
#    [2]     chr1         2      * |     0
#    [3]     chr1         3      * |     0
#    [4]     chr1         4      * |   0.1
#    [5]     chr1         5      * |   0.1
#    ...      ...       ...    ... .   ...
#   [56]     chr2        36      * |   0.3
#   [57]     chr2        37      * |   0.3
#   [58]     chr2        38      * |   0.3
#   [59]     chr2        39      * |     0
#   [60]     chr2        40      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Then:

gpos[seqnames(gpos) == "chr1"]
# GPos object with 20 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr1         1      * |     0
#    [2]     chr1         2      * |     0
#    [3]     chr1         3      * |     0
#    [4]     chr1         4      * |   0.1
#    [5]     chr1         5      * |   0.1
#    ...      ...       ...    ... .   ...
#   [16]     chr1        16      * |   0.1
#   [17]     chr1        17      * |     0
#   [18]     chr1        18      * |     0
#   [19]     chr1        19      * |     0
#   [20]     chr1        20      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

gpos[seqnames(gpos) == "chr2"]
# GPos object with 40 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr2         1      * |     0
#    [2]     chr2         2      * |     0
#    [3]     chr2         3      * |   0.2
#    [4]     chr2         4      * |   0.2
#    [5]     chr2         5      * |   0.2
#    ...      ...       ...    ... .   ...
#   [36]     chr2        36      * |   0.3
#   [37]     chr2        37      * |   0.3
#   [38]     chr2        38      * |   0.3
#   [39]     chr2        39      * |     0
#   [40]     chr2        40      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Hope this helps.

H.

I made a small edit to my answer above to use an Rle when propagating the score to the GPos object.

Thanks, this looks very nice. I will try it. I have a question though. I am familiar with the seqlengths when creating a GRanges object. But here I am reading a bedgraph file into R. How do I give the import.bw or import.bedgraph the seqlength() parameter?

I don't know about getting the seqlengths info thru import.bw() or import.bedgraph(), Michael might be able to help with this. However please note that you can always use the seqlengths() setter to add the seqlengths "manually" to the GRanges object returned by import.bw() or import.bedgraph(). For example, since you seem to be working with yeast:

library(GenomeInfoDb)
si <- Seqinfo(genome="sacCer3")
seqlengths(si)
#   chrI   chrII  chrIII   chrIV ...   chrXV  chrXVI    chrM
# 230218  813184  316620 1531933 ... 1091291  948066   85779
seqlengths(gr) <- seqlengths(si)

H.

Thanks, this I know. I thought there might be a better way.

It should come automatically from a BigWig file (it's embedded within the file). For bedGraph, unless the file has the genome indicated in a UCSC-style "track line", you will need to tell it the genome via the genome= argument (for example, "hg38").  It will look up the seqlengths automatically from the genome.

But what if I'm using t he Ensembl annotations? so my chromosome names differs from the standard UCSC naming?

The approach described by Herve is not so bad, in general. You can pass the Seqinfo object to import, e.g.,

gr <- import(file, seqinfo=Seqinfo(genome="sacCer3"))

after trying this example and not understanding, why it doesn't do what I want, I figured it out. The Rle() options looks better, but if I use it, I can't export the data as bedgraph (or bigwig). I get an error message claiming

Scores must be non-NA numeric values

Is there a way to coerce the Rle score column into a numeric values vector?

I fixed this in devel. Work around is just not to use Rles, or coerce them to numeric vectors on export. The export routines are not that optimized for GPos anyway.

yes thanks, I have figured it out by now. I am using now the rep() command instead.