Search
Question: (Big)wig file of complete chromosome
0
gravatar for Assa Yeroslaviz
7 weeks ago by
Assa Yeroslaviz1.3k
Munich, Germany
Assa Yeroslaviz1.3k wrote:

Hi,

 

I have a list of bigwig files from different data sets. We would like to create a table of all the genomic positions per chromosome to be able to compare the samples with each other. 

As the peaks found in each samples are at different chromosomal positions, I would like to complete each chromosomes and add all the positions which have no reads with a 0. As an example

this is my file:

variableStep chrom=chrV
86      1.00
87      1.00
88      1.00
89      1.00
90      1.00
....
variableStep chrom=chrII
1       2.00
2       2.00
3       2.00
4       3.00
5       4.00

In this sample I would like for Chr V to add positions 1-85 with a 0.

And than split the bigwig file into separate chromosomes and creates a data.frame for each ofthe chromosomes. 

I know it is possible to read bigwig files into R using rtracklayer and also to subset them per chromosome. What I don't know is if it is possible to add the empty positions into the GRanges object created from the bigwig file.

thanks

Assa

 

P.S.

I was thinking about it a bit longer now and was wondering if it is possible to create a GRangeList object from all samples and than modify it within this object so that I can modify all samples at once and than reduce the list into a data.frame for the separate chromosomes

thanks again

 

ADD COMMENTlink modified 7 weeks ago by Michael Lawrence9.5k • written 7 weeks ago by Assa Yeroslaviz1.3k
1
gravatar for Michael Lawrence
7 weeks ago by
United States
Michael Lawrence9.5k wrote:

Compute the coverage using the coverage from the file as the weight, and then coerce that back into a GRanges. 

Like:

gr_filled <- GRanges(coverage(gr, weight="score"))

 

ADD COMMENTlink written 7 weeks ago by Michael Lawrence9.5k

Thanks Michael for the neat solution. it is still not what i need though. The coverage function summarize the similar values into one row like that:

GRanges object with 5793227 ranges and 1 metadata column:
            seqnames         ranges strand |     score
               <Rle>      <IRanges>  <Rle> | <numeric>
        [1]     chrI       [ 1, 11]      * |         1
        [2]     chrI       [12, 12]      * |         2
        [3]     chrI       [13, 13]      * |         5
        [4]     chrI       [14, 14]      * |         7
        [5]     chrI       [15, 15]      * |         9
        ...      ...            ...    ... .       ...
  [5793223]    chrmt [85674, 85675]      * |         7
  [5793224]    chrmt [85676, 85677]      * |         6
  [5793225]    chrmt [85678, 85679]      * |         5
  [5793226]    chrmt [85680, 85682]      * |         2
  [5793227]    chrmt [85683, 85779]      * |         0

Is there a way to expand the GRanges object to contain all the positions separately. I need something like this one here (done manually):

GRanges object with 5793227 ranges and 1 metadata column:
            seqnames         ranges strand |     score
               <Rle>      <IRanges>  <Rle> | <numeric>
        [1]     chrI       [ 1, 1]      * |         1
        [2]     chrI       [ 2, 2]      * |         1
        [3]     chrI       [ 3, 3]      * |         1
        [4]     chrI       [ 4, 4]      * |         1
        [5]     chrI       [ 5, 5]      * |         1
        ...      ...            ...    ... .       ...
        [1]     chrI       [10, 10]      * |         1
        [2]     chrI       [11, 11]      * |         1
        [3]     chrI       [12, 12]      * |         2
        [4]     chrI       [13, 13]      * |         5
        [5]     chrI       [14, 14]      * |         5
        ...      ...            ...    ... .       ...

thanks in advance

ADD REPLYlink written 6 weeks ago by Assa Yeroslaviz1.3k

You want to use a GPos object. I thought this would be simple, but it turns out to require a couple lines:

gpos <- GPos(gr)
gpos$score <- rep(gr$score, width(gr))

 

ADD REPLYlink written 6 weeks ago by Michael Lawrence9.5k

thanks for the suggestion, but it is still not exactly what i need. The code above doesn't take the ends of each chromosome into account. This leads to different number of entries in the two GRanges objects. e.g.

> InputIP.control[seqnames(InputIP.control) == "I"]
GRanges object with 92413 ranges and 1 metadata column:
          seqnames           ranges strand |            score
             <Rle>        <IRanges>  <Rle> |        <numeric>
      [1]        I     [   1, 6303]      * | 17.5913696289062
      [2]        I     [6304, 6304]      * | 17.6069602966309
      [3]        I     [6305, 6351]      * | 17.5913696289062
      [4]        I     [6352, 6352]      * | 17.6825294494629
      [5]        I     [6353, 6353]      * | 17.7580890655518
      ...      ...              ...    ... .              ...
  [92409]        I [220379, 222823]      * | 17.5913696289062
  [92410]        I [222824, 222824]      * | 17.6069602966309
  [92411]        I [222825, 222825]      * | 17.5913696289062
  [92412]        I [222826, 222826]      * | 17.6069602966309
  [92413]        I [222827, 230210]      * | 17.5913696289062

> InputIgG.control[seqnames(InputIgG.control) == "I"]
GRanges object with 59889 ranges and 1 metadata column:
          seqnames           ranges strand |            score
             <Rle>        <IRanges>  <Rle> |        <numeric>
      [1]        I       [  1, 450]      * |  18.625129699707
      [2]        I       [451, 451]      * |  18.647180557251
      [3]        I       [452, 452]      * |  18.625129699707
      [4]        I       [453, 453]      * | 18.8590793609619
      [5]        I       [454, 454]      * |  19.176929473877
      ...      ...              ...    ... .              ...
  [59885]        I [223907, 223907]      * |  18.647180557251
  [59886]        I [223908, 223908]      * | 18.7531299591064
  [59887]        I [223909, 223910]      * |  18.647180557251
  [59888]        I [223911, 223912]      * | 18.7531299591064
  [59889]        I [223913, 230207]      * |  18.625129699707

or when changed into GPos objects

> gpos.InputIgG.control[seqnames(gpos.InputIgG.control) == "I"]
GPos object with 230207 positions and 1 metadata column:
           seqnames       pos strand |           score
              <Rle> <integer>  <Rle> |       <numeric>
       [1]        I         1      * | 18.625129699707
       [2]        I         2      * | 18.625129699707
       [3]        I         3      * | 18.625129699707
       [4]        I         4      * | 18.625129699707
       [5]        I         5      * | 18.625129699707
       ...      ...       ...    ... .             ...
  [230203]        I    230203      * | 18.625129699707
  [230204]        I    230204      * | 18.625129699707
  [230205]        I    230205      * | 18.625129699707
  [230206]        I    230206      * | 18.625129699707
  [230207]        I    230207      * | 18.625129699707

> gpos.InputIP.control[seqnames(gpos.InputIP.control) == "I"]
GPos object with 230210 positions and 1 metadata column:
           seqnames       pos strand |            score
              <Rle> <integer>  <Rle> |        <numeric>
       [1]        I         1      * | 17.5913696289062
       [2]        I         2      * | 17.5913696289062
       [3]        I         3      * | 17.5913696289062
       [4]        I         4      * | 17.5913696289062
       [5]        I         5      * | 17.5913696289062
       ...      ...       ...    ... .              ...
  [230206]        I    230206      * | 17.5913696289062
  [230207]        I    230207      * | 17.5913696289062
  [230208]        I    230208      * | 17.5913696289062
  [230209]        I    230209      * | 17.5913696289062
  [230210]        I    230210      * | 17.5913696289062

Even though the seqlengths() of the two Granges objects is the same for this chromosome, both of them contain less elements as the sequence length

> seqlengths(gpos.InputIP.control)
      I   
 230218 ...
> seqlengths(gpos.InputIgG.control)
      I  
 230218 ...

Is there a method to expand (extend) every chromosome in each of the GRanges objects based on the seqlengths() of this chromosome? ( What I would like to have for this example are two GPos objects for chromosome I with 230218 positions instead of 230207 and 230210 positions).

I need it so that I can compare the two bigwig files with each other.

 

ADD REPLYlink written 4 weeks ago by Assa Yeroslaviz1.3k

Hi Assa,

Make sure you start with a GRanges object that has the seqlengths information on it:

library(GenomicRanges)
gr <- GRanges(c("chr1", "chr2", "chr2"),
              IRanges(c(4, 3, 21), c(16, 10, 38)),
              score=1:3/10,
              seqlengths=c(chr1=20, chr2=40))

gr
# GRanges object with 3 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1  [ 4, 16]      * |       0.1
#   [2]     chr2  [ 3, 10]      * |       0.2
#   [3]     chr2  [21, 38]      * |       0.3
#   -------
#   seqinfo: 2 sequences from an unspecified genome

seqlengths(gr)
# chr1 chr2 
#   20   40 

Insert the "no score regions":

gr_filled <- GRanges(coverage(gr, weight="score"))
gr_filled
# GRanges object with 8 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1  [ 1,  3]      * |         0
#   [2]     chr1  [ 4, 16]      * |       0.1
#   [3]     chr1  [17, 20]      * |         0
#   [4]     chr2  [ 1,  2]      * |         0
#   [5]     chr2  [ 3, 10]      * |       0.2
#   [6]     chr2  [11, 20]      * |         0
#   [7]     chr2  [21, 38]      * |       0.3
#   [8]     chr2  [39, 40]      * |         0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Turn into a GPos object:

gpos <- GPos(gr_filled)
## Edit: it is MUCH preferred to use an Rle when propagating the
## score to the GPos object: 
#mcols(gpos)$score <- rep(mcols(gr_filled)$score, width(gr_filled))
mcols(gpos)$score <- Rle(mcols(gr_filled)$score, width(gr_filled))
gpos
# GPos object with 60 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr1         1      * |     0
#    [2]     chr1         2      * |     0
#    [3]     chr1         3      * |     0
#    [4]     chr1         4      * |   0.1
#    [5]     chr1         5      * |   0.1
#    ...      ...       ...    ... .   ...
#   [56]     chr2        36      * |   0.3
#   [57]     chr2        37      * |   0.3
#   [58]     chr2        38      * |   0.3
#   [59]     chr2        39      * |     0
#   [60]     chr2        40      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Then:

gpos[seqnames(gpos) == "chr1"]
# GPos object with 20 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr1         1      * |     0
#    [2]     chr1         2      * |     0
#    [3]     chr1         3      * |     0
#    [4]     chr1         4      * |   0.1
#    [5]     chr1         5      * |   0.1
#    ...      ...       ...    ... .   ...
#   [16]     chr1        16      * |   0.1
#   [17]     chr1        17      * |     0
#   [18]     chr1        18      * |     0
#   [19]     chr1        19      * |     0
#   [20]     chr1        20      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

gpos[seqnames(gpos) == "chr2"]
# GPos object with 40 positions and 1 metadata column:
#        seqnames       pos strand | score
#           <Rle> <integer>  <Rle> | <Rle>
#    [1]     chr2         1      * |     0
#    [2]     chr2         2      * |     0
#    [3]     chr2         3      * |   0.2
#    [4]     chr2         4      * |   0.2
#    [5]     chr2         5      * |   0.2
#    ...      ...       ...    ... .   ...
#   [36]     chr2        36      * |   0.3
#   [37]     chr2        37      * |   0.3
#   [38]     chr2        38      * |   0.3
#   [39]     chr2        39      * |     0
#   [40]     chr2        40      * |     0
#   -------
#   seqinfo: 2 sequences from an unspecified genome

Hope this helps.

H.

ADD REPLYlink modified 29 days ago • written 29 days ago by Hervé Pagès ♦♦ 12k

I made a small edit to my answer above to use an Rle when propagating the score to the GPos object.

ADD REPLYlink written 29 days ago by Hervé Pagès ♦♦ 12k

Thanks, this looks very nice. I will try it. I have a question though. I am familiar with the seqlengths when creating a GRanges object. But here I am reading a bedgraph file into R. How do I give the import.bw or import.bedgraph the seqlength() parameter? 

ADD REPLYlink written 29 days ago by Assa Yeroslaviz1.3k

I don't know about getting the seqlengths info thru import.bw() or import.bedgraph(), Michael might be able to help with this. However please note that you can always use the seqlengths() setter to add the seqlengths "manually" to the GRanges object returned by import.bw() or import.bedgraph(). For example, since you seem to be working with yeast:

library(GenomeInfoDb)
si <- Seqinfo(genome="sacCer3")
seqlengths(si)
#   chrI   chrII  chrIII   chrIV ...   chrXV  chrXVI    chrM
# 230218  813184  316620 1531933 ... 1091291  948066   85779 
seqlengths(gr) <- seqlengths(si)

H.

ADD REPLYlink modified 29 days ago • written 29 days ago by Hervé Pagès ♦♦ 12k

Thanks, this I know. I thought there might be a better way.

ADD REPLYlink written 29 days ago by Assa Yeroslaviz1.3k

It should come automatically from a BigWig file (it's embedded within the file). For bedGraph, unless the file has the genome indicated in a UCSC-style "track line", you will need to tell it the genome via the genome= argument (for example, "hg38").  It will look up the seqlengths automatically from the genome.

ADD REPLYlink written 29 days ago by Michael Lawrence9.5k

But what if I'm using t he Ensembl annotations? so my chromosome names differs from the standard UCSC naming?

ADD REPLYlink written 28 days ago by Assa Yeroslaviz1.3k

The approach described by Herve is not so bad, in general. You can pass the Seqinfo object to import, e.g.,

gr <- import(file, seqinfo=Seqinfo(genome="sacCer3"))

 

ADD REPLYlink written 28 days ago by Michael Lawrence9.5k

after trying this example and not understanding, why it doesn't do what I want, I figured it out. The Rle() options looks better, but if I use it, I can't export the data as bedgraph (or bigwig). I get an error message claiming

Scores must be non-NA numeric values

Is there a way to coerce the Rle score column into a numeric values vector?

ADD REPLYlink written 18 days ago by Assa Yeroslaviz1.3k

I fixed this in devel. Work around is just not to use Rles, or coerce them to numeric vectors on export. The export routines are not that optimized for GPos anyway.

ADD REPLYlink written 17 days ago by Michael Lawrence9.5k

yes thanks, I have figured it out by now. I am using now the rep() command instead. 

ADD REPLYlink written 17 days ago by Assa Yeroslaviz1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 267 users visited in the last hour