Question

How to correctly import Bismark coverage files into bsseq?

0

Entering edit mode

enricoferrero ▴ 680

@enricoferrero-6037

Last seen 4.2 years ago

Switzerland

Hi,

I've used Bismark v0.14.1 (http://www.bioinformatics.babraham.ac.uk/projects/bismark/) to align bs-seq reads and call methylated C's. Now I'd like to perform a differential methylation analysis using bsseq v1.6.0.

I've generated the coverage files using bismark_methylation_extractor but I don't understand whether I should use the 1-based files (.cov) or the 0-based ones (.zero.cov).

The bsseq::read.bismark() help says:

        • The genomic co-ordinates of the Bismark output file may be
          zero-based or one-based depending on whether the
          ‘--zero_based’ argument is used.  Furthermore, the default
          co-ordinate system varies by version of Bismark.  bsseq makes
          no assumptions about the basis of the genomic co-ordinates
          and it is left to the user to ensure that the appropriate
          basis is used in the analysis of their data.

which I find a bit ambiguous.

Should I use 1-based or 0-based coverage files from Bismark to perform a differential methylation analysis in bsseq?

Thank you.

bsseq bismark methylation • 6.9k views

ADD COMMENT • link 10.1 years ago enricoferrero ▴ 680

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

I don't think this is ambigous - it's just informative. In essence what this is saying is that there are two different ways to represent sequence positions; zero based, half open, and one based, closed, and that Bismark can output both. Given that fact, it is up to you to decide which you prefer, and to proceed with your analysis based on that preference, because bsseq isn't going to make that choice for you.

So the answer to your question is another question - which do you prefer? I think the convention in Bioconductor is to use one based, closed, so if you have no preference then you might go with that.

ADD COMMENT • link 10.1 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you. I don't really have a preference: I'm just worried that bsseq might interpret the two formats differently. The coverage files are converted to GRanges internally, which are 1-based. My question basically is: Is bsseq assuming that the coverage files are 1-based when converting them to GRanges?

ADD REPLY • link 10.1 years ago enricoferrero ▴ 680

0

Entering edit mode

This shouldn't be a mystery, given that we can inspect the code. In read.bismark() we have this:

if (fileType == "cov" || fileType == "oldBedGraph") {
            out <- read.bismarkCovRaw(thisfile = files[ii], thisSampleName = sampleNames[ii], rmZeroCov = rmZeroCov)

And read.bismarkCovRaw() will almost certainly be hidden in the namespace, so we use the ::: function to get it out.

> bsseq:::read.bismarkCovRaw
function (thisfile, thisSampleName, rmZeroCov)
{
    if (isGzipped(thisfile)) {
        thisfile <- gunzip(thisfile, temporary = TRUE, overwrite = TRUE,
            remove = FALSE)
    }
    out <- fread(thisfile)
    if (ncol(out) != 6L) {
        stop("unknown file format")
    }
    gr <- GRanges(seqnames = out[[1L]], ranges = IRanges(start = out[[2L]],
        width = 1L))
    BSseq(gr = gr, M = as.matrix(out[[5L]]), Cov = as.matrix(out[[5L]] +
        out[[6L]]), sampleNames = thisSampleName, rmZeroCov = rmZeroCov)
}
<environment: namespace:bsseq>
>

Which shows pretty clearly what assumptions are being made.

ADD REPLY • link 10.1 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks! From that code it looks like both 0-based and 1-based files should be imported correctly as only the start coordinate is used.

I still think the docs could have been clearer, as shown by the fact that we had to look at a hidden function to find out - I'll ask Peter to clarify.

ADD REPLY • link 10.1 years ago enricoferrero ▴ 680

0

Entering edit mode

Thanks, Jim. Yes, we left it to the user to decide whether they wish to use 0-based or 1-based co-ordinates. Moreover, we simply construct a GRanges object using the given co-ordinates, without trying to adjust for whether the co-ordinates are 0-based or 1-based. The reason is that bsseq::read.bismark() cannot infer from the .cov file alone whether it is 0-based or 1-based (some versions of Bismark encode this in the file name, but not all as I recall). As noted in the documentation, the default basis of the co-ordinate system has varied over different versions of Bismark.

FWIW, I use 1-based co-ordinates in my analyses, as is the convention for much Bioconductor infrastructure as Jim notes.

Pete (author of bsseq::read.bismark())

ADD REPLY • link 10.1 years ago Peter Hickey ▴ 760

0

Entering edit mode

Hi Peter, thanks for chiming in!

Any chance the help for that function could be clearer? I understand that both 0-based and 1-based files are imported correctly: if so, is it worth saying that more explicitly?

Cheers,

ADD REPLY • link 10.1 years ago enricoferrero ▴ 680

2

Entering edit mode

I think if you look at the code and think about it a bit, you will see that the zero-based data are NOT imported 'correctly', if I interpret that word to mean 'imported as zero-based, half open'. They are instead imported as zero-based, closed, which is, IMO as correct as possible, given that

Bioconductor uses one-based, closed almost exclusively
Bsseq states clearly that no attempt is made to infer the type of input data, and leaves that up to the end user

In other words, zero-based, half open would say that a SNP at the first position of chr1 is at (0,1), whereas one-based, closed, would say it is at (1,1). So if you import zero-based data you get

> GRanges("chr1", IRanges(0, width=1))
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [0, 0]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

And if the data are one-based, closed, you get

> GRanges("chr1", IRanges(1, width=1))
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1    [1, 1]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

So in the former case you have to shift everything to the right by one position to be consistent with, like everything else in Bioconductor. So this is correct inasmuch as it does exactly what the help page says it will do, but not correct if you are thinking you can just blindly import zero-based data and have everything work out for you.

@Peter - Am I wrong to think that you could use the end position from the Bismark output, regardless of the counting scheme used, and thus not need to care if it is zero or one based? I think that would work, but may well be missing some important caveat.

ADD REPLY • link 10.1 years ago James W. MacDonald 68k

0

Entering edit mode

@Jim That's an excellent idea. I'll check that bismark's zero-based co-ordinates were always half-open (I have a sketchy recollection that some versions didn't do this properly) and implement it for the next Bioconductor release.

ADD REPLY • link 10.1 years ago Peter Hickey ▴ 760

1

Entering edit mode

To close this, I investigated the option of using the end co-ordinate as per Jim's suggestions. Unfortunately, the use of open vs. closed, zero-based vs. one-based co-ordinates has not been consistent across the development of bismark. Just by way of example:
v0.7.9 bedGraph output is closed, 0-based, e.g., [665, 665]
v0.12.3 bedGraph output is half-open, zero-based, e.g., (665, 666]
v0.12.3 cov output is closed, 1-based, e.g., [666, 666]
v0.12.3 cov output with --zero_based flag is half-open, 0-based, e.g., (665, 666]

It's very difficult to support all these options, especially since there is no version information in the files we import. So we will continue to leave this decision to the user and have added some comments based on this thread to the docs in the BioC 3.3 release version.

ADD REPLY • link 9.7 years ago Peter Hickey ▴ 760

0

Entering edit mode

Understood. Thanks Peter!

ADD REPLY • link 9.7 years ago enricoferrero ▴ 680

0

Entering edit mode

Got it now, thanks.

So, to answer my initial question, one should use 1-based coverage files to ensure compatibility with GRanges and other Bioconductor components, as 0-based coverage files will result in an inaccurate genomic range (unless you want to manually shift all coordinates after creating the GRanges objects).

I second the option to use the end coordinate from the Bismark output in the future as it may be less problematic (if compatible with older Bismark versions obviously).

ADD REPLY • link 10.1 years ago enricoferrero ▴ 680

0

Entering edit mode

I'm happy to clarify the documentation and suggestions for a sentence or two are most welcome! Like Jim, I don't find the current message ambiguous but I'm biased due my familiarity with the code.

ADD REPLY • link 10.1 years ago Peter Hickey ▴ 760

0

Entering edit mode

In light of the discussion above, I would simply remind the user that Bioconductor and GRanges use a 1-based system and that is the preferred format to import.

See also my own answer: A: How to correctly import Bismark coverage files into bsseq?

Thanks!

ADD REPLY • link 10.1 years ago enricoferrero ▴ 680

score 1 · Accepted Answer · 2015-12-08

1

Entering edit mode

enricoferrero ▴ 680

@enricoferrero-6037

Last seen 4.2 years ago

Switzerland

Use the 1-based coverage files from Bismark output.

Bioconductor uses 1-based ranges so importing the 0-based coverage files would require manually shifting coordinates to the right by one position.

See C: How to correctly import Bismark coverage files into bsseq? for details.

ADD COMMENT • link 10.1 years ago enricoferrero ▴ 680