Search
Question: Finding DMRs using dmrFinder in bsseq
0
gravatar for Maria
6 months ago by
Maria0
Russia / Institute of Biochemistry and Genetics
Maria0 wrote:

Hello!
Could you help me with my challenges?
I have paired-end data obtained with RRBS expriment.
I did the adapter and quality trimming using trim_galore.
Then i the read alignments the methylation extraction using Bismark .
And farther I worked with the produced coverage files, is it all right?

==============================================
After the smothing step (BSmooth(My.bsseq.data, , mc.cores = 1,verbose = TRUE)),  

I see a lot of warnings (more than 50) looks like this:
... 50: In lfpros(x, y, weights = weights, cens = cens, base = base,
... : procv: parameters out of bounds
Is it all right?

==============================================
My data consists of 8 normal samples and 8 tumour samples of cancer patients.

The average coverage of CpGs on the considered chromosome
(for example, 7th chromosome) is 2.4 (from 0.4 to 4.6 in different samples).
Number of CpGs in this chromosome is 1 247 716,
Number of CpGs which are covered by at least 1 read in all 16 samples is  18 543,
Number of CpGs with 0 coverage in all samples is 0...

And something similar is observed in all chromosomes.

There is absent CpGs with 0 coverage in all samples.
Is there something wrong with my data? Is it possible to continue the analysis?

==============================================
According to the document "Analyzing WGBS with the bsseq package",

I want to remove CpGs with little or no coverage before computing t-statistics to avoid finding false positive DMRs.
In your example with 3+3=6 samples you recommended the following:

keepLoci.ex <-
 which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2 &
          rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 2).

But what is correct for my data(8+8=16 samples)?

 

I tried this command with the following parameters (1 and 2):
myLoci1 <-
which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 1) >= 2 &
         rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 1) >= 2)
Then length(myLoci1) is equal 320 019; the average coverage of remaining CpGs is 8.89 (from 0.5 to 17.7 per sample!)


I tried this command with another parameters (1 and 5):
myLoci2 <-
which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 1) >= 5 &
         rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 1) >= 5)
Then length(myLoci2) is equal 136 610; the average coverage of remaining CpGs is 18.59 (from 0.7 to 39.2 per sample!)

And also I tried this command with another parameters (2 and 6):
myLoci3 <-
which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 6 &
         rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 6)
Then length(myLoci3) is equal 68 594; the average coverage of remaining CpGs is 31.08 (from 0.9 to 70.0!!!!)

My data is far from ideal, but you need to somehow try to process them and get DMR. What parameters should I use?

Thank you!

ADD COMMENTlink modified 6 months ago • written 6 months ago by Maria0
0
gravatar for James W. MacDonald
6 months ago by
United States
James W. MacDonald45k wrote:

Most of your questions could be characterized as asking how you should analyze your data. However, this support site is intended to help people with technical problems using Bioconductor software.

Analyzing data is a non-trivial task, in many ways much more difficult than simply figuring out how to use a particular R package. You have to understand your data and the goals of the experiment, determine what assumptions you are willing to (or can) make, and then make an informed decision as to how best to proceed.

It's really not possible for anybody to do that for you on an online forum. If you don't have the required experience to analyze your data, you should strongly consider enlisting a local statistician who does have the experience to help you. If that isn't a possibility, then you need to educate yourself, which will require lots of reading, and is not likely to include much interaction with people on an online forum, unless you have a strictly technical question.

ADD COMMENTlink written 6 months ago by James W. MacDonald45k

Excuse me, yes, it's true.

ADD REPLYlink written 6 months ago by Maria0
0
gravatar for Peter Hickey
6 months ago by
Peter Hickey290
Johns Hopkins University, Baltimore, USA
Peter Hickey290 wrote:

To answer some of your specific questions:

1. The warnings generated during BSmooth() are coming from the underlying locfit R package. Such warnings are quite technical by nature and can be hard to understand, but they often indicate a problem with the choice of smoothing parameters. The default BSmooth() parameters are tuned for analysing CpG methylation from whole-genome bisulfite-sequencing data of human DNA. You may need to vary these parameters to find appropriate values for RRBS data.

2. RRBS is typically used to study a small part of the genome with high sequencing coverage. By design, RRBS only captures a small percentage of CpGs in the genome. Therefore you should expect most CpGs in the reference genome to be uncovered. However, you should also expect most CpGs that have coverage in 1 samples to have coverage in multiple samples. This doesn't appear to be the case in your data. Additionally, the very low sequencing coverage suggests something has gone wrong because RRBS data normally has high coverage (> 30x). It's hard to say where the problem is. I would check with the people who designed the experiment or did the sequencing what depth of coverage they were aiming for. 

Pete

ADD COMMENTlink written 6 months ago by Peter Hickey290

Thank you for your consideration!
1. I tryed to change parameters ns (minimum number of meth.loci in a smoothing window) and h (minimum smoothing window).
The value of ns is equal 70 by default. The above warnings are not occure when I set the meaning of ns to 700 or 800 or more. At the same time, little depends on the parameter h for my data. I can even set it to 0 and no warnings.


2. I know that, unfortunately, my data have a serious problem with quality. I carryed out quality trimming for these data, but this actions did not help in full. But I want to try to extract information (to find DMRs) even for these data.
So I excluded two pairs of samples with low sequencing coverage compared with other samples.

There is an example here: https://www.bioconductor.org/packages/devel/bioc/vignettes/bsseq/inst/doc/bsseq_analysis.pdf
which keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least 2x in coverage.
But why not all 3 cancer and all 3 normal samples?

Yes, it's a personal preference but I want to understand the considerations.
For  my  analysis  I'm going to keep only CpGs where all (i.e.6) cancer samples and all (i.e.6) normal samples have at least 2x in coverage.
I'm sorry that my post is too long!

ADD REPLYlink written 6 months ago by Maria0

1. Setting ns so large is probably not a good idea for RRBS data. As documented, ns is the minimum number of methylation loci in a smooth window. Most RRBS captured regions will contain many, many fewer CpGs, so you will be smoothing over large windows containing several clusters of CpGs with large gaps in between. Although I have not analysed much RRBS data, I suspect even ns = 70 is on the larger end of what is appropriate. 

2. As you say, the choice of cutoff is partly down to personal preference. The idea is not to have too many loci for which only a very small number of samples have coverage. This is not the same as only using loci for which all samples have coverage, which is obviously a more restrictive requirement

ADD REPLYlink written 6 months ago by Peter Hickey290

Thank you very much for your answer!

ADD REPLYlink written 6 months ago by Maria0

Dear Dr. Hickey,
It's all right when I'm analyzing chromosomes 17-22.
But when analyzing the 16 chromosomes, I see the following error when calling the function BSmooth.tstat:

Error in preplot.locfit.raw(object, newdata, where, what, band):
    NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In lfproc(x , y, weights = weights, cens = cens, base = base, geth = geth, :)

    max_nr reduction problem

The data is about the same as for the other chromosomes.
So I don't understand what could be causing this error.

ADD REPLYlink written 6 months ago by Maria0

It's difficult to know what's causing the error from that output. Are you able to share the chr16 data with me (e.g., via Dropbox)? I could then investigate further

ADD REPLYlink written 6 months ago by Peter Hickey290
0
gravatar for Maria
6 months ago by
Maria0
Russia / Institute of Biochemistry and Genetics
Maria0 wrote:

Yes, thank you!

I sent a link to you by e-mail. There are COV-files for 16 chromosome (and for example for 20 chromosome) in the folder.
Files with "n" letter refer to the normal tissue group, files with "t" letter refer to the tumor tissue group.
So, I compare the two groups of 6 samples each.

ADD COMMENTlink written 6 months ago by Maria0

Thanks, Nastasia. Can you please also post the exact command you ran that worked for chr20 but not for chr16

ADD REPLYlink written 6 months ago by Peter Hickey290

I used the same commands for chromosome 16 and for chromosome 20. I post here the commands for chromosome 16.

list.files(pattern="*.cov") -> cov.files
read.bismark(files=cov.files,sampleNames=gsub(".cov","",cov.files),rmZeroCov=TRUE, strandCollapse=FALSE) ->bsseq.data

bsseq.data[[1]] <- c("normal","cancer","normal","cancer","normal","cancer","normal","cancer","normal","cancer","normal","cancer")
bsseq.data[[2]] <- c("pair1","pair1","pair2","pair2","pair3","pair3","pair4","pair4","pair5","pair5","pair7","pair7")
pData(bsseq.data) <- setNames(pData(bsseq.data),c("Type","pair"))

bsseq.data.smoothed <- BSmooth(bsseq.data, verbose=TRUE)
bsseq.data.smooth <- updateObject(bsseq.data.smoothed)

myBS.cov <- getCoverage(bsseq.data.smooth)
myLoci1 <- which(rowSums(myBS.cov[, bsseq.data$Type == "cancer"] >= 1 >= 4 &
                  rowSums(myBS.cov[, bsseq.data$Type == "normal"] >= 1) >= 4)
bsseq.data.smooth1 <- bsseq.data.smooth[myLoci1,]

BSmooth.tstat(bsseq.data.smooth1,
              group1=c("71n_16", "72n_16", "73n_16", "74n_16", "75n_16", "78n_16"),
              group2=c("71t_16", "72t_16", "73t_16", "74t_16", "75t_16", "78t_16"),
              estimate.var="group2",local.correct=TRUE, verbose=TRUE) -> bsseq.data.tstats1

tcrit1 <- -qt(0.01/2, 10)

dmrs0 <- dmrFinder(bsseq.data.tstats1, cutoff = c(-tcrit1, tcrit1),stat="tstat.corrected",verbose=TRUE)

Thank you!

ADD REPLYlink written 6 months ago by Maria0

I am unable to reproduce the error. However, your example code has a bug: list.files(pattern="*.cov") -> cov.files means that the chr16 and chr20 data will be considered to have come from separate samples. This is not what you want.

I tidied up and simplified your example code and was able to successfully smooth both the chr16 and chr20 data. Please try running my example code on your machine.

Note, I am using the development branch of Bioconductor (3.6) with bsseq v1.13.2. Please share your session info (the output of sessionInfo()), so I know which version of bsseq you are using. Please also format your code, it makes it easier to read and understand.

# Example code
suppressPackageStartupMessages(library(bsseq))
cov_files_16 <- list.files(path = "~/Dropbox/16&20chromosomec_cov",
                           pattern = glob2rx("*16.cov"),
full.names = TRUE)
bsseq_16 <- read.bismark(files = cov_files_16, 
                         sampleNames = gsub(".cov", "", cov_files_16),
                         rmZeroCov = TRUE, 
                         strandCollapse = FALSE)
#> Assuming file type is cov
#> [read.bismark] Joining samples ... done in 10.4 secs
bsseq_16 <- BSmooth(bsseq_16, verbose = TRUE)
#> [BSmooth] preprocessing ... done in 0.6 sec
#> [BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
#> [BSmooth] smoothing done in 192.0 sec
cov_files_20 <- list.files(path = "~/Dropbox/16&20chromosomec_cov",
                           pattern = glob2rx("*20.cov"),
                           full.names = TRUE)
bsseq_20 <- read.bismark(files = cov_files_20, 
                         sampleNames = gsub(".cov", "", cov_files_20),
                         rmZeroCov = TRUE, 
                         strandCollapse = FALSE,
                         mc.cores = 1)
#> Assuming file type is cov
#> [read.bismark] Joining samples ... done in 6.0 secs
bsseq_20 <- BSmooth(bsseq_20, verbose = TRUE, mc.cores = 1)
#> [BSmooth] preprocessing ... done in 0.5 sec
#> [BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
#> [BSmooth] smoothing done in 88.6 se
ADD REPLYlink modified 6 months ago • written 6 months ago by Peter Hickey290

Dear Dr. Hickey, thanks for your reply.
I run the following code and I have the following error.

list.files(pattern="*.cov") -> cov.files
read.bismark(files=cov.files,sampleNames=gsub(".cov","",cov.files),rmZeroCov              =TRUE, strandCollapse=FALSE) -> bsseq.data

#> Assuming file type is cov
#> [read.bismark] Reading file '71n_16.cov' ... done in 1.8 secs
#> [read.bismark] Reading file '71t_16.cov' ... done in 1.0 secs
#> [read.bismark] Reading file '72n_16.cov' ... done in 0.5 secs
#> [read.bismark] Reading file '72t_16.cov' ... done in 0.6 secs
#> [read.bismark] Reading file '73n_16.cov' ... done in 0.4 secs
#> [read.bismark] Reading file '73t_16.cov' ... done in 0.9 secs
#> [read.bismark] Reading file '74n_16.cov' ... done in 0.4 secs
#> [read.bismark] Reading file '74t_16.cov' ... done in 0.6 secs
#> [read.bismark] Reading file '75n_16.cov' ... done in 0.3 secs
#> [read.bismark] Reading file '75t_16.cov' ... done in 0.6 secs
#> [read.bismark] Reading file '78n_16.cov' ... done in 0.4 secs
#> [read.bismark] Reading file '78t_16.cov' ... done in 0.3 secs
#> [read.bismark] Joining samples ... done in 10.2 secs

bsseq.data[[1]] <- c("normal","cancer","normal","cancer","normal","cancer","normal","cancer","normal","cancer","normal","cancer")
bsseq.data[[2]] <- c("pair1","pair1","pair2","pair2","pair3","pair3","pair4","pair4","pair5","pair5","pair7","pair7")
pData(bsseq.data) <- setNames(pData(bsseq.data), c("Type", "pair"))

bsseq.data.smoothed <- BSmooth(bsseq.data, verbose=TRUE)
#> [BSmooth] preprocessing ... done in 0.7 sec
#> [BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
#> [BSmooth] sample 71n_16 (out of 12), done in 32.1 sec
#> [BSmooth] sample 71t_16 (out of 12), done in 23.5 sec
#> [BSmooth] sample 72n_16 (out of 12), done in 22.9 sec
#> [BSmooth] sample 72t_16 (out of 12), done in 22.6 sec
#> [BSmooth] sample 73n_16 (out of 12), done in 26.7 sec
#> [BSmooth] sample 73t_16 (out of 12), done in 36.0 sec
#> [BSmooth] sample 74n_16 (out of 12), done in 23.9 sec
#> [BSmooth] sample 74t_16 (out of 12), done in 24.7 sec
#> [BSmooth] sample 75n_16 (out of 12), done in 21.8 sec
#> [BSmooth] sample 75t_16 (out of 12), done in 27.0 sec
#> [BSmooth] sample 78n_16 (out of 12), done in 28.9 sec
#> [BSmooth] sample 78t_16 (out of 12), done in 28.4 sec
#> [BSmooth] smoothing done in 318.5 sec
#> There were 50 or more warnings (use warnings() to see the first 50)

myBS.cov <- getCoverage(bsseq.data.smooth)
myLoci1 <- which(rowSums(myBS.cov[, bsseq.data$Type == "cancer"] >= 1) >= 4 &
                   rowSums(myBS.cov[, bsseq.data$Type == "normal"] >= 1) >= 4)
bsseq.data.smooth1 <- bsseq.data.smooth[myLoci1,]
BSmooth.tstat(bsseq.data.smooth1,
               group1=c("71n_16", "72n_16", "73n_16", "74n_16", "75n_16", "78n_16"),
               group2=c("71t_16", "72t_16", "73t_16", "74t_16", "75t_16", "78t_16"),
               estimate.var="group2",local.correct=TRUE, verbose=TRUE) -> bsseq.data.tstats1
#> [BSmooth.tstat] preprocessing ... done in 0.3 sec
#> [BSmooth.tstat] computing stats within groups ... done in 0.4 sec
#> [BSmooth.tstat] computing stats across groups ... 
#> Error in preplot.locfit.raw(object, newdata, where, what, band) : 
#>   NA/NaN/Inf in foreign function call (arg 2)
#> In addition: Warning message:
#> In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
#>   max_nr reduction problem

 

ADD REPLYlink modified 5 months ago • written 5 months ago by Maria0

There were multiple warnings following bsseq.data.smoothed <- BSmooth(bsseq.data, verbose=TRUE). Did you look to see what there were? Above I suggested that because you are analysing RRBS data you would need to explore different parameters to BSmooth() but you've used the defaults. I suspect this is causing downstream issues. 

ADD REPLYlink written 5 months ago by Peter Hickey290

Sorry, forgot to write right away, I'm using the latest version of R and bsseq v1.12.1.

sessionInfo()
#> R version 3.4.0 (2017-04-21)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1

What should I do? Thank you for your attention.

ADD REPLYlink modified 5 months ago • written 5 months ago by Maria0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 262 users visited in the last hour