Search
Question: Finding DMRs using dmrFinder in bsseq
0
gravatar for Maria
10 days ago by
Maria0
Russia / Institute of Biochemistry and Genetics
Maria0 wrote:

Hello!
Could you help me with my challenges?
I have paired-end data obtained with RRBS expriment.
I did the adapter and quality trimming using trim_galore.
Then i the read alignments the methylation extraction using Bismark .
And farther I worked with the produced coverage files, is it all right?

==============================================
After the smothing step (BSmooth(My.bsseq.data, , mc.cores = 1,verbose = TRUE)),  

I see a lot of warnings (more than 50) looks like this:
... 50: In lfpros(x, y, weights = weights, cens = cens, base = base,
... : procv: parameters out of bounds
Is it all right?

==============================================
My data consists of 8 normal samples and 8 tumour samples of cancer patients.

The average coverage of CpGs on the considered chromosome
(for example, 7th chromosome) is 2.4 (from 0.4 to 4.6 in different samples).
Number of CpGs in this chromosome is 1 247 716,
Number of CpGs which are covered by at least 1 read in all 16 samples is  18 543,
Number of CpGs with 0 coverage in all samples is 0...

And something similar is observed in all chromosomes.

There is absent CpGs with 0 coverage in all samples.
Is there something wrong with my data? Is it possible to continue the analysis?

==============================================
According to the document "Analyzing WGBS with the bsseq package",

I want to remove CpGs with little or no coverage before computing t-statistics to avoid finding false positive DMRs.
In your example with 3+3=6 samples you recommended the following:

keepLoci.ex <-
 which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 2 &
          rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 2).

But what is correct for my data(8+8=16 samples)?

 

I tried this command with the following parameters (1 and 2):
myLoci1 <-
which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 1) >= 2 &
         rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 1) >= 2)
Then length(myLoci1) is equal 320 019; the average coverage of remaining CpGs is 8.89 (from 0.5 to 17.7 per sample!)


I tried this command with another parameters (1 and 5):
myLoci2 <-
which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 1) >= 5 &
         rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 1) >= 5)
Then length(myLoci2) is equal 136 610; the average coverage of remaining CpGs is 18.59 (from 0.7 to 39.2 per sample!)

And also I tried this command with another parameters (2 and 6):
myLoci3 <-
which(rowSums(BS.cov[, BS.cancer.ex$Type == "cancer"] >= 2) >= 6 &
         rowSums(BS.cov[, BS.cancer.ex$Type == "normal"] >= 2) >= 6)
Then length(myLoci3) is equal 68 594; the average coverage of remaining CpGs is 31.08 (from 0.9 to 70.0!!!!)

My data is far from ideal, but you need to somehow try to process them and get DMR. What parameters should I use?

Thank you!

ADD COMMENTlink modified 5 days ago • written 10 days ago by Maria0
0
gravatar for James W. MacDonald
10 days ago by
United States
James W. MacDonald43k wrote:

Most of your questions could be characterized as asking how you should analyze your data. However, this support site is intended to help people with technical problems using Bioconductor software.

Analyzing data is a non-trivial task, in many ways much more difficult than simply figuring out how to use a particular R package. You have to understand your data and the goals of the experiment, determine what assumptions you are willing to (or can) make, and then make an informed decision as to how best to proceed.

It's really not possible for anybody to do that for you on an online forum. If you don't have the required experience to analyze your data, you should strongly consider enlisting a local statistician who does have the experience to help you. If that isn't a possibility, then you need to educate yourself, which will require lots of reading, and is not likely to include much interaction with people on an online forum, unless you have a strictly technical question.

ADD COMMENTlink written 10 days ago by James W. MacDonald43k

Excuse me, yes, it's true.

ADD REPLYlink written 9 days ago by Maria0
0
gravatar for Peter Hickey
10 days ago by
Peter Hickey210
Australia
Peter Hickey210 wrote:

To answer some of your specific questions:

1. The warnings generated during BSmooth() are coming from the underlying locfit R package. Such warnings are quite technical by nature and can be hard to understand, but they often indicate a problem with the choice of smoothing parameters. The default BSmooth() parameters are tuned for analysing CpG methylation from whole-genome bisulfite-sequencing data of human DNA. You may need to vary these parameters to find appropriate values for RRBS data.

2. RRBS is typically used to study a small part of the genome with high sequencing coverage. By design, RRBS only captures a small percentage of CpGs in the genome. Therefore you should expect most CpGs in the reference genome to be uncovered. However, you should also expect most CpGs that have coverage in 1 samples to have coverage in multiple samples. This doesn't appear to be the case in your data. Additionally, the very low sequencing coverage suggests something has gone wrong because RRBS data normally has high coverage (> 30x). It's hard to say where the problem is. I would check with the people who designed the experiment or did the sequencing what depth of coverage they were aiming for. 

Pete

ADD COMMENTlink written 10 days ago by Peter Hickey210

Thank you for your consideration!
1. I tryed to change parameters ns (minimum number of meth.loci in a smoothing window) and h (minimum smoothing window).
The value of ns is equal 70 by default. The above warnings are not occure when I set the meaning of ns to 700 or 800 or more. At the same time, little depends on the parameter h for my data. I can even set it to 0 and no warnings.


2. I know that, unfortunately, my data have a serious problem with quality. I carryed out quality trimming for these data, but this actions did not help in full. But I want to try to extract information (to find DMRs) even for these data.
So I excluded two pairs of samples with low sequencing coverage compared with other samples.

There is an example here: https://www.bioconductor.org/packages/devel/bioc/vignettes/bsseq/inst/doc/bsseq_analysis.pdf
which keep CpGs where at least 2 cancer samples and at least 2 normal samples have at least 2x in coverage.
But why not all 3 cancer and all 3 normal samples?

Yes, it's a personal preference but I want to understand the considerations.
For  my  analysis  I'm going to keep only CpGs where all (i.e.6) cancer samples and all (i.e.6) normal samples have at least 2x in coverage.
I'm sorry that my post is too long!

ADD REPLYlink written 9 days ago by Maria0

1. Setting ns so large is probably not a good idea for RRBS data. As documented, ns is the minimum number of methylation loci in a smooth window. Most RRBS captured regions will contain many, many fewer CpGs, so you will be smoothing over large windows containing several clusters of CpGs with large gaps in between. Although I have not analysed much RRBS data, I suspect even ns = 70 is on the larger end of what is appropriate. 

2. As you say, the choice of cutoff is partly down to personal preference. The idea is not to have too many loci for which only a very small number of samples have coverage. This is not the same as only using loci for which all samples have coverage, which is obviously a more restrictive requirement

ADD REPLYlink written 9 days ago by Peter Hickey210

Thank you very much for your answer!

ADD REPLYlink written 9 days ago by Maria0

Dear Dr. Hickey,
It's all right when I'm analyzing chromosomes 17-22.
But when analyzing the 16 chromosomes, I see the following error when calling the function BSmooth.tstat:

Error in preplot.locfit.raw(object, newdata, where, what, band):
    NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning message:
In lfproc(x , y, weights = weights, cens = cens, base = base, geth = geth, :)

    max_nr reduction problem

The data is about the same as for the other chromosomes.
So I don't understand what could be causing this error.

ADD REPLYlink written 7 days ago by Maria0

It's difficult to know what's causing the error from that output. Are you able to share the chr16 data with me (e.g., via Dropbox)? I could then investigate further

ADD REPLYlink written 6 days ago by Peter Hickey210
0
gravatar for Maria
5 days ago by
Maria0
Russia / Institute of Biochemistry and Genetics
Maria0 wrote:

Yes, thank you!

I sent a link to you by e-mail. There are COV-files for 16 chromosome (and for example for 20 chromosome) in the folder.
Files with "n" letter refer to the normal tissue group, files with "t" letter refer to the tumor tissue group.
So, I compare the two groups of 6 samples each.

ADD COMMENTlink written 5 days ago by Maria0

Thanks, Nastasia. Can you please also post the exact command you ran that worked for chr20 but not for chr16

ADD REPLYlink written 3 days ago by Peter Hickey210

I used the same commands for chromosome 16 and for chromosome 20. I post here the commands for chromosome 16.

list.files(pattern="*.cov") -> cov.files
read.bismark(files=cov.files,sampleNames=gsub(".cov","",cov.files),rmZeroCov=TRUE, strandCollapse=FALSE) ->bsseq.data

bsseq.data[[1]] <- c("normal","cancer","normal","cancer","normal","cancer","normal","cancer","normal","cancer","normal","cancer")
bsseq.data[[2]] <- c("pair1","pair1","pair2","pair2","pair3","pair3","pair4","pair4","pair5","pair5","pair7","pair7")
pData(bsseq.data) <- setNames(pData(bsseq.data),c("Type","pair"))

bsseq.data.smoothed <- BSmooth(bsseq.data, verbose=TRUE)
bsseq.data.smooth <- updateObject(bsseq.data.smoothed)

myBS.cov <- getCoverage(bsseq.data.smooth)
myLoci1 <- which(rowSums(myBS.cov[, bsseq.data$Type == "cancer"] >= 1 >= 4 &
                  rowSums(myBS.cov[, bsseq.data$Type == "normal"] >= 1) >= 4)
bsseq.data.smooth1 <- bsseq.data.smooth[myLoci1,]

BSmooth.tstat(bsseq.data.smooth1,
              group1=c("71n_16", "72n_16", "73n_16", "74n_16", "75n_16", "78n_16"),
              group2=c("71t_16", "72t_16", "73t_16", "74t_16", "75t_16", "78t_16"),
              estimate.var="group2",local.correct=TRUE, verbose=TRUE) -> bsseq.data.tstats1

tcrit1 <- -qt(0.01/2, 10)

dmrs0 <- dmrFinder(bsseq.data.tstats1, cutoff = c(-tcrit1, tcrit1),stat="tstat.corrected",verbose=TRUE)

Thank you!

ADD REPLYlink written 2 days ago by Maria0

I am unable to reproduce the error. However, your example code has a bug: list.files(pattern="*.cov") -> cov.files means that the chr16 and chr20 data will be considered to have come from separate samples. This is not what you want.

I tidied up and simplified your example code and was able to successfully smooth both the chr16 and chr20 data. Please try running my example code on your machine.

Note, I am using the development branch of Bioconductor (3.6) with bsseq v1.13.2. Please share your session info (the output of sessionInfo()), so I know which version of bsseq you are using. Please also format your code, it makes it easier to read and understand.

# Example code
suppressPackageStartupMessages(library(bsseq))
cov_files_16 <- list.files(path = "~/Dropbox/16&20chromosomec_cov",
                           pattern = glob2rx("*16.cov"),
full.names = TRUE)
bsseq_16 <- read.bismark(files = cov_files_16, 
                         sampleNames = gsub(".cov", "", cov_files_16),
                         rmZeroCov = TRUE, 
                         strandCollapse = FALSE)
#> Assuming file type is cov
#> [read.bismark] Joining samples ... done in 10.4 secs
bsseq_16 <- BSmooth(bsseq_16, verbose = TRUE)
#> [BSmooth] preprocessing ... done in 0.6 sec
#> [BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
#> [BSmooth] smoothing done in 192.0 sec
cov_files_20 <- list.files(path = "~/Dropbox/16&20chromosomec_cov",
                           pattern = glob2rx("*20.cov"),
                           full.names = TRUE)
bsseq_20 <- read.bismark(files = cov_files_20, 
                         sampleNames = gsub(".cov", "", cov_files_20),
                         rmZeroCov = TRUE, 
                         strandCollapse = FALSE,
                         mc.cores = 1)
#> Assuming file type is cov
#> [read.bismark] Joining samples ... done in 6.0 secs
bsseq_20 <- BSmooth(bsseq_20, verbose = TRUE, mc.cores = 1)
#> [BSmooth] preprocessing ... done in 0.5 sec
#> [BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
#> [BSmooth] smoothing done in 88.6 se
ADD REPLYlink modified 2 days ago • written 2 days ago by Peter Hickey210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 224 users visited in the last hour