Question: BiSeq trimClusters error
0
3.8 years ago by
kms680
kms680 wrote:

Dear All,

I was hoping you could help me with a problem I am having. I have run RRBS paired-end methyl-sequencing, and am currently analysing the data via an R pipeline. I am searching for Differentially Methylated Regions (DMRs) between 2 groups, analysing by sex, treatment, and age.

I have been using the Bioconductor BiSeq Package. In the manual, it says to perform the function: ‘trimClusters’ to remove the regions which aren’t differentially methylated, to leave me with my final result. But the following error appears:

> locCor.Mcord.all.3 <- estLocCor(vario.sm.Mcord.all.3)
(warning of NaNs produced, but the function seems to allow the pipeline to progress, despite the NaNs).

> clusters.rej.Mcord.all.3 <- testClusters(locCor.Mcord.all.3, FDR.cluster = 0.05)

> clusters.trimmed.Mcord.all.3 <- trimClusters(clusters.rej.Mcord.all.3, FDR.loc = 0.05)
Error in integrate(integrand, lower = z.li[loc], upper = Inf) :
non-finite function value
1: In sqrt(1 - rho.li^2) : NaNs produced
2: In sqrt(1 - rho.li^2) : NaNs produced

I have analysed my clusters.rej. This has a few components, one of which is ‘clusters.rej.Mcord.all.3$clusters.reject’. This has no Na values that I can spot. Other components, such as ‘clusters.rej.Mcord.all.3$clusters.not.reject’ do have rows with p values of ‘Na’, which I have tried my best to remove.

I have 2 questions:
1)  What is the function of ‘trimClusters’? I thought it just meant removing all the clusters that weren’t rejected, leaving those that were.

2) Why am I getting this error message? My supervisor said that perhaps there were CpG regions that didn’t have a high coverage during the initial sequencing, which therefore gave errors in downstream processing. If this is the case, how could I correct this? Failing this, is there a way to remove all the NaNs in the ‘clusters.rej’ to allow the subsequent step to progress?
What could have possibly caused this error?

Please let me know if you require any further information. I am happy to send you, for example, my R pipeline if necessary.

Manu

University of Cambridge

Email: kms68@cam.ac.uk

My sessionInfo() is:

R version 3.2.2 (2015-08-14)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

locale:

[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252

[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C

[5] LC_TIME=English_United Kingdom.1252

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

[1] BiSeq_1.10.0               Formula_1.2-1              SummarizedExperiment_1.0.1

[4] Biobase_2.30.0             GenomicRanges_1.22.1       GenomeInfoDb_1.6.1

[7] IRanges_2.4.4              S4Vectors_0.8.3            BiocGenerics_0.16.1

loaded via a namespace (and not attached):

[1] futile.logger_1.4.1     XVector_0.10.0          bitops_1.0-6            futile.options_1.0.0

[5] tools_3.2.2             zlibbioc_1.16.0         RSQLite_1.0.0           annotate_1.48.0

[9] lattice_0.20-33         DBI_0.3.1               rtracklayer_1.30.1      Biostrings_2.38.2

[13] lmtest_0.9-34           grid_3.2.2              nnet_7.3-11             globaltest_5.24.0

[17] flexmix_2.3-13          AnnotationDbi_1.32.0    survival_2.38-3         XML_3.98-1.3

[21] BiocParallel_1.4.0      lokern_1.1-5            lambda.r_1.1.7          splines_3.2.2

[25] Rsamtools_1.22.0        modeltools_0.2-21       sfsmisc_1.0-28          GenomicAlignments_1.6.1

[29] xtable_1.8-0            betareg_3.0-5           sandwich_2.3-4          RCurl_1.95-4.7

[33] zoo_1.7-12

biseq error trimclusters • 935 views
modified 3.8 years ago by Katja Hebestreit110 • written 3.8 years ago by kms680
0
3.8 years ago by
United States
Katja Hebestreit110 wrote:

Hi Manu,

Regarding your first question: In a first step BiSeq tests all CpG clusters (= regions of high CpG density) for differential methylation. When a CpG cluster is significant, this doesn't necessarily mean that all CpG sites within this region are differentially methylated. That's why trimClusters tests each CpG site in a significant CpG cluster for methylation difference and gives you back all CpG sites that were actually significantly differentially methylated.

Regarding the error: It is normal that the beta regression model cannot be built for all CpG sites. You will always encounter NAs for CpG sites with, e.g., a variance of zero across all samples (because the methylation was zero or one across all samples). I have never encountered the error for trimClusters(). Have you looked at the variogram?  It is pretty useful for getting an idea how well the data could be modeled. How does the variogram look for your data?

Cheers,
Katja

Dear Katja,

Thank you so much for your swift reply, and for your explanation. I have been reading up about variograms, and have attached a representative one of mine as a URL: http://tinypic.com/r/15guj60/9

These were the steps I performed to create the variogram:

vario.Cordplacebo.all.3 <- makeVariogram(betaResultsNull.Cordplacebo.all.3)
plot(vario.Cordplacebo.all.3$variogram$v)
vario.sm.Cordplacebo.all.3 <- smoothVariogram(vario.Cordplacebo.all.3, sill = 1.15)
lines(vario.sm.Cordplacebo.all.3\$variogram[,c("h", "v.sm")], col = "red", lwd = 1.5)
grid()

The variogram seems ok (to me at least), perhaps the concerning thing is that there is not as much variation between the points as there is in the example variogram in the BiSeq manual. But I don't have much experience of variograms and haven't been able to spot anything else odd.

I have been trying to maniupulate my R script but I am genuinely quite confused about what I'm doing wrong. Please let me know if you want me to send me further information.

Regards,

Manu

0
3.8 years ago by
United States
Katja Hebestreit110 wrote:

Hi Manu,

Thanks for the figure of the variogram. You can interpret the variogram as 1-correlation, which means it has to be between 0 and 1. For some data the variogram tends goes beyond 1, as it does in your example. In cases like this, you should set your sill to 1:

smoothVariogram( vario.Cardplacebo.all.3, sill = 1)

I think this should solve the problem. I should probably change the function so that a sill greater than 1 would get caught.

It is good that you don't have much variation between your points in the variogram. That means you have a lot of data (the example dataset in the manual is pretty small).

Let me know if sill = 1 solves the problem.

Cheers,
Katja

0
3.8 years ago by
kms680
kms680 wrote:

Hi Katja,

Sorry for taking so long to reply. Setting the sill = 1 did indeed solve my problem. Thank you so much for your help!

I have been trying to find a reason why the data would have a sill above 1 - if possible, could you link me a resource or explain why this would be the case? However, this is just curiosity and will not delay my work so there is no pressure on you to do this if you haven't got the time!

Regards,

Manu

0
3.8 years ago by
United States
Katja Hebestreit110 wrote:

Hi Manu,

I am glad it is working now!

The assumption is that the p-values are uniformly distributed under the null hypothesis. This is not always the case in real datasets. For example, the p-values are less well uniformly distributed the viewer samples you have. There can also be a problem with very lowly covered CpG sites (e.g. one read across multiple samples). In these cases the model estimation doesn't work very well and the distribution of your p-values may deviate a little from uniform distribution. This is a normal phenomenon though and it should work fine if you restrict the sill to 1.

Cheers,

Katja

0
3.8 years ago by
United States
Katja Hebestreit110 wrote:

Hi Manu,

I am glad it is working now!

The assumption is that the p-values are uniformly distributed under the null hypothesis. This is not always the case in real datasets. For example, the p-values are less well uniformly distributed the viewer samples you have. There can also be a problem with very lowly covered CpG sites (e.g. one read across multiple samples). In these cases the model estimation doesn't work very well and the distribution of your p-values may deviate a little from uniform distribution. This is a normal phenomenon though and it should work fine if you restrict the sill to 1.

Cheers,

Katja