Question

BiSeq trimClusters error

0

Entering edit mode

kms68 • 0

@kms68-9519

Last seen 10.1 years ago

Dear All,

I was hoping you could help me with a problem I am having. I have run RRBS paired-end methyl-sequencing, and am currently analysing the data via an R pipeline. I am searching for Differentially Methylated Regions (DMRs) between 2 groups, analysing by sex, treatment, and age.

I have been using the Bioconductor BiSeq Package. In the manual, it says to perform the function: ‘trimClusters’ to remove the regions which aren’t differentially methylated, to leave me with my final result. But the following error appears:

> locCor.Mcord.all.3 <- estLocCor(vario.sm.Mcord.all.3)
(warning of NaNs produced, but the function seems to allow the pipeline to progress, despite the NaNs).

> clusters.rej.Mcord.all.3 <- testClusters(locCor.Mcord.all.3, FDR.cluster = 0.05)

> clusters.trimmed.Mcord.all.3 <- trimClusters(clusters.rej.Mcord.all.3, FDR.loc = 0.05)
Error in integrate(integrand, lower = z.li[loc], upper = Inf) :
non-finite function value
In addition: Warning messages:
1: In sqrt(1 - rho.li^2) : NaNs produced
2: In sqrt(1 - rho.li^2) : NaNs produced

I have analysed my clusters.rej. This has a few components, one of which is ‘clusters.rej.Mcord.all.3$clusters.reject’. This has no Na values that I can spot. Other components, such as ‘clusters.rej.Mcord.all.3$clusters.not.reject’ do have rows with p values of ‘Na’, which I have tried my best to remove.

I have 2 questions:
1) What is the function of ‘trimClusters’? I thought it just meant removing all the clusters that weren’t rejected, leaving those that were.

2) Why am I getting this error message? My supervisor said that perhaps there were CpG regions that didn’t have a high coverage during the initial sequencing, which therefore gave errors in downstream processing. If this is the case, how could I correct this? Failing this, is there a way to remove all the NaNs in the ‘clusters.rej’ to allow the subsequent step to progress?
What could have possibly caused this error?

Please let me know if you require any further information. I am happy to send you, for example, my R pipeline if necessary.

Thanks for your help!

Manu

Part 2 BA Undergraduate Student
University of Cambridge

Email: kms68@cam.ac.uk

My sessionInfo() is:

R version 3.2.2 (2015-08-14)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

locale:

[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252

[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C

[5] LC_TIME=English_United Kingdom.1252

attached base packages:

[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:

[1] BiSeq_1.10.0 Formula_1.2-1 SummarizedExperiment_1.0.1

[4] Biobase_2.30.0 GenomicRanges_1.22.1 GenomeInfoDb_1.6.1

[7] IRanges_2.4.4 S4Vectors_0.8.3 BiocGenerics_0.16.1

loaded via a namespace (and not attached):

[1] futile.logger_1.4.1 XVector_0.10.0 bitops_1.0-6 futile.options_1.0.0

[5] tools_3.2.2 zlibbioc_1.16.0 RSQLite_1.0.0 annotate_1.48.0

[9] lattice_0.20-33 DBI_0.3.1 rtracklayer_1.30.1 Biostrings_2.38.2

[13] lmtest_0.9-34 grid_3.2.2 nnet_7.3-11 globaltest_5.24.0

[17] flexmix_2.3-13 AnnotationDbi_1.32.0 survival_2.38-3 XML_3.98-1.3

[21] BiocParallel_1.4.0 lokern_1.1-5 lambda.r_1.1.7 splines_3.2.2

[25] Rsamtools_1.22.0 modeltools_0.2-21 sfsmisc_1.0-28 GenomicAlignments_1.6.1

[29] xtable_1.8-0 betareg_3.0-5 sandwich_2.3-4 RCurl_1.95-4.7

[33] zoo_1.7-12

biseq error trimClusters • 2.5k views

ADD COMMENT • link updated 10.0 years ago by Katja Hebestreit ▴ 130 • written 10.1 years ago by kms68 • 0

score 0 · Answer 1 · 2016-01-14

0

Entering edit mode

Katja Hebestreit ▴ 130

@katja-hebestreit-6495

Last seen 5.6 years ago

United States

Hi Manu,

Regarding your first question: In a first step BiSeq tests all CpG clusters (= regions of high CpG density) for differential methylation. When a CpG cluster is significant, this doesn't necessarily mean that all CpG sites within this region are differentially methylated. That's why trimClusters tests each CpG site in a significant CpG cluster for methylation difference and gives you back all CpG sites that were actually significantly differentially methylated.

Regarding the error: It is normal that the beta regression model cannot be built for all CpG sites. You will always encounter NAs for CpG sites with, e.g., a variance of zero across all samples (because the methylation was zero or one across all samples). I have never encountered the error for trimClusters(). Have you looked at the variogram? It is pretty useful for getting an idea how well the data could be modeled. How does the variogram look for your data?

Cheers,
Katja

ADD COMMENT • link 10.1 years ago Katja Hebestreit ▴ 130

0

Entering edit mode

Dear Katja,

Thank you so much for your swift reply, and for your explanation. I have been reading up about variograms, and have attached a representative one of mine as a URL: http://tinypic.com/r/15guj60/9

These were the steps I performed to create the variogram:

vario.Cordplacebo.all.3 <- makeVariogram(betaResultsNull.Cordplacebo.all.3)
plot(vario.Cordplacebo.all.3$variogram$v)
vario.sm.Cordplacebo.all.3 <- smoothVariogram(vario.Cordplacebo.all.3, sill = 1.15)
lines(vario.sm.Cordplacebo.all.3$variogram[,c("h", "v.sm")], col = "red", lwd = 1.5)
grid()

The variogram seems ok (to me at least), perhaps the concerning thing is that there is not as much variation between the points as there is in the example variogram in the BiSeq manual. But I don't have much experience of variograms and haven't been able to spot anything else odd.

I have been trying to maniupulate my R script but I am genuinely quite confused about what I'm doing wrong. Please let me know if you want me to send me further information.

Regards,

Manu

ADD REPLY • link 10.1 years ago kms68 • 0

score 0 · Answer 2 · 2016-01-18

Hi Manu,

Thanks for the figure of the variogram. You can interpret the variogram as 1-correlation, which means it has to be between 0 and 1. For some data the variogram tends goes beyond 1, as it does in your example. In cases like this, you should set your sill to 1:

smoothVariogram( vario.Cardplacebo.all.3, sill = 1)

I think this should solve the problem. I should probably change the function so that a sill greater than 1 would get caught.

It is good that you don't have much variation between your points in the variogram. That means you have a lot of data (the example dataset in the manual is pretty small).

Let me know if sill = 1 solves the problem.

Cheers,
Katja

score 0 · Answer 3 · 2016-02-07

Hi Katja,

Sorry for taking so long to reply. Setting the sill = 1 did indeed solve my problem. Thank you so much for your help!

I have been trying to find a reason why the data would have a sill above 1 - if possible, could you link me a resource or explain why this would be the case? However, this is just curiosity and will not delay my work so there is no pressure on you to do this if you haven't got the time!

Regards,

Manu

score 0 · Answer 4 · 2016-02-09

Hi Manu,

I am glad it is working now!

The assumption is that the p-values are uniformly distributed under the null hypothesis. This is not always the case in real datasets. For example, the p-values are less well uniformly distributed the viewer samples you have. There can also be a problem with very lowly covered CpG sites (e.g. one read across multiple samples). In these cases the model estimation doesn't work very well and the distribution of your p-values may deviate a little from uniform distribution. This is a normal phenomenon though and it should work fine if you restrict the sill to 1.

Cheers,

Katja

score 0 · Answer 5 · 2016-02-09

Hi Manu,

I am glad it is working now!

The assumption is that the p-values are uniformly distributed under the null hypothesis. This is not always the case in real datasets. For example, the p-values are less well uniformly distributed the viewer samples you have. There can also be a problem with very lowly covered CpG sites (e.g. one read across multiple samples). In these cases the model estimation doesn't work very well and the distribution of your p-values may deviate a little from uniform distribution. This is a normal phenomenon though and it should work fine if you restrict the sill to 1.

Cheers,

Katja