Question

estimateCellCounts Visualisations - Adjusting Beta Values for Plotting

0

Entering edit mode

andrew.j.skelton73 ▴ 370

@andrewjskelton73-7074

Last seen 7 months ago

United Kingdom

Hi,

I've got some blood 450K methylation data and I've used the estimateCellCounts function in Minfi to include as terms in my model for a differential methylation test. I just wanted to be sure that what I'm doing is correct - I want to get adjust beta values for visualisations.

Code:

foo <- estimateCellCounts(rgSet            = raw.idat,
                          meanPlot         = T,
                          returnAll        = T,
                          fixOutliers      = T,
                          mergeManifest    = T)

treatments           <- unique(pData(lumi.norm.filtered)$SampleType)
treatment_arrays     <- as.factor(pData(lumi.norm.filtered)$SampleType)
cell_counts          <- as.data.frame(foo$counts)

design               <- model.matrix(~0 + treatment_arrays
                                        + cell_counts$CD8T
                                        + cell_counts$CD4T
                                        + cell_counts$NK
                                        + cell_counts$Bcell
                                        + cell_counts$Mono
                                        + cell_counts$Gran)
colnames(design)     <- c(treatments, colnames(cell_counts))
fitM                 <- lmFit(getM(lumi.norm.filtered),    design)

#Adjust M Values
mAdj.in     <- getM(lumi.norm.filtered)
mAdj.fit    <- fitM$coefficients[,-c(1,2)]
mAdj        <- as.matrix(mAdj.in) - mAdj.fit %*% t(design[,-c(1,2)])
#Use Lumi's M value to Beta Value conversion
betaAdj     <- m2beta(mAdj) 
#or
#betaAdj     <- ilogit2(mAdj)

Does that seem logical/ sane?

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] FlowSorted.Blood.450k_1.6.0                        RPMM_1.20                                          cluster_2.0.3                                     
 [4] dynamicTreeCut_1.62                                impute_1.42.0                                      IlluminaHumanMethylation450kanno.ilmn12.hg19_0.2.1
 [7] IlluminaHumanMethylation450kmanifest_0.4.0         biomaRt_2.24.0                                     GenomicFeatures_1.20.1                            
[10] AnnotationDbi_1.30.1                               biovizBase_1.16.0                                  ggbio_1.16.1                                      
[13] cowplot_0.5.0                                      gridExtra_2.0.0                                    doParallel_1.0.8                                  
[16] scales_0.2.5                                       reshape2_1.4.1                                     ggplot2_1.0.1                                     
[19] limma_3.24.14                                      lumi_2.20.2                                        minfi_1.14.0                                      
[22] bumphunter_1.8.0                                   locfit_1.5-9.1                                     iterators_1.0.7                                   
[25] foreach_1.4.2                                      Biostrings_2.36.1                                  XVector_0.8.0                                     
[28] GenomicRanges_1.20.5                               GenomeInfoDb_1.4.1                                 IRanges_2.2.5                                     
[31] S4Vectors_0.6.2                                    lattice_0.20-33                                    Biobase_2.28.0                                    
[34] BiocGenerics_0.14.0                               

loaded via a namespace (and not attached):
 [1] nlme_3.1-121             bitops_1.0-6             matrixStats_0.14.2       RColorBrewer_1.1-2       tools_3.2.1              doRNG_1.6               
 [7] nor1mix_1.2-0            affyio_1.36.0            rpart_4.1-10             KernSmooth_2.23-15       Hmisc_3.16-0             DBI_0.3.1               
[13] mgcv_1.8-7               colorspace_1.2-6         nnet_7.3-10              methylumi_2.14.0         GGally_0.5.0             base64_1.1              
[19] preprocessCore_1.30.0    graph_1.46.0             pkgmaker_0.22            labeling_0.3             rtracklayer_1.28.6       genefilter_1.50.0       
[25] quadprog_1.5-5           affy_1.46.1              RBGL_1.44.0              stringr_1.0.0            digest_0.6.8             Rsamtools_1.20.4        
[31] foreign_0.8-65           illuminaio_0.10.0        siggenes_1.42.0          GEOquery_2.34.0          dichromat_2.0-0          BSgenome_1.36.2         
[37] RSQLite_1.0.0            BiocInstaller_1.18.4     mclust_5.0.2             BiocParallel_1.2.9       acepack_1.3-3.3          VariantAnnotation_1.14.6
[43] RCurl_1.95-4.7           magrittr_1.5             Formula_1.2-1            futile.logger_1.4.1      Matrix_1.2-2             Rcpp_0.11.6             
[49] munsell_0.4.2            proto_0.3-10             stringi_0.5-5            nleqslv_2.8              MASS_7.3-43              zlibbioc_1.14.0         
[55] plyr_1.8.3               splines_3.2.1            multtest_2.24.0          annotate_1.46.1          beanplot_1.2             rngtools_1.2.4          
[61] codetools_0.2-14         futile.options_1.0.0     XML_3.98-1.3             latticeExtra_0.6-26      lambda.r_1.1.7           gtable_0.1.2            
[67] reshape_0.8.5            xtable_1.7-4             survival_2.38-3          OrganismDbi_1.10.0       GenomicAlignments_1.4.1  registry_0.3

minfi limma • 3.1k views

ADD COMMENT • link updated 8.7 years ago by Smeeta Shrestha ▴ 20 • written 9.3 years ago by andrew.j.skelton73 ▴ 370

score 2 · Answer 1 · 2015-07-29

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 7 hours ago

The city by the bay

I can't speak for the rest of the code, but there's a couple of things to point out with using raw cell counts as covariates. The first is that you're fitting to M-values, which is a log-transformed ratio of counts. If, for example, there's an increase in the cell count from 10 to 100 between two samples, you're effectively asking for a change in the M-value of 10*B between samples (where B is the estimated coefficient for the cell count covariate). This is equivalent to a 2^(10*B) = 1024^B change in the relevant counts (assuming the original log-transformation was base-2), which is pretty extreme for any non-zero value of B.

Possibly a better solution would be to log-transform the cell counts before using them as covariates. In the example above, the logged counts would have a change of log2(10) between samples. This equates to a change in the M-value of log2(10)*B, which is equivalent to a 10^B change in the relevant counts. This is less extreme and may provide a better fit to the data. For example, if the cell type was such that all bases were methylated, a 10-fold increase in the cell count for this type would result in a 10-fold increase in the methylated counts. This would be modelled by B=1 (ignoring proportionality effects for simplicity).

Another thing to keep in mind is that you're putting a lot of covariates into your design. This may confound the interpretation of your treatment effects in treatment_arrays. For example, if one sample type is associated with greater numbers of one cell type, any sample-specific differences (e.g., the treatment effect of interest) may be absorbed into the estimate of the coefficient for the count covariate of that cell type. As a result, you may end up with few significant features during differential testing for sample effects.

ADD COMMENT • link 9.3 years ago Aaron Lun ★ 28k

1

Entering edit mode

In addition to Aaron's comments I would add that the first thing I would do is to visualize the distributions of cell types between cases and controls. I believe we have great evidence that this estimation procedure is good enough for uncovering systematic differences between two groups (with a decent number of samples). However, assuming you uncover a difference between cases and controls, I am less convinced that we know the best way to address this. It is not clear to me that just adding terms in a linear model is the way to go. You could also try the various methods for reference free correction. Best, Kasper On Wed, Jul 29, 2015 at 8:14 PM, Aaron Lun [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Aaron Lun <https: support.bioconductor.org="" u="" 6732=""/> wrote Answer: > estimateCellCounts Visualisations - Adjusting Beta Values for Plotting > <https: support.bioconductor.org="" p="" 70559="" #70587="">: > > I can't speak for the rest of the code, but there's a couple of things to > point out with using raw cell counts as covariates. The first is that > you're fitting to M-values, which is a log-transformed ratio of counts. If, > for example, there's an increase in the cell count from 10 to 100 between > two samples, you're effectively asking for a change in the M-value of 10*B > between samples (where B is the estimated coefficient for the cell count > covariate). This is equivalent to a 2^(10*B) change in the relevant > counts (assuming the original transformation was base-2), which is pretty > extreme for any non-zero value of B. > > Possibly a better solution would be to log-transform the cell counts > before using them as covariates. In the example above, the logged counts > would have a change of log2(10) between samples. This equates to a change > in the M-value of log2(10)*B, which is equivalent to a 10^B change in the > relevant counts. This is less extreme and may provide a better fit to the > data. For example, if the cell type was such that all bases were > methylated, a 10-fold increase in the cell count for this type would result > in a 10-fold increase in the methylated counts. This would be modelled by > B=1 (ignoring proportionality effects for simplicity). > > Another thing to keep in mind is that you're putting a lot of covariates > in. This may confound the interpretation of your treatment effects. For > example, if one sample type is associated with greater numbers of one cell > type, any sample-specific differences (e.g., the treatment effect of > interest) may be absorbed into the estimate of the coefficient for the > count covariate of that cell type. As a result, you may end up with few > significant genes during testing for sample effects. > > ------------------------------ > > Post tags: minfi, limma > > You may reply via email or visit > A: estimateCellCounts Visualisations - Adjusting Beta Values for Plotting >

ADD REPLY • link 9.3 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Hi Kasper:

Just wanted to point out that you might want to update the estimateCellCounts.Rd file to provide the correct citation/reference info for the "Accounting for cellular heterogeneity is critical in epigenome-wide association studies" paper. Currently it says that the paper is "Under Review"

ADD REPLY • link 9.3 years ago Steve Lianoglou ★ 13k

score 0 · Answer 2 · 2016-01-21

0

Entering edit mode

Smeeta Shrestha ▴ 20

@smeeta-shrestha-6393

Last seen 8.7 years ago

Hi Andrew ,

Could you please elaborate

1. mAdj.fit <- fitM$coefficients[,-c(1,2)]

2. mAdj        <- as.matrixmAdj.in) - mAdj.fit %*% t(design[,-c(1,2)])

what the [,-c(1,2)] means in the syntax 1 and 2 above.

If i understand correctly does it mean removing columns 1 and 2 from the data matrix ?

Thank you

smeeta

ADD COMMENT • link 8.8 years ago Smeeta Shrestha ▴ 20

0

Entering edit mode

It means you remove the first and second column, correct. What you need to do is remove the columns which contain your treatments. So if you have 3 treatments, and they're the first three columns in your design matrix, then you'd remove 1:3. Check the columns of your design matrix to be sure. On Thursday, 21 January 2016, Smeeta Shrestha [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Smeeta Shrestha <https: support.bioconductor.org="" u="" 6393=""/> wrote Answer: > estimateCellCounts Visualisations - Adjusting Beta Values for Plotting > <https: support.bioconductor.org="" p="" 70559="" #77248="">: > > Hi Andrew , > > Could you please elaborate > > 1. mAdj.fit <- fitM$coefficients[,-c(1,2)] > > 2. mAdj <- as.matrixmAdj.in) - mAdj.fit %*% t(design[,-c(1,2)]) > > what the [,-c(1,2)] means in the syntax 1 and 2 above. > > > > If i understand correctly does it mean removing columns 1 and 2 from the > data matrix ? > > > > Thank you > > smeeta > > ------------------------------ > > Post tags: minfi, limma > > You may reply via email or visit > A: estimateCellCounts Visualisations - Adjusting Beta Values for Plotting >

ADD REPLY • link 8.8 years ago andrew.j.skelton73 ▴ 370

score 0 · Answer 3 · 2016-02-15

0

Entering edit mode

Smeeta Shrestha ▴ 20

@smeeta-shrestha-6393

Last seen 8.7 years ago

Hi Andrew,

I have a question. Why are you dropping the intercept in your model.

you design is showing

design <- model.matrix (~0+treatment_arrays+ ..........) and not

design <- model.matrix(~treatment_arrays + ..........)

Since I tried to adjust for blood count in similiar manner, but on using two designs I get different "adjusted beta values".

Thank you

smeeta

ADD COMMENT • link 8.7 years ago Smeeta Shrestha ▴ 20

0

Entering edit mode

I dropped the intercept because I've built a contrast matrix, i.e. there are more than 2 sample groupings and I want to compare between them. The OP is simply an example. You will get different adjusted beta values, because you're doing two different things, your models are different. To reiterate, check the columns of your design matrix. I'd suggest that if you don't understand what's happening with your different models, that you consult with a local statistician.

ADD REPLY • link 8.7 years ago andrew.j.skelton73 ▴ 370