Question

minfi - estimateCellCount

5

Entering edit mode

Methyl ▴ 50

@methyl-8470

Last seen 8.5 years ago

United Kingdom

I'm trying to understand what the estimateCellCounts() function returns when returnAll=TRUE is specified. Currently the method description lacks details and just says "Should the composition table and the normalized user supplied data be return?". Does anyone know what "counts" "compTable" and "normalizedData" datasets returned are please? Is normalizedData the corrected intensities for cell type composition?

minfi • 2.6k views

ADD COMMENT • link 8.8 years ago Methyl ▴ 50

score 3 · Answer 1 · 2015-07-24

The issue with R help pages in general is that they are extremely terse, so you have to peruse the information closely to understand exactly what you are being told. So let's look at the help page.

Details:

     This is an implementaion of the Houseman et al (2012) regression
     calibration approachalgorithm to the Illumina 450k microarray for
     deconvoluting heterogeneous tissue sources like blood. For
     example, this function will take an 'RGChannelSet' from a DNA
     methylation (DNAm) study of blood, and return the relative
     proportions of CD4+ and CD8+ T-cells, natural killer cells,
     monocytes, granulocytes, and b-cells in each sample.

     The 'meanPlot' should be used to check for large batch effects in
     the data, reducing the confidence placed in the composition
     estimates. This plot depicts the average DNA methylation across
     the cell-type discrimating probes in both the provided and sorted
     data. The means from the provided heterogeneous samples should be
     within the range of the sorted samples. If the sample means fall
     outside the range of the sorted means, the cell type estimates
     will inflated to the closest cell type. Note that we quantile
     normalize the sorted data with the provided data to reduce these
     batch effects.

Value:

     Matrix of composition estimates across all samples and cell types.

     If 'returnAll=TRUE' a list of a count matrix (see previous
     paragraph), a composition table and the normalized user data in
     form of a GenomicMethylSet.

Looking at the Value section, it says you get a 'Matrix of composition estimates across all samples and cell types.' And then it says if you use returnAll = TRUE, you get a count matrix, a composition table and a normalized user data.

So this is a bit confusing because it seems that count and composition are being used interchangeably here. So let me interpret. If you use returnAll = TRUE, you get a list. The first list item is called 'counts' and contains the composition estimates across all samples. In other words this list item contains the estimated proportion of each cell type in each sample.

You also get two other things. You get the normalized user data in the form of a GenomicMethylSet. If you look at the last sentence of the Details section, you can see what this is. In order to deconvolute your samples, this function uses an existing data set of known proportions to infer the proportions in your data. In order to best compare the data between these two data sets, there is a quantile normalization performed on a combination of your data and the pre-existing data, and you get your data back, after it has been quantile normalized.

The composition table is difficult to interpret from the existing information, but luckily there is an example for this function that we can run (which in an ideal world you would have already tried). If we do this, and look at the composition table, it looks like this:

> head(counts$compTable)
               Fstat      p.value      CD8T      CD4T        NK     Bcell
cg13869341  1.048226 4.080828e-01 0.8517652 0.8463667 0.8598454 0.8495618
cg14008030 15.695187 1.330140e-07 0.7375011 0.7356875 0.7437085 0.6345076
cg12045430 11.060340 4.212250e-06 0.1956881 0.2138476 0.1858280 0.1882041
cg20826792  7.235756 1.508050e-04 0.3777702 0.4293842 0.4228932 0.3857610
cg00381604  1.393994 2.546868e-01 0.1597431 0.1820281 0.1780191 0.1504064
cg20253340  0.810480 5.514880e-01 0.6090627 0.6311182 0.6210798 0.6215644
                Mono      Gran       low      high      range
cg13869341 0.8382613 0.8345798 0.8062695 0.8960010 0.08973149
cg14008030 0.6546512 0.5208975 0.4754629 0.8469223 0.37145935
cg12045430 0.2546996 0.2269501 0.1635160 0.2793192 0.11580328
cg20826792 0.4704966 0.4489326 0.3114430 0.4900109 0.17856793
cg00381604 0.1822330 0.2007854 0.1191203 0.2941590 0.17503871
cg20253340 0.6365886 0.6511151 0.5354807 0.6798090 0.14432838

This is wide, so gets wrapped. But looking at the data, we have an Fstat, a p-value, some parameter estimates and the low/high/range values for the parameter estimates. Without doing anything more than scanning the Houseman paper (which you should actually read, if you are planning to use this method, because just blindly running code without having at least a general sense of what is going on is not very scientific and stuff), it seems that we are doing some sort of regression, probably using the existing data as some sort of prior, on each CpG being measured. And this table shows the F-statistic testing for any difference between any of estimated proportions for each CpG, the p-value for that test, and the estimated coefficients. Presumably you could sort by F-stat (decreasing, as F-stats are strictly positive, and big means 'more likely that there is a difference'), or by p-value (increasing because it's the exact opposite for p-values), and then you could see what CpGs helped infer the proportions. Or not, because that's probably sort of boring, and the counts table is all you really wanted in the first place.

score 0 · Answer 2 · 2015-07-27

0