Question: minfi - estimateCellCount
5
4.3 years ago by
Methyl50
United Kingdom
Methyl50 wrote:

I'm trying to understand what the estimateCellCounts() function returns when returnAll=TRUE is specified.  Currently the method description lacks details and just says "Should the composition table and the normalized user supplied data be return?".  Does anyone know what "counts" "compTable" and "normalizedData" datasets returned are please?  Is normalizedData the corrected intensities for cell type composition?

minfi • 1.3k views
modified 4.3 years ago • written 4.3 years ago by Methyl50
3
4.3 years ago by
United States
James W. MacDonald51k wrote:

The issue with R help pages in general is that they are extremely terse, so you have to peruse the information closely to understand exactly what you are being told. So let's look at the help page.

Details:

This is an implementaion of the Houseman et al (2012) regression
calibration approachalgorithm to the Illumina 450k microarray for
deconvoluting heterogeneous tissue sources like blood. For
example, this function will take an 'RGChannelSet' from a DNA
methylation (DNAm) study of blood, and return the relative
proportions of CD4+ and CD8+ T-cells, natural killer cells,
monocytes, granulocytes, and b-cells in each sample.

The 'meanPlot' should be used to check for large batch effects in
the data, reducing the confidence placed in the composition
estimates. This plot depicts the average DNA methylation across
the cell-type discrimating probes in both the provided and sorted
data. The means from the provided heterogeneous samples should be
within the range of the sorted samples. If the sample means fall
outside the range of the sorted means, the cell type estimates
will inflated to the closest cell type. Note that we quantile
normalize the sorted data with the provided data to reduce these
batch effects.

Value:

Matrix of composition estimates across all samples and cell types.

If 'returnAll=TRUE' a list of a count matrix (see previous
paragraph), a composition table and the normalized user data in
form of a GenomicMethylSet.

Looking at the Value section, it says you get a 'Matrix of composition estimates across all samples and cell types.' And then it says if you use returnAll = TRUE, you get a count matrix, a composition table and a normalized user data.

So this is a bit confusing because it seems that count and composition are being used interchangeably here. So let me interpret. If you use returnAll = TRUE, you get a list. The first list item is called 'counts' and contains the composition estimates across all samples. In other words this list item contains the estimated proportion of each cell type in each sample.

You also get two other things. You get the normalized user data in the form of a GenomicMethylSet. If you look at the last sentence of the Details section, you can see what this is. In order to deconvolute your samples, this function uses an existing data set of known proportions to infer the proportions in your data. In order to best compare the data between these two data sets, there is a quantile normalization performed on a combination of your data and the pre-existing data, and you get your data back, after it has been quantile normalized.

The composition table is difficult to interpret from the existing information, but luckily there is an example for this function that we can run (which in an ideal world you would have already tried). If we do this, and look at the composition table, it looks like this:

> head(counts$compTable) Fstat p.value CD8T CD4T NK Bcell cg13869341 1.048226 4.080828e-01 0.8517652 0.8463667 0.8598454 0.8495618 cg14008030 15.695187 1.330140e-07 0.7375011 0.7356875 0.7437085 0.6345076 cg12045430 11.060340 4.212250e-06 0.1956881 0.2138476 0.1858280 0.1882041 cg20826792 7.235756 1.508050e-04 0.3777702 0.4293842 0.4228932 0.3857610 cg00381604 1.393994 2.546868e-01 0.1597431 0.1820281 0.1780191 0.1504064 cg20253340 0.810480 5.514880e-01 0.6090627 0.6311182 0.6210798 0.6215644 Mono Gran low high range cg13869341 0.8382613 0.8345798 0.8062695 0.8960010 0.08973149 cg14008030 0.6546512 0.5208975 0.4754629 0.8469223 0.37145935 cg12045430 0.2546996 0.2269501 0.1635160 0.2793192 0.11580328 cg20826792 0.4704966 0.4489326 0.3114430 0.4900109 0.17856793 cg00381604 0.1822330 0.2007854 0.1191203 0.2941590 0.17503871 cg20253340 0.6365886 0.6511151 0.5354807 0.6798090 0.14432838 This is wide, so gets wrapped. But looking at the data, we have an Fstat, a p-value, some parameter estimates and the low/high/range values for the parameter estimates. Without doing anything more than scanning the Houseman paper (which you should actually read, if you are planning to use this method, because just blindly running code without having at least a general sense of what is going on is not very scientific and stuff), it seems that we are doing some sort of regression, probably using the existing data as some sort of prior, on each CpG being measured. And this table shows the F-statistic testing for any difference between any of estimated proportions for each CpG, the p-value for that test, and the estimated coefficients. Presumably you could sort by F-stat (decreasing, as F-stats are strictly positive, and big means 'more likely that there is a difference'), or by p-value (increasing because it's the exact opposite for p-values), and then you could see what CpGs helped infer the proportions. Or not, because that's probably sort of boring, and the counts table is all you really wanted in the first place. ADD COMMENTlink written 4.3 years ago by James W. MacDonald51k 2 Why don't we all retire and let Jim answer all posts on the support forum. I think this would be much more efficient in the long run. Best, Kasper On Fri, Jul 24, 2015 at 8:46 PM, James W. MacDonald [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User James W. MacDonald <https: support.bioconductor.org="" u="" 5106=""/> wrote Answer: > minfi - estimateCellCount > <https: support.bioconductor.org="" p="" 70348="" #70378="">: > > The issue with R help pages in general is that they are extremely terse, > so you have to peruse the information closely to understand exactly what > you are being told. So let's look at the help page. > > Details: > > This is an implementaion of the Houseman et al (2012) regression > calibration approachalgorithm to the Illumina 450k microarray for > deconvoluting heterogeneous tissue sources like blood. For > example, this function will take an 'RGChannelSet' from a DNA > methylation (DNAm) study of blood, and return the relative > proportions of CD4+ and CD8+ T-cells, natural killer cells, > monocytes, granulocytes, and b-cells in each sample. > > The 'meanPlot' should be used to check for large batch effects in > the data, reducing the confidence placed in the composition > estimates. This plot depicts the average DNA methylation across > the cell-type discrimating probes in both the provided and sorted > data. The means from the provided heterogeneous samples should be > within the range of the sorted samples. If the sample means fall > outside the range of the sorted means, the cell type estimates > will inflated to the closest cell type. Note that we quantile > normalize the sorted data with the provided data to reduce these > batch effects. > > Value: > > Matrix of composition estimates across all samples and cell types. > > If 'returnAll=TRUE' a list of a count matrix (see previous > paragraph), a composition table and the normalized user data in > form of a GenomicMethylSet. > > > > Looking at the Value section, it says you get a 'Matrix of composition > estimates across all samples and cell types.' And then it says if you use > returnAll = TRUE, you get a count matrix, a composition table and a > normalized user data. > > So this is a bit confusing because it seems that count and composition are > being used interchangeably here. So let me interpret. If you use returnAll > = TRUE, you get a list. The first list item is called 'counts' and contains > the composition estimates across all samples. In other words this list item > contains the estimated proportion of each cell type in each sample. > > You also get two other things. You get the normalized user data in the > form of a GenomicMethylSet. If you look at the last sentence of the Details > section, you can see what this is. In order to deconvolute your samples, > this function uses an existing data set of known proportions to infer the > proportions in your data. In order to best compare the data between these > two data sets, there is a quantile normalization performed on a combination > of your data and the pre-existing data, and you get your data back, after > it has been quantile normalized. > > The composition table is difficult to interpret from the existing > information, but luckily there is an example for this function that we can > run (which in an ideal world you would have already tried). If we do this, > and look at the composition table, it looks like this: > > > head(counts$compTable) > Fstat p.value CD8T CD4T NK Bcell > cg13869341 1.048226 4.080828e-01 0.8517652 0.8463667 0.8598454 0.8495618 > cg14008030 15.695187 1.330140e-07 0.7375011 0.7356875 0.7437085 0.6345076 > cg12045430 11.060340 4.212250e-06 0.1956881 0.2138476 0.1858280 0.1882041 > cg20826792 7.235756 1.508050e-04 0.3777702 0.4293842 0.4228932 0.3857610 > cg00381604 1.393994 2.546868e-01 0.1597431 0.1820281 0.1780191 0.1504064 > cg20253340 0.810480 5.514880e-01 0.6090627 0.6311182 0.6210798 0.6215644 > Mono Gran low high range > cg13869341 0.8382613 0.8345798 0.8062695 0.8960010 0.08973149 > cg14008030 0.6546512 0.5208975 0.4754629 0.8469223 0.37145935 > cg12045430 0.2546996 0.2269501 0.1635160 0.2793192 0.11580328 > cg20826792 0.4704966 0.4489326 0.3114430 0.4900109 0.17856793 > cg00381604 0.1822330 0.2007854 0.1191203 0.2941590 0.17503871 > cg20253340 0.6365886 0.6511151 0.5354807 0.6798090 0.14432838 > > This is wide, so gets wrapped. But looking at the data, we have an Fstat, > a p-value, some parameter estimates and the low/high/range values for the > parameter estimates. Without doing anything more than scanning the Houseman > paper (which you should actually read, if you are planning to use this > method, because just blindly running code without having at least a general > sense of what is going on is not very scientific and stuff), it seems that > we are doing some sort of regression, probably using the existing data as > some sort of prior, on each CpG being measured. And this table shows the > F-statistic testing for any difference between any of estimated proportions > for each CpG, the p-value for that test, and the estimated coefficients. > Presumably you could sort by F-stat (decreasing, as F-stats are strictly > positive, and big means 'more likely that there is a difference'), or by > p-value (increasing because it's the exact opposite for p-values), and then > you could see what CpGs helped infer the proportions. Or not, because > that's probably sort of boring, and the counts table is all you really > wanted in the first place. > > ------------------------------ > > Post tags: minfi > > You may reply via email or visit > A: minfi - estimateCellCount >
0
4.3 years ago by
Methyl50
United Kingdom
Methyl50 wrote:

Thanks that explains it better!