Hello,
I've noticed that the number of variables included in the data matrix influence the Wald statistic and fold change values in the DESeq2 output. I don't understand why is this so.
The data set I'm using is from a very simple design: 1 treatment with two levels (treated and control), each with 6 replicates. There are 3074 features (in this case these are operational taxonomic units), so I have a 3074 x 12 matrix. A lot of the features have very low sequence counts. I thought that I won't get reliable inference from these anyway, so it's better to remove them to decrease the number of simultaneous tests I have to run. So I deleted 2290 features and plugged the resulting 784 x 12 matrix into DESeq2. The result was that no features had significantly different sequence counts in the two sample groups. Out of curiosity, I prepared a second matrix from which I only deleted 1990 features leaving an 1084 x 12 matrix. In this matrix, DESeq2 identified a couple of features with significantly different sequence counts in the two sample groups. These features were however included in the first matrix too. The 784 features that were included in both matrices had the same base mean values in the the two DESeq2 outputs, but different values for the Wald statistic and fold change (and of course p-value).
Could someone please explain to me why the Wald statistic and fold change values calculated for a feature depend on whether 300 extra features are included in the data matrix, or not (when the base mean values are still the same)? Can I trust either of the two sets of results?
Thank you very much!
I used R 3.3.1 and DESeq2 1.12.3 with the default settings. Here is an example of the scripts I used:
> Barley_DESeq2=DESeqDataSetFromMatrix(countData=Barley_matrix, colData=grouping, design=~Treatment)
> Barley_DESeq2
class: DESeqDataSet
dim: 1084 12
metadata(1): version
assays(1): counts
rownames(1084): SV1 SV2 ... SV1083 SV1084
rowData names(0):
colnames(12): ambient_44 ambient_46 ... elevated_52 elevated_55
colData names(1): Treatment
> Barley_deseqanalysis=DESeq(Barley_DESeq2)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> Barley_deseqresults=results(Barley_deseqanalysis)
> Barley_deseqresults
log2 fold change (MAP): Treatment elevated vs ambient
Wald test p-value: Treatment elevated vs ambient
DataFrame with 1084 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
SV1 2990.215 0.8853490 0.6773994 1.3069822 0.19121874 0.9991334
SV2 2531.758 0.2078180 0.3927295 0.5291632 0.59669225 0.9991334
SV3 2348.142 0.6548006 0.5757752 1.1372504 0.25543363 0.9991334
SV4 2276.444 1.1717034 0.5737804 2.0420763 0.04114396 0.6707282
SV5 1693.010 0.9203465 0.5030663 1.8294737 0.06732867 0.6707282
... ... ... ... ... ... ...
Thank you for the explanation! Am I right to conclude than, that it is generally better not to remove any features with low sequence counts cause they may improve the parameter estimations, and they will be excluded from the final results anyway due to the automatic independent filtering?
Thanks for your reply! However, on the dispersion plot (generated with the plotDispEsts script) the fitted values don't seem to follow the data very well if I use the full matrix, but I get a better fit if I remove the features with low sequence counts. I may be misunderstanding this, but doesn't this indicate that trimming the matrix actually improve the parameter estimations?
If the trend is not good, you could try alternative
fitType
. I would try this before filtering, but of course it's up to you as the analyst. If filtering gives a better dispersion plot then that makes sense. And I think you understand now the concepts behind how the results can change with different sets of genes provided to the algorithm.