Why does rmoving features with low sequence counts influence the Wald statistic and fold change values in the DESeq2 output?
1
0
Entering edit mode
@szoboszlaym-11050
Last seen 8.4 years ago

Hello,

I've noticed that the number of variables included in the data matrix influence the Wald statistic and fold change values in the DESeq2 output. I don't understand why is this so.

The data set I'm using is from a very simple design: 1 treatment with two levels (treated and control), each with 6 replicates. There are 3074 features (in this case these are operational taxonomic units), so I have a 3074 x 12 matrix. A lot of the features have very low sequence counts. I thought that I won't get reliable inference from these anyway, so it's better to remove them to decrease the number of simultaneous tests I have to run. So I deleted 2290 features and plugged the resulting 784 x 12 matrix into DESeq2. The result was that no features had significantly different sequence counts in the two sample groups. Out of curiosity, I prepared a second matrix from which I only deleted 1990 features leaving an 1084 x 12 matrix. In this matrix, DESeq2 identified a couple of features with significantly different sequence counts in the two sample groups. These features were however included in the first matrix too. The 784 features that were included in both matrices had the same base mean values in the the two DESeq2 outputs, but different values for the Wald statistic and fold change (and of course p-value).

Could someone please explain to me why the Wald statistic and fold change values calculated for a feature depend on whether 300 extra features are included in the data matrix, or not (when the base mean values are still the same)? Can I trust either of the two sets of results?

Thank you very much!

I used R 3.3.1 and DESeq2 1.12.3 with the default settings. Here is an example of the scripts I used:

> Barley_DESeq2=DESeqDataSetFromMatrix(countData=Barley_matrix, colData=grouping, design=~Treatment)
> Barley_DESeq2
class: DESeqDataSet
dim: 1084 12
metadata(1): version
assays(1): counts
rownames(1084): SV1 SV2 ... SV1083 SV1084
rowData names(0):
colnames(12): ambient_44 ambient_46 ... elevated_52 elevated_55
colData names(1): Treatment
> Barley_deseqanalysis=DESeq(Barley_DESeq2)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> Barley_deseqresults=results(Barley_deseqanalysis)
> Barley_deseqresults
log2 fold change (MAP): Treatment elevated vs ambient
Wald test p-value: Treatment elevated vs ambient
DataFrame with 1084 rows and 6 columns
        baseMean log2FoldChange     lfcSE        stat     pvalue      padj
       <numeric>      <numeric> <numeric>   <numeric>  <numeric> <numeric>
SV1     2990.215      0.8853490 0.6773994   1.3069822 0.19121874 0.9991334
SV2     2531.758      0.2078180 0.3927295   0.5291632 0.59669225 0.9991334
SV3     2348.142      0.6548006 0.5757752   1.1372504 0.25543363 0.9991334
SV4     2276.444      1.1717034 0.5737804   2.0420763 0.04114396 0.6707282
SV5     1693.010      0.9203465 0.5030663   1.8294737 0.06732867 0.6707282
...          ...            ...       ...         ...        ...       ...
deseq2 • 757 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 3 hours ago
United States

hi,

First read this post I wrote about how p-values are very sensitive to differences in model parameters:

A: Deseq2 results discrepancies because of versions of R

The second thing to take a look at is the DESeq2 methods in the paper:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

In DESeq2 we produce maximum of posterior estimates of fold change and dispersion by forming prior distributions over all the genes. So the genes do effect each other's results, both the fold change and the p-values. (The genes affecting each other's dispersion or variance parameters is also true for other empirical Bayes models like edgeR and limma). By removing genes, you have removed some of the information the method uses to estimate final parameters.

ADD COMMENT
0
Entering edit mode

Thank you for the explanation! Am I right to conclude than, that it is generally better not to remove any features with low sequence counts cause they may improve the parameter estimations, and they will be excluded from the final results anyway due to the automatic independent filtering?

ADD REPLY
0
Entering edit mode
Yes, exactly. The DESeq2 functions will take care of filtering for you.
ADD REPLY
0
Entering edit mode

Thanks for your reply! However, on the dispersion plot (generated with the plotDispEsts script) the fitted values don't seem to follow the data very well if I use the full matrix, but I get a better fit if I remove the features with low sequence counts. I may be misunderstanding this, but doesn't this indicate that trimming the matrix actually improve the parameter estimations?

ADD REPLY
0
Entering edit mode

If the trend is not good, you could try alternative fitType. I would try this before filtering, but of course it's up to you as the analyst. If filtering gives a better dispersion plot then that makes sense. And I think you understand now the concepts behind how the results can change with different sets of genes provided to the algorithm.

ADD REPLY

Login before adding your answer.

Traffic: 922 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6