Question

Why does rmoving features with low sequence counts influence the Wald statistic and fold change values in the DESeq2 output?

0

Entering edit mode

szoboszlay.m • 0

@szoboszlaym-11050

Last seen 8.7 years ago

Hello,

I've noticed that the number of variables included in the data matrix influence the Wald statistic and fold change values in the DESeq2 output. I don't understand why is this so.

The data set I'm using is from a very simple design: 1 treatment with two levels (treated and control), each with 6 replicates. There are 3074 features (in this case these are operational taxonomic units), so I have a 3074 x 12 matrix. A lot of the features have very low sequence counts. I thought that I won't get reliable inference from these anyway, so it's better to remove them to decrease the number of simultaneous tests I have to run. So I deleted 2290 features and plugged the resulting 784 x 12 matrix into DESeq2. The result was that no features had significantly different sequence counts in the two sample groups. Out of curiosity, I prepared a second matrix from which I only deleted 1990 features leaving an 1084 x 12 matrix. In this matrix, DESeq2 identified a couple of features with significantly different sequence counts in the two sample groups. These features were however included in the first matrix too. The 784 features that were included in both matrices had the same base mean values in the the two DESeq2 outputs, but different values for the Wald statistic and fold change (and of course p-value).

Could someone please explain to me why the Wald statistic and fold change values calculated for a feature depend on whether 300 extra features are included in the data matrix, or not (when the base mean values are still the same)? Can I trust either of the two sets of results?

Thank you very much!

I used R 3.3.1 and DESeq2 1.12.3 with the default settings. Here is an example of the scripts I used:

> Barley_DESeq2=DESeqDataSetFromMatrix(countData=Barley_matrix, colData=grouping, design=~Treatment)
> Barley_DESeq2
class: DESeqDataSet
dim: 1084 12
metadata(1): version
assays(1): counts
rownames(1084): SV1 SV2 ... SV1083 SV1084
rowData names(0):
colnames(12): ambient_44 ambient_46 ... elevated_52 elevated_55
colData names(1): Treatment
> Barley_deseqanalysis=DESeq(Barley_DESeq2)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> Barley_deseqresults=results(Barley_deseqanalysis)
> Barley_deseqresults
log2 fold change (MAP): Treatment elevated vs ambient
Wald test p-value: Treatment elevated vs ambient
DataFrame with 1084 rows and 6 columns
        baseMean log2FoldChange     lfcSE        stat     pvalue      padj
       <numeric>      <numeric> <numeric>   <numeric>  <numeric> <numeric>
SV1     2990.215      0.8853490 0.6773994   1.3069822 0.19121874 0.9991334
SV2     2531.758      0.2078180 0.3927295   0.5291632 0.59669225 0.9991334
SV3     2348.142      0.6548006 0.5757752   1.1372504 0.25543363 0.9991334
SV4     2276.444      1.1717034 0.5737804   2.0420763 0.04114396 0.6707282
SV5     1693.010      0.9203465 0.5030663   1.8294737 0.06732867 0.6707282
...          ...            ...       ...         ...        ...       ...

deseq2 • 797 views

ADD COMMENT • link updated 8.7 years ago by Michael Love 43k • written 8.7 years ago by szoboszlay.m • 0

score 0 · Answer 1 · 2016-07-07

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 days ago

United States

hi,

First read this post I wrote about how p-values are very sensitive to differences in model parameters:

A: Deseq2 results discrepancies because of versions of R

The second thing to take a look at is the DESeq2 methods in the paper:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

In DESeq2 we produce maximum of posterior estimates of fold change and dispersion by forming prior distributions over all the genes. So the genes do effect each other's results, both the fold change and the p-values. (The genes affecting each other's dispersion or variance parameters is also true for other empirical Bayes models like edgeR and limma). By removing genes, you have removed some of the information the method uses to estimate final parameters.

ADD COMMENT • link 8.7 years ago Michael Love 43k

0

Entering edit mode

Thank you for the explanation! Am I right to conclude than, that it is generally better not to remove any features with low sequence counts cause they may improve the parameter estimations, and they will be excluded from the final results anyway due to the automatic independent filtering?

ADD REPLY • link 8.7 years ago szoboszlay.m • 0

0

Entering edit mode

Yes, exactly. The DESeq2 functions will take care of filtering for you.

ADD REPLY • link 8.7 years ago Michael Love 43k

0

Entering edit mode

Thanks for your reply! However, on the dispersion plot (generated with the plotDispEsts script) the fitted values don't seem to follow the data very well if I use the full matrix, but I get a better fit if I remove the features with low sequence counts. I may be misunderstanding this, but doesn't this indicate that trimming the matrix actually improve the parameter estimations?

ADD REPLY • link 8.7 years ago szoboszlay.m • 0

0

Entering edit mode

If the trend is not good, you could try alternative fitType. I would try this before filtering, but of course it's up to you as the analyst. If filtering gives a better dispersion plot then that makes sense. And I think you understand now the concepts behind how the results can change with different sets of genes provided to the algorithm.

ADD REPLY • link 8.7 years ago Michael Love 43k