Question

discrepance in DESeq2 results with different design structures

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 12 weeks ago

Germany

Hi,

I am doing a DESeq2 analysis using a data set of nine different conditions/time-points.
I first ran the analysis with the complete data set in the count table and than have used the results function to extract the pair-wise results of the DE analysis.

>count_table <- read.delim2("count_table_complete.txt",row.names=1)

>colData <- read.delim2("metaData.txt")
>levels(colData$condition)
[1] "CR4W0h"  "CR4W24h" "CR4W4h"  "CTRL0h"  "CTRL24h" "CTRL4h"  "HP0h"    "HP24h"   "HP4h"   

>cds <- DESeqDataSetFromMatrix (
  countData= count_table[,-1],
  colData    = colData, 
  design     = ~condition )

>fit = DESeq(cds)

>res = results(fit, contrast=c("condition", "CTRL0h" , "HP0h") )
> res
log2 fold change (MAP): condition CTRL0h vs HP0h
Wald test p-value: condition CTRL0h vs HP0h
DataFrame with 39179 rows and 6 columns
                      baseMean log2FoldChange      lfcSE       stat     pvalue      padj
                     <numeric>      <numeric>  <numeric>  <numeric>  <numeric> <numeric>
ENSMUSG00000000001 [B][COLOR="Red"]3093.215856[/COLOR][/B]    0.002814819 0.08192671 0.03435777 0.97259186  0.998355

When doing so, I get only 3 genes with an adjusted p-value below 0.05

>table(res$padj<=0.05)

FALSE  TRUE
1956     3

But if i ran the same analysis with the subset of the data which contains only the columns from the count table for "CTRL0h" and "HP0h" I get many more significant genes.

> count_table <- read.delim2("count_table_complete.txt",row.names=1)
> count_subset <- count_table[,c(2:5,23:26)]
> head(count_subset)
                    C20  C22  C23  C24 HP10 HP11 HP12 HP14
ENSMUSG00000000001 2811 2360 3053 3334 2636 2736 2505 3282

> colData_subset <- colData[c(1:4, 23:26),]  

>cds_subset <- DESeqDataSetFromMatrix (
  countData= count_subset,
  colData    = colData_subset,  
  design     = ~condition )

>fit_subset = DESeq(cds_subset)

>res_subset = results(fit_subset, contrast=c("condition", "CTRL0h" , "HP0h") )
> res_subset
log2 fold change (MAP): condition CTRL0h vs HP0h
Wald test p-value: condition CTRL0h vs HP0h
DataFrame with 39179 rows and 6 columns
                       baseMean log2FoldChange      lfcSE       stat      pvalue      padj
                      <numeric>      <numeric>  <numeric>  <numeric>   <numeric> <numeric>
ENSMUSG00000000001 [B][COLOR="Red"]2810.6322989 [/COLOR][/B]  -0.003504429 0.05504337 -0.0636667 0.949235621 0.9938743

> table(res$padj<= 0.05)

FALSE  TRUE
13122   321

first I see a difference in the baseMean values. I guess this might be because for some of the conditions/TP I have more samples than the others, so that the baseMean is calculated differently (Am I correct in this assumption?)

But I can't figure out why I get such a big different in the number of DE genes between the two approaches.

Is it because of the differences in the independent filtering step, DESeq2 is automatically doing?
I have tried to deactivate the independent filtering in the results function, but it didn't make the results better.

Any other suggestions?

thanks,
Assa

PS

I have posted this question also on seqanswers, but as I got no answers yet I am posting it also here.

deseq2 differential expression independent filtering pre-filtering design matrix • 2.6k views

ADD COMMENT • link 9.0 years ago Assa Yeroslaviz ★ 1.5k

1

Entering edit mode

Jenny Drnevich ★ 2.0k

@jenny-drnevich-2812

Last seen 13 days ago

United States

I just want to second Michael's answer. I just ran into a similar issue today with an experiment with 4 treatments and 2 different tissues from each of 29 individuals. The two different tissues had drastically different amounts of within-group variance, such that when analyzed separately vs. together the number of genes showing treatment differences in the low variance tissue went from ~3000 to only 4 and the number in the high variance tissue went from ~4000 to ~8000. Therefore, a separate analysis for each tissue was more appropriate.

ADD COMMENT • link 9.0 years ago Jenny Drnevich ★ 2.0k

0

Entering edit mode

Assa Yeroslaviz ★ 1.5k

@assa-yeroslaviz-1597

Last seen 12 weeks ago

Germany

So - is there a way to test in advance which of the two methods is better for a specific data set?

Michael mentioned the PCS plot. If, after the plotting, I see that the samples are close (enough) together, does this means that analysing the complete data set in one group should yield more results. I have plotted here two data sets I am working on. The mRNA-Seq data (first image) gave better results using only the pair-wise samples in the DESeq object. the miRNA (2nd. figure) data set was done with the combined samples in one DESeq object to yield better results.

plot_PCA_m_RNAseq_ribominus plot_PCA_miRNA

If I understand it correctly from Michael explanation, the first data set shows a higher within group variance and was therefore better with a pair-wise analysis than the miRNA data set.

is this assumption correct?

Thanks

Assa

ADD COMMENT • link 9.0 years ago Assa Yeroslaviz ★ 1.5k

2

Entering edit mode

There are statistical tests for equality of variance but AFAIK no one ever bothers to do them for microarray/RNA-Seq data. In general, I advise against trying different statistical tests and picking the one that gives you the "best" answer, but equality of variances is an assumption in parametric statistics. If that assumption is violated, then it's a valid reason to switch to a different statistical test. It looks like you have more than two groups in the reduced-variance cluster on the left, plus one replicate of a different treatment? Maybe a sample is mis-labeled? That alone could be driving up the dispersion estimate because otherwise the differences in within-group variation don't look that extreme and the amount of between-group variation is quite large. I wouldn't pull out just two groups, but maybe all the groups in that cluster. However, you won't be able to make any comparisons with the other groups unless you do include them all in one analysis.

You might want to consult with a local statistician to get more custom help.

Cheers,

Jenny

ADD REPLY • link 9.0 years ago Jenny Drnevich ★ 2.0k

score 3 · Accepted Answer · 2015-04-30

The baseMean column is the mean of normalized counts for all samples in the DESeqDataSet (it does not change based on the contrast call).

The reason for the difference in # of DEG is not due to independent filtering, but due to dispersion estimation. The dispersion parameter for the gene is estimated within DESeq() using all the groups, and if some groups have higher dispersions, then it will raise the final estimate of dispersion for the gene. Note that this is before you call results(), so it doesn't know yet that you are going to test one pair of groups or another.

p-values are also tail probabilities and therefore very sensitive to small changes to parameters like dispersion. So it's not surprising to see changes to # DEG actually, if you change the set of samples (here removing some) which are used to estimate the dispersion.

Take a look at the plotPCA for your data (see vignette) and likely the other samples will show more within-group variance.

Subsetting to only pairs of groups for running DESeq() can be useful when the within-group variance is very different across groups.