Hi,
I am doing a DESeq2 analysis using a data set of nine different conditions/time-points.
I first ran the analysis with the complete data set in the count table and than have used the results function to extract the pair-wise results of the DE analysis.
>count_table <- read.delim2("count_table_complete.txt",row.names=1) >colData <- read.delim2("metaData.txt") >levels(colData$condition) [1] "CR4W0h" "CR4W24h" "CR4W4h" "CTRL0h" "CTRL24h" "CTRL4h" "HP0h" "HP24h" "HP4h" >cds <- DESeqDataSetFromMatrix ( countData= count_table[,-1], colData = colData, design = ~condition ) >fit = DESeq(cds) >res = results(fit, contrast=c("condition", "CTRL0h" , "HP0h") ) > res log2 fold change (MAP): condition CTRL0h vs HP0h Wald test p-value: condition CTRL0h vs HP0h DataFrame with 39179 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSMUSG00000000001 [B][COLOR="Red"]3093.215856[/COLOR][/B] 0.002814819 0.08192671 0.03435777 0.97259186 0.998355
When doing so, I get only 3 genes with an adjusted p-value below 0.05
>table(res$padj<=0.05) FALSE TRUE 1956 3
But if i ran the same analysis with the subset of the data which contains only the columns from the count table for "CTRL0h" and "HP0h" I get many more significant genes.
> count_table <- read.delim2("count_table_complete.txt",row.names=1) > count_subset <- count_table[,c(2:5,23:26)] > head(count_subset) C20 C22 C23 C24 HP10 HP11 HP12 HP14 ENSMUSG00000000001 2811 2360 3053 3334 2636 2736 2505 3282 > colData_subset <- colData[c(1:4, 23:26),] >cds_subset <- DESeqDataSetFromMatrix ( countData= count_subset, colData = colData_subset, design = ~condition ) >fit_subset = DESeq(cds_subset) >res_subset = results(fit_subset, contrast=c("condition", "CTRL0h" , "HP0h") ) > res_subset log2 fold change (MAP): condition CTRL0h vs HP0h Wald test p-value: condition CTRL0h vs HP0h DataFrame with 39179 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSMUSG00000000001 [B][COLOR="Red"]2810.6322989 [/COLOR][/B] -0.003504429 0.05504337 -0.0636667 0.949235621 0.9938743 > table(res$padj<= 0.05) FALSE TRUE 13122 321
first I see a difference in the baseMean values. I guess this might be because for some of the conditions/TP I have more samples than the others, so that the baseMean is calculated differently (Am I correct in this assumption?)
But I can't figure out why I get such a big different in the number of DE genes between the two approaches.
Is it because of the differences in the independent filtering step, DESeq2 is automatically doing?
I have tried to deactivate the independent filtering in the results function, but it didn't make the results better.
Any other suggestions?
thanks,
Assa
PS
I have posted this question also on seqanswers, but as I got no answers yet I am posting it also here.
There are statistical tests for equality of variance but AFAIK no one ever bothers to do them for microarray/RNA-Seq data. In general, I advise against trying different statistical tests and picking the one that gives you the "best" answer, but equality of variances is an assumption in parametric statistics. If that assumption is violated, then it's a valid reason to switch to a different statistical test. It looks like you have more than two groups in the reduced-variance cluster on the left, plus one replicate of a different treatment? Maybe a sample is mis-labeled? That alone could be driving up the dispersion estimate because otherwise the differences in within-group variation don't look that extreme and the amount of between-group variation is quite large. I wouldn't pull out just two groups, but maybe all the groups in that cluster. However, you won't be able to make any comparisons with the other groups unless you do include them all in one analysis.
You might want to consult with a local statistician to get more custom help.
Cheers,
Jenny