Question

Analyzing a subgroups of samples, which result should I trust

0

Entering edit mode

Raymond ▴ 20

@raymond-14020

Last seen 4.9 years ago

Dear friends, I had 6(A,B,C,D,E,F) groups of animals, each containing 7 samples. I run DESeq and compared the DEGs between every two groups.

Later, I found group F is biologically far away from other groups (all 6 groups were in the same batch). So, I run DESeq again, and drop the idx in group F. I calculated the DEGs again and found found that the new DEG table is different from previous ones. For example, for A vs B, I found 373 DEGs (padj < 0.1) when I include all 6 groups. When I remove group F, only 239 DEGs (padj<0.1), were identified.

The question is, which method should I use in this case.

My code:

Get the DEGs between group A and B, where dds_6groups contains all samples

res_AB_6groups <- results(dds_6groups,contrast=c("groups","A","B"))

res_1_rmNA <- res_AB_6groups[! is.na(res_AB_6groups$padj),]

res_1_p10<-res_1_rmNA[(res_1_rmNA$padj<0.1),]

rownames(res_1_p10)

#373 DEGs identified

recalculate dds object, removing group F

dds_5groups <- dds_6groups[,dds_6groups$groups %in% c("F")]

dds_5groups$groups <- droplevels(dds_5groups$groups)

dds_5groups <- DESeq(dds_5groups)

Get the DEGs between group A and B from new dds object

res_AB_5groups <- results(dds_5groups,contrast=c("groups","A","B"))

res_2_rmNA <- res_AB_5groups[! is.na(res_AB_5groups$padj),]

res_2_p10<-res_2_rmNA[(res_2_rmNA$padj<0.1),]

rownames(res_2_p10) 

#239 DEGs identified

deseq2 rna-seq • 1.1k views

ADD COMMENT • link written 6.0 years ago by Raymond ▴ 20

score 0 · Answer 1 · 2018-05-02

Remember that gene dispersions are estimated using all samples, which means removing a group from the analysis will affect the dispersion estimation. It appears that when group F was included, the estimated gene dispersions were probably smaller on average, leading to more significant p-values. As for which set of results to trust, you have to decide whether it makes sense to include group F in the dispersion estimation step. I don't see any reason why group F should not be included. Just because there are large differences between group F and other groups, that doesn't mean that the intra-group variance is different enough to justify excluding it. So unless you have a specific reason not to, I would include all the samples from all groups in your analysis.

If you are really concerned with different groups having different dispersions, you could experiment with limma's voomWithQualityWeights function, to see if the estimated weights seem dependent on group membership or not.