I used DESeq2 to perform differential expression between three groups (condition1, condition2 and condition3). Samples also come from three different countries.
PCA (using rlog transformed data) showed a similar pattern of clustering using country and condition (please see link) So I controlled for country as well (~ country + condition).
The MA plot showed two significant genes extremely affected by the condition, but upon inspection of the counts with plotCounts(), I see that there are only a few samples driving the difference. For one gene is even just one sample (please see link).
I checked other DEGs not as extreme and the plot of counts seemed much more reasonable.
Somehow outliers are being allowed to affect the coefficient estimation, but I am not sure why is that. This is how I ran the computations:
des <- model.matrix(~ 0+CountryCode + condition, data = colData(dds)) dds2 <- DESeq(dds, full = des, betaPrior = FALSE) # dds is a deseq data set with design ~1 dds2_cond3_vs_cond1 <- results(dds2, name = "conditionCond3", alpha = 0.05) dds2_cond2_vs_cond1 <- results(dds2, name = "conditionCond2", alpha = 0.05) dds2_cond2_vs_cond3 <-results(dds2, alpha = 0.05, contrast = list("conditionCond2", "conditionCond3"))
I would like ask for guidance on how to deal with those extremely regulated DEGs, and also if the data has been modeled properly, considering the highly overlapping effects of country and and condition. I want to make sure I am not in a "garbage in garbage out" type situation.
Thank you in advance!