I'm trying to do "combine variables" approach in a two-factor experimental design. To create combinations of variables I follow the example from here http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions
Strangely, the list of differentially expressed genes is different based on how I setup colData:
strain <- c('WT', 'WT', 'KO', 'KO', 'KO', 'WT','WT','WT','WT' ,'KO', 'KO','KO')
condition <- c('A','A','A','A','A','B','B','B','B','B','B')
colData <- data.frame(sample = colnames(counts),
group = factor(paste0(strain, "_", condition),
levels=c('WT_A', 'WT_B', 'KO_A', 'KO_B')),
stringsAsFactors = FALSE
)
#same thing without explicitly listing factor levels
colData <- data.frame(sample = colnames(counts),
group = factor(paste0(strain, "_", condition)
stringsAsFactors = FALSE
)
#these two colData objects produce different results when I run
dds <- DESeqDataSetFromMatrix(countData = counts[, colData$sample],
colData = colData,
design =~ group)
dds <- dds[rowMeans(counts(dds)) >= 10, ]
dds <- DESeq(dds)
result <- results(dds, contrast = c('group', 'WT_B', 'WT_A'),
cooksCutoff = FALSE,
independentFiltering = FALSE)
Interestingly, the first approach gives the same result as a classic linear model: strain + condition + strain:condition. Is there some subtle issue with how DESeq expects the factors to be supplied in the design?
The difference in DE gene lists is also quite peculiar. It is not like all genes are scrambled, rather few of them go up and down the rank. Here is the top 10 genes and only one of them is misplaced in each set.
Irx2 Zic2 Ltbp1 Etv4 Akr1b7 Mgam Efhd1 Ugt1a6a Sgms2
Irx2 C1s1 Ltbp1 Etv4 Akr1b7 Mgam Efhd1 Ugt1a6a Sgms2
figure: https://imgur.com/a/ZUlq5g3
Then it seems to be due to which group is the reference level, and likely happening when one or the other group has all zeros, and the beta vector may have non-finite solutions without a prior.
I'd recommend using an LRT instead of Wald test here, by providing
full
andreduced
matrices and specifyingtest="LRT"
when you runDESeq()
. So you can form a full matrix withmodel.matrix(design(dds), colData(dds))
, and then you can remove the column from this matrix that represents the WT B vs A difference and provide that smaller matrix toreduced
.Thank you Michael, you are correct. The misbehaving genes do have all zeroes in some groups. I chose to eliminate them with a more stringent filter criteria. In fact, I removed them after the analysis by detecting unrealistically high logFC values.
Do you think transforming them as (n+1) would be an acceptable alternative solution?
I prefer filtering to adding counts. And also LRT should be less susceptible to these releveling differences.