subtle differences in results based on how Design Table is coded
1
0
Entering edit mode
germaximus • 0
@germaximus-22975
Last seen 4.9 years ago

I'm trying to do "combine variables" approach in a two-factor experimental design. To create combinations of variables I follow the example from here http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions

Strangely, the list of differentially expressed genes is different based on how I setup colData:

strain <- c('WT', 'WT', 'KO', 'KO', 'KO', 'WT','WT','WT','WT' ,'KO', 'KO','KO')
condition <- c('A','A','A','A','A','B','B','B','B','B','B')
colData <- data.frame(sample = colnames(counts),
                      group = factor(paste0(strain, "_", condition),
                              levels=c('WT_A', 'WT_B', 'KO_A', 'KO_B')),
                      stringsAsFactors = FALSE
) 
#same thing without explicitly listing factor levels
colData <- data.frame(sample = colnames(counts),
                      group = factor(paste0(strain, "_", condition)                                  
                      stringsAsFactors = FALSE
) 

#these two colData objects produce different results when I run
dds <- DESeqDataSetFromMatrix(countData = counts[, colData$sample],
                              colData = colData,
                              design =~ group)
dds <- dds[rowMeans(counts(dds)) >= 10, ]
dds <- DESeq(dds)
result <- results(dds, contrast = c('group', 'WT_B', 'WT_A'), 
                       cooksCutoff = FALSE, 
                       independentFiltering = FALSE)

Interestingly, the first approach gives the same result as a classic linear model: strain + condition + strain:condition. Is there some subtle issue with how DESeq expects the factors to be supplied in the design?

The difference in DE gene lists is also quite peculiar. It is not like all genes are scrambled, rather few of them go up and down the rank. Here is the top 10 genes and only one of them is misplaced in each set.

Irx2 Zic2 Ltbp1 Etv4 Akr1b7 Mgam Efhd1 Ugt1a6a Sgms2

Irx2 C1s1 Ltbp1 Etv4 Akr1b7 Mgam Efhd1 Ugt1a6a Sgms2

deseq2 • 535 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Can you take a look at a plot of the p-values (you don't need to post)? Is it the case that the p-values are the same except for a few genes? Or are the p-values different for most or all genes?

ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Then it seems to be due to which group is the reference level, and likely happening when one or the other group has all zeros, and the beta vector may have non-finite solutions without a prior.

I'd recommend using an LRT instead of Wald test here, by providing full and reduced matrices and specifying test="LRT" when you run DESeq(). So you can form a full matrix with model.matrix(design(dds), colData(dds)), and then you can remove the column from this matrix that represents the WT B vs A difference and provide that smaller matrix to reduced.

ADD REPLY
0
Entering edit mode

Thank you Michael, you are correct. The misbehaving genes do have all zeroes in some groups. I chose to eliminate them with a more stringent filter criteria. In fact, I removed them after the analysis by detecting unrealistically high logFC values.

Do you think transforming them as (n+1) would be an acceptable alternative solution?

ADD REPLY
0
Entering edit mode

I prefer filtering to adding counts. And also LRT should be less susceptible to these releveling differences.

ADD REPLY

Login before adding your answer.

Traffic: 611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6