Question

subtle differences in results based on how Design Table is coded

0

Entering edit mode

germaximus • 0

@germaximus-22975

Last seen 5.1 years ago

I'm trying to do "combine variables" approach in a two-factor experimental design. To create combinations of variables I follow the example from here http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions

Strangely, the list of differentially expressed genes is different based on how I setup colData:

strain <- c('WT', 'WT', 'KO', 'KO', 'KO', 'WT','WT','WT','WT' ,'KO', 'KO','KO')
condition <- c('A','A','A','A','A','B','B','B','B','B','B')
colData <- data.frame(sample = colnames(counts),
                      group = factor(paste0(strain, "_", condition),
                              levels=c('WT_A', 'WT_B', 'KO_A', 'KO_B')),
                      stringsAsFactors = FALSE
) 
#same thing without explicitly listing factor levels
colData <- data.frame(sample = colnames(counts),
                      group = factor(paste0(strain, "_", condition)                                  
                      stringsAsFactors = FALSE
) 

#these two colData objects produce different results when I run
dds <- DESeqDataSetFromMatrix(countData = counts[, colData$sample],
                              colData = colData,
                              design =~ group)
dds <- dds[rowMeans(counts(dds)) >= 10, ]
dds <- DESeq(dds)
result <- results(dds, contrast = c('group', 'WT_B', 'WT_A'), 
                       cooksCutoff = FALSE, 
                       independentFiltering = FALSE)

Interestingly, the first approach gives the same result as a classic linear model: strain + condition + strain:condition. Is there some subtle issue with how DESeq expects the factors to be supplied in the design?

The difference in DE gene lists is also quite peculiar. It is not like all genes are scrambled, rather few of them go up and down the rank. Here is the top 10 genes and only one of them is misplaced in each set.

Irx2 Zic2 Ltbp1 Etv4 Akr1b7 Mgam Efhd1 Ugt1a6a Sgms2

Irx2 C1s1 Ltbp1 Etv4 Akr1b7 Mgam Efhd1 Ugt1a6a Sgms2

deseq2 • 603 views

ADD COMMENT • link updated 5.1 years ago by Michael Love 43k • written 5.1 years ago by germaximus • 0

score 0 · Answer 1 · 2020-02-24

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 8 days ago

United States

Can you take a look at a plot of the p-values (you don't need to post)? Is it the case that the p-values are the same except for a few genes? Or are the p-values different for most or all genes?

ADD COMMENT • link 5.1 years ago Michael Love 43k

0

Entering edit mode

Only a few genes are off

figure: https://imgur.com/a/ZUlq5g3

ADD REPLY • link 5.1 years ago germaximus • 0

1

Entering edit mode

Then it seems to be due to which group is the reference level, and likely happening when one or the other group has all zeros, and the beta vector may have non-finite solutions without a prior.

I'd recommend using an LRT instead of Wald test here, by providing full and reduced matrices and specifying test="LRT" when you run DESeq(). So you can form a full matrix with model.matrix(design(dds), colData(dds)), and then you can remove the column from this matrix that represents the WT B vs A difference and provide that smaller matrix to reduced.

ADD REPLY • link 5.1 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael, you are correct. The misbehaving genes do have all zeroes in some groups. I chose to eliminate them with a more stringent filter criteria. In fact, I removed them after the analysis by detecting unrealistically high logFC values.

Do you think transforming them as (n+1) would be an acceptable alternative solution?

ADD REPLY • link 5.1 years ago germaximus • 0

0

Entering edit mode

I prefer filtering to adding counts. And also LRT should be less susceptible to these releveling differences.

ADD REPLY • link 5.1 years ago Michael Love 43k