Entering edit mode
megapode32559
•
0
@megapode32559-23129
Last seen 4.7 years ago
I have a dataset with 3-factor levels (A, B, C). If I use the data of A, B, C to perform a pairwise comparison between A and B, I got a result very different from that when I just use the data of A and B.
Is there a bug in DESeq()
of DESeq2?
In the example,
deg_output.tsv
has 1857 non-NA padj
deg_output2.tsv
has 16209 non-NA padj
this is a quite big difference.
The data is here.
https://filebin.net/izu2idggjyorzuof
The metadata is here.
$ cat meta_data.tsv
sample_id treatment
A0116300 A
A0216455 A
A0316316 A
A0416309 A
B0516310 B
B0616470 B
B0716468 B
B0816450 B
C0916315 C
C1016460 C
C1116453 C
C1216301 C
The code is here.
==> main.R <==
# vim: set noexpandtab tabstop=2:
count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'
read.tsv = function(file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}
sample_table = read.table(sample_file, header=T)
count_table = read.table(count_file, header=T)[, sample_table[[1]], drop=F]
count_table = count_table[rowSums(count_table) > 0, ]
design = as.formula(sprintf('~%s', names(sample_table)[[2]]))
suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
cds = suppressWarnings(
DESeqDataSetFromMatrix(
countData = count_table
, colData = sample_table
, design = design
)
)
cds = DESeq(cds)
res=data.frame(
results(cds
, contrast=c(names(sample_table)[[2]], condA, condB)
)
)
write.table(
cbind(symbol=rownames(res), res)
, quote = F, sep = "\t", row.names = F
, file='deg_output.tsv'
)
==> main2.R <==
# vim: set noexpandtab tabstop=2:
count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'
read.tsv = function(file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}
sample_table = read.table(sample_file, header=T)[1:8, ]
count_table = read.table(count_file, header=T)[, sample_table[[1]], drop=F]
count_table = count_table[rowSums(count_table) > 0, ]
design = as.formula(sprintf('~%s', names(sample_table)[[2]]))
suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
cds = suppressWarnings(
DESeqDataSetFromMatrix(
countData = count_table
, colData = sample_table
, design = design
)
)
cds = DESeq(cds)
res=data.frame(
results(cds
, contrast=c(names(sample_table)[[2]], condA, condB)
)
)
write.table(
cbind(symbol=rownames(res), res)
, quote = F, sep = "\t", row.names = F
, file='deg_output2.tsv'
)
If linear models were used and the variances are consistent across factor groups, of course, the data of all factors should be used as this would give a more stable estimate of the variance. Even though, it should not have such a big difference in the results when just two-factor data are used.
In the DESeq2 results, there is almost 10x difference. Therefore, I don't think this makes sense.
"Here we diagram an extreme range of within-group variability with a simulated dataset. Typically, it is recommended to run DESeq across samples from all groups, for datasets with multiple groups. However, this simulated dataset shows a case where it would be preferable to compare groups A and B by creating a smaller dataset without the C samples. Group C has much higher within-group variability, which would inflate the per-gene dispersion estimate for groups A and B as well:"
"If I have multiple groups, should I run all together or split into pairs of groups?" Do you refer to this FAQ.
So, doesn't this mean this is a bug in DESeq2? If the variance are not the same among groups, why does it assume so?
But for this specific dataset, could you check what is the reason to cause such a difference? I don't think just citing an FAQ can answer a question for a specific dataset. Thanks.
I recommend that you work with a statistician for questions about analysis of your particular dataset. The point of the support site is for developers to provide guidance on specific software usage questions, not for developers to be universally available for data analysis questions.
You are free to use a different software as you feel necessary.