Question

Big difference between two-factor comparison when 3 factors data or when 2 factor data are used?

0

Entering edit mode

megapode32559 • 0

@megapode32559-23129

Last seen 4.2 years ago

I have a dataset with 3-factor levels (A, B, C). If I use the data of A, B, C to perform a pairwise comparison between A and B, I got a result very different from that when I just use the data of A and B.

Is there a bug in DESeq() of DESeq2?

In the example,

deg_output.tsv has 1857 non-NA padj deg_output2.tsv has 16209 non-NA padj

this is a quite big difference.

The data is here.

https://filebin.net/izu2idggjyorzuof

The metadata is here.

$ cat meta_data.tsv
sample_id   treatment
A0116300    A
A0216455    A
A0316316    A
A0416309    A
B0516310    B
B0616470    B
B0716468    B
B0816450    B
C0916315    C
C1016460    C
C1116453    C
C1216301    C

The code is here.

==> main.R <==
# vim: set noexpandtab tabstop=2:

count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'

read.tsv = function(file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
  read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}

sample_table = read.table(sample_file, header=T)
count_table = read.table(count_file, header=T)[, sample_table[[1]], drop=F]
count_table = count_table[rowSums(count_table) > 0, ]
design = as.formula(sprintf('~%s', names(sample_table)[[2]]))

suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
cds = suppressWarnings(
  DESeqDataSetFromMatrix(
    countData = count_table
    , colData = sample_table
    , design = design
    )
  )

cds = DESeq(cds)
res=data.frame(
  results(cds
    , contrast=c(names(sample_table)[[2]], condA, condB)
    )
  )

write.table(
  cbind(symbol=rownames(res), res)
    , quote = F, sep = "\t", row.names = F
  , file='deg_output.tsv'
  )

==> main2.R <==
# vim: set noexpandtab tabstop=2:

count_file='count_data.tsv'
sample_file='meta_data.tsv'
condA='A'
condB='B'

read.tsv = function(file, header = T, sep = "\t", quote = "", check.names = F, comment.char = "", stringsAsFactors = F, ...) {
  read.table(file, header = header, sep = sep, quote = quote, check.names = check.names, comment.char = comment.char, stringsAsFactors = stringsAsFactors, ...)
}

sample_table = read.table(sample_file, header=T)[1:8, ]
count_table = read.table(count_file, header=T)[, sample_table[[1]], drop=F]
count_table = count_table[rowSums(count_table) > 0, ]
design = as.formula(sprintf('~%s', names(sample_table)[[2]]))

suppressWarnings(suppressPackageStartupMessages(library(DESeq2)))
cds = suppressWarnings(
  DESeqDataSetFromMatrix(
    countData = count_table
    , colData = sample_table
    , design = design
    )
  )

cds = DESeq(cds)
res=data.frame(
  results(cds
    , contrast=c(names(sample_table)[[2]], condA, condB)
    )
  )

write.table(
  cbind(symbol=rownames(res), res)
    , quote = F, sep = "\t", row.names = F
  , file='deg_output2.tsv'
  )

deseq2 • 403 views

ADD COMMENT • link updated 4.1 years ago by Michael Love 42k • written 4.1 years ago by megapode32559 • 0

score 0 · Answer 1 · 2020-03-31

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 6 hours ago

United States

Nope, not a bug. This is a property of linear models in general. This is also a FAQ in the vignette.

ADD COMMENT • link 4.1 years ago Michael Love 42k

0

Entering edit mode

If linear models were used and the variances are consistent across factor groups, of course, the data of all factors should be used as this would give a more stable estimate of the variance. Even though, it should not have such a big difference in the results when just two-factor data are used.

In the DESeq2 results, there is almost 10x difference. Therefore, I don't think this makes sense.

"Here we diagram an extreme range of within-group variability with a simulated dataset. Typically, it is recommended to run DESeq across samples from all groups, for datasets with multiple groups. However, this simulated dataset shows a case where it would be preferable to compare groups A and B by creating a smaller dataset without the C samples. Group C has much higher within-group variability, which would inflate the per-gene dispersion estimate for groups A and B as well:"

"If I have multiple groups, should I run all together or split into pairs of groups?" Do you refer to this FAQ.

So, doesn't this mean this is a bug in DESeq2? If the variance are not the same among groups, why does it assume so?

But for this specific dataset, could you check what is the reason to cause such a difference? I don't think just citing an FAQ can answer a question for a specific dataset. Thanks.

ADD REPLY • link 4.1 years ago megapode32559 • 0

0

Entering edit mode

I recommend that you work with a statistician for questions about analysis of your particular dataset. The point of the support site is for developers to provide guidance on specific software usage questions, not for developers to be universally available for data analysis questions.

You are free to use a different software as you feel necessary.

ADD REPLY • link 4.1 years ago Michael Love 42k