Question

deseq2 zeros in all treatments, fold change

1

Entering edit mode

mary.cho753 ▴ 10

@marycho753-16544

Last seen 5.8 years ago

Suppose you have 4 treated and 4 untreated samples.

On avg if all 8 columns are zero Deseq2 will output NA across the results for, baseMean log2FoldChange lfcSE stat pvalue.

However, in a few cases I can see numbers for baseMean log2FoldChange lfcSE stat pvalue (though padj will be NA) when all 8 columns are zero.

Is this possible?

Thanks.

deseq2 • 1.8k views

ADD COMMENT • link updated 5.8 years ago by Michael Love 41k • written 5.8 years ago by mary.cho753 ▴ 10

Michael Love · Answer 1 · 2018-07-19

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 23 hours ago

United States

Can you show this happening? The base mean is the mean of normalized counts so equal to zero if the counts are zero for the whole row (gene). These genes are not used in computation of LFC or pvalues.

ADD COMMENT • link 5.8 years ago Michael Love 41k

0

Entering edit mode

Thanks for your reply. I figured out at least what has happened.

If I have 16 samples and I normalize them together. (At least I think I am normalizing them together)

Then when I contrast samples ac (1,2,3,4) and an (5,6,7,8)

If some rows have zeros for raw counts 0,0,0,0,0,0,0,0 - I expected NA for baseMean log2FoldChange lfcSE stat pvalue. This is true the majority of the time.

However, if sample 15 has a normalized value of 3.6 (example 0,0,0,0,0,0,0,0,0,0,0,0,0,3.6,0, then I will get values for baseMean log2FoldChange lfcSE stat pvalue even

when comparing ac and an with the contrast function. Here I thought I was comparing (0,0,0,0,0,0,0,0) and not (0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.6,0) Please see code below. Please let me know if I was not clear of what happened. Is there a mistake in my code or design table? or is this just something that I ignore. Thanks for your quick reply.

Design table

id	condition
cor1	ac
cor2	ac
cor3	ac
cor4	ac
cor5	an
cor6	an
cor7	an
cor8	an
cor9	zc
cor10	zc
cor11	zc
cor12	zc
cor13	zn
cor14	zn
cor15	zn
cor16	zn

####################################################################
library( "DESeq2" )
setwd ("C:/dseq/data")
getwd()

countData <-read.table("C:/dseq/data/my_data.csv",header=T,sep=",")
head(countData)

colData1 <-read.table("C:/dseq/data/cond.csv",header=T,sep=",")
head (colData1)

colData<-data.frame(colData1)[,c("id","condition")]

dds<-DESeqDataSetFromMatrix(countData= countData,colData= colData,design=~condition)

dds$condition<-factor(dds$condition,levels=c("ac","an","zc","zn"))

ddsresults<-DESeq(dds)

test <- results(ddsresults, contrast=c("condition","an","ac"))
plotMA(test,main="DESeq2",ylim=c(-6,6))
write.csv (test, "results_ac_vs_an.csv")

ADD REPLY • link 5.8 years ago mary.cho753 ▴ 10

0

Entering edit mode

I think you can safely ignore this. Unless the counts are all zero then DESeq2 does compute statistics and the mean is not equal to zero. You can filter these genes out manually if you like.

ADD REPLY • link 5.8 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

I got a similar situation. Let's say I use test <- results(ddsresults, contrast=c("condition","an","ac")), I got a very low padj (<0.0001) value for this gene. Normally, I think a gene is differently expressed if its padj < 0.05, however, this dose not make sense in this case. Could you please give some suggestions how to deal with this?

Thanks!

ADD REPLY • link 3.6 years ago Rain Yi ▴ 10

0

Entering edit mode

What is the issue again?

Are you using an LRT or Wald test here?

If you use a Wald test the LFC will be zero and the pvalue equal to 1.

ADD REPLY • link 3.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

My experiment has two genotypes, two tissues, five time points and three biological replicates, so in total 60 samples. And I combined the genotype, tissue and time point as one factor (group in the design) for comparison.

sample_info

Sample_id   group
sample1 Genotype1-tissue1-0h
sample2 Genotype1-tissue1-0h
sample3 Genotype1-tissue1-0h
sample4 Genotype1-tissue1-3h
sample5 Genotype1-tissue1-3h
sample6 Genotype1-tissue1-3h
sample7 Genotype1-tissue1-6h
sample8 Genotype1-tissue1-6h
sample9 Genotype1-tissue1-6h
sample10    Genotype1-tissue1-24h
sample11    Genotype1-tissue1-24h
sample12    Genotype1-tissue1-24h
sample13    Genotype1-tissue1-48h
sample14    Genotype1-tissue1-48h
sample15    Genotype1-tissue1-48h
sample16    Genotype2-tissue1-0h
sample17    Genotype2-tissue1-0h
sample18    Genotype2-tissue1-0h
sample19    Genotype2-tissue1-3h
sample20    Genotype2-tissue1-3h
sample21    Genotype2-tissue1-3h
sample22    Genotype2-tissue1-6h
sample23    Genotype2-tissue1-6h
sample24    Genotype2-tissue1-6h
sample25    Genotype2-tissue1-24h
sample26    Genotype2-tissue1-24h
sample27    Genotype2-tissue1-24h
sample28    Genotype2-tissue1-48h
sample29    Genotype2-tissue1-48h
sample30    Genotype2-tissue1-48h
...

my code for DE analysis is

dds <- DESeqDataSetFromTximport(txi = txi, 
                                   colData = sample_info, 
                                   design = ~group)

dds <- dds[rowSums(counts(dds)) >= 25,]
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds, maxit=1000)

The txi object is created by tximport package, and Salmon was used for reads quantification. compare the first two groups

res <- results(dds, name = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h', alpha = 0.05)

resLFC <- lfcShrink(dds,coef = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h', type = 'apeglm')

resLFC

    baseMean    log2FoldChange  lfcSE   pvalue  padj
gene1   109.3867391 -0.003655068    0.224289647 0.003375813 0.042195941
gene2   107.4892382 -0.002637937    0.224384981 6.77E-05    0.002134097
gene3   43.1487467  -0.004780616    0.224289612 6.99E-06    0.000355125
gene4   156.7601499 -0.008634478    0.224322088 1.36E-27    4.71E-24
gene5   30.52627659 -0.005412231    0.224355353 7.65E-10    1.45E-07
gene6   312.5574935 -0.008088801    0.224300432 1.81E-26    4.54E-23
...

The padj values of these genes are smaller than 0.05, however, when I look at the count, you can see that all the counts values of these six genes in the first six samples are zero, so they should not be DEGs in this comparison.

normalized_count <- t(counts(dds, normalized = T)[paste0('gene', 1:6),])
normalized_count$group <- sample_info$group

normalized_count <- normalized_count[c('group', paste0('gene', 1:6)),]
normalized_count 
    group   gene1   gene2   gene3   gene4   gene5   gene6
sample1 Genotype1-tissue1-0h    0   0   0   0   0   0
sample2 Genotype1-tissue1-0h    0   0   0   0   0   0
sample3 Genotype1-tissue1-0h    0   0   0   0   0   0
sample4 Genotype1-tissue1-3h    0   0   0   0   0   0
sample5 Genotype1-tissue1-3h    0   0   0   0   0   0
sample6 Genotype1-tissue1-3h    0   0   0   0   0   0
sample7 Genotype1-tissue1-6h    0   0   72.731  44.664  0   0
sample8 Genotype1-tissue1-6h    0   135.236 0   188.599 0   0
sample9 Genotype1-tissue1-6h    0   392.548 0   129.881 285.307 17.493
sample10    Genotype1-tissue1-24h   0   0   0   0   0   0
sample11    Genotype1-tissue1-24h   0   0   0   0   0   0
sample12    Genotype1-tissue1-24h   0   606.759 0   0   0   0
sample13    Genotype1-tissue1-48h   0   611.081 0   0   0   0
sample14    Genotype1-tissue1-48h   0   0   0   0   0   0
sample15    Genotype1-tissue1-48h   0   51.556  0   125.036 0   0
sample16    Genotype2-tissue1-0h    0   0   69.414  221.474 0   71.053
sample17    Genotype2-tissue1-0h    56.018  0   158.726 227.126 18.17   139.635
sample18    Genotype2-tissue1-0h    0   0   147.957 143.406 20.438  150.624
sample19    Genotype2-tissue1-3h    17.109  0   24.681  535.139 0   97.378
sample20    Genotype2-tissue1-3h    0   0   77.521  133.851 0   500.539
sample21    Genotype2-tissue1-3h    0   216.741 114.825 604.459 0   460.391
sample22    Genotype2-tissue1-6h    0   0   45.666  731.594 0   339.807
sample23    Genotype2-tissue1-6h    0   0   124.964 549.044 44.797  675.459
sample24    Genotype2-tissue1-6h    0   380.825 345.23  782.227 0   1431.021
sample25    Genotype2-tissue1-24h   10.842  0   0   22.393  0   216.766
sample26    Genotype2-tissue1-24h   0   432.569 166.305 124.141 102.607 477.002
sample27    Genotype2-tissue1-24h   0   0   241.048 86.266  0   995.148
sample28    Genotype2-tissue1-48h   52.191  0   0   154.786 0   449.625
sample29    Genotype2-tissue1-48h   0   0   176.788 129.002 0   676.021
sample30    Genotype2-tissue1-48h   149.033 112.09  275.438 334.19  0   1620.9
...

Could you please give a clue what's happening here? I can send the example data and code to you if you need. Thanks

ADD REPLY • link 3.6 years ago Rain Yi ▴ 10

1

Entering edit mode

If you specify the comparison as a contrast in results it will just set these Wald test pvalues to 1.

Or you can filter out these genes based on small shrunken LFC, e.g. filtering with x[abs(x$log2FoldChange > 0.1),]. Or you could use svalue=TRUE in the lfcShrink function as well which would then give large s-value (aggregate false sign rate) to these genes.

For example:

dds <- makeExampleDESeqDataSet(m=20)
dds$condition <- factor(rep(1:4,each=5))
counts(dds)[1,11:20] <- 0L
dds <- DESeq(dds, quiet=TRUE)
results(dds, contrast=c("condition","4","3"))[1,]

log2 fold change (MLE): condition 4 vs 3
Wald test p-value: condition 4 vs 3
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat    pvalue      padj
      <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
gene1   1.23113              0   1.97322         0         1         1

ADD REPLY • link 3.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thank you very much for the quick reply!

All your suggestions works, and I choose the first one because it is the easiest one for me.

One thing I noticed is if I use coef in the lfcShrink function, I have to specify the res= argument in the meantime, otherwise it still give very low padj value (which I think also make sense, but normally I don't specify this if I already provide coef).

res_contrast <- results(dds, contrast = c('group', 'Genotype1-tissue1-3h','Genotype1-tissue1-0h'), alpha = 0.05)

resLFC_coef_specify_res <- lfcShrink(dds, coef = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h', res= res_contrast, type = 'apeglm')

resLFC_coef_specify_res

                   baseMean log2FoldChange     lfcSE    pvalue      padj
gene1           156.7601    -0.00771226  0.211948         1         1
gene2           109.3867    -0.00331308  0.211914         1         1
gene3           107.4892    -0.00274913  0.212014         1         1
gene4            43.1487    -0.00428061  0.211916         1         1
gene5            30.5263    -0.00460797  0.211950         1         1
gene6           312.5575    -0.00721755  0.211930         1         1

resLFC_coef_without_res <- lfcShrink(dds, coef = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h',  type = 'apeglm')

resLFC_coef_without_res

                   baseMean log2FoldChange     lfcSE      pvalue        padj
gene1           156.7601    -0.00771552  0.212000 5.16948e-28 2.15964e-24
gene2           109.3867    -0.00331893  0.211966 9.93627e-01 9.99544e-01
gene3           107.4892    -0.00274985  0.212066 3.45938e-05 1.70378e-03
gene4            43.1487    -0.00429210  0.211968 9.93237e-01 9.99544e-01
gene5            30.5263    -0.00460696  0.212002 4.22455e-08 6.30313e-06
gene6           312.5575    -0.00722102  0.211982 4.93318e-25 1.12414e-21

ADD REPLY • link updated 3.6 years ago by Michael Love 41k • written 3.6 years ago by Rain Yi ▴ 10

1

Entering edit mode

Yes this is because lfcShrink only produces LFC, posterior SD and svalues (optionally). The table with pvalue has to be either passed in as res, or built internally with a call to results().

ADD REPLY • link 3.6 years ago Michael Love 41k