deseq2 zeros in all treatments, fold change
1
1
Entering edit mode
mary.cho753 ▴ 10
@marycho753-16544
Last seen 5.7 years ago

Suppose you have 4 treated and 4 untreated samples.

On avg if all 8 columns are zero Deseq2 will output NA across the results for, baseMean log2FoldChange lfcSE stat pvalue.

However, in a few cases I can see numbers for baseMean log2FoldChange lfcSE stat pvalue (though padj will be NA) when all 8 columns are zero.

Is this possible?

 

Thanks.

       

 

 

deseq2 • 1.7k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 16 hours ago
United States

Can you show this happening? The base mean is the mean of normalized counts so equal to zero if the counts are zero for the whole row (gene). These genes are not used in computation of LFC or pvalues.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. I figured out at least what has happened.

If I have 16 samples and I normalize them together. (At least I think I am normalizing them together)

Then when I contrast samples ac (1,2,3,4)  and an (5,6,7,8)

If some rows have zeros for raw counts 0,0,0,0,0,0,0,0 - I expected NA for  baseMean log2FoldChange lfcSE stat pvalue. This is true the majority of the time.

However, if sample 15 has a normalized value of 3.6 (example 0,0,0,0,0,0,0,0,0,0,0,0,0,3.6,0, then I will get values for baseMean log2FoldChange lfcSE stat pvalue even

when comparing ac and an with the contrast function. Here I thought I was comparing (0,0,0,0,0,0,0,0) and not (0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.6,0) Please see code below. Please let me know if I was not clear of what happened. Is there a mistake in my code or design table? or is this just something that I ignore.  Thanks for your quick reply. 

 

Design table

id condition
cor1 ac
cor2 ac
cor3 ac
cor4 ac
cor5 an
cor6 an
cor7 an
cor8 an
cor9 zc
cor10 zc
cor11 zc
cor12 zc
cor13 zn
cor14 zn
cor15 zn
cor16 zn

 

 

####################################################################
library( "DESeq2" )
setwd ("C:/dseq/data")
getwd()

countData <-read.table("C:/dseq/data/my_data.csv",header=T,sep=",")
head(countData)

 

colData1 <-read.table("C:/dseq/data/cond.csv",header=T,sep=",")
head (colData1)

colData<-data.frame(colData1)[,c("id","condition")]

 

dds<-DESeqDataSetFromMatrix(countData= countData,colData= colData,design=~condition)

 

dds$condition<-factor(dds$condition,levels=c("ac","an","zc","zn"))

ddsresults<-DESeq(dds)

test <- results(ddsresults, contrast=c("condition","an","ac"))
plotMA(test,main="DESeq2",ylim=c(-6,6))
write.csv (test, "results_ac_vs_an.csv")

 

ADD REPLY
0
Entering edit mode

I think you can safely ignore this. Unless the counts are all zero then DESeq2 does compute statistics and the mean is not equal to zero. You can filter these genes out manually if you like.

ADD REPLY
0
Entering edit mode

Hi Michael,

I got a similar situation. Let's say I use test <- results(ddsresults, contrast=c("condition","an","ac")), I got a very low padj (<0.0001) value for this gene. Normally, I think a gene is differently expressed if its padj < 0.05, however, this dose not make sense in this case. Could you please give some suggestions how to deal with this?

Thanks!

ADD REPLY
0
Entering edit mode

What is the issue again?

Are you using an LRT or Wald test here?

If you use a Wald test the LFC will be zero and the pvalue equal to 1.

ADD REPLY
0
Entering edit mode

Hi Michael,

My experiment has two genotypes, two tissues, five time points and three biological replicates, so in total 60 samples. And I combined the genotype, tissue and time point as one factor (group in the design) for comparison.

sample_info

Sample_id   group
sample1 Genotype1-tissue1-0h
sample2 Genotype1-tissue1-0h
sample3 Genotype1-tissue1-0h
sample4 Genotype1-tissue1-3h
sample5 Genotype1-tissue1-3h
sample6 Genotype1-tissue1-3h
sample7 Genotype1-tissue1-6h
sample8 Genotype1-tissue1-6h
sample9 Genotype1-tissue1-6h
sample10    Genotype1-tissue1-24h
sample11    Genotype1-tissue1-24h
sample12    Genotype1-tissue1-24h
sample13    Genotype1-tissue1-48h
sample14    Genotype1-tissue1-48h
sample15    Genotype1-tissue1-48h
sample16    Genotype2-tissue1-0h
sample17    Genotype2-tissue1-0h
sample18    Genotype2-tissue1-0h
sample19    Genotype2-tissue1-3h
sample20    Genotype2-tissue1-3h
sample21    Genotype2-tissue1-3h
sample22    Genotype2-tissue1-6h
sample23    Genotype2-tissue1-6h
sample24    Genotype2-tissue1-6h
sample25    Genotype2-tissue1-24h
sample26    Genotype2-tissue1-24h
sample27    Genotype2-tissue1-24h
sample28    Genotype2-tissue1-48h
sample29    Genotype2-tissue1-48h
sample30    Genotype2-tissue1-48h
...

my code for DE analysis is

dds <- DESeqDataSetFromTximport(txi = txi, 
                                   colData = sample_info, 
                                   design = ~group)

dds <- dds[rowSums(counts(dds)) >= 25,]
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds, maxit=1000)

The txi object is created by tximport package, and Salmon was used for reads quantification. compare the first two groups

res <- results(dds, name = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h', alpha = 0.05)

resLFC <- lfcShrink(dds,coef = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h', type = 'apeglm')

resLFC

    baseMean    log2FoldChange  lfcSE   pvalue  padj
gene1   109.3867391 -0.003655068    0.224289647 0.003375813 0.042195941
gene2   107.4892382 -0.002637937    0.224384981 6.77E-05    0.002134097
gene3   43.1487467  -0.004780616    0.224289612 6.99E-06    0.000355125
gene4   156.7601499 -0.008634478    0.224322088 1.36E-27    4.71E-24
gene5   30.52627659 -0.005412231    0.224355353 7.65E-10    1.45E-07
gene6   312.5574935 -0.008088801    0.224300432 1.81E-26    4.54E-23
...

The padj values of these genes are smaller than 0.05, however, when I look at the count, you can see that all the counts values of these six genes in the first six samples are zero, so they should not be DEGs in this comparison.

normalized_count <- t(counts(dds, normalized = T)[paste0('gene', 1:6),])
normalized_count$group <- sample_info$group

normalized_count <- normalized_count[c('group', paste0('gene', 1:6)),]
normalized_count 
    group   gene1   gene2   gene3   gene4   gene5   gene6
sample1 Genotype1-tissue1-0h    0   0   0   0   0   0
sample2 Genotype1-tissue1-0h    0   0   0   0   0   0
sample3 Genotype1-tissue1-0h    0   0   0   0   0   0
sample4 Genotype1-tissue1-3h    0   0   0   0   0   0
sample5 Genotype1-tissue1-3h    0   0   0   0   0   0
sample6 Genotype1-tissue1-3h    0   0   0   0   0   0
sample7 Genotype1-tissue1-6h    0   0   72.731  44.664  0   0
sample8 Genotype1-tissue1-6h    0   135.236 0   188.599 0   0
sample9 Genotype1-tissue1-6h    0   392.548 0   129.881 285.307 17.493
sample10    Genotype1-tissue1-24h   0   0   0   0   0   0
sample11    Genotype1-tissue1-24h   0   0   0   0   0   0
sample12    Genotype1-tissue1-24h   0   606.759 0   0   0   0
sample13    Genotype1-tissue1-48h   0   611.081 0   0   0   0
sample14    Genotype1-tissue1-48h   0   0   0   0   0   0
sample15    Genotype1-tissue1-48h   0   51.556  0   125.036 0   0
sample16    Genotype2-tissue1-0h    0   0   69.414  221.474 0   71.053
sample17    Genotype2-tissue1-0h    56.018  0   158.726 227.126 18.17   139.635
sample18    Genotype2-tissue1-0h    0   0   147.957 143.406 20.438  150.624
sample19    Genotype2-tissue1-3h    17.109  0   24.681  535.139 0   97.378
sample20    Genotype2-tissue1-3h    0   0   77.521  133.851 0   500.539
sample21    Genotype2-tissue1-3h    0   216.741 114.825 604.459 0   460.391
sample22    Genotype2-tissue1-6h    0   0   45.666  731.594 0   339.807
sample23    Genotype2-tissue1-6h    0   0   124.964 549.044 44.797  675.459
sample24    Genotype2-tissue1-6h    0   380.825 345.23  782.227 0   1431.021
sample25    Genotype2-tissue1-24h   10.842  0   0   22.393  0   216.766
sample26    Genotype2-tissue1-24h   0   432.569 166.305 124.141 102.607 477.002
sample27    Genotype2-tissue1-24h   0   0   241.048 86.266  0   995.148
sample28    Genotype2-tissue1-48h   52.191  0   0   154.786 0   449.625
sample29    Genotype2-tissue1-48h   0   0   176.788 129.002 0   676.021
sample30    Genotype2-tissue1-48h   149.033 112.09  275.438 334.19  0   1620.9
...

Could you please give a clue what's happening here? I can send the example data and code to you if you need. Thanks

ADD REPLY
1
Entering edit mode

If you specify the comparison as a contrast in results it will just set these Wald test pvalues to 1.

Or you can filter out these genes based on small shrunken LFC, e.g. filtering with x[abs(x$log2FoldChange > 0.1),]. Or you could use svalue=TRUE in the lfcShrink function as well which would then give large s-value (aggregate false sign rate) to these genes.

For example:

dds <- makeExampleDESeqDataSet(m=20)
dds$condition <- factor(rep(1:4,each=5))
counts(dds)[1,11:20] <- 0L
dds <- DESeq(dds, quiet=TRUE)
results(dds, contrast=c("condition","4","3"))[1,]

log2 fold change (MLE): condition 4 vs 3
Wald test p-value: condition 4 vs 3
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat    pvalue      padj
      <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
gene1   1.23113              0   1.97322         0         1         1
ADD REPLY
0
Entering edit mode

Hi Michael,

Thank you very much for the quick reply!

All your suggestions works, and I choose the first one because it is the easiest one for me.

One thing I noticed is if I use coef in the lfcShrink function, I have to specify the res= argument in the meantime, otherwise it still give very low padj value (which I think also make sense, but normally I don't specify this if I already provide coef).

res_contrast <- results(dds, contrast = c('group', 'Genotype1-tissue1-3h','Genotype1-tissue1-0h'), alpha = 0.05)

resLFC_coef_specify_res <- lfcShrink(dds, coef = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h', res= res_contrast, type = 'apeglm')

resLFC_coef_specify_res

                   baseMean log2FoldChange     lfcSE    pvalue      padj
gene1           156.7601    -0.00771226  0.211948         1         1
gene2           109.3867    -0.00331308  0.211914         1         1
gene3           107.4892    -0.00274913  0.212014         1         1
gene4            43.1487    -0.00428061  0.211916         1         1
gene5            30.5263    -0.00460797  0.211950         1         1
gene6           312.5575    -0.00721755  0.211930         1         1

resLFC_coef_without_res <- lfcShrink(dds, coef = 'group_Genotype1-tissue1-3h_vs_Genotype1-tissue1-0h',  type = 'apeglm')

resLFC_coef_without_res

                   baseMean log2FoldChange     lfcSE      pvalue        padj
gene1           156.7601    -0.00771552  0.212000 5.16948e-28 2.15964e-24
gene2           109.3867    -0.00331893  0.211966 9.93627e-01 9.99544e-01
gene3           107.4892    -0.00274985  0.212066 3.45938e-05 1.70378e-03
gene4            43.1487    -0.00429210  0.211968 9.93237e-01 9.99544e-01
gene5            30.5263    -0.00460696  0.212002 4.22455e-08 6.30313e-06
gene6           312.5575    -0.00722102  0.211982 4.93318e-25 1.12414e-21
ADD REPLY
1
Entering edit mode

Yes this is because lfcShrink only produces LFC, posterior SD and svalues (optionally). The table with pvalue has to be either passed in as res, or built internally with a call to results().

ADD REPLY

Login before adding your answer.

Traffic: 750 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6