Question

Large fold changes after deseq2 when one condition has two zeros and small number the other condition has two zeros and a large number

0

Entering edit mode

loliveira • 0

@loliveira-17518

Last seen 5.6 years ago

I am working with 18 RNAseq datasets of 6 conditions in triplicates. I am importing the data from RSEM using TXimport and proceeding with deseq2 - My question is that looking at the results for a preliminary comparison I see that I have high Log2fc (~20) and significant Padjust, however when I go to the counts from the original RSEM file I see the counts for these specific gene (OTC-60486) as CTRL 22.06 , 0 , 0 and INF 0, 1190.1, 0. I would expect to have a high P adjust. However this is the result of the comparison

Gene id baseMean log2FoldChange lfcSE stat pvalue padj

Otc-60486 566.31317 25.6076609 2.63792993 9.70748334 2.80E-22 4.07E-19

Obviously, the zeros are having a big effect since if I replace the zeros for 1 this log2FC is reduced to 1,5 and the Pvalue is no significant. Am I missing something? Please see R season below:

> dir<- "/Users/loliveira/RSEM/OTC"

> samples <- read.table(file.path(dir,"OTC.txt"), header=TRUE)
> samples
sample treatment rep run
1 SG4 Fed_Inf_4yrs a OTC-SG004.genes.results
2 SG5 Fed_Inf_4yrs a OTC-Sg005.genes.results
3 SG6 Fed_Inf_4yrs a OTC-Sg006.genes.results
4 SG10 CTRL b OTC-Sg010.genes.results
5 SG11 CTRL b OTC-Sg011.genes.results
6 SG12 CTRL b OTC-Sg012.genes.results
7 SG16 Fed_Inf_1wk c OTC-Sg016.genes.results
8 SG17 Fed_Inf_1wk c OTC-Sg017.genes.results
9 SG18 Fed_Inf_1wk c OTC-Sg018.genes.results
10 SG22 Fed_Uni_1wk d OTC-Sg022.genes.results
11 SG23 Fed_Uni_1wk d OTC-Sg023.genes.results
12 SG24 Fed_Uni_1wk d OTC-Sg024.genes.results
13 SG28 Fed_Inf_2bm_1wk e OTC-Sg028.genes.results
14 SG29 Fed_Inf_2bm_1wk e OTC-Sg029.genes.results
15 SG30 Fed_Inf_2bm_1wk e OTC-Sg030.genes.results
16 SG34 Fed_Inf_1month f OTC-Sg034.genes.results
17 SG35 Fed_Inf_1month f OTC-Sg035.genes.results
18 SG36 Fed_Inf_1month f OTC-Sg036.genes.results
>
> files <- file.path(dir,samples$run)
> names(files) <- samples$sample
>
> files
SG4 SG5 SG6
"/Users/loliveira/RSEM/OTC/OTC-SG004.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg005.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg006.genes.results"
SG10 SG11 SG12
"/Users/loliveira/RSEM/OTC/OTC-Sg010.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg011.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg012.genes.results"
SG16 SG17 SG18
"/Users/loliveira/RSEM/OTC/OTC-Sg016.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg017.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg018.genes.results"
SG22 SG23 SG24
"/Users/loliveira/RSEM/OTC/OTC-Sg022.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg023.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg024.genes.results"
SG28 SG29 SG30
"/Users/loliveira/RSEM/OTC/OTC-Sg028.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg029.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg030.genes.results"
SG34 SG35 SG36
"/Users/loliveira/RSEM/OTC/OTC-Sg034.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg035.genes.results" "/Users/loliveira/RSEM/OTC/OTC-Sg036.genes.results"
> library(tximport)
> library(readr)
> txi <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> names(txi)
[1] "abundance" "counts" "length" "countsFromAbundance"
>
> head(txi$counts)
SG4 SG5 SG6 SG10 SG11 SG12 SG16 SG17 SG18 SG22 SG23 SG24 SG28 SG29 SG30 SG34 SG35 SG36
Otc-10005 119 140 70 217 33 76.00 153 197 76 47 35 48 60 68 67 81.00 63 158
Otc-10006 12 6 7 68 4 5.00 7 8 4 13 2 1 3 6 4 7.00 5 16
Otc-10009 2 10 16 25 4 9.01 5 5 2 11 10 10 7 10 7 7.01 10 10
Otc-10027 145 179 164 427 58 79.00 69 72 109 62 63 61 88 101 79 83.00 90 81
Otc-10042 16 30 26 71 15 16.00 19 19 24 27 20 12 11 11 16 16.00 20 26
Otc-10050 17 16 11 10 5 9.00 6 9 8 6 2 5 4 4 2 4.00 12 2
>
> library(DESeq2)

> dds <- DESeqDataSetFromTximport(txi,colData = samples, design = ~treatment)
using counts and average transcript lengths from tximport
> dds
class: DESeqDataSet
dim: 10984 18
metadata(1): version
assays(2): counts avgTxLength
rownames(10984): Otc-10005 Otc-10006 ... OtcSigP-9981 OtcSigP-9986
rowData names(0):
colnames(18): SG4 SG5 ... SG35 SG36
colData names(4): sample treatment rep run
>
> keep <- rowSums(counts(dds)) >= 10
> dds <- dds[keep,]
> dds
class: DESeqDataSet
dim: 8376 18
metadata(1): version
assays(2): counts avgTxLength
rownames(8376): Otc-10005 Otc-10006 ... OtcSigP-9936 OtcSigP-9981
rowData names(0):
colnames(18): SG4 SG5 ... SG35 SG36
colData names(4): sample treatment rep run
> dds <- DESeq(dds)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> res <- results(dds)
> res
log2 fold change (MLE): treatment Fed Uni 1wk vs CTRL
Wald test p-value: treatment Fed Uni 1wk vs CTRL
DataFrame with 8376 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Otc-10005 90.829530 -0.05694356 0.3738994 -0.1522965 0.8789531 0.9862238
Otc-10006 7.495166 -0.56269995 0.7472859 -0.7529915 0.4514550 0.8895800
Otc-10009 8.529125 0.96111647 0.5616601 1.7112065 0.0870430 0.5468635
Otc-10027 97.147211 -0.16809618 0.2412074 -0.6968949 0.4858686 0.9022474
Otc-10042 20.431021 0.49162397 0.3751679 1.3104104 0.1900570 0.7129306
... ... ... ... ... ... ...
OtcSigP-9829 1.034385 -2.69934118 1.7906619 -1.50745441 0.13169422 NA
OtcSigP-9877 14.310269 -1.16920823 0.6066144 -1.92743229 0.05392578 0.4518551
OtcSigP-9904 59.446744 0.43915855 0.2894595 1.51716731 0.12922445 0.6367444
OtcSigP-9936 9.917122 0.01944894 0.6563387 0.02963248 0.97636016 0.9961168
OtcSigP-9981 12.489498 0.94278120 0.5968313 1.57964433 0.11418834 0.6097460
>
> summary(res)

out of 8376 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 167, 2%
LFC < 0 (down) : 90, 1.1%
outliers [1] : 0, 0%
low counts [2] : 975, 12%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

>
> dds$group <- factor(paste0(dds$rep, dds$treatment))
> design(dds) <- ~ group
> resultsNames(dds)
[1] "Intercept" "treatment_Fed_Inf_1month_vs_CTRL" "treatment_Fed_Inf_1wk_vs_CTRL" "treatment_Fed_Inf_2bm_1wk_vs_CTRL"
[5] "treatment_Fed_Inf_4yrs_vs_CTRL" "treatment_Fed_Uni_1wk_vs_CTRL"
> dds <- DESeq(dds)
using pre-existing normalization factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> resultsNames(dds)
[1] "Intercept" "group_bCTRL_vs_aFed_Inf_4yrs" "group_cFed_Inf_1wk_vs_aFed_Inf_4yrs"
[4] "group_dFed_Uni_1wk_vs_aFed_Inf_4yrs" "group_eFed_Inf_2bm_1wk_vs_aFed_Inf_4yrs" "group_fFed_Inf_1month_vs_aFed_Inf_4yrs"
>
> library("vsn")
Warning message:
package ‘vsn’ was built under R version 3.4.2
> rld <- normTransform(dds)
>
> meanSdPlot(assay(rld))
Warning message:
package ‘hexbin’ was built under R version 3.4.3
>
> plotPCA(rld, intgroup=c("treatment"))
> result = results(dds, contrast=c("group", "cFed_Inf_1wk", "bCTRL"))
> write.table(result, file='Fed_Inf_1wkvsCTRL.txt',sep='\t',quote=FALSE)

Thanks in advance for the feedback

rnaseq rsem deseq2 • 2.5k views

ADD COMMENT • link 5.6 years ago loliveira • 0

0

Entering edit mode

Dear Michael,

Thanks for your reply, I need to do multiple comparing the conditions, but when I try the coef= on the lfcShrink it restrict me to the resultsNames(dds) as below, Since I need more iterations I am considering going for the betaPrior=TRUE.

What would you indicate in this case? How can I work all iterations of the results with lfcShrink using the coef variable? if this was answered before, please point me to the forum post - I couldn't find anything relevant.

Very best and thanks again for the support

resultsNames(dds)

[1] "Intercept" "group_bCTRL_vs_aFed_Inf_4yrs" "group_cFed_Inf_1wk_vs_aFed_Inf_4yrs"
[4] "group_dFed_Uni_1wk_vs_aFed_Inf_4yrs" "group_eFed_Inf_2bm_1wk_vs_aFed_Inf_4yrs" "group_fFed_Inf_1month_vs_aFed_Inf_4yrs"

ADD REPLY • link 5.6 years ago loliveira • 0

Michael Love · Answer 1 · 2018-09-25

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 29 minutes ago

United States

Why aren’t you using lfcShrink to produce posterior effect sizes? This is the recommended workflow, see the vignette or the workflow. You might also want to try lfcThreshold within lfcShrink to specify a minimum effect size.

ADD COMMENT • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Here’s a section where we show how to rearrange for contrasts using apeglm:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#extended-section-on-shrinkage-estimators

ADD REPLY • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Thanks again for the support, I did add the lfcShrink to my datasets and relevel for each combination of variables. The results made much more sense now(Thank you). However, I still have several genes where (ex. below from-Ctrl vs Infected 1wk) several genes are zeros on the CTRL and I have only one replicate as a high positive value. In this case, The LOG2FC reduced from ~25 to ~10, but I still think it is an "artifact". I keep asking myself why are the P adjust values still below <0.05 for this comparisons? Is there a way to recalculate if making it more stringent? I did the IHW but I still get a significant value... I am a bit lost here ...

Also, I couldn't run the lfcThreshold with the apeglm, but I don't think the discrepancy is coming from the FC, but from the P adjust values calculation.

# lfcThreshold error

resLFC <- lfcShrink(dds, coef="group_cFed_Inf_1wk_vs_bCTRL", type="apeglm",lfcThreshold= .1)
using 'apeglm' for LFC shrinkage
Error in apeglm::apeglm(Y = Y, x = design, log.lik = apeglm::logLikNB, :
unused argument (lfcThreshold = 0.1)

#Examples with original and shrink FC.

Counts from RSEM 3 replicates

otc-60486 0,0,0 and 0, 1190.1, 0

Original results with IHW

baseMean log2FoldChange lfcSE stat pvalue padj ` weight

Otc-60486 566.31317 24.7938956 2.63770541 9.39979708 5.47E-21 2.70E-18 1.69356631

> dds$group <- relevel(dds$group, ref = "bCTRL")
> dds <- DESeq(dds)

>resLFC <- lfcShrink(dds, coef="group_cFed_Inf_1wk_vs_bCTRL", type="apeglm")

write.table(resLFC, file='group_cFed_Inf_1wk_vs_bCTRL.txt',sep='\t',quote=FALSE)

# Results with lfcShrink

baseMean log2FoldChange lfcSE pvalue padj `

566.31317 10.9832851 4.10034068 1.40E-19 2.02E-16

ADD REPLY • link updated 5.6 years ago by Michael Love 41k • written 5.6 years ago by loliveira • 0

0

Entering edit mode

I'd guess these have small dispersion estimates because of the count distributions of the other conditions. So there is a very large count and information from other genes saying the the count is likely not due to high dispersion of the gene, and then you have a very large difference in means (0 vs ~400) which is not likely equal to 0 given the estimates from the data.

lfcShrink() doesn't replace the p-value column, unless you specify lfcThreshold or svalue=TRUE. This is available in the latest release (v1.20), but not previously.

Another option would be to filter per comparison to remove the genes where there are not, e.g. 3 samples with a count of 5 or higher:

idx <- dds$condition %in% c("A","B") # levels of two groups of interest:
keep <- rowSums(counts(dds)[,idx] >= 5) >= 3
res <- res[keep,]

ADD REPLY • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Hi Mike,

Thanks for lingering on,

I updated my DESeq2 to the latest version, and did exactly the same run as before, (as below) The first difference I noted was that now I see the citation to the manuscript (Nice paper!), the second is that the results for the lfcShrink changed dramatically. Now that problematic genes all have an irrelevant log2FoldChange

XX

	baseMean	log2FoldChange	lfcSE	pvalue	padj
Otc-60486	566.31317	0.04097335	0.27390458	1.40E-19	2.02E-16

Also now I can run the lfcThreshold, That replaces the P values for the s values. I am unsure how to define the lfcThreshold value that make sense to my dataset. I will read a bit more to see if I find more information.

Thanks for the support

> resLFC <- lfcShrink(dds, coef="group_cFed_Inf_1wk_vs_bCTRL", type="apeglm")
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
sequence count data: removing the noise and preserving large differences.
bioRxiv. https://doi.org/10.1101/303255
> resLFC
log2 fold change (MAP): group cFed Inf 1wk vs bCTRL
Wald test p-value: group cFed Inf 1wk vs bCTRL
DataFrame with 8376 rows and 5 columns
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
Otc-10005 90.8295300215003 1.21921880536886 0.384682618069075 8.44358364780871e-05 0.00379609448166066
Otc-10006 7.49516608368584 -0.0750104915728319 0.265213683522076 0.413564208091076 NA
Otc-10009 8.52912466546036 -0.108342087836708 0.273789519999909 0.298256656398482 0.7309913282316
Otc-10027 97.1472108432723 -0.00711785369618373 0.17665262570643 0.952230663441284 0.991549155246969
Otc-10042 20.4310212153511 0.110932202547201 0.236218265690278 0.41208242297299 0.79934729777592
... ... ... ... ... ...
OtcSigP-9829 1.03438505866019 -0.034002243697861 0.268577778110672 0.441778223544063 NA
OtcSigP-9877 14.3102685292295 -0.277025457033635 0.409757551028378 0.0568212737905355 0.358965529484449
OtcSigP-9904 59.4467435346105 0.279170607879664 0.25741738666105 0.0884109150084793 0.448447832903963
OtcSigP-9936 9.91712247256658 -0.119224918545589 0.283109314021552 0.235066258566969 0.66878442450563
OtcSigP-9981 12.4894977143121 0.719314822405624 0.780726333153449 0.0127635007879737 0.153875054192443
> write.table(resLFC, file='group_cFed_Inf_1wk_vs_bCTRL.txt',sep='\t',quote=FALSE)

ADD REPLY • link 5.6 years ago loliveira • 0