Question

DESeq2 and betaPrior

0

Entering edit mode

Elena Grassi ▴ 10

@elena-grassi-6138

Last seen 6.6 years ago

Italy, Rozzano, Humanitas University

Hi, I've recently started using DESeq2 with complex experimental designs and I've stumbled across some facts which I cannot wrap my head around.
One of the aims of the experiment is to study the transcriptional response following knockdown of several genes, working both on basal conditions and in an "induced state" were we know that several genes are up/downregulated (due to a treatment).
Right now I've got samples obtained in two different batches with 4 biological replicates for the sh and some more for the wt-controls.
I've decided to use a simple design with a factor representing the treatment and sh condition together as suggested in the documentation and another one to control for the batch effect: "~ batch + condition", where condition is "shgeneX|control_untreated|treated". After fitting the model I call DE genes for all the desired comparisons using results(dds, c("condition", "shgeneX_treated", "control_treated")). I am using betaPrior=TRUE since someone suggested me that this is on average a good idea.

With this setup I obtained some strange calls, namely some genes that are not expressed in the untreated samples are called as DE (FDR corrected for all comparison < 0.05 and |lfc|>1) in comparisons shgeneX_untreated vs control_untreated...I therefore reasoned that my initial filtering on raw reads would be
able to filter out these genes if I worked separately on the treated and untreated samples, building two different DESeq2 objects and so on. I also feared that working with everything together the variance estimates could be bloated due to the high variability between the treated and untreated conditions (they are widely separated the the first principal component).

To make a long story short when I split treated and untreated samples I got very similar pvalues and baseMeans but smaller log fold changes, and therefore the "strange" calls go away (I am not always able to filter out those genes using a filter on raw reads since in some sh samples, which are not outliers for PCA, they still have a reasonable number of reads...). The DE genes with this setup are a subset of the other one.
Since I stumbled in this observation: "The beta prior is estimated from the observed differences between conditions across all genes, so if there are many large differences here, there is less shrinkage." DESeq2 with high dispersions, I tried to use betaPrior=FALSE with the split setup and this made lfc more similar to the first setup (for the curious ones I put some plots here https://www.dropbox.com/sh/5i9r3dsva23xbys/AABXjauLgOdojtYNdSfS_6u3a?dl=0).
My questions, sorry for the long post, are:
- in experiments like this one is betaPrior=T/lfc shrinkage suggested or not? To me it seems that shrinking fold changes in this way would work if the comparisons I am interested in are the ones where the large differences lies, but I am not sure
- considering the fact that I need to perform the low counts filtering on treated and untreated samples separately is it ok to split the analysis or not? Right now I am tempted to use the results obtained with the split setup and betaPrior=T but I would like some insights :)

deseq2 biostatistics • 2.7k views

ADD COMMENT • link 6.6 years ago Elena Grassi ▴ 10

0

Entering edit mode

merged_pruned is a data.frame with samples identifiers as colnames, entrez as rownames and counts as values.

conditions_pruned has columns conditions_sample, rep, comparison and has samples identifiers as rownames.

mincounts is 10, mingroup is 3.

dds <- DESeqDataSetFromMatrix(countData = merged_pruned, colData = conditions_pruned, design = ~ rep + comparison) filterGenes <- rowSums(counts(dds) > mincounts) < mingroup dds <- dds[!filterGenes] dds <- DESeq(dds, parallel=TRUE, betaPrior=TRUE)

res <- results(dds, alpha=0.05, contrast=c("comparison", "Ino80_lps", "ctrl_lps"), parallel=TRUE)

Am I making some rookie mistakes? :)

ADD REPLY • link 6.6 years ago Elena Grassi ▴ 10

score 1 · Accepted Answer · 2017-09-08

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 7 hours ago

United States

hi Elena,

Can you show an example of a gene where you got a small p-value for the sh vs control in untreated comparison but you say also the gene is not expressed in untreated? By not expressed, you mean low counts or 0 counts? You could use plotCounts() for example for such a gene. Also can you show the colData, e.g. as.data.frame(colData(dds))?

ADD COMMENT • link 6.6 years ago Michael Love 41k

0

Entering edit mode

Yup, everything is coming (thanks for the patience in reading :)).

ADD REPLY • link 6.6 years ago Elena Grassi ▴ 10

0

Entering edit mode

One of the strange genes:

the results for one of the UT comparisons:

entrez	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	padj_multi
16175	1385.52457620314	2.38144205539228	0.46061579583222	5.170126767124	2.3393525067864e-07	0.000203380151207798	0.000161511179359231

The colData:

                   condition_sample rep comparison sizeFactor
SID100271 TRCN0000124093 Ino80b_ut   1 Ino80b_ut 1.0864321
SID100272 TRCN0000124089 Ino80b_ut   1 Ino80b_ut 0.9937351
SID100273   TRCN0000096374 Ino80_ut   1   Ino80_ut 1.0711857
SID100274   TRCN0000096377 Ino80_ut   1   Ino80_ut 0.7002019

...

since it's 169 rows maybe you prefer to have it via email :)

Also the counts are huge "long" :) , the one of the reported comparison are:

> counts <- plotCounts(dds, gene="16175", intgroup="comparison", returnData=TRUE, normalized=F)

> counts[counts$comparison == "ctrl_ut",] count comparison SID100335 0.5 ctrl_ut SID100352 0.5 ctrl_ut SID100353 0.5 ctrl_ut SID100356 0.5 ctrl_ut SID100358 0.5 ctrl_ut SID100442 1.5 ctrl_ut SID100461 0.5 ctrl_ut SID100462 11.5 ctrl_ut SID100463 1.5 ctrl_ut SID100465 0.5 ctrl_ut > counts[counts$comparison == "Ino80_ut",] count comparison SID100273 0.5 Ino80_ut SID100274 1.5 Ino80_ut SID100380 0.5 Ino80_ut SID100381 1.5 Ino80_ut

And normalized:

> counts <- plotCounts(dds, gene="16175", intgroup="comparison", returnData=TRUE, normalized=T) > counts[counts$comparison == "Ino80_ut",] count comparison SID100273 0.500000 Ino80_ut SID100274 1.928159 Ino80_ut SID100380 0.500000 Ino80_ut SID100381 1.331427 Ino80_ut > counts[counts$comparison == "ctrl_ut",] count comparison SID100335 0.5000000 ctrl_ut SID100352 0.5000000 ctrl_ut SID100353 0.5000000 ctrl_ut SID100356 0.5000000 ctrl_ut SID100358 0.5000000 ctrl_ut SID100442 0.8369791 ctrl_ut SID100461 0.5000000 ctrl_ut SID100462 7.6661643 ctrl_ut SID100463 0.8833815 ctrl_ut SID100465 0.5000000 ctrl_ut

ADD REPLY • link 6.6 years ago Elena Grassi ▴ 10

0

Entering edit mode

Sorry I realized that I shoud tell you that these data come from the "together" analysis. THe relevant colData:

> d[d$comparison == "ctrl_ut",] condition_sample rep comparison sizeFactor SID100335 SHC002_ut 1 ctrl_ut 0.9642440 SID100352 luc_ut 1 ctrl_ut 1.0615145 SID100353 luc_ut 1 ctrl_ut 1.5108928 SID100356 luc_ut 1 ctrl_ut 1.1745947 SID100358 SHC004_ut 1 ctrl_ut 0.3549336 SID100442 SHC002_ut 2 ctrl_ut 2.9675434 SID100461 luc_ut 2 ctrl_ut 1.4947868 SID100462 luc_ut 2 ctrl_ut 1.5349913 SID100463 luc_ut 2 ctrl_ut 2.6083681 SID100465 SHC004_ut 2 ctrl_ut 0.8363213 > d[d$comparison == "Ino80_ut",] condition_sample rep comparison sizeFactor SID100273 TRCN0000096374 Ino80_ut 1 Ino80_ut 1.0711857 SID100274 TRCN0000096377 Ino80_ut 1 Ino80_ut 0.7002019 SID100380 TRCN0000096374 Ino80_ut 2 Ino80_ut 0.8785057 SID100381 TRCN0000096377 Ino80_ut 2 Ino80_ut 1.2027508

In the split approach this particular gene is filtered out by my expression filters in the ut comparisons.

ADD REPLY • link 6.6 years ago Elena Grassi ▴ 10

0

Entering edit mode

And can you show the code that produced that first line of results?

ADD REPLY • link 6.6 years ago Michael Love 41k

0

Entering edit mode

In the code you posted above, you contrast "Ino80_lps" and "ctrl_lps" but you are showing the normalized counts for ctrl_ut and Ino80_ut. Note that the exact comparison used when calling results() will also appear at the top of the table if you print

> res

to the console. I want to make sure that the comparison you made was over the ctrl_ut and Ino80_ut samples.

ADD REPLY • link 6.6 years ago Michael Love 41k

0

Entering edit mode

Yes sorry my bad, I copy and pasted from my notes the wrong line. Repeating the steps now (on an Rdata with the dds object):

> res <- results(dds, alpha=0.05, contrast=c("comparison", "Ino80_ut", "ctrl_ut"), parallel=TRUE) > head(res) log2 fold change (MAP): comparison Ino80_ut vs ctrl_ut Wald test p-value: comparison Ino80_ut vs ctrl_ut DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> 100017 818.80462 -0.07499959 0.2200352 -0.3408526 0.73321453 0.9350128 100019 111.73721 -0.13345338 0.2312766 -0.5770293 0.56391968 0.8823707 100033459 122.13192 -0.50682085 0.4848992 -1.0452086 0.29592656 0.7507248 100034251 2976.08953 0.15876695 0.3509555 0.4523849 0.65099174 0.9142452 100034361 27.35382 0.33582071 0.2317138 1.4492912 0.14725628 0.5962709 100034726 10.98791 0.54122447 0.3251924 1.6643206 0.09604838 0.5096690 > res[rownames(res)==16175,] log2 fold change (MAP): comparison Ino80_ut vs ctrl_ut Wald test p-value: comparison Ino80_ut vs ctrl_ut DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> 16175 1385.525 2.381442 0.4606158 5.170127 2.339353e-07 0.0002033802

ADD REPLY • link 6.6 years ago Elena Grassi ▴ 10

0

Entering edit mode

With betaPrior=FALSE:

> res_nobeta[rownames(res_nobeta)==16175,] log2 fold change (MLE): comparison Ino80_ut vs ctrl_ut Wald test p-value: comparison Ino80_ut vs ctrl_ut DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> 16175 1385.525 -0.8255021 1.234827 -0.6685165 0.503804 0.8774721

ADD REPLY • link 6.6 years ago Elena Grassi ▴ 10

0

Entering edit mode

OK I'd recommend not using the beta prior here, it seems it's having some bad combination with a prior on all of the batch and condition variables. This (betaPrior=FALSE) is also default behavior with current and future versions of DESeq2.

ADD REPLY • link 6.6 years ago Michael Love 41k

0

Entering edit mode

Yes, I see. Would you consider the split analysis completely wrong or ok-ish given the huge differences between treated and untreated?

ADD REPLY • link 6.6 years ago Elena Grassi ▴ 10

1

Entering edit mode

Splitting the analysis is sometimes good, there is a FAQ that discusses this in the vignette