Question: DESeq2 and betaPrior
0
2.1 years ago by
Elena Grassi10
Italy, Rozzano, Humanitas University
Elena Grassi10 wrote:

Hi, I've recently started using DESeq2 with complex experimental designs and I've stumbled across some facts which I cannot wrap my head around.
One of the aims of the experiment is to study the transcriptional response following knockdown of several genes, working both on basal conditions and in an "induced state" were we know that several genes are up/downregulated (due to a treatment).
Right now I've got samples obtained in two different batches with 4 biological replicates for the sh and some more for the wt-controls.
I've decided to use a simple design with a factor representing the treatment and sh condition together as suggested in the documentation and another one to control for the batch effect: "~ batch + condition", where condition is "shgeneX|control_untreated|treated". After fitting the model I call DE genes for all the desired comparisons using results(dds, c("condition", "shgeneX_treated", "control_treated")). I am using betaPrior=TRUE since someone suggested me that this is on average a good idea.

With this setup I obtained some strange calls, namely some genes that are not expressed in the untreated samples are called as DE (FDR corrected for all comparison < 0.05 and |lfc|>1) in comparisons shgeneX_untreated vs control_untreated...I therefore reasoned that my initial filtering on raw reads would be
able to filter out these genes if I worked separately on the treated and untreated samples, building two different DESeq2 objects and so on. I also feared that working with everything together the variance estimates could be bloated due to the high variability between the treated and untreated conditions (they are widely separated the the first principal component).

To make a long story short when I split treated and untreated samples I got very similar pvalues and baseMeans but smaller log fold changes, and therefore the "strange" calls go away (I am not always able to filter out those genes using a filter on raw reads since in some sh samples, which are not outliers for PCA, they still have a reasonable number of reads...). The DE genes with this setup are a subset of the other one.
Since I stumbled in this observation: "The beta prior is estimated from the observed differences between conditions across all genes, so if there are many large differences here, there is less shrinkage." DESeq2 with high dispersions, I tried to use betaPrior=FALSE with the split setup and this made lfc more similar to the first setup (for the curious ones I put some plots here https://www.dropbox.com/sh/5i9r3dsva23xbys/AABXjauLgOdojtYNdSfS_6u3a?dl=0).
My questions, sorry for the long post, are:
- in experiments like this one is betaPrior=T/lfc shrinkage suggested or not? To me it seems that shrinking fold changes in this way would work if the comparisons I am interested in are the ones where the large differences lies, but I am not sure
- considering the fact that I need to perform the low counts filtering on treated and untreated samples separately is it ok to split the analysis or not? Right now I am tempted to use the results obtained with the split setup and betaPrior=T but I would like some insights :)

deseq2 biostatistics • 774 views
modified 2.1 years ago • written 2.1 years ago by Elena Grassi10

merged_pruned is a data.frame with samples identifiers as colnames, entrez as rownames and counts as values.

conditions_pruned has columns conditions_sample, rep, comparison and has samples identifiers as rownames.

mincounts is 10, mingroup is 3.

dds <- DESeqDataSetFromMatrix(countData = merged_pruned, colData = conditions_pruned, design = ~ rep + comparison) filterGenes <- rowSums(counts(dds) > mincounts) < mingroup dds <- dds[!filterGenes] dds <- DESeq(dds, parallel=TRUE, betaPrior=TRUE) 

res <- results(dds, alpha=0.05, contrast=c("comparison", "Ino80_lps", "ctrl_lps"), parallel=TRUE)

Am I making some rookie mistakes? :)

Answer: DESeq2 and betaPrior
1
2.1 years ago by
Michael Love25k
United States
Michael Love25k wrote:

hi Elena,

Can you show an example of a gene where you got a small p-value for the sh vs control in untreated comparison but you say also the gene is not expressed in untreated? By not expressed, you mean low counts or 0 counts? You could use plotCounts() for example for such a gene. Also can you show the colData, e.g. as.data.frame(colData(dds))?

Yup, everything is coming (thanks for the patience in reading :)).

One of the strange genes:

the results for one of the UT comparisons:

 entrez baseMean log2FoldChange lfcSE stat pvalue padj padj_multi 16175 1385.52457620314 2.38144205539228 0.46061579583222 5.170126767124 2.3393525067864e-07 0.000203380151207798 0.000161511179359231

The colData:

condition_sample rep comparison sizeFactor
SID100271  TRCN0000124093 Ino80b_ut   1  Ino80b_ut  1.0864321
SID100272  TRCN0000124089 Ino80b_ut   1  Ino80b_ut  0.9937351
SID100273   TRCN0000096374 Ino80_ut   1   Ino80_ut  1.0711857
SID100274   TRCN0000096377 Ino80_ut   1   Ino80_ut  0.7002019

...

since it's 169 rows maybe you prefer to have it via email :)

Also the counts are huge "long" :) , the one of the reported comparison are:

> counts <- plotCounts(dds, gene="16175", intgroup="comparison", returnData=TRUE, normalized=F)

> counts[counts$comparison == "ctrl_ut",] count comparison SID100335 0.5 ctrl_ut SID100352 0.5 ctrl_ut SID100353 0.5 ctrl_ut SID100356 0.5 ctrl_ut SID100358 0.5 ctrl_ut SID100442 1.5 ctrl_ut SID100461 0.5 ctrl_ut SID100462 11.5 ctrl_ut SID100463 1.5 ctrl_ut SID100465 0.5 ctrl_ut > counts[counts$comparison == "Ino80_ut",]           count comparison SID100273   0.5   Ino80_ut SID100274   1.5   Ino80_ut SID100380   0.5   Ino80_ut SID100381   1.5   Ino80_ut

And normalized:

> counts <- plotCounts(dds, gene="16175", intgroup="comparison", returnData=TRUE, normalized=T) > counts[counts$comparison == "Ino80_ut",] count comparison SID100273 0.500000 Ino80_ut SID100274 1.928159 Ino80_ut SID100380 0.500000 Ino80_ut SID100381 1.331427 Ino80_ut > counts[counts$comparison == "ctrl_ut",]               count comparison SID100335 0.5000000    ctrl_ut SID100352 0.5000000    ctrl_ut SID100353 0.5000000    ctrl_ut SID100356 0.5000000    ctrl_ut SID100358 0.5000000    ctrl_ut SID100442 0.8369791    ctrl_ut SID100461 0.5000000    ctrl_ut SID100462 7.6661643    ctrl_ut SID100463 0.8833815    ctrl_ut SID100465 0.5000000    ctrl_ut

Sorry I realized that I shoud tell you that these data come from the "together" analysis. THe relevant colData:

> d[d$comparison == "ctrl_ut",] condition_sample rep comparison sizeFactor SID100335 SHC002_ut 1 ctrl_ut 0.9642440 SID100352 luc_ut 1 ctrl_ut 1.0615145 SID100353 luc_ut 1 ctrl_ut 1.5108928 SID100356 luc_ut 1 ctrl_ut 1.1745947 SID100358 SHC004_ut 1 ctrl_ut 0.3549336 SID100442 SHC002_ut 2 ctrl_ut 2.9675434 SID100461 luc_ut 2 ctrl_ut 1.4947868 SID100462 luc_ut 2 ctrl_ut 1.5349913 SID100463 luc_ut 2 ctrl_ut 2.6083681 SID100465 SHC004_ut 2 ctrl_ut 0.8363213 > d[d$comparison == "Ino80_ut",]                  condition_sample rep comparison sizeFactor SID100273 TRCN0000096374 Ino80_ut   1   Ino80_ut  1.0711857 SID100274 TRCN0000096377 Ino80_ut   1   Ino80_ut  0.7002019 SID100380 TRCN0000096374 Ino80_ut   2   Ino80_ut  0.8785057 SID100381 TRCN0000096377 Ino80_ut   2   Ino80_ut  1.2027508

In the split approach this particular gene is filtered out by my expression filters in the ut comparisons.

And can you show the code that produced that first line of results?

In the code you posted above, you contrast "Ino80_lps" and "ctrl_lps" but you are showing the normalized counts for ctrl_ut and Ino80_ut. Note that the exact comparison used when calling results() will also appear at the top of the table if you print

> res

to the console. I want to make sure that the comparison you made was over the ctrl_ut and Ino80_ut samples.

Yes sorry my bad, I copy and pasted from my notes the wrong line. Repeating the steps now (on an Rdata with the dds object):

> res <- results(dds, alpha=0.05, contrast=c("comparison", "Ino80_ut", "ctrl_ut"), parallel=TRUE) > head(res) log2 fold change (MAP): comparison Ino80_ut vs ctrl_ut Wald test p-value: comparison Ino80_ut vs ctrl_ut DataFrame with 6 rows and 6 columns             baseMean log2FoldChange     lfcSE       stat     pvalue      padj            <numeric>      <numeric> <numeric>  <numeric>  <numeric> <numeric> 100017     818.80462    -0.07499959 0.2200352 -0.3408526 0.73321453 0.9350128 100019     111.73721    -0.13345338 0.2312766 -0.5770293 0.56391968 0.8823707 100033459  122.13192    -0.50682085 0.4848992 -1.0452086 0.29592656 0.7507248 100034251 2976.08953     0.15876695 0.3509555  0.4523849 0.65099174 0.9142452 100034361   27.35382     0.33582071 0.2317138  1.4492912 0.14725628 0.5962709 100034726   10.98791     0.54122447 0.3251924  1.6643206 0.09604838 0.5096690 > res[rownames(res)==16175,] log2 fold change (MAP): comparison Ino80_ut vs ctrl_ut Wald test p-value: comparison Ino80_ut vs ctrl_ut DataFrame with 1 row and 6 columns        baseMean log2FoldChange     lfcSE      stat       pvalue         padj       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric> 16175  1385.525       2.381442 0.4606158  5.170127 2.339353e-07 0.0002033802

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Elena Grassi10

With betaPrior=FALSE:

> res_nobeta[rownames(res_nobeta)==16175,] log2 fold change (MLE): comparison Ino80_ut vs ctrl_ut Wald test p-value: comparison Ino80_ut vs ctrl_ut DataFrame with 1 row and 6 columns        baseMean log2FoldChange     lfcSE       stat    pvalue      padj       <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric> 16175  1385.525     -0.8255021  1.234827 -0.6685165  0.503804 0.8774721

OK I'd recommend not using the beta prior here,  it seems it's having some bad combination with a prior on all of the batch and condition variables. This (betaPrior=FALSE) is also default behavior with current and future versions of DESeq2.

Yes, I see. Would you consider the split analysis completely wrong or ok-ish given the huge differences between treated and untreated?

1

Splitting the analysis is sometimes good, there is a FAQ that discusses this in the vignette

Must have lost it! Thank you very much for your help!

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.