Question: DESeq2 and betaPrior
0
gravatar for Elena Grassi
2.1 years ago by
Elena Grassi10
Italy, Rozzano, Humanitas University
Elena Grassi10 wrote:

Hi, I've recently started using DESeq2 with complex experimental designs and I've stumbled across some facts which I cannot wrap my head around.
One of the aims of the experiment is to study the transcriptional response following knockdown of several genes, working both on basal conditions and in an "induced state" were we know that several genes are up/downregulated (due to a treatment).
Right now I've got samples obtained in two different batches with 4 biological replicates for the sh and some more for the wt-controls.
I've decided to use a simple design with a factor representing the treatment and sh condition together as suggested in the documentation and another one to control for the batch effect: "~ batch + condition", where condition is "shgeneX|control_untreated|treated". After fitting the model I call DE genes for all the desired comparisons using results(dds, c("condition", "shgeneX_treated", "control_treated")). I am using betaPrior=TRUE since someone suggested me that this is on average a good idea.

With this setup I obtained some strange calls, namely some genes that are not expressed in the untreated samples are called as DE (FDR corrected for all comparison < 0.05 and |lfc|>1) in comparisons shgeneX_untreated vs control_untreated...I therefore reasoned that my initial filtering on raw reads would be
able to filter out these genes if I worked separately on the treated and untreated samples, building two different DESeq2 objects and so on. I also feared that working with everything together the variance estimates could be bloated due to the high variability between the treated and untreated conditions (they are widely separated the the first principal component).

To make a long story short when I split treated and untreated samples I got very similar pvalues and baseMeans but smaller log fold changes, and therefore the "strange" calls go away (I am not always able to filter out those genes using a filter on raw reads since in some sh samples, which are not outliers for PCA, they still have a reasonable number of reads...). The DE genes with this setup are a subset of the other one.
Since I stumbled in this observation: "The beta prior is estimated from the observed differences between conditions across all genes, so if there are many large differences here, there is less shrinkage." DESeq2 with high dispersions, I tried to use betaPrior=FALSE with the split setup and this made lfc more similar to the first setup (for the curious ones I put some plots here https://www.dropbox.com/sh/5i9r3dsva23xbys/AABXjauLgOdojtYNdSfS_6u3a?dl=0).
My questions, sorry for the long post, are:
- in experiments like this one is betaPrior=T/lfc shrinkage suggested or not? To me it seems that shrinking fold changes in this way would work if the comparisons I am interested in are the ones where the large differences lies, but I am not sure
- considering the fact that I need to perform the low counts filtering on treated and untreated samples separately is it ok to split the analysis or not? Right now I am tempted to use the results obtained with the split setup and betaPrior=T but I would like some insights :)

deseq2 biostatistics • 774 views
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Elena Grassi10

merged_pruned is a data.frame with samples identifiers as colnames, entrez as rownames and counts as values.

conditions_pruned has columns conditions_sample, rep, comparison and has samples identifiers as rownames.

mincounts is 10, mingroup is 3.

dds <- DESeqDataSetFromMatrix(countData = merged_pruned, colData = conditions_pruned, design = ~ rep + comparison)
filterGenes <- rowSums(counts(dds) > mincounts) < mingroup
dds <- dds[!filterGenes]
dds <- DESeq(dds, parallel=TRUE, betaPrior=TRUE)

res <- results(dds, alpha=0.05, contrast=c("comparison", "Ino80_lps", "ctrl_lps"), parallel=TRUE)

Am I making some rookie mistakes? :)

ADD REPLYlink written 2.1 years ago by Elena Grassi10
Answer: DESeq2 and betaPrior
1
gravatar for Michael Love
2.1 years ago by
Michael Love25k
United States
Michael Love25k wrote:

hi Elena,

Can you show an example of a gene where you got a small p-value for the sh vs control in untreated comparison but you say also the gene is not expressed in untreated? By not expressed, you mean low counts or 0 counts? You could use plotCounts() for example for such a gene. Also can you show the colData, e.g. as.data.frame(colData(dds))?

ADD COMMENTlink written 2.1 years ago by Michael Love25k

Yup, everything is coming (thanks for the patience in reading :)).

ADD REPLYlink written 2.1 years ago by Elena Grassi10

One of the strange genes:

the results for one of the UT comparisons:

entrez baseMean log2FoldChange lfcSE stat pvalue padj padj_multi
16175 1385.52457620314 2.38144205539228 0.46061579583222 5.170126767124 2.3393525067864e-07 0.000203380151207798 0.000161511179359231

The colData:

                   condition_sample rep comparison sizeFactor
SID100271  TRCN0000124093 Ino80b_ut   1  Ino80b_ut  1.0864321
SID100272  TRCN0000124089 Ino80b_ut   1  Ino80b_ut  0.9937351
SID100273   TRCN0000096374 Ino80_ut   1   Ino80_ut  1.0711857
SID100274   TRCN0000096377 Ino80_ut   1   Ino80_ut  0.7002019

...

since it's 169 rows maybe you prefer to have it via email :)

Also the counts are huge "long" :) , the one of the reported comparison are:

> counts <- plotCounts(dds, gene="16175", intgroup="comparison", returnData=TRUE, normalized=F)

> counts[counts$comparison == "ctrl_ut",]
          count comparison
SID100335   0.5    ctrl_ut
SID100352   0.5    ctrl_ut
SID100353   0.5    ctrl_ut
SID100356   0.5    ctrl_ut
SID100358   0.5    ctrl_ut
SID100442   1.5    ctrl_ut
SID100461   0.5    ctrl_ut
SID100462  11.5    ctrl_ut
SID100463   1.5    ctrl_ut
SID100465   0.5    ctrl_ut
> counts[counts$comparison == "Ino80_ut",]
          count comparison
SID100273   0.5   Ino80_ut
SID100274   1.5   Ino80_ut
SID100380   0.5   Ino80_ut
SID100381   1.5   Ino80_ut

 

And normalized:

> counts <- plotCounts(dds, gene="16175", intgroup="comparison", returnData=TRUE, normalized=T)
> counts[counts$comparison == "Ino80_ut",]
             count comparison
SID100273 0.500000   Ino80_ut
SID100274 1.928159   Ino80_ut
SID100380 0.500000   Ino80_ut
SID100381 1.331427   Ino80_ut
> counts[counts$comparison == "ctrl_ut",]
              count comparison
SID100335 0.5000000    ctrl_ut
SID100352 0.5000000    ctrl_ut
SID100353 0.5000000    ctrl_ut
SID100356 0.5000000    ctrl_ut
SID100358 0.5000000    ctrl_ut
SID100442 0.8369791    ctrl_ut
SID100461 0.5000000    ctrl_ut
SID100462 7.6661643    ctrl_ut
SID100463 0.8833815    ctrl_ut
SID100465 0.5000000    ctrl_ut

 

 

 

ADD REPLYlink written 2.1 years ago by Elena Grassi10

Sorry I realized that I shoud tell you that these data come from the "together" analysis. THe relevant colData:

> d[d$comparison == "ctrl_ut",]
          condition_sample rep comparison sizeFactor
SID100335        SHC002_ut   1    ctrl_ut  0.9642440
SID100352           luc_ut   1    ctrl_ut  1.0615145
SID100353           luc_ut   1    ctrl_ut  1.5108928
SID100356           luc_ut   1    ctrl_ut  1.1745947
SID100358        SHC004_ut   1    ctrl_ut  0.3549336
SID100442        SHC002_ut   2    ctrl_ut  2.9675434
SID100461           luc_ut   2    ctrl_ut  1.4947868
SID100462           luc_ut   2    ctrl_ut  1.5349913
SID100463           luc_ut   2    ctrl_ut  2.6083681
SID100465        SHC004_ut   2    ctrl_ut  0.8363213
> d[d$comparison == "Ino80_ut",]
                 condition_sample rep comparison sizeFactor
SID100273 TRCN0000096374 Ino80_ut   1   Ino80_ut  1.0711857
SID100274 TRCN0000096377 Ino80_ut   1   Ino80_ut  0.7002019
SID100380 TRCN0000096374 Ino80_ut   2   Ino80_ut  0.8785057
SID100381 TRCN0000096377 Ino80_ut   2   Ino80_ut  1.2027508

In the split approach this particular gene is filtered out by my expression filters in the ut comparisons.

ADD REPLYlink written 2.1 years ago by Elena Grassi10

And can you show the code that produced that first line of results?

ADD REPLYlink written 2.1 years ago by Michael Love25k

In the code you posted above, you contrast "Ino80_lps" and "ctrl_lps" but you are showing the normalized counts for ctrl_ut and Ino80_ut. Note that the exact comparison used when calling results() will also appear at the top of the table if you print 

> res

to the console. I want to make sure that the comparison you made was over the ctrl_ut and Ino80_ut samples.

ADD REPLYlink written 2.1 years ago by Michael Love25k

Yes sorry my bad, I copy and pasted from my notes the wrong line. Repeating the steps now (on an Rdata with the dds object):

> res <- results(dds, alpha=0.05, contrast=c("comparison", "Ino80_ut", "ctrl_ut"), parallel=TRUE)
> head(res)
log2 fold change (MAP): comparison Ino80_ut vs ctrl_ut
Wald test p-value: comparison Ino80_ut vs ctrl_ut
DataFrame with 6 rows and 6 columns
            baseMean log2FoldChange     lfcSE       stat     pvalue      padj
           <numeric>      <numeric> <numeric>  <numeric>  <numeric> <numeric>
100017     818.80462    -0.07499959 0.2200352 -0.3408526 0.73321453 0.9350128
100019     111.73721    -0.13345338 0.2312766 -0.5770293 0.56391968 0.8823707
100033459  122.13192    -0.50682085 0.4848992 -1.0452086 0.29592656 0.7507248
100034251 2976.08953     0.15876695 0.3509555  0.4523849 0.65099174 0.9142452
100034361   27.35382     0.33582071 0.2317138  1.4492912 0.14725628 0.5962709
100034726   10.98791     0.54122447 0.3251924  1.6643206 0.09604838 0.5096690
> res[rownames(res)==16175,]
log2 fold change (MAP): comparison Ino80_ut vs ctrl_ut
Wald test p-value: comparison Ino80_ut vs ctrl_ut
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat       pvalue         padj
      <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
16175  1385.525       2.381442 0.4606158  5.170127 2.339353e-07 0.0002033802

 

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Elena Grassi10

With betaPrior=FALSE:

> res_nobeta[rownames(res_nobeta)==16175,]
log2 fold change (MLE): comparison Ino80_ut vs ctrl_ut
Wald test p-value: comparison Ino80_ut vs ctrl_ut
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE       stat    pvalue      padj
      <numeric>      <numeric> <numeric>  <numeric> <numeric> <numeric>
16175  1385.525     -0.8255021  1.234827 -0.6685165  0.503804 0.8774721

 

ADD REPLYlink written 2.1 years ago by Elena Grassi10

 OK I'd recommend not using the beta prior here,  it seems it's having some bad combination with a prior on all of the batch and condition variables. This (betaPrior=FALSE) is also default behavior with current and future versions of DESeq2.

ADD REPLYlink written 2.1 years ago by Michael Love25k

Yes, I see. Would you consider the split analysis completely wrong or ok-ish given the huge differences between treated and untreated?

ADD REPLYlink written 2.1 years ago by Elena Grassi10
1

Splitting the analysis is sometimes good, there is a FAQ that discusses this in the vignette

ADD REPLYlink written 2.1 years ago by Michael Love25k

Must have lost it! Thank you very much for your help!

ADD REPLYlink written 2.1 years ago by Elena Grassi10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 166 users visited in the last hour