lfcShrink and factors with multiple levels
1
1
Entering edit mode
mat.lesche ▴ 110
@matlesche-6835
Last seen 7 months ago
Germany

Hello Michael,

So far I have used DESeq 2 with betaPrior = T in the DESeq() function but now I'm testing it with betaPrior = False and using lfcShrink either with the coef or contrast parameter (DESeq2 version 1.18.0). I notice a couple of things which I think need clarifcation

I used a fairly common experimental design with a factor that has 3 levels and DM is initial reference level.

coldata$Cond <- factor(coldata$Cond, levels = c('DM', 'DP', 'Co'))
ddsdm <- DESeqDataSetFromMatrix(countData = counttab, colData = coldata, design = ~ Cond)
ddsdm <- DESeq(ddsdm, parallel=T, betaPrior = F)

With that I ran a couple of comparison and I'm pasting the command calls and the fold changes and was is shown if one does a head on the result obj

1a) Comparison DM vs DP with DP as denumerator

res_dm_dmVdp <- results(ddsdm, contrast = c('Cond', 'DM', 'DP'), alpha = alpha)

Cond DM vs DP and log2 fold change (MLE): Cond DM vs DP
LFC > 0 (up)     : 576
LFC < 0 (down)   : 401

1b) Applying lfcShrink to the result

res_dm_dmVdp_shrink <- lfcShrink(ddsdm, coef = 2, res = res_dm_dmVdp)

Cond DM vs DP and log2 fold change (MAP): Cond DP vs DM
LFC > 0 (up)     : 401
LFC < 0 (down)   : 576

As one can see the genes that were down regulated are now up and vice versa. p-value and padj values stay the same. I went through the documentation (https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) and tried to find a section which could explain this and didn't find anything. I did however find a post on bioconductor which explains it (DESeq2 lfcShrink() usage of coef vs. usage of contrast) and you explain that this is an intendedbehavior of the function. I think it would be best to describe this better in the manual because many people will just use that program and lfcShrink and may not check it as good as they should and report the wrong results.

2a) This will even more confusion when one has more than two levels in a factor. For example, here I compare Co to DP

res_dm_coVdp <- results(ddsdm, contrast = c('Cond', 'Co', 'DP'), alpha = alpha)

 

Cond Co vs DP and log2 fold change (MLE): Cond Co vs DP
LFC > 0 (up)     : 100
LFC < 0 (down)   : 371

 

2b) adding lfcShrink

res_dm_coVdp_shrink <- lfcShrink(ddsdm, coef = 2, res = res_dm_coVdp)

Cond Co vs DP and log2 fold change (MAP): Cond DP vs DM
LFC > 0 (up)     : 222
LFC < 0 (down)   : 249

and I get the foldchanges for DP vs DM but the pvalues and padj values for Co vs DP

2c) One can make this correct by using contrast instead of coef

res_dm_coVdp_shrink_contrast <- lfcShrink(ddsdm, res = res_dm_coVdp, contrast = c('Cond', 'Co', 'DP'))

 

Cond Co vs DP and log2 fold change (MLE): Cond Co vs DP
LFC > 0 (up)     : 103
LFC < 0 (down)   : 368

and we are back to Co vs DP for fold changes, pvalue and padj. However, 3 genes went from down-regulated to up-regulated. I checked these genes and the highest basemean was 107. So it's lowly expressed but what if the gene has a high expression and it's a good candidate for experimental validation and suddenly the fold change is flipped. I think that's quite hard to explain to biologist. Could you explain why this can happen? I also noticed the the difference in fold change between coef and contrast become higher the lower the expression of a gene is.

Again the different behavior of coef and contrast has already been explained (DESeq2 lfcShrink() usage of coef vs. usage of contrast) but I think it would be good to have it in the documentation too. And maybe show the users a design that relates to DESeq2 version before 1.16.1

3) Lastly my question is how do I use lfcShrink when I have multi level factors and I want to compare Co vs DP, DM vs DP or Co vs DM.  lfcShrink with coef will always give me DP vs DM because that's how it has been set up in the beginning. The only way I can think of is to do a relevel every time

ddsdm_reset <- ddsdm
ddsdm_reset$Cond <- relevel(ddsdm_reset$Cond, ref = "DP")
ddsdm_reset <- DESeq(ddsdm_reset, parallel=T, betaPrior = F)

res_dm_dmVdp_reset <- results(ddsdm_reset, contrast = c('Cond', 'DM', 'DP'), alpha = alpha)

Cond DM vs DP and log2 fold change (MLE): Cond DM vs DP
LFC > 0 (up)     : 576
LFC < 0 (down)   : 401

res_dm_dmVdp_reset_shrink <- lfcShrink(ddsdm_reset, coef = 2, res = res_dm_dmVdp_reset)

Cond DM vs DP and log2 fold change (MLE): Cond DM vs DP
LFC > 0 (up)     : 576
LFC < 0 (down)   : 401

But how do you set it to have Co vs DP for the lfcShrink with the relevel?

You wrote in the two threads (DESeq2 lfcShrink() usage of coef vs. usage of contrast and https://support.bioconductor.org/p/95695/) that the future way is DESeq => results => lfcShrink and coef because that is future proof. So how do I use it for the above shown example.

4) I noticed a result object that has been run with IHW looses the weight columns after passing it to lfcShrink

 

Thanks for taking the time to answer it and if the above is already explained somewhere, then please can you provide a link.

We are already discussing this at our site, especially now that results are changed and one can go different ways (betaPrior set to True or False but then lfcShrink with coef or contrast) and it would good to have a way to guarantee a backwards comparability to older version (because somehow this has to be explained to the biologists) and have the new improvements too.

Best

Mathias 

 

 

 

 

 
deseq2 lfcshrink relevel • 6.5k views
ADD COMMENT
0
Entering edit mode

For 1a and 1b, you've swapped numerator and denominator - "Cond DM vs DP" is output by DESeq in 1a, but 1b reports it's calculating "Cond DP vs DM".  Using 'contrast' you specify the numerator first and the denominator (ie what would be the baseline for the 'coef' approach) second, so the flips you're getting look to be incorrectly specified contrasts.

ADD REPLY
0
Entering edit mode

Hi Gavin,

That's exactly the point I'm trying to make. I'm aware that it is swapped. If I'm interested in DM vs DP and I define it in the results() function, I get the opposite from lfcShrink because the initial reference level is set to DM. It's even more worrying if the initial level is set to Co. You will just get the wrong fold changes. If you are not aware of it and it's not explained in the documentation, you report the wrong results. I might add coef is recommend to be the future go to option
 

ADD REPLY
0
Entering edit mode

Thanks for the feedback. I’ll reply in more depth to your questions today, but on this point, ’coef’ should specify a name or number of resultsNames(dds) according to the docs.

I don’t see what’s misleading then. Aren’t you getting the fold changes for that coefficient you asked for?

 

ADD REPLY
0
Entering edit mode

Hey Michael,

Thanks for the reply. Now the doc to ?lfcShrink makes more sense: "the number of the coefficient (LFC) to shrink, consult resultsNames(dds) after running DESeq(dds, betaPrior=FALSE). only coef or contrast can be specified, not both"

I simply did not make the connection from the text, maybe due to lack of technical vocabulary. It would be good if it's explained in the html on bioconductor. Also some of my colleagues couldn't make sense out of it as well.

coldata$Cond <- factor(coldata$Cond, levels = c('DM', 'DP', 'Co'))
ddsdm <- DESeqDataSetFromMatrix(countData = counttab, colData = coldata, design = ~ Cond)
ddsdm <- DESeq(ddsdm, parallel=T, betaPrior = F)

resultsNames(ddsdm)
[1] "Intercept"     "Cond_DP_vs_DM" "Cond_Co_vs_DM"

How can I use lfcShrink with coef for ('Cond', 'Co', 'DP')? coef will only accept 1,2 or 3 but it's none of those.

I have to say contrast in lfcShrink() looks as a better option at the moment. Because I can define exactly what I want, e.g. c('Cond', 'Co', 'DP'), therefore it's more bullet proof and way easier to use in automated pipelining . On top of it, I know how to interpret the fold changes because I get what I type. Whereas coef gives me different calls depending how the reference level is set and will be hard to interpret if you want to wrap it into automated process.

I don't want to imply that it is wrong, because it isn't. It just how I see it from the user's view. I'm sure there are many people who are using DESeq2 in automated workflows or who simply have scripts that they use all the time and this might lead to errors on the user's side.

 

 

ADD REPLY
1
Entering edit mode
@mikelove
Last seen 3 days ago
United States

Hi Mat,

Thanks for the feedback, your comments have been valuable in the past, and I'm sure we can update the documentation more. I've been spending most of my free time to work on the new methods, and on the primary documentation in the man pages, and now that they are in place, I will have time to add to the DESeq2 vignette.

Regarding, "I simply did not make the connection from the text, maybe due to lack of technical vocabulary. It would be good if it's explained in the html on bioconductor.",

Do you mean that I should put more text about 'coef' and 'contrast' in the section on lfcShrink in the vignette? This is probably a good idea.

If you want to compare all levels of a factor, I'd recommend sticking with the use of 'contrast', when calling lfcShrink(). This will be supported going forward. (Also supported going forward is the use of betaPrior=TRUE, which provides 100% backwards compatibility, avoiding the use of a new function, so that users who don't want to incorporate any additional functionality would not have to change their scripts except for a single argument. I think there hasn't been any change which removes functionality that users had previously.)

Another recommendation from the above, is that, it was possible for a user to overwrite a results table with different LFCs than the p-value, by first using one contrast with results(), then using another with lfcShrink() and providing the previous results table. Two notes about this: the contrast associated with LFC, p-values, etc is always recorded in the mcols(res), and also printed when the results table is printed to console. So at least it can be spotted by eye. But more importantly, in v1.18, the recommended code doesn't provide res anymore, so it is created inside lfcShrink() using the 'contrast' or 'coef'. This is more foolproof, so that users cannot possibly specify different contrasts for the results table and the LFC.

ADD COMMENT
3
Entering edit mode

Hi Michael,

Here are some suggestion for the documentation

?lfcShrink:

"the name or number of the coefficient (LFC) to shrink, consult resultsNames(dds) after running DESeq(dds). note: only coef or contrast can be specified, not both. type="apeglm" requires use of coef."

I would write something like "Consult resultsNames(dds) after running DESeq(dds) to obtain the index number or name of the coeffient (LFC) one wants to shrink. note: only coef or contrast can be specified, not both. coef uses the reference level. type="apeglm" requires use of coef." 

I know that is just flipped but I makes more sense for me now. Additionally, the call of resultsNames(dds) can be shown in the example.

In general, in regards to https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#alternative-shrinkage-estimators and https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#quick-start, I think what would make it more clearer is to explain what lfcShrink and especially what coef = 2 means. 

It is already shown in https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#differential-expression-analysis To be honest I did not make the connection that coef = 2 equals  "condition_treated_vs_untreated".  Maybe it's also good to show the contrast alternative and show them how they can obtains results with betaPrior = True or lfcShrink and contrast which provides backwards compatibility 

Many people will only read the doc and set lfcShrink coef always to 2 and don't know what they are doing. The better way would be to show them the connection to resultsNames(dds) and how that works. I have worked with people who just go through a doc or tutorial and apply what they see because they only use it once or twice. I do this sometimes too and I know it's wrong. But for example if a person runs the tutorial and doesn't has a factor with 3 levels (A,B,C), this person won't make a connection and always type 2 for coef.

I don't want to be too nitpicky but I have talked to other people here and they had the same issues with the lfcShrink method.From my experience it's a problem in the level knowledge and how often I use this knowledge. For example if I talk to a biologist, I won't understand some of their vocabulary and vice versa. I write a report for them and everything is clear to me (because I do this all day long) but they have issues understanding it. 

I know that your focus is on the development of the new methods and it's great that you keep the tutorials up to date! There are way to many other projects which don't have that

ADD REPLY
0
Entering edit mode

Thanks Mat, this is really useful.

Yes, in the vignette it's currently pretty thin on lfcShrink(), I will start working on expanding these sections in the devel branch, and maybe adding another about 'coef' and 'contrast', when they can be used with other arguments.

ADD REPLY
0
Entering edit mode

dear Mat,

I've updated the documentation in the development branch. Both the recommendations you have here for ?lfcShrink and I also wrote a lot in the vignette, to gently introduce lfcShrink() and compare the different methods. The interface for the new functionality is solidifying, hopefully it's becoming easier to use. Thanks for the feedback.

ADD REPLY
0
Entering edit mode

Also, let me know if I missed anything in this answer, and I'll add to it with comments.

ADD REPLY
0
Entering edit mode

Two more notes from above:

"A result object that has been run with IHW loses the weight columns after passing it to lfcShrink"

Thanks for noting this, I'll add a fix for this in devel.

For this code chunk:

res_dm_coVdp_shrink_contrast <- lfcShrink(ddsdm, res = res_dm_coVdp, contrast = c('Cond', 'Co', 'DP'))

Cond Co vs DP and log2 fold change (MLE): Cond Co vs DP
LFC > 0 (up)     : 103
LFC < 0 (down)   : 368

Something doesn't seem right, because it says (MLE) here. The lfcShrink() function would have MAP. Can you review to make sure this the table that was output by lfcShrink()?

ADD REPLY
0
Entering edit mode

I'll have a look and report back. the lfcSE column disappears too. Also I'll write a bigger comment on how to improve the documentation.

ADD REPLY
0
Entering edit mode

With the current release, v1.18, you should get a more reasonable output: the LFC and lfcSE are updated to reflect shrinkage, and any other columns should come along.

ADD REPLY
0
Entering edit mode

re: IHW and then losing the "weight" column. I just noticed, when I rearranged code for 1.18, the current release, I fixed this issue. So this only happens in 1.16, and I can't change the code there anymore, as it's no longer the release version.

ADD REPLY
0
Entering edit mode

That explains it. I only noticed it in version 1.16

ADD REPLY
0
Entering edit mode

This was my mistake and most likely due to copy and paste. I just ran it with 1.16 and 1.18 and it's MAP

ADD REPLY
0
Entering edit mode

One more thing, can you paste the normalized counts for one of these genes that have changing sign of LFC for this contrast when you go from MLE to MAP?

ADD REPLY
0
Entering edit mode
  Co1 DP1 DP2 DM1 DM1
GeneA 0 0.859386 0.0000000 0 321.0731
GeneB 0 0.000000 0.6611592 0 535.1218
GeneC 0 0.859386 0.0000000 0 419.9690

The comparison was Co vs DP and it's a single sample vs two replicates. I would not recommend to use the padj values or even check for DEGs with this data set as it has been explained in your Question Section of the documentation ("Can I use DESeq2 to analyze a dataset without replicates?").  But I was wondering why these are flipped and if this could also happen when it's a proper experiment (3 or more replicates). And I noticed the warning message is not show at the moment. See "If a DESeqDataSet is provided with an experimental design without replicates, a warning is printed, that the samples are treated as replicates for estimation of dispersion."

ADD REPLY
0
Entering edit mode

There are replicates, just not for every condition. This allows dispersion estimation. This is a case where the fitting procedure is being thrown off by the single large count for one sample. And DESeq2's outlier procedure doesn't apply when the number of replicates is 6 or fewer. A more robust test for this design with few replicates is actually the LRT. If you set up:

full <- model.matrix(design(dds), colData(dds))

You can compare DM vs CO and DP vs CO using the code below, which I think would not give these genes significance for DP vs CO. This doesn't help you compare DM vs DP however.

dds2 <- nbinomLRT(dds, full=full, reduced=full[,-2])
results(dds2)
dds3 <- nbinomLRT(dds, full=full, reduced=full[,-3])
results(dds3)
ADD REPLY
0
Entering edit mode

We were thinking about doing a separate analysis for DM vs DP because the samples are quite problematic when it comes to sample extraction. But I could use your shown workflow for the DP vs CO and DM vs CO. I don't really want to recommend these genes as DEGs but I like to show the change in expression and from that conduct a experiment with more samples. But just in case the padj values would be valid to use?

ADD REPLY

Login before adding your answer.

Traffic: 777 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6