Question

Counter-directional LFC shrinkage

0

Entering edit mode

ennui • 0

@5a77cdfc

Last seen 17 months ago

Switzerland

Hi,

First a caveat: I am new to enrichment analyses, following the DESeq2 guide. The data is not mine and I have very little to say in terms of experimental design, I am just trying to improve on a "naive" analysis of producing a ration counts-per-million which exaggerates low-count reads.

I noticed something counter-intuitive in my data, namely that lfcShrink does the opposite of what it should: low-count reads are exaggerated and high-count reads are shrunk.

Two example genes below.

# counts, metadata defined elsewhere
> dds <- DESeq2::DESeqDataSetFromMatrix(
    countData = counts,
    colData = metadata,
    design = ~condition
  )
> dds$condition <- relevel(dds$condition, ref = "reference")

# many zeros in this dataset hence poscount
> dds <- DESeq2::DESeq(dds, sfType = "poscount")

> result_unshrunk <- results(dds, name = "condition_condition1_vs_reference")
> result_unshrunk[c('gene1', 'gene2'),]

log2 fold change (MLE): condition1 vs reference 
Wald test p-value: condition1 vs reference 
DataFrame with 2 rows and 6 columns
       baseMean log2FoldChange     lfcSE      stat     pvalue      padj
      <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
gene1   11.0575       -5.49035   2.08908  -2.62811 0.00858602  0.116205
gene2 7537.5202        8.97127   5.23205   1.71468 0.08640462  0.289067

> result_apeglm <- lfcShrink(dds, coef = "condition_condition1_vs_reference", type = "apeglm")
> result_apeglm[c('gene1', 'gene2'),]

log2 fold change (MAP): condition1 vs reference 
Wald test p-value: condition1 vs reference 
DataFrame with 2 rows and 5 columns
       baseMean log2FoldChange     lfcSE     pvalue      padj
      <numeric>      <numeric> <numeric>  <numeric> <numeric>
gene1   11.0575      10.457654        NA 0.00858602  0.116205
gene2 7537.5202       0.123898        NA 0.08640462  0.289067

Looking at the normalized counts below (from plotCounts). It seems obvious to me that gene2 should be the one with higher LFC. Both genes have the same counts in reference, but one has a clearly higher average. In the un-shrunk calculation gene2 is higher, but the relationship reverses under the shrunk model.

Gene 1 - `plotCounts` Gene 2 - `plotCounts`

Now, this is just a small snapshot of the data, which I cannot share, so I appreciate that it's not possible to some up with a definitive answer via a forum. Still, I would appreciate some guidance as to how to diagnose what is really happening?

Below are the distributions from plotMA.

Basic LFC calculation apeglm-shrunk LFCs

Does this give a hint? Perhaps it's the non-symmetric LFC distribution that's the problem? Thanks a lot for any help.

DESeq2 • 1.1k views

ADD COMMENT • link updated 17 months ago by Michael Love 43k • written 17 months ago by ennui • 0

0

Entering edit mode

Is this RNA-seq? The unshrunken MA-plot looks very odd and indicates that normalization is completely offset.

ADD REPLY • link 17 months ago ATpoint ★ 5.0k

0

Entering edit mode

I agree it looks offset (and weird). The data is from NGS, not RNAseq, but we still want to see relative abundance between conditions - I thought the same principles would apply.

ADD REPLY • link 17 months ago ennui • 0

score 0 · Answer 1 · 2024-05-28

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

It's not just the average count but also the dispersion that affects apeglm.

You can try ashr for another type of shrinkage.

ADD COMMENT • link 17 months ago Michael Love 43k

0

Entering edit mode

Thank you for your response. The ashr shrinkage indeed preserves more of the original ranks in the data.

So if I understand correctly, LFC for Gene 2 in the example is strongly shrunk because it's more evenly dispersed between 0 and 10e6 in condition1. Gene 1 is clustered around 0 with a single outlier, hence less dispersion, hence increase rather than decrease in LFC?

ADD REPLY • link 17 months ago ennui • 0

0

Entering edit mode

Gene 1 is hard to predict given the paucity of data for the comparison of interest. What is happening is that the relatively low dispersion from condition 1 and 2 is bringing the dispersion down for the whole gene but then you only have a single count for the reference.

I would actually recommend you not process the data altogether for this dataset (so instead run DESeq separately on condition 1 vs ref and condition 2 vs ref), because I think the model is too confident from pooling dispersion across all samples. For a two group comparison I would guess it would be less confident, more accurately summarizing the data.

ADD REPLY • link 17 months ago Michael Love 43k

score 0 · Answer 2 · 2024-05-28

0

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 36 minutes ago

San Diego

The y axes in those plots are logged. The average for condition1 is going to be higher than condition2 because of the very high outliers.

ADD COMMENT • link 17 months ago swbarnes2 ★ 1.4k