EDIT OF POST: ADDITION OF CODE

Question

Different contrast coding strategies in DESeq2 produce slightly different output?

0

Entering edit mode

Ben ▴ 50

@ben-17772

Last seen 3 months ago

United States

Hi!

I am currently playing with different contrast coding strategies in R - especially in DESeq2.

As part of that, I was producing the same DGE analysis using two different contrast coding strategies: (1) using the what is called simplest case for contrasts in the DESeq2::results() help file, and (2) using what is called the most general case.

While I was expecting identical results, I saw a minor fraction of genes with differences.

These differences are in four columns of the results output: (1) log2FoldChange, (2) stat, (3) pvalue, and (4) padj.

Does anyone know why this is the case?

For now, I am skipping the original code part for this example, since it is a rather long script. And again: splitting point of the two different outputs is the DESeq2::results() function, supplying different contrast codings. So based on the fact that more than 17,200 genes are completely identical between the two outputs, I am rather certain that there is no coding issue.

But I am happy to expand on the coding part - of course.

I did attach an image of the genes (again, out of more than 17,200 genes) which will show the genes with differences. I hope the columns are self-explanatory. _general columns refer to the most general case; _DESeq2-specific columns refer to the contrast coding using the c(factor, numerator, denominator) style.

EDIT OF POST: ADDITION OF CODE

As mentioned in the original post above, the splitting point into the two different outputs is the DESeq2::results().

model_formula <- stats::formula(
  x = ~ 1 + donor_id + treatment
)

model_matrix <- stats::model.matrix(
  object = model_formula,
  data = SummarizedExperiment::colData(x = dds)
)

DESeq2::design(object = dds) <- model_formula

dds_estimate_size_factors <- DESeq2::estimateSizeFactors(
  object = dds
)

dds_estimate_dispersions <- DESeq2::estimateDispersions(
  object = dds_estimate_size_factors
)

dds_nbinom_wald_test <- DESeq2::nbinomWaldTest(
  object = dds_estimate_dispersions
)

model_contrasts <- limma::makeContrasts(
  treatment_A_effect =
    treatment_A,
  levels = model_matrix
)

dds_res_1 <- DESeq2::results(
    object = dds_nbinom_wald_test,
    contrast = model_contrasts,
    lfcThreshold = params$lfc_threshold_stats,
    altHypothesis = "greaterAbs",
    independentFiltering = TRUE,
    alpha = 0.1,
    pAdjustMethod = "BH",
    format = "DataFrame",
    tidy = TRUE
  )

tmp_contrasts <- base::list(
  treatment_A_effect = c("treatment", "A", "B")

dds_res_2 <- DESeq2::results(
    object = dds_nbinom_wald_test,
    contrast = tmp_contrasts,
    lfcThreshold = params$lfc_threshold_stats,
    altHypothesis = "greaterAbs",
    independentFiltering = TRUE,
    alpha = 0.1,
    pAdjustMethod = "BH",
    format = "DataFrame",
    tidy = TRUE
  )
)

The result of this code are two tables das_res_1 and dds_res_2, where 17,293 genes are completely identical in terms of their log2FC, p-values etc, and 12 genes are not. This is shown in the image of this post. I did not show the code for making the data frame shown in the image, but this was basic re-arranging the columns using tidyr.

BTW: Treatment "B" is the reference level in the model.

I hope the code addition helps!

Thanks much everybody!!

Genes with differences between the two different contrast coding strategies.

DESeq2 RNASeqData DifferentialExpression • 1.3k views

ADD COMMENT • link updated 3.4 years ago by Michael Love 41k • written 3.4 years ago by Ben ▴ 50

1

Entering edit mode

You should add code to explain what you did rather than just text. Right now it is basically irreproducible.

ADD REPLY • link 3.4 years ago ATpoint ★ 4.0k

0

Entering edit mode

Sure... as mentioned, I can add the code.

I will edit the original post.

Best, Ben

ADD REPLY • link 3.4 years ago Ben ▴ 50

0

Entering edit mode

Okay... Code is added now. Let me know if you have any other questions regarding the code.

ADD REPLY • link 3.4 years ago Ben ▴ 50

score 2 · Accepted Answer · 2020-11-28

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 54 minutes ago

United States

If the difference is between character and numeric contrasts, I think it is that DESeq2 will "zero out" the LFC when you supply character contrasts and when the two groups specified by the character-based contrast have 0's for all samples. This is not done for numeric contrasts.

ADD COMMENT • link 3.4 years ago Michael Love 41k

0

Entering edit mode

Hi Michael Love!

Thanks for your answer.

You are right, the values are all zero for the specified contrast - I checked.

I expect that explaining why (when all samples of these contrasts show zero counts) there are non-zero log2FC values for these genes will get into the knitty gritty of statistics? Especially one gene (which is shown in the example of my original post) showed a log2FC of -7.49 and non-1 p-values. Sure, p-values are still non-significant, but still... a (for me) unexpected observation.

Thanks again! ...And congrats to your EOSS funding!

Best, Ben

ADD REPLY • link 3.4 years ago Ben ▴ 50

0

Entering edit mode

Yes, so it is possible for the MLE to be non-zero. I'd recommend to use the character-based contrast therefore.

ADD REPLY • link 3.4 years ago Michael Love 41k