Different contrast coding strategies in DESeq2 produce slightly different output?
1
0
Entering edit mode
Ben ▴ 50
@ben-17772
Last seen 6 months ago
United States

Hi!

I am currently playing with different contrast coding strategies in R - especially in DESeq2.

As part of that, I was producing the same DGE analysis using two different contrast coding strategies: (1) using the what is called simplest case for contrasts in the DESeq2::results() help file, and (2) using what is called the most general case.

While I was expecting identical results, I saw a minor fraction of genes with differences.

These differences are in four columns of the results output: (1) log2FoldChange, (2) stat, (3) pvalue, and (4) padj.

Does anyone know why this is the case?

For now, I am skipping the original code part for this example, since it is a rather long script. And again: splitting point of the two different outputs is the DESeq2::results() function, supplying different contrast codings. So based on the fact that more than 17,200 genes are completely identical between the two outputs, I am rather certain that there is no coding issue.

But I am happy to expand on the coding part - of course.

I did attach an image of the genes (again, out of more than 17,200 genes) which will show the genes with differences. I hope the columns are self-explanatory. _general columns refer to the most general case; _DESeq2-specific columns refer to the contrast coding using the c(factor, numerator, denominator) style.

EDIT OF POST: ADDITION OF CODE

As mentioned in the original post above, the splitting point into the two different outputs is the DESeq2::results().

model_formula <- stats::formula(
  x = ~ 1 + donor_id + treatment
)

model_matrix <- stats::model.matrix(
  object = model_formula,
  data = SummarizedExperiment::colData(x = dds)
)

DESeq2::design(object = dds) <- model_formula

dds_estimate_size_factors <- DESeq2::estimateSizeFactors(
  object = dds
)

dds_estimate_dispersions <- DESeq2::estimateDispersions(
  object = dds_estimate_size_factors
)

dds_nbinom_wald_test <- DESeq2::nbinomWaldTest(
  object = dds_estimate_dispersions
)

model_contrasts <- limma::makeContrasts(
  treatment_A_effect =
    treatment_A,
  levels = model_matrix
)

dds_res_1 <- DESeq2::results(
    object = dds_nbinom_wald_test,
    contrast = model_contrasts,
    lfcThreshold = params$lfc_threshold_stats,
    altHypothesis = "greaterAbs",
    independentFiltering = TRUE,
    alpha = 0.1,
    pAdjustMethod = "BH",
    format = "DataFrame",
    tidy = TRUE
  )

tmp_contrasts <- base::list(
  treatment_A_effect = c("treatment", "A", "B")

dds_res_2 <- DESeq2::results(
    object = dds_nbinom_wald_test,
    contrast = tmp_contrasts,
    lfcThreshold = params$lfc_threshold_stats,
    altHypothesis = "greaterAbs",
    independentFiltering = TRUE,
    alpha = 0.1,
    pAdjustMethod = "BH",
    format = "DataFrame",
    tidy = TRUE
  )
)

The result of this code are two tables das_res_1 and dds_res_2, where 17,293 genes are completely identical in terms of their log2FC, p-values etc, and 12 genes are not. This is shown in the image of this post. I did not show the code for making the data frame shown in the image, but this was basic re-arranging the columns using tidyr.

BTW: Treatment "B" is the reference level in the model.

I hope the code addition helps!

Thanks much everybody!!

Genes with differences between the two different contrast coding strategies.

DESeq2 RNASeqData DifferentialExpression • 1.5k views
ADD COMMENT
1
Entering edit mode

You should add code to explain what you did rather than just text. Right now it is basically irreproducible.

ADD REPLY
0
Entering edit mode

Sure... as mentioned, I can add the code.

I will edit the original post.

Best, Ben

ADD REPLY
0
Entering edit mode

Okay... Code is added now. Let me know if you have any other questions regarding the code.

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

If the difference is between character and numeric contrasts, I think it is that DESeq2 will "zero out" the LFC when you supply character contrasts and when the two groups specified by the character-based contrast have 0's for all samples. This is not done for numeric contrasts.

ADD COMMENT
0
Entering edit mode

Hi Michael Love!

Thanks for your answer.

You are right, the values are all zero for the specified contrast - I checked.

I expect that explaining why (when all samples of these contrasts show zero counts) there are non-zero log2FC values for these genes will get into the knitty gritty of statistics? Especially one gene (which is shown in the example of my original post) showed a log2FC of -7.49 and non-1 p-values. Sure, p-values are still non-significant, but still... a (for me) unexpected observation.

Thanks again! ...And congrats to your EOSS funding!

Best, Ben

ADD REPLY
0
Entering edit mode

Yes, so it is possible for the MLE to be non-zero. I'd recommend to use the character-based contrast therefore.

ADD REPLY

Login before adding your answer.

Traffic: 864 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6