Question

Quickly calculating adjusted Fold Change with DESeq2

0

Entering edit mode

Luca • 0

@6367e4e1

Last seen 5 months ago

Italy

Hello!

I'm working on comparing tumor and healthy samples with DESeq2. My analysis only cares about the adjusted Fold Change values that DESeq2 computes. The input data encompasses a huge number of samples (I'm working with TCGA and GTEX data), and I need to run the calculation a bunch of times. Calls to DESeq() take a very, very long time to resolve.

Is there a way to speed the process up given the fact that I merely require the adjusted fold change instead of the whole DESeq2 output?

The full script that I run on the data is here at this permalink.

The specific code snippet is here:

run_deseq <- function(case_frame, control_frame) {

  # Make a dummy metadata frame
  metadata <- data.frame(
    row.names = c(colnames(case_frame), colnames((control_frame))),
    status = factor(c(rep("case", ncol(case_frame)), rep("control", ncol(control_frame))), levels = c("control", "case"))
  )

  # Merge the case and control frames
  data <- rowmerge(case_frame, control_frame, all = FALSE) # inner merge

  # Assert that the order of the cols and rows is the same
  data <- data[, row.names(metadata)]

  # The data must be un-logged
  data <- round((2 ** data) -1)

  DESeq2::DESeqDataSetFromMatrix(
    data, metadata, ~ status
  ) |>
    DESeq2::DESeq() -> dsq_obj

  DESeq2::results(dsq_obj, name = "status_case_vs_control") -> dsq_res

  cat("Deseq finished running!\n")
  as.data.frame(dsq_res) |> rownames_to_column("gene_id")
}

I'm not posting sessionInfo() since nothing is not working.

Thank you!

DifferentialExpression DESeq2 • 811 views

ADD COMMENT • link 6 months ago • updated 5 months ago Luca • 0

0

Entering edit mode

Update: on Mike Love's suggestion, I've tried to use glmGamPoi option in the fitType call, but it seems to make the runtime even longer, so it does not seem suitable.

ADD REPLY • link 6 months ago Luca • 0

0

Entering edit mode

The DESeq() function can be parallelized via BiocParallel, see ?DESeq and the BPPARAM argument. Beyond that, what do you define as "very long"? On simulated data with 50000 genes and 1000 samples as below, DESeq2 runs about 3.5min on my machine with a single core (i7 12700k) for a design with a single covariate. Does it run a lot longer for you? Would running a linear approach such as voom be an option?

library(DESeq2)

set.seed(1)
dds <- makeExampleDESeqDataSet(n=50000, m=1000)

fun <- function() DESeq(dds)

microbenchmark::microbenchmark(deseq2=fun(), times=1)

(...)

Unit: seconds
   expr      min       lq     mean   median       uq      max neval
 deseq2 203.6366 203.6366 203.6366 203.6366 203.6366 203.6366     1

Unrelated to this, GTEx and TCGA are completely different studies/batches so combiningthem in a single quantitative analysis is highly questionable.

ADD REPLY • link 6 months ago ATpoint ★ 4.1k

0

Entering edit mode

Thanks for the reply! I'm already running the task in a parallel way so that is not really a viable solution. The single run takes approximately as much as you pointed out, but the problem is that I need to run many DESeq calls, so the overall runtime is very long.

I'm now considering using voom, as you pointed out, as well as the other ideas that have come up.

Unrelated to this, GTEx and TCGA are completely different studies/batches so combiningthem in a single quantitative analysis is highly questionable.

I am aware of this and I'm already working on other data, but I wanted to use something easy and quick to do a first run.

ADD REPLY • link 5 months ago Luca • 0

0

Entering edit mode

What do you mean by "corrected"?

Do you just mean the log2fold changes calcaulted by doing inter-sample normalisation and the doing the fold change? Or do you mean calculating the stabalised shrunk lfcs that DESeq is capable of? Because if its the former then you don't need to run DESeq() and if its the latter then your code isn't calculating them.

To normalise data using DESeq's intersample normalisation, you can use dds<- estimateSizeFactors(dds) and then recover the normalised counts with counts_data <- counts(dds, normalized=TRUE). No call to DESeq() neccessary.

To get the stablised, shrunken values, you need to use lfcShink on your results object.

I notice that you are"unlogging" your data. This suggests to me that the data yo have isn't counts. To run DESeq you need counts. Its not enough to have integers, they need to be Negative Binomially distributed (or approximately so). I suspect DESeq is taking so long to run because its struggling to fit the distribution as your data isn't really count data.

ADD REPLY • link 6 months ago i.sudbery ▴ 40

0

Entering edit mode

It could also be memory, this matrix is ~200 Mb.

Likely things could be speed up a lot also by reasonable filtering.

See ?edgeR::filterByExpr

ADD REPLY • link 6 months ago Michael Love 41k

0

Entering edit mode

Filtering is a thing I do not do, which I probably should be doing. I'll try it out! Thanks.

ADD REPLY • link 5 months ago Luca • 0

0

Entering edit mode

Hello! Thanks for taking the time to reply.

Do you just mean the log2fold changes calcaulted by doing inter-sample normalisation and the doing the fold change? Or do you mean calculating the stabalised shrunk lfcs that DESeq is capable of? Because if its the former then you don't need to run DESeq() and if its the latter then your code isn't calculating them.

It's the latter. I was 100% sure that it did calculate them, but now that you point it out, I'm completely wrong. Since our datasets are very large, I believe that shrinkage of the LFCs would not have a big effect, but the error is there none the less.

I notice that you are"unlogging" your data. This suggests to me that the data yo have isn't counts. To run DESeq you need counts. Its not enough to have integers, they need to be Negative Binomially distributed (or approximately so). I suspect DESeq is taking so long to run because its struggling to fit the distribution as your data isn't really count data.

The original data is from UCSC Xena (here). The data was logged by Xena, but it's actually estimated counts from RSEM, which I believe are the correct metric for DESeq2. But since the data comes from two very different consortia, I wonder if the marked batch effect between the two would cause DESeq to "struggle" as you said.

In any case, thanks for pointing my error out.

ADD REPLY • link 5 months ago Luca • 0

score 0 · Answer 1 · 2023-11-01

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 16 hours ago

United States

Seconding ATpoint that we often use linear approaches with limma voom in the lab for large datasets.

I get 6 minutes for the 50,000 x 1,000 dds.

ADD COMMENT • link 6 months ago Michael Love 41k