LIMMA dupcor() function time issue.
2
0
Entering edit mode
tleona3 • 0
@tleona3-13813
Last seen 17 months ago
United States

I am trying to apply the LIMMA package to a dataset that is much larger than a traditional microarray or RNA-seq experiment. The number of samples is 96, but the number of observations is ~11,127,676. I have been able to run LIMMA on this dataset previously to make different comparisons between samples, however, now that I am trying to incorporate blocking by subject my jobs will time out before completing. In the documentation it says the following:

"The function may take long time to execute as it fits a mixed linear model for each gene for an iterative algorithm."

Taken from a previous question to demonstrate the design setup:

# For simplicity, we'll just use 2 replicates
# per tissue and time-point combination.
tissue <- rep(c("M", "C"), each=10)
time <- rep(rep(c(0, 6, 24, 72, 168), each=2), 2)

# Assume one subject contributes to all time points 
# for simplicity. I know this isn't actually the case,
# but this is fine provided that most subjects contribute
# to at least two samples.
subject <- c(rep(1:2, 5), rep(3:4, 5))
The obvious design matrix is:

g <- factor(paste0(tissue, time))
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)

#block on subject
corfit <- duplicateCorrelation(eset,design,block=eset$subject)
corfit$consensus

When performing the duplicateCorrelation() function with my actual data using different numbers of observations the time it takes is as follows:

#1250: 5.088 sec elapsed
#2500: 9.374 sec elapsed
#5000: 19.896 sec elapsed
#10000: 38.284 sec elapsed
#20000: 74.756 sec elapsed
#40000: 147.583 sec elapsed
#80000: 286.273 sec elapsed
#160000: 697.932 sec elapsed
#320000: 1744.34 sec elapsed

#If I plot the time points it appears to not be linear in terms of time with increasing observation number:
y <- c(5.088, 9.374, 19.896, 38.284, 74.756, 147.583, 286.273, 697.932)
x <- c(1250, 2500, 5000, 10000, 20000, 40000, 80000, 160000)
lm(y~x)
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
  -42.32812      0.00533

I'm doing the following on my laptop (8GB ram, 2.7 GHz i5 processor) but I am running the actual job on a computing cluster with 128GB ram and 16 x 2.6 GHz Intel Xeon E5-2670 CPUs. Assuming this "linear" trend follows, on my computer it should take this much time to run the computation: (0.00533(64417279))/3600 = 16.47 hours. However, when running on the cluster with a 48 hour window it does not complete.

Question: Is there a way to speed up the dupcor() function (run in parallel) or alternative way to block on subject with the current design to still perform contrasts as seen in the previous post (117545)?

Previous post on time series design setup

limma microarray block • 2.2k views
ADD COMMENT
6
Entering edit mode
@ryan-c-thompson-5618
Last seen 9 weeks ago
Icahn School of Medicine at Mount Sinaiā€¦

Looking at the code of duplicateCorrelation, the main time sink is the loop that runs mixedModel2Fit on each row of the input. So the time required should indeed be a linear function of the number of rows. However, the linear trend might break down if you are bumping up against hardware limits. The obvious one is that if you are running out of RAM and triggering swapping, you're going to see a huge slowdown. I don't think that should be happening with 128 GB of RAM, but if it's a shared system, perhaps you only have the use of a fraction of that RAM.

Of course, it would be possible to rewrite the main loop using a parallel lapply, but I'll let someone from the limma team comment on whether the parallel speedup is likely to outweigh the additional overhead. (Also, if limited memory is an issue, parallel computation is just going to make it worse.)

What I can suggest, though, is that you don't necessarily need to run duplicateCorrelation on every row. duplicateCorrelation's only purpose is to estimate the average of the random effect's correlation coefficient across all genes. If your data are well-behaved, you should be able to get a robust estimate of this value from any sufficiently large randomly-selected subset of genes. A typical human genomic dataset might be somewhere around 20k genes, so if you want to be safe, sample around 50k rows at random and run duplicateCorrelation on them. Then, run it on another 10 or so random samples of 50k genes, and see how consistent the estimate is. If different random selections of genes give you very similar correlation values, then you're probably getting a robust estimate of the overall average correlation across the entire dataset, and it's safe to use that correlation value in downstream analyses.

ADD COMMENT
5
Entering edit mode
@gordon-smyth
Last seen 8 hours ago
WEHI, Melbourne, Australia

Ryan is right -- computation time is linear in the number of rows and the solution is just to run duplicateCorrelation on a subset of rows.

The computation time is linear in the number of rows, linear in the number of columns, and quadratic in the number of columns of the design matrix.

Have you already filtered out non expressed rows? That will both speed up the computation and improve the estimate.

You've already run duplicateCorrelation on 320k rows, for which it took half an hour. There's nothing further to be gained by running it on the whole dataset, provided that the first 320k rows are representative of the whole dataset. If your data is ordered in some way then it would be better to select rows randomly or to choose a systematic sample by average expression. I'd think that 20k randomly chosen rows would be more than enough, but you can choose more to be safe.

ADD COMMENT
0
Entering edit mode

I randomly chose 10,000 rows 1,000 times from the entire dataset (since it was ordered in a specific way) to run the duplicateCorrelation function and got a normally distributed corfit$consensus so I used the median value as my parameter for my actual experiment. Thanks for all the help!

ADD REPLY
0
Entering edit mode

That's ok, but there is nothing to be gained by subsetting 1000 times. It would be better to simply subset 100,000 rows once, although the result probably won't be very different from what you already have.

ADD REPLY

Login before adding your answer.

Traffic: 851 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6