Question

advice on limma design when analyzing for final outcome in timecourse data

0

Entering edit mode

Adaikalavan Ramasamy ▴ 220

@adaikalavan-ramasamy-5765

Last seen 9.5 years ago

United Kingdom

Dear Prof. Smyth and limma experts,

I would like to clarify if my design and logic in applying limma is correct. I have a timecourse where the blood was taken from participants before vaccination (Day0), after first vaccination (Day1) and after second vaccination eight weeks later (Day57). These participants were subsequently challenged with the infectious pathogen and monitored.

Some individuals were protected while others developed the disease (and treated). Our aim is to look for genes correlated with the protection status.

If we subset the dataset for each timepoint and analyzed for protection status, only a couple of genes are significant. However, when we analyze all time points together with adjustment for timepoint, we obtain many more significant genes. The phenotype data looks something like this:

SubjectID Timepoint Status Age Sex Batch

1001 Day0 NotProtected 20 M B1

1001 Day1 NotProtected 20 M B2

1001 Day57 NotProtected 20 M B3

1015 Day0 Protected 35 F B2

1015 Day1 Protected 35 F B3

1015 Day57 Protected 35 F B1

... ... ... ... ... ....

And here is the model matrix and limma call for analyzing all time points jointly.

mm <- model.matrix( ~ -1 + Age + Sex + Batch + Status + Timepoint, data=pheno )

dupCor <- duplicateCorrelation( EXPRS, design=mm, block=pheno$SubjectID )$consensus

fit <- lmFit( EXPRS, design=mm, block=pheno$Subject, correlation=dupCor )

I am slightly concerned that we are violating some basic assumption here. Or is the blocking for subject and duplicateCorrelation sufficient? Any advice is greatly appreciated. Many thanks.

Regards, Adai

Adaikalavan Ramasamy

Senior Leadership Fellow in Bioinformatics

Head of the Transcriptomics Core Facility

The Jenner Institute, University of Oxford

Roosevelt Drive, Oxford OX3 7DQ

Email: adaikalavan.ramasamy@ndm.ox.ac.uk

Office: 01865 617 100

Mob: 07906 308 465

http://www.jenner.ac.uk/transcriptomics-facility

limma timecourse • 2.0k views

ADD COMMENT • link updated 9.5 years ago by kodream ▴ 20 • written 9.5 years ago by Adaikalavan Ramasamy ▴ 220

score 0 · Answer 1 · 2014-10-31

The Age factor in your design seems a bit strange to me. If this is a numeric vector, the corresponding coefficient would be modelling genes where the (log-)expression is linearly proportional to the age of the subject. This seems unrealistic, e.g., would you really expect to see genes that double (or halve) expression in 50 year olds compared to 25 year olds?

In any case, the age effect is implicitly accounted for when you block on SubjectID. This means that there's no need to specify Age in the design if you're not explicitly interested in it. The same could be argued for Sex, though the specification of this factor in design is more sensible and could be retained (at the cost of residual degrees of freedom for variance estimation).

To address your main concern, blocking on SubjectID with duplicateCorrelation has some (not unreasonable) limitations. For example, it requires homogeneity in the subject effects, i.e., you shouldn't have systematic differences between subjects or an outlier subject. duplicateCorrelation also estimates a common correlation across all genes and uses that consensus value in generalized least squares. This is necessary to improve precision, but it also introduces some (acceptable) bias when the correlations are different between genes. The ideal strategy would consider subject-specific effects on a gene-by-gene basis. This could be done by putting SubjectID in the design, but the (interesting) coefficient for Status would then become unestimable. Thus, blocking with duplicateCorrelation is probably the best approach in this situation.

As an aside, one should avoid subsetting the data prior to a limma analysis. This will reduce the amount of information available to estimate the dispersion, resulting in greater uncertainty and a decrease in the number of DE calls.

score 0 · Answer 2 · 2014-10-31

0

Entering edit mode

kodream ▴ 20

@kodream-6952

Last seen 7.1 years ago

United States

I am also using this approach for each individual having more than one sample, each sample with a different condition. One problem I have run into is that duplicateCorrelation takes a very long time to run, several days even. Is there anyway to expedite the process? It seems like it should be an "embarrassingly parallel" process and could easily threaded or run on a cluster.

ADD COMMENT • link 9.5 years ago kodream ▴ 20

0

Entering edit mode

How big is your dataset? A limma analysis taking several days doesn't seem right. For reference, a duplicateCorrelation call with around 16000 genes and 12 libraries takes about 3-5 minutes to complete.

As you might know, there is a for loop across all genes inside duplicateCorrelation. This could indeed be split over multiple cores. Alternatively, it could be C'ified, which has given a noticeable speed boost in other cases (e.g., 2 - 10 fold in edgeR). That's up to the limma developers, though.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

I concur with Aaron. It takes me also around 3 minutes (24k genes, 38 subjects x 3 timepoint per subject).

ADD REPLY • link 9.5 years ago Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

47K genes, 468 subjects x 2 conditions

ADD REPLY • link 9.5 years ago kodream ▴ 20

0

Entering edit mode

Okay, that's pretty big. The simplest option here is to just grit your teeth and bear with it. Run duplicateCorrelation once, and then hold on to the consensus value so that you don't have to run it again. If you're desperate, you could easily rewrite duplicateCorrelation to run on multiple cores. However, I'd warn against doing so, as it'd be painful to keep a custom version of the function up to date.

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

I used foreach from the parallel package. 12 threads got down to about 4 hours. Though there seems to be a significant amount of blocking. Maybe due to forking. Good enough, I suppose. I get the impression that this part of the code is just tacked on after the fact though, since it isn't written in C.

ADD REPLY • link 9.5 years ago kodream ▴ 20

0

Entering edit mode

There's not a great deal of limma code written in C/C++. The operations per gene are fairly quick for linear modelling, so fast enough is usually good enough. The R code in duplicateCorrelation is just an extension of this philosophy.

Also, foreach seems to come from the foreach package, rather than the parallel package (though the latter is still required for parallel computation with the former).

ADD REPLY • link 9.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Sure, just doesn't feel like it wasn't stress tested. Or if any stress testing was done, they didn't do very much of it.

ADD REPLY • link 9.5 years ago kodream ▴ 20