advice on limma design when analyzing for final outcome in timecourse data
2
0
Entering edit mode
@adaikalavan-ramasamy-5765
Last seen 9.5 years ago
United Kingdom

Dear Prof. Smyth and limma experts,

I would like to clarify if my design and logic in applying limma is correct. I have a timecourse where the blood was taken from participants before vaccination (Day0), after first vaccination (Day1) and after second vaccination eight weeks later (Day57). These participants were subsequently challenged with the infectious pathogen and monitored.

Some individuals were protected while others developed the disease (and treated). Our aim is to look for genes correlated with the protection status.

If we subset the dataset for each timepoint and analyzed for protection status, only a couple of genes are significant. However, when we analyze all time points together with adjustment for timepoint, we obtain many more significant genes. The phenotype data looks something like this:

     SubjectID   Timepoint   Status              Age  Sex  Batch

     1001           Day0          NotProtected   20     M     B1

     1001           Day1          NotProtected   20     M     B2

     1001           Day57        NotProtected   20     M     B3

     1015           Day0          Protected         35     F     B2

     1015           Day1          Protected         35     F     B3

     1015           Day57        Protected         35     F     B1

      ...               ...                ...                     ...      ...    ....

 

And here is the model matrix and limma call for analyzing all time points jointly.

   mm <- model.matrix( ~ -1 + Age + Sex + Batch + Status + Timepoint, data=pheno )

   dupCor <- duplicateCorrelation( EXPRS, design=mm, block=pheno$SubjectID )$consensus 

   fit <- lmFit( EXPRS, design=mm, block=pheno$Subject, correlation=dupCor )

I am slightly concerned that we are violating some basic assumption here. Or is the blocking for subject and duplicateCorrelation sufficient? Any advice is greatly appreciated. Many thanks.

 

Regards, Adai

<font color="#0782c1">--</font>

Adaikalavan Ramasamy

Senior Leadership Fellow in Bioinformatics

Head of the Transcriptomics Core Facility

The Jenner Institute, University of Oxford

Roosevelt Drive, Oxford OX3 7DQ

 

Email: adaikalavan.ramasamy@ndm.ox.ac.uk

Office: 01865 617 100

Mob: 07906 308 465

http://www.jenner.ac.uk/transcriptomics-facility

limma timecourse • 2.0k views
ADD COMMENT
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 13 hours ago
The city by the bay

The Age factor in your design seems a bit strange to me. If this is a numeric vector, the corresponding coefficient would be modelling genes where the (log-)expression is linearly proportional to the age of the subject. This seems unrealistic, e.g., would you really expect to see genes that double (or halve) expression in 50 year olds compared to 25 year olds?

In any case, the age effect is implicitly accounted for when you block on SubjectID. This means that there's no need to specify Age in the design if you're not explicitly interested in it. The same could be argued for Sex, though the specification of this factor in design is more sensible and could be retained (at the cost of residual degrees of freedom for variance estimation).

To address your main concern, blocking on SubjectID with duplicateCorrelation has some (not unreasonable) limitations. For example, it requires homogeneity in the subject effects, i.e., you shouldn't have systematic differences between subjects or an outlier subject. duplicateCorrelation also estimates a common correlation across all genes and uses that consensus value in generalized least squares. This is necessary to improve precision, but it also introduces some (acceptable) bias when the correlations are different between genes. The ideal strategy would consider subject-specific effects on a gene-by-gene basis. This could be done by putting SubjectID in the design, but the (interesting) coefficient for Status would then become unestimable. Thus, blocking with duplicateCorrelation is probably the best approach in this situation.

As an aside, one should avoid subsetting the data prior to a limma analysis. This will reduce the amount of information available to estimate the dispersion, resulting in greater uncertainty and a decrease in the number of DE calls.

ADD COMMENT
0
Entering edit mode

Dear Aaron,

Thank you very much for the useful answer. Apologies for the delay. We included the Age covariate in to account for the fact there was some weak correlation between age and protection status (probably due to small sampling). You are right that it is implicit when blocking for SubjectID but I am not sure how best to account for the age variation.

Thank you for conforming duplicateCorrelation is probably the best way to go and advice on subsetting.

Regards, Adai

 

 

 

ADD REPLY
0
Entering edit mode
kodream ▴ 20
@kodream-6952
Last seen 7.1 years ago
United States

I am also using this approach for each individual having more than one sample, each sample with a different condition.  One problem I have run into is that duplicateCorrelation takes a very long time to run, several days even.  Is there anyway to expedite the process?  It seems like it should be an "embarrassingly parallel" process and could easily threaded or run on a cluster.

ADD COMMENT
0
Entering edit mode

How big is your dataset? A limma analysis taking several days doesn't seem right. For reference, a duplicateCorrelation call with around 16000 genes and 12 libraries takes about 3-5 minutes to complete.

As you might know, there is a for loop across all genes inside duplicateCorrelation. This could indeed be split over multiple cores. Alternatively, it could be C'ified, which has given a noticeable speed boost in other cases (e.g., 2 - 10 fold in edgeR). That's up to the limma developers, though.

ADD REPLY
0
Entering edit mode

I concur with Aaron. It takes me also around 3 minutes (24k genes, 38 subjects x 3 timepoint per subject).

ADD REPLY
0
Entering edit mode

47K genes, 468 subjects x 2 conditions

ADD REPLY
0
Entering edit mode

Okay, that's pretty big. The simplest option here is to just grit your teeth and bear with it. Run duplicateCorrelation once, and then hold on to the consensus value so that you don't have to run it again. If you're desperate, you could easily rewrite duplicateCorrelation to run on multiple cores. However, I'd warn against doing so, as it'd be painful to keep a custom version of the function up to date.

ADD REPLY
0
Entering edit mode

I used foreach from the parallel package.  12 threads got down to about 4 hours.  Though there seems to be a significant amount of blocking.  Maybe due to forking.  Good enough, I suppose.  I get the impression that this part of the code is just tacked on after the fact though, since it isn't written in C.

ADD REPLY
0
Entering edit mode

There's not a great deal of limma code written in C/C++. The operations per gene are fairly quick for linear modelling, so fast enough is usually good enough. The R code in duplicateCorrelation is just an extension of this philosophy.

Also, foreach seems to come from the foreach package, rather than the parallel package (though the latter is still required for parallel computation with the former).

ADD REPLY
0
Entering edit mode

Sure, just doesn't feel like it wasn't stress tested.  Or if any stress testing was done, they didn't do very much of it.

ADD REPLY

Login before adding your answer.

Traffic: 983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6