Question

Time course experiment using limma with voom

0

Entering edit mode

mahes.muniandy • 0

@mahesmuniandy-7955

Last seen 3 months ago

Helsinki

Hi,

I am performing RNA sequencing data analysis on weight loss data with three timepoints: before diet, 2 months after diet and 10 months after diet (I’m calling the timepoints TP1, TP2, TP3). My variable of interest is percentage weight lost, so I set my percentage weight lost at TP1 at 0 and TP2 and TP3 at (weight1-weight2)/weight1 and (weight1-weight3)/weight1 respectively. At the moment I am analysing my TP2vsTP1 and TP3vsTP1 separately. Not all individuals have data from all three timepoints.

My limma model looks like this:

design <- model.matrix(~ 0 + sequencing_center + sex +  study_center + age + perc_weightlost_TP1TP2) 
v <- voom(dgeT1T2, design)
corfit <- duplicateCorrelation(v,design,block=SubjectID)
v <- voom(dgeT1T2,design,block= SubjectID,correlation=corfit$consensus)
fit <- lmFit(v,design,block=ID,correlation=corfit$consensus)
fit <- eBayes(fit)
TP2vsTP1=topTreat(fit, adjust="BH", coef="perc_weight_TP1TP2", number=5000)

My interest is in finding out how gene expression changes according to percentage weight lost. Should I be putting the time point into my design matrix above? I had figured it was not necessary because the percentage weight lost already captures this information. However, if I want to put all three time points in my analysis, should I go about it as below?

design <- model.matrix(~ 0 + sequencing_center + sex +  study_center + age + TP:perc_weightlost)

I would be grateful for any pointers on how to analyse this type of data.

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /apps/statistics2/R-3.5.1/lib/libRblas.so
LAPACK: /apps/statistics2/R-3.5.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=en_US.utf8       LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] edgeR_3.24.3 limma_3.38.3

loaded via a namespace (and not attached):
[1] compiler_3.5.1  Rcpp_1.0.1      grid_3.5.1      locfit_1.5-9.1
[5] statmod_1.4.30  lattice_0.20-38

Thank you,

Mahes Muniandy,

University of Helsinki

limma voom RNA Seq Time-course experiments • 1.3k views

ADD COMMENT • link updated 5.1 years ago by Aaron Lun ★ 28k • written 5.1 years ago by mahes.muniandy • 0

score 1 · Accepted Answer · 2019-06-20

If weight loss and time are well correlated, you're in trouble. With your experimental design, there's no way to distinguish between the effect of time - due to aging or whatever - and the effect of weight loss. The correct way to do it would be to have a control group without any diet, which would provide a baseline for the time effect.

Nonetheless, you might be able to get something out of this data if time and weight loss are not well correlated (e.g., some individuals lose more weight than others, or weight loss is not linear with respect to time.) You can then put both factors into the model to identify significant effects associated with weight loss that are conditional on the time effect. This reduces the chance that the resulting DE genes are driven by the confounding time effect. However, this comes at the cost of a loss of power, which is more severe as time and weight loss become more correlated.

Note that if time and weight loss are perfectly correlated, then you can't put them in the same model together, as the coefficients will not be estimable. And TP:perc_weightlost doesn't make sense, just use an additive model.

Edit:

I hadn't appreciated that you were modeling time as a factor, not a covariate. That's fine, and probably the correct approach, but it changes my answer a bit. Let's mock up a sample table for demonstration purposes.

patient <- gl(8, 2)
time <- rep(c("T0", "T1"), 8) # Two time points, for simplicity.

set.seed(100)
weight0 <- rnorm(8, mean=70)
weight1 <- weight0 - runif(8, 0, 10)
wloss <- as.vector(rbind(0, 1 - weight0/weight1)) # interleaved

Your second proposed design was almost correct. I would instead do:

design <- model.matrix(~ 0 + patient + time + time:wloss)
design <- design[,-10] # remove the all-zero column.

The first 8 coefficients are patient blocking factors, representing the fitted expression of each patient at time zero. The next coefficient is the average log-fold change due to time within each patient. The final coefficient represents the "effect" (i.e., association) of expression with weight loss at time point 1. As I mentioned before, the last two coefficients are likely to be almost completely confounding, which will reduce your power to detect changes compared to a more carefully designed experiment. This is the correct behavior, because otherwise you would get false positives associated with weight loss that are actually caused by time.

The above example is fairly easy to extend to three time points. Note that if age and sex are the same for all samples derived from a single patient, they will be redundant with the patient blocking factors - you don't need them. Same for sequencing center, if all samples from a single patient were sequenced at the same location.