Hi,
I am performing RNA sequencing data analysis on weight loss data with three timepoints: before diet, 2 months after diet and 10 months after diet (I’m calling the timepoints TP1, TP2, TP3). My variable of interest is percentage weight lost, so I set my percentage weight lost at TP1 at 0 and TP2 and TP3 at (weight1-weight2)/weight1 and (weight1-weight3)/weight1 respectively. At the moment I am analysing my TP2vsTP1 and TP3vsTP1 separately. Not all individuals have data from all three timepoints.
My limma model looks like this:
design <- model.matrix(~ 0 + sequencing_center + sex + study_center + age + perc_weightlost_TP1TP2)
v <- voom(dgeT1T2, design)
corfit <- duplicateCorrelation(v,design,block=SubjectID)
v <- voom(dgeT1T2,design,block= SubjectID,correlation=corfit$consensus)
fit <- lmFit(v,design,block=ID,correlation=corfit$consensus)
fit <- eBayes(fit)
TP2vsTP1=topTreat(fit, adjust="BH", coef="perc_weight_TP1TP2", number=5000)
My interest is in finding out how gene expression changes according to percentage weight lost. Should I be putting the time point into my design matrix above? I had figured it was not necessary because the percentage weight lost already captures this information. However, if I want to put all three time points in my analysis, should I go about it as below?
design <- model.matrix(~ 0 + sequencing_center + sex + study_center + age + TP:perc_weightlost)
I would be grateful for any pointers on how to analyse this type of data.
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /apps/statistics2/R-3.5.1/lib/libRblas.so
LAPACK: /apps/statistics2/R-3.5.1/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.24.3 limma_3.38.3
loaded via a namespace (and not attached):
[1] compiler_3.5.1 Rcpp_1.0.1 grid_3.5.1 locfit_1.5-9.1
[5] statmod_1.4.30 lattice_0.20-38
Thank you,
Mahes Muniandy,
University of Helsinki
Hi Aaron,
Thank you so much for the guidance. Yes, some individuals lose more weight than others and weight loss is not linear with respect to time, plus between timepoints 2 and 3, the diet is different than between timepoints 1 and 2. I realise all this complicates matters and I am quite confused on how to model the data.
If I “put both factors into the model to identify significant effects associated with weight loss that are conditional on the time effect” would it be a simple additive model with a contrast matrix? I'm not quite sure how to do a contrast matrix with a continuous and categorical variable. Or can I make the timepoint the intercept?
I am also unsure about my independent variable (percentage weight lost). The reason I use percentage weight lost is because the heavier one is, the easier it is to lose weight. So, kgs lost is not comparable across the individuals. But for me to use percentage weight lost, I have made the value zero at TP1. In hind-sight, I am thinking this is wrong, because then the model cannot find the association between percentage weight lost and gene expression at TP1. Is this correct?
If I use weight (kgs) at TP1, TP2, TP3, how do I model the effects of starting weight?
Thank you so much, Mahes Muniandy
See my updated answer. Don't worry about the fact that the weight loss is zero at the first time point. There's no need for the model to account for this, because there's nothing there!
Now, the appropriate model would be different if you wanted to account for the association of expression with the starting weight for each patient, in which case you would need to put in weights at the first time point. But this makes the model more complicated (i.e., the coefficients become more difficult to interpret) and only helps if the weight-related association at the first time point disappears at later time points (otherwise the patient blocking factors would be sufficient).
Hi Aaron, Thank you once again. I have tried the below model as suggested and as expected I get fewer significant genes compared to not having the time points in the model.
Just one more question, if I want to find the effects of age and sex between the individuals, can I extend this to the multi-level model with duplicateCorrelation? Many Thanks, Mahes
Yes, if you wanted to do that, block on
patient
induplicateCorrelation()
and put inage
andsex
as additive terms in the design matrix. Those things don't seem that interesting to me, though; perhaps an interaction betweenwloss
andsex
would be more fun.