Adjusting for baseline when having longitudinal data, mixed model and no treatment
Entering edit mode
Anna ▴ 20
Last seen 11 weeks ago

Dear all I have bulk RNAseq data and my goal is to detect genes that change over time after a treatment.I have the following design

I have 7 subjects and 5 time points. One time point is baseline (aka pre treatment), but I actually only have one treatment arm. Given the fact that I have subjects not being measured at some timepoints and that I want to account for intraclass correlation within subjects I go with mixed models. I would also like to use limma to take advantage of voom and ebayes

My question here is the following: . In a linear mixed model like with lmer I would adjust for baseline eg. to increase power . I know that limma is not exactly like lmer since it uses covariance matrix (like compound symmetry I presume, but my question is.. Is there a way to adjust for baseline gene expression in my case?
I had a look at that post . But didnt help me, since there are more like one treatment and different questions



RNAseq mixedmodels baselineadjustement limma • 1.0k views
Entering edit mode

similar to

lmerTest::lmer(log(cpm)~log(cpm_bl)+Time+ 1|Subj, data=data_for_one_gene)

or better

gls( log(cpm) ~ log(cpm_bl) + Time, data=one_gene, correlation=corCompSymm(form=1|Subj)

Where time here is a factor with levels Time2, Time3, Time 4, Time 5 (so no time at baseline)

Also for second equation we need tilt before 1 I think but I have issues with posting it when addint tilt

Entering edit mode
Last seen 3 hours ago
WEHI, Melbourne, Australia

It is fairly straight forward to analysis time-course experiments like you describe in limma. Subject is treated as a block either by including it as a factor in the design matrix or as a random block using duplicateCorrelation. Both approaches are described in the User's Guide.

It would be possible to give more specific advice if you showed the targets frame for your experiment, i.e., the data.frame showing which patient was observed at which time.

Entering edit mode

Dear Gordon, these are my metadata. As you see some subject are missing for timepoint 1

     Subject Timepoint
1      ID1  Baseline
2      ID1         2
3      ID1         3
4      ID1         4
5      ID2  Baseline
6      ID2         2
7      ID2         3
8      ID2         4
9      ID3  Baseline
10     ID3         1
11     ID3         2
12     ID3         3
13     ID3         4
14     ID4  Baseline
15     ID4         2
16     ID4         3
17     ID4         4
18     ID5  Baseline
19     ID5         1
20     ID5         2
21     ID5         3
22     ID5         4
23     ID6  Baseline
24     ID6         2
25     ID6         3
26     ID6         4
27     ID7  Baseline
28     ID7         1
29     ID7         2
30     ID7         3
31     ID7         4
32     ID8  Baseline
33     ID8         2
34     ID8         3
35     ID8         4

Indeed I use duplicateCorrelation (see code below). But my Issue is not that. My issue is that I want to use baseline expression as an independent variable. Reason for that is I believe that baseline has a prognostic effect on the following expression levels (used as dependent variables) so by adjusting for it I want to increase the precision of my estimate. Can I do that? Thanks!

# apply duplicateCorrelation in two rounds
design = model.matrix( ~ Timepoint , metadata)
vobj_tmp = voom( geneExpr, design, plot=FALSE)
dupcor <- duplicateCorrelation(vobj_tmp,design,block=metadata$Subject)

# run voom considering the duplicateCorrelation results
# in order to compute more accurate precision weights
vobj = voom( geneExpr, design, plot=TRUE, block=metadata$Subject, correlation=dupcor$consensus)
dupcor <- duplicateCorrelation(vobj, design, block=metadata$Subject)
# Estimate linear mixed model with a single variance component

# Fit the model for each gene,
# Note: this step uses only the genome-wide average for the random effect
fitDupCor <- lmFit(vobj, design, block=metadata$Subject, correlation=dupcor$consensus)
fit <- eBayes(fitDupCor)
p<-topTable(fit, adjust="BH", number = 100)
Entering edit mode

limma already adjusts for the baseline expression level of each subject. Trying to include baseline expression explicitly in the linear model is neither necessary nor correct.

You have the same two choices as for any blocked experiment. If the Subjects differ substantially in baseline expression, then I would include Subject in the linear model by

design <- model.matrix( ~ Subject + Timepoint)

If the differences between Subjects are relatively subtle, then I would instead use duplicateCorrelation and treat Subject as a random block.

If you are worried about the size of baseline differences between Subjects, then including Subject in the linear model is the way to go. It has a little less statistical power than duplicateCorrelation for contrasts involving Timepoint 1 but it accounts completely for any baseline differences between the Subjects.

Entering edit mode

Dear Gordon ,

Thank you very much

I will do as you suggested

When analyzing other data (like with normally distributed errors) is standard procedure to adjust for baseline expression (and additionally use the blocking factor (eg subject) as a fixed or random effect -depending on the design), This is done mainly because we believe that baseline is affecting the outcome and we use it as prognostic factor in our linear model to increase precision. Only out of scientific curiosity, why is this different in this case? Thanks

Entering edit mode

Perhaps you are thinking of data analyses where the dependent variable is something other than expression itself.

In the context of gene expression analyses, what you claim to be a standard procedure is not even possible. Using expression (baseline or otherwise) to predict itself is a statistical tautology and can never be valid. Adding Subject to the model as a factor and also donor-wise-baseline expression as a covariate will simply lead to a singular design matrix.

Entering edit mode

A no I never referred to transcriptomic data when I was talking about "standard procedure" I was referring to the way we analyze longitudinal expression (or other data that are not count data) from our clinical trials

Baseline is switched from dependent to independent variable (like a continuous number like age, weight etc) and dependent is either the follow up measurement or their FC wrt baseline.

I think equivqalent in limma would be like a gene by gene specificic design matrix where no BL time point would exist but a continuous variable termed BL with the aactuall gene expressiom

Here is an example of what I mean (is not lme4 or gls but I think the idea is pretty clear, though here has not only visit/time but also group (could be eg more than one treatment). Here predep is the baselin expression, postdep the after trt and subject is used as random effect.. That would not yield a singular design matrix

Here is another link on what I mean (think of weight as expression)

In any case thanks for the support

Entering edit mode

equivqalent in limma would be like a gene by gene specificic design matrix where no BL time point would exist but a continuous variable termed BL with the aactuall gene expressiom

I don't think there is any potential gain from such an analysis. The standard limma analysis is simpler, better and more appropriate for your stated aim of detecting DE genes. If you are still not clear on that, I suggest that you chat with a statistician at your own Institution. The issues involved are not specific to limma.

Entering edit mode

Thanks Gordon

Actually, I am working in the clinical statistics department for the last 5 years, even though a bioinformatician, and the "standard analysis" I am referring to is an analysis clinical statisticians do to detect eg DE of proteins. But we can assume log transforming proteins (skewed expression etc) and plugging them in a linear model (very often with a bayesian approach with small subject, but we usually have much more than 8) will work. Actually the question popped up when talking with a statistician

Of course I will go with limma for RNASeq given the nature of the data, that I have 20K and not 100 molecules, and the very small number of subjects. But when you have linear regression, from protein expression, levels of Creatinine, etc and prognostic factors, like BL measurement (before treatment), then those BL measures are used as covariates and not as depended variables, to increase precision, eg more precise results with the same number of subjects (this applies easily only for linear regression). So I wondered why not here with limma

I could do that for RNASeq eg after using vst to plug them in lme4 or gls, but I still prefer to use limma to avoid errors from me becoming creative and also use random effects coz with 8 subject and 5 timepoints and adding 8 subjects as fixed effects I lose a lot of power as you mentioned

Thanks for all the time and the clarifications

Entering edit mode

No, you are not losing power. The two possible limma analyses (fixed or random Subject effects) are both more powerful than the alternative analysis you describe using lme4 or gls with BL as a covariate instead of an observation.

Entering edit mode

Yes. But still I prefer using the covariance matrix over fixed effects for power issues. Or at least I wanted to use it as such before our discussions. Now I need to check more. Thanks again for all the help!


Login before adding your answer.

Traffic: 240 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6