Hi BioC community,
I work on a PCR data set designed like that :
- 21 patients are included in the study
- For each patient, his gene expression profile was measured (by PCR) at the inclusion of the study on a sample of his back which was not treated. We call this gene expression measure the “baseline” measure, the time of this measure is called “T1”.
- After this baseline measurement, two areas are defined in the back of the patient : a left area (also called Z2) and a right area (also called Z1). A treatment (called T) and a control (= no treatment, called NT) are randomized between the two areas. So, each patient has one of the two following sequences :
- Treatment in the left side of his back and no treatment in the right side (sequence called “T.NT” in the following)
- No treatment in the left side of his back and treatment in the right side (sequence called “NT.T” in the following)
- His gene expression was then measured 8 days (called T8) and 22 days (called T22) after the baseline measurement on the two areas treated and no treated.
So, for each of the 21 patients, we have 5 samples (T1, T8Z1, T8Z2, T22Z1, T22Z2) on which we measured gene expression.
The gene expression was measured by PCR (TLDA technology), which gives the Ct of 64 genes, each gene measured in triplicate.
The variable to explain (=the response) is the evolution of gene expression relatively to T1.
So, briefly, the following steps were performed before fitting the model :
- Normalize the dataset to obtain DeltaCt for 63 genes (1 gene was used to normalize the others genes)
- Compute, the gene expression evolution as : DeltaCt_TimeX-DeltaCtT1 (where TimeX represents time8 or time22 according to the sample which is considered)
The first two questions are inter-product analysis, i.e comparison between the two products treatment and no treatment :
1/ Is the product effect significantly associated with the evolution (all times together) ?
2/ For each time (8 and 22) separately : Is the product effect significantly associated with the evolution ?
The last two questions are intra-product analysis, i.e analysis for each product separately :
3/ For each product (treatment and no treatment) separately : Is the time effect significantly associated with the evolution ?
4/For each product and each time (NT.8, NT.22, T.8, T.22) separately : is there a significant association with the evolution ?
These kind of design is very used in our field and an ancova model is routinely used to answer to the 4 questions, when we have only one response variable, for example a clinical variable.
Here is an example of the dataset we use in this clinical context:
> head(clinical)
patientid.f time.f product.f sequence.f side.f T1 evolution
117 8 T T.NT left -0.1780001 0.2070001
117 8 NT T.NT right -0.1780001 0.5546665
110 8 T NT.T right -2.2918336 1.2368339
124 22 T T.NT left -1.7133338 -0.2925002
102 22 NT T.NT right -0.4601666 1.4905005
101 8 T NT.T right -1.0029996 -0.1920007
To be more precise, we fit the following model,
> options(contrasts=c(factor="contr.SAS",ordered="contr.poly"))
> library(lme4)
> library(lmerTest)
> model = lmer(evolution ~ T1 + product.f + side.f + time.f + product.f:time.f + sequence.f + (1|patientid.f), data=clinical)
where :
- the evolution is the response
- T1 is the baseline measurement of the clinical variable (continous parameter allowing for taking into account the baseline level of the clinical parameter in the evolution).
- product is the product factor
- side is a blocking variable indicating the area where the sample was measured
- time is the time of the measure
- sequence is another blocking variable indicating one of the two sequence (NT.T and T.NT) presented above
- A random effect is set on the patient.
The answers of the 4 questions are respectively given by :
1/ anova(model, type=3)
: product effect studied
2/ difflsmeans(model, test.effs="product.f:time.f")
3/ lsmeans(model, test.effs="product.f")
4/ lsmeans(model, test.effs="product.f:time.f")
So my question is mainly : how to transpose this analysis in the context of gene expression with limma, taking into account triplicates in measurement, covariate T1 depending on gene and missing values ?
Here is an extract of the eset used, with 2 genes in triplicates (each line represent a technical replicate, eset is ordered by gene and triplicates), and 5 samples. A cell is the evolution of gene expression relatively to T1. There are missing values.
> eset[1:6,1:5]
P101T08Z01 P101T08Z02 P101T22Z01 P101T22Z02 P102T08Z01
EGFR.1 1.3973325 1.922667 -0.9410000 1.34133212 3.066335
EGFR.2 0.2103322 1.816666 -1.0289993 0.86233203 2.921334
EGFR.3 0.4623330 0.870667 0.8929996 0.78133265 1.499333
HER2.1 -1.9818335 0.820501 -1.6911662 -0.04483382 1.203333
HER2.2 -1.8958330 1.277500 -1.2571662 0.26116594 1.263334
HER2.3 -2.0888338 1.008501 NA 0.19516595 NA
And here the code I propose to answer the four questions :
> product.f[1:10] [1] T NT T NT NT T NT T NT T Levels: NT T > side.f[1:10] [1] right left right left right left right left right left Levels: left right > patientid.f[1:10] [1] 101 101 101 101 102 102 102 102 103 103 21 Levels: 101 102 103 104 105 106 107 108 109 110 111 112 113 115 116 ... 124 > time.f[1:10] [1] 8 8 22 22 8 8 22 22 8 8 Levels: 8 22 > sequence.f[1:10] [1] NT.T NT.T NT.T NT.T T.NT T.NT T.NT T.NT T.NT T.NT Levels: NT.T T.NT > gp <- factor(paste(product.f,time.f,sep=".")) > gp<-factor(gp,levels=c("NT.8","NT.22","T.8","T.22")) # Design matrix taking into account the two block variables and gp > design <- model.matrix(~ side.f + sequence.f + gp) > colnames(design) #[1] "(Intercept)" "side.fleft" "sequence.fT.NT" "gpNT.22" #[5] "gpT.8" "gpT.22" > colnames(design)[1]<-"Intercept" > colnames(design)[4]<-"gpNT.22vsgpNT.8" > colnames(design)[5]<-"gpT.8vsgpNT.8" > colnames(design)[6]<-"gpT.22vsgpNT.8" > library(limma) > corfit <- duplicateCorrelation(eset,design=design,block=patientid.f) > corfit$consensus # [1] 0.5792043 > fit <- lmFit(eset, design,block=patientid.f,correlation=corfit$consensus) > cm <- makeContrasts( productEffect = (gpT.8vsgpNT.8)/2- (gpT.22vsgpNT.8-(gpNT.22vsgpNT.8))/2, productEffectTime8=gpT.8vsgpNT.8, productEffectTime22=gpT.22vsgpNT.8-gpNT.22vsgpNT.8, timeEffectproductT=gpT.22vsgpNT.8-gpT.8vsgpNT.8, timeEffectproductNT=gpNT.22vsgpNT.8, levels=design) > fit <- contrasts.fit(fit, cm) # As recommended in the post DE analysis of PCR array [was: dataset dim for siggenes], I use robust=T and trend=T in eBayes for PCR data > fit <- eBayes(fit,robust=T,trend=T)
Here are finaly the answers to my questions :
1/ > topTable(fit, coef="productEffect")
2/
> topTable(fit, coef="productEffectTime8") > topTable(fit, coef="productEffectTime22")
3/
> topTable(fit, coef="timeEffectproductNT") > topTable(fit, coef="timeEffectproductT")
4/ ????
And so my questions are :
A/ In this model, I do not take into account for the covariate T1 because it depends not only on the patient but also on the gene studied. So how can I do to take into account the baseline expression in the model?
B/ I use duplicateCorrelation for the patientid, to transpose what is done in the clinical setting. In this case, how can I take into account the triplicates ? I tried with “patientid.f” as a block variable to take into account pairing, but sequence and patientid.f are counfounded (all samples from the same patient have the same sequence) and so it is not possible to keep the two informations. Taking patientid as a block variable, there is also the problem of missing values which make the design unbalanced, and in this case it is recommended to use duplicateCorrelation.
C/ How to obtain the answers to question 4, that is to say the coefficients: gpNT.8, gpNT.22, gpT.8 and gpT.22 ?
Many thanks for having read this long post, and thanks in advance for your answers,
Eléonore
Hi Aaron,
Thanks a lot for your answer.
Here are my comments :
In this study, it is not as clear as in my example because there is not an a priori reason to think that left side and right side of the back could influence differently gene expression. But nothing is guaranteed and we want to be consistent between all the studies with the same design, that’s why it is important for us to control site and sequence effect in our model.
So my question is: is it possible (and how ?) to fit in limma a model including 2 block variables (site and sequence), pairing, technical replicates and continuous covariate depending on gene expression ?
If it is not possible, I could give up some of my requirements (in a priority order) : do not include sequence variable, site variable, averaging triplicates, do not include T1 variable… but I hope model including maximum information will be possible !
Thanks again to Aaron and to all the users who could help me !
Eléonore
duplicateCorrelation
. This doesn't happen with the paired design, which will absorb patient-specific effects before variance estimation.avearrays
function.OK, I understand what you mean as for the T1 baseline and patientid.
- For technical replicates, if I use patientid as a block variable as you mention, I can use duplicateCorrelation (for duplicate spots) to capture variation due to triplicates, it is the best way to include pairing and triplicates, isn't it ?
- I do not understand very well how to perform "splitting of the last four coefficients into two groups". Could you explain to me how to do please ?
Many Thanks
Eléonore
Yes, you can block on the triplicates that way if you want. I only suggest averaging because it's simpler and saves time. If the triplicates are highly correlated, averaging will result in little-to-no loss of information.
For the splitting, it's hard to describe without code, but basically, your model would look like:
This will (probably) not initially be of full rank; you need to remove any
gp:sequence.f
coefficients corresponding to the T1 samples, because T1 expression in each patient is represented by the patient blocking factors instead. You should be able to do that by identifying the column name that hasgpT1
in it and dropping it fromdesign
.Thank you very much Aaron.
Here is my script, coud you please check if it is correct ?
Moreover, I also have some questions, if could answer it would be great :
But I do not understand why T1 is included intercept ?
###################################################
eset[1 :7,1 :7]
P101T01Z00 P101T08Z01 P101T08Z02 P101T22Z01 P101T22Z02 P102T01Z00
EGFR.1 2.38633347 2.30200005 3.4173342 1.8256671 3.4190000 0.6233336
EGFR.2 2.67133331 2.61499977 3.3733336 1.8186671 3.4099992 1.5343329
EGFR.3 2.56733322 2.88500023 4.4793326 1.5946662 3.7179998 1.2973334
SFN.1 0.05133438 -0.37700081 0.6793334 -1.1513341 -0.1470000 0.2263343
SFN.2 -0.32466698 -0.33300018 0.4613330 -0.8423341 0.6239999 -0.6716665
SFN.3 -0.58266640 0.04100037 0.5773341 -1.5653337 0.4889997 -0.5826658
SPPR2B.1 0.39233398 0.54599953 2.0713336 -0.7923330 1.4899991 -1.2266668
P102T08Z01
EGFR.1 NA
EGFR.2 -1.97000027
EGFR.3 -2.31700039
SFN.1 -0.09200001
SFN.2 NA
SFN.3 -0.47099972
SPPR2B.1 2.13100147
patientid.f[1 :7]
# [1] P101 P101 P101 P101 P101 P102 P102
# 22 Levels: P101 P102 P103 P104 P105 P106 P107 P108 P109 P110 P111 P112 ... P124
gp.f[1 :7]
# [1] T1 T.08 NT.08 T.22 NT.22 T1 NT.08
# Levels: T1 NT.08 NT.22 T.08 T.22
sequence.f[1 :7]
# [1] NT.T NT.T NT.T NT.T NT.T T.NT T.NT
# Levels: NT.T T.NT
design <- model.matrix(~ patientid.f + gp.f:sequence.f)
colnames(design)
#[1] "(Intercept)" "patientid.fP102"
#[3] "patientid.fP103" "patientid.fP104"
#[5] "patientid.fP105" "patientid.fP106"
#[7] "patientid.fP107" "patientid.fP108"
#[9] "patientid.fP109" "patientid.fP110"
#[11] "patientid.fP111" "patientid.fP112"
#[13] "patientid.fP113" "patientid.fP115"
#[15] "patientid.fP116" "patientid.fP117"
#[17] "patientid.fP118" "patientid.fP119"
#[19] "patientid.fP121" "patientid.fP122"
#[21] "patientid.fP123" "patientid.fP124"
#[23] "gp.fT1:sequence.fNT.T" "gp.fNT.08:sequence.fNT.T"
#[25] "gp.fNT.22:sequence.fNT.T" "gp.fT.08:sequence.fNT.T"
#[27] "gp.fT.22:sequence.fNT.T" "gp.fT1:sequence.fT.NT"
#[29] "gp.fNT.08:sequence.fT.NT" "gp.fNT.22:sequence.fT.NT"
#[31] "gp.fT.08:sequence.fT.NT" "gp.fT.22:sequence.fT.NT"
design<-design[,-c(23,28)]
colnames(design)
#[1] "(Intercept)" "patientid.fP102"
#[3] "patientid.fP103" "patientid.fP104"
#[5] "patientid.fP105" "patientid.fP106"
#[7] "patientid.fP107" "patientid.fP108"
#[9] "patientid.fP109" "patientid.fP110"
#[11] "patientid.fP111" "patientid.fP112"
#[13] "patientid.fP113" "patientid.fP115"
#[15] "patientid.fP116" "patientid.fP117"
#[17] "patientid.fP118" "patientid.fP119"
#[19] "patientid.fP121" "patientid.fP122"
#[21] "patientid.fP123" "patientid.fP124"
#[23] "gp.fNT.08:sequence.fNT.T" "gp.fNT.22:sequence.fNT.T"
#[25] "gp.fT.08:sequence.fNT.T" "gp.fT.22:sequence.fNT.T"
#[27] "gp.fNT.08:sequence.fT.NT" "gp.fNT.22:sequence.fT.NT"
#[29] "gp.fT.08:sequence.fT.NT" "gp.fT.22:sequence.fT.NT"
# To make the names OK for R
colnames(design)<-gsub(":","_",colnames(design),fixed=T)
colnames(design)[1]<-"Intercept"
library(limma)
# Triplicates are ordered by block of 3 in the eset
corfit <- duplicateCorrelation(eset,ndups=3,design=design,spacing=1)
corfit$consensus
# [1] 0.9672377
fit <- lmFit(eset, design,ndups=3,spacing=1,correlation=corfit$consensus)
# To give the name of genes to fit
nom<-gsub(".1","",rownames(eset),fixed=T)
nom<-gsub(".2","",nom,fixed=T)
nom<-gsub(".3","",nom,fixed=T)
rownames(fit$coefficients)<-unique(nom)
#Warning message:
#Partial NA coefficients for 3 probe(s)
cm <- makeContrasts(
productEffect = (gp.fT.08_sequence.fNT.T+gp.fT.22_sequence.fNT.T+gp.fT.08_sequence.fT.NT+gp.fT.22_sequence.fT.NT)/4- (gp.fNT.08_sequence.fNT.T+gp.fNT.22_sequence.fNT.T+gp.fNT.08_sequence.fT.NT+gp.fNT.22_sequence.fT.NT)/4 ,
productEffectTime8=(gp.fT.08_sequence.fNT.T+gp.fT.08_sequence.fT.NT)/2-(gp.fNT.08_sequence.fNT.T+gp.fNT.08_sequence.fT.NT)/2,
productEffectTime22=(gp.fT.22_sequence.fNT.T+gp.fT.22_sequence.fT.NT)/2-(gp.fNT.22_sequence.fNT.T+gp.fNT.22_sequence.fT.NT)/2,
timeEffectproductT=(gp.fT.22_sequence.fNT.T+gp.fT.22_sequence.fT.NT)/2-(gp.fT.08_sequence.fNT.T+gp.fT.08_sequence.fT.NT)/2,
timeEffectproductNT=(gp.fNT.22_sequence.fNT.T+gp.fNT.22_sequence.fT.NT)/2-(gp.fNT.08_sequence.fNT.T+gp.fNT.08_sequence.fT.NT)/2,
gpNT.8=(gp.fNT.08_sequence.fNT.T+gp.fNT.08_sequence.fT.NT)/2,
gpNT.22=(gp.fNT.22_sequence.fNT.T+gp.fNT.22_sequence.fT.NT)/2,
gpT.8=(gp.fT.08_sequence.fNT.T+gp.fT.08_sequence.fT.NT)/2,
gpT.22=(gp.fT.22_sequence.fNT.T+gp.fT.22_sequence.fT.NT)/2,
levels=design)
fit <- contrasts.fit(fit, cm)
fit <- eBayes(fit,robust=T,trend=T)
# 1/ Product effect (all times together)
topTable(fit, coef="productEffect",number=20)
# 2/ Product effect for each time
topTable(fit, coef="productEffectTime8",n=20)
topTable(fit, coef="productEffectTime22",n=20)
# 3/ Time effect for each product separately
topTable(fit, coef="timeEffectproductNT")
topTable(fit, coef="timeEffectproductT")
# 4/Effect for each groupe (product by time)
topTable(fit, coef="gpNT.8",n=20)
topTable(fit, coef="gpNT.22",n=20)
topTable(fit, coef="gpT.8",n=20)
topTable(fit, coef="gpT.22",n=20)
Let's consider an example, where you want to look at the effect of time on "evolution" for the treated samples, i.e., T.08 evolution against T.22 evolution. For your design, the last 8 coefficients directly represent the evolution values of each time/treatment combination over the T1 "baseline". So, to test for an effect of time on evolution, you can just set up your contrasts to compare the relevant evolution values:
Note that I've renamed the column names of design with "_" to replace ":", as the latter doesn't work well with
makeContrasts
. There's no need to explicitly account for the T1 expression in your contrasts, as that is already considered when the coefficients are estimated.For the left/right stuff, the sequence is fully absorbed into the patient blocking factors. There is no need to do anything extra in the model to account for that, because any (spurious) differences due to sequence will be factored out in the patient effects and ignored. The only extra thing that this model does is to have sequence-specific log-fold changes for the different combinations over T1. This allows you to check if the evolution for left/right is different from the evolution of right/left. You can do that by comparing the evolution values with similar contrasts to the ones I've specified above.
Hi Aaron and thanks again for your comments. I think I was not very clear to ask my questions, that’s why here are more details :
In our “clinical” model, we ALSO (in addition to explain the evolution) include T1 as a continuous covariate to take into account that an evolution of 5 for example is not considered the same if your baseline level is 1 or if it is 10. So, by including T1 as a continuous covariate we would like to make comparable the evolution taking into account the baseline values. From what I understand, the model
do not take into account this information, isn’t it ? Is it possible to take into account this information ?
but the problem is that the T1 samples have not “site” information (they are sampled before the randomization of the treatment between right and left sites). The “site” information is more important for us than “sequence” variable, so if we have to choose between the two, we prefer to keep the “site” information in the model. Do you think it is possible ?
I hope it is more clear now, do not hesitate to ask me details if you need.
Many thanks for spending time on helping me,
Eléonore
To answer your second question first; you can re-parametrize the model to use site information rather than sequence information. In effect, you define:
... where each T1 sample just has a
site.f
of""
and every other sample has"left"
or"right"
, depending on what's appropriate. You can then use this in a model with:This should give you the patient blocking factors, plus 8 coefficients at the end that represent the log-fold change of each combination of these factors (left or right, treated or untreated, time 8 or 22) over the T1, i.e., their evolution values. You can then drop these separately, compare them to each other via
makeContrasts
, etc., depending on what DE comparison you want to perform.Note that this model specification already contains all of the information in
sequence.f
, so you don't need to supply the sequence for each patient separately. For example, if you know a patient is treated on the left (based on its factor combinations), then the right must be untreated, so that patient would have sequence T.NT. In fact, the converse is also true; if you have the sequence information in the model (as in the one you've specified in your last post), then you don't need to explicitly supply the site either.Now, for your second question; the short answer is no, and the long answer is yes. For starters, supplying the T1 expression as a covariate will not affect the evolution values. If you tried to do that, all samples for the same patient would have the same T1 value and the same coefficient for the covariate term. That coefficient would cancel out when calculating evolution values within patients. In fact, the T1 covariate will be totally absorbed by the patient blocking factor anyway, such that it wouldn't be estimable (i.e., you can't distinguish between patient-specific effects and an effect correlated to T1 expression levels).
If you want to compute T1-dependent evolution values, simply putting T1 as a covariate in a linear model is not sufficient. Rather, you need a model with interaction terms for
gp.site
and T1. This should be possible to construct by extracting the last 8 coefficients of the design matrix, multiplying them by the T1 expression across patients, and thencbind
'ing the products back onto the matrix, i.e.:These new terms represent the effect of T1 expression on the evolution values, assuming a linear response (specifically, you can imagine fitting a line to the evolution values against T1 for each combination of factors; each of the old 8 terms represent the intercept, while the each of the new 8 terms represent the gradient of this line). I think you'll need at least 5 patients, though, otherwise you won't have enough residual d.f. to estimate the variance afterwards. You could simplify it to have a common gradient by doing:
... before the
cbind
call. Alternatively, you could have a common intercept but that seems less flexible.Hi Aaron,
Thanks again for your help.
I understand the part with gp.site where the reference level is T1 to explain evolution in different
combinations treatment/time/site.
For example treatment effect can be computed by substracting the mean from all the "NT" coefficients (4 terms) to the mean from the 4 treated coefficients, isn't it ?
I also understand that the information to compute the interaction term "product:site" (=the sequence) is present in gp.site . To test the sequence effect, the idea is to test if the treatment effect is the same in the left site (mean of the 2 treated coefficients left site - mean of the 2 NT coefficients left site) compared to the right site.
As for the T1 convariate, I tried to do what you recommend but it is not possible
because expression vector T1 is different between genes... so we have a lot of T1 expression vector
and not only one... May be I do not understand exactly what you mean and I am wrong ?
Thanks
Eléonore
Ah, sorry; I forgot that T1 was the gene expression. Well, that poses a problem, because you'll need different design matrices for each gene, and
limma
doesn't support that, regardless of whether you fit to the raw Ct's or to the evolution values. I don't see any way to get around this. I suppose you should start by seeing if you really need that level of complexity in your model. Just try fitting the design without T1 as a covariate, and if it gives sensible DE genes, then that's probably good enough; there's really no need to make life difficult for yourself.OK for T1.
Are you OK with my computation of treatment effect and sequence effect in the previous comment ?
Thank very much for having spent time on my problem
Eleonore
Yes, your calculations look fine. Keep in mind, though, that the final log-fold change won't be able to capture strange behaviour, e.g., if gene expression goes down in treated right vs. untreated right, but goes up in treated left vs untreated left. The reported change will just be the average effect of treated vs. untreated, which means that this oddness will not be picked up. In short, it's always worth checking that the individual contrasts are behaving like they should (i.e., check left/right treated vs. untreated at each time point separately) to make sure that the average log-fold change can be interpreted sensibly.
Yes, it will safer !