Question

limma with continous covariate depending on gene, 2 block variables, pairing and technical replicates

1

Entering edit mode

eleonoregravier ▴ 70

@eleonoregravier-8219

Last seen 2.6 years ago

France

Hi BioC community,

I work on a PCR data set designed like that :

- 21 patients are included in the study

- For each patient, his gene expression profile was measured (by PCR) at the inclusion of the study on a sample of his back which was not treated. We call this gene expression measure the “baseline” measure, the time of this measure is called “T1”.

- After this baseline measurement, two areas are defined in the back of the patient : a left area (also called Z2) and a right area (also called Z1). A treatment (called T) and a control (= no treatment, called NT) are randomized between the two areas. So, each patient has one of the two following sequences :

- Treatment in the left side of his back and no treatment in the right side (sequence called “T.NT” in the following)

- No treatment in the left side of his back and treatment in the right side (sequence called “NT.T” in the following)

- His gene expression was then measured 8 days (called T8) and 22 days (called T22) after the baseline measurement on the two areas treated and no treated.

So, for each of the 21 patients, we have 5 samples (T1, T8Z1, T8Z2, T22Z1, T22Z2) on which we measured gene expression.

The gene expression was measured by PCR (TLDA technology), which gives the Ct of 64 genes, each gene measured in triplicate.

The variable to explain (=the response) is the evolution of gene expression relatively to T1.

So, briefly, the following steps were performed before fitting the model :

Normalize the dataset to obtain DeltaCt for 63 genes (1 gene was used to normalize the others genes)
Compute, the gene expression evolution as : DeltaCt_TimeX-DeltaCtT1 (where TimeX represents time8 or time22 according to the sample which is considered)

The first two questions are inter-product analysis, i.e comparison between the two products treatment and no treatment :

1/ Is the product effect significantly associated with the evolution (all times together) ?

2/ For each time (8 and 22) separately : Is the product effect significantly associated with the evolution ?

The last two questions are intra-product analysis, i.e analysis for each product separately :

3/ For each product (treatment and no treatment) separately : Is the time effect significantly associated with the evolution ?

4/For each product and each time (NT.8, NT.22, T.8, T.22) separately : is there a significant association with the evolution ?

These kind of design is very used in our field and an ancova model is routinely used to answer to the 4 questions, when we have only one response variable, for example a clinical variable.

Here is an example of the dataset we use in this clinical context:

> head(clinical)

patientid.f time.f product.f sequence.f side.f T1 evolution

117 8 T T.NT left -0.1780001 0.2070001

117 8 NT T.NT right -0.1780001 0.5546665

110 8 T NT.T right -2.2918336 1.2368339

124 22 T T.NT left -1.7133338 -0.2925002

102 22 NT T.NT right -0.4601666 1.4905005

101 8 T NT.T right -1.0029996 -0.1920007

To be more precise, we fit the following model,

> options(contrasts=c(factor="contr.SAS",ordered="contr.poly"))
> library(lme4)
> library(lmerTest)
> model = lmer(evolution ~ T1 + product.f + side.f + time.f + product.f:time.f + sequence.f + (1|patientid.f), data=clinical)

where :

the evolution is the response
T1 is the baseline measurement of the clinical variable (continous parameter allowing for taking into account the baseline level of the clinical parameter in the evolution).
product is the product factor
side is a blocking variable indicating the area where the sample was measured
time is the time of the measure
sequence is another blocking variable indicating one of the two sequence (NT.T and T.NT) presented above
A random effect is set on the patient.

The answers of the 4 questions are respectively given by :

1/ anova(model, type=3) : product effect studied

2/ difflsmeans(model, test.effs="product.f:time.f")

3/ lsmeans(model, test.effs="product.f")

4/ lsmeans(model, test.effs="product.f:time.f")

So my question is mainly : how to transpose this analysis in the context of gene expression with limma, taking into account triplicates in measurement, covariate T1 depending on gene and missing values ?

Here is an extract of the eset used, with 2 genes in triplicates (each line represent a technical replicate, eset is ordered by gene and triplicates), and 5 samples. A cell is the evolution of gene expression relatively to T1. There are missing values.

> eset[1:6,1:5]

P101T08Z01 P101T08Z02 P101T22Z01 P101T22Z02 P102T08Z01

EGFR.1 1.3973325 1.922667 -0.9410000 1.34133212 3.066335

EGFR.2 0.2103322 1.816666 -1.0289993 0.86233203 2.921334

EGFR.3 0.4623330 0.870667 0.8929996 0.78133265 1.499333

HER2.1 -1.9818335 0.820501 -1.6911662 -0.04483382 1.203333

HER2.2 -1.8958330 1.277500 -1.2571662 0.26116594 1.263334

HER2.3 -2.0888338 1.008501 NA 0.19516595 NA

And here the code I propose to answer the four questions :

> product.f[1:10]
[1] T  NT T  NT NT T  NT T  NT T
Levels: NT T
> side.f[1:10]
[1] right left  right left  right left  right left  right left
Levels: left right
> patientid.f[1:10]
[1] 101 101 101 101 102 102 102 102 103 103
21 Levels: 101 102 103 104 105 106 107 108 109 110 111 112 113 115 116 ... 124
> time.f[1:10]
[1] 8  8  22 22 8  8  22 22 8  8
Levels: 8 22
> sequence.f[1:10]
[1] NT.T NT.T NT.T NT.T T.NT T.NT T.NT T.NT T.NT T.NT
Levels: NT.T T.NT

> gp <- factor(paste(product.f,time.f,sep="."))
> gp<-factor(gp,levels=c("NT.8","NT.22","T.8","T.22"))

# Design matrix taking into account the two block variables and gp
> design <- model.matrix(~  side.f + sequence.f + gp)
> colnames(design)
#[1] "(Intercept)"      "side.fleft"     "sequence.fT.NT" "gpNT.22"        
#[5] "gpT.8"            "gpT.22"         
> colnames(design)[1]<-"Intercept"
> colnames(design)[4]<-"gpNT.22vsgpNT.8"
> colnames(design)[5]<-"gpT.8vsgpNT.8"
> colnames(design)[6]<-"gpT.22vsgpNT.8"

> library(limma)
> corfit <- duplicateCorrelation(eset,design=design,block=patientid.f)
> corfit$consensus
# [1] 0.5792043
> fit <- lmFit(eset, design,block=patientid.f,correlation=corfit$consensus)
> cm <- makeContrasts(

                        productEffect = (gpT.8vsgpNT.8)/2- (gpT.22vsgpNT.8-(gpNT.22vsgpNT.8))/2,
                        productEffectTime8=gpT.8vsgpNT.8,
                        productEffectTime22=gpT.22vsgpNT.8-gpNT.22vsgpNT.8,
                        timeEffectproductT=gpT.22vsgpNT.8-gpT.8vsgpNT.8,
                        timeEffectproductNT=gpNT.22vsgpNT.8,
                        levels=design)
> fit <- contrasts.fit(fit, cm)

# As recommended in the post DE analysis of PCR array [was: dataset dim for siggenes], I use robust=T and  trend=T in eBayes for PCR data
> fit <- eBayes(fit,robust=T,trend=T)

Here are finaly the answers to my questions :

1/ > topTable(fit, coef="productEffect")

2/

> topTable(fit, coef="productEffectTime8")

> topTable(fit, coef="productEffectTime22")

3/

> topTable(fit, coef="timeEffectproductNT")

> topTable(fit, coef="timeEffectproductT")

4/ ????

And so my questions are :

A/ In this model, I do not take into account for the covariate T1 because it depends not only on the patient but also on the gene studied. So how can I do to take into account the baseline expression in the model?

B/ I use duplicateCorrelation for the patientid, to transpose what is done in the clinical setting. In this case, how can I take into account the triplicates ? I tried with “patientid.f” as a block variable to take into account pairing, but sequence and patientid.f are counfounded (all samples from the same patient have the same sequence) and so it is not possible to keep the two informations. Taking patientid as a block variable, there is also the problem of missing values which make the design unbalanced, and in this case it is recommended to use duplicateCorrelation.

C/ How to obtain the answers to question 4, that is to say the coefficients: gpNT.8, gpNT.22, gpT.8 and gpT.22 ?

Many thanks for having read this long post, and thanks in advance for your answers,

Eléonore

limma covariate block paired samples technical replicates • 3.6k views

ADD COMMENT • link updated 9.4 years ago by Aaron Lun ★ 28k • written 9.4 years ago by eleonoregravier ▴ 70

score 1 · Answer 1 · 2015-11-18

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 1 hour ago

The city by the bay

Your design seems unnecessarily complicated. Why don't you just block on the patient ID directly in the design? Each patient has T1 and treated/untreated samples at both time points, so the comparisons can be performed within each patient. The design matrix would be something like:

design <- model.matrix(~ 0 + patientid.f + gp)

... where gp has all 5 levels (T1, NT.T8, T.T8, NT.T22, T.T22), releveled to have T1 as the baseline. This should give you one coefficient for each patient, plus an extra four coefficients representing the fold changes of the non-T1 samples over T1. If you're paranoid about the left/right sequence, you can split those 4 coefficients into two sets; one for the left/right patients, and another for the right/left patients. In any case, you can drop any number of these coefficients (or compare them to one another) to test for your product effect over time.

Now, onto your questions. For (A), the model I've suggested above will necessarily include T1. For (B), I'd suggest you just average the triplicate measurements and fit the model to the averages - from what I understand, these triplicates are technical replicates, and the true replication in this setting is that between patients. For (C), you can just compare each time/product combination to the T1 baseline within each patient, again using the above model. Not sure how useful that is, though, as you won't be able to tell whether the time or product effect is driving the increase over the baseline.

ADD COMMENT • link 9.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thanks a lot for your answer.

Here are my comments :

The design is deliberately complicated because in our field (dermatology), there is a strong influence of factors like site or sequence in the response. For example where you study moisturizing effect of a cream and you need to use several areas of the arm (from the insight of the elbow to the intern face of the wrist for example), there is a significant difference of dryness between these areas to take into account in your model (as a block variable).

In this study, it is not as clear as in my example because there is not an a priori reason to think that left side and right side of the back could influence differently gene expression. But nothing is guaranteed and we want to be consistent between all the studies with the same design, that’s why it is important for us to control site and sequence effect in our model.

I have a paired analysis but I do not block on the patient ID for 2 reasons :
- patientid is confounded with sequence
- There are missing values and so in this unbalanced design it is recommended to use duplicateCorrelation because random effects approach is statistically more powerful (see post https://mailman.stat.ethz.ch/pipermail/bioconductor/2014-February/057887.html)

If I well understand, the design you propose is based on DeltaCt values (and not evolution values as I presented), else the T1 group would have all its values equal to 0. So your design compare each of the four groups to T1. It is NOT what we want to do, we want to evaluate the effect of treatment and time on evolution values adjusted on the the gene expression baseline T1. That’s why we need to include T1 as a covariate (and not a reference group on which to compare).

As for averaging of triplicates, it is also not the solution recommended because it results in the loss of valuable information about genewise variability (see http://bioinformatics.oxfordjournals.org/content/21/9/2067.full). That’s why we would like to include triplicate information in our analysis

So my question is: is it possible (and how ?) to fit in limma a model including 2 block variables (site and sequence), pairing, technical replicates and continuous covariate depending on gene expression ?

If it is not possible, I could give up some of my requirements (in a priority order) : do not include sequence variable, site variable, averaging triplicates, do not include T1 variable… but I hope model including maximum information will be possible !

Thanks again to Aaron and to all the users who could help me !

Eléonore

ADD REPLY • link 9.4 years ago eleonoregravier ▴ 70

1

Entering edit mode

In my design, you can split the last four coefficients into two groups of four, to account for sequence-specific effects in the relevant patients. This is not affected by the fact that patient and sequence are directly confounding.
While missing values will result in loss of information from some samples, it is arguable that you will also experience variance inflation and loss of power by treating patients as direct replicates with duplicateCorrelation. This doesn't happen with the paired design, which will absorb patient-specific effects before variance estimation.
From what I understand, your "evolution" term refers to differences between the other samples and T1. So, the effect of time or treatment on "evolution" refers to differential differences, which simplifies to comparing the non-T1 groups directly as the T1 expression will cancel out. This is fully accommodated within the paired model; you can just compare any two of the last four coefficients to each other, rather than dropping them for comparison to T1.
If they're technical replicates, the information isn't really that valuable compared to the variance information between patients. Check out the avearrays function.

ADD REPLY • link 9.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

OK, I understand what you mean as for the T1 baseline and patientid.

- For technical replicates, if I use patientid as a block variable as you mention, I can use duplicateCorrelation (for duplicate spots) to capture variation due to triplicates, it is the best way to include pairing and triplicates, isn't it ?

- I do not understand very well how to perform "splitting of the last four coefficients into two groups". Could you explain to me how to do please ?

Many Thanks

Eléonore

ADD REPLY • link 9.4 years ago eleonoregravier ▴ 70

1

Entering edit mode

Yes, you can block on the triplicates that way if you want. I only suggest averaging because it's simpler and saves time. If the triplicates are highly correlated, averaging will result in little-to-no loss of information.

For the splitting, it's hard to describe without code, but basically, your model would look like:

design <- model.matrix(~patientid.f + gp:sequence.f)

This will (probably) not initially be of full rank; you need to remove any gp:sequence.f coefficients corresponding to the T1 samples, because T1 expression in each patient is represented by the patient blocking factors instead. You should be able to do that by identifying the column name that has gpT1 in it and dropping it from design.

ADD REPLY • link 9.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much Aaron.

Here is my script, coud you please check if it is correct ?

Moreover, I also have some questions, if could answer it would be great :

I guess from what you have written that the interpretation of the coefficients is:
- Intercept : log average expression of P101 in T1 group,
- Patientid.fP102 : difference of expression between P102 and P101 in T1 group
- gp.fNT.22:sequence.fNT.T : Within the sequence NT.T, difference between NT.22 and T1

But I do not understand why T1 is included intercept ?

I am sorry to insist on this point, but with this design it seems to me that I explain the evolution (because I compare each non-T1 with T1) but I do not understand how I adjust the effect on evolution by taking into account the baseline ?
Finally, How could I take into account the site (left/right) in this model since all the T1 samples do not have “site” information ?

###################################################

eset[1 :7,1 :7]

P101T01Z00 P101T08Z01 P101T08Z02 P101T22Z01 P101T22Z02 P102T01Z00

EGFR.1 2.38633347 2.30200005 3.4173342 1.8256671 3.4190000 0.6233336

EGFR.2 2.67133331 2.61499977 3.3733336 1.8186671 3.4099992 1.5343329

EGFR.3 2.56733322 2.88500023 4.4793326 1.5946662 3.7179998 1.2973334

SFN.1 0.05133438 -0.37700081 0.6793334 -1.1513341 -0.1470000 0.2263343

SFN.2 -0.32466698 -0.33300018 0.4613330 -0.8423341 0.6239999 -0.6716665

SFN.3 -0.58266640 0.04100037 0.5773341 -1.5653337 0.4889997 -0.5826658

SPPR2B.1 0.39233398 0.54599953 2.0713336 -0.7923330 1.4899991 -1.2266668

P102T08Z01

EGFR.1 NA

EGFR.2 -1.97000027

EGFR.3 -2.31700039

SFN.1 -0.09200001

SFN.2 NA

SFN.3 -0.47099972

SPPR2B.1 2.13100147

patientid.f[1 :7]

# [1] P101 P101 P101 P101 P101 P102 P102

# 22 Levels: P101 P102 P103 P104 P105 P106 P107 P108 P109 P110 P111 P112 ... P124

gp.f[1 :7]

# [1] T1 T.08 NT.08 T.22 NT.22 T1 NT.08

# Levels: T1 NT.08 NT.22 T.08 T.22

sequence.f[1 :7]

# [1] NT.T NT.T NT.T NT.T NT.T T.NT T.NT

# Levels: NT.T T.NT

design <- model.matrix(~ patientid.f + gp.f:sequence.f)

colnames(design)

#[1] "(Intercept)" "patientid.fP102"

#[3] "patientid.fP103" "patientid.fP104"

#[5] "patientid.fP105" "patientid.fP106"

#[7] "patientid.fP107" "patientid.fP108"

#[9] "patientid.fP109" "patientid.fP110"

#[11] "patientid.fP111" "patientid.fP112"

#[13] "patientid.fP113" "patientid.fP115"

#[15] "patientid.fP116" "patientid.fP117"

#[17] "patientid.fP118" "patientid.fP119"

#[19] "patientid.fP121" "patientid.fP122"

#[21] "patientid.fP123" "patientid.fP124"

#[23] "gp.fT1:sequence.fNT.T" "gp.fNT.08:sequence.fNT.T"

#[25] "gp.fNT.22:sequence.fNT.T" "gp.fT.08:sequence.fNT.T"

#[27] "gp.fT.22:sequence.fNT.T" "gp.fT1:sequence.fT.NT"

#[29] "gp.fNT.08:sequence.fT.NT" "gp.fNT.22:sequence.fT.NT"

#[31] "gp.fT.08:sequence.fT.NT" "gp.fT.22:sequence.fT.NT"

design<-design[,-c(23,28)]

colnames(design)

#[1] "(Intercept)" "patientid.fP102"

#[3] "patientid.fP103" "patientid.fP104"

#[5] "patientid.fP105" "patientid.fP106"

#[7] "patientid.fP107" "patientid.fP108"

#[9] "patientid.fP109" "patientid.fP110"

#[11] "patientid.fP111" "patientid.fP112"

#[13] "patientid.fP113" "patientid.fP115"

#[15] "patientid.fP116" "patientid.fP117"

#[17] "patientid.fP118" "patientid.fP119"

#[19] "patientid.fP121" "patientid.fP122"

#[21] "patientid.fP123" "patientid.fP124"

#[23] "gp.fNT.08:sequence.fNT.T" "gp.fNT.22:sequence.fNT.T"

#[25] "gp.fT.08:sequence.fNT.T" "gp.fT.22:sequence.fNT.T"

#[27] "gp.fNT.08:sequence.fT.NT" "gp.fNT.22:sequence.fT.NT"

#[29] "gp.fT.08:sequence.fT.NT" "gp.fT.22:sequence.fT.NT"

# To make the names OK for R

colnames(design)<-gsub(":","_",colnames(design),fixed=T)

colnames(design)[1]<-"Intercept"

library(limma)

# Triplicates are ordered by block of 3 in the eset

corfit <- duplicateCorrelation(eset,ndups=3,design=design,spacing=1)

corfit$consensus

# [1] 0.9672377

fit <- lmFit(eset, design,ndups=3,spacing=1,correlation=corfit$consensus)

# To give the name of genes to fit

nom<-gsub(".1","",rownames(eset),fixed=T)

nom<-gsub(".2","",nom,fixed=T)

nom<-gsub(".3","",nom,fixed=T)

rownames(fit$coefficients)<-unique(nom)

#Warning message:

#Partial NA coefficients for 3 probe(s)

cm <- makeContrasts(

productEffect = (gp.fT.08_sequence.fNT.T+gp.fT.22_sequence.fNT.T+gp.fT.08_sequence.fT.NT+gp.fT.22_sequence.fT.NT)/4- (gp.fNT.08_sequence.fNT.T+gp.fNT.22_sequence.fNT.T+gp.fNT.08_sequence.fT.NT+gp.fNT.22_sequence.fT.NT)/4 ,

productEffectTime8=(gp.fT.08_sequence.fNT.T+gp.fT.08_sequence.fT.NT)/2-(gp.fNT.08_sequence.fNT.T+gp.fNT.08_sequence.fT.NT)/2,

productEffectTime22=(gp.fT.22_sequence.fNT.T+gp.fT.22_sequence.fT.NT)/2-(gp.fNT.22_sequence.fNT.T+gp.fNT.22_sequence.fT.NT)/2,

timeEffectproductT=(gp.fT.22_sequence.fNT.T+gp.fT.22_sequence.fT.NT)/2-(gp.fT.08_sequence.fNT.T+gp.fT.08_sequence.fT.NT)/2,

timeEffectproductNT=(gp.fNT.22_sequence.fNT.T+gp.fNT.22_sequence.fT.NT)/2-(gp.fNT.08_sequence.fNT.T+gp.fNT.08_sequence.fT.NT)/2,

gpNT.8=(gp.fNT.08_sequence.fNT.T+gp.fNT.08_sequence.fT.NT)/2,

gpNT.22=(gp.fNT.22_sequence.fNT.T+gp.fNT.22_sequence.fT.NT)/2,

gpT.8=(gp.fT.08_sequence.fNT.T+gp.fT.08_sequence.fT.NT)/2,

gpT.22=(gp.fT.22_sequence.fNT.T+gp.fT.22_sequence.fT.NT)/2,

levels=design)

fit <- contrasts.fit(fit, cm)

fit <- eBayes(fit,robust=T,trend=T)

# 1/ Product effect (all times together)

topTable(fit, coef="productEffect",number=20)

# 2/ Product effect for each time

topTable(fit, coef="productEffectTime8",n=20)

topTable(fit, coef="productEffectTime22",n=20)

# 3/ Time effect for each product separately

topTable(fit, coef="timeEffectproductNT")

topTable(fit, coef="timeEffectproductT")

# 4/Effect for each groupe (product by time)

topTable(fit, coef="gpNT.8",n=20)

topTable(fit, coef="gpNT.22",n=20)

topTable(fit, coef="gpT.8",n=20)

topTable(fit, coef="gpT.22",n=20)

ADD REPLY • link 9.4 years ago eleonoregravier ▴ 70

1

Entering edit mode

Let's consider an example, where you want to look at the effect of time on "evolution" for the treated samples, i.e., T.08 evolution against T.22 evolution. For your design, the last 8 coefficients directly represent the evolution values of each time/treatment combination over the T1 "baseline". So, to test for an effect of time on evolution, you can just set up your contrasts to compare the relevant evolution values:

con <- makeContrasts(gp.fT.22_sequence.fNT.T - gp.fT.08_sequence.fNT.T,
          levels=design) # left/right sequence
con <- makeContrasts(gp.fT.22_sequence.fT.NT - gp.fT.08_sequence.fT.NT, 
          levels=design) # right/left sequence

Note that I've renamed the column names of design with "_" to replace ":", as the latter doesn't work well with makeContrasts. There's no need to explicitly account for the T1 expression in your contrasts, as that is already considered when the coefficients are estimated.

For the left/right stuff, the sequence is fully absorbed into the patient blocking factors. There is no need to do anything extra in the model to account for that, because any (spurious) differences due to sequence will be factored out in the patient effects and ignored. The only extra thing that this model does is to have sequence-specific log-fold changes for the different combinations over T1. This allows you to check if the evolution for left/right is different from the evolution of right/left. You can do that by comparing the evolution values with similar contrasts to the ones I've specified above.

ADD REPLY • link 9.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron and thanks again for your comments. I think I was not very clear to ask my questions, that’s why here are more details :

I very well understand that the 8 last coefficients represent evolution relatively to T1.

In our “clinical” model, we ALSO (in addition to explain the evolution) include T1 as a continuous covariate to take into account that an evolution of 5 for example is not considered the same if your baseline level is 1 or if it is 10. So, by including T1 as a continuous covariate we would like to make comparable the evolution taking into account the baseline values. From what I understand, the model

design <- model.matrix(~patientid.f + gp:sequence.f)

do not take into account this information, isn’t it ? Is it possible to take into account this information ?

As for the “site” information (that is to say, for each sample from the 4 groups T08, NT08, T22 and NT22, the information on which side left or right the treatment was given), we include it as a block variable in our “clinical” model to take into account for non-relevant differences that can occur between the different sites (as I explained with the dryness example above). In the model you propose, only the sequence is taken into account (sequence is NOT the site variable, it is the interaction between product and site). I would like to simply include site as a block variable in the model like (as we do with a batch effect for example) :

design <- model.matrix(~patientid.f + site.f+ gp:sequence.f)

but the problem is that the T1 samples have not “site” information (they are sampled before the randomization of the treatment between right and left sites). The “site” information is more important for us than “sequence” variable, so if we have to choose between the two, we prefer to keep the “site” information in the model. Do you think it is possible ?

I hope it is more clear now, do not hesitate to ask me details if you need.

Many thanks for spending time on helping me,

Eléonore

ADD REPLY • link 9.4 years ago eleonoregravier ▴ 70

1

Entering edit mode

To answer your second question first; you can re-parametrize the model to use site information rather than sequence information. In effect, you define:

gp.site <- factor(paste0(gp, site.f))

... where each T1 sample just has a site.f of "" and every other sample has "left" or "right", depending on what's appropriate. You can then use this in a model with:

design <- model.matrix(~patientid.f + gp.site)

This should give you the patient blocking factors, plus 8 coefficients at the end that represent the log-fold change of each combination of these factors (left or right, treated or untreated, time 8 or 22) over the T1, i.e., their evolution values. You can then drop these separately, compare them to each other via makeContrasts, etc., depending on what DE comparison you want to perform.

Note that this model specification already contains all of the information in sequence.f, so you don't need to supply the sequence for each patient separately. For example, if you know a patient is treated on the left (based on its factor combinations), then the right must be untreated, so that patient would have sequence T.NT. In fact, the converse is also true; if you have the sequence information in the model (as in the one you've specified in your last post), then you don't need to explicitly supply the site either.

Now, for your second question; the short answer is no, and the long answer is yes. For starters, supplying the T1 expression as a covariate will not affect the evolution values. If you tried to do that, all samples for the same patient would have the same T1 value and the same coefficient for the covariate term. That coefficient would cancel out when calculating evolution values within patients. In fact, the T1 covariate will be totally absorbed by the patient blocking factor anyway, such that it wouldn't be estimable (i.e., you can't distinguish between patient-specific effects and an effect correlated to T1 expression levels).

If you want to compute T1-dependent evolution values, simply putting T1 as a covariate in a linear model is not sufficient. Rather, you need a model with interaction terms for gp.site and T1. This should be possible to construct by extracting the last 8 coefficients of the design matrix, multiplying them by the T1 expression across patients, and then cbind'ing the products back onto the matrix, i.e.:

nc <- ncol(design)
evolution.terms <- design[,(nc-7):nc]
evolution.terms <- evolution.terms * T1 # vector of expression
design.new <- cbind(design, evolution.terms)

These new terms represent the effect of T1 expression on the evolution values, assuming a linear response (specifically, you can imagine fitting a line to the evolution values against T1 for each combination of factors; each of the old 8 terms represent the intercept, while the each of the new 8 terms represent the gradient of this line). I think you'll need at least 5 patients, though, otherwise you won't have enough residual d.f. to estimate the variance afterwards. You could simplify it to have a common gradient by doing:

evolution.terms <- rowSums(evolution.terms)

... before the cbind call. Alternatively, you could have a common intercept but that seems less flexible.

ADD REPLY • link 9.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,
Thanks again for your help.
I understand the part with gp.site where the reference level is T1 to explain evolution in different
combinations treatment/time/site.
For example treatment effect can be computed by substracting the mean from all the "NT" coefficients (4 terms) to the mean from the 4 treated coefficients, isn't it ?
I also understand that the information to compute the interaction term "product:site" (=the sequence) is present in gp.site . To test the sequence effect, the idea is to test if the treatment effect is the same in the left site (mean of the 2 treated coefficients left site - mean of the 2 NT coefficients left site) compared to the right site.

As for the T1 convariate, I tried to do what you recommend but it is not possible
because expression vector T1 is different between genes... so we have a lot of T1 expression vector
and not only one... May be I do not understand exactly what you mean and I am wrong ?

Thanks

Eléonore

ADD REPLY • link 9.3 years ago eleonoregravier ▴ 70

0

Entering edit mode

Ah, sorry; I forgot that T1 was the gene expression. Well, that poses a problem, because you'll need different design matrices for each gene, and limma doesn't support that, regardless of whether you fit to the raw Ct's or to the evolution values. I don't see any way to get around this. I suppose you should start by seeing if you really need that level of complexity in your model. Just try fitting the design without T1 as a covariate, and if it gives sensible DE genes, then that's probably good enough; there's really no need to make life difficult for yourself.

ADD REPLY • link 9.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

OK for T1.

Are you OK with my computation of treatment effect and sequence effect in the previous comment ?

Thank very much for having spent time on my problem

Eleonore

ADD REPLY • link 9.3 years ago eleonoregravier ▴ 70

0

Entering edit mode

Yes, your calculations look fine. Keep in mind, though, that the final log-fold change won't be able to capture strange behaviour, e.g., if gene expression goes down in treated right vs. untreated right, but goes up in treated left vs untreated left. The reported change will just be the average effect of treated vs. untreated, which means that this oddness will not be picked up. In short, it's always worth checking that the individual contrasts are behaving like they should (i.e., check left/right treated vs. untreated at each time point separately) to make sure that the average log-fold change can be interpreted sensibly.