Question

limma modeling, paired samples: first disease type disappears from design matrix

0

Entering edit mode

Riba Michela ▴ 90

@riba-michela-6472

Last seen 2.4 years ago

Italy

Hi,
I'm writing again dealing with a paired sample design:
the experimental setting involves 9 patients, 3 disease stages and microarray expression data according to the included target file

target<- readTargets("targetPT.txt")
head(target)
Genotype <- factor(target$Genotype)
Disease<- factor(target$Disease, levels=c("stageA", "stageB", "stageC"))

I have performed a paired samples analysis using

design <- model.matrix(~Genotype+Disease)

in order to sort out genes differentially expressed between stages A and B for example but I noticed that the first patient and the first disease stage (in alphabetical order) disappears in the fit using colnames (fit)

I tried to use

design <- model.matrix(~0+Genotype+Disease)

to explicit the coefficient in intercept and the first Disease type disappears

I tried again

design <- model.matrix(~0+Disease+Genotype)

and again the first patient in alphabetical order disappears

I do not have sufficient mathematical education to understand exactly what shoud fit the needs

I would prefer this last model formula to extract using a contrast matrix the differentially expressed genes between stages considering the variability due to different patients because it explicits all the disease stages,

anyhow I would ask what could be the best way to address this problem and what could be the mistakes behind (i.e. I do not have all disease conditions for all the 9 patients,.. )

I thank you very much for attention,

Michela

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] limma_3.18.13

loaded via a namespace (and not attached):
[1] tools_3.0.2

limma design matrix limma • 1.7k views

ADD COMMENT • link updated 9.1 years ago by Gordon Smyth 50k • written 9.9 years ago by Riba Michela ▴ 90

Gordon Smyth · Answer 1 · 2014-06-09

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 11 hours ago

United States

Hi Riba,

The patient and disease don't disappear; they are absorbed into the intercept term. The model you are fitting is called a 'factor effects' model, and all the coefficients are interpreted as differences between a given sample type and the 'baseline', which in this case is the Stage A disease for Genotype 1.

In other words:

> colnames(design)
[1] "(Intercept)"   "DiseasestageB" "DiseasestageC" "Genotypept02" "Genotypept03" "Genotypept04"  "Genotypept06"
[8] "Genotypept09"  "Genotypept10"  "Genotypept13"  "Genotypept14"

DiseasestageB can be interpreted as Stage B - Stage A after controlling for the paired nature of your data. The DiseasestageC coefficient is interpreted analogously.

This is a basic concept of linear modeling, and if you are getting tripped up on the basics then I would highly recommend finding a local statistician who can help you.

Best,
Jim

--

James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

ADD COMMENT • link updated 9.1 years ago by Gordon Smyth 50k • written 9.9 years ago by James W. MacDonald 65k

0

Entering edit mode

Hi,
thanks for your quick answer, I get the point,

and basically I actually arrived to the conclusion of being absorbed in the Intercept, for this reason I went on and put 0+ in the model,

in any case sure I'm not a statistician, and I cannot move on from this.

I'm not at the moment convinced about the meaning of DiseaseB even if actually following your indication is right what I 'm intrested in. The point is that biologically the first patient is not a baseline. For this reson I would not consider a model in which both p[atient and disease are put together in the intercept. Disease stageA is a baseline for disease but the first patient is not as often in clinical settings happens.

In this light could you suggest another formula to extract in paired sample way (somehow considering that each pateint has his own variability) the genes which significantly differ among all the 3 disease stages?

eg. DiseasestageA vs B
DiseasestageA vs C
Disease stage B vs C?

Due to high interpatient variability it is very difficult to obtain results in not paired sample, Disease only based modeling.

I thank you very much for your patient and hope you would give me
feedback

Thanks a lot,

Michela

Dr. Michela Riba
Genome Function Unit
Center for Translational Genomics and Bioinformatics
San Raffaele Scientific Institute
Via Olgettina 58
20132 Milano
Italy

lab: +39 02 2643 9114
skype: mic_mir32
riba.michela@gmail.com
riba.michela@hsr.it

ADD REPLY • link updated 9.1 years ago by Gordon Smyth 50k • written 9.9 years ago by Riba Michela ▴ 90

0

Entering edit mode

Hi Riba,

> I'm not at the moment convinced about the meaning of DiseaseB even if
> actually following your indication is right what I 'm intrested in.
> The point is that biologically the first patient is not a baseline
> For this reson I would not consider a model in which both p[atient and
> disease are put together in the intercept.
> Disease stageA is a baseline for disease but the first patient is not as
> often in clinical settings happens.

And this is exactly why I suggest you consult with a local statistician. What you have done is perfectly acceptable, but you don't understand enough to realize that.

Baseline in this context has nothing to do with any biological meaning for the term. Instead, it simply means that all other groups are compared to the baseline. You can use relevel() to change the baseline at will.

And the parameterization you are using to account for pairs requires that one of the subjects be considered a baseline. This is algebraically identical to fitting a conventional paired t-test, and you will not be able to fit a model that accounts for pairs without absorbing one subject into a baseline.

Best,
Jim

--

James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

ADD REPLY • link updated 9.1 years ago by Gordon Smyth 50k • written 9.9 years ago by James W. MacDonald 65k

0

Entering edit mode

Hi Mihaela,

Well, as Jim said the coefficient "DiseasestageB" is B-A, accordingly "DiseasestageC" is C-A.

To get B-C you have to extract the contrast "DiseasestageB - DiseasestageC" which is B-A -(C-A) = B-C.

In this factor model you assume that the effects are additive, so "contrasting" two coefficients that are relative to the same genotype base level gives you the difference in mean expression explained by disease independent of genotype.

I sent you in another mail some Teaching material of mine that explains this in more detail. (I will put this on github soon)

Best wishes,
Bernd

ADD REPLY • link updated 9.1 years ago by Gordon Smyth 50k • written 9.9 years ago by Bernd Klaus ▴ 610

0

Entering edit mode

Thanks a lot for explanations I'm pleased to go into more detail following the mail and studying!

Thanks a lot so much

Michela

ADD REPLY • link updated 9.1 years ago by Gordon Smyth 50k • written 9.9 years ago by Riba Michela ▴ 90