Question

Finding differentially expressed genes in a prospective cohort study

0

Entering edit mode

YC • 0

@974c4358

Last seen 3.3 years ago

Taiwan

Hi, I'm a beginner at microarray data. I want to use the patient's baseline gene expression to predict their 3-month response to treatment. In my imagination, the first step is to find differentially expressed genes between responders and non-responders from their baseline expression. I found a lot of studies using LIMMA package to find differentially expressed genes.

Here is my code.

Treat <- factor(paste(data$response, data$month,sep=".")) 
factor<- data$age
design <- model.matrix(~0+Treat+factor)
corfit <- duplicateCorrelation(data ,design,block=data$id)
fit <- lmFit(data,design,block=data$id,correlation=corfit$consensus)
cm <- makeContrasts(
  res1vsres0ForM0 = Treat1.0-Treat0.0,
  levels=design)
fit2 <- contrasts.fit(fit, cm)
fit3 <- eBayes(fit2)

After reading some tutorials, I think the formula of the linear model in my study will be Gene Expression at baseline = b0 + b1 Response + b2 Age. (I want to adjust their age.) However, it seems a little bit weird to predict their baseline expression by their response 3-month later. Does this mean my study is not suitable for limma? If so, does it make sense to use logistic regression to find differentially expressed genes? Response = b0 + b1 Gene Expression at baseline + b2 Age

I would appreciate it if you could give me some suggestions. Thank you.

limma MicroarrayData Microarray • 2.2k views

ADD COMMENT • link 3.3 years ago YC • 0

score 0 · Answer 1 · 2022-06-22

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

From the very brief description you give, it appears that your study is a standard paired comparison with two groups. The study can be analysed by standard methods in limma without any problems.

It's hard to follow the linear model you have defined without knowing what the variables are, but the code doesn't match the formula you state. There is usually no need to adjust for age because the data is already paired. The baseline is not predicted by the 3-month response. Logistic regression is unnecessary and inappropriate. Why not do a standard paired analysis?

ADD COMMENT • link 3.3 years ago Gordon Smyth 53k

0

Entering edit mode

Thank you for answering the questions and sorry for the confusion.

Here is my data frame. The response represents treatment response 3-month later. It seems like a Multi-level Experiment.

No  Filename    ID  Response        Month   Age
1   F1.1.CEL    1   responder       0       32
2   F1.2.CEL    1   responder       3       32
3   F1.3.CEL    1   responder       6       32
5   F2.1.CEL    2   responder       0       19
6   F2.2.CEL    2   responder       3       19
7   F3.1.CEL    3   non-responder   0       28
8   F3.2.CEL    3   non-responder   3       28
9   F3.3.CEL    3   non-responder   6       28

What is the formula that matches the code? As LIMMA is a linear model, I thought the formula was like what I mentioned above. That's also why I thought the baseline expression is predicted by their 3-month data. Please correct me if I am wrong. Thank you!

ADD REPLY • link 3.3 years ago YC • 0

0

Entering edit mode

Do you have only 3 individuals?

What hypothesis are you trying to test? The standard analysis of this sort of experiment would be test whether the responders have a different response (3mth vs 0 and 6mth vs 0) from the non-responder. Testing for baseline differences is not usually a focus.

Sure, it is a multilevel experiment but with only three individuals you don't have nearly enough data to estimate a random effect and adjust for age. If these are human patients, it would be very surprising to get any significant results at all from only three patients.

ADD REPLY • link 3.3 years ago Gordon Smyth 53k

0

Entering edit mode

Thank you for your detailed explanation. I have 40 patients and yes, they are human patients. I want to know if there are differentially expressed genes between responders and non-responders at baseline. It is a guess that their genetic difference at baseline may be responsible for their different treatment responses. May I ask is there a formula that matches the code?

ADD REPLY • link 3.3 years ago YC • 0

0

Entering edit mode

I am still somewhat confused because the contrast shown in your code doesn't match the values shown in your data.frame. Judging from the data.frame, your baseline contrast would be responser.0 - non-responder.0 rather than Treat1.0 - Treat0.0.

Have you simply renamed all the responses between the data.frame and the code?

ADD REPLY • link 3.3 years ago Gordon Smyth 53k

0

Entering edit mode

Oh! Yes! Sorry! it is my fault. I changed responder to 1 and non-responder to 0.

ADD REPLY • link 3.3 years ago YC • 0

0

Entering edit mode

OK, now that I understand your data better and I see what the variables mean, your original analysis is fine. It is more or less a standard multilevel analysis. You wouldn't need to adjust for Age to estimate the treatment responses but it might help somewhat with the baseline comparison.

My other comments remain. Your interpretation of the linear model as a formula is not correct and the baseline is not being predicted by the 3-month values. It all seems fine, I don't see any problems.

ADD REPLY • link 3.3 years ago Gordon Smyth 53k

0

Entering edit mode

Thank you for your explanation and clarification. So, if I understand right, the purpose of building linear models in LIMMA is to find differentially expressed genes rather than use the linear model to predict something. However, the linear model can do both.

ADD REPLY • link 3.3 years ago YC • 0