create design matrix when 2 groups of gene-expression data
5
0
Entering edit mode
mheydarpour ▴ 10
@mheydarpour-9430
Last seen 5.9 years ago

I have 1000 candidate genes expression data (FPKM) for heart-failure before surgery (A) and after surgery (B). I would like to find differential gene expression after surgery (B) for 100 patients in two groups (Yes=Hypertension=30 ; No_Hypertension=70). In the model I would like to adjust for gender,age, and gene_expression_before_surgery=A. How to define parameters in the design matrix?

Note: The dimension of Matrix(A)=Matrix(B)=1000 rows & 100 columns, but the values are different.

limma • 4.6k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

This seems like a job for a paired-sample analysis, if the before/after samples are from the same individual. If that's the case, then you don't really need to worry about sex or age, because I'd expect that neither parameter would change considerably for each patient after (heart-related) surgery. This means that blocking on the patient will implicitly block on sex and age. You could then make a design matrix that looks a bit like this:

design <- model.matrix(~0 + patient + hypertension:time) 
# ... where time is either "Before" or "After".
design <- design[,-grep("Before", colnames(design))] 
# ... to get to full column rank.

This should give you a design matrix with 100 patient-specific blocking factors and two coefficients to describe the DE due to surgery in patients with and without hypertension. Dropping either coefficient will test for DE in each group of patients, while comparing them to each other will identify genes where the effect of surgery is different between groups. To use this design, you'll have to cbind the data matrices together for analysis in limma.

ADD COMMENT
0
Entering edit mode
mheydarpour ▴ 10
@mheydarpour-9430
Last seen 5.9 years ago

Thanks Aaron for your great explanation. However, I need to know more how to define two matrix of Gene_Expression (A=before & B=after). this is not a variable "after" and "before" in the file. these are two GE matrix contain 1000 genes (Normalized_FPKM).

I wrote the following script for this analysis. as I said I want to find gene_expression digfferential for individuals who have hypertension and not-hypertension adjusted for GE profile before surgery and other covariates. Could you please have a look to the following script and correct it for this analyses or let me know how to correct it. Thank you!

library(limma)

GE_A=read.delim("GeneExpress_Before_Surgery_Normalized.txt", header=T,row.names=1)

GE_B=read.delim("GeneExpress_After_Surgery_Normalized.txt", header=T,row.names=1)

pheno=read.delim("clinicalDB.txt", header=T)
Hypertension=factor(pheno$Hypertension, levels=c("Yes","No"))
design=model.matrix(~age+sex+GE_A+Hypertension)
pheno=pheno[which(pheno$SampleID %in% names(GE_B)),]
cbind(as.data.frame(names(GE_B)), pheno$SampleID)
fit <- lmFit(GE_B,design)
fit.de <- eBayes(fit, robust=TRUE)
topTablefit.de, coef=ncol(design), n=Inf, sort.by="P")

ADD COMMENT
0
Entering edit mode

You make the before/after variable yourself:

GE_all <- cbind(GE_B, GE_A)
time <- rep(c("Before", "After"), c(ncol(GE_B), ncol(GE_A)))

Your current design won't work because you're putting GE_A in design. limma requires that the same design matrix be used for all genes - this means that you can't use gene-specific expression as a covariate.

ADD REPLY
0
Entering edit mode
mheydarpour ▴ 10
@mheydarpour-9430
Last seen 5.9 years ago

Hi Aaron, I have 2 more questions:

Q1) I am interested in seeing the gene_expression differential after surgery (GE_B) not before surgery (GE_A). If I am not able use GE_A as a covariate, how to adjust GE level of GE_B in the model. please explain

Q2) you recommended the following design: 

design <- model.matrix(~0 + patient + hypertension:time) 

why I have to put patient, and interaction effect between time&hypertension in the design matrix?

ADD COMMENT
0
Entering edit mode

"Differential expression after surgery"; between what groups?

ADD REPLY
0
Entering edit mode

Basically, you want to "adjust" the expression of each AFTER sample based on the expression of the matching BEFORE sample, and then compare the adjusted expression between groups with and without hypertension. If this is correct, then the strategy I've suggested will do what you want. In particular, you'll need to set up contrasts.fit to compare the last two coefficients of design. Each coefficient represents the AFTER/BEFORE log-fold change within each group, which is effectively equivalent to the BEFORE-adjusted expression that you're trying to compute. Thus, testing for differences in those coefficients between the two groups will be the same as testing for differences in adjusted expression.

ADD REPLY
0
Entering edit mode

Yes Aaron, this is exactly which I want to do. Could you please advise who to set up the design and contrasts.fit based on structure of previouse script which I sent you. Thanks

ADD REPLY
0
Entering edit mode

I think there's enough information in my previous posts for you to try to do it yourself. I'm happy to have a look at your script, but I'm not going to set it up for you from scratch.

ADD REPLY
0
Entering edit mode

Hi Aaron, Could you please give me a hint.

If I used this design .....> design=model.matrix(~0+patient+hypertension:time) which time would be GE data before & after surgery. Then how to set up the contrast.fit to compare the two coeficients before & after. Thanks

ADD REPLY
0
Entering edit mode

Try something like:

con <- makeContrasts(hypertensionYes_timeAfter - hypertensionNo_timeAfter, 
                     levels=design)

... after replacing the colons with underscores in the column names of design.

ADD REPLY
0
Entering edit mode

Hi Aaron,

Regarding your comments, I wrote the following script for above analyses:

library(limma)
GE_A=read.delim("GeneExpress_Before_Normalized.txt", header=T,row.names=1)
GE_B=read.delim("GeneExpress_After_Normalized.txt", header=T,row.names=1)
GE_all <- cbind(GE_A, GE_B)
time <- rep(c("Before", "After"), c(ncol(GE_A), ncol(GE_B)))
pheno=read.delim("clinicalDB.txt", header=T)
Hypertension=factor(pheno$Hypertension, levels=c("Yes","No"))
patient=factor(pheno$SampleID)
Hypertension_time <- factor(paste(pheno$Hypertension,time))
pheno=pheno[which(pheno$SampleID %in% names(GE_all)),]
cbind(as.data.frame(names(GE_all)), pheno$SampleID)
design=model.matrix(~0+patient+Hypertension_time)
fit <- lmFit(GE_all,design)
con <- makeContrasts(HypertensionYes_timeAfter - HypertensionNo_timeAfter, levels=design)
fit2 <- contrasts.fit(fit,con)
fit2 <- eBayes(fit2)
topTable(fit2, coef=ncol(design), n=Inf, sort.by="P")

 

but after run the "design", it gave me the following error:

Error in model.frame.default(object, data, xlev = xlev) : 
  variable lengths differ (found for 'Hypertension_time

What is wrong in this script? How to fix this? Please advise. Thanks.

ADD REPLY
0
Entering edit mode

You need to double patient because you've combined the two matrices together:

patient <- factor(rep(pheno$SampleID, 2))
ADD REPLY
0
Entering edit mode

Hi Aaron, Thanks for correcting my script. I run it and I got the following error:

Coefficients not estimable: Hypertension_timeYes Before 
Warning message:
Partial NA coefficients for 18258 probe(s) 

ADD REPLY
0
Entering edit mode

Read my original post where I remove coefficients to obtain full column rank.

ADD REPLY
0
Entering edit mode

I used your comment for full rank matrix and modified my script as:

design=model.matrix(~0+patient+Hypertension_time)
design <- design[,-grep("Before", colnames(design))]

fit <- lmFit(GE_all,design)

con <- makeContrasts(Hypertension_timeYes.After - Hypertension_timeNo.After, levels=design)

But, this error came up: Error in eval(expr, envir, enclos) : 
  object 'Hypertension_timeNo.After' not found

 

ADD REPLY
0
Entering edit mode

You need to adjust the names you give to makeContrasts to match the column names of design. I suggest you read some more limma documentation, you'll make a lot more progress if you can work it out yourself.

ADD REPLY
0
Entering edit mode

OK, I will do that. However, I have "Hypertension_timeYes.After" in my column names of design but there is no "Hypertension_timeNo.After". I don't know what is wrong?

ADD REPLY
0
Entering edit mode

You should follow my original instructions, and construct design with an interaction term rather than with Hypertension_time. If you use the latter, you'll need to refactor it so that the first level is not an After term.

ADD REPLY
0
Entering edit mode

But a couple days ago, you mentioned that:

... after replacing the colons with underscores in the column names of design

 

ADD REPLY
0
Entering edit mode
colnames(design) <- sub(":", "_", colnames(design))
ADD REPLY
0
Entering edit mode

Thank you very much Aaron for your time and great comments!

ADD REPLY
0
Entering edit mode

Hi Aaron, I have 2 other variables [time_of_surgery(min), and Center_of_surgery] which I would like put in the model for adjusting. Time of surgery is a continuous variable and Center is bionomial variable. How to define these 2 variables in the design model. The last design model which I used was:

design <- model.matrix(~0 + patient + hypertension:time)
ADD REPLY
0
Entering edit mode

It seems like those two variables would be the same for both before/after samples in any given patient. If that's the case, blocking on those variables will have no additional effect beyond blocking on patient. The only way you could include them would be as part of the interaction term; this would effectively define a separate before/after log-fold change for each combination of surgery center + hypertension group + time of surgery. I'm loathe to recommend this approach as it is quite complicated to interpret and will require some pruning to maintain full column rank. It will also reduce the effective number of samples used to test for DE, which results in a decrease in power. I would only do this if I were very sure that the sequencing center and time of surgery had an effect on the treatment response within each patient (effects on the absolute expression are already handled by blocking on patient).

ADD REPLY
0
Entering edit mode

Yes, these 2 variables are different between patients and particularly different between 2 groups (e.g.time of surgery is much lass in the group of no-hypertension compare to hypertension) or center_of_surgery is different between these groups (e.g. patients who had hypertension had heart surgery in 2 different center). That's why I would like to adjust the effect of these 2 variables in the model. What design matrix are you recommend? Thanks

ADD REPLY
0
Entering edit mode

You misunderstand me. For any given patient, I would expect that each of the two variables would be the same between the before and after samples. For example, it doesn't make sense for a single patient to have a "before" sample at one surgery center and an "after" sample at another surgery center. 

Now, if the variables are the same within each patient, it doesn't matter that they're different between patients, because that gets modelled by blocking on patient anyway. So, including these variables as additive terms in the model would make no difference. Thus, your choices are:

  • Ignore those terms, as any additive effects from them will be included in patient and removed automatically in the analysis.
  • Include those terms as part of the interaction term in model.matrix. Not recommended unless you really know what you're doing.
ADD REPLY
0
Entering edit mode

Thank you very much Aaron for the explanation.

ADD REPLY
0
Entering edit mode

Hi Aaron,

I have a question regarding your above design model. How to see what would happen when GE_after surgery would adjust for

GE_before surgery for 2 groups of patients (Hypertension=Yes ; Hypertension=No). 

ADD REPLY
0
Entering edit mode
mheydarpour ▴ 10
@mheydarpour-9430
Last seen 5.9 years ago

Thanks Aaron for your time & great comments!

ADD COMMENT
0
Entering edit mode
mheydarpour ▴ 10
@mheydarpour-9430
Last seen 5.9 years ago

Hi Aaron,

I have a question regarding your below design model suggested. How to see what would happen when GE_after surgery would adjust for

GE_before surgery for 2 groups of patients (Hypertension=Yes ; Hypertension=No). How this model design works? please explain 

design <- model.matrix(~0 + patient + hypertension:time) 
ADD COMMENT
0
Entering edit mode

I don't think I can explain any more clearly than what I've already mentioned in the posts above. If you're still having problems, perhaps you should consult a local statistician or analyst.

ADD REPLY

Login before adding your answer.

Traffic: 556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6