Question: Microarray expression analysis for on factor taking account confounders (covariables)
gravatar for michel655
3 months ago by
michel6550 wrote:

Hello everyone. I have a little question about how to design a matrix taking into account covariables.

Subject : I have got agilent microarray data from 12 samples (6 mild malaria sample noted "MM" and 6 cerebral malaria noted "CM"). Data relative to age, sexe, ethnie and leucocytes count are also proven for each sample. These later parameters (age, sex and leucocytes counts) are in categorical form using some thresholds that we have chosen to classify them.

Objective : I want get differentially expressed genes (DEGs) comparing "MM" and "CM" samples whils taking into account the effect of age, sexe and leucocytes count.

What i did

1. Creating a data frame (called "targets") containing all informations that i need. Columns SEXE, AGE and LEUC are factors with "0" and "1" for levels and column "PHENO" is factor with levels "MM" and "CM"

 NP15014.txt     CM    1   1    1
 NP15016.txt     CM    0   1    0
 TA15052.txt     CM    1   0    1
 HPIKNP01.txt    CM    1   0    0
 NP15013.txt     CM    1   1    1
 NP15015.txt     CM    0   1    0
 TA16037.txt     MM    1   0    1
 TA16036.txt     MM    1   0    1
 TA16035.txt     MM    0   0    1
 TA16031.txt     MM    0   1    1
 TA16006.txt     MM    1   0    0
 HPIKPS01.txt    MM    0   0    0

2. Creating desing by this commands

>  design <- model.matrix(~PHENO*AGE*SEXE*LEUC, data = targets)

>  Fit_design <- lmfit (my_ExpressionSet, design)     

## i got this errror message


As you can see, the fiting results are wrong for almost of the design matrix coefficients. So topTable results is also wrong.

My question

Using the "targets" table above, how i can create a design matrix with model.matrix which allow me to get DEGs between "CM" and "MM" (my  real goal) but taking into account age, sexe and leuc as covariables (confounders).

Any help will be really appreciable.

Dieureudieuf in advance.

ADD COMMENTlink modified 3 months ago by Aaron Lun19k • written 3 months ago by michel6550

Hello SamGG,

I have already perform a formula with + instead of * but it gives me only effect on each factors. It is not what i'm looking for. But thanks for your reply.

ADD REPLYlink written 3 months ago by michel6550
gravatar for SamGG
3 months ago by
SamGG150 wrote:

Hi ! Short answer: you are trying to estimate nearly as many parameters as samples. The * in the formula means that you are looking for interactions between the factors as well as well the factors independently. In order to estimate the factors without interactions, replace * by +. Long answer: I guess Aaron or Gordon will perform than me.

ADD COMMENTlink written 3 months ago by SamGG150
gravatar for Aaron Lun
3 months ago by
Aaron Lun19k
Cambridge, United Kingdom
Aaron Lun19k wrote:

You have two options. The first (and recommended) approach is to follow SamGG's suggestion, and replace the * with + in your model.matrix call. This assumes that the effects of the confounders on log-expression are additive. You can then test for the effect of PHENO by selecting the appropriate coefficient in topTable (probably coef=2). Your original post mentions that you want the MM vs CM comparison, so I don't understand what you mean in your second post when you say that it's not what you're looking for.

The second option is to concatenate some or all of the factors into a single grouping variable. In the most extreme case, let's say we concatenate all of the factors:

group <- with(targets, paste(PHENO, AGE, SEXE, LEUC, sep="."))
design <- model.matrix(~0 + group)

You can then perform comparisons between MM and CM at each combination of the confounding factors, using makeContrasts. This avoids the additivity assumption for all of the factors that were concatenated together. However, for this to work, you need many samples, and 12 probably isn't enough for the 4 factors that you want to model in your design.

P.S. Please respond to answers with the "Add comment" button, not "Add answer" unless you are answering your own question.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Aaron Lun19k

Hello Aaron, thanks for your reply.

I was confused bay the fact that this formula "model.matrix( ~PHENO+SEXE+AGE+LEUC)" gives me a matrix with 4 coefficenients corresponding to the 4 variables a put in formula and nothing with interaction terms like PHENO:SEXE and so on. That's why i was tinking that the TopTable perform statististics taking account only to each coef of the matrix alone. But based on your explanation, i was wrong and the suggestion of SamGG was right.

So i did the first approach, model.matrix( ~PHENO+SEXE+AGE+LEUC),  and any of the 22620 genes i got i my expression set is significant when i perform topTable using "PHENO" as coefficient. This results is very different from the one i Obtain when i Perform ANOVA taking PHENO as facteur and SEXE, AGE and LEUC as confoundings.  Can you help me know i these two test give different results?

I trying the second approach you suggest and will come back.


ADD REPLYlink written 3 months ago by michel6550

Could you check the columns of the design by issuing colnames(design)? I am wondering whether "PHENO" is the correct way to name the column of interest.

ADD REPLYlink written 3 months ago by SamGG150

Hello SamGG, I changed the columns names, so "PHENOMM" is called just "PHENO" on my matrix (to recover the initial name on the targets table).

ADD REPLYlink written 3 months ago by michel6550

Are you saying that all of the genes are DE when you use an additive model? I'm surprised that topTable gave you any result at all, because there is no "PHENO" coefficient in this design matrix:

design <- model.matrix( ~PHENO+SEXE+AGE+LEUC, targets)
## [1] "(Intercept)" "PHENOMM" "SEXE" "AGE" "LEUC"

As SamGG suggests, you should drop "PHENOMM" instead, which should give you more sensible results. I can't speak for what you've done with the ANOVA, because you haven't shown any code for it.

ADD REPLYlink modified 3 months ago • written 3 months ago by Aaron Lun19k

Hello Aaron, as I said to SamGG, i changed my lomumn name and put "PHENO" instead of "PHENOMM" in my matrix.

I was trying to explain that any gene is upragulated with my topTable statistics up to 10% FDR. Anova test was perform not in R but with another tools called GeneAnova. The fact is either limma is not adequete to do what i'm trying to perform or all is perform is perfromed well and results are just negative.



ADD REPLYlink written 3 months ago by michel6550

Not sure what a "lomumn" is. I presume you meant "column"?

I also don't understand your second paragraph. Do you mean that all genes are upregulated? Or no gene is upregulated? "Any gene is upregulated" doesn't make sense in this context. 

In any case, I assure you that limma is perfectly adequate for analyzing this type of experiment. You'll have to provide a lot more evidence if you're suggesting that limma is doing the wrong thing here.

ADD REPLYlink written 3 months ago by Aaron Lun19k

Hello Aaron,

When i said "Any gene is upregulated" means "no gene is upregulated". Sorry. I perform the two approches you suggest me but it still seem that something wrong on my data and, by consequence, don't permit Limma to perform good statistics because my results are all negative in the first approach and all positive in the second.

Now, I'm a little blocked cause i don't have an idea to contunue.

May be, I can provide to you my Elisraw data (the processed dataset with which a fit (lmFit) my design matrix and compare. This can help me to see my error ? 

ADD REPLYlink written 3 months ago by michel6550

The full code of your analysis would be sufficient to determine what you have done.

Also, I don't understand what you mean by "all negative" and "all positive". Are you saying the log-fold changes are all negative or positive? The coefficients? Or do you mean "negative" and "positive" in the sense of results being (non-)significant?

ADD REPLYlink modified 3 months ago • written 3 months ago by Aaron Lun19k

Hello Aaron,

"All negatives" and "all positives" refer to non-significant and significant genes.

This is my code :

show (target)

 NP15014.txt    1    1   1    1
 NP15016.txt    1    0   1    0
 TA15052.txt    1    1   0    1
 HPIKNP01.txt   1    1   0    0
 NP15013.txt    1    1   1    1
 NP15015.txt    1    0   1    0
 TA16037.txt    0    1   0    1
 TA16036.txt    0    1   0    1
 TA16035.txt    0    0   0    1
 TA16031.txt    0    0   1    1
 TA16006.txt    0    1   0    0
 HPIKPS01.txt   0    0   0    0
Design_1 <- model.matrix (~PHENO + SEXE + AGE + LEUC, targets)
Fit_design_1 <- lmFit (RD3, Design_1)
eBayes_Design_1 <- eBayes (Fit_Design_1, trend = TRUE)
Results_Design_1 <- topTable (eBayes_Design_1, coef = "PHENO", number = 22620, adjust.method ="fdr")

Here is a summary of Results_Design_1 :

GeneName         t      P.Value   adj.P.Val
36353         DEFA8P 11.421350 1.389214e-07 0.003142402
61496          S100P  7.066590 1.730988e-05 0.163580684
53844           SLPI  6.668216 2.974813e-05 0.163580684
58513          PRRT4  6.149016 6.215623e-05 0.163580684
16092   LOC101927686  5.923916 8.652973e-05 0.163580684
51737 XLOC_l2_001496  5.797909 1.044543e-04 0.163580684
23107          INHBA  5.691113 1.227376e-04 0.163580684
58476            MPO  5.689887 1.229662e-04 0.163580684
5872           ELANE  5.663078 1.280801e-04 0.163580684
28122           CD24  5.624560 1.358263e-04 0.163580684




ADD REPLYlink written 3 months ago by michel6550

Looks fine to me (assuming you've normalized your data). You have one significantly DE gene at a FDR of 5%, i.e., DEFA8P. That's pretty clear to me.

Are you trying to compare the different methods to each other? I'm not going to comment on what GeneAnova does, but I will point out that limma is the standard tool for DE analyses of microarray data. The 2004 paper in SAGMB has over 9000 citations, according to Google Scholar. Your limma code looks correct, so I would conclude that you have only 1 DE gene.

ADD REPLYlink written 3 months ago by Aaron Lun19k

Hello Aaron,

I was waitng for more gene that's why i was thought my codes was wrong. So, i keep this result.Thanks for helping me.

I have another issue in my limma analysis. I want to know how to transfrom a dataframe to a Elist object of limma package in order to perform the same analysis the one above. In other words, i want to start my limma analysis, the lmFit step precisely, with a data frame already processed (already log transformed, normalysed and filtered)?  I try lot of things but it was not working. Do you have an Idea.


ADD REPLYlink written 3 months ago by michel6550

lmFit can accept data.frame objects directly:

design <- model.matrix(~c(0,1,2))
a <- data.frame(A=rnorm(10), B=rnorm(10), C=rnorm(10))
y <- lmFit(a, design)

... so there is no need to transform it beforehand. The only requirement is that all of your columns contain numeric fields, otherwise the as.matrix coercion will not produce a numeric matrix.

ADD REPLYlink modified 3 months ago • written 3 months ago by Aaron Lun19k

Hello, oh great. i was trying to coerce instead of use my dataframe directely.

So thanks for all your disponibility Aaron.

ADD REPLYlink written 3 months ago by michel6550

No doubt limma is perfect. GeneAnova does seem to compute a standard ANOVA and does not seem to manage false discovery rate. IMHO, this makes a big difference.

You can increase the FDR threshold up to 20%, but there is chance that the differences in your experiment are not related to the PHENO parameter, but to the others.

ADD REPLYlink written 3 months ago by SamGG150

Hello SamGG,

for gene anaova statistics, i'm sorry i forgot to mention that Pvalue obtain with the logiciel are use to calculate the limite Pvalue for different FDR. So, our DEGs in GeneAnova are filtered according to FDR 10%.

ADD REPLYlink written 3 months ago by michel6550
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 207 users visited in the last hour