Question

Help to understand my contrast.matrix and to interpret my topTable results

0

Entering edit mode

stinehedensted • 0

@stinehedensted-14017

Last seen 7.5 years ago

Denmark

Hi

I am trying to analyse a microarray dataset from GEO.

The data is already normalized and log2 transformed. I have 30 paired patient samples of benign prostate epithelia and stroma tissue. The data is grouped like this:

Patient	Treatment
1	B
1	sB
2	B
2	sB
3	B
3	sB

....

I have big trouble finding out if how to design my contrast matrix. I want to identify differentially expressed genes between epithelia and stroma. I have read the examples in the Limma User Guide, p 41, but my design matrix is more complicated than the expamles in the User Guide. Also I am confused about the formulation of substrating in the contrast matrix, when you want to find differentially expressed gene, like the example below:

contrast.matrix <- makeContrasts(Benign_epithelvsBenign_stroma = Benign_epithel - Benign_stroma, levels = design)

1) Is this the coding for devision of the two expression means, that gives the logFC?

1) Can you tell me if my contrast matrix is correct?

2) What is the point of using "~" in the design matrix? Or said in an other way, what is the point of not wanting an intercept?

3) I have seen examples of people adding "coef=1" to their topTable function. What does this mean?

4) what does the "intercept" column in the design matrix mean?

Kind regards, Stine

Here is all my coding:

> Array_benigntissue <- read.table("30sB_30B_samplesPAIRED.csv", header = T, sep = ";", stringsAsFactors = T, dec = ",", skip = 1)

> #Convert dataframe into matrix:
> Array_benigntissue2 <- as.matrix(sapply(Array_benigntissue,as.numeric))

> Array_benigntissue2 <- Array_benigntissue[,-1]

> rownames(Array_benigntissue2) <- Array_benigntissue[,1]

> #Construct design matrix
> pairing <- factor(rep(1:30, each = 2))

> tissue <- factor(rep(c("B", "sB"), 30))

> design <- model.matrix(~pairing+tissue)

> design
(Intercept) pairing2 pairing3 pairing4 pairing5 pairing6 pairing7 pairing8 pairing9
1 1 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0
3 1 1 0 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0 0
5 1 0 1 0 0 0 0 0 0
6 1 0 1 0 0 0 0 0 0
7 1 0 0 1 0 0 0 0 0
8 1 0 0 1 0 0 0 0 0
9 1 0 0 0 1 0 0 0 0

.....
pairing28 pairing29 pairing30 tissuesB
1 0 0 0 0
2 0 0 0 1
3 0 0 0 0
4 0 0 0 1
5 0 0 0 0
6 0 0 0 1
7 0 0 0 0
8 0 0 0 1
9 0 0 0 0
10 0 0 0 1
11 0 0 0 0
12 0 0 0 1
13 0 0 0 0
14 0 0 0 1
15 0 0 0 0
16 0 0 0 1
17 0 0 0 0
18 0 0 0 1
19 0 0 0 0
20 0 0 0 1
21 0 0 0 0
22 0 0 0 1
23 0 0 0 0
24 0 0 0 1
25 0 0 0 0
26 0 0 0 1
27 0 0 0 0
28 0 0 0 1
29 0 0 0 0
30 0 0 0 1
31 0 0 0 0
32 0 0 0 1

attr(,"assign")
[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$pairing
[1] "contr.treatment"

attr(,"contrasts")$tissue
[1] "contr.treatment"

> fit = lmFit(Array_benigntissue2[,4:63], design)

contrast matrix toptable makecontrasts differential gene expression • 2.7k views

ADD COMMENT • link 7.5 years ago stinehedensted • 0

0

Entering edit mode

It seems your question relates to the limma package, so you should include the limma tag on your post.

ADD REPLY • link 7.5 years ago Aaron Lun ★ 28k

score 2 · Answer 1 · 2017-09-28

This isn't really a "complicated" design, and most of the answers are available in the packages' help pages. As with your last question, everything is very dependent on getting the right mapping between the columns of csv file you're reading in, and the 'rows' of the factors you're putting into your design; the fact that the output when you look at the design consists of only 32 rows is a bit concerning, when you're putting a 60-column expression matrix into lmFit. But assuming that the orders do correspond, then your code looks ok. There are probably better ways of coding this (using a block design to capture the patient variability), but I'd hesitate to recommend them if you don't understand the idea of intercepts yet.

On point (2), you should read the "~" sign as 'depends upon', as in 'expression depends on patient and tissue' - it has nothing to do with the intercept. You may be confusing it with the appearance of '-1' (or equivalently '+0') on the right hand side of the model. It's roughly a choice of how to parametrise your model. If for example you had just tissue as the independent term, then you could estimate average B and average sB expression, or you could estimate average B expression, and a log-fold-change of sB compared to B; the absence or presence of an intercept term would make the difference. Consult a local statistician for a fuller explanation - different parametrisations will strongly influence how you interpret the coefficients that are returned by the fitting procedure.

On point (3), it says in the help("topTable") what the purpose of this is, and if you don't fully understand, then proceed with the analysis with utmost caution. It's telling limma the comparison for which you want to pick differential genes. In your case, neither coef=1 nor coef=NULL (the default) is the correct option, so if you're still struggling to understand after reading the help page, and re-reading the very good limma manual, then come back here.

The intercept column for your example is the parametrisation of the baseline condition (pairing1, tissueB in your case) - these things are implicitly chosen (by the 'factor' function) alphabetically, unless you explicitly state otherwise, by use of e.g. 'relevel'.

score 0 · Answer 2 · 2017-10-02

Thank you for your response!

Actually is have 60 rows in my design matrix, R just prints a truncated version.

> dim(design)

[1] 60 31

> dim(Array_benigntissue2[,4:63])
[1] 33297 60

I have tried to look into your suggestion of using a blocking design. I understand from the Limma User Guide, that blocking is only used when you want to correct for batch effect or if the experiment has been conducted in blocks. The data I am using have already been corrected for batch effect. I do not understand what the authers mean by conducted in block.

Am I correct, when I say, that I don't need to block anything?

After your explanation of "~" sign and the -1 and +0, I have chosen the following design of matrix, since this allows be to get a logFC expression later on:

pairing <- factor(rep(1:30, each = 2))
tissue <- factor(rep(c("B", "sB"), 30))

design <- model.matrix(~0+tissue+pairing)

fit = lmFit(Array_benigntissue2[,4:63], design)

contrast.matrix <- makeContrasts(BvssB = tissueB - tissuesB, levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

tablebenign <- topTable(fit2, adjust = "BH", number = 33297)

Sorry in advance, if this is all nonsense:

I still feel unsure, whether the samples beeing paried are considered in the analysis of differential expression. I think, I have constructed my design.matrix to take the parring into account, but I am not sure, if I have done the same for my contrast matrix... Put another way, I am not sure, if I am done with my analysis here, or if I still have to work on the coding?

My topTable-results look like this.

logFC AveExpr t P.Value adj.P.Val B
8030753 3.021267e+00 8.702636 1.898669e+01 2.991877e-19 9.962053e-15 33.5764336434
8030768 3.046219e+00 7.457127 1.785397e+01 1.889625e-18 3.145942e-14 31.8323290339
8082673 3.144625e+00 7.411396 1.678567e+01 1.175867e-17 1.086312e-13 30.0883060810
8062312 -3.311054e+00 8.745128 -1.666208e+01 1.461563e-17 1.086312e-13 29.8799630225
8137979 -1.198953e+00 8.726161 -1.654179e+01 1.808356e-17 1.086312e-13 29.6758474425
8176026 -3.325756e+00 7.325486 -1.649720e+01 1.957496e-17 1.086312e-13 29.5998311286
7944082 -3.683877e+00 6.850248 -1.627914e+01 2.891098e-17 1.368307e-13 29.2254345544

Best regards,

Stine