"Blocking" in design matrix prior to limma DE analysis
1
0
Entering edit mode
@bhaktidwivedi-8895
Last seen 4.6 years ago
United States

Hi,

I have the following design. Three cell types, A, B, and C obtained from three subjects.

   subject  condition
    1   A
    1   B
    1   C
    2   A
    2   B
    2   C
    3   A
    3   B
    3   C

I would like to compare condition A to B to C (A is sort of primary ref here, so like BvsA, CvsA, and CvsB). I am not interested in differences between the subjects and would like to adjust for it. Though the data (from PCA plots etc) clearly shows separation by condition and similarity among subjects. The data is RNAseq processed, filtered for genes, and normalized (TMM with voom).

I am thinking of "blocking" using the design matrix:

design <- model.matrix(~subject+condition)

This generates only five columns, three for the subjects and two for the conditions. Where is the third condition? and the intercept is the first subject. Is this correct? Am I doing something wrong?

How should I define contrasts to detect genes differentially expressed in condition B vs condition A; condition C vs condition A; condition C vs condition B; and in in any of the three treatments? Do I specify as in below?

DGE = DGEList(counts=exprdatafltd, group=metadata)
y <- calcNormFactors(DGE,method =c("TMM"))
v <- voom(y, design, plot=TRUE)
fit <- lmFit(v, design)
fit <- contrasts.fit(fit, coefficient=?) 
fit <- eBayes(fit)

Appreciate any help or suggestions! Thank you.

limma • 615 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 13 hours ago
United States

A simple way to figure out what the coefficients are is to look at the rows, one by one. So your design matrix looks like

> model.matrix(~subject + condition, df)
  (Intercept) subject2 subject3 condition2 condition3
1           1        0        0          0          0
2           1        0        0          1          0
3           1        0        0          0          1
4           1        1        0          0          0
5           1        1        0          1          0
6           1        1        0          0          1
7           1        0        1          0          0
8           1        0        1          1          0
9           1        0        1          0          1

Right? And each row pertains to each sample. The first sample (row) has only one 1, and that sample is subject 1, condition A. So that's what the intercept column represents. The next row has an additional 1 in the condition 2 column, and that's subject 1, condition B. So we can infer that the condition 2 coefficient is the difference between condition B and condition A for subject 1. Or you can do it algebraically:

Subj1_condB = Subj1_condA + X
#solve for X
X = Subj1_condB - Subj1_condA

Following that logic, the fifth column is Subj1condC - Subj1condA. So what is column 2? It's the difference between Subj2condA and Subj1condA (you can do the algebra). And column 3 is Subj3condA - Subj1condA.

Heuristically you can think of it this way; you are making comparisons between conditions for subject 1, and using data from subjects 2 and 3 by setting them to an equivalent level as subject 1 (by subtracting out the difference between subjects). Does that make sense?

ADD COMMENT
0
Entering edit mode

Got it. Thank you so much for the detailed explanation!

ADD REPLY

Login before adding your answer.

Traffic: 588 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6