Question

Understanding and creating various comparisons with model.matrix in limma regarding microarrays

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 4 months ago

Italy

Dear Community,

I’m very new to the limma statistical methodology, and especially the construction and understanding of design matrix. In detail, I tried to interpret with some 'fake' data in R the example of multifactorial design in the example of 9.7 section page 48. Thus, I created a matrix below, and the 2 conditions like the example in the tutorial:

> set.seed(81)

> mat <- matrix(data=sample(1:100), nrow=10, ncol=10)

> head(mat)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,]   16   40   20   84   13   90   72   82   52    14

[2,]   91    9   29   47   80   21   34   15   87    74

[3,]   68   18   67   46   17   94   63   93   92    70

[4,]    1   48   83   77   33    2   81   97    8    71

[5,]   78   36   86    6   58   73   49   85   61    38

[6,]   98   89   56   79   88   23   44   69   31    64

> pdat <- data.frame(Tissue=rep(c("A", "B"), times=5), Condition=rep(c("Diseased","Normal"),each=5))
> pdat
   Tissue Condition
1       A  Diseased
2       B  Diseased
3       A  Diseased
4       B  Diseased
5       A  Diseased
6       B    Normal
7       A    Normal
8       B    Normal
9       A    Normal
10      B    Normal
new.set <- new("ExpressionSet", exprs=mat)
> head(exprs(new.set))
   1  2  3  4  5  6  7  8  9 10
1 16 40 20 84 13 90 72 82 52 14
2 91  9 29 47 80 21 34 15 87 74
3 68 18 67 46 17 94 63 93 92 70
4  1 48 83 77 33  2 81 97  8 71
5 78 36 86  6 58 73 49 85 61 38
6 98 89 56 79 88 23 44 69 31 64
> pData(new.set) <-pdat
> head(pData(new.set))
  Tissue Condition
1      A  Diseased
2      B  Diseased
3      A  Diseased
4      B  Diseased
5      A  Diseased
6      B    Normal

> pairs <- factor(rep(1:5, each=2))
> Treat <- factor(paste(new.set$Condition, new.set$Tissue, sep="."))
> Treat
[1] Diseased.A Diseased.B Diseased.A Diseased.B Diseased.A Normal.B   Normal.A   Normal.B
[9] Normal.A   Normal.B
Levels: Diseased.A Diseased.B Normal.A Normal.B
> design <- model.matrix(~0 +Treat)
> colnames(design)
[1] "TreatDiseased.A" "TreatDiseased.B" "TreatNormal.A"   "TreatNormal.B"
> design
   TreatDiseased.A TreatDiseased.B TreatNormal.A TreatNormal.B
1                1               0             0             0
2                0               1             0             0
3                1               0             0             0
4                0               1             0             0
5                1               0             0             0
6                0               0             0             1
7                0               0             1             0
8                0               0             0             1
9                0               0             1             0
10               0               0             0             1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$Treat
[1] "contr.treatment"

> colnames(design) <- levels(Treat)
> dupcor <- duplicateCorrelation(new.set, design, block=pairs)
> fit <- lmFit(new.set, design, block=pairs, correlation=dupcor$consensus)
> cm <- makeContrasts(Disease.A=Diseased.A-Normal.A , levels=design)
> fit2 <- contrasts.fit(fit, cm)
> fit3 <- eBayes(fit2, trend=TRUE)
> top2 <- topTable(fit3, coef=1, number=nrow(fit3), adjust.method="fdr", sort.by="none")
> head(top2)
       logFC AveExpr           t    P.Value adj.P.Val         B
1 -49.677974    48.3 -1.97576747 0.07300290 0.3650145 -4.562943
2   8.592188    48.7  0.30626035 0.76495236 0.8815297 -4.608752
3 -28.543152    62.8 -1.44092558 0.17662419 0.5887473 -4.581712
4   2.134082    50.1  0.06339996 0.95055105 0.9505510 -4.610196
5  17.663417    57.0  0.79132671 0.44497885 0.7111324 -4.600623
6  47.382520    64.1  2.68642944 0.02065864 0.2065864 -4.538251

So, my main questions are the following:

If my samples are paired(like the tutorial in the limma users guide), I would proceed as the example using duplicate correlation. Thus, what is the main difference, from including the pairs into the design matrix ??
If I would like to move by including the pairs into the design matrix(maybe not correct but to correctly understand the approach):

# alternative methodology

> f <- relevel(Treat, ref="Normal.A")
> design1 <- model.matrix(~0 + pairs + f)
> design1
   pairs1 pairs2 pairs3 pairs4 pairs5 fDiseased.A fDiseased.B fNormal.B
1       1      0      0      0      0           1           0         0
2       1      0      0      0      0           0           1         0
3       0      1      0      0      0           1           0         0
4       0      1      0      0      0           0           1         0
5       0      0      1      0      0           1           0         0
6       0      0      1      0      0           0           0         1
7       0      0      0      1      0           0           0         0
8       0      0      0      1      0           0           0         1
9       0      0      0      0      1           0           0         0
10      0      0      0      0      1           0           0         1
attr(,"assign")
[1] 1 1 1 1 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$pairs
[1] "contr.treatment"
attr(,"contrasts")$f
[1] "contr.treatment"
> fit <- lmFit(new.set, design1)
> fit2 <- eBayes(fit, trend=TRUE)

....

So, the first 5 coefficients named pairs are the blocking factors for each “object”—but what the coefficients fDiseasedA, fDiseasedB represent exactly?

I mean, for instance, fDiseasedA represents the average log2 fold change between groups Diseased A and Normal A ?

Moreover, why the level "Normal.B" is missing from the design1?

Or it depends on the first level of the f factor ? when I leave it as default, has:

> Treat
 [1] Diseased.A Diseased.B Diseased.A Diseased.B Diseased.A Normal.B   Normal.A   Normal.B 
 [9] Normal.A   Normal.B 
Levels: Diseased.A Diseased.B Normal.A Normal.B

Or I have to somehow relevel my Treat factor, in order to have the samed resulted log2 fold change for Diseased.A –Normal.A, as in the first defined difference in cm?

My final question is, if I want to have the same returned log2 fold change of Diseased.A-Normal.A with duplicate correlation & contrasts.fit from the first design matrix, how I should proceed with the second approach—that is when I include the pairs into the model.matrix(second design matrix)?

I can understand that the statistics would change, but the log2 fold change of the same compared groups would be the same, right?

Any explanations or feedback on these questions would be very helpful, as I'm a beginner in analyzing microarray data with limma, and I would like firstly to understand important aspects of formulating various comparisons.

Thank you,
Constantinos

limma microarray multifactorial design duplicatecorrelation model.matrix • 2.8k views

ADD COMMENT • link 8.4 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 4 months ago

Italy

Aaron,

thank you for your reply !! as i have some more questions about your answers, i will try to focus on the parts that confused me a bit in understanding them:

1) You mentioned about the cost of the function duplicateCorrelation--thus, the bigger the number of "pairs" the more robust the assumptions you talked about outliers ? Or this is irrelevant ??

2) Also, in your updated "corrected" design matrix, about the coefficients:

> rowSums(design[,7:8])
1  2  3  4  5  6  7  8  9 10 11 12
1  1  1  1  1  1  0  0  0  0  0  0
> rowSums(design[,1:3])
1  2  3  4  5  6  7  8  9 10 11 12
1  1  1  1  1  1  0  0  0  0  0  0

Why you mentioned that these coeffients(1:3 & 7:8) are unestimable ? In other words, this is explained only from the rowSums functions or i should expect it somehow ? And for instance, instead of droping the 7th coefficient, i could similarly drop the first or 3rd ?

But, why when i used your updated code for the annotation data frame (pdat) & i also changed the initial matrix( set.seed(81); mat <- matrix(data=sample(1:100), nrow=10, ncol=12) in order to have two more columns to match the number of rows of pdat---but then, when

i typed

> fit <- lmFit(new.set, design)
Coefficients not estimable: fDiseased.B 
Warning message:
Partial NA coefficients for 10 probe(s)

Why it returns only for the fDiseased.B, and not also for the other coefficients ? Or I'm missing something here ?

3) Moreover, in your final explanation, you said "The first 6 coefficients are blocking factors that represent the expression of the "A" sample in each pairing"---but they represent A, because it is the first level of the Tissue factor ? or for another reason ?

4) Finally, again from the last part of your explanation (and please correct me if I interpret something wrong): the last two coefficients, represent the log2 fold change between B vs A tissue either for Diseased subjects, or Normal ones, because the different Disease states describe in either case both tissues(for instance either Diseased A & B, or Normal A & B) ? and thus, we cant inspect the Disease effect, but instead "blocked" it ? And for instance, if instead of the Disease state, the Tissue was nested between the different pairs, then the reverse case would applied ? that is, test for the tissue state, right ??

Thanks,

Constantinos

ADD COMMENT • link 8.4 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

The more pairs, the better (i.e., more accurate) the estimate of the correlation will be.
You can diagnose linear dependencies as qr(design)$rank!=ncol(design). If this is true, then you need to drop some coefficients. Which ones you drop will affect the interpretation of the coefficients, so there is no one-size-fits-all solution. As for the warning, make sure you eliminate the linear dependencies in the design before you run lmFit.
Yes. But it's always a good idea to examine the design matrix to make sure your interpretation is correct. In this case, the only non-zero coefficient in the row of the design matrix corresponding to each A sample is that of the blocking factor; thus, the blocking factor represents the expression of the A sample in each pair (after you drop column 7, though).
If the pairs were nested in tissue (i.e., each pair consists of diseased/control samples for a single tissue) then you could test for the disease effect within each pair, but wouldn't be able to test for the tissue effect, i.e., A vs B between pairs.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you again for your quick reply Aaron. One thing to insist for your answer-2, because from my opinion as a beginner seems the most “challenging” to interpret correctly. Why in my warning returns only for the coefficient fDiseased.B not estimable, and not also for the others, like the specific pairs ? Moreover, regarding which coefficient I should drop, is highly affected by which coefficients I’m interested in from my experimental design? Could for instance drop more of these inestimable coefficients, and keep for instance one of specific interest ? Or usually, I might firstly look of coefficients that are not “generally estimable” or don’t represent any interesting effect? Like the Normal.B ?

Last update: something from my initial question 2: why in the updated design matrix the reference level “Normal.A” is not appeared in any of the columns(even before dropping any coefficients) ??

ADD REPLY • link 8.4 years ago Konstantinos Yeles ▴ 80

0

Entering edit mode

Don't worry about the specifics of the warning. All you should take out of it is that you have inestimable coefficients. If you want to "automate" the dropping of coefficients, you could do something like:

QR <- qr(design)
design[,QR$pivot[seq_len(QR$rank)]]

... but that sometimes drops terms that makes the interpretation of the remaining coefficients a bit awkward. The best way is to look at the design matrix, figure out what each coefficient means, and decide which ones to drop in order to obtain full column rank (i.e., so that the number of columns is equal to QR$rank). I can't think of any general rule of thumb to make this easier, because it depends so much on the design involved, the DE comparisons you're interested in, etc. Simply put, there is no substitute for understanding exactly what you're doing with your design.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you again for your explanation !! But regarding my updated question ??

(Last update: something from my initial question 2: why in the updated design matrix the reference level “Normal.A” is not appeared in any of the columns(even before dropping any coefficients) )

ADD REPLY • link 8.4 years ago Konstantinos Yeles ▴ 80

score 2 · Accepted Answer · 2015-12-18

For your first question; blocking on duplicateCorrelation allows you to do between-block comparisons. For example, if you have a paired design with patients before and after treatment, and you want to test for DE in the after samples between male and female patients, you can do that with the duplicateCorrelation approach. This is not possible when you block in the design matrix, as the blocking factors absorb any DE between samples in different blocks. However, the cost is that duplicateCorrelation makes some distributional assumptions about the block-specific effects. It works best when there are many levels of the blocking factor and the block effect is homogenously distributed across blocks (i.e., no outlier blocks).

For your second question; your pairings make no sense, because you have a pairing that contains both a normal sample (6) and a diseased sample (5). This isn't a minor complaint, because it affects the interpretation of all the coefficients in the design. You should have copied the corresponding example from the limma user's guide more accurately:

pdat <- data.frame(Tissue=rep(c("A", "B"), times=6), 
    Condition=rep(c("Diseased","Normal"),each=6))
Treat <- factor(paste(pdat$Condition, pdat$Tissue, sep="."))
pairs <- factor(rep(1:6, each=2))
f <- relevel(Treat, ref="Normal.A")
design <- model.matrix(~0 + pairs + f)

If you look at colnames(design), you should get:

[1] "pairs1"      "pairs2"      "pairs3"      "pairs4"      "pairs5"     
[6] "pairs6"      "fDiseased.A" "fDiseased.B" "fNormal.B"

If you look closely, you'll notice that rowSums(design[,7:8]) and rowSums(design[,1:3]) yield the same result. This is a linear dependency which means that coefficients 1, 2, 3, 7 and 8 are unestimable. To resolve this, you need to drop one of the coefficients - I'll drop the 7th coefficient here because it makes things easier later on:

design <- design[,-7]

This gives us the column names of:

[1] "pairs1"      "pairs2"      "pairs3"      "pairs4"      "pairs5"     
[6] "pairs6"      "fDiseased.B" "fNormal.B"

The first 6 coefficients are blocking factors that represent the expression of the "A" sample in each pairing. The last two coefficients represent the log-fold change of B over A for diseased or normal pairings. Dropping either or both of these guys will test for the tissue effect. You cannot test for the disease effect with this approach, because the pairings are nested within the disease state (so you've effectively blocked on the disease state directly). Rather, you should use duplicateCorrelation in the first approach; after all, the method was designed to handle this kind of DE comparison between different levels of the blocking factor.