Design matrix with intercept for multiple cluster comparison
1
0
Entering edit mode
kentfung • 0
@kentfung-17051
Last seen 2.3 years ago

I have posted a similar question before but it was poorly formulated so I'll try to make it better this time.

I have 258 samples and they are all settled in 26 different clusters, with some clusters with only 1 sample and some with over 40. The clusters are mutually exclusive. I will illustrate it with an example due to the large number of clusters and samples.

Let's say I have 12 samples in 4 blocks

> cluster <- as.factor(rep(c("A","B","C","D"), each = 3))
> design<-model.matrix(~cluster)


The design matrix will look like

> design
(Intercept) clusterB clusterC clusterD
1            1        0        0        0
2            1        0        0        0
3            1        0        0        0
4            1        1        0        0
5            1        1        0        0
6            1        1        0        0
7            1        0        1        0
8            1        0        1        0
9            1        0        1        0
10           1        0        0        1
11           1        0        0        1
12           1        0        0        1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$cluster [1] "contr.treatment"  If I want to compare A with B+C+D, do I just do this? > makeContrasts("Intercept-(clusterB+clusterC+clusterD)/3",levels = design) Contrasts Levels Intercept-(clusterB+clusterC+clusterD)/3 Intercept 1.0000000 clusterB -0.3333333 clusterC -0.3333333 clusterD -0.3333333  And for B vs A+C+D, do I just do this? > makeContrasts("clusterB-(Intercept+clusterC+clusterD)/3",levels = design) Contrasts Levels clusterB-(Intercept+clusterC+clusterD)/3 Intercept -0.3333333 clusterB 1.0000000 clusterC -0.3333333 clusterD -0.3333333  I need to use the intercept because mod0 argument in svaseq(), if not taking the intercept as the null model, won't work, but this make it kind of complicated for myself. limma sva • 449 views ADD COMMENT 3 Entering edit mode @james-w-macdonald-5106 Last seen 2 hours ago United States No, that's not how you do things. First, about your terminology. In general, 'blocks' sounds like some sort of batch-effect like thing that one would normally want to account for, but that is uninteresting. If you mean 'groups' or 'treatments' or 'phenotypes' or some interesting thing that one might care to know something about, then you should call them what they are. If these are blocks of samples that were processed separately, or for which you expect there to be technical differences, then it's rather uncommon for people to be testing for differences (you usually control for those differences, rather than test for them). Aaaanyway, If you have an intercept, and you are using the default treatment contrasts (which is what you are doing here), then the intercept is a 'baseline' group (in your case clusterA), and all the other coefficients are estimates of the difference between a given cluster and the baseline. So clusterB is actually clusterB - clusterA, and inherently is a contrast. So it makes no sense to do what you are doing. For example, your first contrast is actually clusterA - (clusterB - clusterA + clusterC - clusterA + clusterD - clusterA)/3, which is definitely something, but probably not what you intend. You are correct that svaseq needs a model with an intercept, but that has nothing to do with anything but running svaseq. The requirement for an intercept doesn't extend to the modeling step. So you can run svaseq with an intercept, and then add those surrogate variables to your design matrix, and then use a design matrix without an intercept to make any comparison you desire using makeContrasts. ADD COMMENT 0 Entering edit mode Thanks for correcting the terminology. I know very little about bioinformatics or statistics as you can tell. So it will be like, carry on from the example: > dat0 = as.matrix(raw_data$counts)  # this is the actual count matrix
> mod1 = design
> mod0 = cbind(mod1[,1])
> set.seed(12345)
> svseq = svaseq(dat0,mod1,mod0)
Number of significant surrogate variables is:  16
Iteration (out of 5 ):1  2  3  4  5

> new_design <- model.matrix(~0+cluster)
> new_design <- cbind(new_design,svseq$sv) > voom_exprs = voom(raw_data, new_design, normalize="none", plot = TRUE)  And the contrast matrix will be like > makeContrasts("clusterA-(clusterB+clusterC+clusterD)/3",levels = new_design)  Is this correct? ADD REPLY 0 Entering edit mode Assuming that you want to compare clusterA with the mean of clustersB-D, yes. ADD REPLY 0 Entering edit mode Hope this thread can still find you James. I have just realised that there's one more questions I have. The size of the cluster Is not even, which means it will be like: > cluster <- as.factor(rep(c("A","B","C","D"), c(3,4,5,3))) > cluster [1] A A A B B B B C C C C C D D D Levels: A B C D > design<-model.matrix(~0+cluster) > design clusterA clusterB clusterC clusterD 1 1 0 0 0 2 1 0 0 0 3 1 0 0 0 4 0 1 0 0 5 0 1 0 0 6 0 1 0 0 7 0 1 0 0 8 0 0 1 0 9 0 0 1 0 10 0 0 1 0 11 0 0 1 0 12 0 0 1 0 13 0 0 0 1 14 0 0 0 1 15 0 0 0 1 attr(,"assign") [1] 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$cluster
[1] "contr.treatment"



If I am to do a B vs the entire population (the idea is what B is compare to the overall mean), should I use weighted mean in my contrast matrix instead? LIke:

> makeContrasts("clusterB-(3*clusterA+5*clusterC+3*clusterD)/(3+5+3)",levels = design)
Contrasts
Levels     clusterB-(3*clusterA+5*clusterC+3*clusterD)/(3+5+3)
clusterA                                          -0.2727273
clusterB                                           1.0000000
clusterC                                          -0.4545455
clusterD                                          -0.2727273

1
Entering edit mode

Probably not. While you could make the argument that the mean of the groups with smaller numbers of observations are less reliable, that's not a compelling argument. The mean is a pretty efficient statistic, so that is less of a concern than the variance, which is less efficient.

In the context of an ANOVA, where you are using a variance estimate based on (basically) the mean of the within-group variances, having widely different group sizes can be problematic because the smaller groups provide less reliable estimates, but the computation doesn't take that into account. So you would be 'fixing' the part that probably isn't in need of being fixed, and ignoring the part that is more likely to need help.

And if this were a thing that people did, you can rest assured that there would be something in the limma User's Guide detailing how it's done. But there are no such instructions, because it isn't a thing.