Search
Question: Limma: Defining the design matrix of a linear mixed model
0
17 months ago by
j.baldauf0 wrote:

Dear all, I hope to get some elp in setting up my factors and the model matrix to determine differentially expressed genes.

I have a RNA-seq data set with samples of 13 genotypes and 3 developmental stages and of each sample, I have 4 biological replicates (in total 180 RNA-seq samples or columns in a count table). The data was sequenced in 9 lanes, which were distributed on 2 flowcells. I am especially interested in the genes differentially expressed between the different genotypes within a developmental stage (e.g. G1.I_vs_G2.I), but also within a specific genotype between the different developmental stages (e.g. G1.I_vs_G1.II).

The model I was thinking of looks like following: geno + dev + geno*dev : Flowcell + Flowcell:lane

With terms written in front of the colon as fixed effects and terms written after the colon as random terms.

I know there are controversial discussion about modeling effects concerning the sequencing. But as my data was generated from two different flowcells, I think it would be maybe wise to include those factors. So the first question is it possible to include those two factors (Flowcell + Flowcell:lane) Or do you have any recommendations about that?

The second point, I am not really sure how to define design matrix of my "treatments". Until so far, I would define the factors for genotype and developmental stage like this:

geno <- factor(c(1,1,1,1, 1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 2,2,2,2, 3,3,3,3, 3,3,3,3, 3,3,3,3, 4,4,4,4, 4,4,4,4, 4,4,4,4, 5,5,5,5, 5,5,5,5, 5,5,5,5, 6,6,6,6, 6,6,6,6, 6,6,6,6, 7,7,7,7, 7,7,7,7, 7,7,7,7, 8,8,8,8, 8,8,8,8, 8,8,8,8, 9,9,9,9, 9,9,9,9, 9,9,9,9, 10,10,10,10, 10,10,10,10, 10,10,10,10, 11,11,11,11, 11,11,11,11, 11,11,11,11, 12,12,12,12, 12,12,12,12, 12,12,12,12, 13,13,13,13, 13,13,13,13, 13,13,13,13))
dev     <- factor(c(1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3))
With the following model.matrix:

design <- model.matrix(~0 + geno + dev + geno:dev)

When I understood it correctly, the determine the differently expressed genes between between the different genotypes within a developmental stage (e.g. G1.I_vs_G2:I)​, I would be interested in the contrast

contrast.matrix <- makeContrasts(G1.I_vs_G2.I="geno1.time1-geno2.time1"), levels=design)

However including this contrast.matrix into contrast.fit results in the following error: Error in eval(expr, envir, enclos) : object 'genotype1.time1' not found

Which is true, due to the fact that genotype1.time1 is not included in the design.matrix.

As a solution, I was thinking of defining a single group factor for geno*dev (13 genotypes x 3 stages =39 group factors). However, here I am a little bit confused and clueless if this group factor still represent my original model (geno + dev + geno*dev) and if it is a correct way to go?

group    <- factor(c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 6,6,6,6, 7,7,7,7, 8,8,8,8, 9,9,9,9, 10,10,10,10, 11,11,11,11, 12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15, 16,16,16,16, 17,17,17,17, 18,18,18,18, 19,19,19,19, 20,20,20,20, 21,21,21,21, 22,22,22,22, 23,23,23,23, 24,24,24,24, 25,25,25,25, 26,26,26,26, 27,27,27,27, 28,28,28,28, 29,29,29,29, 30,30,30,30, 31,31,31,31, 32,32,32,32, 33,33,33,33, 34,34,34,34, 35,35,35,35, 36,36,36,36, 37,37,37,37, 38,38,38,38, 39,39,39,39))

design <- model.matrix(~0 + group)

contrast.matrix <- makeContrasts(G1.I_vs_G2.I="group1-group4"), levels=design)

I would be very happy about any recommendations about how to define the factors of my experiment and which is the right design matrix. Thank you very much, Jutta

modified 16 months ago • written 17 months ago by j.baldauf0
1
17 months ago by
United States
James W. MacDonald46k wrote:

The second design is probably the best way to go. It doesn't represent the original model, as you are estimating different coefficients, but it's easier to understand, and you can make all the same comparisons (and you will get the same results). In other words, in the second model formulation you are just computing the mean expression for each genotype:development stage, and you can then make any comparison you might care about using contrasts.

That said, it's probably easier to define your design matrix using the actual genotype:development stage. Trying to remember what 'group1' represents is much more difficult than say, WT_stage1. So doing something like

genotype <- rep(c("WT","Geno1","Geno2",<other genotypes>), each = 12)
dev <- rep(c("stage1","stage2", <other stages>), each = 4)
genostage <- factor(paste(genotype, dev, sep = "_"))
design <- model.matrix(~0+genostage)

is a reasonable thing to do.

1

James is right; interaction models look cool and allow you to sound smart when you describe them to collaborators, but the group-based approach is easier to use and is almost always preferable if you're not a masochist.

As for the sequencing part of the question; you can use duplicateCorrelation to estimate any correlations between samples processed on the same lane. This is analogous to a mixed model, though you can only specify one blocking factor. I would be surprised, though, if the correlation was much different from zero; sequencing at the same place and time and by the same person tends to be quite reproducible. Besides, most data sets I get are multiplexed, which avoids problems with lane/flowcell effects.

Further to Aaron's point; the sequencing center we normally use would have barcoded all the samples and pooled prior to application to the flow cell. That way you sequence all the samples in each lane of each flow cell, and you can just ignore the lane and flow cell effects. If that is what your sequencing center did (and IMO they should), then you don't have to worry.

Thanks James and Aaron for your help and recommendations :)

To test weather there are any correlations between my samples processed on the same lane, how would I correctly define the blocking factor feeding to duplicateCorrelation?

What I tried is the following:

I defined in addition to the geno and stage factor, factors for flowcell and lane:

flowcell <- factor(c(1,1,1,1, 1,1,2,1, 1,2,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,2,1,1, 1,1,2,1, 1,1,1,1, 1,2,1,1, 1,2,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,2,1, 1,1,1,2, 1,1,1,1, 2,1,1,1, 1,1,1,2, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,2, 2,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,2,1, 1,1,2,1, 1,1,1,1, 2,1,1,1, 1,2,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1))
lane     <- factor(c(4,2,3,8, 2,1,7,8, 3,7,1,4, 4,7,7,5, 1,2,7,1, 3,6,6,4, 4,3,2,8, 8,7,2,1, 4,3,7,1, 8,7,4,5, 7,7,6,8, 6,7,4,5, 4,8,2,5, 2,1,6,8, 5,4,6,1, 2,7,4,3, 2,6,7,7, 4,6,3,7, 7,5,3,8, 7,7,1,8, 1,3,5,7, 7,2,3,5, 1,7,6,2, 1,5,6,3, 4,5,8,7, 8,7,6,7, 7,4,5,6, 8,2,5,4, 8,1,2,6, 1,4,5,6, 7,2,3,4, 2,6,7,7, 6,4,7,3, 7,5,3,8, 7,8,1,7, 5,7,3,1, 5,7,2,3, 6,1,7,2, 6,5,3,1))
design <- model.matrix(~0 + group + flowcell)

corfit  <- duplicateCorrelation(voom ,design, block=flowcell:lane)

the resulting corfit\$consensus.correlation = 0.01464837

Is this the right way to test? If so, is the correlation "not much different from zero" to neglect the model effects for the sequencing part? Whether my data was multiplexed I am not sure, I guess so, but I will asked the sequencing facility to be sure.

The flowcell:lane term is almost definitely not doing what you expect. If you want to block on the flowcell-lane combination, you should merge them into a single factor (just like you did with your groups of interest) and supply that as block. If you do this and you get a similarly small consensus value for the correlation, it suggests that there is not a big sequencing effect. But it probably wouldn't hurt to continue to block on this factor during the linear modelling.

0
17 months ago by
j.baldauf0 wrote:

when I include the blocking factor, I obtain different numbers of differentially expressed genes in comparison to not including the blocking factor? I merged the flowcell and lane factors as you suggested into one blocking factor.

genostageG1_stageI-genostageG2_stageI with blocking
-1                                        3073
0                                        21990
1                                         3530

genostageG1_stageI-genostageG2_stageI (model.matrix(~0 + genostage) ​)
-1                                        3036
0                                        22106
1                                         3451

How can I explain this, and which result should I trust?

Relatively speaking, that's a negligible difference. There is basically no effect of sequencing on the DE results.

0
16 months ago by
j.baldauf0 wrote:

Hi again,

I was wondering, how I would define the contrast to determine changes in expression over time between two genotypes. According to examples in the limma manual, I would define the contrast like that:

(genostageG1_stageII-genostageG1_stageI)-(genostageG2_stageII-genostageG2_stageI)

But then, I was wondering how I would interpret the results of the following contrast, and whether it makes sense it all?

(genostageG1_stageII-genostageG2_stageII)-(genostageG1_stageI-genostageG2_stageI)

Could you please shortly explain me the difference between these two contrast statements?