Dear all, I hope to get some elp in setting up my factors and the model matrix to determine differentially expressed genes.
I have a RNA-seq data set with samples of 13 genotypes and 3 developmental stages and of each sample, I have 4 biological replicates (in total 180 RNA-seq samples or columns in a count table). The data was sequenced in 9 lanes, which were distributed on 2 flowcells. I am especially interested in the genes differentially expressed between the different genotypes within a developmental stage (e.g. G1.I_vs_G2.I), but also within a specific genotype between the different developmental stages (e.g. G1.I_vs_G1.II).
The model I was thinking of looks like following: geno + dev + geno*dev : Flowcell + Flowcell:lane
With terms written in front of the colon as fixed effects and terms written after the colon as random terms.
I know there are controversial discussion about modeling effects concerning the sequencing. But as my data was generated from two different flowcells, I think it would be maybe wise to include those factors. So the first question is it possible to include those two factors (Flowcell + Flowcell:lane) Or do you have any recommendations about that?
The second point, I am not really sure how to define design matrix of my "treatments". Until so far, I would define the factors for genotype and developmental stage like this:
geno <- factor(c(1,1,1,1, 1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 2,2,2,2, 3,3,3,3, 3,3,3,3, 3,3,3,3, 4,4,4,4, 4,4,4,4, 4,4,4,4, 5,5,5,5, 5,5,5,5, 5,5,5,5, 6,6,6,6, 6,6,6,6, 6,6,6,6, 7,7,7,7, 7,7,7,7, 7,7,7,7, 8,8,8,8, 8,8,8,8, 8,8,8,8, 9,9,9,9, 9,9,9,9, 9,9,9,9, 10,10,10,10, 10,10,10,10, 10,10,10,10, 11,11,11,11, 11,11,11,11, 11,11,11,11, 12,12,12,12, 12,12,12,12, 12,12,12,12, 13,13,13,13, 13,13,13,13, 13,13,13,13))
dev <- factor(c(1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3, 1,1,1,1, 2,2,2,2, 3,3,3,3))
With the following model.matrix:
design <- model.matrix(~0 + geno + dev + geno:dev)
When I understood it correctly, the determine the differently expressed genes between between the different genotypes within a developmental stage (e.g. G1.I_vs_G2:I), I would be interested in the contrast
contrast.matrix <- makeContrasts(G1.I_vs_G2.I="geno1.time1-geno2.time1"), levels=design)
However including this contrast.matrix into contrast.fit results in the following error: Error in eval(expr, envir, enclos) : object 'genotype1.time1' not found
Which is true, due to the fact that genotype1.time1 is not included in the design.matrix.
As a solution, I was thinking of defining a single group factor for geno*dev (13 genotypes x 3 stages =39 group factors). However, here I am a little bit confused and clueless if this group factor still represent my original model (geno + dev + geno*dev) and if it is a correct way to go?
group <- factor(c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4, 5,5,5,5, 6,6,6,6, 7,7,7,7, 8,8,8,8, 9,9,9,9, 10,10,10,10, 11,11,11,11, 12,12,12,12, 13,13,13,13, 14,14,14,14, 15,15,15,15, 16,16,16,16, 17,17,17,17, 18,18,18,18, 19,19,19,19, 20,20,20,20, 21,21,21,21, 22,22,22,22, 23,23,23,23, 24,24,24,24, 25,25,25,25, 26,26,26,26, 27,27,27,27, 28,28,28,28, 29,29,29,29, 30,30,30,30, 31,31,31,31, 32,32,32,32, 33,33,33,33, 34,34,34,34, 35,35,35,35, 36,36,36,36, 37,37,37,37, 38,38,38,38, 39,39,39,39))
design <- model.matrix(~0 + group)
contrast.matrix <- makeContrasts(G1.I_vs_G2.I="group1-group4"), levels=design)
I would be very happy about any recommendations about how to define the factors of my experiment and which is the right design matrix. Thank you very much, Jutta
James is right; interaction models look cool and allow you to sound smart when you describe them to collaborators, but the group-based approach is easier to use and is almost always preferable if you're not a masochist.
As for the sequencing part of the question; you can use
duplicateCorrelation
to estimate any correlations between samples processed on the same lane. This is analogous to a mixed model, though you can only specify one blocking factor. I would be surprised, though, if the correlation was much different from zero; sequencing at the same place and time and by the same person tends to be quite reproducible. Besides, most data sets I get are multiplexed, which avoids problems with lane/flowcell effects.Further to Aaron's point; the sequencing center we normally use would have barcoded all the samples and pooled prior to application to the flow cell. That way you sequence all the samples in each lane of each flow cell, and you can just ignore the lane and flow cell effects. If that is what your sequencing center did (and IMO they should), then you don't have to worry.
Thanks James and Aaron for your help and recommendations :)
To test weather there are any correlations between my samples processed on the same lane, how would I correctly define the blocking factor feeding to duplicateCorrelation?
What I tried is the following:
I defined in addition to the geno and stage factor, factors for flowcell and lane:
flowcell <- factor(c(1,1,1,1, 1,1,2,1, 1,2,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,2,1,1, 1,1,2,1, 1,1,1,1, 1,2,1,1, 1,2,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,2,1, 1,1,1,2, 1,1,1,1, 2,1,1,1, 1,1,1,2, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,2, 2,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,2,1, 1,1,2,1, 1,1,1,1, 2,1,1,1, 1,2,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1))
lane <- factor(c(4,2,3,8, 2,1,7,8, 3,7,1,4, 4,7,7,5, 1,2,7,1, 3,6,6,4, 4,3,2,8, 8,7,2,1, 4,3,7,1, 8,7,4,5, 7,7,6,8, 6,7,4,5, 4,8,2,5, 2,1,6,8, 5,4,6,1, 2,7,4,3, 2,6,7,7, 4,6,3,7, 7,5,3,8, 7,7,1,8, 1,3,5,7, 7,2,3,5, 1,7,6,2, 1,5,6,3, 4,5,8,7, 8,7,6,7, 7,4,5,6, 8,2,5,4, 8,1,2,6, 1,4,5,6, 7,2,3,4, 2,6,7,7, 6,4,7,3, 7,5,3,8, 7,8,1,7, 5,7,3,1, 5,7,2,3, 6,1,7,2, 6,5,3,1))
design <- model.matrix(~0 + group + flowcell)
corfit <- duplicateCorrelation(voom ,design, block=flowcell:lane)
the resulting corfit$consensus.correlation = 0.01464837
Is this the right way to test? If so, is the correlation "not much different from zero" to neglect the model effects for the sequencing part? Whether my data was multiplexed I am not sure, I guess so, but I will asked the sequencing facility to be sure.
The
flowcell:lane
term is almost definitely not doing what you expect. If you want to block on the flowcell-lane combination, you should merge them into a single factor (just like you did with your groups of interest) and supply that asblock
. If you do this and you get a similarly small consensus value for the correlation, it suggests that there is not a big sequencing effect. But it probably wouldn't hurt to continue to block on this factor during the linear modelling.