Hi,
I have an experimental design where I took the left and right brain hemispheres from mice across several time points following treatment at time 0 to the left brain hemisphere only. Therefore the right hemisphere serves as a paired control for the left hemisphere at each of the time points. Here's the samples data.frame
:
library(dplyr)
df <- data.frame(animal_id = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5","id6","id6","id7","id7","id8","id8","id9","id9","id10","id10","id11","id11","id12","id12","id13","id13","id14","id14","id15","id15","id16","id16","id17","id17","id18","id18","id19","id19","id20","id20","id21","id21","id22","id22","id23","id23","id24","id24","id25","id25","id26","id26","id27","id27","id28","id28","id29","id29","id30","id30"),
time_point = c(0,0,0,0,2,2,2,2,2,2,3,3,3,3,3,3,6,6,6,6,6,6,9,9,9,9,14,14,14,14,14,14,14,14,26,26,26,26,26,26,26,26,50,50,50,50,50,50,50,50,74,74,170,170,170,170,170,170,170,170),
hemisphere = c("L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R","L","R"),
cov1 = c(99,99,98,98,42.2,42.2,47.6,47.6,38.7,38.7,73.5,73.5,40.6,40.6,37.4,37.4,29.9,29.9,35.1,35.1,38.9,38.9,33.5,33.5,37.9,37.9,38,38,37.3,37.3,45.2,45.2,40.4,40.4,40.3,40.3,39.6,39.6,38.3,38.3,38.9,38.9,37.7,37.7,41.1,41.1,42.8,42.8,37.5,37.5,41.1,41.1,40.8,40.8,42.8,42.8,39,39,40.9,40.9),
cov2 = c(28,28,27.3,27.3,28.2,28.2,28.1,28.1,25.6,25.6,30,30,30.1,30.1,30.3,30.3,30.2,30.2,28.3,28.3,31.6,31.6,28.9,28.9,31.5,31.5,26.4,26.4,26.1,26.1,27.5,27.5,26.4,26.4,23.6,23.6,26.5,26.5,25,25,22.1,22.1,27.8,27.8,23.2,23.2,23.2,23.2,21.1,21.1,25.9,25.9,25.7,25.7,25.4,25.4,26.7,26.7,19,19),
stringsAsFactors = F) %>%
dplyr::mutate(sample_id = paste0(animal_id,"_",hemisphere,"_",time_point))
As you can see I have 30 animals where for each I have both hemispheres at each time point, however the number of animals at each time point varies between 1 to 4.
What I'm interested in is how the difference in gene expression between the left and right hemispheres changes at each time point relative to the first time point, while controlling for the covariates (cov1
and cov2
). For this reason I convert time_point
to a factor, as well as animal_id
and hemisphere
(setting hemisphere R
as baseline):
df$time_point <- factor(df$time_point)
df$animal_id <- factor(df$animal_id)
df$hemisphere <- factor(df$hemisphere, levels = c("R","L"))
Now I simulate a read count matrix:
set.seed(1)
count.mat <- matrix(as.integer(runif(10000*60 ,10, 200000)), nrow = 10000, ncol = 60,dimnames = list(paste0("gene",1:10000), df$sample_id))
So as the DESeq2 vignette part on Model matrix not full rank describes, my data have variable(s) that can be formed by the combination of other factor levels. For this reason running:
DESeq2::DESeqDataSetFromMatrix(countData = count.mat, colData = df, design =~ cov1 + cov2 + animal_id + hemisphere*time_point)
I get the error:
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
If I follow the DESeq2 vignette part on Group-specific condition effects, individuals nested within groups for fixing this issue the suggestion is to create a pseudo animal_id
like this:
df$pseudo_animal_id <- factor(unlist(lapply(unique(df$time_point),function(t)as.integer(factor(dplyr::filter(df,time_point == t)$animal_id)))))
And to create this design matrix:
design.mat <- model.matrix(~ cov1 + cov2 + time_point + time_point:pseudo_animal_id + time_point:hemisphere,data=df)
And then to remove the all-zero columns (empty levels):
all.zero.idx <- unname(which(apply(design.mat, 2, function(x) all(x==0))))
design.mat <- design.mat[,-all.zero.idx]
And then to create the data object with a simplified design matrix:
dds <- DESeq2::DESeqDataSetFromMatrix(countData = count.mat, colData = df, design =~ animal_id + hemisphere)
And then to run DESeq2
with design.mat
:
dds <- DESeq2::DESeq(dds, full = design.mat, parallel=T)
The last step fails with the same error as above:
using supplied model matrix
Error in checkFullRank(full) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
But if I eliminate cov1
and cov2
from the design matrix it runs fine.
But it's not clear to me how ti fix this covariates issue if I still wanted to account for them.
My higher level question is whether modeling the response as the difference in number of reads between the left and right hemispheres (L
-R
) per each animal wouldn't simplify this problem.
In that case the design would only be: cov1 + cov2 + time_point
.
I guess it's probably not straight forward for the obvious reason that I'll have negative integers due to my subtraction, so I guess my question is what are my options?
And in addition, if I have to either give up cov1
and cov2
or 'animal_id`, what would conceptually be a better choice?