Hi,
I'm analysing RNA-seq data and have an experimental set up similar to this:
colData <- data.frame(sample=c(1,2,3,4,5,6,7,8),
group=c("A", "B", "C", "A", "B", "A", "C", "C"),
patient=c(1,1,1,2,2,3,3,4),
age=c(13.2, 15, 16.8, 13.4, 15.3, 12.8, 17, 16.5))
I'd like to do all pairwise group comparisons (AvB, AvC, BvC), whilst taking into account the ages of the patients and matched patient samples. However the issue is not all my patient samples are matched and in all groups.
I've decided to use limma, currently I understand duplicateCorrelation() can be used to account for missing matched patient samples?
I've generated the code below following the limma manual and other similar questions I've found online:
# reads is an expression matrix with columns as samples and rows as genes
dge <- DGEList(counts=reads)
# remove none expressed genes
cutoff <- 1
drop <- which(apply(cpm(dge), 1, max) < cutoff)
d <- dge[-drop,]
d <- calcNormFactors(d)
# fit model
design = model.matrix(~0+colData$group+colData$age)
colnames(design) <- c("A", "B", "C", "age")
# normalise reads using voom
v <- voom(d, design=design)
# here the patient is treated as a random effect due to performing between and within patient comparisons
corfit <- duplicateCorrelation(v, design, block=colData$patient)
# repeat voom and duplicate correlation (https://support.bioconductor.org/p/59700/)
v <- voom(d, design=design, block=colData$patient, correlation=corfit$consensus)
corfit <- duplicateCorrelation(v, design, block=colData$patient)
fit <- lmFit(v, design=design, block=colData$patient, correlation=corfit$consensus)
contr <- makeContrasts(A - B, levels=design)
fit2 <- contrasts.fit(fit, contr)
fit2 <- eBayes(fit2)
top.table <- topTable(fit2, sort.by = "p", n = Inf)
head(top.table, 20)
The code runs fine, but I'm doubting my model is correct as the lowest adjusted p-value is 0.97 for any of the comparisons so no genes are coming up as significantly differentially expressed.
This may just be the data, but want to double check there is no issue with my model/code as I'm new to using limma. Any help appreciated, thanks!
Ok thanks for your feedback! I'll try to rerun without including age in the model. As for the three ages, they are samples collected from the patient over a period of several years.
If these are human patients, then larger sample sizes would usually be required to get reliable results. It would be a common experience to find no statistically significant DE in an analysis with just three human subjects.
The colData I included was just an example so we have almost 20 patients. I removed the age variable and this still resulted in very high adjusted p-values. I happened to try without setting intercept as 0 in the model and the results change a lot.
top 3 genes ranked by adjusted p-val with
design = model.matrix(~0+colData$group)
top 3 genes ranked by adjusted p-val with
design = model.matrix(~colData$group)
I've read on a few posts that setting/not setting 0 intercept shouldn't make too much of a difference. Is there a way to determine whether to set it as 0 or not is more appropriate for my data? Sorry if this is a simplistic question, my background is not stats and I appreciate any help you can give!
Your second analysis is wrong. Please read the previous posts or the limma User's Guide to see why. If you change the design matrix you naturally have to make the corresponding adjustment to the contrasts.
Ahh I see, as my set up is group-means parametrization and without ~0 would require treatment-contrasts parametrization. Thanks for pointing me in the right direction!