Dear Bioconductor support team,
This question is a follow up of the same I asked before, and then I carried out some analysis, but need help:-
I have samples from two time points. The samples belong to two categories: Treatment and Placebo as follows: 005v1 Placebo 005v4 Placebo 008v1 Treatment 008v4 Treatment
The ID number 005 etc represent the same sample at two time points. At V1 time-point there is no treatment given, its just that the samples are assigned to be given a treatment (treatment) or not (placebo). At V4 time-point there would be difference in placebo and treatment sample.
Now, if the question is to identify the changes caused due to the treatment, how should I define the design? Should V1 time-point be taken into consideration or not, given that the treatment intervention was given only after the assignment of samples at V1.
It seems like a paired analysis, where the blocking factor is the IDs of the samples, so using the edgeR guide (3.4), I tried to block on the ID (called as Cid in the Rcode) in the following way:-
Read in and annotate
counts <- read.delim("all-samples.txt", row.names="Symbol", header = T) metrics <- read.delim("metric-data.txt", row.names = 1, header = T)
Defining Factors
Categ <-factor(metrics$Group, levels = c("placebo", "treatment")) Cid <-factor(metrics$ID) Visit <-factor(metrics$Visit)
Creating DGElist
Group <- factor(paste(Categ, Visit, sep=".")) raw <- DGEList(counts=counts, group=Group)
Filter
plot (table (rowSums(cpm(raw) >= 1) ) ) keep <- (rowSums(cpm(raw) >= 1) >= 17) raw <- raw[keep, , keep.lib.sizes=FALSE]
Normalization
norm <- calcNormFactors(raw)
DE analysis
design <- model.matrix(~0+Cid+Group) colnames(design) design <- design[,-19]
ed <- estimateDisp(norm, design) glmfit <- glmFit(ed, design) cont <- makeContrasts(Grouptreatment.V4-Groupplacebo.V4, levels=design) gltr <- glmTreat(glmfit, contrast=cont)#no fc cut-off
out <- topTags(gltr, n = Inf, adjust.method="BH") keep <- out$table$PValue <= 0.05
Mainly, I combined Category (Placebo/Treatment) and Visit (V1/V4) into one factor Group. I then created the design as follows: design <- model.matrix(~0+Cid+Group)
Q1: Is this sufficient to block the effect of IDs?
The design looks like this:
colnames(design)
[1] "Cid5" "Cid8" "Cid9" "Cid22" "Cid28" "Cid38" "Cid43"
[8] "Cid48" "Cid56" "Cid58" "Cid60" "Cid61" "Cid62" "Cid63"
[15] "Cid153" "Cid180" "Cid190" "Groupplacebo.V4" "Grouptreatment.V1" "Grouptreatment.V4"
When I tried to estimate dispersion on this design, I get a message:
"Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, : Design matrix not of full rank. The following coefficients not estimable: Grouptreatment.V4"
So I removed Grouptreatment.V1 using design <- design[,-19] and then it runs.
Q2: What should I do if I also want to compare Grouptreatment.V1 vs Groupplacebo.V1 and perhaps other combinations as well?
Appreciate your help and support!
Thanks & Regards, Candida Vaz
Dear Aaron,
Thank you so very much for the detailed explanation and solutions to my questions! So grateful!
Regards, Candida
Dear Aaron,
I had used limma voom with duplicateCorrelation blocking on id. This is my code...does this seem ok? or could you give your suggestions.
Read in and annotate
counts <- read.delim("all-samples.txt", row.names="Symbol", header = T) metrics <- read.delim("checking-metrics-data.txt", row.names = 1, header = T)
head(metrics)
Defining Factors
Group <-factor(metrics$Group, levels = c("placebo", "treatment")) Cid <-factor(metrics$ID) Visit <-factor(metrics$Visit.number)
Creating DGElist
raw <- DGEList(counts=counts, group=Group) raw$samples <- cbind(raw$samples,Cid,Visit) data.frame(Sample=colnames(raw),Group,Cid,Visit)
Filter
plot (table (rowSums(cpm(raw) >= 1) ) ) keep <- (rowSums(cpm(raw) >= 1) >= 17) raw <- raw[keep, , keep.lib.sizes=FALSE]
Normalization
norm <- calcNormFactors(raw)
DGE
design <- model.matrix( ~0 + Group + Visit, data=norm$samples) v <- voomWithQualityWeights(norm, design, plot=T) dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T) fit <- lmFit(v, design, block=norm$samples$Cid, correlation=dupCor$consensus) cm <- makeContrasts(treat = Grouptreatment-Groupplacebo, levels=design ) fit2 <-contrasts.fit(fit, cm) efit <-eBayes(fit2)
Thanks a ton! Appreciate you kind support
Group
andVisit
into a single variable and use a one-way layout. Your additive model assumes that the treatment and time are additive, but this won't be true for the most interesting genes.duplicateCorrelation
again after the secondvoomWithQualityWeights
call.Thanks a ton Aaron!
1. To fuse into a single variable can I use:
Group <- factor(paste(Categ, Visit, sep="."))
raw <- DGEList(counts=counts, group=Group)
raw$samples <- cbind(raw$samples,Cid,Visit)
(Where Categ =placebo/treatment & Visit=two time points)
2. For filtering (you are right, i'm now using 8 which is the number of samples in treated category)...is this ok?
keep <- (rowSums(cpm(raw) >= 1) >= 8)
raw <- raw[keep, , keep.lib.sizes=FALSE]
3. Design and DGE calculation
design <- model.matrix( ~0 + Group, data=norm$samples) # one way layout
v <- voomWithQualityWeights(norm, design, plot=T)
dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid)
v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T)
dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) # second duplicateCorrelation
fit <- lmFit(v, design, block=norm$samples$Cid, correlation=dupCor$consensus)
cm <- makeContrasts(treat = Grouptreatment.MV4-Groupplacebo.MV4, levels=design ) # contrast i'm interested in
fit2 <-contrasts.fit(fit, cm)
efit <-eBayes(fit2)
One important fact I would like to stress on, in the experimental design is that some samples remain placebo (9X2) at both time points and some (8X2) get treatment at 2nd time-point. Hope the model is sufficient to get DE genes (Treatment vs Placebo).
Appreciate your time and help on this, getting to learn a lot!
If you want to identify differences between treatment and placebo, the most relevant contrast seems to be:
This corresponds to the contrast for edgeR that I proposed in my original answer. It should be better than your
cm
, as a within-patient comparison avoids the penalty thatduplicateCorrelation
must impose (i.e., an increase in the standard error, and thus a decrease in power) when comparing across patients.Dear Aaron,
Thanks a ton for your suggestions!
Dear Aaron,
Just to follow up on the same analysis: was wondering if one could run the double voom and double duplicate correlation on a model that includes adjusting for confounding factors. For example if Age is a confounding factor then the design model will be:
design <- model.matrix( ~0 + Group + Age, data=norm$samples)
Can I now run voom and duplicate correlation on this adjusted design model?
v <- voomWithQualityWeights(norm, design, plot=T)
dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid)
v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T)
dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) # second duplicateCorrelation
Thanks for your help on this!
That's fine, providing
Age
andGroup
are not confounded.Age
will probably be confounded withCid
(unless you've recorded that the patient changes their age between visits), but that's not a problem when you block on patient ID induplicateCorrelation()
. You'll still probably have patient-specific effects that are independent of age, so blocking onCid
will still be necessary.Thank you for your advice Aaron.
Age of the patients don't change much except by a few months during the two visits. My concern is the effect of "Age" itself on the outcome. I am interested in removing the effect of Age or any such factor and seeing only the effect of Treatment on the outcome. So are the steps mentioned above good enough to do so?
Appreciate your time and help on this!
Blocking on
Age
will certainly mitigate the effect of age. Whether the effect is removed depends on whether the effect of age on log-expression is linear. In reality, it probably isn't, so if you want to protect against non-linearity, you'd need to use a spline with respect to age (see the user's guide for details).Another potential issue would be the additivity assumption, which you're also making when you do
+ Age
inmodel.matrix()
. In other words, you're assuming that the effect of age just "adds onto" other effects, which is also unlikely to be true in complex biological systems. You cannot avoid this assumption while still being able to perform your comparisons of interest.So, it's unlikely that you can fully remove the effect of age in real data. The linearity and additivity assumptions are strong and only the former is really fixable. Your current approach is probably good enough; a spline would avoid the linearity assumption (assuming you have enough samples to tolerate the loss of residual d.f.); but if you want to really get rid of the effect of age, make sure you only collect samples from patients of the same age!
Thanks a lot Aaron for the advice.
I never came across the spline concept in the limma/edgeR guide. Could you please let me know where can I find it?
I will be using the additivity assumption + Age in model.matrix() (understand that it may not be true in complex biological systems.
Can I just confirm with you if its appropriate to further use double voom and double duplicate correlation on this additive model.
Thanks!
Yes, double use of
voom
andduplicateCorrelation
is fine.