Paired analysis setup using edgeR - follow up question
1
1
Entering edit mode
candida.vaz ▴ 50
@candidavaz-6923
Last seen 5.8 years ago
Singapore

Dear Bioconductor support team,

This question is a follow up of the same I asked before, and then I carried out some analysis, but need help:-

I have samples from two time points. The samples belong to two categories: Treatment and Placebo as follows: 005v1 Placebo 005v4 Placebo 008v1 Treatment 008v4 Treatment

The ID number 005 etc represent the same sample at two time points. At V1 time-point there is no treatment given, its just that the samples are assigned to be given a treatment (treatment) or not (placebo). At V4 time-point there would be difference in placebo and treatment sample.

Now, if the question is to identify the changes caused due to the treatment, how should I define the design? Should V1 time-point be taken into consideration or not, given that the treatment intervention was given only after the assignment of samples at V1.

It seems like a paired analysis, where the blocking factor is the IDs of the samples, so using the edgeR guide (3.4), I tried to block on the ID (called as Cid in the Rcode) in the following way:-

Read in and annotate

counts <- read.delim("all-samples.txt", row.names="Symbol", header = T) metrics <- read.delim("metric-data.txt", row.names = 1, header = T)

Defining Factors

Categ <-factor(metrics$Group, levels = c("placebo", "treatment")) Cid <-factor(metrics$ID) Visit <-factor(metrics$Visit)

Creating DGElist

Group <- factor(paste(Categ, Visit, sep=".")) raw <- DGEList(counts=counts, group=Group)

Filter

plot (table (rowSums(cpm(raw) >= 1) ) ) keep <- (rowSums(cpm(raw) >= 1) >= 17) raw <- raw[keep, , keep.lib.sizes=FALSE]

Normalization

norm <- calcNormFactors(raw)

DE analysis

design <- model.matrix(~0+Cid+Group) colnames(design) design <- design[,-19]

ed <- estimateDisp(norm, design) glmfit <- glmFit(ed, design) cont <- makeContrasts(Grouptreatment.V4-Groupplacebo.V4, levels=design) gltr <- glmTreat(glmfit, contrast=cont)#no fc cut-off

out <- topTags(gltr, n = Inf, adjust.method="BH") keep <- out$table$PValue <= 0.05

Mainly, I combined Category (Placebo/Treatment) and Visit (V1/V4) into one factor Group. I then created the design as follows: design <- model.matrix(~0+Cid+Group)

Q1: Is this sufficient to block the effect of IDs?

The design looks like this:

     colnames(design)

[1] "Cid5" "Cid8" "Cid9" "Cid22" "Cid28" "Cid38" "Cid43"
[8] "Cid48" "Cid56" "Cid58" "Cid60" "Cid61" "Cid62" "Cid63"
[15] "Cid153" "Cid180" "Cid190" "Groupplacebo.V4" "Grouptreatment.V1" "Grouptreatment.V4"

When I tried to estimate dispersion on this design, I get a message:

"Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, : Design matrix not of full rank. The following coefficients not estimable: Grouptreatment.V4"

So I removed Grouptreatment.V1 using design <- design[,-19] and then it runs.

Q2: What should I do if I also want to compare Grouptreatment.V1 vs Groupplacebo.V1 and perhaps other combinations as well?

Appreciate your help and support!

Thanks & Regards, Candida Vaz

edger • 3.0k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 6 hours ago
The city by the bay

You would make life a lot easier for everyone if you formatted your experimental design more clearly, like so:

id <- rep(c("004", "008", "009", "022"), each=2)
time <- factor(rep(c(1, 4), 4))
treatment <- rep(rep(c("Placebo", "Treatment"), each=2), 2)
data.frame(id, time, treatment)

The logical choice of design would then be:

design <- model.matrix(~0 + id + time:treatment) # don't really need 0+.
design <- design[,!grepl("time1", colnames(design))] # getting to full rank.
colnames(design) <- make.names(colnames(design)) # cleaning up names.

The first four coefficients represent the individual blocking effects and are not of interest. The second-last coefficient represents the log-fold change of time 4 over time 1 in the placebo patients, while the last coefficient represents the equivalent log-fold change in the treated patients. The most interesting comparison is between the effects on the placebo and treated patients:

con <- makeContrasts(time4.treatmentTreatment - time4.treatmentPlacebo, levels=design)

This is the answer to your Question 1. It's probably worth reporting the individual log-fold changes as well, to make it easier to interpret the log-fold change from con (which is actually a difference of log-fold changes between the placebo and treatment conditions). A careful reading of your code suggests that it tries to do something similar but fails because you did not construct a design matrix of full rank. This means that there were linear dependencies between columns in your design (i.e., one set of columns could be expressed as a linear combination of another set of columns, rendering the system unsolvable), which I took care to remove when creating design.

As for your Question 2, if you want to compare directly between patients, you cannot do so with the full data set in edgeR. There is no way to account for correlations between samples from the same individual. Well, not without blocking on id, but if you do so, you won't be able to compare between patients because the blocking factors absorb the interesting differences. This leaves you with two choices:

  1. If you want to stick to using edgeR, subset the data so that each individual only contributes one sample (presumably from the same time point). Then use a simple one-way layout to fit a GLM to the data subset; there's no need for blocking factors.
  2. Use voom and limma, with duplicateCorrelation to block on id.

P.S. Use glmQLFit rather than glmFit, the former is more accurate and reduces the number of false positives. And don't do keep <- out$table$PValue <= 0.05, make sure you use the adjusted p-value for defining the set of significant genes.

ADD COMMENT
0
Entering edit mode

Dear Aaron,

Thank you so very much for the detailed explanation and solutions to my questions! So grateful!

Regards, Candida

ADD REPLY
0
Entering edit mode

Dear Aaron,

I had used limma voom with duplicateCorrelation blocking on id. This is my code...does this seem ok? or could you give your suggestions.

Read in and annotate

counts <- read.delim("all-samples.txt", row.names="Symbol", header = T) metrics <- read.delim("checking-metrics-data.txt", row.names = 1, header = T)

head(metrics)

Defining Factors

Group <-factor(metrics$Group, levels = c("placebo", "treatment")) Cid <-factor(metrics$ID) Visit <-factor(metrics$Visit.number)

Creating DGElist

raw <- DGEList(counts=counts, group=Group) raw$samples <- cbind(raw$samples,Cid,Visit) data.frame(Sample=colnames(raw),Group,Cid,Visit)

Filter

plot (table (rowSums(cpm(raw) >= 1) ) ) keep <- (rowSums(cpm(raw) >= 1) >= 17) raw <- raw[keep, , keep.lib.sizes=FALSE]

Normalization

norm <- calcNormFactors(raw)

DGE

design <- model.matrix( ~0 + Group + Visit, data=norm$samples) v <- voomWithQualityWeights(norm, design, plot=T) dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T) fit <- lmFit(v, design, block=norm$samples$Cid, correlation=dupCor$consensus) cm <- makeContrasts(treat = Grouptreatment-Groupplacebo, levels=design ) fit2 <-contrasts.fit(fit, cm) efit <-eBayes(fit2)

Thanks a ton! Appreciate you kind support

ADD REPLY
1
Entering edit mode
  1. I would fuse Group and Visit into a single variable and use a one-way layout. Your additive model assumes that the treatment and time are additive, but this won't be true for the most interesting genes.
  2. It wouldn't hurt to run duplicateCorrelation again after the second voomWithQualityWeights call.
  3. Your filtering looks odd to me. Why keep 17? What happens to a gene that is not expressed in any group except the treated condition at time point 4 (which presumably would only have non-zero CPMs in a subset of the 17)?
ADD REPLY
0
Entering edit mode

Thanks a ton Aaron!

1. To fuse into a single variable can I use:

Group <- factor(paste(Categ, Visit, sep="."))

raw <- DGEList(counts=counts, group=Group)

raw$samples <- cbind(raw$samples,Cid,Visit)

(Where Categ =placebo/treatment & Visit=two time points)

2. For filtering (you are right, i'm now using 8 which is the number of samples in treated category)...is this ok?

keep <- (rowSums(cpm(raw) >= 1) >= 8)

raw <- raw[keep, , keep.lib.sizes=FALSE]

3. Design and DGE calculation

design <- model.matrix( ~0 + Group, data=norm$samples) # one way layout

v <- voomWithQualityWeights(norm, design, plot=T)

dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid)

v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T)

dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) # second duplicateCorrelation

fit <- lmFit(v, design, block=norm$samples$Cid, correlation=dupCor$consensus)

cm <- makeContrasts(treat = Grouptreatment.MV4-Groupplacebo.MV4, levels=design ) # contrast i'm interested in

fit2 <-contrasts.fit(fit, cm)

efit <-eBayes(fit2)

One important fact I would like to stress on, in the experimental design is that some samples remain placebo (9X2) at both time points and some (8X2) get treatment at 2nd time-point. Hope the model is sufficient to get DE genes (Treatment vs Placebo).

Appreciate your time and help on this, getting to learn a lot!

ADD REPLY
1
Entering edit mode

If you want to identify differences between treatment and placebo, the most relevant contrast seems to be:

makeContrasts((Grouptreatment.MV4 - Grouptreatment.MV1) 
    - (Groupplacebo.MV4 - Groupplacebo.MV1), levels=design)

This corresponds to the contrast for edgeR that I proposed in my original answer. It should be better than your cm, as a within-patient comparison avoids the penalty that duplicateCorrelation must impose (i.e., an increase in the standard error, and thus a decrease in power) when comparing across patients.

ADD REPLY
0
Entering edit mode

Dear Aaron,

Thanks a ton for your suggestions!

ADD REPLY
0
Entering edit mode

Dear Aaron,

Just to follow up on the same analysis: was wondering if one could run the double voom and double duplicate correlation on a model that includes adjusting for confounding factors. For example if Age is a confounding factor then the design model will be:

design <- model.matrix( ~0 + Group + Age, data=norm$samples)

Can I now run voom and duplicate correlation on this adjusted design model?

v <- voomWithQualityWeights(norm, design, plot=T)

dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid)

v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T)

dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) # second duplicateCorrelation

Thanks for your help on this!

ADD REPLY
1
Entering edit mode

That's fine, providing Age and Group are not confounded. Age will probably be confounded with Cid (unless you've recorded that the patient changes their age between visits), but that's not a problem when you block on patient ID in duplicateCorrelation(). You'll still probably have patient-specific effects that are independent of age, so blocking on Cid will still be necessary.

ADD REPLY
0
Entering edit mode

Thank you for your advice Aaron.

Age of the patients don't change much except by a few months during the two visits. My concern is the effect of "Age" itself on the outcome. I am interested in removing the effect of Age or any such factor and seeing only the effect of Treatment on the outcome. So are the steps mentioned above good enough to do so?

Appreciate your time and help on this!

ADD REPLY
0
Entering edit mode

Blocking on Age will certainly mitigate the effect of age. Whether the effect is removed depends on whether the effect of age on log-expression is linear. In reality, it probably isn't, so if you want to protect against non-linearity, you'd need to use a spline with respect to age (see the user's guide for details).

Another potential issue would be the additivity assumption, which you're also making when you do + Age in model.matrix(). In other words, you're assuming that the effect of age just "adds onto" other effects, which is also unlikely to be true in complex biological systems. You cannot avoid this assumption while still being able to perform your comparisons of interest.

So, it's unlikely that you can fully remove the effect of age in real data. The linearity and additivity assumptions are strong and only the former is really fixable. Your current approach is probably good enough; a spline would avoid the linearity assumption (assuming you have enough samples to tolerate the loss of residual d.f.); but if you want to really get rid of the effect of age, make sure you only collect samples from patients of the same age!

ADD REPLY
0
Entering edit mode

Thanks a lot Aaron for the advice.

I never came across the spline concept in the limma/edgeR guide. Could you please let me know where can I find it?

I will be using the additivity assumption + Age in model.matrix() (understand that it may not be true in complex biological systems.

Can I just confirm with you if its appropriate to further use double voom and double duplicate correlation on this additive model.

Thanks!

ADD REPLY
1
Entering edit mode

Yes, double use of voom and duplicateCorrelation is fine.

ADD REPLY

Login before adding your answer.

Traffic: 526 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6