Question: Paired analysis setup using edgeR - follow up question
0
gravatar for candida.vaz
10 days ago by
candida.vaz30
Singapore
candida.vaz30 wrote:

Dear Bioconductor support team,

This question is a follow up of the same I asked before, and then I carried out some analysis, but need help:-

I have samples from two time points. The samples belong to two categories: Treatment and Placebo as follows: 005v1 Placebo 005v4 Placebo 008v1 Treatment 008v4 Treatment

The ID number 005 etc represent the same sample at two time points. At V1 time-point there is no treatment given, its just that the samples are assigned to be given a treatment (treatment) or not (placebo). At V4 time-point there would be difference in placebo and treatment sample.

Now, if the question is to identify the changes caused due to the treatment, how should I define the design? Should V1 time-point be taken into consideration or not, given that the treatment intervention was given only after the assignment of samples at V1.

It seems like a paired analysis, where the blocking factor is the IDs of the samples, so using the edgeR guide (3.4), I tried to block on the ID (called as Cid in the Rcode) in the following way:-

Read in and annotate

counts <- read.delim("all-samples.txt", row.names="Symbol", header = T) metrics <- read.delim("metric-data.txt", row.names = 1, header = T)

Defining Factors

Categ <-factor(metrics$Group, levels = c("placebo", "treatment")) Cid <-factor(metrics$ID) Visit <-factor(metrics$Visit)

Creating DGElist

Group <- factor(paste(Categ, Visit, sep=".")) raw <- DGEList(counts=counts, group=Group)

Filter

plot (table (rowSums(cpm(raw) >= 1) ) ) keep <- (rowSums(cpm(raw) >= 1) >= 17) raw <- raw[keep, , keep.lib.sizes=FALSE]

Normalization

norm <- calcNormFactors(raw)

DE analysis

design <- model.matrix(~0+Cid+Group) colnames(design) design <- design[,-19]

ed <- estimateDisp(norm, design) glmfit <- glmFit(ed, design) cont <- makeContrasts(Grouptreatment.V4-Groupplacebo.V4, levels=design) gltr <- glmTreat(glmfit, contrast=cont)#no fc cut-off

out <- topTags(gltr, n = Inf, adjust.method="BH") keep <- out$table$PValue <= 0.05

Mainly, I combined Category (Placebo/Treatment) and Visit (V1/V4) into one factor Group. I then created the design as follows: design <- model.matrix(~0+Cid+Group)

Q1: Is this sufficient to block the effect of IDs?

The design looks like this:

     colnames(design)

[1] "Cid5" "Cid8" "Cid9" "Cid22" "Cid28" "Cid38" "Cid43"
[8] "Cid48" "Cid56" "Cid58" "Cid60" "Cid61" "Cid62" "Cid63"
[15] "Cid153" "Cid180" "Cid190" "Groupplacebo.V4" "Grouptreatment.V1" "Grouptreatment.V4"

When I tried to estimate dispersion on this design, I get a message:

"Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, : Design matrix not of full rank. The following coefficients not estimable: Grouptreatment.V4"

So I removed Grouptreatment.V1 using design <- design[,-19] and then it runs.

Q2: What should I do if I also want to compare Grouptreatment.V1 vs Groupplacebo.V1 and perhaps other combinations as well?

Appreciate your help and support!

Thanks & Regards, Candida Vaz

edger • 95 views
ADD COMMENTlink modified 9 days ago • written 10 days ago by candida.vaz30
Answer: Paired analysis setup using edgeR - follow up question
1
gravatar for Aaron Lun
10 days ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

You would make life a lot easier for everyone if you formatted your experimental design more clearly, like so:

id <- rep(c("004", "008", "009", "022"), each=2)
time <- factor(rep(c(1, 4), 4))
treatment <- rep(rep(c("Placebo", "Treatment"), each=2), 2)
data.frame(id, time, treatment)

The logical choice of design would then be:

design <- model.matrix(~0 + id + time:treatment) # don't really need 0+.
design <- design[,!grepl("time1", colnames(design))] # getting to full rank.
colnames(design) <- make.names(colnames(design)) # cleaning up names.

The first four coefficients represent the individual blocking effects and are not of interest. The second-last coefficient represents the log-fold change of time 4 over time 1 in the placebo patients, while the last coefficient represents the equivalent log-fold change in the treated patients. The most interesting comparison is between the effects on the placebo and treated patients:

con <- makeContrasts(time4.treatmentTreatment - time4.treatmentPlacebo, levels=design)

This is the answer to your Question 1. It's probably worth reporting the individual log-fold changes as well, to make it easier to interpret the log-fold change from con (which is actually a difference of log-fold changes between the placebo and treatment conditions). A careful reading of your code suggests that it tries to do something similar but fails because you did not construct a design matrix of full rank. This means that there were linear dependencies between columns in your design (i.e., one set of columns could be expressed as a linear combination of another set of columns, rendering the system unsolvable), which I took care to remove when creating design.

As for your Question 2, if you want to compare directly between patients, you cannot do so with the full data set in edgeR. There is no way to account for correlations between samples from the same individual. Well, not without blocking on id, but if you do so, you won't be able to compare between patients because the blocking factors absorb the interesting differences. This leaves you with two choices:

  1. If you want to stick to using edgeR, subset the data so that each individual only contributes one sample (presumably from the same time point). Then use a simple one-way layout to fit a GLM to the data subset; there's no need for blocking factors.
  2. Use voom and limma, with duplicateCorrelation to block on id.

P.S. Use glmQLFit rather than glmFit, the former is more accurate and reduces the number of false positives. And don't do keep <- out$table$PValue <= 0.05, make sure you use the adjusted p-value for defining the set of significant genes.

ADD COMMENTlink modified 10 days ago • written 10 days ago by Aaron Lun22k

Dear Aaron,

Thank you so very much for the detailed explanation and solutions to my questions! So grateful!

Regards, Candida

ADD REPLYlink written 10 days ago by candida.vaz30

Dear Aaron,

I had used limma voom with duplicateCorrelation blocking on id. This is my code...does this seem ok? or could you give your suggestions.

Read in and annotate

counts <- read.delim("all-samples.txt", row.names="Symbol", header = T) metrics <- read.delim("checking-metrics-data.txt", row.names = 1, header = T)

head(metrics)

Defining Factors

Group <-factor(metrics$Group, levels = c("placebo", "treatment")) Cid <-factor(metrics$ID) Visit <-factor(metrics$Visit.number)

Creating DGElist

raw <- DGEList(counts=counts, group=Group) raw$samples <- cbind(raw$samples,Cid,Visit) data.frame(Sample=colnames(raw),Group,Cid,Visit)

Filter

plot (table (rowSums(cpm(raw) >= 1) ) ) keep <- (rowSums(cpm(raw) >= 1) >= 17) raw <- raw[keep, , keep.lib.sizes=FALSE]

Normalization

norm <- calcNormFactors(raw)

DGE

design <- model.matrix( ~0 + Group + Visit, data=norm$samples) v <- voomWithQualityWeights(norm, design, plot=T) dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T) fit <- lmFit(v, design, block=norm$samples$Cid, correlation=dupCor$consensus) cm <- makeContrasts(treat = Grouptreatment-Groupplacebo, levels=design ) fit2 <-contrasts.fit(fit, cm) efit <-eBayes(fit2)

Thanks a ton! Appreciate you kind support

ADD REPLYlink modified 9 days ago • written 10 days ago by candida.vaz30
1
  1. I would fuse Group and Visit into a single variable and use a one-way layout. Your additive model assumes that the treatment and time are additive, but this won't be true for the most interesting genes.
  2. It wouldn't hurt to run duplicateCorrelation again after the second voomWithQualityWeights call.
  3. Your filtering looks odd to me. Why keep 17? What happens to a gene that is not expressed in any group except the treated condition at time point 4 (which presumably would only have non-zero CPMs in a subset of the 17)?
ADD REPLYlink modified 10 days ago • written 10 days ago by Aaron Lun22k

Thanks a ton Aaron!

1. To fuse into a single variable can I use:

Group <- factor(paste(Categ, Visit, sep="."))

raw <- DGEList(counts=counts, group=Group)

raw$samples <- cbind(raw$samples,Cid,Visit)

(Where Categ =placebo/treatment & Visit=two time points)

2. For filtering (you are right, i'm now using 8 which is the number of samples in treated category)...is this ok?

keep <- (rowSums(cpm(raw) >= 1) >= 8)

raw <- raw[keep, , keep.lib.sizes=FALSE]

3. Design and DGE calculation

design <- model.matrix( ~0 + Group, data=norm$samples) # one way layout

v <- voomWithQualityWeights(norm, design, plot=T)

dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid)

v <- voomWithQualityWeights(norm, design, block=norm$samples$Cid, correlation=dupCor$consensus, plot=T)

dupCor <- duplicateCorrelation(v, design, block=norm$samples$Cid) # second duplicateCorrelation

fit <- lmFit(v, design, block=norm$samples$Cid, correlation=dupCor$consensus)

cm <- makeContrasts(treat = Grouptreatment.MV4-Groupplacebo.MV4, levels=design ) # contrast i'm interested in

fit2 <-contrasts.fit(fit, cm)

efit <-eBayes(fit2)

One important fact I would like to stress on, in the experimental design is that some samples remain placebo (9X2) at both time points and some (8X2) get treatment at 2nd time-point. Hope the model is sufficient to get DE genes (Treatment vs Placebo).

Appreciate your time and help on this, getting to learn a lot!

ADD REPLYlink modified 9 days ago • written 9 days ago by candida.vaz30
1

If you want to identify differences between treatment and placebo, the most relevant contrast seems to be:

makeContrasts((Grouptreatment.MV4 - Grouptreatment.MV1) 
    - (Groupplacebo.MV4 - Groupplacebo.MV1), levels=design)

This corresponds to the contrast for edgeR that I proposed in my original answer. It should be better than your cm, as a within-patient comparison avoids the penalty that duplicateCorrelation must impose (i.e., an increase in the standard error, and thus a decrease in power) when comparing across patients.

ADD REPLYlink modified 9 days ago • written 9 days ago by Aaron Lun22k

Dear Aaron,

Thanks a ton for your suggestions!

ADD REPLYlink written 8 days ago by candida.vaz30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 263 users visited in the last hour