Question: edgeR: boxplot the expression of a gene in two conditions after glmFit
0
4 months ago by
United States
zoppoli pietro10 wrote:

I have a DGEGLM object from edgeR called fit2

fit2 <- glmFit(counts(Data), design, disp2$tagwise.dispersion,offset=-offst(dataOffset)) I want the logFC values so logFC <- fit2$coefficients[geneA,2] # first column intercept, second column is my comparison

fit2$fitted.values[geneA,] # is the vector with the fitted values TableWithFC <- aggregate(log2(fit2$fitted.values[geneA,]) ~ (condition),FUN= mean)

logFC2 <- TableWithFC[2,2]-TableWithFC [1,2]

logFC is NOT equal to logFC2

If logFC is the right one (is the same I get if I continue the pipeline and I do glmLRT and topTags), how I can obtain the expression values of geneA for each sample in order to obtain a boxplot like this:

boxplot(log2(fit2$fitted.values["geneA",]) ~ (fit2$design[,2]))

but with the values giving the logFC of the fit2$coefficients[geneA,2] ? In other words the boxplot with the fitted.values is not representative of the result with coefficients. I want a boxplot representative of the logFC obtained with coefficients. Thanks, P edger • 127 views ADD COMMENTlink modified 4 months ago by Aaron Lun23k • written 4 months ago by zoppoli pietro10 Answer: edgeR: boxplot the expression of a gene in two conditions after glmFit 1 4 months ago by Aaron Lun23k Cambridge, United Kingdom Aaron Lun23k wrote: There are many reasons why those two values are different: 1. The log-fold changes derived from coefficients are computed with a prior count added to the counts, which squeezes differences towards zero (see ?glmFit). If you want the unshrunk log-fold changes, you should use unshrunk.coefficients. 2. The mean of the log-values is not generally equal to the log-mean value. So computing the mean of the log-fitted values doesn't necessarily give you the log-fold change between the means. 3. In fact, the sample mean of each group will not be exactly proportional to the group-specific abundance reported by edgeR (unless your libraries are exactly the same size). This is because the latter is computed from the GLM, which effectively weights each observation based on its variance. 4. The coefficients are computed with a natural log, while you're using log2 for your log-transform. I think you're worrying too much about this. Just use cpm(y, log=TRUE, prior=5) to get log-CPMs and make your boxplot; any discrepancy should be too minor for people to notice or care, at least for the purposes of visualization. ADD COMMENTlink modified 4 months ago • written 4 months ago by Aaron Lun23k Hi Aaron, thanks a lot for your answer, you made me reflect on the many reasons the values can be different. I expected a minor difference in FC as you state in your answer BUT I find FCs with different sign comparing the results of my analysis and the cpm. Let me explain that i'm using both EDAseq and RUVSeq prior edgeR to obtain my result. According to EDASeq manual: dataOffset <- withinLaneNormalization(data,"gc",which="upper",offset=TRUE) dataOffset <- betweenLaneNormalization(dataOffset,which="upper",offset=TRUE) then according to RUVSeq using control genes taken from Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet. 2013 I do: set1 <- RUVg(x = dataOffset cIdx = HKgenes, k=1) RUVg does NOT fill the offset slot of the EDASeq object. To use the offsets I need the result from EDASeq but I still want to Remove Unwanted Variation so I modify the edgeR paragraph of the EDASeq manual: W_1 <- pData(set1)$ W_1

pData(dataOffset) <- cbind(pData(dataOffset),W_1)

design <- model.matrix(~conditions + W_1, data=pData(dataOffset))

then all by manual

disp2 <- estimateDisp(counts(dataOffset),design, offset=-offst(dataOffset))
fit2 <- glmFit(counts(dataOffset), design, disp2$tagwise.dispersion, offset=-offst(dataOffset)) lrt <- glmLRT(fit, coef=2) topTags(lrt) contrast_BvsA <- makeContrasts(TvsN=conditionsB-conditionA, levels=design) contrast_BvsA # Contrasts # Levels BvsA # conditionA -1 # conditionsB 1 # W_1 0 # W_1 is a vector to Remove Unwanted Variation lrt2 <- glmLRT(fit2,contrast = contrast_BvsA )  lrt2$table["geneB",]

#         logFC   logCPM       LR       PValue
# geneB -1.088669 31.60722 23.45083 1.281471e-06

If I do

y <- DGEList(counts = counts(dataOffset), lib.size = colSums(counts(dataOffset)),
samples = (dataOffset@phenoData@data),
group = design[,2], genes = dataOffset@featureData@data, remove.zeros = FALSE)
y <- calcNormFactors(y)

CPMs <- cpm(y, log=TRUE, prior=5)
CPMs[geneB,]
b <- aggregate(CPMs["geneB",]~ (condition),FUN= mean)
b[2,2]-b[1,2]

# 0.2840857

As you can see it's not a small difference.

Thanks,

P

In my answer, I was assuming that you were performing the simplest possible comparison between group means, i.e., a GLM fitted with a one-way layout. However, if you're using blocking factors that are not balanced between groups, all bets are off. To illustrate with a simple linear model:

set.seed(0)
groups <- gl(2, 10)
batch <- factor(rep(c(1,2,1,2), c(8,2,2,8)))
y <- rnorm(length(batch), mean=as.integer(batch)) # complete batch effect

# Difference in sample means:
mean(y[groups=="2"]) - mean(y[groups=="1"])
## [1] -0.1214053

# Linear model:
fit <- lm(y ~ groups + batch)
coef(fit)
## (Intercept)     groups2       batch
## -0.01288533 -0.90730992  1.30984108


So, you can see that the difference in sample means (-0.12) is quite different from the difference returned by the linear model (-0.91). This is because the latter accounts for the effect of the blocking variables on the apparent differences between groups. If you want to represent this in a boxplot, I'd suggest using removeBatchEffect on the log-CPMs with your groups in design= and all blocking factors in covariates=.

Dear Aaron,

I tried your solution but with no luck.

Maybe it's related to the used of the offset from EDAseq package together with the RUVg function from RUVSeq package.

As you can read in the previous post my design is

design <- model.matrix(~conditions + W_1, data=pData(dataOffset))

so the only block factor i see is W_1 (which is the the estimated factors of unwanted variation).

so:

CPMsNoBatch<- removeBatchEffect(CPMs, batch=NULL, batch2=NULL,

covariates=Data2@phenoData@data$W_1, design=as.matrix(design[,2])) bb <- aggregate(CPMsNoBatch["geneB",]~ design[,2],FUN= mean) b[2,2]-b[1,2] # 9.359853 A difference bigger than before. I remind you the lrt2$table["geneB",]

gives FC -1

Maybe I'm doing something wrong or there is an easiest way to integrate EDAseq normalization and RUVg results into edgeR pipeline.

Sorry for the many questions but I'd really like to use this pipeline (or a modified working one) in some papers we are preparing.

Best,

P

set.seed(0)
groups <- gl(2, 10)
batch <- factor(rep(c(1,2,1,2), c(8,2,2,8)))
design <- model.matrix(~groups + batch)

# Mocking up some data.
ngenes <- 1000
batch.means <- matrix(rnorm(2*ngenes), ncol=2)
sample.means <- 100*2^batch.means[,batch]
y <- matrix(rnbinom(length(sample.means), mu=sample.means, size=10), ncol=length(batch))

# Using glmFit/glmLRT for demonstration purposes: normally I would use QL methods.
library(edgeR)
fit <- glmFit(y, design, offset=0, dispersion=0.1)
res <- glmLRT(fit, coef=2)

# Using removeBatchEffect.
lcpms <- cpm(y, log=TRUE, prior.count=1)
corrected <- removeBatchEffect(lcpms, design=design[,1:2,drop=FALSE], covariates=design[,3])
alt.lfc <- rowMeans(corrected[,groups==2]) - rowMeans(corrected[,groups==1])

# Looks pretty consistent to me.
plot(alt.lfc, res$table$logFC)
abline(a=0, b=1, col="red", lwd=2, lty=2)

At low counts, this is less likely to match up due to the effect of the prior.count and the fact that the variance relative to the mean is higher (such that the mean of the logs becomes even more different to the log of the means). But otherwise it should be pretty consistent, as you can see from the plot above. If you're not getting this result, something probably went wrong - try working back from the code I've just provided above.