Question: edgeR removing batch effect before using expression data for clustering
0
6 months ago by
Biologist70
Biologist70 wrote:

I have gene expression data for 8 samples, 4 controls and 4 FOXCUT OE samples.

The raw counts data were in a dataframe counts. coldata looks like below:

Samples   Group              Experiment
Sample1 Control                 EXP1
Sample2 Control                 EXP1
Sample3 Control                 EXP2
Sample4 Control                 EXP2
Sample5 FOXCUT Overexpression   EXP2
Sample6 FOXCUT Overexpression   EXP2
Sample7 FOXCUT Overexpression   EXP1
Sample8 FOXCUT Overexpression   EXP1


I have used following code for filtering and normallization:

library(edgeR)
group <- factor(paste0(coldata$Group)) y <- DGEList(counts,group = group) ## Filtering keep <- rowSums(cpm(y) > 1) >= 2 table(keep) summary(keep) y <- y[keep, , keep.lib.sizes=FALSE] y <- calcNormFactors(y,method = "TMM") design2 <- model.matrix(~ 0 + group + coldata$Experiment)


So, after Normalization I did the batch correction using removeBatchEffect function like below:

y <- removeBatchEffect(y, design = design2)

## For Clustering Heatmap
logCPM <- cpm(y, prior.count=2, log=TRUE)
# Scale data to Z-scores
logCPM <- t(scale(t(logCPM)))

library(matrixStats)
## To select top 10% highly variable genes
vars <- apply(logCPM, 1, IQR)
clust <- logCPM[vars > quantile(vars, 0.9), ]


I wanted to use clust for a clustering heatmap.

Question: Is the above way of batch correction is right or should I do batch correction after calculating logCPM

modified 6 months ago by James W. MacDonald51k • written 6 months ago by Biologist70
Answer: edgeR removing batch effect before using expression data for clustering
0
6 months ago by
United States
James W. MacDonald51k wrote:

You should run removeBatchEffect after computing logCPM, but you should probably use voom rather than cpm for that step. The difference being that voom will also generate observation-level weights that will be passed to lmFit for the batch correction.

Thanks for the reply. So, after calculating logCPM, then I have to use removeBatchEffect function on that and then scale that for clustering analysis. Is it fine?

logCPM <- cpm(y, prior.count=2, log=TRUE)

logCPM <- removeBatchEffect(logCPM)
logCPM <- t(scale(t(logCPM)))


Is this right?

Yes, James. But in this Remove batch effect in small RNASeq study (SVA or others?) Gordon recommends to do in the above way and then use it for MDS, PCA or heatmaps. So, I did it in that way.

Is this wrong?

No, it's not wrong, particularly if you are doing MDS, which is what he is talking about in that post. There is never any mention of heatmaps, and only the OP is talking about PCA. And Gordon specifically mentions that he isn't enthused with using the scale argument to prcomp, and I would interpret that as being cautious about using methods that assume normality on non-normal data.

You are using scale directly, which centers and scales the data by dividing by the standard deviation (well, the root-mean-square, which is the same if you also center). Since that assumes the data are at least normal-ish, I recommended you use voom, which will incorporate observation-level weights to the lmFit step, and hopefully reduce some of the heteroscedasticity inherent to logCPM data.

That said, if somebody gives you a recommendation, and you then use somebody else's recommendation (that is different from the recommendation you already have), why would you ask the first person if it's OK? You are basically asking me to critique advice that Gordon already gave in another context, without letting me know that's what you are doing.

Sorry, I got you now. Before your first comment itself I saw the Gordon's comment in that post and thought of asking you whether it is right. Anyways now I got the point, I will use voom

But If you dont mind could you please help me with the all the syntax I have to use for that. I never did this. Thank you