edgeR removing batch effect before using expression data for clustering
Entering edit mode
Biologist ▴ 110
Last seen 4.1 years ago

I have gene expression data for 8 samples, 4 controls and 4 FOXCUT OE samples.

The raw counts data were in a dataframe counts. coldata looks like below:

Samples   Group              Experiment
Sample1 Control                 EXP1
Sample2 Control                 EXP1
Sample3 Control                 EXP2
Sample4 Control                 EXP2
Sample5 FOXCUT Overexpression   EXP2
Sample6 FOXCUT Overexpression   EXP2
Sample7 FOXCUT Overexpression   EXP1
Sample8 FOXCUT Overexpression   EXP1

I have used following code for filtering and normallization:

group <- factor(paste0(coldata$Group))
y <- DGEList(counts,group = group)

## Filtering 
keep <- rowSums(cpm(y) > 1) >= 2

y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y,method = "TMM") 
design2 <- model.matrix(~ 0 + group + coldata$Experiment)

So, after Normalization I did the batch correction using removeBatchEffect function like below:

y <- removeBatchEffect(y, design = design2)

## For Clustering Heatmap
logCPM <- cpm(y, prior.count=2, log=TRUE)
# Scale data to Z-scores
logCPM <- t(scale(t(logCPM)))

## To select top 10% highly variable genes
vars <- apply(logCPM, 1, IQR) 
clust <- logCPM[vars > quantile(vars, 0.9), ]

I wanted to use clust for a clustering heatmap.

Question: Is the above way of batch correction is right or should I do batch correction after calculating logCPM

edger r batcheffect clustering removeBatchEffect • 1.6k views
Entering edit mode
Last seen 10 hours ago
United States

You should run removeBatchEffect after computing logCPM, but you should probably use voom rather than cpm for that step. The difference being that voom will also generate observation-level weights that will be passed to lmFit for the batch correction.

Entering edit mode

Thanks for the reply. So, after calculating logCPM, then I have to use removeBatchEffect function on that and then scale that for clustering analysis. Is it fine?

logCPM <- cpm(y, prior.count=2, log=TRUE)

logCPM <- removeBatchEffect(logCPM)
logCPM <- t(scale(t(logCPM)))

Is this right?

Entering edit mode

Did you read my answer?

Entering edit mode

Yes, James. But in this Remove batch effect in small RNASeq study (SVA or others?) Gordon recommends to do in the above way and then use it for MDS, PCA or heatmaps. So, I did it in that way.

Is this wrong?

Entering edit mode

No, it's not wrong, particularly if you are doing MDS, which is what he is talking about in that post. There is never any mention of heatmaps, and only the OP is talking about PCA. And Gordon specifically mentions that he isn't enthused with using the scale argument to prcomp, and I would interpret that as being cautious about using methods that assume normality on non-normal data.

You are using scale directly, which centers and scales the data by dividing by the standard deviation (well, the root-mean-square, which is the same if you also center). Since that assumes the data are at least normal-ish, I recommended you use voom, which will incorporate observation-level weights to the lmFit step, and hopefully reduce some of the heteroscedasticity inherent to logCPM data.

That said, if somebody gives you a recommendation, and you then use somebody else's recommendation (that is different from the recommendation you already have), why would you ask the first person if it's OK? You are basically asking me to critique advice that Gordon already gave in another context, without letting me know that's what you are doing.

Entering edit mode

Sorry, I got you now. Before your first comment itself I saw the Gordon's comment in that post and thought of asking you whether it is right. Anyways now I got the point, I will use voom

But If you dont mind could you please help me with the all the syntax I have to use for that. I never did this. Thank you


Login before adding your answer.

Traffic: 561 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6