I have gene expression data for 8 samples, 4 controls and 4 FOXCUT OE samples.
The raw counts data were in a dataframe counts
. coldata
looks like below:
Samples Group Experiment
Sample1 Control EXP1
Sample2 Control EXP1
Sample3 Control EXP2
Sample4 Control EXP2
Sample5 FOXCUT Overexpression EXP2
Sample6 FOXCUT Overexpression EXP2
Sample7 FOXCUT Overexpression EXP1
Sample8 FOXCUT Overexpression EXP1
I have used following code for filtering and normallization:
library(edgeR)
group <- factor(paste0(coldata$Group))
y <- DGEList(counts,group = group)
## Filtering
keep <- rowSums(cpm(y) > 1) >= 2
table(keep)
summary(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y,method = "TMM")
design2 <- model.matrix(~ 0 + group + coldata$Experiment)
So, after Normalization I did the batch correction using removeBatchEffect
function like below:
y <- removeBatchEffect(y, design = design2)
## For Clustering Heatmap
logCPM <- cpm(y, prior.count=2, log=TRUE)
# Scale data to Z-scores
logCPM <- t(scale(t(logCPM)))
library(matrixStats)
## To select top 10% highly variable genes
vars <- apply(logCPM, 1, IQR)
clust <- logCPM[vars > quantile(vars, 0.9), ]
I wanted to use clust
for a clustering heatmap.
Question: Is the above way of batch correction is right or should I do batch correction after calculating logCPM
Thanks for the reply. So, after calculating logCPM, then I have to use removeBatchEffect function on that and then scale that for clustering analysis. Is it fine?
Is this right?
Did you read my answer?
Yes, James. But in this Remove batch effect in small RNASeq study (SVA or others?) Gordon recommends to do in the above way and then use it for MDS, PCA or heatmaps. So, I did it in that way.
Is this wrong?
No, it's not wrong, particularly if you are doing MDS, which is what he is talking about in that post. There is never any mention of heatmaps, and only the OP is talking about PCA. And Gordon specifically mentions that he isn't enthused with using the scale argument to
prcomp
, and I would interpret that as being cautious about using methods that assume normality on non-normal data.You are using
scale
directly, which centers and scales the data by dividing by the standard deviation (well, the root-mean-square, which is the same if you also center). Since that assumes the data are at least normal-ish, I recommended you usevoom
, which will incorporate observation-level weights to thelmFit
step, and hopefully reduce some of the heteroscedasticity inherent to logCPM data.That said, if somebody gives you a recommendation, and you then use somebody else's recommendation (that is different from the recommendation you already have), why would you ask the first person if it's OK? You are basically asking me to critique advice that Gordon already gave in another context, without letting me know that's what you are doing.
Sorry, I got you now. Before your first comment itself I saw the Gordon's comment in that post and thought of asking you whether it is right. Anyways now I got the point, I will use
voom
But If you dont mind could you please help me with the all the syntax I have to use for that. I never did this. Thank you