Dear Dr. Smyth,
We are analyzing some RNA-seq samples collected in different batches, where the batch is a known variable. To account for that we reasoned we could use a linear model to include the batch effect and then remove it.
Since the voom+limma approach is shown to work well for differential gene expression, we thought of estimating the weights for each observation through voom and then use them in the limma function removeBatchEffect(). In the end we get log2(cpm) corrected for the batch (I guess?) and we get some biologically meaningful clustering.
As a next step, we wanted to cluster genes based on the batch corrected expression values.
So my questions are:
1) is this approach for batch effect correction valid in the first place?
2) Would it make sense to use log2(cpm) corrected values to cluster genes, without taking the gene length into account? Should we worry that longer genes could cluster together just because of their length? One solution would be transforming log2(cpm) to RPKM, but by doing this we lose some benefits of the batch correction step.
Many thanks in advance for your time and help.