As there are genes that could still be involved in cell cycle or are affected by cell cycle, I still want to regress out the effect of cell cycle.
Is this really something you want to do, or are you just doing it because you were told to do it?
I think that people are far, far, far too relaxed about using linear regression to remove uninteresting factors of variation in single-cell data. There is often a lack of appreciation of the underlying assumption - that the covariate(s) being regressed out are orthogonal to the underlying biology. For technical factors of variation, only the most well-designed experiments can reasonably claim to satisfy this assumption; for biological factors like cell cycle, stress response, etc., the default attitude should be that the assumption is violated unless shown otherwise.
Violations of this assumption lead to incorrect results in the - ahem - "corrected" expression values. No other way to say it. I have witnessed people blindly regressing away the batch effect when their batches were confounded with different biological states. This obviously wipes out the biological differences that they were interested in, leaving them wondering why their states disappeared. However, the opposite effect can also happen. It is entirely possible to introduce false positive differences via regression when the population composition differs across different levels of the covariate being regressed out. (This is, in fact, the original motivation for the use of other batch correction methods like MNN.)
I speculate that people like using linear regression for single-cell data because (i) it's simple and so they don't have to think, and (ii) it's what they did for bulk RNA-seq and so they don't have to think. However, the bulk analogy is poor because we usually know all of the other factors of interest in a bulk experiment. If we used
limma::removeBatchEffect, we would almost certainly supply
design= to ensure that our factors of interest are not altered upon removal of the batch effect specified in
batch=. This knowledge is not available in single-cell experiments and so there is little precedent to be drawn from our bulk procedures.
In your case, do you truly believe that cell cycle is independent of the other biological processes happening in your dataset? You're talking about development here, and I would be extremely surprised if there wasn't some correlation between cell cycle and other interesting things like differentiation. All it takes is for a few lineages to exhibit more or less proliferation than the others and you can kiss goodbye to the above assumption.
Your argument is that cycling might affect genes other than the cell cycle genes, in which case removal of the latter is not enough. However, if they're not cell cycle genes, and their expression is altered... is this really just a cell cycle effect anymore? Or is it another process that just happens to be correlated with the cell cycle? If you can answer this question with "yes, it is still a cell cycle effect" (followed by a "I also cannot be bothered to repeat the analysis after removing those extra sort-of-but-not-cell-cycle genes"), only then may you have a case for using regression.
You mention that you got similar results before and after removing cell cycle genes - this sounds like good news to me, in that your conclusions are robust to any differences in cycling activity. IMO regression should be a procedure of last resort that - for any reasonably heterogeneous dataset - raises as many questions as it answers.
1) Shall I process individual dataset for cell cycle effect and used the obtained "corrected" logcounts for integration? 2) Or can I use the output of multiBatchNorm() and perform correction on that and then use the corrected log counts for fastMNN?
If you must do this, I would suggest:
regressBatches() on each SCE with the covariates you want to regress out in
fastMNN() on the corrected matrices, probably with
It's difficult to predict what will crawl out the end of this procedure, but there you have it.
3) Also how can I validate that the cell cycle gene effect was removed in the integrated data, as I cannot plot a PCA for integrated data?
I don't know how you were demonstrating that the cell cycle effect was there in the first place, but it seems you could just do the same thing on the low-dimensional result after correction. If you need some PCs after step 2, you can just use
multiBatchPCA() or its simpler cousin