Currently, I am analysing sc-RNA sequencing data. As far as I know, there are several normalization methods available when differential gene expression analysis is performed. However, in my case, I have a predefined set of genes (n=530), and I want to compare the expression of these genes between different kinds of cells (intersample comparison).
To my (very basic) knowledge, accounting for total no. of reads is important for between-sample comparisons, so CPM should do that trick. But I do not know if I should normalize in more ways. Accounting for gene length does not seem necessary, since the genes are the same in every condition. Samples can be from different subjects, but all data is from one dataset.
Thank you for your help.
Thank you so much for your answer! Forgive my ignorance, but I feel like these packages were designed for a fundamentally different question. I'm not doing a DE analysis, but already have a predefined set of genes I want to compare. So currently I am summing the CPM count of these genes, and comparing this sum of expression between cell types. But I'm wondering if I'm not overlooking some normalization I should perform on these counts.
Sure, but allow me to put my answer in another way: I see no problem in using a standardised workflow for your data, such as those provided by the scran or Seurat authors. With either, you can normalise your raw UMI counts (if that's what you have?), deal with batch effects, low count genes, mitochondrial artefacts, etc., and, ultimately, transform this data to a normal distribution, suitable for any parametric downstream statistical test that you want to use.
If you are content to just calculate the CPM values manually, then that is fine, but a seasoned reviewer will criticise you for not dealing with the known sources of bias in a scRNA-seq analysis.
How your scRNA-seq wet-lab protocol was conducted is important, as is the count method used.