I've been looking into normalization more and more, and I was wondering about a few things that perhaps some of you might know the answer to or want to discuss
So their exists within samples normalization (TPM or others), i.e. relative abundances and between samples normalization (TMM or others), but is it necessary to do both ever, i.e. is it ever necessary to normalize relative abundances across a cohort?
I don't think it would be, but another scenario which seems to be quite common is filtering out isoforms that have no expression for 90% (or some other threshold) of the samples if working with a large cohort. But if you do this while working with TPM then the sum of TPM for every isoform for each subject will no longer be equal. Would it make sense to then use TMM after such a filtration process? I think it would.
Do you think such filter out of isoforms is flawed in some manner?
My guess is it used because people are worried about the sensitivity of RNA-seq and biologically most think that for specific tissue type a good percentage of genes are not expressed. So I think it makes some sense
It seems like all between samples normalizations require raw counts as input, and leave it there. I read harold pimentel's blog post about it (https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/, very informative) but I haven't seen a follow up about this problem if it is a problem.
I'm new to this stuff, so I was wondering what others thoughts are on the issue.
Another rookie here, sorry if I'm misunderstanding something obvious. I am a little confused by what constitutes a 'sample'. I'm doing single-cell RNA-seq with 100 cells, with cDNA for each made separately, then pooled for library enrichment and then sequenced together in a single lane. I get that between-sample normalization is necessary when comparing data across different lanes or sequencing runs. But what about the case where all samples (multiplexed, of course) were run in a single lane? My single cells have varying levels of total read count (0.8 to 1.2 million reads), so should I treat each single cell as a sample and do normalization (e.g.,TPM) within that sample? Would between-sample normalization be necessary if I wanted to compare expression of gene A across the 100 cells I sequenced together?
All my advice above is for "ordinary" RNA-seq. I have no experience with single-cell sequencing, but I know that it requires a quite different statistical approach, so you probably shouldn't try to apply anything from this question to it. If you want to know about single-cell sequencing, ask a separate question about that.
In any case, though, I can tell you that normalizing for the technical batch effects inherent in the sequencing technology is only one of the reasons that between-sample normalization is required, and you should read the TMM paper for for a more detailed explanation on this topic (even though the TMM method itself may not be suitable for single-cell data): http://www.genomebiology.com/2010/11/3/R25