Question: RNA-seq normalization without filtering
1
9 weeks ago by
Lin50
Lin50 wrote:

Hi all,

I have a basic question regarding the normalization of RNAseq data - I understand why we have to normalize the raw counts, but I do not fully understand the biological details and I am confused about the differences between methods - so sorry if the answer is obvious.

Basically, I have ~ 58.000 transcripts, and I just want to normalize the raw counts and transform them so that I can make comparisons (I have 2 time points and 60 samples per time point). I would like to do it in R.

My question is: Is there an opportunity to just normalize & transform (I mean sth like (log)CPM) my data, without a prior filter? If yes, do you have any suggestion what method/ package (and function) I could use?

(I would like to filter and apply a variance-mean stabilization afterwards)

modified 9 weeks ago by Steve Lianoglou12k • written 9 weeks ago by Lin50
1

Whenever you're feeling lost amidst all the options, following the example workflows like this one would work wonders.

Briefly, you could and should filter lowly-expressed genes before normalisation. This is especially important if you decided to use TMM normalisation (as implemented in edgeR's calcNormFactors) as the method is quite sensitive to gene filtering (CMIIW).

3
9 weeks ago by
Denali
Steve Lianoglou12k wrote:

In the edgeR world, assuming you've got your data in a DGEList y already, you should just be able to do the following without filtering:

y <- calcNormFactors(y)
cpms <- edgeR::cpm(y, log = TRUE)


Are you running into some problem doing that, or was this more of a theoretical question? Are you asking if it's OK to do this in principle?

Thanks for your answer! Indeed, this was more a general question (but of course the code is also helpful because I need to do it with my data). As Mikhael mentioned I also read in the edgeR workflow that filtering comes before normalization. And now I was wondering if there is a pipeline where it would not matter if I do not filter my data before normalization. I know that with Microarray data a lot of people first normalize, and filter afterwards. But it seems to me that it is always the other way around with RNAseq, right?

2

As Mikhael mentioned, all ratio-based normalization methods are somewhat sensitive to low counts. The most obvious problem is that ratios involving zeroes are either undefined or yield useless scaling factors of zero. Even if we were to put those aside, we now have the problem of highly discrete ratios generated from very low counts. Taking the median or trimming by quantile for such a discrete distribution performs poorly for getting an estimate of the true scaling factor. A related problem is the fact that low counts yield highly variable ratios, which results in further degradation in performance.

If you absolutely must get normalized expression values for all genes for some reason, then one approach is to filter genes by abundance, compute normalization factors from the filtered subset, and then use the normalization factors on the entire set of genes ("transplanting" the y$samples$norm.factors back into the full set). This assumes that the same scaling biases apply across both low- and high-abundance genes, which is tolerable if you're mainly concerned about composition biases. If you do this, it is important that you do not set keep.lib.sizes=FALSE during subsetting (i.e., don't go out of your way to change the defaults). The normalization factors only make sense in the context of the library sizes with which they are calculated, so if you're doing the normalization manually, you need to make sure that the library sizes during calculation of the factors are the same as the library sizes during their application.

Basically, I have ~ 58.000 transcripts, and I just want to normalize the raw counts and transform them so that I can make comparisons (I have 2 time points and 60 samples per time point).

If you want to do differential expression comparisons, this should be done on the filtered data anyway. Because it is (i) computationally faster, (ii) improves the accuracy of the mean-variance modelling and (iii) reduces the severity of the multiple testing correction.