Question

RNA-seq normalization without filtering

1

Entering edit mode

Lin ▴ 50

@lin-19103

Last seen 5.2 years ago

Hi all,

I have a basic question regarding the normalization of RNAseq data - I understand why we have to normalize the raw counts, but I do not fully understand the biological details and I am confused about the differences between methods - so sorry if the answer is obvious.

Basically, I have ~ 58.000 transcripts, and I just want to normalize the raw counts and transform them so that I can make comparisons (I have 2 time points and 60 samples per time point). I would like to do it in R.

My question is: Is there an opportunity to just normalize & transform (I mean sth like (log)CPM) my data, without a prior filter? If yes, do you have any suggestion what method/ package (and function) I could use?

(I would like to filter and apply a variance-mean stabilization afterwards)

Thank you for advices!

deseq2 edger normalization RNAseq • 3.1k views

ADD COMMENT • link updated 6.3 years ago by Steve Lianoglou ★ 13k • written 6.3 years ago by Lin ▴ 50

1

Entering edit mode

Whenever you're feeling lost amidst all the options, following the example workflows like this one would work wonders.

Briefly, you could and should filter lowly-expressed genes before normalisation. This is especially important if you decided to use TMM normalisation (as implemented in edgeR's calcNormFactors) as the method is quite sensitive to gene filtering (CMIIW).

ADD REPLY • link 6.3 years ago mikhael.manurung ▴ 280

score 3 · Answer 1 · 2019-08-09

3

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 12 days ago

United States

In the edgeR world, assuming you've got your data in a DGEList y already, you should just be able to do the following without filtering:

y <- calcNormFactors(y)
cpms <- edgeR::cpm(y, log = TRUE)

Are you running into some problem doing that, or was this more of a theoretical question? Are you asking if it's OK to do this in principle?

ADD COMMENT • link 6.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks for your answer! Indeed, this was more a general question (but of course the code is also helpful because I need to do it with my data). As Mikhael mentioned I also read in the edgeR workflow that filtering comes before normalization. And now I was wondering if there is a pipeline where it would not matter if I do not filter my data before normalization. I know that with Microarray data a lot of people first normalize, and filter afterwards. But it seems to me that it is always the other way around with RNAseq, right?

ADD REPLY • link 6.3 years ago Lin ▴ 50

2

Entering edit mode

As Mikhael mentioned, all ratio-based normalization methods are somewhat sensitive to low counts. The most obvious problem is that ratios involving zeroes are either undefined or yield useless scaling factors of zero. Even if we were to put those aside, we now have the problem of highly discrete ratios generated from very low counts. Taking the median or trimming by quantile for such a discrete distribution performs poorly for getting an estimate of the true scaling factor. A related problem is the fact that low counts yield highly variable ratios, which results in further degradation in performance.

If you absolutely must get normalized expression values for all genes for some reason, then one approach is to filter genes by abundance, compute normalization factors from the filtered subset, and then use the normalization factors on the entire set of genes ("transplanting" the y$samples$norm.factors back into the full set). This assumes that the same scaling biases apply across both low- and high-abundance genes, which is tolerable if you're mainly concerned about composition biases. If you do this, it is important that you do not set keep.lib.sizes=FALSE during subsetting (i.e., don't go out of your way to change the defaults). The normalization factors only make sense in the context of the library sizes with which they are calculated, so if you're doing the normalization manually, you need to make sure that the library sizes during calculation of the factors are the same as the library sizes during their application.

Basically, I have ~ 58.000 transcripts, and I just want to normalize the raw counts and transform them so that I can make comparisons (I have 2 time points and 60 samples per time point).

If you want to do differential expression comparisons, this should be done on the filtered data anyway. Because it is (i) computationally faster, (ii) improves the accuracy of the mean-variance modelling and (iii) reduces the severity of the multiple testing correction.

ADD REPLY • link 6.3 years ago Aaron Lun ★ 29k