Question: RNAseq analysis: what comes first, filtering or normalization
0
gravatar for apfelbapfel
12 weeks ago by
apfelbapfel0 wrote:

Hi there

please excuse my very basic questions, but I was not able to find appropriate answers using searchengines.

I am trying to analyze a small dataset of the RNAseq of  3 vs 3 samples to identify differentially expressed genes and do some multivariate statistics. Due to the low sample size I chose to use EdgeR, but am a bit confused. In the package description (https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) all steps are nicely explained, but the order seems odd to me: they first describe filtering for low read counts, which in my samples removes quite a bit from the respective libraries, and then describe TMM normalization to account for the RNA composition effect. 

Is this really the right order to do it, or am I confusing things?

So first:

data_edgeR <- DGEList(counts=data_matrix[2:46079,3:10], group=group) #create DGEList for further analyses

data_edgeR$samples #looking at library sizes before filtering

keep <- rowSums(cpm(data_edgeR)>1) >= 3
data_edgeR_filtered <- data_edgeR[keep, , keep.lib.sizes=FALSE]

 

and then

data_TMM_normalized <- calcNormFactors(data_edgeR_filtered)

Is this correct, or the other way ´round?

 

Many thanks!

 

 

rnaseq edger R • 144 views
ADD COMMENTlink modified 12 weeks ago by Aaron Lun23k • written 12 weeks ago by apfelbapfel0
Answer: RNAseq analysis: what comes first, filtering or normalization
1
gravatar for Aaron Lun
12 weeks ago by
Aaron Lun23k
Cambridge, United Kingdom
Aaron Lun23k wrote:

You don't explain why it seems odd to you, which would help with explaining things.

Keep in mind that the filtering step with cpm (or with filterExprs) uses counts-per-million that are effectively library size-normalized. So, it's not like the filtering is being done on the raw counts. That would obviously be silly in the presence of samples with differing coverage, as the retention of a gene by the filter would be greatly dependent on their expression in libraries with deeper coverage.  

TMM normalization in calcNormFactors removes composition biases that remain after library size normalization. You could argue that the filtering should be performed on the TMM-normalized counts; naively, this would require calcNormFactors to be run before filtering. However, this is not desirable as low counts reduce the accuracy of TMM normalization (lots of discreteness, imprecision, see here).

If you must filter on the TMM-normalized counts (e.g., because some of your samples have extreme composition biases), the correct procedure would be to do something analogous to what we do in single-cell data analysis. That is:

  1. Filter prior to TMM normalization, using the same procedure as described above. Do not recompute the libsize fields.
  2. Run calcNormFactors on the filtered object.
  3. Assign the normalization factors back to the original object. This only works if the libsize is not changed after filtering. 
  4. Redo the filtering step to obtain the final filtered object. cpm will now be aware of the normalization factors.

This is rather complicated for routine use and it's difficult to show that it would have any benefit over the usual approach, especially given that most datasets will have normalization factors for all samples close to 1.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Aaron Lun23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 232 users visited in the last hour