Question

Is it always preferable to recount library sizes of DGEList object in edgeR after filtering?

1

Entering edit mode

Lucy ▴ 60

@lucy-17014

Last seen 4 days ago

United Kingdom

Hi,

I was wondering whether it is always preferable to recalculate the library sizes of your samples after any filtering. I have seen from the manual that this is recommended after filtering out lowly expressed genes, however I was wondering if you would also recommend this after filtering to retain only protein-coding genes? Could this not end up skewing the results if some of your samples had high expression of non-protein-coding genes?

The section of code that I am referring to is:

y <- y[keep, , keep.lib.sizes=FALSE]

Many thanks for the advice,

Lucy

edgeR RnaSeqSampleSizeData RNAseq • 2.0k views

ADD COMMENT • link updated 6 months ago by Gordon Smyth 51k • written 3.2 years ago by Lucy ▴ 60

score 2 · Accepted Answer · 2021-05-03

No it doesn't skew the results. There is no assumption that the filtered genes are equally expressed in the different libraries.

We recommend keep.lib.sizes = FALSE after gene filtering and before calcNormFactors, regardless of whether the filtering is by expression level or by annotation type. So yes we would still recommend it even if you keep protein-coding genes only.

Having said that, setting keep.lib.sizes to TRUE or FALSE is not a crucial issue. The library size normalization done by normLibSizes will re-adjust the library sizes so you will end up with much the same effective library sizes either way. So leaving keep.lib.sizes = TRUE will give nearly the same DE results in the end.