Hi All,
When I am analyzing some RNA-seq data from noncoding regions, I bumped into this question.
I only focused on noncoding regions with ncRNA transcripts detected, which only constituted 5-10% of all the sequencing reads. So for examples, if there are 20million reads of a RNA-seq sample, the reads from noncoding region is perhaps 1million.
So now the question becomes, shall I manually input the total sequencing read number for a specific sample (e.g 20million) as the y$samples$lib.size when I am using edgeR, or shall I let edgeR do it automatically (as far as I understand, edgeR will automatically sum all reads in the DGElist of a sample, which in my case will be 1 million as only ncRNAs were calculated here).
The logic behind this is that - I assume the total RNA-seq reads of different samples are comparable, rather than the reads from noncoding region only. So I will consider that it is correct to manually input the specific reads number of a sample. e.g. (y$samples$lib.size <- c(12624935, 13276916, 20222212, 21997722, 16789741, 29073257))
This question will also affects the y$samples$norm.factors. (y <- calcNormFactors(y)). If I let the edgeR do it automatically, the norm.factors will be only take into consideration the variability of noncoding RNAs in different samples.
Thank you so much for and suggestions!
wenbo

What you should be doing is:
DGEListobject from your count data.calcNormFactorson the filtered object.There's no need to manually change
lib.sizeornorm.factors. In fact, you shouldn't have any meaningfulnorm.factorsbefore filtering, because you shouldn't be runningcalcNormFactorson the unfiltered data. As I said before,calcNormFactorswill adjust for the library sizes that are present in the suppliedDGEListobject, so changes in the total count after filtering will automatically be considered without requiring any further intervention.P.S. Add comments to answers via "add comment", unless you're answering your own question.