Hi All,
When I am analyzing some RNA-seq data from noncoding regions, I bumped into this question.
I only focused on noncoding regions with ncRNA transcripts detected, which only constituted 5-10% of all the sequencing reads. So for examples, if there are 20million reads of a RNA-seq sample, the reads from noncoding region is perhaps 1million.
So now the question becomes, shall I manually input the total sequencing read number for a specific sample (e.g 20million) as the y$samples$lib.size when I am using edgeR, or shall I let edgeR do it automatically (as far as I understand, edgeR will automatically sum all reads in the DGElist of a sample, which in my case will be 1 million as only ncRNAs were calculated here).
The logic behind this is that - I assume the total RNA-seq reads of different samples are comparable, rather than the reads from noncoding region only. So I will consider that it is correct to manually input the specific reads number of a sample. e.g. (y$samples$lib.size <- c(12624935, 13276916, 20222212, 21997722, 16789741, 29073257))
This question will also affects the y$samples$norm.factors. (y <- calcNormFactors(y)). If I let the edgeR do it automatically, the norm.factors will be only take into consideration the variability of noncoding RNAs in different samples.
Thank you so much for and suggestions!
wenbo
What you should be doing is:
DGEList
object from your count data.calcNormFactors
on the filtered object.There's no need to manually change
lib.size
ornorm.factors
. In fact, you shouldn't have any meaningfulnorm.factors
before filtering, because you shouldn't be runningcalcNormFactors
on the unfiltered data. As I said before,calcNormFactors
will adjust for the library sizes that are present in the suppliedDGEList
object, so changes in the total count after filtering will automatically be considered without requiring any further intervention.P.S. Add comments to answers via "add comment", unless you're answering your own question.