4 weeks ago by
The short answer is 'don't do that'.
One issue with RNA-Seq is differences in library size. If you took two aliquots of the same exact sample and sequenced twice, one time getting 20M reads, and the other getting 10M reads, the expectation for the counts/gene would be that you would get twice the number of counts for each gene in the first aliquot than the second (because you have twice the reads). You want to account for those differences, because in this silly example you already know that the genes were expressed at the same exact level in both samples, and the difference in counts is due only to the library size.
It's not that simple when you are using biological replicates. All things equal, if you have twice the reads in one sample than the other, you expect about twice the counts/gene. But all things aren't equal, and you may have some genes that are really highly expressed in one sample, which may contribute more than expected to the total library size (this is called compositional bias), in which case you would want to find a set of genes that aren't obviously changing expression bigly between the two samples. The TMM normalization that is the default in calcNormFactors is intended to do that - estimate normalization offsets for each library, based on a reasonable set of genes.
If you subset to some arbitrary set of genes, you may well lose all the genes that aren't really changing, and that would be useful for calculating differences in library size. In addition, all the other genes that you may not be interested in are quite useful for estimating things like the relationship between gene counts and variance (which you need for linear modeling). So in general you should keep all the genes that appear to be expressed at a reasonable level in your data set all the way through the model fit. After that, you can filter out the genes you care about.