Hi there,
for a number of reasons my RNASeq mapping pipeline has been built to generate somewhat more "traditional" coverage data and not read counts as result of the mapping. The net effect is that I am able to generate a number which is equivalent to "average transcript copy number per given amount of (clean, non-junk) sequenced RNA bases."
Would edgeR still be the appropriate package to find differentially expressed genes or should I turn to chip expression packages which exist for technologies like, e.g. Affymetrix?
As a follow-up: if edgeR does not take read length into account, one can use it with any read-length, right? So how about using - as read counts - the total number of bases of those parts of fragments which are covering a gene? This would be equivalent to the number of 1-base reads which mapped to exactly this gene. Would there be hidden traps with that approach?
Well, you'll effectively be counting each fragment multiple times. This would be roughly equivalent to multiplying your original fragment-level counts by the length of each fragment (or at least, the length that overlaps with the gene). edgeR can work with these counts, but there's no advantage in doing so; your counts will increase, but so will your dispersion estimate, so there won't be any gain in power. In fact, there might even be some loss of performance, because any non-NB behaviour of the original counts would be amplified upon multiplication.
The idea being to transform a "average coverage" (==average gene expression) value back to something I do not have atm (well, read counts) but which edgeR needs. I'll give it a try.
Each read is one count and must be treated that way. You can't pretend that a single 100-base read sampled from a single transcript was actually 100 1-base reads independently sampled from 100 different transcripts, since they are certainly not independent of each other. Doing so would greatly overstate your confidence level and lead to many false positives. (Imagine you're doing a t-test with 3 samples in each group, but you instead pretend that each group has 300 samples.) You could obtain approximate counts by counting the total number of bases in a region and then dividing by the read length. This should at least put your numbers on the same scale as raw counts. Passing these approximate counts to edgeR is probably the best you'll be able to do if you can't recover the original reads.
A closer analogy would be that, for a t-test with 3 samples in each group, you multiply all observations by some constant (representing your fragment length). This doesn't do any harm, as the scaling factor increases both the difference in the means and the variance estimate, so it cancels out in the t-statistic; but on the same note, it doesn't help either. For the NB model, it's not so simple as the variance changes naturally with the mean. Getting values close to the original counts would be preferable - just for interpretation, if nothing else - so Ryan's advice is still applicable.