Question: Using edgeR with "traditional" coverage data?
0
3.4 years ago by
bastien.chevreux0 wrote:

Hi there,

for a number of reasons my RNASeq mapping pipeline has been built to generate somewhat more "traditional" coverage data and not read counts as result of the mapping. The net effect is that I am able to generate a number which is equivalent to "average transcript copy number per given amount of (clean, non-junk) sequenced RNA bases."

Would edgeR still be the appropriate package to find differentially expressed genes or should I turn to chip expression packages which exist for technologies like, e.g. Affymetrix?

rnaseq coverage edger • 651 views
modified 3.4 years ago by Ryan C. Thompson7.4k • written 3.4 years ago by bastien.chevreux0
Answer: Using edgeR with "traditional" coverage data?
2
3.4 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.4k wrote:

No, as the package documentation makes clear, edgeR requires data on a raw count scale, and makes important statistical assumptions based on this requirement. Converting to RPKM, FPKM, TPM, or any similar measure discards important information about both the sequencing depth and counting precision of each measurement. If you want a similar package to edgeR that you can apply to your data, the best you'll be able to do is probably limma, setting trend=TRUE in the eBayes step. But be aware that any analysis you do on this data will be partially hampered by the inability to take into account the sequencing depth and counting precision.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Ryan C. Thompson7.4k

As a follow-up: if edgeR does not take read length into account, one can use it with any read-length, right? So how about using - as read counts - the total number of bases of those parts of fragments which are covering a gene? This would be equivalent to the number of 1-base reads which mapped to exactly this gene. Would there be hidden traps with that approach?

Well, you'll effectively be counting each fragment multiple times. This would be roughly equivalent to multiplying your original fragment-level counts by the length of each fragment (or at least, the length that overlaps with the gene). edgeR can work with these counts, but there's no advantage in doing so; your counts will increase, but so will your dispersion estimate, so there won't be any gain in power. In fact, there might even be some loss of performance, because any non-NB behaviour of the original counts would be amplified upon multiplication.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Aaron Lun24k

The idea being to transform a "average coverage" (==average gene expression) value back to something I do not have atm (well, read counts) but which edgeR needs. I'll give it a try.

Each read is one count and must be treated that way. You can't pretend that a single 100-base read sampled from a single transcript was actually 100 1-base reads independently sampled from 100 different transcripts, since they are certainly not independent of each other. Doing so would greatly overstate your confidence level and lead to many false positives. (Imagine you're doing a t-test with 3 samples in each group, but you instead pretend that each group has 300 samples.) You could obtain approximate counts by counting the total number of bases in a region and then dividing by the read length. This should at least put your numbers on the same scale as raw counts. Passing these approximate counts to edgeR is probably the best you'll be able to do if you can't recover the original reads.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Ryan C. Thompson7.4k

A closer analogy would be that, for a t-test with 3 samples in each group, you multiply all observations by some constant (representing your fragment length). This doesn't do any harm, as the scaling factor increases both the difference in the means and the variance estimate, so it cancels out in the t-statistic; but on the same note, it doesn't help either. For the NB model, it's not so simple as the variance changes naturally with the mean. Getting values close to the original counts would be preferable - just for interpretation, if nothing else - so Ryan's advice is still applicable.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Aaron Lun24k