Question: edgeR for looking at differences in coverage?
gravatar for am3
5 months ago by
am30 wrote:

I have whole exome sequencing data from insect individuals from two different populations. I'm interested to see if there are any genome regions that have much better coverage in one population than in the other. (For example, I want to see if the exome capture probes work much better for one population, or see if there are any large deletions.) Would there be any problem, conceptually, with using edgeR for this purpose? They're both dealing with read abundances that vary between individuals and groups of samples. In principle, is calculating differential coverage and calculating its statistical significance different from doing the same thing for differential gene expression?

edger exome • 169 views
ADD COMMENTlink modified 5 months ago by Aaron Lun25k • written 5 months ago by am30
Answer: edgeR for looking at differences in coverage?
gravatar for Aaron Lun
5 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Sounds like my PhD in a nutshell. (postdoc work)

It's generally fine if you select the features correctly. In RNA-seq, this is not a problem because the features are defined for us. In genome-wide applications, we can't test every position in the genome, so we have to do some filtering - the choice of filtering method determines the validity of the results. See:

... and related publications for more details.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Aaron Lun25k

(Sorry for my delayed response, I apparently didn't have notifications on.) Thank you for this; this will be very helpful! In my exome example, would the list of regions defined by the probe set used for exon capture be sufficient to define the features of interest? If not, could you explain further what you mean by "selecting features"?

ADD REPLYlink written 5 months ago by am30

Yes, pre-defined regions from the probe set are fine. The real problems begin when you have to define the features from the data (e.g., peak calling in ChIP-seq data), which requires some care to avoid circularity and data dredging. You don't have to worry about this when your features are defined in advance (from a separate source of data), which makes the statistics nice and simple.

I would also imagine there to be a fairly clear demarcation between captured and non-captured regions, so filtering should be fairly straightforward. Not like ChIP-seq, where weakly "bound" regions dominate and you need to apply stringent filters to get to the interesting bits. The more stringent the filter, the more apparent errors become in the filtering procedure - see here for some comments.

ADD REPLYlink modified 5 months ago • written 5 months ago by Aaron Lun25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 198 users visited in the last hour