edgeR for looking at differences in coverage?
1
0
Entering edit mode
am3 • 0
@am3-20682
Last seen 5.1 years ago

I have whole exome sequencing data from insect individuals from two different populations. I'm interested to see if there are any genome regions that have much better coverage in one population than in the other. (For example, I want to see if the exome capture probes work much better for one population, or see if there are any large deletions.) Would there be any problem, conceptually, with using edgeR for this purpose? They're both dealing with read abundances that vary between individuals and groups of samples. In principle, is calculating differential coverage and calculating its statistical significance different from doing the same thing for differential gene expression?

edgeR exome • 983 views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

Sounds like my PhD in a nutshell.

https://bioconductor.org/packages/csaw

https://bioconductor.org/packages/diffHic

https://bioconductor.org/packages/cydar (postdoc work)

It's generally fine if you select the features correctly. In RNA-seq, this is not a problem because the features are defined for us. In genome-wide applications, we can't test every position in the genome, so we have to do some filtering - the choice of filtering method determines the validity of the results. See:

https://bioconductor.org/packages/devel/workflows/html/csawUsersGuide.html

https://bioconductor.org/packages/devel/workflows/html/chipseqDB.html

... and related publications for more details.

ADD COMMENT
0
Entering edit mode

(Sorry for my delayed response, I apparently didn't have notifications on.) Thank you for this; this will be very helpful! In my exome example, would the list of regions defined by the probe set used for exon capture be sufficient to define the features of interest? If not, could you explain further what you mean by "selecting features"?

ADD REPLY
0
Entering edit mode

Yes, pre-defined regions from the probe set are fine. The real problems begin when you have to define the features from the data (e.g., peak calling in ChIP-seq data), which requires some care to avoid circularity and data dredging. You don't have to worry about this when your features are defined in advance (from a separate source of data), which makes the statistics nice and simple.

I would also imagine there to be a fairly clear demarcation between captured and non-captured regions, so filtering should be fairly straightforward. Not like ChIP-seq, where weakly "bound" regions dominate and you need to apply stringent filters to get to the interesting bits. The more stringent the filter, the more apparent errors become in the filtering procedure - see here for some comments.

ADD REPLY

Login before adding your answer.

Traffic: 560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6