Question

Question about repurposing DESeq2 for CLIP-seq

0

Entering edit mode

gpratt • 0

@gpratt-11806

Last seen 8.4 years ago

We wanted to ask for your input about adapting the deseq dispersion and fold-change shrinkage (MAP estimation) in DEseq for CLIP/ChIP-seq fold-enrichments. In CLIP-seq, we often have fold-enrichments based on fairly low read densities (for example, 20 reads in IP versus 1 or 0 reads in input), and so we're exploring ways to properly estimate the true fold-enrichment. However, as IP and input will obviously have extremely different coverage profiles, it is quite unlike the standard RNA-seq dataset (where the baseline assumption is ~5-10% of genes will change, but most should show 0-fold change between the conditions).

Effectively, what we'd like to do is train the MAP estimation model based on one pair of datasets (for example, one input versus a biological replicate input), and then be able to perform shrinkage of the CLIP versus Input comparison based on the input-vs-input model. Is this something you think would be doable (and can you recommend any ways we could modify the deseq pipeline to do this), or do you have any other suggestion of how to approach this?

Thanks!

deseq2 • 1.3k views

ADD COMMENT • link updated 8.4 years ago by eric.vannostrand • 0 • written 8.4 years ago by gpratt • 0

0

Entering edit mode

Hi Michael,

Thanks for your fast reply!

You're right - our primary concern with simply using DESeq2 (or equivalent) on the dataset is that the distribution is trivially non-zero centered (either we start with a list of 'peaks', in which they'll be shifted towards y > 0; or we use a larger set of genomic regions, in which case they will be shifted y < 0).

What we're thinking is that we could use two 'input' biological replicates to estimate these factors based on some randomly selected regions - this should work fine, as the variability will basically be random biological + technical error. What we'd then like to do is use those trained values to perform normalization on a separate pair of files - effectively training the model on one pair of datasets, and then perform MLE normalization using that model on another dataset

Thanks!

-Eric

ADD REPLY • link 8.4 years ago eric.vannostrand • 0

0

Entering edit mode

Can you rephrase the question. I'm missing it a bit. I don't understand "use those trained values to perform normalization on a separate pair of files" or "MLE normalization".

The MAP vs MLE values in DESeq2 are for the dispersion parameter alpha and the log fold changes beta. There is no MAP vs MLE distinction on the size factors, which are estimated using what we call the "median ratio" method, and then these estimates are later plugged into the model as fixed quantities.

ADD REPLY • link 8.4 years ago Michael Love 43k

score 1 · Answer 1 · 2016-11-08

hi,

If I were going to use DESeq2 for such a dataset, I would look closely at the MA plot to make sure that that "null" regions are properly falling on the y=0 line. You may need to find ways to estimate the size factors, such as removing the enriched rows from the set used for size factor estimation, so that the normalization works as planned, and puts the "null" regions on the y=0 line in the MA plot.

Regarding the fold change moderation, it's hard to say what the true LFC should be, when you have 20 vs 0. DESeq2 will moderate these, but I would also examine the unmoderated LFCs: just use addMLE=TRUE when you run results(). You could also compare to the LFCs calculated as the ratio of normalized counts with increasing pseudocounts. The moderated LFCs tended to work well in our testing on bulk RNA-seq, often giving less error than pseudocount-based estimators, but they may not work as well for datasets with other distributional characteristics.