We wanted to ask for your input about adapting the deseq dispersion and fold-change shrinkage (MAP estimation) in DEseq for CLIP/ChIP-seq fold-enrichments. In CLIP-seq, we often have fold-enrichments based on fairly low read densities (for example, 20 reads in IP versus 1 or 0 reads in input), and so we're exploring ways to properly estimate the true fold-enrichment. However, as IP and input will obviously have extremely different coverage profiles, it is quite unlike the standard RNA-seq dataset (where the baseline assumption is ~5-10% of genes will change, but most should show 0-fold change between the conditions).
Effectively, what we'd like to do is train the MAP estimation model based on one pair of datasets (for example, one input versus a biological replicate input), and then be able to perform shrinkage of the CLIP versus Input comparison based on the input-vs-input model. Is this something you think would be doable (and can you recommend any ways we could modify the deseq pipeline to do this), or do you have any other suggestion of how to approach this?
Thanks!
Hi Michael,
Thanks for your fast reply!
You're right - our primary concern with simply using DESeq2 (or equivalent) on the dataset is that the distribution is trivially non-zero centered (either we start with a list of 'peaks', in which they'll be shifted towards y > 0; or we use a larger set of genomic regions, in which case they will be shifted y < 0).
What we're thinking is that we could use two 'input' biological replicates to estimate these factors based on some randomly selected regions - this should work fine, as the variability will basically be random biological + technical error. What we'd then like to do is use those trained values to perform normalization on a separate pair of files - effectively training the model on one pair of datasets, and then perform MLE normalization using that model on another dataset
Thanks!
-Eric
Can you rephrase the question. I'm missing it a bit. I don't understand "use those trained values to perform normalization on a separate pair of files" or "MLE normalization".
The MAP vs MLE values in DESeq2 are for the dispersion parameter alpha and the log fold changes beta. There is no MAP vs MLE distinction on the size factors, which are estimated using what we call the "median ratio" method, and then these estimates are later plugged into the model as fixed quantities.