Fitting Gamma GLMs for examining differential expression
1
1
Entering edit mode
jspmccain ▴ 10
@jspmccain-13725
Last seen 6.0 years ago

It's my understanding that negative binomial GLMs are used in different ways in both EdgeR and DeSeq2. Both of these packages assume the input are un-normalized counts. I am working with proteomics data from a mass spectrometer. Specifically, I'm using TMT-Labeling to quantify changes in differential expression. However, the normalization procedure is a bit unintuitive and we followed methods from Plubell et al (2017; http://www.mcponline.org/content/16/5/873.short). Importantly, we do not get integers as our expression values. Also in this normalization procedure, we only compare peptides which were found in all treatments. 

So my questions are:

- Is there a way to implement a Gamma GLM in EdgeR or DeSeq2 to better fit these continuous data?

- If not, would fitting a gamma GLM with the regular `glm()` function in R, for each peptide, be a possible approach? (With a much more stringent significance cutoff)

Thanks!

deseq2 edger differential expression • 1.7k views
ADD COMMENT
4
Entering edit mode
@ryan-c-thompson-5618
Last seen 9 weeks ago
Icahn School of Medicine at Mount Sinai…

Both edgeR and DESeq2 are built around the negative binomial distribution, so there is no way to shoehorn a different distribution into them. However, if your raw data actually consists of counts, you can probably still use these methods with your custom normalization method. Instead of computing the normalized values directly, compute the appropriate normalization factors and feed them into edgeR or DESeq2 along with the raw counts. (The specifics of how to do this are different for both methods, but should be documented in their respective manuals.)

In addition, edgeR does not require counts to be integers, although it does require that the input values be on a raw count scale. (For example, RSEM estimated counts are acceptable inputs, but FPKM values are not.)

Anyway, you could certainly fit a gamma GLM to each peptide. As I'm sure you're aware, you will lose the primary benefit of edgeR or DESeq2, which is the empirical Bayes moderated estimation of biological variability. If you want to use a similar method to edgeR or DESeq2 that doesn't assume a negative binomial distribution, I would recommend using limma-trend, i.e. a standard limma analysis with trend=TRUE in the call to eBayes. (Don't use voom, since your inputs are already normalized, not raw counts.)

There is one other issue you should be aware of with your data, which is that your filtering criterion is not independent of your test statistic. By requiring a peptide to be present in all samples, you are effectively filtering on the minimum count (or minimum abundance), which is equivalent to using a lower abundance threshold for peptides with large fold changes and/or high variances, and a higher threshold for low-variance, small fold-change peptides. Since your filtering and test statistic are not independent, your FDR values may not be trustworthy. Lun & Smyth 2014 clearly demonstrates the failure of false discovery rate control when non-independent filtering is used (See Tables 1 & 2): https://academic.oup.com/nar/article/42/11/e95/1442937 (This paper demonstrates the problem in a ChIP-seq context, but the issue is not specific to ChIP-seq.)

ADD COMMENT
0
Entering edit mode

This is super helpful, thank you for the limma suggestion, and for your filtering criterion insight. I will be looking into the application of limma, and to be honest I'm still trying to understand the empirical Bayes estimation of biological variability. 

Regarding the filtering criterion my understanding is as follows: tandem mass tags (TMT) are used for peptide quantitation from mass spectrometry due to high variability across runs. I think you if you included peptides that are not found in all samples, your normalization method would not be able accurately quantify differentially expressed proteins (peptides in the MS). I'm still working through the literature on this, but this normalization method in the link above (http://www.mcponline.org/content/16/5/873.short) seems to be the most appropriate for our experiment. If you have any suggestions on ways forward, please let me know! Thanks so much for your response.

ADD REPLY
1
Entering edit mode

Unfortunately, I don't have much experience specifically with proteomics and mass spec data, so I can't really advise you on how to handle filtering and normalization. It sounds like your normalization method inherently imposes a suboptimal filtering strategy. One thing you can do to assess your filtering is to make an MA plot of the data. Ideally, your data should be bounded on the left by a vertical line, which corresponds to your filtering threshold. This indicates that your filtering threshold is independent of the log fold change, which is what you want. In your case, you might find that the threshold line looks more like a ">" or ")" symbol. If so, you can impose an additional average abundance filter at the x-intercept of the threshold line in order to make the filtering uniform across all fold changes. However, you might also find such a filter to be too stringent and discard too much of your data. You may simply have to forge ahead knowing that your computed FDR values might not be trustworthy.

ADD REPLY
0
Entering edit mode

This link is an extremely helpful description of the TMT normalization method, and the other related links go into the statistical consequences of it! https://github.com/pwilmart/IRS_normalization/blob/master/.ipynb_checkpoints/understanding_IRS-checkpoint.ipynb

ADD REPLY

Login before adding your answer.

Traffic: 967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6