Hello,
I am new to EdgeR and I am working on processing some proteomics data to identify DEPs between two samples (sample vs control, both in triplicate). I've been working through the EdgeR documentation, I certainly don't understand it all yet but so far inspection of my data looks good. Samples cluster nicely in MDS plots and BCV is something like 6% (not sure if this is good).
I've been testing different approaches to identify DEPs and I am curious how I should compare results between QLF and LRT tests. My samples have ~3000 identified proteins in them; QLF identifies about 130 DEPS, while LRT identifies around 250. So, when comparing these results, is there a way to decide if using LRT is acceptable? I would prefer to maximize the number of DEPs.
Does this have anything to do with QLF test using the trended dispersion? If LRT doesn't use the trended dispersion is it using the common dispersion? I'm just wondering because when I look at my BCV plot the trended and common dispersion are almost identical. Is this something to do concerned with?
Not at all. Please read the documentation or the relevant published papers.
Both the QLF and LRT pipelines use protein-specific dispersions with a trended prior. When used as recommended, with default argument choices, then they both model protein-specific variability with the same resolution and complexity.
The LRT pipeline does have the ability to use the common dispersion alone if you tell it to, but we don't know whether you did that.
Not at all. In general, RNA-seq is on the only technology that typically shows a strong BCV trend.
Thanks Gordon.
I'm learning as I go, lots to go through. I am curious why it's not recommended for proteomics data? Aren't these analyses implemented in proteomics programs like Perseus and R for proteomics?
Looking at the significant DEPs produced by edgeR from this dataset A LOT of these make sense based on previously published data.
The statistical methods in edgeR are specifically designed for counts but proteomics does not produce counts. edgeR does not allow NAs but proteomics does produce NAs.
We didn't write Perseus. I am unclear how Perseus uses edgeR, what data they input to the edgeR or which edgeR settings are used.
Not as far as I know.
The edgeR analysis might well be better than some other approaches implemented in Perseus, like random imputation of missing values.
Regarding NAs, my data contains only ~5% missing values. Based on this I imputed using a minimum probability, which was the best approach for my data given the nature of how it was acquired (data independent analysis). So to be clear I'm not using data with NAs when I'm running it with edgeR.
Regarding counts, I understand that my proteomic intensity data is intensity based and these are continuous values. But regarding something like RNA seq data where you can have millions of discrete counts, is it unrealistic to consider discrete and continuous values the same?
The reason why the Perseus people use edgeR rather than limma is presumably because edgeR can accept the exact zeros that are output by the maxLFQ algorithm without giving an error. What is a bit disturbing to me is that the weight that edgeR places on the zeros depends on the measurement units (scaling) of the intensities, which is entirely arbitrary.
If you are imputing positive intensities for the NAs, then you could use limma instead of edgeR, which is what we do. As I note above, edgeR and limma may give very similar results if almost all the intensities are large.
Regarding discrete and continuous values being the same, there is little logic to that. By that argument, we would never needed to develop edgeR in the first place because the RNA-seq counts could have been analysed by pre-existing continuous data methods.
Thank you very much for the help!