Question

Help with inspecting DEs from QLF test and LRT

0

Entering edit mode

Tanner • 0

@5ed88adc

Last seen 20 months ago

Canada

Hello,
I am new to EdgeR and I am working on processing some proteomics data to identify DEPs between two samples (sample vs control, both in triplicate). I've been working through the EdgeR documentation, I certainly don't understand it all yet but so far inspection of my data looks good. Samples cluster nicely in MDS plots and BCV is something like 6% (not sure if this is good).

I've been testing different approaches to identify DEPs and I am curious how I should compare results between QLF and LRT tests. My samples have ~3000 identified proteins in them; QLF identifies about 130 DEPS, while LRT identifies around 250. So, when comparing these results, is there a way to decide if using LRT is acceptable? I would prefer to maximize the number of DEPs.

edgeR • 1.7k views

ADD COMMENT • link 21 months ago Tanner • 0

score 1 · Answer 1 · 2023-02-16

1

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 52 minutes ago

WEHI, Melbourne, Australia

The edgeR authors recommend QLF because it is gives more rigorous FDR control that LRT but, naturally, it will give slightly fewer DE results than LRT.

Our recommendations are for sequencing data. We have not ourselves tested edgeR for proteomics data. We use limma rather than edgeR for proteomics data.

If all the intensities are large, then edgeR will give similar results to limma, especially if QLF is used. The only advantage I would think of for edgeR in this context is that it can accept exact zero intensities whereas, for limma, zero intensities need to either set to NA or replaced with a positive imputed value. edgeR may be less sensitive to small intensity outliers in general, which might be considered an advantage, but I don't like the dependence of the edgeR analysis on the arbitrary scaling unit of the intensities.

ADD COMMENT • link 21 months ago Gordon Smyth 51k

0

Entering edit mode

Does this have anything to do with QLF test using the trended dispersion? If LRT doesn't use the trended dispersion is it using the common dispersion? I'm just wondering because when I look at my BCV plot the trended and common dispersion are almost identical. Is this something to do concerned with?

enter image description here

ADD REPLY • link 21 months ago Tanner • 0

1

Entering edit mode

Does this have anything to do with QLF test using the trended dispersion?

Not at all. Please read the documentation or the relevant published papers.

If LRT doesn't use the trended dispersion is it using the common dispersion?

Both the QLF and LRT pipelines use protein-specific dispersions with a trended prior. When used as recommended, with default argument choices, then they both model protein-specific variability with the same resolution and complexity.

The LRT pipeline does have the ability to use the common dispersion alone if you tell it to, but we don't know whether you did that.

I'm just wondering because when I look at my BCV plot the trended and common dispersion are almost identical. Is this something to do concerned with?

Not at all. In general, RNA-seq is on the only technology that typically shows a strong BCV trend.

ADD REPLY • link 21 months ago Gordon Smyth 51k

0

Entering edit mode

Thanks Gordon.
I'm learning as I go, lots to go through. I am curious why it's not recommended for proteomics data? Aren't these analyses implemented in proteomics programs like Perseus and R for proteomics?

Looking at the significant DEPs produced by edgeR from this dataset A LOT of these make sense based on previously published data.

ADD REPLY • link 21 months ago Tanner • 0

0

Entering edit mode

I am curious why it's not recommended for proteomics data?

The statistical methods in edgeR are specifically designed for counts but proteomics does not produce counts. edgeR does not allow NAs but proteomics does produce NAs.

Aren't these analyses implemented in proteomics programs like Perseus

We didn't write Perseus. I am unclear how Perseus uses edgeR, what data they input to the edgeR or which edgeR settings are used.

and R for proteomics?

Not as far as I know.

Looking at the significant DEPs produced by edgeR from this dataset A LOT of these make sense based on previously published data.

The edgeR analysis might well be better than some other approaches implemented in Perseus, like random imputation of missing values.

ADD REPLY • link 21 months ago Gordon Smyth 51k

0

Entering edit mode

Regarding NAs, my data contains only ~5% missing values. Based on this I imputed using a minimum probability, which was the best approach for my data given the nature of how it was acquired (data independent analysis). So to be clear I'm not using data with NAs when I'm running it with edgeR.
Regarding counts, I understand that my proteomic intensity data is intensity based and these are continuous values. But regarding something like RNA seq data where you can have millions of discrete counts, is it unrealistic to consider discrete and continuous values the same?

ADD REPLY • link 21 months ago Tanner • 0

0

Entering edit mode

The reason why the Perseus people use edgeR rather than limma is presumably because edgeR can accept the exact zeros that are output by the maxLFQ algorithm without giving an error. What is a bit disturbing to me is that the weight that edgeR places on the zeros depends on the measurement units (scaling) of the intensities, which is entirely arbitrary.

If you are imputing positive intensities for the NAs, then you could use limma instead of edgeR, which is what we do. As I note above, edgeR and limma may give very similar results if almost all the intensities are large.

Regarding discrete and continuous values being the same, there is little logic to that. By that argument, we would never needed to develop edgeR in the first place because the RNA-seq counts could have been analysed by pre-existing continuous data methods.