Question: voom for spectral counts
gravatar for Ben Neely
2.9 years ago by
Ben Neely40
United States
Ben Neely40 wrote:

I am re-analyzing a data set from many years ago and since then voom has come along and I would like to address its applicability to a typical proteomics label-free analysis.

The data is a proteomics data set which is spectral count information for ~200 proteins (I am using spectral counts not emPAI, NSAF, etc). This is across 11 samples, 8 disease 3 control. Also to note is that spectral count data in proteomics can be zero, and therefore the data is quite sparse at times. For instance, approximately 20% of the proteins have zero values for at least 6 of the samples. The counts themselves can run from 0 to ~150 (this can range depending on the mass spec and experimental workflow). Lastly, it is accepted in proteomics that zeros can be replaced with a half-minimum observed value prior to statistical analysis.

A conservative approach to this type of data is to log transform and analyze with a equal two sample t-test. Another is to not transform and use a rank sum approach. Other approaches are proteomics specific like PLGEM, or count specific like EdgeR, DESeq, baySeq, etc, which I somewhat described here.

I have used a voom-limma approach on this data. I used the data as is, data with replaced zeros (with 0.5), and data removing proteins with 6 or more zero measurements. Each of these produces slightly different results based on how voom converts the counts. This line in the manual concerns me with how to best prune my data before running voom limma: 

The limma-voom method assumes that rows with zero or very low counts have been removed.

I am not sure (1) how many zeros can be allowed, and (2) what "low" means. I have run across some other discussions which imply that voom-limma might be better at dealing with the low counts we see in proteomics as opposed to other count based methods like DESeq.

Any guidance would be appreciated, and if this isn't applicable, that is fine. Proteomics is often driving downstream experiments and this data has already had secondary confirmation by westerns and IHC. In other words, if voom-limma isn't applicable, a t-test works just fine at telling the story we have experimentally confirmed.

Thanks ahead of time for any help.

ADD COMMENTlink modified 2.9 years ago by Gordon Smyth32k • written 2.9 years ago by Ben Neely40

Here is a related question relating to count data of microbial communities converting extremely sparse count dataframe to continuous distributions for study in WGCNA

ADD REPLYlink written 16 months ago by chrisclarkson10030
gravatar for Gordon Smyth
2.9 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

We have experimented with a few of different statistical approaches to proteomics data and spectral counts. We are still learning, but spectral counts seem to behave somewhat differently to RNA-seq counts, showing much more technical variability, so I don't think that I would recommend voom or classic edgeR or DESeq.

Rather than voom, a slightly simpler and very robust approach would be to tranform to y = log2(count+1), quantile normalize, then run a limma pipeline with eBayes(trend=TRUE, robust=TRUE). I feel confident that this would be better than using ordinary t-tests.

Alternatively, if there are a lot of zeros in your data, then a quasi-negative-binomial approach may well be preferable. Have a look at Section 4.7 "Profiles of Unrelated Nigerian Individuals" of the edgeR User's Guide for how a quasi-likelihood pipeline might go.

Whatever pipeline you use, you should filter out low count proteins, as Aaron and Bernd have already discussed. How much filtering to do? The mean-variance plots will guide you. For voom, have a look at the plot produced by voom(plot=TRUE). Ideally the trend line should be monotonically decreasing. If it goes sharply up before trending down, then you haven't filtered enough. For limma and log-counts, look at plotSA(fit) where fit is output from lmFit and eBayes. The trend line should be smooth. If it increases sharply at the low count end, then you haven't filtered enough. For edgeR quasi-likelihood, look at plotBCV and plotQLDisp. Again the trends should be smooth.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Gordon Smyth32k
gravatar for Bernd Fischer
2.9 years ago by
Bernd Fischer540
Germany / Heidelberg / DKFZ
Bernd Fischer540 wrote:

Counts in spectral count data are usually very low, therefore, I expect that the normal approximation from voom will not be the best solution, but it depends on your data. I would prefer one of edgeR, DESeq, DESeq2, or similar. 

Filtering proteins with a low number of counts or a low number of non-zero values is a good idea, for both voom / limma or edgeR, DESeq, because it is very unlikely that you will obtain significant hits for proteins with a very low spectral count. See "Independent filtering increases detection power for high-throughput experiments., Bourgon, et al., PNAS, 2010".




ADD COMMENTlink written 2.9 years ago by Bernd Fischer540

In addition, filtering in voom gets rid of discrete log-transformed values for which it is difficult to fit a mean-variance trend; check out C: Voom on TCGA data shifts count distributions towards negative values . Count-based methods like edgeR tend to be a bit more robust to low counts, as discreteness is explicitly considered by the statistical model.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Aaron Lun17k

Thanks for the response. I obviously got something mixed up with thinking voom was best for low numbers and the other count based were better at high numbers. Seems the other way around, and I appreciate your clarification and recommendation.

As for filtering out these low values, I am still uncertain as to what low values violate assumptions within the different count based methods. Bourgon demonstrates that using mean (or variance) as a independent filter improved the power of analysis after correction for multiple testing. Similarly I could evaluate my data to see what an appropriate theta would be, but it would be more interesting to know what "low" really means. I responded to Aaron Lun's comment saying that defining low as < 5 or 10 spectral counts could be appropriate since from my experience these proteins have a decent amount of variability due to technical effects.

Thanks for your response and advice.

ADD REPLYlink written 2.9 years ago by Ben Neely40
gravatar for Aaron Lun
2.9 years ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

I'll flip your questions around a bit, such that we keep proteins where there are at least N libraries with a count above X. The code might look a bit like this, for a matrix of counts:

keep <- rowSums(counts > X) >= N
counts <- counts[keep,]

Now, our problem involves choosing appropriate values for N and X. For N, the value can be determined from your experimental design. In your case, you've got 8 samples in one group and 3 samples in the other. This suggests that N=3 is most appropriate. Otherwise, you might filter out significant proteins that are present in the smaller group and near-absent in the larger group.

The value of X is more arbitrary. From a biological viewpoint, how low must expression be in order to be considered negligble? That's hard to answer, but from a statistical perspective, we want to get rid of those low, discrete counts that will mess up voom's approximations. A value of 5 - 10 might be appropriate for your data. Alternatively, you can filter on CPM values above 1 - 2, as is done in the case studies in the edgeR user's guide.

Of course, as Bernd has suggested, count-based models might be more appropriate if many of your counts are low.

P.S. voom already adds 0.5 to the counts during its internal log-transformation, so there's no need to do it manually. Also, if you remove proteins with 6 or more zero measurements, then you could be throwing out significant proteins that are totally absent in the larger group but present in the smaller group.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Aaron Lun17k

Thank you for the excellent response. I am still not completely clear on what "low" could mean, but from my experience I can say that normally something under 5 spectral counts is extremely variable and typically won't yield significance, not because the protein itself is that variable (which it might be) but due to the technical nature of how data dependent mass spec runs are set up for shotgun proteomics analyses. Bottom line, filtering base on a cutoff of 5 sounds reasonable with N>=3. 

On a side note, if voom uses a lowess the way I think I am understanding Law et al., 2014, wouldn't that mean that there is probably some requirement for number of observations? I remember a normalization paper or manual which indicated that lowess works well with 1,000s of points, but not with 100s of points. This could be another reason voom may not be the best approach for this set specifically. Albeit, many proteomics experiments do have 1000s of variables at which point this could be viable.

Thank you again for your help and insight.

ADD REPLYlink written 2.9 years ago by Ben Neely40

On the side note: having 100s of points is not a particular problem for lowess or voom. We have used voom on datasets with a few hundred probes. If the number of features is very low, you could increase the span setting used by voom to ensure more stability.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Gordon Smyth32k

The lowess algorithm probably does have a minimum requirement for the number of data points, in order to obtain a stable fit (i.e., avoid overfitting). I'm not sure how large this needs to be, though. We usually have more than enough points in genome-scale data sets to obtain reproducible curves. My only comment would be that  you'd need more points if you wanted to reduce the span/bandwidth of the fit.

ADD REPLYlink written 2.9 years ago by Aaron Lun17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 335 users visited in the last hour