I am re-analyzing a data set from many years ago and since then voom has come along and I would like to address its applicability to a typical proteomics label-free analysis.

The data is a proteomics data set which is spectral count information for ~200 proteins (I am using spectral counts not emPAI, NSAF, etc). This is across 11 samples, 8 disease 3 control. Also to note is that spectral count data in proteomics can be zero, and therefore the data is quite sparse at times. For instance, approximately 20% of the proteins have zero values for at least 6 of the samples. The counts themselves can run from 0 to ~150 (this can range depending on the mass spec and experimental workflow). Lastly, it is accepted in proteomics that zeros can be replaced with a half-minimum observed value prior to statistical analysis.

A conservative approach to this type of data is to log transform and analyze with a equal two sample t-test. Another is to not transform and use a rank sum approach. Other approaches are proteomics specific like PLGEM, or count specific like EdgeR, DESeq, baySeq, etc, which I somewhat described here.

I have used a voom-limma approach on this data. I used the data as is, data with replaced zeros (with 0.5), and data removing proteins with 6 or more zero measurements. Each of these produces slightly different results based on how voom converts the counts. This line in the manual concerns me with how to best prune my data before running voom limma:

*The limma-voom method assumes that rows with zero or very low counts have been removed.*

I am not sure (1) how many zeros can be allowed, and (2) what "low" means. I have run across some other discussions which imply that voom-limma might be better at dealing with the low counts we see in proteomics as opposed to other count based methods like DESeq.

Any guidance would be appreciated, and if this isn't applicable, that is fine. Proteomics is often driving downstream experiments and this data has already had secondary confirmation by westerns and IHC. In other words, if voom-limma isn't applicable, a t-test works just fine at telling the story we have experimentally confirmed.

Thanks ahead of time for any help.

Here is a related question relating to count data of microbial communities converting extremely sparse count dataframe to continuous distributions for study in WGCNA