Does anyone have any experience using emPAI values for differential testing of protein abundances? I was asked to analyze a fairly complex proteomics data set - 90 samples total, 15 males + 15 females, each measured at 3 time points, with 6 possible covariates, some of which are constant per subject and some of which are measured per sample. I've been working with emPAI values so far. Searching the archives shows more posts on using spectral counts (which I might have buried somewhere), and one mention by Laurent Gatto of calculating emPAI from spectral counts, but that these would then need further log transformation before using limma (https://support.bioconductor.org/p/35932/#35938).
The funny thing is that emPAI itself is an exponential transformation of the Protein Abundance Index (10^PAI - 1), which is supposedly a ratio of the observed peptides / observable in-silico digested tryptic peptides for a protein, although (as I recently learned) this ratio can be larger than 1 because of incomplete digestion of the protein leading to observed "peptides" that contain some uncut sites. So I've tried playing around with back-translating to PAI then log2-transforming, which seems weird somehow given that the exponential transformation was shown to lead to a linear correlation with absolute protein abundance (Ishihama et al. 2005; http://www.mcponline.org/content/4/9/1265.full).
Ishihama et al. also give some further equations to calculate the molar fraction of a protein given all observed proteins in the sample, which is extremely similar to CPM for RNASeq data: emPAI / sum(emPAI). You can also weight by protein mass to get the weight fraction in the sample: (emPAI * mass)/ sum(emPAI * mass). I've tried PCA clustering on all 3 transformations, log2(PAI + constant), log2(molar fract. + contant), log2(weight fraction + constant), where the constant is half the minimum non-zero value. All 3 PCA clusterings are similar and show some patterning by sex, so I'm hopeful that I'm on the right track. I should also mention that this data is extremely sparse; the original 2796 proteins were reduced to 517 by requiring at least 3 samples with non-zero values, and the resulting matrix still has 79% zero values.
I'd appreciate any comments or suggestions on what data values and transformations I should use, or any other advice you'd like to give!