Question

Dealing with NA values from mass spec data using Limma, NA values.

0

Entering edit mode

reubenmcgregor88 • 0

@reubenmcgregor88-13722

Last seen 3.3 years ago

Hi I have downloaded a mass spec dataset from maxqb (http://maxqb.biochem.mpg.de/mxdb/).

I have then imported the data Into R specifying 0 values as NA (as I think this is correct for analysis using the Limma package):

 exprs <- read_excel("exprs.xlsx", na="0")

I then normalise using normalizeBetweenArrays and remove any columns with all NA values:

 y <- normalizeBetweenArrays(log2(exprs), method="cyclicloess", iterations=10)
 y <- exprs[rowSumsis.na(exprs)) != ncol(exprs), ]

Then I carry out DE analysis:

group <- factor(pData$MV_only) 
design <- model.matrix(~0+group)
contrast.matrix <- makeContrasts(
  mv_vs_other = MV-other,levels=design)
fit <- lmFit(y, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2, trend=T)

and I get the following warning: Warning message:

Partial NA coefficients for 2885 probe(s)

But the analysis seems to work fine.

I have a couple of questions here.

1) Is this the correct way to deal with 0 values? It should be dealt with in normalizeBetweenArrays?

2) To then visualise any of the data in PCA or coolmap (heat map) I convert the NA's back to 0 in the normalised data:

y[is.na(y)] <- 0

Is this also correct?

Thank you

Reuben

limma mass spec na • 3.6k views

ADD COMMENT • link updated 5.1 years ago by Gordon Smyth 50k • written 5.1 years ago by reubenmcgregor88 • 0

score 2 · Accepted Answer · 2019-03-06

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 49 minutes ago

WEHI, Melbourne, Australia

I don't yet have a recommended limma pipeline for MaxQuant mass spec data, although this is something we are working on.

The method you've used seems to me to be as good as anything else at the moment. limma is treating the non-detection NAs as "missing at random". While this assumption is not completely correct, and the NAs do contain some weak information about expression levels, I don't yet have a good general way to recover this information.

Converting NAs back to zeros is however going to play havoc with the heatmap or PCA. I think you would need to use an imputation method for this purpose, for example MaxQuant LQF values.

On a more minor note, setting the cyclic loess iterations as high as 10 seems a bit unnecessary.

ADD COMMENT • link 5.1 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon, it's a good news to know you working on Mass Spec data. I think limma is great for such processing after some transformation of the data. Not Missing At Random data is the challenge.

From my experience, the data should be first divided by 1e6 or 5e5 prior to transformation with log2 or so. This level comes from that very few measurements are observed between this level and zero. When applying a log2(intensity+1) transformation (+1 is used here to handle 0 in log2), there is a large gap between log2(0+1) = 0 and log2(5e5+1) = 19. This gap is nearly equal to (sometimes larger than) the dynamical range of the measured intensities (excluding zero). Maximum values could be 1e12 or 1e15, log2(1e12) = 39 or log2(1e15) = 50. In data analysis, this large empty gap between transformed values of 0 and 19 renders a zero value as distant as from intensity 1e6 than an intensity of 1e12 from intensity 1e6. But in practice, there is a very small difference between an untransformed value of 0 and 1e6.

The same artifact occurs when NA are a posteriori replaced by zero (as in the procedure of reubenmcgregor88) for the PCA computation or heatmap display. The limma part of the procedure is unaffected, but zero values are nevertheless ignored. This could lead to missing deferentially expressed proteins in the case where for one condition all measurements are zero aka NA.

For all these reasons, I would replace the log2(exprs) by log2(exprs/1e6+1/2) at he beginning of the processing. You should adjust the threshold to what sounds reasonable to you.

In general, I am more in favor of asinh transform in place of log2, because it handles zeros and negative values.

May be I am wrong, may be there is already literature on that, but I will be happy to have feedback from others, especially eubenmcgregor88 if you try it on your data of interest.

Best.

ADD REPLY • link 5.1 years ago SamGG ▴ 350

0

Entering edit mode

Yes, I have done analyses similar to what you describe. Your analysis however is treating the NAs as entirely arising from low expression, which is only sometimes true. NAs frequently arise from other causes as well. Unfortunately, the simple imputation strategy you describe overstates the amount of information in the data and introduces outliers into proteins that are highly expressed but still have NAs.

Most of the problems can be overcome by setting very low observational weights for the imputed values -- I have used weights = 1/50 for the imputed NAs. This is the same strategy I have mentioned previously: https://support.bioconductor.org/p/114154/ . If the proportion of NAs is not too large, this works quite well.

However, even with weights, this analysis strategy will still work poorly if the proportion of NAs is very high, and the analysis is in any case sensitive to some arbitrary choices of how much to offset the zeros or how much to divide by. That's why I do not want to recommend such an analysis yet to inexperienced users.

Regarding asinh, the fact that it works for meaningless intensity values (negative or zero) is a mark against it IMO.

ADD REPLY • link 5.1 years ago Gordon Smyth 50k

0

Entering edit mode

Many thanks for sharing your views Gordon.

If I summarize correctly, your approach applies when one considers NA arising at random whereas my proposal applies when one considers NA arising because of low signals.

Weighting is an interesting approach, but playing Evil's advocate, I am wondering if a weight of 1/50 is really making the difference with a NA?

ADD REPLY • link 5.1 years ago SamGG ▴ 350

0

Entering edit mode

Thanks a lot for making your answer even clearer. Some more questions arise.

What is the floor intensity level you observed in the datasets you studied?

Considering that instruments are quite reproductive apart from some glitches leading to NA at random, do you think that imputation should be done per group of biological replicates in order to avoid merging high and low intensities when a protein is varying a lot?

ADD REPLY • link 5.1 years ago SamGG ▴ 350