limma on spectral data
1
0
Entering edit mode
wiscoyogi • 0
@wiscoyogi-21673
Last seen 2 days ago
United States

I’m working with untargeted metabolomics data from samples that we run on two different column types (C18 and HILIC), both in positive and negative mode, for a total of four tables of (m/z + retention time vs. samples), which I generated using XCMS online. My goal now is to identify differentially abundant peaks, as one does with gene expression for RNA.

I filtered features for a given table as one typically does for RNA, I excluded a peak if it was not present in at least k % of samples, where k is defined as the minimum proportion of a given group in the comparison (here, sick = 0.6 and healthy = 0.4, so I required that a given peak be detected in at least 0.4 % of samples to be included in the analysis). I have 15 samples total, where 6 are sick. There was a technical factor that contributed considerably to the variation so I blocked with it in the design matrix (‘sick_A’, ‘sick_B’, ‘healthy_A’, ‘healthy_B’). I treated another biological covariate (age) that is not of direct interest but could contribute to variability as a column in the design matrix.

In terms of identifying differentially abundant peaks, I followed Gordon’s advice here, where I’m transforming the abundance matrix log2 counts, performing quantile normalization, then running a limma pipeline with eBayes(trend=TRUE, robust=TRUE). Putting the code at the end of the post.

Questions:

1. Are there additional/newer guidances from this post from Gordon that I should follow?
2. Are there guidelines for what the density plot should look like following quantile normalization between arrays? This is what I’m getting:

The red and yellow samples also have outlier total ion chromatogram plots and I’m figuring out what to do.

1. Thoughts on using POMA vs limma? I’m most familiar with limma/voom and wanted to ask first.
2. Are there guidelines for correction between runs (e.g positive vs. negative mode on a given column?)? Any thoughts on how necessary this is? I did not just knit the two matrices together and treat duplicate patient ID as a random effect since there is no overlap in the features between the positive and negative run - they’re independent data collection events - so I'd get a matrix with two blocks, one . I know when comparing related contrasts I can do decideTests; but that’s within a given run.

sample code

log2_counts <- log2(counts + 1)
normalized <- normalizeBetweenArrays(log2_counts, method="quantile")
fit <- lmFit(normalized, design = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit3 <- eBayes(fit2, trend = TRUE, robust = TRUE)
topTable(fit3, sort.by = "P", number = Inf)

0
Entering edit mode
@gordon-smyth
Last seen 9 minutes ago
WEHI, Melbourne, Australia

I'm not very familiar with this sort of data but your analysis looks reasonable. I'll just make two technical comments:

here, sick = 0.6 and healthy = 0.4, so I required that a given peak be detected in at least 0.4 % of samples to be included in the analysis

That should be 40% rather than 0.4%.

Are there guidelines for what the density plot should look like following quantile normalization between arrays?

They should be identical. I actually find boxplots or histograms more useful than density plots because they don't try to force smoothness on the plot.