Sanity Checking Mixed Factorial Proteomics Analysis Workflow Using Limma
2
0
Entering edit mode
JMallory • 0
@jmallory-13488
Last seen 3.5 years ago

The present analysis is an exercise physiology study with a mixed factorial design. The design is as follows: Time (pre vs. post intervention; repeated measures) x Metformin status (on drug vs off drug; independent subjects) x dietary intervention (control diet vs MUFA enriched vs PUFA enriched; independent subjects). The data under investigation comes from a custom preoteomics aptamer array (104 individual subjects; two timepoints; 857 x 208 expression matrix). The primary contrast of interest is the first one (pre-post). I have set up the analysis using the attached code, but others on the research team find the initial results highly improbable.

I am unclear on how to interpret this and I want to give their concerns due diligence. Any advice would be appreciated. My primary concern is that some aspect of the following code is inappropriate for the stated application due to faulty implementation/violated assumptions.

library(dplyr)
library(impute)
library(vsn)
library(limma)

#filter combined proteomics data to exclude MNAR probes where abundances are NA for more than 33.3% of samples
combined<-inner_join(join,bind,by="Order_Key_USE")
na<-apply(combined,2,is.na)
na_index<-apply(na,2,sum)/dim(combined)[1]
combined<-combined[,na_index<.333]

add_data$Sex<-as.factor(add_data$Sex)
add_data$Smoker<-as.factor(add_data$Smoker)
write.csv(final_join,file="Intermediate_one.csv")

#set up objects for limma and normalize expression data with vsn
abunds<-justvsn(t(R[,-c(1:25)]))

# impute remaining NA assuming they are not MNAR
edata<-impute.knn(abunds,k = 10, rowmax = 1.0, colmax = 1.0, maxp = 1500, rng.seed=362436069)$data rownames(edata)<-colnames(R[,-c(1:25)]) #set up model matrix for limma and estimate correlation of individual subject samples across repeated measures pheno_table<-R[,c(2,3,14,18)] colnames(pheno_table)<-c("StudyID","Treatment","Study","Main") Treat<-factor(paste(pheno_table$Study,pheno_table$Main,sep=".")) Treat<-factor(paste(Treat,pheno_table$Treatment,sep="."))
levels(Treat)<-list("PUFApre.Metform"=c("PUFA.1.Metform"),
"PUFApost.Metform"=c("PUFA.2.Metform"),
"PUFApre.NoMetform"=c("PUFA.1.NoMetform"),
"PUFApost.NoMetform"=c("PUFAt.2.NoMetform"),
"NoDietpre.Metform"=c("NoDiet.1.Metform"),
"NoDietpost.Metform"=c("NoDiet.2.Metform"),
"NoDietpre.NoMetform"=c("NoDiet.1.NoMetform"),
"NoDietpost.NoMetform"=c("NoDiet.2.NoMetform"),
"MUFApre.Metform"=c("MUFA.1.Metform"),
"MUFApost.Metform"=c("MUFA.2.Metform"),
"MUFApre.NoMetform"=c("MUFA.1.NoMetform"),
"MUFApost.NoMetform"=c("MUFA.2.NoMetform"))
design<-model.matrix(~0+Treat)
colnames(design)<-levels(Treat)
corfit<-duplicateCorrelation(edata,design,block=pheno_table$StudyID) #Fit linear model for within subjects contrast fit<-lmFit(edata,design,block=pheno_table$StudyID,correlation=corfit\$consensus)
cm <- makeContrasts(
PUFApost.Metform-PUFApre.Metform = PUFApost.Metform-PUFApre.Metform,
NoDietpost.Metform-NoDietpre.Metform = NoDietpost.Metform-NoDietpre.Metform,
MUFApost.Metform-MUFApre.Metform = MUFApost.Metform-MUFApre.Metform,
PUFApost.NoMetform-PUFApre.NoMetform = PUFApost.NoMetform-PUFApre.NoMetform,
NoDietpost.NoMetform-NoDietpre.NoMetform = NoDietpost.NoMetform-NoDietpre.NoMetform,
MUFApost.NoMetform-MUFApre.NoMetform = MUFApost.NoMetform-MUFApre.NoMetform,
levels=design)

#clean up DE table
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2,robust=TRUE)
T<-topTable(fit2,number=100000)
T<-cbind(rownames(T),T)
colnames(T)[1]<-"FeatureID"

limma • 408 views
0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 7 hours ago
The city by the bay

The limma part of this only starts at # set up model matrix for limma and .... Before that, there seem to be some proteomics-specific steps that I would not be qualified to answer. The data munging step is also a mystery to me as well, so I can only assume that you've done that part correctly.

Anyway, onto the limma part. It mostly looks fine, though there is surely a better way of setting levels(Treat). Note that calling topTable without coef will do an ANOVA-like analysis, testing all contrasts at once. This may not be what you want if you need DE results for each individual contrast.

You don't mention exactly why your colleagues find the results to be improbable. I would suggest (i) finding a handful of DE proteins that they're not liking, and (ii) looking back through the expression values to determine the step at which it became DE. You can do this by just plotting raw or transformed or imputed expression values - and if there is a clear difference even with the raw values, then there's really nothing more to be said. Well, unless you mixed up the protein names during pre-processing.

0
Entering edit mode

Concerning the part before limma.I don't know from which instrument your proteomics data came out. Usually, a log transform is needed. It's not clear if it's already applied. You should also check graphically the VSN and the impute steps by plotting the replicates (or the samples) before and after each processing.

0
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

Well, you've made a number of non-trivial and debatable pre-processing steps. We don't know anything about your custom proteomics array, so we can't tell you whether your pre-processing is good or bad. Actually it would be hard to give definitive advice even if we did know your platform: pre-processing proteomics data for DE analyses is in general an unsolved problem because of the complex pattern of NAs.

I will say though that inputing imputing data to limma is never correct because it overstates the amount of data that you actually have and will over-state the significance of any results you obtain. There is no need to impute data because limma already handles missing values automatically (assuming they are MAR).

Also, limma doesn't need the data to be variance-stabilized (although it should be on the log-scale). It would be simpler and more robust to model the mean-variance relationship using trend=TRUE in eBayes rather than using VSN. I can understand why would you might impute data or variance-stabilize if you were inputing data to a clustering algorithm, say, but limma doesn't need so much pre-processing.

As Aaron has already remarked, it seems strange that you've formed a topTable for all the contrasts at once. If the first contrast is the one of interest, why haven't you tested for that contrast by itself?