Design and contrast matrices for analyzing the effect of two factors of interest on protein expression, controlling for factors and covariates not of interest, using limma
1
0
Entering edit mode
@b5aee53d
Last seen 4 weeks ago
United States

I'm working on analyzing a proteomic dataset in which I am interested in evaluating the difference in protein expression between case and control and the effect and interaction of gender on expression differences by case status.

I also have a factor (site) and a covariate (age) that I need to adjust for, neither of which I am interested in knowing the effects of, and that is where I'm having trouble. In a typical regression, I would just add them to the model when estimating my coefficients of interest, but, following this extremely helpful guide, I think I can/should include site as a random effects variable (if it's possible to do that with age, a continuous covariate, I would not know how to do that, which is why I don't list it as well even though the effect of age is not of interest to me). One of the issues I'm having with that is that it results in site being included as a blocking variable, and from my experience with blocking in clinical trials, this would not be a blocking variable; it's not that strict, it's just the sites across which observations are distributed.

So far, I've set up how to look at: differential expression by case status; differential expression by gender; differential expression by case status within genders; differential expression by gender within case status; and interaction between case status and gender. I believe that's more or less been coded correctly, but I will still include it and please do let me know if there are any improvements that could be made. I've also set up those same comparisons, but with two different attempts adjusting for age and site. All of that is included below. I am asking if either of those methods for adjusting is appropriate in this case, and if not, how should I adjust for age and site?

Any assistance is greatly appreciated!


This is the rough structure of the data I'm working with, except rather than "expression" being a single column, it's more like 8000 columns.

# Create mock dataset
pheno_dat <- data.frame(matrix(NA, nrow = 10, ncol = 5))
pheno_dat[1,] <- c("case", "female", "A", 44, "case_female")
pheno_dat[2,] <- c("control", "female", "B", 36, "control_female")
pheno_dat[3,] <- c("case", "female", "A", 45, "case_female")
pheno_dat[4,] <- c("control", "female", "C", 38, "control_female")
pheno_dat[5,] <- c("case", "female", "B", 46, "case_female")
pheno_dat[6,] <- c("control", "female", "A", 38, "control_female")
pheno_dat[7,] <- c("case", "male",  "A", 48, "case_male")
pheno_dat[8,] <- c("control", "male", "B", 39, "control_male")
pheno_dat[9,] <- c("case", "male",  "C", 47, "case_male")
pheno_dat[10,] <- c("control",  "male", "C", 41, "control_male")
colnames(pheno_dat) <- c("status", "gender", "site", "age", "stat_gender")
pheno_dat$status <- as.factor(pheno_dat$status)
pheno_dat$status <- relevel(pheno_dat$status, ref="control")
pheno_dat$gender <- as.factor(pheno_dat$gender)
pheno_dat$gender <- relevel(pheno_dat$gender, ref="male")
pheno_dat$age <- as.numeric(pheno_dat$age)
pheno_dat$site <- as.factor(pheno_dat$site)
pheno_dat$stat_gender <- as.factor(pheno_dat$stat_gender)
pheno_dat$stat_gender <- relevel(pheno_dat$stat_gender, ref="control_male")

> pheno_dat
>          status gender site age    stat_gender
> 1         case female    A  44    case_female
> 2      control female    B  36 control_female
> 3         case female    A  45    case_female
> 4      control female    C  38 control_female
> 5         case female    B  46    case_female
> 6      control female    A  38 control_female
> 7         case   male    A  48      case_male
> 8      control   male    B  39   control_male
> 9         case   male    C  47      case_male
> 10     control   male    C  41   control_male

exprs_dat <- as.data.frame(c(9.7424779,
                             8.541871,
                             7.6822924,
                             9.6101021,
                             9.2693605,
                             8.8050989,
                             8.8638765,
                             7.9937876,
                             7.3689429,
                             8.9164766))
colnames(exprs_dat) <- "expression"

# Convert to expression set
exp_set <- as.ExpressionSet(exprs_dat)
exp_set@phenoData@data <- pheno_dat

The modeling strategy I've used is similar between the following two methods of covariate adjustment. I believe this part is done correctly (but again, please correct me if I'm wrong!).

This is my attempt at covariate adjustment by adding age to the design matrix and treating site as a random effects variable.

# Design matrix adjusting for age
means_stat_gender_age <- model.matrix(~ 0 + exp_set$stat_gender + exp_set$age)
colnames(means_stat_gender_age) <- c("control_male", "case_female", "case_male", "control_female", "age")

# Setting up site as a blocking (random effects) variable
cor <- duplicateCorrelation(exp_set, means_stat_gender_age, block=exp_set$site)

> cor$consensus.correlation
> [1] -0.3233333

# Model blocking on site 
fit_stat_gender_age_block <- lmFit(object=exp_set, design=means_stat_gender_age, block=exp_set$site, correlation=cor$consensus.correlation)
fit_stat_gender_age_block <- eBayes(fit_stat_gender_age_block)

> topTable(fit_stat_gender_age_block)
>   control_male case_female case_male control_female       age  AveExpr        F      P.Value    adj.P.Val
> 1     -3.97873   -4.984572 -6.532959      -2.566777 0.3094411 8.679429 3894.656 5.733608e-09 5.733608e-09

# Contrasts for comparisons of interest, adjusting for age and blocking by site
contrasts_comp_int_stat_gender_age_block <- makeContrasts(
  caseVScontrol=(case_male+case_female)/2-(control_male+control_female)/2, # Differential expression by case status
  femaleVSmale=(case_female+control_female)/2-(case_male+control_male)/2, # Differential expression by gender
  caseVScontrol_female=case_female-control_female, caseVScontrol_male=case_male-control_male, # Differential expression by case status within gender
  femaleVSmale_case=case_female-case_male, femaleVSmale_control=control_female-control_male, # Differential expression by gender within case status
  statusXgender=(case_female-control_male)-(control_female-control_male)-(case_male-control_male), # Interaction between case status and gender
  levels=colnames(means_stat_gender_age))
fit_contrasts_comp_int_stat_gender_age_block <- contrasts.fit(fit_stat_gender_age_block, contrasts_comp_int_stat_gender_age_block)
fit_contrasts_comp_int_stat_gender_age_block <- eBayes(fit_contrasts_comp_int_stat_gender_age_block)

> topTable(fit_contrasts_comp_int_stat_gender_age_block)
>   caseVScontrol femaleVSmale caseVScontrol_female caseVScontrol_male femaleVSmale_case femaleVSmale_control statusXgender
> 1     -2.486012      1.48017            -2.417794          -2.554229          1.548387             1.411952     0.1364348
>    AveExpr         F   P.Value adj.P.Val
> 1 8.679429 0.9186268 0.4954753 0.4954753

This is my attempt at covariate adjustment by adding age and site to the design matrix.

# Design matrix adjusting for age and site
means_stat_gender_cor <- model.matrix(~ 0 + exp_set$stat_gender + exp_set$age + exp_set$site)
colnames(means_stat_gender_cor) <- c("control_male", "case_female", "case_male", "control_female", "age", "B", "C")
fit_stat_gender_cor <- lmFit(exp_set, means_stat_gender_cor)
fit_stat_gender_cor <- eBayes(fit_stat_gender_cor)

> topTable(fit_stat_gender_cor)
>   control_male case_female case_male control_female       age          B           C  AveExpr        F     P.Value   adj.P.Val
> 1     -2.58555   -3.574291 -5.064257      -1.339882 0.2779798 -0.1102714 -0.04674809 8.679429 83.64332 0.001962154 0.001962154

# Contrasts for comparisons of interest, adjusting for age and blocking by site
contrasts_comp_int_stat_gender_cor <- makeContrasts(
  caseVScontrol=(case_male+case_female)/2-(control_male+control_female)/2, # Differential expression by case status
  femaleVSmale=(case_female+control_female)/2-(case_male+control_male)/2, # Differential expression by gender
  caseVScontrol_female=case_female-control_female, caseVScontrol_male=case_male-control_male, # Differential expression by case status within gender
  femaleVSmale_case=case_female-case_male, femaleVSmale_control=control_female-control_male, # Differential expression by gender within case status
  statusXgender=(case_female-control_male)-(control_female-control_male)-(case_male-control_male), # Interaction between case status and gender
  levels=colnames(means_stat_gender_cor))
fit_contrasts_comp_int_stat_gender_cor <- contrasts.fit(fit_stat_gender_cor, contrasts_comp_int_stat_gender_cor)
fit_contrasts_comp_int_stat_gender_cor <- eBayes(fit_contrasts_comp_int_stat_gender_cor)

> topTable(fit_contrasts_comp_int_stat_gender_cor)
>   caseVScontrol femaleVSmale caseVScontrol_female caseVScontrol_male femaleVSmale_case femaleVSmale_control statusXgender
> 1     -2.356557     1.367817            -2.234408          -2.478707          1.489967             1.245668     0.2442984
>    AveExpr         F   P.Value adj.P.Val
> 1 8.679429 0.3691084 0.7826167 0.7826167

If anything is missing or unclear, please let me know, and I'll address it as best as possible!

Cheers!



# please also include the results of running the following in an R session 

sessionInfo( )

R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] statmod_1.4.36       BiocManager_1.30.18  leukemiasEset_1.30.0 destiny_3.8.1        limma_3.50.3         ROCR_1.0-11         
 [7] glmnet_4.1-4         Matrix_1.4-1         gridExtra_2.3        janitor_2.1.0        UniprotR_2.2.0       magrittr_2.0.3      
[13] glue_1.6.2           Gmisc_3.0.0          htmlTable_2.4.1      Rcpp_1.0.9           robust_0.7-1         fit.models_0.64     
[19] sva_3.42.0           BiocParallel_1.28.3  genefilter_1.76.0    mgcv_1.8-40          nlme_3.1-158         report_0.5.1        
[25] broom_1.0.0          flextable_0.7.2      rempsyc_0.0.4.5      devtools_2.4.4       usethis_2.1.6        forcats_0.5.1       
[31] table1_1.4.2         psych_2.2.5          dplyr_1.0.9          tidyr_1.2.0          Hmisc_4.7-0          Formula_1.2-4       
[37] survival_3.3-1       lattice_0.20-45      ggplot2_3.3.6        lmSupport_2.9.13     readxl_1.4.0         Biobase_2.54.0      
[43] BiocGenerics_0.40.0 

loaded via a namespace (and not attached):
#Removed because of character limit
limma Proteomics ProteomicsWorkflow • 258 views
ADD COMMENT
0
Entering edit mode

I understand that your mock dataset has expression data for just one protein instead of 8000, but is the phenoData complete? Does your real dataset have 10 samples, same as the mock dataset?

ADD REPLY
0
Entering edit mode

No, the real data has 153 samples. My apologies, I thought I'd included that somewhere. The balance of status, gender, site, and age is approximately the same as I've put here though!

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 12 hours ago
WEHI, Melbourne, Australia

In a typical regression, I would just add them to the model when estimating my coefficients of interest

Why not do the same with limma? limma allows you to fit completely general regression equations.

following this extremely helpful guide, I think I can/should include site as a random effects variable

Thanks for the kind words, but we didn't advise you to necessary treat factors or covariates as random. It's just an extra option, useful in the context of incomplete block designs or hierarchical models.

> cor$consensus.correlation
> [1] -0.3233333

Given that the intra-site correlation is negative, you definitely should not be fitting site as a random factor. The negative correlation shows that expression values from the same site are not similar to one another. The negative correlation makes me wonder whether site should be included in the model at all.

I notice that you haven't used arrayWeights or trended or robust eBayes. I would view all those things as routine for a large human study like yours.

ADD COMMENT

Login before adding your answer.

Traffic: 211 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6