Question

Interpreting a paired limma model

0

Entering edit mode

jdougherty • 0

@jdougherty-23255

Last seen 4.0 years ago

Hi all,

I'm a university student experienced with R but new to Bioconductor. For my research class, I'm reproducing this study on methylphenidate's (MPH) impact on gene expression for ADHD patients.

I was able to fully reproduce the processed data and model results, but I'm confused about the model interpretation. Page 44 of limma's user guide describes the exact model used in this study, a paired sample with a block size of two. The code for this is on the bottom for reference.

In my case, the four RNA pools "consisted of samples from seven individuals each, with the same individuals present in both treatment groups". So my assumption is that we're testing the impact of MPH on gene expression, but we're also testing how it specifically affects individuals with ADHD. For this model, the design matrix can be found here.

Fitting the model and running decideTests() results in this table

My question is, does this design only explicitly test for MPH effects on the unaffected (non-ADHD) sample? If I wanted to know which genes are significantly differentially expressed for ADHD individuals, would I have to change the factor levels and run another model? I've made a folder containing all the code and files needed to reproduce everything, so if you'd like to take a look it can be found here.

Thank you for your time!

# Code from page 45 of the limma user guide
SibShip <- factor(targets$SibShip)
Treat <- factor(targets$Treatment, levels=c("C","T")) 
design <- model.matrix(~SibShip+Treat)
fit <- lmFit(eset, design) 
fit <- eBayes(fit) 
topTable(fit, coef="TreatT")

limma microarray • 616 views

ADD COMMENT • link updated 4.0 years ago by James W. MacDonald 65k • written 4.0 years ago by jdougherty • 0

score 1 · Answer 1 · 2020-04-06

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Your experiment is not at all the same as the one on page 44 of the limma User's Guide! You have pairing, but you also have treatments within each pair. And when you fit a model you are ignoring the pairing. Your experiment is instead identical to the model in section 3.5 of the edgeR User's Guide.

You can't get R to generate a full rank model matrix if you have both pairing and a treatment without some work on your part, and section 3.5 shows you how to do that.

ADD COMMENT • link 4.0 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks so much for your answer! I've been trying to fit this model, but I'm getting a "No residual df: setting dispersion to NA" while trying to estimate the dispersion, along with a "Design matrix not of full rank" error when I fit the model.

I can post all the details, but I'm almost certain this is because the expression matrix only has 4 columns. So the design matrix has more columns then the expression matrix. While this study looks at 13 samples, they're pooled so there are only two measurements per group here, one before and one after the treatment, so four total columns. Would there be any way around this? Or am I limited to a maximum of 3 coefficients? Thanks again for your time!

ADD REPLY • link 4.0 years ago jdougherty • 0

0

Entering edit mode

I pointed you to a specific example. Did that not work? I don't see how the pooling has any bearing on the situation, but maybe I am missing something?

Anyway, without the details nobody can help diagnose, so you should probably post code.

ADD REPLY • link 4.0 years ago James W. MacDonald 65k

0

Entering edit mode

Sure thing, all the code, and files are at the Google Drive link here: https://drive.google.com/open?id=1Cm7BuwSek5nRYMvOPKnSuCwzQ24EstlK

I've just added an "edgeR_code" file which is my attempt at trying section 3.5 of edgeR, but the relevant parts are posted below.

Patient <- factor(c(1,2,1,2)) # 1 2 1 2 
Disease <- factor(pData(expr_set)$affection) # ADHD, unaffected, ADHD, unaffected
Treatment<- factor(pData(expr_set)$treatment,levels=c("untreat","MPH")) # untreat untreat MPH MPH 

design <- model.matrix(~Disease+Disease:Patient+Disease:Treatment)

colnames(design) #"(Intercept)", "Diseaseunaffected","DiseaseADHD:Patient2", 
# "Diseaseunaffected:Patient2",  "DiseaseADHD:TreatmentMPH" "Diseaseunaffected:TreatmentMPH"

dispersion <- edgeR::estimateDisp(exp_mat, design) #error if design is included
fit <- glmQLFit(exp_mat, design) # error: Design matrix not full rank

I think the pooling may be relevant because, if I'm understanding correctly, it means I have fewer columns in the expression matrix (4) even though there were 13 samples. Thanks so much for all the help, I really appreciate it.

ADD REPLY • link 4.0 years ago jdougherty • 0

0

Entering edit mode

If you only have four samples (pooled or not), then you are correct. Remember that what you are trying to do is simultaneously solve for a set of unknown values. This is just simple algebra. If you have one unknown, you need one equation to solve for that value, like

3 = x + 2

is easily solvable. But if you have

y = x + 3

You can't solve that equation because there are an infinite number of solutions! When you fit a linear model you are doing that same exact thing. If you have two unknowns, you need two equations. If you have four equations, you cannot solve for more than four coefficients. So if you really only have four observations you can't analyze your data while blocking on subject, because that's six unknowns and four equations.

ADD REPLY • link 4.0 years ago James W. MacDonald 65k