Question

EdgeR design matrix question

0

Entering edit mode

jcrowdis • 0

@jcrowdis-23207

Last seen 4.6 years ago

Hi,

I'm having trouble setting up a edgeR design matrix for an RNA-seq dataset I've been working with. We have a number of subjects, with each subjects having 1 - 7 samples. Subjects can be further stratified in 'Responder' and 'Nonresponder'. Here's a simplified example:

Response      Subject         Sample
   0             1          Sample_1_A
   0             1          Sample_1_B
   0             1          Sample_1_C
   1             2          Sample_2_A
   1             2          Sample_2_B
   0             3          Sample_3_A
   0             3          Sample_3_B
   0             3          Sample_3_C
   0             3          Sample_3_D
   1             4          Sample_4_A

We're interested in determining genes that are DE in responders relative to nonresponders, leveraging the fact that many of the samples come from the same subject.

We've tried the following design matrix, but it fails because the rank is not full (since response is a linear combination of subject)

design <- model.matrix(~0 + Subject + Response)

What is the correct design matrix to find genes DE in responders vs. nonresponders? Additionally, while the samples from individual subjects are similar, they are not true biological replicates (they were sampled from different regions). Is there a design matrix setup that can explicitly model differences between samples within a subject, and then allow us to compare responders vs. nonresponders accounting for subject differences? This will presumably kill a lot of our power to compare responders vs. nonresponders, but I'm stilll curious.

I've read the user guide for edgeR (which is very helpful), but I still couldn't determine if any of those examples matched our question.

Thanks for any help!

edger • 772 views

ADD COMMENT • link updated 4.6 years ago by Gordon Smyth 51k • written 4.6 years ago by jcrowdis • 0

0

Entering edit mode

After combing again through edgeR's user manual, is the correct approach to just use design <- model.matrix(~0 + Subject), and then compare responder with nonresponder via the use of contrast? So if there are n responders and m nonresponders, I might use qlf <- glmQLFTest(fit, contrast = c(-1/m, -1/m, 1/n, 1/n....). This gives me results that seem believable (~1k genes up/down, 18k not significant).

If this is the correct approach, can anyone shed some light on what exactly that contrast is doing? What does it mean to compare the average of responders vs. nonresponders?

ADD REPLY • link 4.6 years ago jcrowdis • 0

score 1 · Answer 1 · 2020-03-29

1

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

I would use limma instead of edgeR with

design <- model.matrix(~Response)

and use duplicateCorrelation to estimate the correlation between repeat samples from the same subject.

The contrast approach you describe is not entirely correct because it estimate variability only within subject and ignores variation between subject.

ADD COMMENT • link 4.6 years ago Gordon Smyth 51k

0

Entering edit mode

Would you mind expanding on why it only estimates variability within subject? I thought the point of design <- model.matrix(~0 + Subject) was to account for differences between subjects, although I'm clearly missing something. I thought the contrast approach was just to compare responders vs. nonresponders, not account for differences between samples of the same subject (I thought my approach did nothing on that front).

ADD REPLY • link 4.6 years ago jcrowdis • 0

0

Entering edit mode

Yes, the design does account for diffferences between subjects, but it does so in the model instead of in the variance where it needs to be.

You are evaluating variance within subjects but forming a contrast between subjects. The p-values therefore only hold for the particular four subjects you have in this experiment, not for other responding or non-responding subjects you might deal with in the future.

It is wrong to remove between-subject variation when what you actually want to do is to compare groups of subjects.

ADD REPLY • link 4.6 years ago Gordon Smyth 51k

0

Entering edit mode

I think I'm starting to understand. So you're saying that the issue is that we can't remove between-subject variation because then our results aren't generalizable? My initial thought is that you would want remove differences between patients and get differences driven only by responder vs. nonresponder, but maybe that's not actually possible.

Thanks for pointing me to limma - I have a follow up question on duplicateCorrelation. I assume I pass logCPM as the object? And group_matrix$Subject as the blocking factor? I'm a bit confused on why this results in one correlation (corfit$consensus) - shouldn't it result in n correlations, where n is the number of subjects? I guess I don't understand statistically how limma is able to handle correlations within each of n patients by just using one correlation value.

ADD REPLY • link 4.6 years ago jcrowdis • 0

0

Entering edit mode

Basing off of other examples online, I've tried the following code to include Subject in duplicateCorrelation, and in the end I only get 2 genes differentially expressed in total. I just want to make sure I'm not doing anything wrong:

# in my dataset, response is nonresponder/responder, not 0/1
response = factor(group_matrix$response, levels = c("nonresponder", "responder"))
design = model.matrix(~response)
dge = DGEList(counts = count_matrix)
keep <- filterByExpr(dge, design)
dge <- dge[keep,,keep.lib.sizes=FALSE]
dge <- calcNormFactors(dge)
v <- voom(dge, design, plot=TRUE)
corfit = duplicateCorrelation(v, design, block=group_matrix$subject) #pass voom to duplicateCorr
v <- voom(dge, design, block=group_matrix$subject ,correlation=corfit$consensus)
fit <- lmFit(v, design, block=group_matrix$subject, correlation=corfit$consensus)
fit <- eBayes(fit)
summary(decideTests(fit))

Results in:

               (Intercept) responder
    Down           233          1
    NotSig        1746      16065
    Up           14088          1

Is there an error in my code, am I misinterpreting these results, or are there truly only 2 differentially expressed genes by response status?