Confounded design, batch effect correction (edgeR)
2
2
Entering edit mode
Laura ▴ 20
@laura-17261
Last seen 6.2 years ago

Hello,

I would like to find differentially expressed genes between different groups in RNA-seq data. I have 55 samples that were prepared either in batch 3, batch 4 or batch 6. These samples represent 4 groups (NR, NT, P, F). When I checked the MDS plot, all the samples were separated by group and by batch, so it seems that the batch effect should be corrected. The problem is that one group (group F) has only 4 samples, which were all prepared in batch 3. All these groups are distinct from each other, so none of the groups can be considered as a reference/control.

I would like do the following comparisons:
- NR vs. NT
- NR and NT vs. P
- NR, NT and P vs. F
- P vs. NR, NT and F
- NR vs. P, F and NT
- NT vs. P, F and NT

--------------------------------------------------------------------

I am a bit unsure about a couple of things:

1) How to build the design matrix. I don't have a clear reference/control or "baseline", so I guess I should not include the intercept to the design matrix?

2) How group F affects the de-analysis. For example, if I compare group NT to group F, is the model comparing samples which were prepared within the same batch (NT and F samples which were prepared in batch 3), and ignoring NT samples which were prepared in batch 4 or batch 6? Or should I just ignore group F samples?

3) How to make the correct contrasts for the glmQLFTest function. I think the first three comparisons are correct (the code below), but what about the last three comparisons? How can I make those?

--------------------------------------------------------------------

Here are the samples and the design matrix:

group <- factor(c(rep("NR", 11), rep("NT", 26), rep("P", 14), rep("F", 4)))
batch <- factor(c(3,3,3,4,4,6,3,3,3,6,6,6,6,6,4,6,3,3,3,3,3,4,4,4,4,6,6,6,6,3,3,6,6,6,6,6,6,3,3,3,3,4,4,4,6,3,3,3,3,6,6,3,3,3,3))
design <- model.matrix(~0 + batch + group)

Example how the columns in the design matrix look like:

batch3 batch4 batch5 groupNR groupNT groupP

 


Contrasts for the first three comparisons: 
- NR vs. NT --> c(0,0,0,1,-1,0)
- NR and NT vs. P --> c(0,0,0,0.5,0.5,-1)
- NR, NT and P vs. F --> c(0,0,0,0.33,0.33,0.33)

 

Thank you in advance for all the help.

 

rnaseq edger differential gene expression batch effect • 2.2k views
ADD COMMENT
6
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 2 hours ago
The city by the bay

It is good to see a question that gives me code to reproduce the design matrix. Pat yourself on the back.

For question 1: your design matrix is fine. It doesn't matter whether you have an intercept or not, the model is mathematically equivalent. It's just the interpretation of the coefficients that needs to change. For simplicity, I would do:

design <- model.matrix(~0 + group + batch)

... instead, which means that the first four coefficients represent the average log-expression of each group. The original design in your post defines batch terms that represent the log-expression of group F in each batch (yes, even though F only has samples in batch 3!) and the remaining terms represent the log-fold change of each group from F.

For question 2: you have an additive model so edgeR will also use information from the other batches to help estimation of the coefficients for the non-F groups. edgeR will also use information from the other batches to estimate the dispersion. So, there is no need to ignore samples or batches - using all of the data is fine.

For question 3: you'll find life a lot easier with the design I've described above. It's a simple case of doing:

con <- makeContrasts(groupNR - groupNT, levels=design)

... or:

con <- makeContrasts(groupF - (groupNR + groupNT + groupP)/3, levels=design)

... and so on.

ADD COMMENT
0
Entering edit mode

Thanks a lot Aaron!

ADD REPLY
0
Entering edit mode

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

I appriciate any comments.

ADD REPLY
0
Entering edit mode

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

I appriciate any comments.

ADD REPLY
0
Entering edit mode

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

I appriciate any comments.

ADD REPLY
0
Entering edit mode
Victoria • 0
@victoria-21093
Last seen 5.1 years ago

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

I appriciate any comments.

ADD COMMENT
0
Entering edit mode

Don't post the same thing three times. Also, don't just add to some 9-month old post. If you have a question, make a new one, perhaps referencing the existing one for clarity.

ADD REPLY

Login before adding your answer.

Traffic: 857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6