Question: Confounded design, batch effect correction (edgeR)
2
12 months ago by
Laura20
Laura20 wrote:

Hello,

I would like to find differentially expressed genes between different groups in RNA-seq data. I have 55 samples that were prepared either in batch 3, batch 4 or batch 6. These samples represent 4 groups (NR, NT, P, F). When I checked the MDS plot, all the samples were separated by group and by batch, so it seems that the batch effect should be corrected. The problem is that one group (group F) has only 4 samples, which were all prepared in batch 3. All these groups are distinct from each other, so none of the groups can be considered as a reference/control.

I would like do the following comparisons:
- NR vs. NT
- NR and NT vs. P
- NR, NT and P vs. F
- P vs. NR, NT and F
- NR vs. P, F and NT
- NT vs. P, F and NT

--------------------------------------------------------------------

I am a bit unsure about a couple of things:

1) How to build the design matrix. I don't have a clear reference/control or "baseline", so I guess I should not include the intercept to the design matrix?

2) How group F affects the de-analysis. For example, if I compare group NT to group F, is the model comparing samples which were prepared within the same batch (NT and F samples which were prepared in batch 3), and ignoring NT samples which were prepared in batch 4 or batch 6? Or should I just ignore group F samples?

3) How to make the correct contrasts for the glmQLFTest function. I think the first three comparisons are correct (the code below), but what about the last three comparisons? How can I make those?

--------------------------------------------------------------------

Here are the samples and the design matrix:

group <- factor(c(rep("NR", 11), rep("NT", 26), rep("P", 14), rep("F", 4)))
batch <- factor(c(3,3,3,4,4,6,3,3,3,6,6,6,6,6,4,6,3,3,3,3,3,4,4,4,4,6,6,6,6,3,3,6,6,6,6,6,6,3,3,3,3,4,4,4,6,3,3,3,3,6,6,3,3,3,3))
design <- model.matrix(~0 + batch + group)

Example how the columns in the design matrix look like:

batch3 batch4 batch5 groupNR groupNT groupP

Contrasts for the first three comparisons:
- NR vs. NT --> c(0,0,0,1,-1,0)
- NR and NT vs. P --> c(0,0,0,0.5,0.5,-1)
- NR, NT and P vs. F --> c(0,0,0,0.33,0.33,0.33)

Thank you in advance for all the help.

modified 3 months ago by Victoria0 • written 12 months ago by Laura20
Answer: Confounded design, batch effect correction (edgeR)
6
12 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

It is good to see a question that gives me code to reproduce the design matrix. Pat yourself on the back.

For question 1: your design matrix is fine. It doesn't matter whether you have an intercept or not, the model is mathematically equivalent. It's just the interpretation of the coefficients that needs to change. For simplicity, I would do:

design <- model.matrix(~0 + group + batch)

... instead, which means that the first four coefficients represent the average log-expression of each group. The original design in your post defines batch terms that represent the log-expression of group F in each batch (yes, even though F only has samples in batch 3!) and the remaining terms represent the log-fold change of each group from F.

For question 2: you have an additive model so edgeR will also use information from the other batches to help estimation of the coefficients for the non-F groups. edgeR will also use information from the other batches to estimate the dispersion. So, there is no need to ignore samples or batches - using all of the data is fine.

For question 3: you'll find life a lot easier with the design I've described above. It's a simple case of doing:

con <- makeContrasts(groupNR - groupNT, levels=design)

... or:

con <- makeContrasts(groupF - (groupNR + groupNT + groupP)/3, levels=design)

... and so on.

Thanks a lot Aaron!

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)

Answer: Confounded design, batch effect correction (edgeR)
0
3 months ago by
Victoria0
Victoria0 wrote:

I have a similar question. I have a set of RNA-Seq data from human tissue and want to compare the gene expression in cases and controls using EdgeR. If I want to adjust for the confounding effect of RIN, age and PMI; what is the correct way to do this? Here is my code:

library(edgeR) x <- read.delim("rawCountMatrix.txt", row.names = "Gene") group <- factor(c(1,2,2,2,2,2,1,1,1,1)) RIN <- c(5.6,7.6,6.7,5.6,7.6,6.4,6.8,7.7,5.7,8.3) Age <- c(106,89,76,82,87,87,73,77,78,88) PMI <- c(20.58,15.16,20.17,19.16,21.66,22,22,13.5,12.83,14.5)

y <- DGEList(counts=x,group=group) y <- calcNormFactors(y) design <- model.matrix(~group + RIN + Age + PMI) y <- estimateDisp(y,design)

fit <- glmFit(y,design) lrt <- glmLRT(fit)

Do I need to specify coefficient in the glmLRT() function? I am not sure what coefficient mean, if I change it from 2 to 3 or 2:4? lrt <- glmLRT(fit, coef = 2)