edgeR: Analyzing multiple factors with batch effects
Entering edit mode
venura • 0
Last seen 7 weeks ago
United States


I am working on a dataset containing time series (2hr and 5hr) and treatment (Drug vs Mock). I was using the design based on section 3.3 of the manual as Gordan suggested earlier. When I look at the MDS plot I see that there is a batch effect. Each treatment has a somewhat distant sample. I was looking at section 4.2 of the manual and "a guide to creating design matrices for gene expression designs". I am having some difficulties figuring out how to correct batch effects in my dataset. I tried adding a separate column to the grouping file as follows; and tried to use the following code. It ended up giving an error about the design matrix.

# make labels/grouping

targets <- read.csv("groups.csv", header=TRUE, row.names=1 )

#grouping targets
group <- factor(paste(targets$Treatment, targets$Time, sep = "."))
Batch <- factor(paste(targets$Batch, sep = ","))
#Make DEG List

y <- DGEList(counts=rawdata,group=group)

# TMM Normalization
y <- calcNormFactors(y)

plotMDS(y, col=rep(1:12, each=3))

#checking batch effects
design <- model.matrix(~Batch+Batch:group)
logFC <- predFC(y,design,prior.count=1,dispersion=0.05)

I really appreciate it if someone can help me to get my design corrected and to get the batch effects removed. So I can check genes those were upregulated at 2hr and 5hr in treated samples. Thanks in advance.

Treatment   Time    Batch
    TM_2hr_1    TM  2h  1
    TM_2hr_2    TM  2h  2
    TM_2hr_3    TM  2h  3
    TM_MOCK_2hr_1   Mock    2h  4
    TM_MOCK_2hr_2   Mock    2h  5
    TM_MOCK_2hr_3   Mock    2h  6
    TM_5hr_1    TM  5h  7
    TM_5hr_2    TM  5h  8
    TM_5hr_3    TM  5h  9
    TM_MOCK_5hr_1   Mock    5h  10
    TM_MOCK_5hr_2   Mock    5h  11
    TM_MOCK_5hr_3   Mock    5h  12
edgeR • 136 views
Entering edit mode
Last seen 17 hours ago
United States

What you have there is a total of 12 batches, one per sample. That's not useful and doesn't reflect what you said (that one sample per time/treatment is in a different batch). You really need a factor with two levels, one for each batch. So as an example, if the _3 samples were all run at a different time, you could do

batch <- factor(rep(c(1,1,2), times = 4))

That's just an example and assumes that there is an actual batch effect rather than simply having one sample per group that seems off. If these samples weren't processed in different batches, then you might not want to assume a batch effect, and instead might want to assume that you have outliers that can be adjusted for by using a sample weight. Section 14.4 of the limma User's Guide covers when/why you should use array weights (same thing as a sample weight). I don't believe you can specify a sample weight using edgeR, but you could do so using voomLmFit.

Also, you don't want an interaction between batch and group. You would just do

design <- model.matrix(~ 0 + batch + group)
Entering edit mode

Thank you, James, for the clarification. As you pointed out it is the latter. It is caused by the biological sample. We do destructive sampling so each sample is a different plant. So there is a considerable amount of variation though we try to minimize it. I will check voomLmFit as you suggested. I thought about batch effects since it was the closest one I found for my issue.


Login before adding your answer.

Traffic: 441 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6