Question

edgeR: Analyzing multiple factors with batch effects

0

Entering edit mode

venura • 0

@venura-22066

Last seen 21 months ago

United States

Hi,

I am working on a dataset containing time series (2hr and 5hr) and treatment (Drug vs Mock). I was using the design based on section 3.3 of the manual as Gordan suggested earlier. When I look at the MDS plot I see that there is a batch effect. Each treatment has a somewhat distant sample. I was looking at section 4.2 of the manual and "a guide to creating design matrices for gene expression designs". I am having some difficulties figuring out how to correct batch effects in my dataset. I tried adding a separate column to the grouping file as follows; and tried to use the following code. It ended up giving an error about the design matrix.

# make labels/grouping

targets <- read.csv("groups.csv", header=TRUE, row.names=1 )

#grouping targets
group <- factor(paste(targets$Treatment, targets$Time, sep = "."))
cbind(targets,group=group)
Batch <- factor(paste(targets$Batch, sep = ","))
#Make DEG List

y <- DGEList(counts=rawdata,group=group)
nrow(y)

# TMM Normalization
y <- calcNormFactors(y)
y$samples

#plotsMDS
plotMDS(y, col=rep(1:12, each=3))

#checking batch effects
design <- model.matrix(~Batch+Batch:group)
logFC <- predFC(y,design,prior.count=1,dispersion=0.05)

I really appreciate it if someone can help me to get my design corrected and to get the batch effects removed. So I can check genes those were upregulated at 2hr and 5hr in treated samples. Thanks in advance.

Treatment   Time    Batch
    TM_2hr_1    TM  2h  1
    TM_2hr_2    TM  2h  2
    TM_2hr_3    TM  2h  3
    TM_MOCK_2hr_1   Mock    2h  4
    TM_MOCK_2hr_2   Mock    2h  5
    TM_MOCK_2hr_3   Mock    2h  6
    TM_5hr_1    TM  5h  7
    TM_5hr_2    TM  5h  8
    TM_5hr_3    TM  5h  9
    TM_MOCK_5hr_1   Mock    5h  10
    TM_MOCK_5hr_2   Mock    5h  11
    TM_MOCK_5hr_3   Mock    5h  12

edgeR • 1.3k views

ADD COMMENT • link 3.0 years ago venura • 0

score 0 · Answer 1 · 2022-04-26

What you have there is a total of 12 batches, one per sample. That's not useful and doesn't reflect what you said (that one sample per time/treatment is in a different batch). You really need a factor with two levels, one for each batch. So as an example, if the _3 samples were all run at a different time, you could do

batch <- factor(rep(c(1,1,2), times = 4))

That's just an example and assumes that there is an actual batch effect rather than simply having one sample per group that seems off. If these samples weren't processed in different batches, then you might not want to assume a batch effect, and instead might want to assume that you have outliers that can be adjusted for by using a sample weight. Section 14.4 of the limma User's Guide covers when/why you should use array weights (same thing as a sample weight). I don't believe you can specify a sample weight using edgeR, but you could do so using voomLmFit.

Also, you don't want an interaction between batch and group. You would just do

design <- model.matrix(~ 0 + batch + group)