Question

Removing unwanted variation (RUV) for paired analysis ?

0

Entering edit mode

g.atla ▴ 10

@gatla-9491

Last seen 8.9 years ago

Dear All, I have read (EdgeR: Accounting for batch effects in a pairwise analysis) that there is no need to remove the batch effects for a paired design as they are auto corrected.

I am dealing with data from primary cells. There is lot of heterogeneity in the cells. So I initially thought to use RUV-Seq which corrects for unwanted variation and it estimates factors for unwanted variation in the data set and returns it as W_1.

library(RUVSeq)
filtered <- read.delim("filt_counts.txt",header=T,row.names=1)

treat <-  as.factor(rep(c("treated","Untreated"),8))
subjects=factor(c(rep(1:8, each=2)))
design <- model.matrix(~subjects+treat)

set <- newSeqExpressionSet(as.matrix(filtered), phenoData = data.frame(treat, row.names=colnames(filtered)))
set <- betweenLaneNormalization(set, which="upper")

#create empirical data set
y <- DGEList(counts=filtered, group=treat)
y <- calcNormFactors(y,  method="upperquartile")
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
top <- topTags(lrt, n=nrow(y))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]

#normalise using empirical data set and estimate W_1
set2 <- RUVg(set, empirical, k=1)

#DE analysis using the estimated W_1, final result
design <- model.matrix(~subjects+W_1+treat, data=pData(set2)) 
y <- DGEList(counts=counts(set2), group=treat)
y <- calcNormFactors(y)
y <- estimateGLMCommonDisp(y, design,verbose=TRUE)
y <- estimateGLMTagwiseDisp(y, design) 
fit <- glmFit(y, design)
lrt <- glmLRT(fit)

But now I learned that the paired analysis need not to be batch corrected and my design would introduce biases in the analysis as I might be doing it wrong.

I would like to know if its wrong to try to batch correct paired-analysis or if there is any way to remove hidden, unwanted variation to see true signal in data with heterogeneity.

ruvseq edger ruvg removebatcheffect • 3.5k views

ADD COMMENT • link updated 9.9 years ago by Aaron Lun ★ 29k • written 9.9 years ago by g.atla ▴ 10

1

Entering edit mode

Just to be clear, if by heterogeneity you mean the data is simply highly variable, with truly random independent variations in each gene's expression, there's nothing you can do about this variation. RUV and similar methods are designed for detecting and removing systematic patterns of heterogeneity that appear consistently across many genes and are not explained by your known factors (i.e. treatment and subject).

ADD REPLY • link 9.9 years ago Ryan C. Thompson ★ 7.9k

score 2 · Answer 1 · 2016-02-18

As your design includes subject, it already accounts for any confounding additive effects associated with the subjects, e.g., sex, age, location, sequencing batch. So, you don't need to worry about them. A more subtle problem involves heterogeneity in the treatment effects across subjects. For example, you might have a subset of subjects who don't respond to the treatment at all. This would inflate the dispersion estimate because the model won't fit well when it has to handle responders and non-responders with the same treat factor.

In such cases, RUV* may help - it would (hopefully) define a separate treatment coefficient for the non-responders in W_1, allowing the model to be fitted more accurately. I would probably suggest using RUVr to account for the fact that you've already blocked on subject in the design. Otherwise, as the counts across the empirical control genes include systematic differences between subjects, RUVg might end up returning something that is redundant with subject. This would not be helpful for explaining other sources of variability.

Note that the cost of an extra treatment coefficient for the (hypothetical) non-responders is that you effectively split your subjects into two groups, and then only use one group to identify significant treatment effects. This isn't so bad for genes when the other group is full of non-responders, as you don't lose much from ignoring them (and indeed, gain power by reducing the dispersion estimate). However, if you have a gene where all subjects are responding, then you're losing power by ignoring half your subjects.

The best way to figure out which is best is to run the analysis with and without W_1. If you get substantially lower dispersions and a greater number of DE genes with W_1 included in the design, then you should probably use it. If not, go with the simpler model.