Question: Removing unwanted variation (RUV) for paired analysis ?
0
3.6 years ago by
g.atla0
g.atla0 wrote:

Dear All, I have read (EdgeR: Accounting for batch effects in a pairwise analysis) that there is no need to remove the batch effects for a paired design as they are auto corrected.

I am dealing with data from primary cells. There is lot of heterogeneity in the cells. So I initially thought to use RUV-Seq which corrects for unwanted variation and it estimates factors for unwanted variation in the data set and returns it as W_1.

library(RUVSeq)

treat <-  as.factor(rep(c("treated","Untreated"),8))
subjects=factor(c(rep(1:8, each=2)))
design <- model.matrix(~subjects+treat)

set <- newSeqExpressionSet(as.matrix(filtered), phenoData = data.frame(treat, row.names=colnames(filtered)))
set <- betweenLaneNormalization(set, which="upper")

#create empirical data set
y <- DGEList(counts=filtered, group=treat)
y <- calcNormFactors(y,  method="upperquartile")
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
top <- topTags(lrt, n=nrow(y))\$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]

#normalise using empirical data set and estimate W_1
set2 <- RUVg(set, empirical, k=1)

#DE analysis using the estimated W_1, final result
design <- model.matrix(~subjects+W_1+treat, data=pData(set2))
y <- DGEList(counts=counts(set2), group=treat)
y <- calcNormFactors(y)
y <- estimateGLMCommonDisp(y, design,verbose=TRUE)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)

But now I learned that the paired analysis need not to be batch corrected and my design would introduce biases in the analysis as I might be doing it wrong.

I would like to know if its wrong to try to batch correct paired-analysis or if there is any way to remove hidden, unwanted variation to see true signal in data with heterogeneity.

modified 3.6 years ago by Aaron Lun24k • written 3.6 years ago by g.atla0
1

Just to be clear, if by heterogeneity you mean the data is simply highly variable, with truly random independent variations in each gene's expression, there's nothing you can do about this variation. RUV and similar methods are designed for detecting and removing systematic patterns of heterogeneity that appear consistently across many genes and are not explained by your known factors (i.e. treatment and subject).

Answer: Removing unwanted variation (RUV) for paired analysis ?
2
3.6 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

As your design includes subject, it already accounts for any confounding additive effects associated with the subjects, e.g., sex, age, location, sequencing batch. So, you don't need to worry about them. A more subtle problem involves heterogeneity in the treatment effects across subjects. For example, you might have a subset of subjects who don't respond to the treatment at all. This would inflate the dispersion estimate because the model won't fit well when it has to handle responders and non-responders with the same treat factor.

In such cases, RUV* may help - it would (hopefully) define a separate treatment coefficient for the non-responders in W_1, allowing the model to be fitted more accurately. I would probably suggest using RUVr to account for the fact that you've already blocked on subject in the design. Otherwise, as the counts across the empirical control genes include systematic differences between subjects, RUVg might end up returning something that is redundant with subject. This would not be helpful for explaining other sources of variability.

Note that the cost of an extra treatment coefficient for the (hypothetical) non-responders is that you effectively split your subjects into two groups, and then only use one group to identify significant treatment effects. This isn't so bad for genes when the other group is full of non-responders, as you don't lose much from ignoring them (and indeed, gain power by reducing the dispersion estimate). However, if you have a gene where all subjects are responding, then you're losing power by ignoring half your subjects.

The best way to figure out which is best is to run the analysis with and without W_1. If you get substantially lower dispersions and a greater number of DE genes with W_1 included in the design, then you should probably use it. If not, go with the simpler model.

1

Just to elaborate on Aaron reply, one very simple way of assessing whether RUV may help is to have a look at what W_1 is capturing, i.e., just plot W_1 and see if it is capturing any systematic effect distinct from subject (if it's capturing the subject effect, then you don't want to include both W_1 and subject to avoid collinearity).

I have plotted the W_1, it seems that its not capturing the treatment effects. The dispersions and BCV have reduced but did not make much change in the results. I guess, as Aaron said, the main problem is to identify/subset the responders and non-responders and analyse them separately.