Question: differential expression analysis with big dataset and >500 surrogate variables
11 months ago by
aec40
dear all,

I have more than 500 RNA-seq samples and have to compare cases vs controls. I first run SVA to remove unknown variation and found >500 surrogate variables. Is a good practice to perform a LRT test with deseq2 where full model =~case+SV1+SV2+SVn and reduced model=~case to know how many surrogate variables should I add in order to avoid overfitting? The idea would be to first add SV1 to the full model, then add SV1+SV2, then SV1+SV2+SV3 and so on, and stop if the number of differentially expressed genes diminishes with respect to the previous model.

written 11 months ago by aec40

I think something went wrong with your estimation of SVs. Can you post all your code and sessionInfo()

dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]

mod <- model.matrix(~case, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
n.sv <- num.sv(dat,mod,method="leek")
n.sv

[1] 589

What do you get with the default method "be"?

n.sv <- num.sv(dat,mod)
n.sv
[1] 1

1

I'll wait to see Jeff's answer, but this seems to be an issue.

I typically use a small number of SVs. Even with hundreds of samples, I usually find that 1-10 SVs or RUV factors is sufficient to capture technical variance.