Question: differential expression analysis with big dataset and >500 surrogate variables
0
gravatar for aec
15 months ago by
aec40
aec40 wrote:

dear all,

I have more than 500 RNA-seq samples and have to compare cases vs controls. I first run SVA to remove unknown variation and found >500 surrogate variables. Is a good practice to perform a LRT test with deseq2 where full model =~case+SV1+SV2+SVn and reduced model=~case to know how many surrogate variables should I add in order to avoid overfitting? The idea would be to first add SV1 to the full model, then add SV1+SV2, then SV1+SV2+SV3 and so on, and stop if the number of differentially expressed genes diminishes with respect to the previous model.

 

ADD COMMENTlink written 15 months ago by aec40

I think something went wrong with your estimation of SVs. Can you post all your code and sessionInfo()

ADD REPLYlink written 15 months ago by Michael Love24k
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]

mod <- model.matrix(~case, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
n.sv <- num.sv(dat,mod,method="leek")
n.sv

[1] 589

 

ADD REPLYlink modified 15 months ago • written 15 months ago by aec40

What do you get with the default method "be"?

ADD REPLYlink written 15 months ago by Michael Love24k
n.sv <- num.sv(dat,mod)
n.sv
[1] 1

 

ADD REPLYlink written 15 months ago by aec40
1

I'll wait to see Jeff's answer, but this seems to be an issue.

I typically use a small number of SVs. Even with hundreds of samples, I usually find that 1-10 SVs or RUV factors is sufficient to capture technical variance.

ADD REPLYlink written 15 months ago by Michael Love24k

thanks Michael. 

ADD REPLYlink written 15 months ago by aec40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 234 users visited in the last hour