Question: differential expression analysis with big dataset and >500 surrogate variables
0
gravatar for aec
17 months ago by
aec50
aec50 wrote:

dear all,

I have more than 500 RNA-seq samples and have to compare cases vs controls. I first run SVA to remove unknown variation and found >500 surrogate variables. Is a good practice to perform a LRT test with deseq2 where full model =~case+SV1+SV2+SVn and reduced model=~case to know how many surrogate variables should I add in order to avoid overfitting? The idea would be to first add SV1 to the full model, then add SV1+SV2, then SV1+SV2+SV3 and so on, and stop if the number of differentially expressed genes diminishes with respect to the previous model.

 

ADD COMMENTlink written 17 months ago by aec50

I think something went wrong with your estimation of SVs. Can you post all your code and sessionInfo()

ADD REPLYlink written 17 months ago by Michael Love25k
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]

mod <- model.matrix(~case, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
n.sv <- num.sv(dat,mod,method="leek")
n.sv

[1] 589

 

ADD REPLYlink modified 17 months ago • written 17 months ago by aec50

What do you get with the default method "be"?

ADD REPLYlink written 17 months ago by Michael Love25k
n.sv <- num.sv(dat,mod)
n.sv
[1] 1

 

ADD REPLYlink written 17 months ago by aec50
1

I'll wait to see Jeff's answer, but this seems to be an issue.

I typically use a small number of SVs. Even with hundreds of samples, I usually find that 1-10 SVs or RUV factors is sufficient to capture technical variance.

ADD REPLYlink written 17 months ago by Michael Love25k

thanks Michael. 

ADD REPLYlink written 17 months ago by aec50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 242 users visited in the last hour