Question: PCA of svaseq "cleaned" counts and DESeq2 SV subtracted counts are different?
gravatar for Peter Alto
22 months ago by
Peter Alto10
Peter Alto10 wrote:

I am using svaseq and DESeq2 together to remove surrogate variables from an RNA-seq dataset and call DEG. I ran the raw counts through DESeq2 and rlog transformed the data for processing by svaseq.

dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind = FALSE)
counts_deseq <- assay(rld)
pca = prcomp(t( counts_deseq ), scale=FALSE)

Notice that samples A1 and B1 are separated in PC2. B1_0 and B1_3 are separated in PC1, as I hoped.

This is my svaseq R code. counts_seq contains the rlog transformed data produced above.

mod <- model.matrix(~ condition, colData)
mod0 <- model.matrix(~ 1, colData[, -1])
svseq <- svaseq(counts_deseq, mod, mod0,

I used the R function below to "clean" the original input and obtain corrected counts.

  cleaningY = function(y, mod, svaobj) {
         X = cbind(mod, svaobj$sv)
           Hat = solve(t(X)%*%X)%*%t(X)
           beta = (Hat%*%t(y))
           P = ncol(mod)
           cleany = y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),])
counts_sva = cleaningY(counts_deseq, mod, svseq)
pca = prcomp(t( counts_sva ), scale=FALSE)

The cleaned counts were then used to generate a PCA plot of PC1 x PC2, which showed that the two A1 and B1 batches are now more similar than before sva.

Then I added SV1 and SV2 to my colData and generated a DESeq dds object following

dds.sva <- dds
dds.sva$SV1 <- svseq$sv[,1]
dds.sva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + condition
rld.sva <- rlog(dds.sva, blind = FALSE)
counts_deseq_sva <- assay(rld.sva)
pca = prcomp(t( counts_deseq_sva ), scale=FALSE)

My colData now looks like this:

sample  condition   SV1                     SV2                     sizeFactor
A1_0    control     -0.648430548730741      -0.0538911129250703     0.536015875141358
B1_0_1  control     0.284400077531668       0.828886823588163       0.800354347640333
B1_0_2  control     0.229646673150629       -0.394550325516213      0.615815976577249
B1_0_3  control     0.158381936289815       -0.338562050495541      0.731675847902015
A1_3    treatment   -0.537353868258029      0.144903580282808       1.43186332691591
B1_3_2  treatment   0.269242789169143       -0.118951595650863      1.70605551398322
B1_3_3  treatment   0.244112940847514       -0.0678353192832832     2.33518790975716

When I plotted a PCA with the new rlog data, I observed a slight change from the original rlog data. However, it differs from the svaseq corrected PCA and still shows separation between the batches.

I was expecting the PCA with the rlog transformed with SV1 and SV2 accounted for to look like the cleaned svaseq PCA.

Am I doing something wrong? Any help is much appreciated.

Thank you!

ADD COMMENTlink modified 22 months ago • written 22 months ago by Peter Alto10

There seem to be several copy-and-paste errors in your code, such as the last block where you compute counts_deseq_sva and then compute prcomp on counts_sva.

ADD REPLYlink written 22 months ago by Ryan C. Thompson6.5k

Thanks for catching that, I fixed it.

ADD REPLYlink written 22 months ago by Peter Alto10
gravatar for Ryan C. Thompson
22 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.5k wrote:

First of all, you should add coord_fixed() to all your PCA plots so that PC1 and PC2 will be on the same scale. Without this, it is very difficult to tell the relative size of the PCs. Seconrly, the rlog transformation does not perform any kind of batch removal, so I wouldn't expect a big difference. As far as I can tell, the only difference between dds and dds.sva is that you're using a different design formula. This will estimate a slightly different mean-variance trend, and will therefore slightly change the degree of shrinkage performed on low counts relative to higher counts, which in turn will slightly change the PCA plot and any other downstream analyses. But I wouldn't expect it to make a major difference.

ADD COMMENTlink written 22 months ago by Ryan C. Thompson6.5k

Yes, Ryan is correct that the rlog and VST in DESeq2 do not remove differences associated with the design. There have been a few posts about this recently. In the vignette, in the section of transformations, it's mentioned that the design is only used to get a sense of the amount of within-condition variability and the global trend of this variability on the mean. In the recent similar posts, I recommend using rlog or VST to stabilize variance and then limma's removeBatchEffect on the transformed data, if you want to remove differences on the transformed data associated with any variables in the design.

ADD REPLYlink written 22 months ago by Michael Love17k
gravatar for Peter Alto
22 months ago by
Peter Alto10
Peter Alto10 wrote:

Thank you very much for your replies.

I apologize for missing the information about rlog in the vignette.

I read the post about removeBatchEffect and I'll try to use the covariates option to remove SV's estimated by svaseq.

ADD COMMENTlink written 22 months ago by Peter Alto10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 401 users visited in the last hour