PCA of svaseq "cleaned" counts and DESeq2 SV subtracted counts are different?
Entering edit mode
Peter Alto ▴ 10
Last seen 7.0 years ago

I am using svaseq and DESeq2 together to remove surrogate variables from an RNA-seq dataset and call DEG. I ran the raw counts through DESeq2 and rlog transformed the data for processing by svaseq.

dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind = FALSE)
counts_deseq <- assay(rld)
pca = prcomp(t( counts_deseq ), scale=FALSE)

Notice that samples A1 and B1 are separated in PC2. B1_0 and B1_3 are separated in PC1, as I hoped.

This is my svaseq R code. counts_seq contains the rlog transformed data produced above.

mod <- model.matrix(~ condition, colData)
mod0 <- model.matrix(~ 1, colData[, -1])
svseq <- svaseq(counts_deseq, mod, mod0, n.sv=3)

I used the R function below to "clean" the original input and obtain corrected counts.

  cleaningY = function(y, mod, svaobj) {
         X = cbind(mod, svaobj$sv)
           Hat = solve(t(X)%*%X)%*%t(X)
           beta = (Hat%*%t(y))
           P = ncol(mod)
           cleany = y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),])
counts_sva = cleaningY(counts_deseq, mod, svseq)
pca = prcomp(t( counts_sva ), scale=FALSE)

The cleaned counts were then used to generate a PCA plot of PC1 x PC2, which showed that the two A1 and B1 batches are now more similar than before sva.

Then I added SV1 and SV2 to my colData and generated a DESeq dds object following http://www.bioconductor.org/help/workflows/rnaseqGene/#batch

dds.sva <- dds
dds.sva$SV1 <- svseq$sv[,1]
dds.sva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + condition
rld.sva <- rlog(dds.sva, blind = FALSE)
counts_deseq_sva <- assay(rld.sva)
pca = prcomp(t( counts_deseq_sva ), scale=FALSE)

My colData now looks like this:

sample  condition   SV1                     SV2                     sizeFactor
A1_0    control     -0.648430548730741      -0.0538911129250703     0.536015875141358
B1_0_1  control     0.284400077531668       0.828886823588163       0.800354347640333
B1_0_2  control     0.229646673150629       -0.394550325516213      0.615815976577249
B1_0_3  control     0.158381936289815       -0.338562050495541      0.731675847902015
A1_3    treatment   -0.537353868258029      0.144903580282808       1.43186332691591
B1_3_2  treatment   0.269242789169143       -0.118951595650863      1.70605551398322
B1_3_3  treatment   0.244112940847514       -0.0678353192832832     2.33518790975716

When I plotted a PCA with the new rlog data, I observed a slight change from the original rlog data. However, it differs from the svaseq corrected PCA and still shows separation between the batches.

I was expecting the PCA with the rlog transformed with SV1 and SV2 accounted for to look like the cleaned svaseq PCA.

Am I doing something wrong? Any help is much appreciated.

Thank you!

deseq2 sva svaseq pca • 3.6k views
Entering edit mode

There seem to be several copy-and-paste errors in your code, such as the last block where you compute counts_deseq_sva and then compute prcomp on counts_sva.

Entering edit mode

Thanks for catching that, I fixed it.

Entering edit mode
Last seen 6 months ago
Scripps Research, La Jolla, CA

First of all, you should add coord_fixed() to all your PCA plots so that PC1 and PC2 will be on the same scale. Without this, it is very difficult to tell the relative size of the PCs. Seconrly, the rlog transformation does not perform any kind of batch removal, so I wouldn't expect a big difference. As far as I can tell, the only difference between dds and dds.sva is that you're using a different design formula. This will estimate a slightly different mean-variance trend, and will therefore slightly change the degree of shrinkage performed on low counts relative to higher counts, which in turn will slightly change the PCA plot and any other downstream analyses. But I wouldn't expect it to make a major difference.

Entering edit mode

Yes, Ryan is correct that the rlog and VST in DESeq2 do not remove differences associated with the design. There have been a few posts about this recently. In the vignette, in the section of transformations, it's mentioned that the design is only used to get a sense of the amount of within-condition variability and the global trend of this variability on the mean. In the recent similar posts, I recommend using rlog or VST to stabilize variance and then limma's removeBatchEffect on the transformed data, if you want to remove differences on the transformed data associated with any variables in the design.

Entering edit mode
Peter Alto ▴ 10
Last seen 7.0 years ago

Thank you very much for your replies.

I apologize for missing the information about rlog in the vignette.

I read the post about removeBatchEffect and I'll try to use the covariates option to remove SV's estimated by svaseq.


Login before adding your answer.

Traffic: 586 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6