Search
Question: PCA of svaseq "cleaned" counts and DESeq2 SV subtracted counts are different?
1
2.3 years ago by
Peter Alto10
US/Chicago
Peter Alto10 wrote:

I am using svaseq and DESeq2 together to remove surrogate variables from an RNA-seq dataset and call DEG. I ran the raw counts through DESeq2 and rlog transformed the data for processing by svaseq.

dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind = FALSE)
counts_deseq <- assay(rld)
pca = prcomp(t( counts_deseq ), scale=FALSE)

Notice that samples A1 and B1 are separated in PC2. B1_0 and B1_3 are separated in PC1, as I hoped.

This is my svaseq R code. counts_seq contains the rlog transformed data produced above.

mod <- model.matrix(~ condition, colData)
mod0 <- model.matrix(~ 1, colData[, -1])
svseq <- svaseq(counts_deseq, mod, mod0, n.sv=3)

I used the R function below to "clean" the original input and obtain corrected counts.

  cleaningY = function(y, mod, svaobj) {
X = cbind(mod, svaobj$sv) Hat = solve(t(X)%*%X)%*%t(X) beta = (Hat%*%t(y)) P = ncol(mod) cleany = y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),]) return(cleany) } counts_sva = cleaningY(counts_deseq, mod, svseq) pca = prcomp(t( counts_sva ), scale=FALSE) The cleaned counts were then used to generate a PCA plot of PC1 x PC2, which showed that the two A1 and B1 batches are now more similar than before sva. Then I added SV1 and SV2 to my colData and generated a DESeq dds object following http://www.bioconductor.org/help/workflows/rnaseqGene/#batch dds.sva <- dds dds.sva$SV1 <- svseq$sv[,1] dds.sva$SV2 <- svseq\$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + condition
rld.sva <- rlog(dds.sva, blind = FALSE)
counts_deseq_sva <- assay(rld.sva)
pca = prcomp(t( counts_deseq_sva ), scale=FALSE)

My colData now looks like this:

sample  condition   SV1                     SV2                     sizeFactor A1_0    control     -0.648430548730741      -0.0538911129250703     0.536015875141358 B1_0_1  control     0.284400077531668       0.828886823588163       0.800354347640333 B1_0_2  control     0.229646673150629       -0.394550325516213      0.615815976577249 B1_0_3  control     0.158381936289815       -0.338562050495541      0.731675847902015 A1_3    treatment   -0.537353868258029      0.144903580282808       1.43186332691591 B1_3_2  treatment   0.269242789169143       -0.118951595650863      1.70605551398322 B1_3_3  treatment   0.244112940847514       -0.0678353192832832     2.33518790975716

When I plotted a PCA with the new rlog data, I observed a slight change from the original rlog data. However, it differs from the svaseq corrected PCA and still shows separation between the batches.

I was expecting the PCA with the rlog transformed with SV1 and SV2 accounted for to look like the cleaned svaseq PCA.

Am I doing something wrong? Any help is much appreciated.

Thank you!

modified 2.3 years ago • written 2.3 years ago by Peter Alto10

There seem to be several copy-and-paste errors in your code, such as the last block where you compute counts_deseq_sva and then compute prcomp on counts_sva.

Thanks for catching that, I fixed it.

0
2.3 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.9k wrote:

First of all, you should add coord_fixed() to all your PCA plots so that PC1 and PC2 will be on the same scale. Without this, it is very difficult to tell the relative size of the PCs. Seconrly, the rlog transformation does not perform any kind of batch removal, so I wouldn't expect a big difference. As far as I can tell, the only difference between dds and dds.sva is that you're using a different design formula. This will estimate a slightly different mean-variance trend, and will therefore slightly change the degree of shrinkage performed on low counts relative to higher counts, which in turn will slightly change the PCA plot and any other downstream analyses. But I wouldn't expect it to make a major difference.

Yes, Ryan is correct that the rlog and VST in DESeq2 do not remove differences associated with the design. There have been a few posts about this recently. In the vignette, in the section of transformations, it's mentioned that the design is only used to get a sense of the amount of within-condition variability and the global trend of this variability on the mean. In the recent similar posts, I recommend using rlog or VST to stabilize variance and then limma's removeBatchEffect on the transformed data, if you want to remove differences on the transformed data associated with any variables in the design.

0
2.3 years ago by
Peter Alto10
US/Chicago
Peter Alto10 wrote:

Thank you very much for your replies.