I am using svaseq and DESeq2 together to remove surrogate variables from an RNA-seq dataset and call DEG. I ran the raw counts through DESeq2 and rlog transformed the data for processing by svaseq.
dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind = FALSE)
counts_deseq <- assay(rld)
pca = prcomp(t( counts_deseq ), scale=FALSE)
Notice that samples A1 and B1 are separated in PC2. B1_0 and B1_3 are separated in PC1, as I hoped.
This is my svaseq R code. counts_seq contains the rlog transformed data produced above.
mod <- model.matrix(~ condition, colData)
mod0 <- model.matrix(~ 1, colData[, -1])
svseq <- svaseq(counts_deseq, mod, mod0, n.sv=3)
I used the R function below to "clean" the original input and obtain corrected counts.
cleaningY = function(y, mod, svaobj) { X = cbind(mod, svaobj$sv) Hat = solve(t(X)%*%X)%*%t(X) beta = (Hat%*%t(y)) P = ncol(mod) cleany = y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),]) return(cleany) }
counts_sva = cleaningY(counts_deseq, mod, svseq)
pca = prcomp(t( counts_sva ), scale=FALSE)
The cleaned counts were then used to generate a PCA plot of PC1 x PC2, which showed that the two A1 and B1 batches are now more similar than before sva.
Then I added SV1 and SV2 to my colData and generated a DESeq dds object following http://www.bioconductor.org/help/workflows/rnaseqGene/#batch
dds.sva <- dds dds.sva$SV1 <- svseq$sv[,1]
dds.sva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + condition
rld.sva <- rlog(dds.sva, blind = FALSE) counts_deseq_sva <- assay(rld.sva) pca = prcomp(t( counts_deseq_sva ), scale=FALSE)
My colData now looks like this:
sample condition SV1 SV2 sizeFactor
A1_0 control -0.648430548730741 -0.0538911129250703 0.536015875141358
B1_0_1 control 0.284400077531668 0.828886823588163 0.800354347640333
B1_0_2 control 0.229646673150629 -0.394550325516213 0.615815976577249
B1_0_3 control 0.158381936289815 -0.338562050495541 0.731675847902015
A1_3 treatment -0.537353868258029 0.144903580282808 1.43186332691591
B1_3_2 treatment 0.269242789169143 -0.118951595650863 1.70605551398322
B1_3_3 treatment 0.244112940847514 -0.0678353192832832 2.33518790975716
When I plotted a PCA with the new rlog data, I observed a slight change from the original rlog data. However, it differs from the svaseq corrected PCA and still shows separation between the batches.
I was expecting the PCA with the rlog transformed with SV1 and SV2 accounted for to look like the cleaned svaseq PCA.
Am I doing something wrong? Any help is much appreciated.
Thank you!
There seem to be several copy-and-paste errors in your code, such as the last block where you compute counts_deseq_sva and then compute prcomp on counts_sva.
Thanks for catching that, I fixed it.