Question

PCA of svaseq "cleaned" counts and DESeq2 SV subtracted counts are different?

1

Entering edit mode

Peter Alto ▴ 20

@peter-alto-10891

Last seen 8.1 years ago

US/Chicago

I am using svaseq and DESeq2 together to remove surrogate variables from an RNA-seq dataset and call DEG. I ran the raw counts through DESeq2 and rlog transformed the data for processing by svaseq.

dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)
dds <- DESeq(dds)
rld <- rlog(dds, blind = FALSE)
counts_deseq <- assay(rld)
pca = prcomp(t( counts_deseq ), scale=FALSE)

Notice that samples A1 and B1 are separated in PC2. B1_0 and B1_3 are separated in PC1, as I hoped.

This is my svaseq R code. counts_seq contains the rlog transformed data produced above.

mod <- model.matrix(~ condition, colData)
mod0 <- model.matrix(~ 1, colData[, -1])
svseq <- svaseq(counts_deseq, mod, mod0, n.sv=3)

I used the R function below to "clean" the original input and obtain corrected counts.

  cleaningY = function(y, mod, svaobj) {
         X = cbind(mod, svaobj$sv)
           Hat = solve(t(X)%*%X)%*%t(X)
           beta = (Hat%*%t(y))
           P = ncol(mod)
           cleany = y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),])
           return(cleany)
  }

counts_sva = cleaningY(counts_deseq, mod, svseq)
pca = prcomp(t( counts_sva ), scale=FALSE)

The cleaned counts were then used to generate a PCA plot of PC1 x PC2, which showed that the two A1 and B1 batches are now more similar than before sva.

Then I added SV1 and SV2 to my colData and generated a DESeq dds object following http://www.bioconductor.org/help/workflows/rnaseqGene/#batch

dds.sva <- dds
dds.sva$SV1 <- svseq$sv[,1]
dds.sva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + condition
rld.sva <- rlog(dds.sva, blind = FALSE)
counts_deseq_sva <- assay(rld.sva)
pca = prcomp(t( counts_deseq_sva ), scale=FALSE)

My colData now looks like this:

sample condition SV1 SV2 sizeFactor A1_0 control -0.648430548730741 -0.0538911129250703 0.536015875141358 B1_0_1 control 0.284400077531668 0.828886823588163 0.800354347640333 B1_0_2 control 0.229646673150629 -0.394550325516213 0.615815976577249 B1_0_3 control 0.158381936289815 -0.338562050495541 0.731675847902015 A1_3 treatment -0.537353868258029 0.144903580282808 1.43186332691591 B1_3_2 treatment 0.269242789169143 -0.118951595650863 1.70605551398322 B1_3_3 treatment 0.244112940847514 -0.0678353192832832 2.33518790975716

When I plotted a PCA with the new rlog data, I observed a slight change from the original rlog data. However, it differs from the svaseq corrected PCA and still shows separation between the batches.

I was expecting the PCA with the rlog transformed with SV1 and SV2 accounted for to look like the cleaned svaseq PCA.

Am I doing something wrong? Any help is much appreciated.

Thank you!

deseq2 sva svaseq pca • 4.1k views

ADD COMMENT • link 8.8 years ago Peter Alto ▴ 20

0

Entering edit mode

There seem to be several copy-and-paste errors in your code, such as the last block where you compute counts_deseq_sva and then compute prcomp on counts_sva.

ADD REPLY • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thanks for catching that, I fixed it.

ADD REPLY • link 8.8 years ago Peter Alto ▴ 20

score 0 · Answer 1 · 2016-06-13

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 5 months ago

Icahn School of Medicine at Mount Sinai…

First of all, you should add coord_fixed() to all your PCA plots so that PC1 and PC2 will be on the same scale. Without this, it is very difficult to tell the relative size of the PCs. Seconrly, the rlog transformation does not perform any kind of batch removal, so I wouldn't expect a big difference. As far as I can tell, the only difference between dds and dds.sva is that you're using a different design formula. This will estimate a slightly different mean-variance trend, and will therefore slightly change the degree of shrinkage performed on low counts relative to higher counts, which in turn will slightly change the PCA plot and any other downstream analyses. But I wouldn't expect it to make a major difference.

ADD COMMENT • link 8.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Yes, Ryan is correct that the rlog and VST in DESeq2 do not remove differences associated with the design. There have been a few posts about this recently. In the vignette, in the section of transformations, it's mentioned that the design is only used to get a sense of the amount of within-condition variability and the global trend of this variability on the mean. In the recent similar posts, I recommend using rlog or VST to stabilize variance and then limma's removeBatchEffect on the transformed data, if you want to remove differences on the transformed data associated with any variables in the design.

ADD REPLY • link 8.8 years ago Michael Love 43k

score 0 · Answer 2 · 2016-06-14

0

Entering edit mode

Peter Alto ▴ 20

@peter-alto-10891

Last seen 8.1 years ago

US/Chicago

Thank you very much for your replies.

I apologize for missing the information about rlog in the vignette.

I read the post about removeBatchEffect and I'll try to use the covariates option to remove SV's estimated by svaseq.

ADD COMMENT • link 8.8 years ago Peter Alto ▴ 20