#### The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: Differences in PCA plot - rld versus vsd normalized data (DESeq2)
0
19 months ago by
pkachroo10
pkachroo10 wrote:

Hi all,

I generated PCA plots for my RNASeq data (3 biological replicates per sample) using rld and vsd normalized count data (images below).

I then removed data for three samples that appeared like outliers on the PCA plots (marked in red circle) and also I know that those three samples have mutations in certain genes and other samples should be wild-type like. I regenerated the PCA plot with all samples except the three outliers and get very different PCA plots using rld and vsd normalized counts.

I am therefore wondering what could contribute to the differences? Also the blue dots and the red dots in the PCA represent samples processed as different experiments (batch effects). So, can I use  design = ~ batch + condition to account for the batch effect. How should I re-plot PCA for the normalized counts after removing the batch effect ?

Regards,

Priyanka



deseq2 batch effect pca • 1.8k views
modified 19 months ago • written 19 months ago by pkachroo10

Thanks for your help Micheal. Based on the comments on Bioconductor DESeq2 threads, I understood that I can remove batch effects by simply accounting for it in the design.

I tried the following command: ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~ batch + group).

I get the following error : Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified.One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed.

Here is my experimental design (link for the table below): I have 50 strains in triplicates (150 samples). The 50 strains are from three genetic groups (group 1,2 and 3). Majority of the strains in each group are wild-type, however some strains have mutations in specific genes (status column). From the PCA plots (shown in the above thread), I see clustering due to batch effect - strains done as different experiments  (batch - A and B) cluster together. So here are my two questions:

1. I want to compare strains based on genetic groups and account for the genetic status (wild type or not) and the batch effects. Is the following design appropriate:

design = ~batch+status+group

Finally perform wald tests for different contrasts: group 1 versus group 2; group 2 versus group 3 and group 3 versus group 1

2. How do I resolve the error : checkFullRank(modelMatrix)  ?

Here is the link for Coldata:

"I can remove batch effects by simply accounting for it in the design"

You can control for them in differential expression, but including them in the design will not remove their contribution to distances in the PCA plot. For that you need to use limma's removeBatchEffect() on the transformed data, providing the batches. See my comment on the other Answer below.

Batch and group are not confounded in that table you sent. Are you sure this was the design that gave the error?

Hi Micheal,

Sorry, I reran the design and didn't get the error anymore. Must be a mistake on my end. I used the following design:

ddsFull<-DESeqDataSetFromMatrix(countData=countsTable, colData=colData, design=~batch+status+group)

vsd_b <- varianceStabilizingTransformation(dds, blind=FALSE)
vstMat_b <- assay(vsd, blind=FALSE)

library(limma)

mat2<-removeBatchEffect(vstMat_b)

After removing batch effects, when I make the PCA plot using the following code, I get the 2 errors:

library(ggplot2)
data <- plotPCA(vsd, intgroup=c("status","group"), returnData=TRUE)
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color=group, shape=status)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
geom_text(color="black",size=1.5, aes(label=strain), check_overlap = TRUE)+
coord_fixed()

Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function ‘plotPCA’ for signature ‘"matrix"’

Error in if (is.waive(data) || empty(data)) return(cbind(data, PANEL = integer(0))) :
missing value where TRUE/FALSE needed

Here is the link for the sessionInfo

hi,

Two things. First, to remove batch effects, you need to provide the batch variable(s) to the removeBatchEffect() function, as additional arguments. See the help file for that function. In the code above you just gave the function the data matrix, but you didn't provide the batch variable(s), so I think nothing will be done to the matrix.

Also my suggestion was:

"running it on the assay() of a transformed dataset and reassigning to assay()".

In R code, reassigment to the same slot looks like:

assay(vsd) <- ...

Thanks, once again. I realized that I had not added the batch variable, but I had already posted my response by then (Oops). Okay, so I used the following command and replotted PCA:

assay(vsd)<-removeBatchEffect(assay(vsd), vsd\$batch)

The PCA plots looks like this now:

I see less defined clustering by batch in the second PCA. What do you think?

Answer: Differences in PCA plot - rld versus vsd normalized data (DESeq2)
0
19 months ago by
drikaul0
La Jolla, CA
drikaul0 wrote:

You could try removing the batch effects, limma has a function you could use for that (http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/removeBatchEffect.html). But from the plots, it seems like there's a distinction in the data already.

Moreover, I would recommend using plotMDS instead of PCA for DGE data. Not sure why there is such a difference in the two normalization methods, but the vsd one seems more like it!

Yes you can use removeBatchEffect on the transformed counts by running it on the assay() of a transformed dataset and reassigning to assay(). Note that MDS vs PCA is not causing a difference. MDA with Euclidean distance produces the same plot as PCA.
ADD REPLYlink modified 19 months ago by Wolfgang Huber13k • written 19 months ago by Michael Love22k
Answer: Differences in PCA plot - rld versus vsd normalized data (DESeq2)
0
19 months ago by
Michael Love22k
United States
Michael Love22k wrote:
I'd recommend blind=FALSE (discussed in vignette) and the VST when there are many samples as here. There is a similar thread to this one that was posted a few days ago. You can find it by clicking the DESeq2 tag on this post.