Question

how to plot PCA for only one of two groups compared or plot three groups at a time

1

Entering edit mode

Mnoon ▴ 20

@mnoon-10589

Last seen 7.4 years ago

Hi,

I have two questions about my RNA-Seq datasets that I have analyzed using Deseq2. I have used PCA plots for exploratory purposes. Out of the two groups (each group has 3 biological replicates) compared, samples of one group are spread apart on the plot and its really hard to decide which of the samples should be removed as outlier to proceed to differential expression analysis.

1. What is the acceptable % of variance to make an outlier decision. I was thinking if there is any way I could plot samples of one group separately at a time, variance would be more clear and easy to identify the outlier? Is there any way I could do this?

2. For an other dataset, where I have 5 biological replicates for each group (and als control Vs. treatment between two cell types). I see similar clustering pattern for some samples within a group as described above. Since PCA plot shows two dimensions and suitable for showing differences between two groups/conditions, Is there anyway, I could plot more than two groups at a time and see how/where samples cluster on the plot?

Thanks,

M

deseq2 bioconductor rna-seq pca • 2.7k views

ADD COMMENT • link updated 7.4 years ago by Michael Love 41k • written 7.4 years ago by Mnoon ▴ 20

score 1 · Answer 1 · 2016-11-08

First, the mechanics question: you can just subset the DESeqTransform object using standard operations:

rld.sub <- rld[,c(1,2,3)]

Furthermore, I would be careful with 3 samples about removing outliers, because this might just reflect natural variation in that condition. By keeping the sample in, the DESeq2 methods will be able to estimate the proper variability in the counts and be appropriately conservative in performing inference. I would require additional evidence to remove a sample with only 3 per group, for example, the sequence depth was much lower, or the quality scores of the reads were much lower. I recommend you run FASTQC and then MultiQC to see if one of the samples is poor quality, and if they all seem the same quality, I would keep the sample in.

I'm not sure about your second question. A PCA plot of the top two dimensions with all the samples should be sufficient to get a sense of the differences due to condition and batch, etc.

Note that you can identify on the plot (with shapes, colors) more than one dimension at a time using the ggplot2 code for PCA in the DESeq2 vignette:

vignette("DESeq2")