Question: What can be done to separate the healthy and patient samples in PCA plot and Distance matrix
0
10 months ago by
Tezpur University, India

Hello, I am doing RNA seq analysis to obtain Differential expression genes using DESeq2 for 13 patient and 6 Healthy donors. Before going for DESeq2 analysis, I am visualizing my samples by Distance matrix and PCA plot, using the following commands:

library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld\$condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)

plotPCA(rld, intgroup="condition")


Here I am not able to visualize both the sample group separately. Am I suppose to discard the samples, which can not be separated and how one can find this exact sample? Is there any function or solution to separate the samples without discarding it? Kindly reply. Any help in this regard will be highly appreciated

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS

Matrix products: default BLAS: /usr/lib/x8664-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x8664-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LCCTYPE=enIN.UTF-8 LCNUMERIC=C
LC
TIME=enIN.UTF-8 LCCOLLATE=enIN.UTF-8 [5] LCMONETARY=enIN.UTF-8 LCMESSAGES=enIN.UTF-8
LC
LCMEASUREMENT=enIN.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] hexbin1.27.2 vsn3.50.0
pheatmap1.0.12 RColorBrewer1.1-2 [5] DESeq21.22.1 SummarizedExperiment1.12.0 DelayedArray0.8.0 BiocParallel1.16.5 [9] matrixStats0.54.0 Biobase2.42.0
GenomicRanges1.34.0 GenomeInfoDb1.18.1 [13] IRanges2.16.0 S4Vectors0.20.1
BiocGenerics_0.28.0

loaded via a namespace (and not attached): [1] bit640.9-7
splines
3.5.2 Formula1.2-3 assertthat0.2.0
affy1.60.0 [6] BiocManager1.30.4 latticeExtra0.6-28 blob1.1.1 GenomeInfoDbData1.2.0 pillar1.3.1
[11] RSQLite2.1.1 backports1.1.3 lattice0.20-38
limma
3.38.3 glue1.3.0 [16] digest0.6.18
XVector0.22.0 checkmate1.9.0 colorspace1.4-0
preprocessCore
1.44.0 [21] htmltools0.3.6 Matrix1.2-15
plyr1.8.4 XML3.98-1.16 pkgconfig2.0.2
[26] genefilter
1.64.0 zlibbioc1.28.0 purrr0.2.5
xtable1.8-3 scales1.0.0 [31] affyio1.52.0
htmlTable
1.13.1 tibble2.0.1 annotate1.60.0
ggplot23.1.0 [36] nnet7.3-12 lazyeval0.2.1
survival
2.43-3 magrittr1.5 crayon1.3.4
[41] memoise1.1.0 foreign0.8-70 tools3.5.2
data.table
1.12.0 stringr1.3.1 [46] locfit1.5-9.1
munsell0.5.0 cluster2.0.7-1 AnnotationDbi1.44.0
bindrcpp
0.2.2 [51] compiler3.5.2 rlang0.3.1
grid3.5.2 RCurl1.95-4.11 rstudioapi0.9.0
[56] htmlwidgets
1.3 labeling0.3 bitops1.0-6
base64enc0.1-3 gtable0.2.0 [61] DBI1.0.0
R6
2.3.0 gridExtra2.3 knitr1.21
dplyr0.7.8 [66] bit1.1-14 bindr0.1.1
Hmisc
4.1-1 stringi1.2.4 Rcpp1.0.0
[71] geneplotter1.60.0 rpart4.1-13 acepack1.4.1
tidyselect
0.2.5 xfun_0.4

visualization deseq2 pcaplot • 273 views
modified 10 months ago by Michael Love26k • written 10 months ago by Mithil Gaikwad0

With keeping in mind all these data coming from a single source, Biological replicates or conditions should cluster together. This is a concept to check if the similar kind of data is reproducible enough.

Removing non-clustered ones will certainly improve the downstream analysis. But before that, I would suggest instead of considering all the genes to construct a sample distance matrix, try to go with certain numbers of housekeeping or only protein-coding genes for that matter. This may show some clue.

Answer: What can be done to separate the healthy and patient samples in PCA plot and Dis
1
10 months ago by
Michael Love26k
United States
Michael Love26k wrote:

"Here I am not able to visualize both the sample group separately. Am I suppose to discard the samples, which can not be separated and how one can find this exact sample?"

No, it can be dangerous to remove samples which do not cluster, as you can bias your results toward your desired conclusion. If a sample is an outlier in the PCA plot, I then look to technical reasons for why that sample may be of low quality. I use FASTQC and MultiQC for looking at various diagnostic plots.