Question

What can be done to separate the healthy and patient samples in PCA plot and Distance matrix

0

Entering edit mode

Mithil Gaikwad • 0

@mithil-gaikwad-16865

Last seen 5.2 years ago

Tezpur University, India

Hello, I am doing RNA seq analysis to obtain Differential expression genes using DESeq2 for 13 patient and 6 Healthy donors. Before going for DESeq2 analysis, I am visualizing my samples by Distance matrix and PCA plot, using the following commands:

library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

Distance matrix

plotPCA(rld, intgroup="condition")

PCA plot

Here I am not able to visualize both the sample group separately. Am I suppose to discard the samples, which can not be separated and how one can find this exact sample? Is there any function or solution to separate the samples without discarding it? Kindly reply. Any help in this regard will be highly appreciated

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS

Matrix products: default BLAS: /usr/lib/x8664-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x8664-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LCCTYPE=enIN.UTF-8 LCNUMERIC=C
LCTIME=enIN.UTF-8 LCCOLLATE=enIN.UTF-8 [5] LCMONETARY=enIN.UTF-8 LCMESSAGES=enIN.UTF-8
LCPAPER=enIN.UTF-8 LCNAME=C [9] LCADDRESS=C LCTELEPHONE=C
LCMEASUREMENT=enIN.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] hexbin1.27.2 vsn3.50.0
pheatmap1.0.12 RColorBrewer1.1-2 [5] DESeq21.22.1 SummarizedExperiment1.12.0 DelayedArray0.8.0 BiocParallel1.16.5 [9] matrixStats0.54.0 Biobase2.42.0
GenomicRanges1.34.0 GenomeInfoDb1.18.1 [13] IRanges2.16.0 S4Vectors0.20.1
BiocGenerics_0.28.0

loaded via a namespace splines3.5.2 affy1.60.0 blob1.1.1 [11] RSQLite2.1.1 limma3.38.3 XVector0.22.0 preprocessCore1.44.0 plyr1.8.4 [26] genefilter1.64.0 xtable1.8-3 htmlTable1.13.1 ggplot23.1.0 survival2.43-3 [41] memoise1.1.0 data.table1.12.0 munsell0.5.0 bindrcpp0.2.2 grid3.5.2 [56] htmlwidgets1.3 base64enc0.1-3 R62.3.0 dplyr0.7.8 Hmisc4.1-1 [71] geneplotter1.60.0 tidyselect0.2.5 xfun_0.4 (and not attached): [1] bit640.9-7
Formula1.2-3 assertthat0.2.0
[6] BiocManager1.30.4 latticeExtra0.6-28 GenomeInfoDbData1.2.0 pillar1.3.1
backports1.1.3 lattice0.20-38
glue1.3.0 [16] digest0.6.18
checkmate1.9.0 colorspace1.4-0
[21] htmltools0.3.6 Matrix1.2-15
XML3.98-1.16 pkgconfig2.0.2
zlibbioc1.28.0 purrr0.2.5
scales1.0.0 [31] affyio1.52.0
tibble2.0.1 annotate1.60.0
[36] nnet7.3-12 lazyeval0.2.1
magrittr1.5 crayon1.3.4
foreign0.8-70 tools3.5.2
stringr1.3.1 [46] locfit1.5-9.1
cluster2.0.7-1 AnnotationDbi1.44.0
[51] compiler3.5.2 rlang0.3.1
RCurl1.95-4.11 rstudioapi0.9.0
labeling0.3 bitops1.0-6
gtable0.2.0 [61] DBI1.0.0
gridExtra2.3 knitr1.21
[66] bit1.1-14 bindr0.1.1
stringi1.2.4 Rcpp1.0.0
rpart4.1-13 acepack1.4.1

deseq2 PCAplot Visualization • 1.1k views
ADD COMMENT • link updated 5.3 years ago by Michael Love 41k • written 5.3 years ago by Mithil Gaikwad • 0

0

Entering edit mode

With keeping in mind all these data coming from a single source, Biological replicates or conditions should cluster together. This is a concept to check if the similar kind of data is reproducible enough.

Removing non-clustered ones will certainly improve the downstream analysis. But before that, I would suggest instead of considering all the genes to construct a sample distance matrix, try to go with certain numbers of housekeeping or only protein-coding genes for that matter. This may show some clue.

ADD REPLY • link 5.3 years ago Sangram Keshari Sahu ▴ 20

2

Entering edit mode

Michael Love 41k
@mikelove
Last seen 49 minutes ago

United States

"Here I am not able to visualize both the sample group separately. Am I suppose to discard the samples, which can not be separated and how one can find this exact sample?"

No, it can be dangerous to remove samples which do not cluster, as you can bias your results toward your desired conclusion. If a sample is an outlier in the PCA plot, I then look to technical reasons for why that sample may be of low quality. I use FASTQC and MultiQC for looking at various diagnostic plots.

ADD COMMENT • link 5.3 years ago Michael Love 41k

score 2 · Answer 1 · 2019-01-15

"Here I am not able to visualize both the sample group separately. Am I suppose to discard the samples, which can not be separated and how one can find this exact sample?"

No, it can be dangerous to remove samples which do not cluster, as you can bias your results toward your desired conclusion. If a sample is an outlier in the PCA plot, I then look to technical reasons for why that sample may be of low quality. I use FASTQC and MultiQC for looking at various diagnostic plots.