Hello, I am doing RNA seq analysis to obtain Differential expression genes using DESeq2 for 13 patient and 6 Healthy donors. Before going for DESeq2 analysis, I am visualizing my samples by Distance matrix and PCA plot, using the following commands:
library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
plotPCA(rld, intgroup="condition")
Here I am not able to visualize both the sample group separately. Am I suppose to discard the samples, which can not be separated and how one can find this exact sample? Is there any function or solution to separate the samples without discarding it? Kindly reply. Any help in this regard will be highly appreciated
sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS
Matrix products: default BLAS: /usr/lib/x8664-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x8664-linux-gnu/lapack/liblapack.so.3.7.1
locale: [1] LCCTYPE=enIN.UTF-8 LCNUMERIC=C
LCTIME=enIN.UTF-8 LCCOLLATE=enIN.UTF-8 [5] LCMONETARY=enIN.UTF-8 LCMESSAGES=enIN.UTF-8
LCPAPER=enIN.UTF-8 LCNAME=C [9] LCADDRESS=C LCTELEPHONE=C
LCMEASUREMENT=enIN.UTF-8 LC_IDENTIFICATION=Cattached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages: [1] hexbin1.27.2 vsn3.50.0
pheatmap1.0.12 RColorBrewer1.1-2 [5] DESeq21.22.1 SummarizedExperiment1.12.0 DelayedArray0.8.0 BiocParallel1.16.5 [9] matrixStats0.54.0 Biobase2.42.0
GenomicRanges1.34.0 GenomeInfoDb1.18.1 [13] IRanges2.16.0 S4Vectors0.20.1
BiocGenerics_0.28.0loaded via a namespace (and not attached): [1] bit640.9-7
splines3.5.2 Formula1.2-3 assertthat0.2.0
affy1.60.0 [6] BiocManager1.30.4 latticeExtra0.6-28 blob1.1.1 GenomeInfoDbData1.2.0 pillar1.3.1
[11] RSQLite2.1.1 backports1.1.3 lattice0.20-38
limma3.38.3 glue1.3.0 [16] digest0.6.18
XVector0.22.0 checkmate1.9.0 colorspace1.4-0
preprocessCore1.44.0 [21] htmltools0.3.6 Matrix1.2-15
plyr1.8.4 XML3.98-1.16 pkgconfig2.0.2
[26] genefilter1.64.0 zlibbioc1.28.0 purrr0.2.5
xtable1.8-3 scales1.0.0 [31] affyio1.52.0
htmlTable1.13.1 tibble2.0.1 annotate1.60.0
ggplot23.1.0 [36] nnet7.3-12 lazyeval0.2.1
survival2.43-3 magrittr1.5 crayon1.3.4
[41] memoise1.1.0 foreign0.8-70 tools3.5.2
data.table1.12.0 stringr1.3.1 [46] locfit1.5-9.1
munsell0.5.0 cluster2.0.7-1 AnnotationDbi1.44.0
bindrcpp0.2.2 [51] compiler3.5.2 rlang0.3.1
grid3.5.2 RCurl1.95-4.11 rstudioapi0.9.0
[56] htmlwidgets1.3 labeling0.3 bitops1.0-6
base64enc0.1-3 gtable0.2.0 [61] DBI1.0.0
R62.3.0 gridExtra2.3 knitr1.21
dplyr0.7.8 [66] bit1.1-14 bindr0.1.1
Hmisc4.1-1 stringi1.2.4 Rcpp1.0.0
[71] geneplotter1.60.0 rpart4.1-13 acepack1.4.1
tidyselect0.2.5 xfun_0.4
With keeping in mind all these data coming from a single source, Biological replicates or conditions should cluster together. This is a concept to check if the similar kind of data is reproducible enough.
Removing non-clustered ones will certainly improve the downstream analysis. But before that, I would suggest instead of considering all the genes to construct a sample distance matrix, try to go with certain numbers of housekeeping or only protein-coding genes for that matter. This may show some clue.