I am performing RNA-Seq analysis on a bulk dataset comprising 4 different groups (a, b, c, d) with 5-6 biological replicates per group. When performing QC plotting, I find that PCA and clustering of vst-transformed values identifies some samples as quite different from the others:
# Read dds dds <- readRDS(dds.path) # Variance stabilizing transformation dds.transform <- varianceStabilizingTransformation(dds, blind=TRUE) # Plotting PCA p.pca <- plotPCA(dds.transform, intgroup=c("group"), ntop=500) nudge <- position_nudge(y=5, x=5) p.pca + geom_label(aes(label = name), position = nudge) # Plotting heatmap + clustering df <- data.frame(group = colData(dds.transform)[,"group"], row.names = colnames(dds.transform)) select.200 = order(rowMeans(counts(dds, normalized=TRUE)), decreasing=TRUE)[1:200] pheatmap(assay(dds.transform)[select.200,], cluster_rows=F, show_rownames=F, cluster_cols=T, annotation_col=df)
Interestingly the samples are the ones with the highest amount of reads (and largely belonging to the same batch). Looking at the heatmaps it also is apparent that they cluster differently because of generally higher reads:
I thought the size factor normalization should take care of this? Removing the batch effect using limma's removeBatchEffect() is also not really helping (makes it worse even):
# Remove batch effect assay(dds.transform) <- removeBatchEffect(assay(dds.transform), batch=dds.transform$batch)
Is this kind of variation to be expected or do I need to take care of this? Both intra- and intergroup differences theoretically should not be that big. What I am eventually interested in are mainly the differentially expressed genes.
> sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale:  de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods base other attached packages:  limma_3.36.2 vsn_3.48.1 RColorBrewer_1.1-2 gridExtra_2.3  pheatmap_1.0.10 gplots_3.0.1 ggplot2_3.0.0 DESeq2_1.20.0  SummarizedExperiment_1.10.1 DelayedArray_0.6.5 BiocParallel_1.14.2 matrixStats_0.54.0  Biobase_2.40.0 GenomicRanges_1.32.6 GenomeInfoDb_1.16.0 IRanges_2.14.10  S4Vectors_0.18.3 BiocGenerics_0.26.0