Hi,
I am performing RNA-Seq analysis on a bulk dataset comprising 4 different groups (a, b, c, d) with 5-6 biological replicates per group. When performing QC plotting, I find that PCA and clustering of vst-transformed values identifies some samples as quite different from the others:
# Read dds dds <- readRDS(dds.path) # Variance stabilizing transformation dds.transform <- varianceStabilizingTransformation(dds, blind=TRUE) # Plotting PCA p.pca <- plotPCA(dds.transform, intgroup=c("group"), ntop=500) nudge <- position_nudge(y=5, x=5) p.pca + geom_label(aes(label = name), position = nudge) # Plotting heatmap + clustering df <- data.frame(group = colData(dds.transform)[,"group"], row.names = colnames(dds.transform)) select.200 = order(rowMeans(counts(dds, normalized=TRUE)), decreasing=TRUE)[1:200] pheatmap(assay(dds.transform)[select.200,], cluster_rows=F, show_rownames=F, cluster_cols=T, annotation_col=df)
Interestingly the samples are the ones with the highest amount of reads (and largely belonging to the same batch). Looking at the heatmaps it also is apparent that they cluster differently because of generally higher reads:
I thought the size factor normalization should take care of this? Removing the batch effect using limma's removeBatchEffect() is also not really helping (makes it worse even):
# Remove batch effect assay(dds.transform) <- removeBatchEffect(assay(dds.transform), batch=dds.transform$batch)
Is this kind of variation to be expected or do I need to take care of this? Both intra- and intergroup differences theoretically should not be that big. What I am eventually interested in are mainly the differentially expressed genes.
Thank you!
Best,
Joe
> sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.36.2 vsn_3.48.1 RColorBrewer_1.1-2 gridExtra_2.3 [5] pheatmap_1.0.10 gplots_3.0.1 ggplot2_3.0.0 DESeq2_1.20.0 [9] SummarizedExperiment_1.10.1 DelayedArray_0.6.5 BiocParallel_1.14.2 matrixStats_0.54.0 [13] Biobase_2.40.0 GenomicRanges_1.32.6 GenomeInfoDb_1.16.0 IRanges_2.14.10 [17] S4Vectors_0.18.3 BiocGenerics_0.26.0
Hi Michael,
It's data from hematopoietic stem cells. The dataset is already some years old so the depth is really not that great. Still I need to reanalyse it as it has some unique treatment.
I wasn't really sure what you meant by plotting the raw counts so I attached some basic density plots of the count distribution (log2).
Sorry, I meant
plot(b1, b6, log="xy")
. Scaling normalization would help you when these plots of replicates mostly indicate a shift (a multiplicative factor on the raw count scale).Of course, here you go:
For me, c1 vs. c6 and d1 vs. d2 look somewhat linearly correlated. b1 vs. b6 however looks in fact quite weird.
All of these look highly variable for replicates. This isn't a sequencing depth issue I think, but some other quality control problem. I'd look into if something was special about these high depth samples.
What do you see with e.g. b5 vs b6?
You were right. The variability is much less when comparing between the libraries with lower depth. It seems that b1, c1 and d1 are highly different from everything else. b2 to a lesser extend also. a2, c2 and d5 maybe as well, but I believe their deviation should be still acceptable.
What could be the reason for this high variation?
a samples
b samples
c samples
d samples
I don’t know. Usually the answer lies not on the computer but in talking to the experimentalists.