Hello,
I'm a novice R user, but have been tasked with normalizing some human CyTOF data, acquired over multiple days. I have been working with the Cydar package in the hopes of normalizing between different runs. The question I have is whether I can normalize batch effects if we have only one sample/day on the cytometer (10 runs or days total). I've created a separate ncdfFlowSet for each FCS file (10 flowsets total). I have run normalizebatch, with the warp mode and range mode, but I get a worrisome warning with warp mode. range mode looks alright to my untrained eye, but if you had any suggestions/recommendations they would be greatly appreciated. I have attached my code below.
Best, Graham
Note: all the below analysis is performed on untransformed intensity data (eg arcsin or logicle)
library(flowCore)
library(cydar)
library(ncdfFlow)
library(ggplot2)
library(data.table)
library(flowStats)
#first I check expression of y89Di channel to see batch effects
fcs_raw <- read.flowSet(path = "/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled", transformation = FALSE, truncate_max_range = FALSE, alter.names = TRUE)
expr <- fsApply(fcs_raw, exprs)
md <- read.csv(file="/Users/grahamhogg/Desktop/MD.csv", header=TRUE, sep=",")
sample_ids <- rep(md$Sample_ID, fsApply(fcs_raw, nrow))
ggdf <- data.frame(sample_id = sample_ids, expr)
ggplot(ggdf, aes(x=ggdf$CD45, group = ggdf$sample_id)) + geom_density()
#Batch effects are visible
#Next I try the "warp" normalize batch function for y89Di channel
cydar1 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/180904 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar2 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/181012 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar3 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/181217 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar4 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190220 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar5 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190226 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar6 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190415 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar7 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190507 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar8 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190517 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar9 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190725 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
cydar10 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190905 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE)
Cydarbatch <- list(cydar1, cydar2, cydar3, cydar4, cydar5, cydar6, cydar7, cydar8, cydar9, cydar10)
CHN <- "Y89Di"
corrected1 <- normalizeBatch(Cydarbatch, batch.comp = NULL, mode = "warp", target = NULL, markers = CHN)
Warning message: In regularize.values(x, y, ties, missing(ties)) :
collapsing to unique 'x' values
trial <- lapply(corrected1, data.frame)
trial1 <- rbindlist(trial, use.names=TRUE, fill=TRUE, idcol="trial")
ggplot(trial1, aes(x=trial1$Y89Di, group = trial1$trial)) + geom_density()
#I think this is perhaps overnormalized and I'm hoping to avoid negative values
#next I try "range" normalize batch function for y89Di channel
corrected2 <- normalizeBatch(Cydarbatch, batch.comp = NULL, mode = "range", target = NULL, markers = CHN)
trial2 <- lapply(corrected2, data.frame)
trial3 <- rbindlist(trial2, use.names=TRUE, fill=TRUE, idcol="trial")
ggplot(trial3, aes(x=trial3$Y89Di, group = trial3$trial)) + geom_density()
#this normalization looks okay to me
sessionInfo()
R version 3.6.1 (2019-07-05) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14
Matrix products: default BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dyliblocale: [1] enUS.UTF-8/enUS.UTF-8/enUS.UTF-8/C/enUS.UTF-8/en_US.UTF-8
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages: [1] data.table1.12.2
ggplot23.2.1 cydar1.8.0
SingleCellExperiment1.6.0 [5] SummarizedExperiment1.14.1 DelayedArray0.10.0 matrixStats0.55.0 Biobase2.44.0 [9] GenomicRanges1.36.1 GenomeInfoDb1.20.0
IRanges2.18.2 S4Vectors0.22.1 [13] BiocGenerics0.30.0 BiocParallel1.18.1
ncdfFlow2.30.1 BH1.69.0-1 [17] RcppArmadillo0.9.700.2.0 flowCore1.50.0loaded via a namespace (and not attached): [1] viridis0.5.1
splines3.6.1 viridisLite0.3.0 RcppParallel4.4.3
[5] shiny1.3.2 assertthat0.2.1 BiocManager1.30.4
latticeExtra0.6-28 [9] flowWorkspace3.32.0
GenomeInfoDbData1.2.1 robustbase0.93-5 pillar1.4.2
[13] lattice0.20-38 glue1.3.1 digest0.6.20
RColorBrewer1.1-2 [17] promises1.0.1 XVector0.24.0
colorspace1.4-1 htmltools0.3.6 [21] httpuv1.5.2
Matrix1.2-17 pcaPP1.9-73 pkgconfig2.0.2
[25] fda2.4.8 zlibbioc1.30.0 purrr0.3.2
xtable1.8-4 [29] corpcor1.6.9 mvtnorm1.0-11
scales1.0.0 later0.8.0 [33] flowStats3.42.0
tibble2.1.3 withr2.1.2 flowViz1.48.0
[37] hexbin1.27.3 lazyeval0.2.2 mclust5.4.5
magrittr1.5 [41] crayon1.3.4 IDPmisc1.1.19
mime0.7 ks1.11.5 [45] MASS7.3-51.4
graph1.62.0 tools3.6.1 stringr1.4.0
[49] munsell0.5.0 cluster2.1.0 compiler3.6.1
rlang0.4.0 [53] grid3.6.1 RCurl1.95-4.12
BiocNeighbors1.2.0 rstudioapi0.10 [57] labeling0.3
bitops1.0-6 gtable0.3.0 rrcov1.4-7
[61] R62.4.0 gridExtra2.3 dplyr0.8.3
Rgraphviz2.28.0 [65] KernSmooth2.23-15 stringi1.4.3
Rcpp1.0.2 DEoptimR1.0-8 [69] tidyselect_0.2.5
Thank you for your help!
Best, GH
Hi Aaron,
Thank you very much for your advice. I have decided to use a subset of samples for which we have a PBMC control run on the same day, rather than risking normalization with only one sample/batch. I have also arcsin normalized the data prior to batch normalization. I have been using the quantile function for batch normalization, and it has been giving me some results that appear strange to my untrained eye. I also get warning messages about "collapsing to unique 'x' values". I was hoping you could take a look at my workflow below to let me know if I have made an error, or if the problem is just that there is too much variability in the PBMC control (which is an identical sample in each batch).
Thank you so much! I sincerely appreciate your time.
Best, Graham
No obvious explanation comes to mind. If I had to guess, I would say that's because the quantile normalization uses the "average" distribution across batches as the reference to correct each batch towards (when
target=NULL
). If you average across batches that do not have exactly the same population composition, it's possible that you'll get lots of little lumps like this.The weird thing is that, regardless of the lumpiness of the distribution,
normalizeBatch()
should at least return the exact same distribution for all batches after normalization. The reason why your histograms are so different between batches is probably becausenormalizeBatches()
only uses the reference to compute the correction. Thus, only the references are guaranteed to have exactly the same distribution across batches after normalization. To normalize theRun*
samples, we extrapolate the correction applied to the references, which may or may not be sensible depending on whether the biases in the references are recapitulated in the other samples. For example, if a marker is barely expressed in the reference, this obviously does not bode well for using it to correct the other samples.You might be able to avoid the lumpiness by setting
target
, but that doesn't avoid the aforementioned assumption, so I wouldn't usemode="quantile"
at all. It's only intended for technical replicates where the population is known to be the same so you can use highly aggressive normalization strategies.Thank you so much for your response! Now that I think about it, a distribution normalization would definitely work best if you can assume a similar distribution between reference samples and trial samples, and I know the PBMC reference samples will have a significantly different intensity profile than my clinical samples, so I will follow your advice and not use the quantile function for normalization.
Best, GH