Question: Re: Cydar. Can I Normalize CyTOF data with just one FCS file/batch and if so, which normalization mode is recommended?
0
8 weeks ago by
ghogg0
ghogg0 wrote:

Hello,

I'm a novice R user, but have been tasked with normalizing some human CyTOF data, acquired over multiple days. I have been working with the Cydar package in the hopes of normalizing between different runs. The question I have is whether I can normalize batch effects if we have only one sample/day on the cytometer (10 runs or days total). I've created a separate ncdfFlowSet for each FCS file (10 flowsets total). I have run normalizebatch, with the warp mode and range mode, but I get a worrisome warning with warp mode. range mode looks alright to my untrained eye, but if you had any suggestions/recommendations they would be greatly appreciated. I have attached my code below.

Best, Graham

Note: all the below analysis is performed on untransformed intensity data (eg arcsin or logicle)

library(flowCore)
library(cydar)
library(ncdfFlow)
library(ggplot2)
library(data.table)
library(flowStats)

#first I check expression of y89Di channel to see batch effects

fcs_raw <- read.flowSet(path = "/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled", transformation = FALSE, truncate_max_range = FALSE, alter.names = TRUE)
expr <- fsApply(fcs_raw, exprs)
sample_ids <- rep(md$Sample_ID, fsApply(fcs_raw, nrow)) ggdf <- data.frame(sample_id = sample_ids, expr) ggplot(ggdf, aes(x=ggdf$CD45, group = ggdf$sample_id)) + geom_density() #Batch effects are visible  #Next I try the "warp" normalize batch function for y89Di channel cydar1 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/180904 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar2 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/181012 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar3 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/181217 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar4 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190220 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar5 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190226 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar6 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190415 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar7 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190507 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar8 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190517 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar9 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190725 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) cydar10 <- read.ncdfFlowSet("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled/190905 CZ DeNardo T Cell.fcs", transformation = FALSE, truncate_max_range = FALSE) Cydarbatch <- list(cydar1, cydar2, cydar3, cydar4, cydar5, cydar6, cydar7, cydar8, cydar9, cydar10) CHN <- "Y89Di" corrected1 <- normalizeBatch(Cydarbatch, batch.comp = NULL, mode = "warp", target = NULL, markers = CHN)  Warning message: In regularize.values(x, y, ties, missing(ties)) : collapsing to unique 'x' values trial <- lapply(corrected1, data.frame) trial1 <- rbindlist(trial, use.names=TRUE, fill=TRUE, idcol="trial") ggplot(trial1, aes(x=trial1$Y89Di, group = trial1$trial)) + geom_density() #I think this is perhaps overnormalized and I'm hoping to avoid negative values  #next I try "range" normalize batch function for y89Di channel corrected2 <- normalizeBatch(Cydarbatch, batch.comp = NULL, mode = "range", target = NULL, markers = CHN) trial2 <- lapply(corrected2, data.frame) trial3 <- rbindlist(trial2, use.names=TRUE, fill=TRUE, idcol="trial") ggplot(trial3, aes(x=trial3$Y89Di, group = trial3$trial)) + geom_density() #this normalization looks okay to me  sessionInfo()  R version 3.6.1 (2019-07-05) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib locale: [1] enUS.UTF-8/enUS.UTF-8/enUS.UTF-8/C/enUS.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] data.table1.12.2 ggplot2 3.2.1 cydar1.8.0 SingleCellExperiment 1.6.0 [5] SummarizedExperiment1.14.1 DelayedArray0.10.0 matrixStats0.55.0 Biobase2.44.0 [9] GenomicRanges1.36.1 GenomeInfoDb1.20.0 IRanges2.18.2 S4Vectors0.22.1 [13] BiocGenerics0.30.0 BiocParallel1.18.1 ncdfFlow2.30.1 BH1.69.0-1 [17] RcppArmadillo0.9.700.2.0 flowCore1.50.0 loaded via a namespace (and not attached): [1] viridis0.5.1 splines 3.6.1 viridisLite0.3.0 RcppParallel4.4.3 [5] shiny1.3.2 assertthat0.2.1 BiocManager1.30.4 latticeExtra 0.6-28 [9] flowWorkspace3.32.0 GenomeInfoDbData 1.2.1 robustbase0.93-5 pillar1.4.2 [13] lattice0.20-38 glue1.3.1 digest0.6.20 RColorBrewer 1.1-2 [17] promises1.0.1 XVector0.24.0 colorspace1.4-1 htmltools0.3.6 [21] httpuv1.5.2 Matrix 1.2-17 pcaPP1.9-73 pkgconfig2.0.2 [25] fda2.4.8 zlibbioc1.30.0 purrr0.3.2 xtable 1.8-4 [29] corpcor1.6.9 mvtnorm1.0-11 scales1.0.0 later0.8.0 [33] flowStats3.42.0 tibble 2.1.3 withr2.1.2 flowViz1.48.0 [37] hexbin1.27.3 lazyeval0.2.2 mclust5.4.5 magrittr 1.5 [41] crayon1.3.4 IDPmisc1.1.19 mime0.7 ks1.11.5 [45] MASS7.3-51.4 graph 1.62.0 tools3.6.1 stringr1.4.0 [49] munsell0.5.0 cluster2.1.0 compiler3.6.1 rlang 0.4.0 [53] grid3.6.1 RCurl1.95-4.12 BiocNeighbors1.2.0 rstudioapi0.10 [57] labeling0.3 bitops 1.0-6 gtable0.3.0 rrcov1.4-7 [61] R62.4.0 gridExtra2.3 dplyr0.8.3 Rgraphviz 2.28.0 [65] KernSmooth2.23-15 stringi1.4.3 Rcpp1.0.2 DEoptimR1.0-8 [69] tidyselect_0.2.5 normalization cytof cydar • 117 views ADD COMMENTlink modified 8 weeks ago by Aaron Lun25k • written 8 weeks ago by ghogg0 Answer: Re: Cydar. Can I Normalize CyTOF data with just one FCS file/batch and if so, wh 1 8 weeks ago by Aaron Lun25k Cambridge, United Kingdom Aaron Lun25k wrote: The assumption made by all of the intensity normalization strategies is that each batch has the same true distribution of protein abundances. Different choices of mode vary in how robust they are to violations of this assumption, e.g., mode="quantile" expects identical distributions, mode="warp" expects identical locations of the peaks, and mode="range" only expects that the boundaries are the same. This assumption is fine - or at least, passable - when each batch has many samples, such that any biological variation in abundance between samples can be expected to average out in the per-batch distributions. However, when you only have one sample per batch (and I'm considering each day to be a separate batch here), the assumption is much more problematic; at its most extreme, you're basically assuming that there are no differences between samples, and even if that was true, well, why bother doing the experiment at all? So the first thing to do is to recognize that the design of this particular experiment is pretty poor. Once we make our peace with that, we can look at ways of salvaging something from the analysis. I would suggest range-based normalization as the least-worst option in a bad situation, as it makes the gentlest assumption about the equivalence of distributions. I also tend to do normalization after transformation to ensure that there are no differences in the space that I plan to do the rest of my calculations in. Otherwise, it would be unfortunate if the transformation introduces new differences between samples that have to be normalized again. If you do normalization after transformation, all these differences get caught and removed in one go. ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Aaron Lun25k Thank you for your help! Best, GH ADD REPLYlink written 7 weeks ago by ghogg0 Hi Aaron, Thank you very much for your advice. I have decided to use a subset of samples for which we have a PBMC control run on the same day, rather than risking normalization with only one sample/batch. I have also arcsin normalized the data prior to batch normalization. I have been using the quantile function for batch normalization, and it has been giving me some results that appear strange to my untrained eye. I also get warning messages about "collapsing to unique 'x' values". I was hoping you could take a look at my workflow below to let me know if I have made an error, or if the problem is just that there is too much variability in the PBMC control (which is an identical sample in each batch). Thank you so much! I sincerely appreciate your time. Best, Graham ADD REPLYlink written 7 weeks ago by ghogg0  #first I load the non batch normalized files to check base level expression fcs_raw <- read.flowSet(path = "/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/relabeled_new", transformation = FALSE, truncate_max_range = FALSE, alter.names = TRUE) #Here I arcsinh normalize the data expr1 <- fsApply(fcs_raw, exprs) columns <- colnames(fcs_raw) aTrans <- function(x, cofactor = 5){ asinh(x/cofactor)} myTrans <- transformList(columns, aTrans) finalTrans <- transform(fcs_raw, myTrans) expr <- fsApply(finalTrans, exprs) md <- read.csv(file="/Users/grahamhogg/Desktop/MD.csv", header=TRUE, sep=",") sample_ids <- rep(md$Sample_ID, fsApply(finalTrans, nrow))
ggdf <- data.frame(sample_id = sample_ids, expr)

#I select only the files that will be used later in my workflow
ggdf_down <- dplyr::filter(ggdf, ggdf$sample_id == 2 | ggdf$sample_id == 5 | ggdf$sample_id == 7 | ggdf$sample_id == 10 |
ggdf$sample_id == 12) ggplot(ggdf_down, aes(x=ggdf_down$Yb174Di, group = ggdf_down$sample_id, color=factor(ggdf_down$sample_id))) + geom_density()


#I picked a marker here (PD1) that is relatively consistent but has some batch effects

#Next I load only the flowsets that contain both a sample and a reference

cydar2Qfiles <- list.files("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/Reference Quantile Normalized/Cydar2", full.names = TRUE)
cydar2Q <- read.ncdfFlowSet(files = cydar2Qfiles, transformation = FALSE, truncate_max_range = FALSE)

cydar5Qfiles <- list.files("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/Reference Quantile Normalized/Cydar5", full.names = TRUE)
cydar5Q <- read.ncdfFlowSet(files = cydar5Qfiles, transformation = FALSE, truncate_max_range = FALSE)

cydar7Qfiles <- list.files("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/Reference Quantile Normalized/Cydar7", full.names = TRUE)
cydar7Q <- read.ncdfFlowSet(files = cydar7Qfiles, transformation = FALSE, truncate_max_range = FALSE)

cydar10Qfiles <- list.files("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/Reference Quantile Normalized/Cydar10", full.names = TRUE)
cydar10Q <- read.ncdfFlowSet(files = cydar10Qfiles, transformation = FALSE, truncate_max_range = FALSE)

cydar12Qfiles <- list.files("/Users/grahamhogg/Desktop/CyTOF Human T Cell FCS/Reference Quantile Normalized/Cydar12", full.names = TRUE)
cydar12Q <- read.ncdfFlowSet(files = cydar12Qfiles, transformation = FALSE, truncate_max_range = FALSE)

CydarbatchQ <- list(cydar2Q, cydar5Q, cydar7Q, cydar10Q, cydar12Q)
names(CydarbatchQ)<- c("Run1", "Run2", "Run3", "Run4", "Run5")

#Here I set up my batch.comp

condition <- factor(c("reference", "sample"))
cydarbatchcomp <- list(condition)
cydarbatchcomp1 <- rep(cydarbatchcomp, 5)
names(cydarbatchcomp1)<- c("Run1", "Run2", "Run3", "Run4", "Run5")
cydarbatchcomp1[1]


$Run1 [1] reference sample Levels: reference sample names(cydarbatchcomp1) for (i in names(cydarbatchcomp1)) { current <- as.character(cydarbatchcomp1[[i]]) current[current!="reference"] <- i cydarbatchcomp1[[i]] <- current } cydarbatchcomp1[1] $Run1 [1] "reference" "Run1"

#Here I select markers I want to batch normalize
markers <- colnames(cydar2)
markers1 <- markers[c(3, 51)]
markers2 <- markers[9:16]
markers3 <- markers[18:43]
markersF <- c(markers1, markers2, markers3)

#here I run batch normalize with quantile function
corrected2 <- normalizeBatch(CydarbatchQ, batch.comp = cydarbatchcomp1, mode = "quantile", target = NULL, markers = markersF)


Warning messages: 1: In regularize.values(x, y, ties, missing(ties)) : collapsing to unique 'x' values

#to visualize batch correction:
trial4 <- unlist(corrected2, recursive = FALSE)
trial4 <- lapply(trial4, data.frame)
trial4 <- lapply(trial4, setNames, nm = markersF)
trial5 <- rbindlist(trial4, use.names=TRUE, fill=TRUE, idcol="trial")
trial5 <- as.data.frame(trial5)
trial5[1] <- lapply(trial5[1], gsub, pattern = "CZ DeNardo T Cell", replacement = "", fixed = TRUE)
trial6 <- trial5 %>% dplyr::filter(trial5$trial %in% c("Run1.180904 .fcs", "Run2.181217 .fcs", "Run3.190220 .fcs", "Run4.190415 .fcs", "Run5.190507 .fcs" )) ggplot(trial6, aes(x=trial6$Yb174Di, group = trial6$trial, color=factor(trial6$trial))) + geom_density() + xlim(-1, 10)


#This normalization looks "weird" to me, but I am by no means very well acquainted with bio-informatics or bio-stats


No obvious explanation comes to mind. If I had to guess, I would say that's because the quantile normalization uses the "average" distribution across batches as the reference to correct each batch towards (when target=NULL). If you average across batches that do not have exactly the same population composition, it's possible that you'll get lots of little lumps like this.

The weird thing is that, regardless of the lumpiness of the distribution, normalizeBatch() should at least return the exact same distribution for all batches after normalization. The reason why your histograms are so different between batches is probably because normalizeBatches() only uses the reference to compute the correction. Thus, only the references are guaranteed to have exactly the same distribution across batches after normalization. To normalize the Run* samples, we extrapolate the correction applied to the references, which may or may not be sensible depending on whether the biases in the references are recapitulated in the other samples. For example, if a marker is barely expressed in the reference, this obviously does not bode well for using it to correct the other samples.

You might be able to avoid the lumpiness by setting target, but that doesn't avoid the aforementioned assumption, so I wouldn't use mode="quantile" at all. It's only intended for technical replicates where the population is known to be the same so you can use highly aggressive normalization strategies.

Thank you so much for your response! Now that I think about it, a distribution normalization would definitely work best if you can assume a similar distribution between reference samples and trial samples, and I know the PBMC reference samples will have a significantly different intensity profile than my clinical samples, so I will follow your advice and not use the quantile function for normalization.

Best, GH