Question

CSAW Spike In Normalization: Error in .local(object, ...) : library sizes of 'se.out' and 'object' are not identical

0

Entering edit mode

94133 • 0

@94133-14305

Last seen 5.6 years ago

USA, Stanford

I am trying to normalize ChIP-seq data using an exogenous spike in control with csaw and get the following error:

>filtered.data <- normOffsets(filtered.data.spike, se.out=filtered.data)

Error in .local(object, ...) :
library sizes of 'se.out' and 'object' are not identical

library(csaw)
library(edgeR)
library(rtracklayer)
library(GenomicRanges)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)

param <- readParam(minq=20, restrict=paste0("chr", c(1:19, "X", "Y")))

#Load mouse bams (endogenous bamfiles)
dir <- "../bam/mm10"
samples <- read.table(file.path(dir, "bamfiles.txt"), header = TRUE)
colnames(design) <- c("intercept", "Treatment")
treatment = samples$Treatment
bam.files <- file.path(dir,samples$BAM)
data.frame(BAM=bam.files, treatment=treatment)

#Load human bams (exogenous bamfiles)
dir.spike <- "..bam/hg38"
samples.spike <- read.table(file.path(dir.spike, "spikebamfiles.txt"), header = TRUE)
bam.files.spike <- file.path(dir.spike,samples.spike$BAM)

#Counting mouse reads into windows
win.data <- windowCounts(bam.files, param=param, width=150, ext=200)
#filter by abundance | windows with abundances 5x higher than background.
bins <- windowCounts(bam.files, bin=TRUE, width=2000, param=param)
filter.stat <- filterWindows(win.data, bins, type="global")
min.fc <- 5
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
#implement filter
filtered.data <- win.data[keep,]

#Counting spiked-in reads into windows
win.data.spike <- windowCounts(bam.files.spike, param=param, width=150, ext=200)
#filter by abundance | windows with abundances 5x higher than background.
bins.spike <- windowCounts(bam.files.spike, bin=TRUE, width=2000, param=param)
filter.stat.spike <- filterWindows(win.data.spike, bins.spike, type="global")
min.fc <- 5
keep.spike <- filter.stat.spike$filter > log2(min.fc)
#filter
filtered.data.spike <- win.data.spike[keep.spike,]

#Normalize by spike-in
filtered.data <- normOffsets(filtered.data.spike, se.out=filtered.data)
filtered.data$norm.factors

Error in .local(object, ...) :
library sizes of 'se.out' and 'object' are not identical

sessionInfo()

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin17.5.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 GenomicFeatures_1.32.0
[3] AnnotationDbi_1.42.1 rtracklayer_1.40.2
[5] edgeR_3.22.1 limma_3.36.1
[7] csaw_1.14.0 SummarizedExperiment_1.10.1
[9] DelayedArray_0.6.0 BiocParallel_1.14.1
[11] matrixStats_0.53.1 Biobase_2.40.0
[13] GenomicRanges_1.32.3 GenomeInfoDb_1.16.0
[15] IRanges_2.14.10 S4Vectors_0.18.2
[17] BiocGenerics_0.26.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 compiler_3.5.0 XVector_0.20.0 prettyunits_1.0.2
[5] bitops_1.0-6 tools_3.5.0 zlibbioc_1.26.0 progress_1.1.2
[9] biomaRt_2.36.0 digest_0.6.15 bit_1.1-13 RSQLite_2.1.1
[13] memoise_1.1.0 lattice_0.20-35 pkgconfig_2.0.1 Matrix_1.2-14
[17] DBI_1.0.0 GenomeInfoDbData_1.1.0 httr_1.3.1 stringr_1.3.1
[21] Biostrings_2.48.0 locfit_1.5-9.1 bit64_0.9-7 grid_3.5.0
[25] R6_2.2.2 XML_3.98-1.11 magrittr_1.5 Rhtslib_1.12.0
[29] blob_1.1.1 GenomicAlignments_1.16.0 Rsamtools_1.32.0 assertthat_0.2.0
[33] stringi_1.2.2 RCurl_1.95-4.10

csaw software error R chip-seq bioconductor • 2.1k views

ADD COMMENT • link updated 7.7 years ago by Aaron Lun ★ 29k • written 7.7 years ago by 94133 • 0

score 0 · Answer 1 · 2018-05-25

0

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 9 hours ago

The city by the bay

Pretty much as the error message says - the $totals field is not the same between the two SummarizedExperiment objects you supplied to normOffsets. The final normalization is defined from the product of the normalization factors and the library sizes, and so it generally does not make sense to transfer normalization factors between objects with different library sizes.

I am surprised you have different totals in the first place. Usually, one would align all reads to a combined reference (endogenous + spike), and use windowCounts on the entire BAM file. If you subset the resulting SummarizedExperiment object according to the genome, you should now have two objects where the $totals are the same. I see that you've instead aligned your reads separately to your two references to obtain two BAM files; this seems risky due to reads mapping twice in homologous regions. Nonetheless, you can force csaw to be happy by just manually setting the totals in both objects to their sum across objects.

ADD COMMENT • link 7.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you for your help!

Wouldn't aligning to a combined human/mouse reference (presumably by changing chrom names, i.e. mchr1 v hchr1) suffer from the same putative double mapping issue? Alternatively, one could throw out reads that map to both genomes, but this seems like a very bad idea because these reads will likely map to regions of the genome we're interested in. Do you have a suggested workflow for this?

FWIW: I checked cross-species mapping with Bowtie2 (--very-sensitive) using simulated libraries consisting of all possible 75mers and got <1% human vs. mouse mapping (and vice versa).

For now I'll take your advice and manually set totals in both objects to the sum across objects.

ADD REPLY • link 7.7 years ago 94133 • 0

0

Entering edit mode

The expectation would be that the aligner would ignore reads that are not uniquely mapped. I believe this is true of most aligners nowadays, where non-uniquely mapped reads are either discarded automatically or thresholded based on MAPQ. This would protect you from reads from homologous regions of the mouse/human genome.

So yes, effectively I would throw out reads that map equally well to both genomes. This is conservative but hardly a "very bad idea", as the opposite situation would be disastrous. Imagine if you wrote up a manuscript where your major findings were driven by cross-mapping between genomes! Highly homologous regions in multi-genome experiments are equivalent to repeats, and we all know how problematic those are for short read sequencing.

If you have important things happening in homologous regions, the only reliable approach to retain information from these regions is to choose a spike-in genome that is less related. Drosophila and yeast seem to be popular choices, though I have not touched much ChIP-seq spike-in data - I've been told it's a bit underwhelming.

Regarding the cross-mapping results; simulated libraries are all well and good, but remember that the important parts of the genome are evolutionarily conserved and thus more likely to suffer from cross-mapping. If you're trying to assay something functional (and who isn't?), the problem is likely to be worse than you might predict from the simulations.

ADD REPLY • link 7.7 years ago Aaron Lun ★ 29k

0

Entering edit mode

Do you have a suggested workflow for aligning to combined genomes?

I agree with the points that you raised here, thanks for your input! The primary advantage of spike-in normalization is to draw empirical quantitative normalization between control and treated samples, especially in cases where there are global changes in abundance/occupancy. Additionally, methodologies with very low background like CAR-seq require spike-in for proper normalization. It's unclear to me why this method of normalization would be 'underwhelming', care to explain?

In such experiments where one is interested in comparing differences between two samples, it's hard to imagine a case where an entire manuscript would be "driven by cross-mapping between genomes", since the reference is the same in both. Furthermore, any careful researcher would make sure that enriched sites are called correctly using non-mixed samples. Of course using less related genomes would be optimal but many antibodies do not work in both Drosophila and mouse/human, so in most cases this is not possible.

ADD REPLY • link 7.7 years ago 94133 • 0

0

Entering edit mode

Regarding the suggested workflow: build a combined reference with all relevant genomes and align to that. I do that all the time for single-cell RNA-seq data, e.g., with ERCC spike-ins. You might have to change the names of some sequences in the FASTA file to get this to work properly. Some aligners will also refuse to handle genomes above a certain size - the solution would be to use a less restrictive aligner.

Regarding spike-in normalization: yes, I'm familiar with the how and why. Again, I have not dealt with much of it myself, but in at least some of those datasets I have handled, there has not been enough coverage of the spike-ins for them to be useful. You might say that this is not a problem with the overall spike-in strategy but rather is the fault of the experimentalists for not adding the right quantity of spike-in chromatin, and I would agree. Nonetheless, they weren't going to repeat the experiment for that reason alone, so I never used the spike-ins for any analysis of that data. I have also heard anecdotal stories regarding cases where the spike-in IP efficiencies differ from that of the endogenous genomes, though having not been shown solid evidence for this, I've taken it with a grain of salt.

Regarding DB analyses; well, who knows? At best, cross-mapping from your spike-in to the endogenous genome should would result in some spurious non-DB peaks that cause no harm. If you cross-map to an existing peak in the endogenous genome, you will shrink the log-fold change for that peak towards zero, which is not ideal but generally harmless (unless your major finding involves something about the lack of differences in binding between conditions). At worst, though, cross-mapping from your endogenous genome to your spike-ins can affect your normalization; and at that point, all bets are off regarding the final effects on your DB analysis.

And of course, it's easy to tell people to be careful. But repeating the entire sequencing experiment with non-mixed samples is pretty expensive, especially as you have to redo it right at the start of the protocol, i.e., before IP. We have to do the best we can with what we get.

ADD REPLY • link 7.7 years ago Aaron Lun ★ 29k