Question

DiffBind spike-in lib.sizes confusion

0

Entering edit mode

Weisheng • 0

@177e01d3

Last seen 2.8 years ago

United States

Hi,

I'm confused by how DiffBind uses the spike-ins for normalization. My understanding of the manual is that DiffBind calculates the spike-in reads in the bins, and uses those read counts as the library sizes for normalization. I can see that when I set spikein=FALSE, the $lib.sizes and the $background$binned$totals are equal, which is good:

db_data_spikeinNorm2 <- dba.normalize(db_data, spikein = FALSE, background=T, library=DBA_LIBSIZE_BACKGROUND, normalize=DBA_NORM_LIB)

db_data_spikeinNorm2$norm$DESeq2$lib.sizes
[1] 7424321 7030471 8640826 7006223

> db_data_spikeinNorm2$norm$background$binned$totals
[1] 7424321 7030471 8640826 7006223

However, when I set spikein=TRUE, they are not equal anymore:

db_data_spikeinNorm3 <-dba.normalize(db_data, spikein = TRUE, background=T, library=DBA_LIBSIZE_BACKGROUND, normalize=DBA_NORM_LIB)

db_data_spikeinNorm3$norm$DESeq2$lib.sizes
[1] 7747122 7334460 9112179 7386926

db_data_spikeinNorm3$norm$background$binned$totals
[1] 1970 1923 2638 2262

The $lib.sizes are still big numbers that are close to the $lib.sizes from spikein=FALSE, but not identical. Why are they not equal to the background totals anymore? The binned totals make sense because I have a small number of mapped reads in the spike-in control bams. What's going on with the $lib.sizes values when spikein=TRUE?

Thanks.

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.1.1               pheatmap_1.0.12             ggrepel_0.9.3              
 [4] profileplyr_1.12.0          csaw_1.30.1                 DiffBind_3.6.5             
 [7] SummarizedExperiment_1.26.1 Biobase_2.56.0              MatrixGenerics_1.8.1       
[10] matrixStats_0.63.0          GenomicRanges_1.48.0        GenomeInfoDb_1.32.4        
[13] IRanges_2.30.1              S4Vectors_0.34.0            BiocGenerics_0.42.0        
[16] forcats_1.0.0               stringr_1.5.0               dplyr_1.1.0                
[19] purrr_1.0.1                 readr_2.1.4                 tidyr_1.3.0                
[22] tibble_3.1.8                ggplot2_3.4.0               tidyverse_1.3.2

SpikeIn DiffBind • 1.1k views

ADD COMMENT • link 3.0 years ago Weisheng • 0

score 1 · Answer 1 · 2023-03-13

1

Entering edit mode

Rory Stark ★ 5.3k

@rory-stark-5741

Last seen 13 months ago

Cambridge, UK

In the second case, the $lib.sizes are recorded as the sum of the reads in the primary files plus those in the spike-ins. However, these are not used to compute the normalization factors; the $binned$totals are.

To see this in action, try running:

cor(db_data_spikeinNorm2$norm$DESeq2$lib.sizes,
    db_data_spikeinNorm2$norm$DESeq2$norm.facs)

cor(db_data_spikeinNorm3$norm$background$binned$totals,
    db_data_spikeinNorm3$norm$DESeq2$norm.facs)