Hi there, I have a few BAMS and narrowPeak files from MACS2.
For some reason, I keep finding chr21 in my data, despite everything being in mm10 (i.e. autosomes chr1-19 ).
Sanity Checks:
- Was the data falsely mapped?
- Did the peak caller use a bad annotation?
Was the data falsely mapped?
> ls *.bam | xargs -n 1 samtools head | sort | uniq -c
8 @HD VN:1.5 SO:coordinate
8 @PG ID:bowtie2 PN:bowtie2 VN:2.4.5 \
CL:"bowtie2-align-s --wrapper basic-0 -p -x \
/mnt/2/indexes/Mus_musculus/Ensembl/GRCm38/\
Sequence/Bowtie2Index/genome\
--very-fast -1 input_f.fastq. -2 input_r.fastq.gz"
8 @SQ SN:10 LN:130694993
8 @SQ SN:11 LN:122082543
8 @SQ SN:12 LN:120129022
8 @SQ SN:13 LN:120421639
8 @SQ SN:14 LN:124902244
8 @SQ SN:15 LN:104043685
8 @SQ SN:16 LN:98207768
8 @SQ SN:17 LN:94987271
8 @SQ SN:18 LN:90702639
8 @SQ SN:19 LN:61431566
8 @SQ SN:1 LN:195471971
8 @SQ SN:2 LN:182113224
8 @SQ SN:3 LN:160039680
8 @SQ SN:4 LN:156508116
8 @SQ SN:5 LN:151834684
8 @SQ SN:6 LN:149736546
8 @SQ SN:7 LN:145441459
8 @SQ SN:8 LN:129401213
8 @SQ SN:9 LN:124595110
8 @SQ SN:MT LN:16299
8 @SQ SN:X LN:171031299
8 @SQ SN:Y LN:91744698
Result: Nope, no chr21 in any of the 8 bam files
Did the peak caller use a bad annotation?
> grep -HR "^19" *.bed | wc -l
12428 ## chr 19 exists, good
> grep -HR "^21" *.bed | wc -l
0 ## chr 21 does not exist, also good.
Result: Nope, no chr21 in any of the 8 peak files
What does DiffBind see?
> sample = dba(sampleSheet=read.csv('sample_table.csv'), \
peakFormat='bed', scoreCol=5, bLowerScoreBetter=FALSE)
## Are the peak files read in correctly?
> sapply(1:length(sample$peaks), \
function(x){sum(sample$peaks[[x]]$seqnames == 19)})
1725 1510 1489 1490 1742 1662 1508 1302
## Yes, it clearly sees chromosome 19
> sapply(1:length(sample$peaks), \
function(x){sum(sample$peaks[[x]]$seqnames == 21)})
0 0 0 0 0 0 0 0
## Yes, it clearly sees no chromosome 21
# What chromosomes are detected?
> sample$chrmap
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "3" "4" "5"
[16] "6" "7" "8" "9" "MT" "X" "Y"
## Good, no chr21
# What about the merged peak sets?
> as_tibble(sample$merged) %>% filter(CHR == 21)
# A tibble: 2,662 × 3
CHR START END
<dbl> <dbl> <dbl>
1 21 3234343 3234692
2 21 3470563 3470738
3 21 5329482 5329844
4 21 5452296 5452426
5 21 5467989 5468818
6 21 5866580 5866768
7 21 5918288 5918614
8 21 5949306 5950090
9 21 6047018 6048585
10 21 6083358 6083558
# … with 2,652 more rows
## wait, what?
# Okay, and lastly what about the binding
> as_tibble(sample$binding) %>% filter(CHR == 21)
# A tibble: 1,994 × 11
CHR START END WT_d2_r3 WT_d2_r4 WT_d2.5_r1 WT_d2.5_r2 WT_d3_r3 WT_d3_r4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 3.23e6 3.23e6 0 0 0 0 0 0.00271
2 21 5.33e6 5.33e6 0.00555 0.00472 0 0.00154 0 0
3 21 5.47e6 5.47e6 0.00587 0.00626 0.00315 0.00274 0.00142 0.00441
4 21 5.92e6 5.92e6 0.00236 0 0.00238 0.00274 0 0
5 21 5.95e6 5.95e6 0 0 0.00275 0.00274 0.00376 0.00409
6 21 6.05e6 6.05e6 0.00493 0.00367 0.00413 0.00321 0.00202 0.00614
7 21 6.09e6 6.09e6 0.0107 0.00780 0.00953 0.00972 0.00762 0.00590
8 21 6.17e6 6.17e6 0.00354 0.00287 0.00457 0.00364 0.00197 0.00365
9 21 6.38e6 6.38e6 0 0 0.00164 0 0.00171 0.00111
10 21 6.40e6 6.40e6 0.0114 0.00598 0.00543 0.00581 0.00365 0.00484
# … with 1,984 more rows, and 2 more variables: WT_d3.5_r1 <dbl>,
# WT_d3.5_r2 <dbl>
## this is insane... why do these samples
## have these values for these peaks?
Session Info
> sessionInfo( )
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
Matrix products: default
BLAS: /usr/lib/libblas.so.3.10.1
LAPACK: /usr/lib/liblapack.so.3.10.1
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BSgenome.Mmusculus.UCSC.mm10.masked_1.4.3
[2] BSgenome.Mmusculus.UCSC.mm10_1.4.3
[3] BSgenome_1.64.0
[4] rtracklayer_1.56.0
[5] Biostrings_2.64.0
[6] XVector_0.36.0
[7] DiffBind_3.6.0
[8] SummarizedExperiment_1.26.1
[9] Biobase_2.56.0
[10] MatrixGenerics_1.8.0
[11] matrixStats_0.62.0
[12] GenomicRanges_1.48.0
[13] GenomeInfoDb_1.32.1
[14] IRanges_2.30.0
[15] S4Vectors_0.34.0
[16] BiocGenerics_0.42.0
[17] forcats_0.5.1
[18] stringr_1.4.0
[19] dplyr_1.0.9
[20] purrr_0.3.4
[21] readr_2.1.2
[22] tidyr_1.2.0
[23] tibble_3.1.7
[24] ggplot2_3.3.6
[25] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] amap_0.8-18 colorspace_2.0-3 rjson_0.2.21
[4] hwriter_1.3.2.1 ellipsis_0.3.2 fs_1.5.2
[7] rstudioapi_0.13 ggrepel_0.9.1 fansi_1.0.3
[10] mvtnorm_1.1-3 apeglm_1.18.0 lubridate_1.8.0
[13] xml2_1.3.3 jsonlite_1.8.0 Rsamtools_2.12.0
[16] broom_0.8.0 ashr_2.2-54 dbplyr_2.1.1
[19] png_0.1-7 GreyListChIP_1.28.0 compiler_4.2.0
[22] httr_1.4.3 backports_1.4.1 assertthat_0.2.1
[25] Matrix_1.4-1 fastmap_1.1.0 limma_3.52.0
[28] cli_3.3.0 htmltools_0.5.2 tools_4.2.0
[31] coda_0.19-4 gtable_0.3.0 glue_1.6.2
[34] GenomeInfoDbData_1.2.8 systemPipeR_2.2.2 ShortRead_1.54.0
[37] Rcpp_1.0.8.3 bbmle_1.0.25 cellranger_1.1.0
[40] vctrs_0.4.1 rvest_1.0.2 lifecycle_1.0.1
[43] irlba_2.3.5 restfulr_0.0.13 gtools_3.9.2
[46] XML_3.99-0.9 zlibbioc_1.42.0 MASS_7.3-57
[49] scales_1.2.0 hms_1.1.1 parallel_4.2.0
[52] RColorBrewer_1.1-3 yaml_2.3.5 emdbook_1.3.12
[55] bdsmatrix_1.3-4 latticeExtra_0.6-29 stringi_1.7.6
[58] SQUAREM_2021.1 BiocIO_1.6.0 caTools_1.18.2
[61] BiocParallel_1.30.0 truncnorm_1.0-8 rlang_1.0.2
[64] pkgconfig_2.0.3 bitops_1.0-7 lattice_0.20-45
[67] invgamma_1.1 GenomicAlignments_1.32.0 htmlwidgets_1.5.4
[70] tidyselect_1.1.2 plyr_1.8.7 magrittr_2.0.3
[73] R6_2.5.1 gplots_3.1.3 generics_0.1.2
[76] DelayedArray_0.22.0 DBI_1.1.2 pillar_1.7.0
[79] haven_2.5.0 withr_2.5.0 RCurl_1.98-1.6
[82] mixsqp_0.3-43 modelr_0.1.8 crayon_1.5.1
[85] KernSmooth_2.23-20 utf8_1.2.2 tzdb_0.3.0
[88] jpeg_0.1-9 locfit_1.5-9.5 grid_4.2.0
[91] readxl_1.4.0 reprex_2.0.1 digest_0.6.29
[94] numDeriv_2016.8-1.1 munsell_0.5.0
Am I super sure that the BAM files have not been mapped to Chr21?
Yes, definitely not in the BAM data.