Sanity Checks:

Question

Diffbind dba$binding slot finding Chr21 for an MM10 dataset

0

Entering edit mode

Mehmet • 0

@d8786237

Last seen 4 months ago

Germany

Hi there, I have a few BAMS and narrowPeak files from MACS2.

For some reason, I keep finding chr21 in my data, despite everything being in mm10 (i.e. autosomes chr1-19 ).

Sanity Checks:

Was the data falsely mapped?
Did the peak caller use a bad annotation?

Was the data falsely mapped?

> ls *.bam | xargs -n 1 samtools head | sort | uniq -c

      8 @HD     VN:1.5  SO:coordinate
      8 @PG     ID:bowtie2      PN:bowtie2      VN:2.4.5 \
            CL:"bowtie2-align-s --wrapper basic-0 -p -x \
              /mnt/2/indexes/Mus_musculus/Ensembl/GRCm38/\
               Sequence/Bowtie2Index/genome\
               --very-fast -1 input_f.fastq. -2 input_r.fastq.gz"
      8 @SQ     SN:10   LN:130694993
      8 @SQ     SN:11   LN:122082543
      8 @SQ     SN:12   LN:120129022
      8 @SQ     SN:13   LN:120421639
      8 @SQ     SN:14   LN:124902244
      8 @SQ     SN:15   LN:104043685
      8 @SQ     SN:16   LN:98207768
      8 @SQ     SN:17   LN:94987271
      8 @SQ     SN:18   LN:90702639
      8 @SQ     SN:19   LN:61431566
      8 @SQ     SN:1    LN:195471971
      8 @SQ     SN:2    LN:182113224
      8 @SQ     SN:3    LN:160039680
      8 @SQ     SN:4    LN:156508116
      8 @SQ     SN:5    LN:151834684
      8 @SQ     SN:6    LN:149736546
      8 @SQ     SN:7    LN:145441459
      8 @SQ     SN:8    LN:129401213
      8 @SQ     SN:9    LN:124595110
      8 @SQ     SN:MT   LN:16299
      8 @SQ     SN:X    LN:171031299
      8 @SQ     SN:Y    LN:91744698

Result: Nope, no chr21 in any of the 8 bam files

Did the peak caller use a bad annotation?

> grep -HR "^19" *.bed | wc -l
12428  ## chr 19 exists, good
> grep -HR "^21" *.bed | wc -l
0  ## chr 21 does not exist, also good.

Result: Nope, no chr21 in any of the 8 peak files

What does DiffBind see?

> sample = dba(sampleSheet=read.csv('sample_table.csv'), \
     peakFormat='bed', scoreCol=5, bLowerScoreBetter=FALSE)


## Are the peak files read in correctly?

> sapply(1:length(sample$peaks), \
    function(x){sum(sample$peaks[[x]]$seqnames == 19)})
1725 1510 1489 1490 1742 1662 1508 1302
## Yes, it clearly sees chromosome 19

> sapply(1:length(sample$peaks), \
    function(x){sum(sample$peaks[[x]]$seqnames == 21)})
0 0 0 0 0 0 0 0
## Yes, it clearly sees no chromosome 21


# What chromosomes are detected?

> sample$chrmap
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "3"  "4"  "5" 
[16] "6"  "7"  "8"  "9"  "MT" "X"  "Y" 
## Good, no chr21

# What about the merged peak sets?

> as_tibble(sample$merged) %>% filter(CHR == 21)
# A tibble: 2,662 × 3
     CHR   START     END
   <dbl>   <dbl>   <dbl>
 1    21 3234343 3234692
 2    21 3470563 3470738
 3    21 5329482 5329844
 4    21 5452296 5452426
 5    21 5467989 5468818
 6    21 5866580 5866768
 7    21 5918288 5918614
 8    21 5949306 5950090
 9    21 6047018 6048585
10    21 6083358 6083558
# … with 2,652 more rows
## wait, what?

# Okay, and lastly what about the binding
> as_tibble(sample$binding) %>% filter(CHR == 21)
# A tibble: 1,994 × 11
     CHR  START    END WT_d2_r3 WT_d2_r4 WT_d2.5_r1 WT_d2.5_r2 WT_d3_r3 WT_d3_r4
   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>      <dbl>      <dbl>    <dbl>    <dbl>
 1    21 3.23e6 3.23e6  0        0          0          0        0        0.00271
 2    21 5.33e6 5.33e6  0.00555  0.00472    0          0.00154  0        0      
 3    21 5.47e6 5.47e6  0.00587  0.00626    0.00315    0.00274  0.00142  0.00441
 4    21 5.92e6 5.92e6  0.00236  0          0.00238    0.00274  0        0      
 5    21 5.95e6 5.95e6  0        0          0.00275    0.00274  0.00376  0.00409
 6    21 6.05e6 6.05e6  0.00493  0.00367    0.00413    0.00321  0.00202  0.00614
 7    21 6.09e6 6.09e6  0.0107   0.00780    0.00953    0.00972  0.00762  0.00590
 8    21 6.17e6 6.17e6  0.00354  0.00287    0.00457    0.00364  0.00197  0.00365
 9    21 6.38e6 6.38e6  0        0          0.00164    0        0.00171  0.00111
10    21 6.40e6 6.40e6  0.0114   0.00598    0.00543    0.00581  0.00365  0.00484
# … with 1,984 more rows, and 2 more variables: WT_d3.5_r1 <dbl>,
#   WT_d3.5_r2 <dbl>
##  this is insane... why do these samples
##  have these values for these peaks?

Session Info

> sessionInfo( )
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.10.1
LAPACK: /usr/lib/liblapack.so.3.10.1

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm10.masked_1.4.3
 [2] BSgenome.Mmusculus.UCSC.mm10_1.4.3       
 [3] BSgenome_1.64.0                          
 [4] rtracklayer_1.56.0                       
 [5] Biostrings_2.64.0                        
 [6] XVector_0.36.0                           
 [7] DiffBind_3.6.0                           
 [8] SummarizedExperiment_1.26.1              
 [9] Biobase_2.56.0                           
[10] MatrixGenerics_1.8.0                     
[11] matrixStats_0.62.0                       
[12] GenomicRanges_1.48.0                     
[13] GenomeInfoDb_1.32.1                      
[14] IRanges_2.30.0                           
[15] S4Vectors_0.34.0                         
[16] BiocGenerics_0.42.0                      
[17] forcats_0.5.1                            
[18] stringr_1.4.0                            
[19] dplyr_1.0.9                              
[20] purrr_0.3.4                              
[21] readr_2.1.2                              
[22] tidyr_1.2.0                              
[23] tibble_3.1.7                             
[24] ggplot2_3.3.6                            
[25] tidyverse_1.3.1                          

loaded via a namespace (and not attached):
 [1] amap_0.8-18              colorspace_2.0-3         rjson_0.2.21            
 [4] hwriter_1.3.2.1          ellipsis_0.3.2           fs_1.5.2                
 [7] rstudioapi_0.13          ggrepel_0.9.1            fansi_1.0.3             
[10] mvtnorm_1.1-3            apeglm_1.18.0            lubridate_1.8.0         
[13] xml2_1.3.3               jsonlite_1.8.0           Rsamtools_2.12.0        
[16] broom_0.8.0              ashr_2.2-54              dbplyr_2.1.1            
[19] png_0.1-7                GreyListChIP_1.28.0      compiler_4.2.0          
[22] httr_1.4.3               backports_1.4.1          assertthat_0.2.1        
[25] Matrix_1.4-1             fastmap_1.1.0            limma_3.52.0            
[28] cli_3.3.0                htmltools_0.5.2          tools_4.2.0             
[31] coda_0.19-4              gtable_0.3.0             glue_1.6.2              
[34] GenomeInfoDbData_1.2.8   systemPipeR_2.2.2        ShortRead_1.54.0        
[37] Rcpp_1.0.8.3             bbmle_1.0.25             cellranger_1.1.0        
[40] vctrs_0.4.1              rvest_1.0.2              lifecycle_1.0.1         
[43] irlba_2.3.5              restfulr_0.0.13          gtools_3.9.2            
[46] XML_3.99-0.9             zlibbioc_1.42.0          MASS_7.3-57             
[49] scales_1.2.0             hms_1.1.1                parallel_4.2.0          
[52] RColorBrewer_1.1-3       yaml_2.3.5               emdbook_1.3.12          
[55] bdsmatrix_1.3-4          latticeExtra_0.6-29      stringi_1.7.6           
[58] SQUAREM_2021.1           BiocIO_1.6.0             caTools_1.18.2          
[61] BiocParallel_1.30.0      truncnorm_1.0-8          rlang_1.0.2             
[64] pkgconfig_2.0.3          bitops_1.0-7             lattice_0.20-45         
[67] invgamma_1.1             GenomicAlignments_1.32.0 htmlwidgets_1.5.4       
[70] tidyselect_1.1.2         plyr_1.8.7               magrittr_2.0.3          
[73] R6_2.5.1                 gplots_3.1.3             generics_0.1.2          
[76] DelayedArray_0.22.0      DBI_1.1.2                pillar_1.7.0            
[79] haven_2.5.0              withr_2.5.0              RCurl_1.98-1.6          
[82] mixsqp_0.3-43            modelr_0.1.8             crayon_1.5.1            
[85] KernSmooth_2.23-20       utf8_1.2.2               tzdb_0.3.0              
[88] jpeg_0.1-9               locfit_1.5-9.5           grid_4.2.0              
[91] readxl_1.4.0             reprex_2.0.1             digest_0.6.29           
[94] numDeriv_2016.8-1.1      munsell_0.5.0

DiffBind • 838 views

ADD COMMENT • link updated 23 months ago by sagaryevale121 • 0 • written 23 months ago by Mehmet • 0

0

Entering edit mode

Am I super sure that the BAM files have not been mapped to Chr21?

> ls *.bam | xargs -n 1 samtools view | awk '{print $3}' | sort | uniq -c 
26485688 *
10627096 1
6504281 10
8786263 11
5888992 12
5647827 13
5572417 14
5327053 15
4456125 16
6271790 17
4065676 18
3636229 19
10538289 2
7445814 3
8586943 4
8185808 5
8050500 6
8576673 7
7934833 8
7086330 9
7841323 MT
3614474 X
 966388 Y

Yes, definitely not in the BAM data.

ADD REPLY • link 23 months ago Mehmet • 0

score 1 · Answer 1 · 2022-05-12

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 14 days ago

Cambridge, UK

You are making assumptions about the representations used in sample$binding and sample$merged. The $binding and $merged "slots" are undocumented and not intended to be manipulated directly by users.

The supported method for retrieving the binding matrix is:

binding <- dba.peakset(sample, bRetrieve=TRUE)

Have a look at the chromosome names returned using this method and see if they correspond to your expectations.

Internally, the chromosome number (not name) in the internal matrices are indexes into the $chrmap field. So in your case, the regions with chromosome numbered 21 in sample$binding maps to regions on samples$chrmap[21], which corresponds to a chromosome named "X".

ADD COMMENT • link 23 months ago Rory Stark ★ 5.2k

0

Entering edit mode

(First let me apologise for what was a slightly overkill bug report --- yesterday was a long day. Secondly let me thank you for writing this fantastic tool in the first place, and thirdly thanks for the speedy resposne)

Oh I see... okay, that breathes a sigh of relief into my analysis! I was wondering why the CHR vector was immediately cast into a double in the binding slot.

On page 49 of the vignette, this $binding slot is referenced, so I took this to be the merged matrix (which it sort of is), but I took it too literally I guess

Edit: It works (slight typo "peasket" → "peakset")!

ADD REPLY • link 23 months ago Mehmet • 0

1

Entering edit mode

No worries!

Thanks also for catching the place in the vignette where $binding is accessed directly (there are actually two such places). I've updated the vignette to correctly retrieve the binding matrix using dba.peakset() before getting the number of rows.

ADD REPLY • link 23 months ago Rory Stark ★ 5.2k