I have run into trouble reading bam files into nucleR using readBAM. I aligned my fastq files using bwa, processed them with Picard Tools' "CleanSam", "SortSam", and "FixMateInformation", and then confirmed that there are no errors using "ValidateSamFile". When I run "readBAM", it "thinks" for a while but then crashes as follows:
WT_reads <- readBAM(files = WT_bams, type = 'paired')
The messages that come back are as follows: reading file XXXX.bam processing flags processing strand + processing strand - Error in .processStrand("-", bam) : ERROR: Mate selection for - strand is invalid
"WT_bams" is just a character vector with the names of bam files.
It's challenging for me to figure out even which read in the bam file is causing problems, which seems like a good first step toward fixing things. I would very much appreciate any suggestions about what I am doing wrong.
Thanks.
Eric
``` sessionInfo() R version 4.4.0 (2024-04-24) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 18.04.6 LTS
Matrix products: default BLAS/LAPACK: FlexiBLAS OPENBLAS; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles tzcode source: system (glibc)
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] nucleR_2.36.0 BiocManager_1.30.23
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.34.0 gtable_0.3.5 hwriter_1.3.2.1 ggplot2_3.5.1
[5] latticeExtra_0.6-30 Biobase_2.64.0 lattice_0.22-6 vctrs_0.6.5
[9] tools_4.4.0 bitops_1.0-7 generics_0.1.3 stats4_4.4.0
[13] parallel_4.4.0 tibble_3.2.1 fansi_1.0.6 pkgconfig_2.0.3
[17] Matrix_1.7-0 RColorBrewer_1.1-3 S4Vectors_0.42.0 lifecycle_1.0.4
[21] GenomeInfoDbData_1.2.12 compiler_4.4.0 deldir_2.0-4 Rsamtools_2.20.0
[25] Biostrings_2.72.0 munsell_0.5.1 codetools_0.2-20 GenomeInfoDb_1.40.0
[29] pillar_1.9.0 crayon_1.5.2 BiocParallel_1.38.0 DelayedArray_0.30.1
[33] ShortRead_1.62.0 abind_1.4-5 tidyselect_1.2.1 dplyr_1.1.4
[37] grid_4.4.0 colorspace_2.1-0 cli_3.6.2 SparseArray_1.4.3
[41] magrittr_2.0.3 S4Arrays_1.4.0 utf8_1.2.4 scales_1.3.0
[45] UCSC.utils_1.0.0 pwalign_1.0.0 XVector_0.44.0 httr_1.4.7
[49] matrixStats_1.3.0 jpeg_0.1-10 interp_1.1-6 png_0.1-8
[53] GenomicRanges_1.56.0 IRanges_2.38.0 rlang_1.1.3 Rcpp_1.0.12
[57] glue_1.7.0 BiocGenerics_0.50.0 pkgload_1.3.4 rstudioapi_0.16.0
[61] jsonlite_1.8.8 R6_2.5.1 MatrixGenerics_1.16.0 GenomicAlignments_1.40.0
[65] zlibbioc_1.50.0

I thought I might be able to track down a problematic line in the bam file by repeatedly splitting the bam file in and then running the command on both halves. I did this, splitting the file, seeing that one half caused a crash while the other half didn't, then splitting the half that caused the crash, etc. After 13 or so splits, neither half caused a crash. I then reassembled the last split file, saw that it caused the crash, split it again, and saw that neither half caused the crash. I had hoped that bwa used some sort of syntax that nucleR didn't like, and I would be able to track things down this way, but that doesn't seem to be the case. Perhaps my failure offers a clue about what is going wrong.
Eric