Hi everyone!
Whilst exploring the functions in Rsamtools
and getting familiar with it, I seem to have stumbled upon a potential Bug.
I was trying to using the ScanBam()
function to extract part of my bam file while selecting for a optional tag named "B1".
For example, given these example lines from the bam file :
NS500688:47:HG2VTBGXX:1:22304:7555:15536 16 1 14365 1 90M * 0 0 AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAAGAACTGAGAAGAAACCAACAGTGTGATTTTAATAAAGGATCTCTAGCTGTGCA //A6/E6/AAE/E/EEEE//A/E/E/<EAA/AA//AAEEA///E</E/AAEE<AEAEA/E/E//////6A6<66<A<AAA/E<<A<<AAA HI:i:1 AS:i:84nM:i:2 B1:Z:23 B2:Z:14 B3:Z:TCGTTCTAG NH:i:1 XF:Z:ENSG00000227232.5 NS500688:47:HG2VTBGXX:2:23301:18073:3197 16 1 14365 1 90M * 0 0 AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCA AAAAAEEEEEEEEAEEEEEE<EEE<EEAEAEEAEAEE/AAEEAAEEEEAEAEEEEE//E/E/EEEEE<A/66A</AEEAEEE/EEAE<EE HI:i:1 AS:i:88nM:i:0 B1:Z:23 B2:Z:14 B3:Z:TCGTTCTAG NH:i:1 XF:Z:ENSG00000227232.5 NS500688:47:HG2VTBGXX:2:23301:20724:12491 16 1 14365 1 90M * 0 0 AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAATAACTTAGCAGAAACAAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCA A/A/AEE//AA/AAAE/EAAAAEA<</<AA/AE//A//AA/A/A////<AA</<A/<//A<EA//<A66</6/<6A/A<A<A6E<6A66< HI:i:1 AS:i:82nM:i:3 B1:Z:23 B2:Z:14 B3:Z:TCGTTCTAG NH:i:1 XF:Z:ENSG00000227232.5 NS500688:47:HG2VTBGXX:1:11210:9566:5701 16 1 14366 1 66M *0 0 CACCAATCAATAAAGAACTGAGCAGAAACCAACATTGTTCTTTTAATAAATGATCTCTAGCTGTGC AAAAAA/AAA/EEAEEEE/EEEEAEAAAE<AA<<//EA/<</6//A6/E</EA/A/AAA/A<///A HI:i:1 AS:i:58 nM:i:3 B1:Z:13 B2:A:5 B3:Z:TCTTTGATC NH:i:1 XF:Z:ENSG00000227232.5
I wanted to filter for chromosome 1 and tag "B1" using :
param = ScanBamParam(what = scanBamWhat(), which = GRanges(1, IRanges(1,1e+07)), tag = c("B1")) chr1 = scanBam(testbamDestination, param = param)
But every time I try to run this, my R session crashes (using Rstudio 1.0.136). This does not happen when running the following lines :
param = ScanBamParam(what = scanBamWhat(), which = GRanges(1, IRanges(1,1e+07)), tag = c("XF")) chr1 = scanBam(testbamDestination, param = param)
This leads me to believe that at the given time, Rsamtools does not support tag names having numerical values in them. Would it be possible to confirm this? If so, can we expect potential support for this study case in a near future?
I am asking, because I was trying to filter the bam file using the custom tags, but unless I am missing something, I will have to find an alternative way.
Thank you very much in advance!
Session info :
R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicAlignments_1.10.0 SummarizedExperiment_1.4.0 GenomicFeatures_1.26.2 AnnotationDbi_1.36.1 [5] Biobase_2.34.0 Rsamtools_1.26.1 Biostrings_2.42.1 XVector_0.14.0 [9] GenomicRanges_1.26.4 GenomeInfoDb_1.10.2 IRanges_2.8.2 S4Vectors_0.12.2 [13] BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 zlibbioc_1.20.0 BiocParallel_1.8.1 lattice_0.20-34 tools_3.3.2 grid_3.3.2 [7] DBI_0.5-1 digest_0.6.11 Matrix_1.2-8 rtracklayer_1.34.1 bitops_1.0-6 RCurl_1.95-4.8 [13] biomaRt_2.30.0 memoise_1.0.0 RSQLite_1.1-2 XML_3.98-1.5
Thank you very much for looking into it.
I extracted a small subset of the original bam file, on which trying to extract the "B1" and "B2" tags fails. Given your answer, I also tried it for the "B3" tag, which strangely doesn't fail. Thus I share your observation that it doesn't seem to be related to a tag-with-number problem.
I will contact you per email and thank you in advance for your time and input.
The problem was that a single tag mixed formatting 'Z' (character string) and 'A' (single character). The C code in Rsamtols did not anticipate this. It has been fixed in version 1.26.2 / 1.27.16. These versions should be available in the next several days.
The fact that a tag is always accompanied by a formatting character implies that the bam file authors anticipate that tags can have different types in different records. Rsamtools only supports this to the extent that the R representation of the data type is not changed, e.g., Rsamtools would object if a tag in one record were formatted as a character string, and the same tag in another record formatted as an integer.
Thank you very much for clarifying this. I appreciate your fast response time and your expertise in this matter.