Question: Does Rsamtools support tag names with number characters? i.e ."B1"
0
2.5 years ago by
joelrrv0
joelrrv0 wrote:

Hi everyone!

Whilst exploring the functions in Rsamtools and getting familiar with it, I seem to have stumbled upon a potential Bug.

I was trying to using the ScanBam() function to extract part of my bam file while selecting for a optional tag named "B1".

For example,  given these example lines from the bam file :

NS500688:47:HG2VTBGXX:1:22304:7555:15536    16    1    14365    1    90M    *    0    0    AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAAGAACTGAGAAGAAACCAACAGTGTGATTTTAATAAAGGATCTCTAGCTGTGCA    //A6/E6/AAE/E/EEEE//A/E/E/<EAA/AA//AAEEA///E</E/AAEE<AEAEA/E/E//////6A6<66<A<AAA/E<<A<<AAA    HI:i:1    AS:i:84nM:i:2    B1:Z:23    B2:Z:14    B3:Z:TCGTTCTAG    NH:i:1    XF:Z:ENSG00000227232.5
NS500688:47:HG2VTBGXX:2:23301:18073:3197    16    1    14365    1    90M    *    0    0    AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCA    AAAAAEEEEEEEEAEEEEEE<EEE<EEAEAEEAEAEE/AAEEAAEEEEAEAEEEEE//E/E/EEEEE<A/66A</AEEAEEE/EEAE<EE    HI:i:1    AS:i:88nM:i:0    B1:Z:23    B2:Z:14    B3:Z:TCGTTCTAG    NH:i:1    XF:Z:ENSG00000227232.5
NS500688:47:HG2VTBGXX:2:23301:20724:12491    16    1    14365    1    90M    *    0    0    AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAATAACTTAGCAGAAACAAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCA    A/A/AEE//AA/AAAE/EAAAAEA<</<AA/AE//A//AA/A/A////<AA</<A/<//A<EA//<A66</6/<6A/A<A<A6E<6A66<    HI:i:1    AS:i:82nM:i:3    B1:Z:23    B2:Z:14    B3:Z:TCGTTCTAG    NH:i:1    XF:Z:ENSG00000227232.5
NS500688:47:HG2VTBGXX:1:11210:9566:5701    16    1    14366    1    66M    *0    0    CACCAATCAATAAAGAACTGAGCAGAAACCAACATTGTTCTTTTAATAAATGATCTCTAGCTGTGC    AAAAAA/AAA/EEAEEEE/EEEEAEAAAE<AA<<//EA/<</6//A6/E</EA/A/AAA/A<///A    HI:i:1    AS:i:58    nM:i:3    B1:Z:13    B2:A:5    B3:Z:TCTTTGATC    NH:i:1    XF:Z:ENSG00000227232.5

I wanted to filter for chromosome 1 and tag "B1" using :

param = ScanBamParam(what = scanBamWhat(), which = GRanges(1, IRanges(1,1e+07)), tag = c("B1"))
chr1 = scanBam(testbamDestination, param = param)

But every time I try to run this, my R session crashes (using Rstudio 1.0.136). This does not happen when running the following lines :

param = ScanBamParam(what = scanBamWhat(), which = GRanges(1, IRanges(1,1e+07)), tag = c("XF"))
chr1 = scanBam(testbamDestination, param = param)

This leads me to believe that at the given time, Rsamtools does not support tag names having numerical values in them. Would it be possible to confirm this? If so, can we expect potential support for this study case in a near future?

I am asking, because I was trying to filter the bam file using the custom tags, but unless I am missing something, I will have to find an alternative way.

Thank you very much in advance!

Session info :

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicAlignments_1.10.0   SummarizedExperiment_1.4.0 GenomicFeatures_1.26.2     AnnotationDbi_1.36.1
[5] Biobase_2.34.0             Rsamtools_1.26.1           Biostrings_2.42.1          XVector_0.14.0
[9] GenomicRanges_1.26.4       GenomeInfoDb_1.10.2        IRanges_2.8.2              S4Vectors_0.12.2
[13] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9        zlibbioc_1.20.0    BiocParallel_1.8.1 lattice_0.20-34    tools_3.3.2        grid_3.3.2
[7] DBI_0.5-1          digest_0.6.11      Matrix_1.2-8       rtracklayer_1.34.1 bitops_1.0-6       RCurl_1.95-4.8
[13] biomaRt_2.30.0     memoise_1.0.0      RSQLite_1.1-2      XML_3.98-1.5     
rsamtools bug • 525 views
modified 2.5 years ago by Martin Morgan ♦♦ 23k • written 2.5 years ago by joelrrv0
Answer: Does Rsamtools support tag names with number characters? i.e ."B1"
2
2.5 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

It doesn't seem to be a problem with tags-with-numbers per se

> ex1 = system.file(package="Rsamtools", "extdata", "ex1.bam")
> param = ScanBamParam(tag="H0")
> table(scanBam("ex1.bam", param=param)[[1]]$tag$H0)

0    1    2    3    4    5    6    7    9   11   12   14   22   29   30   42
403 2787   35   14    5    2    4    1    2    1    2    1    2    2    1    1
63   82   85
1    1    6
> sessionInfo()
R version 3.3.2 Patched (2017-02-14 r72175)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] Rsamtools_1.26.1     Biostrings_2.42.1    XVector_0.14.0
[4] GenomicRanges_1.26.3 GenomeInfoDb_1.10.3  IRanges_2.8.1
[7] S4Vectors_0.12.1     BiocGenerics_0.20.0  BiocInstaller_1.24.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.20.0    RCurl_1.95-4.8     BiocParallel_1.8.1 bitops_1.0-6
> 

If you make (a subset of your) bam file available somewhere I will investigate further; please feel free to contact me off-list at martin.morgan at roswellpark.org

Thank you very much for looking into it.

I extracted a small subset of the original bam file, on which trying to extract the "B1" and "B2" tags fails. Given your answer, I also tried it for the "B3" tag, which strangely doesn't fail. Thus I share your observation that it doesn't seem to be related to a tag-with-number problem.

I will contact you per email and thank you in advance for your time and input.

1

The problem was that a single tag mixed formatting 'Z' (character string) and 'A' (single character). The C code in Rsamtols did not anticipate this. It has been fixed in version 1.26.2 / 1.27.16. These versions should be available in the next several days.

The fact that a tag is always accompanied by a formatting character implies that the bam file authors anticipate that tags can have different types in different records. Rsamtools only supports this to the extent that the R representation  of the data type is not changed, e.g., Rsamtools would object if a tag in one record were formatted as a character string, and the same tag in another record formatted as an integer.