Does Rsamtools support tag names with number characters? i.e ."B1"
1
0
Entering edit mode
joelrrv • 0
@joelrrv-12776
Last seen 7.1 years ago

Hi everyone!

Whilst exploring the functions in Rsamtools and getting familiar with it, I seem to have stumbled upon a potential Bug.

I was trying to using the ScanBam() function to extract part of my bam file while selecting for a optional tag named "B1".

For example,  given these example lines from the bam file :

NS500688:47:HG2VTBGXX:1:22304:7555:15536    16    1    14365    1    90M    *    0    0    AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAAGAACTGAGAAGAAACCAACAGTGTGATTTTAATAAAGGATCTCTAGCTGTGCA    //A6/E6/AAE/E/EEEE//A/E/E/<EAA/AA//AAEEA///E</E/AAEE<AEAEA/E/E//////6A6<66<A<AAA/E<<A<<AAA    HI:i:1    AS:i:84nM:i:2    B1:Z:23    B2:Z:14    B3:Z:TCGTTCTAG    NH:i:1    XF:Z:ENSG00000227232.5
NS500688:47:HG2VTBGXX:2:23301:18073:3197    16    1    14365    1    90M    *    0    0    AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCA    AAAAAEEEEEEEEAEEEEEE<EEE<EEAEAEEAEAEE/AAEEAAEEEEAEAEEEEE//E/E/EEEEE<A/66A</AEEAEEE/EEAE<EE    HI:i:1    AS:i:88nM:i:0    B1:Z:23    B2:Z:14    B3:Z:TCGTTCTAG    NH:i:1    XF:Z:ENSG00000227232.5
NS500688:47:HG2VTBGXX:2:23301:20724:12491    16    1    14365    1    90M    *    0    0    AGAGGCTTCCAGAGAAAACGGCACACCAATCAATAAATAACTTAGCAGAAACAAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCA    A/A/AEE//AA/AAAE/EAAAAEA<</<AA/AE//A//AA/A/A////<AA</<A/<//A<EA//<A66</6/<6A/A<A<A6E<6A66<    HI:i:1    AS:i:82nM:i:3    B1:Z:23    B2:Z:14    B3:Z:TCGTTCTAG    NH:i:1    XF:Z:ENSG00000227232.5
NS500688:47:HG2VTBGXX:1:11210:9566:5701    16    1    14366    1    66M    *0    0    CACCAATCAATAAAGAACTGAGCAGAAACCAACATTGTTCTTTTAATAAATGATCTCTAGCTGTGC    AAAAAA/AAA/EEAEEEE/EEEEAEAAAE<AA<<//EA/<</6//A6/E</EA/A/AAA/A<///A    HI:i:1    AS:i:58    nM:i:3    B1:Z:13    B2:A:5    B3:Z:TCTTTGATC    NH:i:1    XF:Z:ENSG00000227232.5

I wanted to filter for chromosome 1 and tag "B1" using :

param = ScanBamParam(what = scanBamWhat(), which = GRanges(1, IRanges(1,1e+07)), tag = c("B1"))
chr1 = scanBam(testbamDestination, param = param)

But every time I try to run this, my R session crashes (using Rstudio 1.0.136). This does not happen when running the following lines :

param = ScanBamParam(what = scanBamWhat(), which = GRanges(1, IRanges(1,1e+07)), tag = c("XF"))
chr1 = scanBam(testbamDestination, param = param)

This leads me to believe that at the given time, Rsamtools does not support tag names having numerical values in them. Would it be possible to confirm this? If so, can we expect potential support for this study case in a near future?

I am asking, because I was trying to filter the bam file using the custom tags, but unless I am missing something, I will have to find an alternative way.

Thank you very much in advance!

Session info :

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicAlignments_1.10.0   SummarizedExperiment_1.4.0 GenomicFeatures_1.26.2     AnnotationDbi_1.36.1      
 [5] Biobase_2.34.0             Rsamtools_1.26.1           Biostrings_2.42.1          XVector_0.14.0            
 [9] GenomicRanges_1.26.4       GenomeInfoDb_1.10.2        IRanges_2.8.2              S4Vectors_0.12.2          
[13] BiocGenerics_0.20.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9        zlibbioc_1.20.0    BiocParallel_1.8.1 lattice_0.20-34    tools_3.3.2        grid_3.3.2        
 [7] DBI_0.5-1          digest_0.6.11      Matrix_1.2-8       rtracklayer_1.34.1 bitops_1.0-6       RCurl_1.95-4.8    
[13] biomaRt_2.30.0     memoise_1.0.0      RSQLite_1.1-2      XML_3.98-1.5     
rsamtools bug • 1.2k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 4 days ago
United States

It doesn't seem to be a problem with tags-with-numbers per se

> ex1 = system.file(package="Rsamtools", "extdata", "ex1.bam")
> param = ScanBamParam(tag="H0")
> table(scanBam("ex1.bam", param=param)[[1]]$tag$H0)

   0    1    2    3    4    5    6    7    9   11   12   14   22   29   30   42 
 403 2787   35   14    5    2    4    1    2    1    2    1    2    2    1    1 
  63   82   85 
   1    1    6 
> sessionInfo()
R version 3.3.2 Patched (2017-02-14 r72175)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Rsamtools_1.26.1     Biostrings_2.42.1    XVector_0.14.0      
[4] GenomicRanges_1.26.3 GenomeInfoDb_1.10.3  IRanges_2.8.1       
[7] S4Vectors_0.12.1     BiocGenerics_0.20.0  BiocInstaller_1.24.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.20.0    RCurl_1.95-4.8     BiocParallel_1.8.1 bitops_1.0-6      
> 

If you make (a subset of your) bam file available somewhere I will investigate further; please feel free to contact me off-list at martin.morgan at roswellpark.org

ADD COMMENT
0
Entering edit mode

Thank you very much for looking into it.

I extracted a small subset of the original bam file, on which trying to extract the "B1" and "B2" tags fails. Given your answer, I also tried it for the "B3" tag, which strangely doesn't fail. Thus I share your observation that it doesn't seem to be related to a tag-with-number problem.

I will contact you per email and thank you in advance for your time and input.

ADD REPLY
1
Entering edit mode

The problem was that a single tag mixed formatting 'Z' (character string) and 'A' (single character). The C code in Rsamtols did not anticipate this. It has been fixed in version 1.26.2 / 1.27.16. These versions should be available in the next several days.

The fact that a tag is always accompanied by a formatting character implies that the bam file authors anticipate that tags can have different types in different records. Rsamtools only supports this to the extent that the R representation  of the data type is not changed, e.g., Rsamtools would object if a tag in one record were formatted as a character string, and the same tag in another record formatted as an integer.

ADD REPLY
0
Entering edit mode

Thank you very much for clarifying this. I appreciate your fast response time and your expertise in this matter.

ADD REPLY

Login before adding your answer.

Traffic: 946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6