Entering edit mode
Taylor, Sean D
▴
250
@taylor-sean-d-5800
Last seen 10.2 years ago
I have received some BAM files from a colleague that were produced
from an Illumina HiSeq instrument. When I try to read them as a
GAlignments object, I get the following error:
> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam'
> gal<-readGAlignments(bamfile)
Error in value[[3L]](cond) :
failed to open BamFile: SAM/BAM header missing or empty
file: '03-192B3_lung.sort.nodup.bam.realigned.bam'
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
[bam_header_read] invalid BAM binary header (this is not a BAM
file).
I looked at the header data this way:
> strsplit(readLines(bamfile, 10), '\t')
[[1]]
[1] NA
[[2]]
[1] "@SQ" "SN:1" "LN:249250621"
[[3]]
[1] "@SQ" "SN:2" "LN:243199373"
[[4]]030
[1] "@SQ" "SN:3" "LN:198022430"
[[5]]
[1] "@SQ" "SN:4" "LN:191154276"
[[6]]
[1] "@SQ" "SN:5" "LN:180915260"
[[7]]
[1] "@SQ" "SN:6" "LN:171115067"
[[8]]
[1] "@SQ" "SN:7" "LN:159138663"
[[9]]
[1] "@SQ" "SN:8" "LN:146364022"
[[10]]
[1] "@SQ" "SN:9" "LN:141213431"
Warning message:
In strsplit(readLines(bamfile, 10), "\t") :
input string 1 is invalid in this locale
And then from the command line:
$ less 03-192B3_lung.sort.nodup.bam.realigned.bam
BAM^A<ca>^K^@^@@HD VN:1.4 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
@SQ SN:4 LN:191154276
@SQ SN:5 LN:180915260
@SQ SN:6 LN:171115067
@SQ SN:7 LN:159138663
@SQ SN:8 LN:146364022
@SQ SN:9 LN:141213431
@SQ SN:10 LN:135534747
I'm not sure what that first line is, but I think that is the problem.
The bam file itself is 18 GB, so it is difficult to edit directly. Is
there a way I can direct readGAlignments to skip the first line? Or is
there a different solution, eg filterBam? Ultimately I am going to be
sifting through many such BAM files so I'm hoping for a general
solution that doesn't involve manual editing if possible.
Thanks,
Sean
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0
lattice_0.20-24
[5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3
XVector_0.2.0
[9] IRanges_1.20.5 BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3
stats4_3.0.2 tools_3.0.2
[7] zlibbioc_1.7.0
Sean Taylor
Post-doctoral Fellow
Fred Hutchinson Cancer Research Center
206-667-5544
[[alternative HTML version deleted]]