Problem with BAM header in GAlignments

1

Entering edit mode

Taylor, Sean D ▴ 250

@taylor-sean-d-5800

Last seen 11.1 years ago

I have received some BAM files from a colleague that were produced from an Illumina HiSeq instrument. When I try to read them as a GAlignments object, I get the following error: > bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam' > gal<-readGAlignments(bamfile) Error in value[[3L]](cond) : failed to open BamFile: SAM/BAM header missing or empty file: '03-192B3_lung.sort.nodup.bam.realigned.bam' In addition: Warning message: In doTryCatch(return(expr), name, parentenv, handler) : [bam_header_read] invalid BAM binary header (this is not a BAM file). I looked at the header data this way: > strsplit(readLines(bamfile, 10), '\t') [[1]] [1] NA [[2]] [1] "@SQ" "SN:1" "LN:249250621" [[3]] [1] "@SQ" "SN:2" "LN:243199373" [[4]]030 [1] "@SQ" "SN:3" "LN:198022430" [[5]] [1] "@SQ" "SN:4" "LN:191154276" [[6]] [1] "@SQ" "SN:5" "LN:180915260" [[7]] [1] "@SQ" "SN:6" "LN:171115067" [[8]] [1] "@SQ" "SN:7" "LN:159138663" [[9]] [1] "@SQ" "SN:8" "LN:146364022" [[10]] [1] "@SQ" "SN:9" "LN:141213431" Warning message: In strsplit(readLines(bamfile, 10), "\t") : input string 1 is invalid in this locale And then from the command line: $ less 03-192B3_lung.sort.nodup.bam.realigned.bam BAM^A<ca>^K^@^@@HD VN:1.4 GO:none SO:coordinate @SQ SN:1 LN:249250621 @SQ SN:2 LN:243199373 @SQ SN:3 LN:198022430 @SQ SN:4 LN:191154276 @SQ SN:5 LN:180915260 @SQ SN:6 LN:171115067 @SQ SN:7 LN:159138663 @SQ SN:8 LN:146364022 @SQ SN:9 LN:141213431 @SQ SN:10 LN:135534747 I'm not sure what that first line is, but I think that is the problem. The bam file itself is 18 GB, so it is difficult to edit directly. Is there a way I can direct readGAlignments to skip the first line? Or is there a different solution, eg filterBam? Ultimately I am going to be sifting through many such BAM files so I'm hoping for a general solution that doesn't involve manual editing if possible. Thanks, Sean > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0 lattice_0.20-24 [5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3 XVector_0.2.0 [9] IRanges_1.20.5 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3 stats4_3.0.2 tools_3.0.2 [7] zlibbioc_1.7.0 Sean Taylor Post-doctoral Fellow Fred Hutchinson Cancer Research Center 206-667-5544 [[alternative HTML version deleted]]

Cancer Cancer • 3.9k views

ADD COMMENT • link updated 11.8 years ago by Martin Morgan 25k • written 11.8 years ago by Taylor, Sean D ▴ 250

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 9 months ago

United States

On 01/07/2014 02:31 PM, Taylor, Sean D wrote: > I have received some BAM files from a colleague that were produced from an Illumina HiSeq instrument. When I try to read them as a GAlignments object, I get the following error: >> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam' >> gal<-readGAlignments(bamfile) > Error in value[[3L]](cond) : > failed to open BamFile: SAM/BAM header missing or empty > file: '03-192B3_lung.sort.nodup.bam.realigned.bam' > In addition: Warning message: > In doTryCatch(return(expr), name, parentenv, handler) : > [bam_header_read] invalid BAM binary header (this is not a BAM file). > > > I looked at the header data this way: >> strsplit(readLines(bamfile, 10), '\t') this (using readLines on bamfile) implies that it is actually a sam (text) file rather than a bam (binary) file, despite the .bam extension! Convert it to bam and optionally generate an index with asBam(bamfile, "some_file_name.bam", indexDestination=TRUE) or, better, get your upstream provider to generate bam files for you. The funky information in the first line looks like the file has been corrupted in some way, too; you'll need to get that fixed first, which probably means coordinating with whomever is providing you with files. Probably you want to make sure that the 'checksums' on the file are the same before and after any copy, e.g., running the linux command md5sum *bam or from within R using tools::md5sum(). Martin > [[1]] > [1] NA > > [[2]] > [1] "@SQ" "SN:1" "LN:249250621" > > [[3]] > [1] "@SQ" "SN:2" "LN:243199373" > > [[4]]030 > [1] "@SQ" "SN:3" "LN:198022430" > > [[5]] > [1] "@SQ" "SN:4" "LN:191154276" > > [[6]] > [1] "@SQ" "SN:5" "LN:180915260" > > [[7]] > [1] "@SQ" "SN:6" "LN:171115067" > > [[8]] > [1] "@SQ" "SN:7" "LN:159138663" > > [[9]] > [1] "@SQ" "SN:8" "LN:146364022" > > [[10]] > [1] "@SQ" "SN:9" "LN:141213431" > Warning message: > In strsplit(readLines(bamfile, 10), "\t") : > input string 1 is invalid in this locale > > And then from the command line: > $ less 03-192B3_lung.sort.nodup.bam.realigned.bam > BAM^A<ca>^K^@^@@HD VN:1.4 GO:none SO:coordinate > @SQ SN:1 LN:249250621 > @SQ SN:2 LN:243199373 > @SQ SN:3 LN:198022430 > @SQ SN:4 LN:191154276 > @SQ SN:5 LN:180915260 > @SQ SN:6 LN:171115067 > @SQ SN:7 LN:159138663 > @SQ SN:8 LN:146364022 > @SQ SN:9 LN:141213431 > @SQ SN:10 LN:135534747 > > I'm not sure what that first line is, but I think that is the problem. The bam file itself is 18 GB, so it is difficult to edit directly. Is there a way I can direct readGAlignments to skip the first line? Or is there a different solution, eg filterBam? Ultimately I am going to be sifting through many such BAM files so I'm hoping for a general solution that doesn't involve manual editing if possible. > > Thanks, > Sean > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0 lattice_0.20-24 > [5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3 XVector_0.2.0 > [9] IRanges_1.20.5 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3 stats4_3.0.2 tools_3.0.2 > [7] zlibbioc_1.7.0 > > > Sean Taylor > Post-doctoral Fellow > Fred Hutchinson Cancer Research Center > 206-667-5544 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 11.8 years ago Martin Morgan 25k

0

Entering edit mode

The thought did occur to me that these might be SAM format, but its odd because the information beyond the header appears to be binary. The files also came with accompanying .bai index files as well. I have never had cause to carefully examine the header, so I didn't know what to expect. I take it thought that the header in a BAM file is also supposed to in binary? Sean > -----Original Message----- > From: Martin Morgan [mailto:mtmorgan at fhcrc.org] > Sent: Tuesday, January 07, 2014 2:45 PM > To: Taylor, Sean D; bioconductor at r-project.org > Subject: Re: [BioC] Problem with BAM header in GAlignments > > On 01/07/2014 02:31 PM, Taylor, Sean D wrote: > > I have received some BAM files from a colleague that were produced from > an Illumina HiSeq instrument. When I try to read them as a GAlignments > object, I get the following error: > >> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam' > >> gal<-readGAlignments(bamfile) > > Error in value[[3L]](cond) : > > failed to open BamFile: SAM/BAM header missing or empty > > file: '03-192B3_lung.sort.nodup.bam.realigned.bam' > > In addition: Warning message: > > In doTryCatch(return(expr), name, parentenv, handler) : > > [bam_header_read] invalid BAM binary header (this is not a BAM file). > > > > > > I looked at the header data this way: > >> strsplit(readLines(bamfile, 10), '\t') > > this (using readLines on bamfile) implies that it is actually a sam (text) file > rather than a bam (binary) file, despite the .bam extension! Convert it to > bam and optionally generate an index with > > asBam(bamfile, "some_file_name.bam", indexDestination=TRUE) > > or, better, get your upstream provider to generate bam files for you. > > The funky information in the first line looks like the file has been corrupted in > some way, too; you'll need to get that fixed first, which probably means > coordinating with whomever is providing you with files. Probably you want to > make sure that the 'checksums' on the file are the same before and after any > copy, e.g., running the linux command md5sum *bam or from within R using > tools::md5sum(). > > Martin > > > [[1]] > > [1] NA > > > > [[2]] > > [1] "@SQ" "SN:1" "LN:249250621" > > > > [[3]] > > [1] "@SQ" "SN:2" "LN:243199373" > > > > [[4]]030 > > [1] "@SQ" "SN:3" "LN:198022430" > > > > [[5]] > > [1] "@SQ" "SN:4" "LN:191154276" > > > > [[6]] > > [1] "@SQ" "SN:5" "LN:180915260" > > > > [[7]] > > [1] "@SQ" "SN:6" "LN:171115067" > > > > [[8]] > > [1] "@SQ" "SN:7" "LN:159138663" > > > > [[9]] > > [1] "@SQ" "SN:8" "LN:146364022" > > > > [[10]] > > [1] "@SQ" "SN:9" "LN:141213431" > > Warning message: > > In strsplit(readLines(bamfile, 10), "\t") : > > input string 1 is invalid in this locale > > > > And then from the command line: > > $ less 03-192B3_lung.sort.nodup.bam.realigned.bam > > BAM^A<ca>^K^@^@@HD VN:1.4 GO:none SO:coordinate > > @SQ SN:1 LN:249250621 > > @SQ SN:2 LN:243199373 > > @SQ SN:3 LN:198022430 > > @SQ SN:4 LN:191154276 > > @SQ SN:5 LN:180915260 > > @SQ SN:6 LN:171115067 > > @SQ SN:7 LN:159138663 > > @SQ SN:8 LN:146364022 > > @SQ SN:9 LN:141213431 > > @SQ SN:10 LN:135534747 > > > > I'm not sure what that first line is, but I think that is the problem. The bam > file itself is 18 GB, so it is difficult to edit directly. Is there a way I can direct > readGAlignments to skip the first line? Or is there a different solution, eg > filterBam? Ultimately I am going to be sifting through many such BAM files so > I'm hoping for a general solution that doesn't involve manual editing if > possible. > > > > Thanks, > > Sean > > > >> sessionInfo() > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 > > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0 > lattice_0.20-24 > > [5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3 > XVector_0.2.0 > > [9] IRanges_1.20.5 BiocGenerics_0.8.0 > > > > loaded via a namespace (and not attached): > > [1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3 stats4_3.0.2 > tools_3.0.2 > > [7] zlibbioc_1.7.0 > > > > > > Sean Taylor > > Post-doctoral Fellow > > Fred Hutchinson Cancer Research Center > > 206-667-5544 > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793

ADD REPLY • link 11.8 years ago Taylor, Sean D ▴ 250

0

Entering edit mode

On 01/07/2014 02:52 PM, Taylor, Sean D wrote: > The thought did occur to me that these might be SAM format, but its odd because the information beyond the header appears to be binary. The files also came with accompanying .bai index files as well. I have never had cause to carefully examine the header, so I didn't know what to expect. I take it thought that the header in a BAM file is also supposed to in binary? > Yes, the entire bam file, header and all, is supposed to be in a compressed binary representation. Martin > Sean >> -----Original Message----- >> From: Martin Morgan [mailto:mtmorgan at fhcrc.org] >> Sent: Tuesday, January 07, 2014 2:45 PM >> To: Taylor, Sean D; bioconductor at r-project.org >> Subject: Re: [BioC] Problem with BAM header in GAlignments >> >> On 01/07/2014 02:31 PM, Taylor, Sean D wrote: >>> I have received some BAM files from a colleague that were produced from >> an Illumina HiSeq instrument. When I try to read them as a GAlignments >> object, I get the following error: >>>> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam' >>>> gal<-readGAlignments(bamfile) >>> Error in value[[3L]](cond) : >>> failed to open BamFile: SAM/BAM header missing or empty >>> file: '03-192B3_lung.sort.nodup.bam.realigned.bam' >>> In addition: Warning message: >>> In doTryCatch(return(expr), name, parentenv, handler) : >>> [bam_header_read] invalid BAM binary header (this is not a BAM file). >>> >>> >>> I looked at the header data this way: >>>> strsplit(readLines(bamfile, 10), '\t') >> >> this (using readLines on bamfile) implies that it is actually a sam (text) file >> rather than a bam (binary) file, despite the .bam extension! Convert it to >> bam and optionally generate an index with >> >> asBam(bamfile, "some_file_name.bam", indexDestination=TRUE) >> >> or, better, get your upstream provider to generate bam files for you. >> >> The funky information in the first line looks like the file has been corrupted in >> some way, too; you'll need to get that fixed first, which probably means >> coordinating with whomever is providing you with files. Probably you want to >> make sure that the 'checksums' on the file are the same before and after any >> copy, e.g., running the linux command md5sum *bam or from within R using >> tools::md5sum(). >> >> Martin >> >>> [[1]] >>> [1] NA >>> >>> [[2]] >>> [1] "@SQ" "SN:1" "LN:249250621" >>> >>> [[3]] >>> [1] "@SQ" "SN:2" "LN:243199373" >>> >>> [[4]]030 >>> [1] "@SQ" "SN:3" "LN:198022430" >>> >>> [[5]] >>> [1] "@SQ" "SN:4" "LN:191154276" >>> >>> [[6]] >>> [1] "@SQ" "SN:5" "LN:180915260" >>> >>> [[7]] >>> [1] "@SQ" "SN:6" "LN:171115067" >>> >>> [[8]] >>> [1] "@SQ" "SN:7" "LN:159138663" >>> >>> [[9]] >>> [1] "@SQ" "SN:8" "LN:146364022" >>> >>> [[10]] >>> [1] "@SQ" "SN:9" "LN:141213431" >>> Warning message: >>> In strsplit(readLines(bamfile, 10), "\t") : >>> input string 1 is invalid in this locale >>> >>> And then from the command line: >>> $ less 03-192B3_lung.sort.nodup.bam.realigned.bam >>> BAM^A<ca>^K^@^@@HD VN:1.4 GO:none SO:coordinate >>> @SQ SN:1 LN:249250621 >>> @SQ SN:2 LN:243199373 >>> @SQ SN:3 LN:198022430 >>> @SQ SN:4 LN:191154276 >>> @SQ SN:5 LN:180915260 >>> @SQ SN:6 LN:171115067 >>> @SQ SN:7 LN:159138663 >>> @SQ SN:8 LN:146364022 >>> @SQ SN:9 LN:141213431 >>> @SQ SN:10 LN:135534747 >>> >>> I'm not sure what that first line is, but I think that is the problem. The bam >> file itself is 18 GB, so it is difficult to edit directly. Is there a way I can direct >> readGAlignments to skip the first line? Or is there a different solution, eg >> filterBam? Ultimately I am going to be sifting through many such BAM files so >> I'm hoping for a general solution that doesn't involve manual editing if >> possible. >>> >>> Thanks, >>> Sean >>> >>>> sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=en_US.UTF-8 >>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 >> LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 >> LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0 >> lattice_0.20-24 >>> [5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3 >> XVector_0.2.0 >>> [9] IRanges_1.20.5 BiocGenerics_0.8.0 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3 stats4_3.0.2 >> tools_3.0.2 >>> [7] zlibbioc_1.7.0 >>> >>> >>> Sean Taylor >>> Post-doctoral Fellow >>> Fred Hutchinson Cancer Research Center >>> 206-667-5544 >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 11.8 years ago Martin Morgan 25k

0

Entering edit mode

Thanks Martin. For what its worth, using asBam() did not work: > asBam(bamfile, 'test.bam', indexDestination=TRUE) Error in value[[3L]](cond) : 'asBam' SAM/BAM header missing or empty file: '03-192B3_lung.sort.nodup.bam.realigned.bam' SAM file: '03-192B3_lung.sort.nodup.bam.realigned.bam' In addition: Warning message: In doTryCatch(return(expr), name, parentenv, handler) : [samopen] no @SQ lines in the header. Here is what the file looks like beyond the header: @PG ID:GATK IndelRealigner VN:2.4-9-g532efad CL:knownAlleles=[(RodBinding name=knownAlleles source=/net/shendure/vol7/akash/references/dbsnp_135.b37.vcf)] targetI ntervals=/net/shendure/vol7/akash/ilsa/03-192B3_lung/03-192B3_lung.sor t.nodup.bam.final.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_SW entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=500 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null V^@^@^@^B^@^@^@1^@=C<db>^N^B^@^@^@2^@<8D><ed>~^N^B^@^@^@3^@^^<95><cd>^ K^B^@^@^@4^@d<c8>d^K^B^@^@^@5^@< <8C><c8> ^B^@^@^@6^@;^B3 ^B^@^@^@7^@gC| ^B^@^@^@8^@vV<b9>^H^B^@^@^@9^@<f7><be>^C^@^@^@10^@<9B> ^X^T^H^C^@^@^@11^@4 ^L^H^C ^@^@^@12^@<f7>j<fa>^G^C^@^@^@13^@VZ

^F^C^@^@^@14^@$^Ff^F^C^@^@^@15^ @@<81>^\^F^C^@^@^@16^@A<b4>b^E^C ^@^@^@17^@<ca><f0><d6>^D^C^@^@^@18^@@]<a7>^D^C^@^@^@19^@<97><<86>^C^C^ @^@^@20^@p<b1><c1>^C^C^@^@^@21^@gg<de>^B^C^@^@^@22^@v<d8>^N^C^B^@^@^@X ^@<a0>=A ^B^@^@^@Y^@<fe><f7><89>^C^C^@^@^@MT^@<b9>@^@^@^K^@^@^@ GL000207.1^@<a6>^P^@^@^K^@^@^@GL000226.1^@<a0>:^@^@^K^@^@^@GL000229.1^ @<c9>M^@^@^K^@^@^@GL000231.1 ^@<fa>j^@^@^K^@^@^@GL000210.1^@"l^@^@^K^@^@^@GL000239.1^@ <84>^@^@^K^@^@^@GL000235.1^@<aa><86>^@^@^K^@ ^@^@GL000201.1^@4<8D>^@^@^K^@^@^@GL000247.1^@F<8E>^@^@^K^@^@^@GL000245 .1^@+<8F>^@^@^K^@^@^@GL000197.1^@7<91>^@^@^K^@^@^@GL000203.1^@z<92>^@^ @^K^@^@^@GL000246.1^@ I have contacted my colleague to see if I can get a better sense of what this is supposed to be. > -----Original Message----- > From: Martin Morgan [mailto:mtmorgan at fhcrc.org] > Sent: Tuesday, January 07, 2014 3:09 PM > To: Taylor, Sean D; bioconductor at r-project.org > Subject: Re: [BioC] Problem with BAM header in GAlignments > > On 01/07/2014 02:52 PM, Taylor, Sean D wrote: > > The thought did occur to me that these might be SAM format, but its odd > because the information beyond the header appears to be binary. The files > also came with accompanying .bai index files as well. I have never had cause > to carefully examine the header, so I didn't know what to expect. I take it > thought that the header in a BAM file is also supposed to in binary? > > > > Yes, the entire bam file, header and all, is supposed to be in a compressed > binary representation. > > Martin > > > Sean > >> -----Original Message----- > >> From: Martin Morgan [mailto:mtmorgan at fhcrc.org] > >> Sent: Tuesday, January 07, 2014 2:45 PM > >> To: Taylor, Sean D; bioconductor at r-project.org > >> Subject: Re: [BioC] Problem with BAM header in GAlignments > >> > >> On 01/07/2014 02:31 PM, Taylor, Sean D wrote: > >>> I have received some BAM files from a colleague that were produced > >>> from > >> an Illumina HiSeq instrument. When I try to read them as a > >> GAlignments object, I get the following error: > >>>> bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam' > >>>> gal<-readGAlignments(bamfile) > >>> Error in value[[3L]](cond) : > >>> failed to open BamFile: SAM/BAM header missing or empty > >>> file: '03-192B3_lung.sort.nodup.bam.realigned.bam' > >>> In addition: Warning message: > >>> In doTryCatch(return(expr), name, parentenv, handler) : > >>> [bam_header_read] invalid BAM binary header (this is not a BAM file). > >>> > >>> > >>> I looked at the header data this way: > >>>> strsplit(readLines(bamfile, 10), '\t') > >> > >> this (using readLines on bamfile) implies that it is actually a sam > >> (text) file rather than a bam (binary) file, despite the .bam > >> extension! Convert it to bam and optionally generate an index with > >> > >> asBam(bamfile, "some_file_name.bam", indexDestination=TRUE) > >> > >> or, better, get your upstream provider to generate bam files for you. > >> > >> The funky information in the first line looks like the file has been > >> corrupted in some way, too; you'll need to get that fixed first, > >> which probably means coordinating with whomever is providing you with > >> files. Probably you want to make sure that the 'checksums' on the > >> file are the same before and after any copy, e.g., running the linux > >> command md5sum *bam or from within R using tools::md5sum(). > >> > >> Martin > >> > >>> [[1]] > >>> [1] NA > >>> > >>> [[2]] > >>> [1] "@SQ" "SN:1" "LN:249250621" > >>> > >>> [[3]] > >>> [1] "@SQ" "SN:2" "LN:243199373" > >>> > >>> [[4]]030 > >>> [1] "@SQ" "SN:3" "LN:198022430" > >>> > >>> [[5]] > >>> [1] "@SQ" "SN:4" "LN:191154276" > >>> > >>> [[6]] > >>> [1] "@SQ" "SN:5" "LN:180915260" > >>> > >>> [[7]] > >>> [1] "@SQ" "SN:6" "LN:171115067" > >>> > >>> [[8]] > >>> [1] "@SQ" "SN:7" "LN:159138663" > >>> > >>> [[9]] > >>> [1] "@SQ" "SN:8" "LN:146364022" > >>> > >>> [[10]] > >>> [1] "@SQ" "SN:9" "LN:141213431" > >>> Warning message: > >>> In strsplit(readLines(bamfile, 10), "\t") : > >>> input string 1 is invalid in this locale > >>> > >>> And then from the command line: > >>> $ less 03-192B3_lung.sort.nodup.bam.realigned.bam > >>> BAM^A<ca>^K^@^@@HD VN:1.4 GO:none SO:coordinate > >>> @SQ SN:1 LN:249250621 > >>> @SQ SN:2 LN:243199373 > >>> @SQ SN:3 LN:198022430 > >>> @SQ SN:4 LN:191154276 > >>> @SQ SN:5 LN:180915260 > >>> @SQ SN:6 LN:171115067 > >>> @SQ SN:7 LN:159138663 > >>> @SQ SN:8 LN:146364022 > >>> @SQ SN:9 LN:141213431 > >>> @SQ SN:10 LN:135534747 > >>> > >>> I'm not sure what that first line is, but I think that is the > >>> problem. The bam > >> file itself is 18 GB, so it is difficult to edit directly. Is there a > >> way I can direct readGAlignments to skip the first line? Or is there > >> a different solution, eg filterBam? Ultimately I am going to be > >> sifting through many such BAM files so I'm hoping for a general > >> solution that doesn't involve manual editing if possible. > >>> > >>> Thanks, > >>> Sean > >>> > >>>> sessionInfo() > >>> R version 3.0.2 (2013-09-25) > >>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> > >>> locale: > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> LC_TIME=en_US.UTF-8 > >>> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > >> LC_MESSAGES=en_US.UTF-8 > >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C > >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > >> LC_IDENTIFICATION=C > >>> > >>> attached base packages: > >>> [1] parallel stats graphics grDevices utils datasets methods base > >>> > >>> other attached packages: > >>> [1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0 > >> lattice_0.20-24 > >>> [5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3 > >> XVector_0.2.0 > >>> [9] IRanges_1.20.5 BiocGenerics_0.8.0 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3 stats4_3.0.2 > >> tools_3.0.2 > >>> [7] zlibbioc_1.7.0 > >>> > >>> > >>> Sean Taylor > >>> Post-doctoral Fellow > >>> Fred Hutchinson Cancer Research Center > >>> 206-667-5544 > >>> > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >> > >> > >> -- > >> Computational Biology / Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N. > >> PO Box 19024 Seattle, WA 98109 > >> > >> Location: Arnold Building M1 B861 > >> Phone: (206) 667-2793 > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793

ADD REPLY • link 11.8 years ago Taylor, Sean D ▴ 250

0

Entering edit mode

Taylor, Sean D ▴ 250

@taylor-sean-d-5800

Last seen 11.1 years ago

I apologize if this is a repeat, but I'm not sure this went through the first time I tried it. I have received some BAM files from a colleague that were produced from an Illumina HiSeq instrument. When I try to read them as a GAlignments object, I get the following error: > bamfile<-'03-192B3_lung.sort.nodup.bam.realigned.bam' > gal<-readGAlignments(bamfile) Error in value[[3L]](cond) : failed to open BamFile: SAM/BAM header missing or empty file: '03-192B3_lung.sort.nodup.bam.realigned.bam' In addition: Warning message: In doTryCatch(return(expr), name, parentenv, handler) : [bam_header_read] invalid BAM binary header (this is not a BAM file). I looked at the header data this way: > strsplit(readLines(bamfile, 10), '\t') [[1]] [1] NA [[2]] [1] "@SQ" "SN:1" "LN:249250621" [[3]] [1] "@SQ" "SN:2" "LN:243199373" [[4]]030 [1] "@SQ" "SN:3" "LN:198022430" [[5]] [1] "@SQ" "SN:4" "LN:191154276" [[6]] [1] "@SQ" "SN:5" "LN:180915260" [[7]] [1] "@SQ" "SN:6" "LN:171115067" [[8]] [1] "@SQ" "SN:7" "LN:159138663" [[9]] [1] "@SQ" "SN:8" "LN:146364022" [[10]] [1] "@SQ" "SN:9" "LN:141213431" Warning message: In strsplit(readLines(bamfile, 10), "\t") : input string 1 is invalid in this locale And then from the command line: $ less 03-192B3_lung.sort.nodup.bam.realigned.bam BAM^A<ca>^K^@^@@HD VN:1.4 GO:none SO:coordinate @SQ SN:1 LN:249250621 @SQ SN:2 LN:243199373 @SQ SN:3 LN:198022430 @SQ SN:4 LN:191154276 @SQ SN:5 LN:180915260 @SQ SN:6 LN:171115067 @SQ SN:7 LN:159138663 @SQ SN:8 LN:146364022 @SQ SN:9 LN:141213431 @SQ SN:10 LN:135534747 I'm not sure what that first line is, but I think that is the problem. The bam file itself is 18 GB, so it is difficult to edit directly. Is there a way I can direct readGAlignments to skip the first line? Or is there a different solution, eg filterBam? Ultimately I am going to be sifting through many such BAM files so I'm hoping for a general solution that doesn't involve manual editing if possible. Thanks, Sean > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] latticeExtra_0.6-26 RColorBrewer_1.0-5 ShortRead_1.20.0 lattice_0.20-24 [5] Rsamtools_1.14.1 Biostrings_2.30.0 GenomicRanges_1.14.3 XVector_0.2.0 [9] IRanges_1.20.5 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] Biobase_2.21.7 bitops_1.0-6 grid_3.0.2 hwriter_1.3 stats4_3.0.2 tools_3.0.2 [7] zlibbioc_1.7.0 Sean Taylor Post-doctoral Fellow Fred Hutchinson Cancer Research Center 206-667-5544 Sean Taylor Post-doctoral Fellow Fred Hutchinson Cancer Research Center 206-667-5544 [[alternative HTML version deleted]]

ADD COMMENT • link 11.8 years ago Taylor, Sean D ▴ 250

Login before adding your answer.