Question

Error while processing BAM file with Picard FixMateInformation

0

Entering edit mode

koustav.pal • 0

@koustavpal-8614

Last seen 6.3 years ago

Hi,

I was using diffHic and the alignment asks for processing by FixMate and MarkDuplicates from PicardTools, to make the reads mate aware. But, even when using VALIDATION STRINGENCY set to LENIENT or SILENT I hit the same error.

Exception in thread "main" htsjdk.samtools.SAMException: Found two records that are paired, not supplementary, and second of the pair
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.advance(SamPairUtil.java:420)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:454)
at htsjdk.samtools.SamPairUtil$SetMateInfoIterator.next(SamPairUtil.java:360)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:195)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Any idea How I can solve this issue? Maybe you have encountered it on a rare occasion?

Also on an unrelated note, while using the provided presplit_map.py package for splitted alignments, my exit code returns non-zero because when the script tries to remove the filename provided in allnames using os.remove after pysam.merge it is unable to find the particular file. What is this BAM file? and why is it that during some alignment runs it is found and removed while during others it is never created?

OSError: [Errno 2] No such file or directory: './tmpwPxiCc/SRR1460691_1.0000.bam'

diffHic diffhic • 4.7k views

ADD COMMENT • link updated 7.3 years ago by geryk.jan • 0 • written 9.3 years ago by koustav.pal • 0

score 0 · Answer 1 · 2015-11-02

The Picard error suggests that there are two entries from the second read of the pair that are marked as primary alignments. FixMateInformation doesn't know how to handle this, as it assumes that only one entry for each read is labelled as the primary alignment (and thus defined as the mate). However, I can't see how this can be happening, as presplit_map.py should only generate one primary alignment per read (corresponding to the 5' chimeric segment for split reads). Personally, I've never seen this error; I can only suggest that you have a look at your BAM file, and see if you have any cases with multiple primary second read alignments. Depending on how big your file is, perhaps this may be useful for diagnosing the problem:

require(Rsamtools)
bf <- BamFile(input, yieldSize=1e6)
of.interest <- scanBam(bf, param=ScanBamParam(what="qname",
    flag=scanBamFlag(isSecondMateRead=TRUE, isSecondaryAlignment=FALSE)))
x <- table(of.interest[[1]]$qname)
names(x)[x>1] # should be empty, if everything's proper

The offending reads should show up in the final call, corresponding to the read name with multiple primary entries. Note that you'll have to repeat all steps including and after the scanBam call, until you go through the entire BAM file specified in input. This means that you're only loading in 1 million read pairs at a time (see yieldSize), to avoid excessive memory usage.

As for the other problem; I assume that tmpwPxiCc is the temporary directory that is created by presplit_map.py at the start of the analysis. Make sure you're not deleting this directory before presplit_map.py finishes running. Other than that, I can't explain why this particular error would be observed. In fact, the name of the BAM file (with the "0000") suggests that it's an internal temporary file created within SAMtools itself - I assume you haven't supplied presplit_map.py with such a complicated file name. If that's true, then it's an error within the core SAMtools functions of pysam, which is beyond my ability to fix.

The only similar experience I can share is from several years ago, when our server IO was very unstable and large BAM files from my Hi-C analyses kept on being corrupted during disk writes. We eventually got over this, but for a time I was getting bizarre errors that shouldn't have been possible. You may want to consider whether this is a possibility on your machine.

score 0 · Answer 2 · 2017-11-21

0

Entering edit mode

geryk.jan • 0

@gerykjan-14441

Last seen 7.3 years ago

Hi,

I was exposed with the same error as above (Exception in thread "main" htsjdk.samtools.SAMException: Found two records that are paired, not supplementary, and second of the pair).

I found that it is caused by two reads with same names (sequence identifier) in the fastq file - at least in my case. Both reads are aligned to the reference genome producing two primary alignments associated with same read name.

I do not know exactly how it can happen, but it can be resolved by removing reads with non-unique names from fastq files.

ADD COMMENT • link 7.3 years ago geryk.jan • 0

0

Entering edit mode

Having two reads with the same name in the FASTQ file would be very irregular. Illumina sequencers name their reads by the coordinates of the cluster on the flow cell, so having the same read name would imply that you have two different reads from the same location on the flow cell! This is clearly impossible. The only possibility I can think of is if the FASTQ file represents a concatenation of different files, where a read position may not be unique across files - though this is unlikely as the read name should include the run number, lane number, etc. You'll have to do some investigative work to figure out where this comes from.

ADD REPLY • link 7.3 years ago Aaron Lun ★ 28k