Question

diffHiC read loss when loading BAM to HDF

0

Entering edit mode

koustav.pal • 0

@koustavpal-8614

Last seen 6.3 years ago

Hi,

I'll describe my problem from the very beginning. I have a cluster at my disposal, and so for alignment I chunked up my FASTQ files into 1M read chunks. Using the cluster I aligned all of them with the presplit_map.py from diffHic. Later on while merging for the MarkDuplicates step I found out that the mate information was not in sync, so after a conversation with the author,

I ran all the BAM chunks through the FixMateInformation of Picard, followed by MergeSamFiles from Picard without sorting enabled.

So I had to run SortSam as well, which produced the error "Mapped mate should have reference name" the error that FixMateInformation was supposed to fix. So I went back and used MergeSamFiles with sort enabled, it worked perfectly fine followed by markduplicates which produced approximately 120M mapped read pairs in the Metrics file after accounting for 30M read pairs which it thought to be duplicates.

After running PreparePairs(), i lost 99% of my reads as singletons. So I reasoned that,

1. FixMateInformation was unable to fix the original problem, and MarkDuplicates did not validate beyond checking the flags which told it that read pairs had mapped properly. If I run validatesamfile on the duplicates marked BAM file i still get errors in it.

2. My breakdown fastq into chunks and align chunks altogether might not have worked properly. So I created a script to check the order between two mate files. I created an awk script to reorder one mate file based on the order of the other mate, and then I did a diff between the reordered mate file and original mate file. No differences were found meaning the order of both mates are the same. So alignment was fine.

It is not possible that so many mates do not mate, because alternatively I used TADBit's alignment feature to align my files and on average around 800K reads mapped per chunk. So I do not know what is going wrong here. Anyone have any ideas?

diffhic • 1.7k views

ADD COMMENT • link updated 9.6 years ago by Aaron Lun ★ 28k • written 9.6 years ago by koustav.pal • 0

score 0 · Answer 1 · 2015-08-13

The preparePairs function expects a name-sorted BAM file as input, such that all paired reads (or segments thereof) are grouped together. So, you need to name-sort the BAM file after running MarkDuplicates. This is also described in the pipeline provided with the package:

system.file("doc", "sra2bam.sh", package="diffHic")

You can base your pipeline around this, with a bit of care for the chunking bits. Without name-sorting of the final file, each read is isolated from its mate in the BAM file and is considered an effective singleton during the scanBam looping in preparePairs. There's no global name-matching, for efficiency.

P.S. I suspect FixMateInformation may be having problems with secondary alignments, which is why trying to validate the final BAM file will raise errors. However, that shouldn't affect preparePairs (as no mate fields are actually used here) or MarkDuplicates (as only primary alignments are used).

P.P.S. As long as your chunking keeps pairs of reads together, you should be fine.