diffHic presplit_map.py --sig option for HiC libraries prepared with restriction enzyme cocktail from Arima genomics
1
0
Entering edit mode
shopnil99 • 0
@shopnil99-20966
Last seen 2.1 years ago

Hi,

I prepared some of my Hi-C libraries with the Arima Genomics Hi-C prep kit that uses a restriction enzyme cocktail. If anyone works with similar libraries, do you know what option I should use for --sig when running presplit_map.py to produce bam files.

thanks, - Iftekhar

diffhic Arima Genomics presplit_map.py • 349 views
0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay

The mapping scripts inside diffHic don't support multiple restriction enzymes. Mostly because I haven't added it, but also because the split-and-map strategy is not very good when there are multiple short ligation signatures. presplit_map.py will identify the ligation signature (i.e., the sequence formed by ligating two filled-in sticky ends) and split the read to create fragments that are separately mapped. If you have several short signatures, reads will get split indiscriminately due to random matches, which unnecessarily reduces alignment accuracy.

Several people I know have reported success using Nicolas' HiC-Pro pipeline to get from FASTQs to BAM files for Arima data. From a brief inspection of the code, this uses a map-and-split approach where it first aligns the reads to the genome, keeps everything that mapped, and splits the unmapped reads at their ligation signatures for a second round of alignment. In theory, this should be more robust to the presence of multiple short signatures as reads are only split if there was a problem with their initial alignment - in which case, you don't have anything to lose by splitting them and trying again.

(Now, the obvious question is "why didn't you do a map-and-split approach in the first place?" This was to avoid alignments being dominated by the longer 3' end of chimeric reads. The 5' fragment of the chimeric read is the informative part about interactions, but if the 3' fragment is long enough, the read gets mapped according to the 3' fragment - even in end-to-end-mode, if the 5' fragment is relatively short. This results in loss of information as you now have a dangling end rather than a valid read pair. The "split-and-map" avoided this problem by splitting first so that the 5' and 3' ends were never in competition with each other. However, it assumed that the signature rarely occurred in the genome, which is no longer the case if you are cutting at multiple short restriction sites.)

0
Entering edit mode

Thank you for the explanation Aaron. I went looking for the Hic-Pro pipeline you mentioned and I found a mapping pipeline from Arima at https://github.com/ArimaGenomics/mapping_pipeline. This creates a combined bam file from paired end reads and Marks duplicates using Picard tools. Do you think this output bam file is good to to feed into diffhic?

0
Entering edit mode

It should be fine as long as the paired reads have the same name. Multiple alignments for segments of chimeric reads are also supported, as long as they have the same name, are hard-clipped, and the 3' segments are marked as secondary alignments. But that's only necessary for calculating some diagnostics; if you don't care about that, then you only need to guarantee that the alignment of the 5' segments are present in the BAM file somewhere.