4 weeks ago by
Cambridge, United Kingdom
The mapping scripts inside diffHic don't support multiple restriction enzymes. Mostly because I haven't added it, but also because the split-and-map strategy is not very good when there are multiple short ligation signatures.
presplit_map.py will identify the ligation signature (i.e., the sequence formed by ligating two filled-in sticky ends) and split the read to create fragments that are separately mapped. If you have several short signatures, reads will get split indiscriminately due to random matches, which unnecessarily reduces alignment accuracy.
Several people I know have reported success using Nicolas' HiC-Pro pipeline to get from FASTQs to BAM files for Arima data. From a brief inspection of the code, this uses a map-and-split approach where it first aligns the reads to the genome, keeps everything that mapped, and splits the unmapped reads at their ligation signatures for a second round of alignment. In theory, this should be more robust to the presence of multiple short signatures as reads are only split if there was a problem with their initial alignment - in which case, you don't have anything to lose by splitting them and trying again.
(Now, the obvious question is "why didn't you do a map-and-split approach in the first place?" This was to avoid alignments being dominated by the longer 3' end of chimeric reads. The 5' fragment of the chimeric read is the informative part about interactions, but if the 3' fragment is long enough, the read gets mapped according to the 3' fragment - even in end-to-end-mode, if the 5' fragment is relatively short. This results in loss of information as you now have a dangling end rather than a valid read pair. The "split-and-map" avoided this problem by splitting first so that the 5' and 3' ends were never in competition with each other. However, it assumed that the signature rarely occurred in the genome, which is no longer the case if you are cutting at multiple short restriction sites.)