Question: diffHic presplit_map.py --sig option for HiC libraries prepared with restriction enzyme cocktail from Arima genomics
0
gravatar for shopnil99
4 weeks ago by
shopnil990
shopnil990 wrote:

Hi,

I prepared some of my Hi-C libraries with the Arima Genomics Hi-C prep kit that uses a restriction enzyme cocktail. If anyone works with similar libraries, do you know what option I should use for --sig when running presplit_map.py to produce bam files.

thanks, - Iftekhar

ADD COMMENTlink modified 4 weeks ago by Aaron Lun24k • written 4 weeks ago by shopnil990
Answer: diffHic presplit_map.py --sig option for HiC libraries prepared with restriction
0
gravatar for Aaron Lun
4 weeks ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

The mapping scripts inside diffHic don't support multiple restriction enzymes. Mostly because I haven't added it, but also because the split-and-map strategy is not very good when there are multiple short ligation signatures. presplit_map.py will identify the ligation signature (i.e., the sequence formed by ligating two filled-in sticky ends) and split the read to create fragments that are separately mapped. If you have several short signatures, reads will get split indiscriminately due to random matches, which unnecessarily reduces alignment accuracy.

Several people I know have reported success using Nicolas' HiC-Pro pipeline to get from FASTQs to BAM files for Arima data. From a brief inspection of the code, this uses a map-and-split approach where it first aligns the reads to the genome, keeps everything that mapped, and splits the unmapped reads at their ligation signatures for a second round of alignment. In theory, this should be more robust to the presence of multiple short signatures as reads are only split if there was a problem with their initial alignment - in which case, you don't have anything to lose by splitting them and trying again.

(Now, the obvious question is "why didn't you do a map-and-split approach in the first place?" This was to avoid alignments being dominated by the longer 3' end of chimeric reads. The 5' fragment of the chimeric read is the informative part about interactions, but if the 3' fragment is long enough, the read gets mapped according to the 3' fragment - even in end-to-end-mode, if the 5' fragment is relatively short. This results in loss of information as you now have a dangling end rather than a valid read pair. The "split-and-map" avoided this problem by splitting first so that the 5' and 3' ends were never in competition with each other. However, it assumed that the signature rarely occurred in the genome, which is no longer the case if you are cutting at multiple short restriction sites.)

ADD COMMENTlink written 4 weeks ago by Aaron Lun24k

Thank you for the explanation Aaron. I went looking for the Hic-Pro pipeline you mentioned and I found a mapping pipeline from Arima at https://github.com/ArimaGenomics/mapping_pipeline. This creates a combined bam file from paired end reads and Marks duplicates using Picard tools. Do you think this output bam file is good to to feed into diffhic?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by shopnil990

It should be fine as long as the paired reads have the same name. Multiple alignments for segments of chimeric reads are also supported, as long as they have the same name, are hard-clipped, and the 3' segments are marked as secondary alignments. But that's only necessary for calculating some diagnostics; if you don't care about that, then you only need to guarantee that the alignment of the 5' segments are present in the BAM file somewhere.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Aaron Lun24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 220 users visited in the last hour