diffHic preparepairs issue
3
0
Entering edit mode
fine418 • 0
@fine418-8882
Last seen 6.0 years ago

Hi,

I just recently tried using diffHic and everything seemed to be working until I tried loading the bamfile into the hdf5 format.  None of the reads were mapped. see below:

ML275 <- preparePairs("275s.bam", hs.param, file="275.h5", dedup=TRUE)

> ML275
$pairs total marked filtered mapped 81109522 63459873 28886930 0$same.id
dangling self.circle
0           0

$singles [1] 0 I saw a similar post about this and tried the following code thinking that it might be a chromosome name issue, but the chromosome names are the same in both the hs.frag object and the bam file. Thus, I am out of ideas as to why none of the reads have mapped. I'd appreciate any advice you could give. Thanks. > names(Rsamtools::scanBamHeader("275s.bam")[[1]]$targets)
[1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chrM"

> as.character(runValue(seqnames(hs.frag)))
[1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chrM"

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] BSgenome.yeast.mine.saccer3_1.0 BSgenome_1.36.3
[3] rtracklayer_1.28.10             Biostrings_2.36.4
[5] XVector_0.8.0                   diffHic_1.0.1
[7] GenomicRanges_1.20.8            GenomeInfoDb_1.4.3
[9] IRanges_2.2.7                   S4Vectors_0.6.6
[11] BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.30.1    edgeR_3.10.3            zlibbioc_1.14.0
[4] GenomicAlignments_1.4.1 BiocParallel_1.2.21     lattice_0.20-33
[7] tools_3.2.2             grid_3.2.2              Biobase_2.28.0
[10] rhdf5_2.12.0            csaw_1.2.1              DBI_0.3.1
[13] lambda.r_1.1.7          futile.logger_1.4.1     futile.options_1.0.0
[16] bitops_1.0-6            biomaRt_2.24.1          RCurl_1.95-4.7
[19] RSQLite_1.0.0           limma_3.24.15           GenomicFeatures_1.20.5
[22] Rsamtools_1.20.4        XML_3.98-1.3            locfit_1.5-9.1

diffhic • 950 views
0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 1 hour ago
The city by the bay

There's a lot of duplicate marking and quality filtering happening. All your read pairs are being removed before you even get to the stage of assigning reads to restriction fragments. You haven't set the minq argument, which means that there shouldn't be any MAPQ filtering happening. Any read pairs that are removed as part of pairs\$filtered must have had the unmapped flag set in the BAM file, i.e., 0x4.

I suspect that the problem occurs upstream of preparePairs, during the read alignment and processing stage. What organism are you working on, and what genome are you aligning to? Can you check the number of marked reads and unmapped reads in the original BAM file? You could do something like this:

Rsamtools::countBam("275s.bam", ScanBamParam(flag = scanBamFlag(
isUnmappedQuery=FALSE, hasUnmappedMate=FALSE, isDuplicate=FALSE)))

If this also returns zero, then it's definitely an issue with the BAM file itself. If not, the only other thing I can think of is that all your alignments are hard clipped (see if there's a H in the CIGAR field of the BAM file). preparePairs will then think that they represent the 3' ends of the original reads - however, the function is primarily interested in 5' ends, and will ignore read pairs that don't have both 5' ends mapped.

There's a pipeline for obtaining BAM files from FASTQ files in the doc directory. This might be helpful for figuring out what's going wrong with your processing (assuming that you haven't been using that pipeline already).

0
Entering edit mode
fine418 • 0
@fine418-8882
Last seen 6.0 years ago

Aaron,

I tried using the BAM output of another Hic pipeline that I believe hardclips the reads while performing iterative mapping. It is from the Leonid Mirny Lab at: http://mirnylab.bitbucket.org/hiclib/index.html if you are interested.  I was trying to avoid switching the mapping strategy, but I'll try using the suggested pipeline in the docs.

0
Entering edit mode

They use a iterative approach which performs hard clipping, but I don't recall whether they report these hard clips in the CIGAR strings. You should be able to check this in your BAM file; however, even if it is present, it shouldn't have been a problem, as any hard clipping should occur from the 3' end and be ignored by preparePairs. The only remaining explanation is a lot of reads have the unmapped flag set. What does the countBam command above return?

Edit: Another (somewhat bizarre) possibility is that the BAM file is sorted such that several alignments for one read (i.e., at different truncation lengths in the iterative mapping process) are located together in the order of the BAM file, but are not located alongside the alignments for the mate. This would be considered as a pair with an unmapped read and ignored, but it would not increment the singles count due to the presence of multiple alignments.

0
Entering edit mode

You could be correct, but I am fairly certain that once a read is mapped, it is no longer included in the iterative mapping process.  I should note that the bam file I was using for the diffHic input was made using samtools merge of several smaller bam files containing fragments mapped at different truncation lengths, so one pair read may not be mapped at the same length as its partner read.  Whether this would affect anything, I don't know.  I am not sure what portion of the bam file may be excluded by their pipeline that your pipeline might be parsing, so its probably easiest if I just use your mapping software.

Here was the output of the countBam function:

> test
space start end width       file records nucleotides
1    NA    NA  NA    NA test.bam  865367      865367

0
Entering edit mode

To be precise, once a read is mapped uniquely, it will not be extended and remapped. Also, it doesn't matter to preparePairs if the two reads in the pair are mapped at different lengths.

It occurs to me that the hiclib pipeline does not automatically synchronize mate information, so the countBam function isn't actually returning the number of reads with a mapped mate (otherwise the number of records should be even). My guess is that there are very low numbers of mapped and non-duplicate reads, and that all their mates are unmapped. As such, they get thrown out by preparePairs, as they can't be used to identify interactions.

If that's the case, then switching to the diffHic mapping pipeline won't help much. You might as well give it a go, but keep in mind that the problem might be more fundamental, i.e., at the read mapping step.

0
Entering edit mode
fine418 • 0
@fine418-8882
Last seen 6.0 years ago

Aaron,

I tried using the BAM output of another Hic pipeline that I believe hardclips the reads while performing iterative mapping. It is from the Leonid Mirny Lab at: http://mirnylab.bitbucket.org/hiclib/index.html if you are interested.  I was trying to avoid switching the mapping strategy, but I'll try using the suggested pipeline in the docs.