Question

diffHic-package_PreparePairs-function-error:"missing chromosomes in cut site list"

0

Entering edit mode

rulicosentino • 0

@rulicosentino-8154

Last seen 8.4 years ago

Germany

Hi, I am trying to learn (and apply) the tools of the package "diffHic" (Differential analysisi of Hi-C data) to analyze Hi-C datasets and compare them. I am following the user's guide. I am working with a genome that is not available as Biostrings object.

To obtain a restriction-fragmented genome object I run:

genome <- "./genome.fasta"

genome.frag <- cutGenome(genome, "AAGCTT", 4)

genome.param <- pairParam(genome.frag)

Until here everything seems to be fine ( when I run "genome.param" i get as output "Genome contains 5450 restriction fragments across 32 chromosomes ..." which seems to be reasonable and the number of chromosomes is right), but when I run the function that matches the mapping location of each read to a restriction fragment in the reference genome(preparePairs), I got the following error:

diagnostics <- preparePairs("output.fixed.bam", genome.param, file="test.h5", dedup=TRUE, minq=10)

Error in preparePairs("output.sorted.bam", genome.param, file = "test.h5", : missing chromosomes in cut site list

the bam file was obtained in the following way:

bowtie2-build genome.fasta genome # "genome.fasta" is the file with the genome sequence in fasta format

python presplit_map.py -G genome -1 reads_R1.fastq -2 reads_R2.fastq -P 33 --sig "AAGCTAGCTT" -o output.bam # "presplit_map.py" is the python script that is in the diffHic package that does the mapping of Hi-C pair-end reads to the reference genome.

samtools fixmate -O bam output.bam output.fixed.bam

I have also tried to follow this pipeline with one genome that is available as Biostrings object ("BSgenome.Celegans.UCSC.ce10"), and a Hi-C dataset available online from this organism, and it worked.. so I think that the problem should be related to the input of my genome as fasta file, but CutGenome function and the storing of the fragments in a pairParam object seems to have worked fine.

Somebody has an idea of what the problem could be and how could I solve it?

Thanks,

Raúl

diffHic PreparePairs Hi-c cutGenome • 1.7k views

ADD COMMENT • link 8.9 years ago • updated 8.8 years ago rulicosentino • 0

0

Entering edit mode

Normally, it's good practice to post the sessionInfo(). In this case, if you're using the release version, there's only one version of diffHic so that doesn't really matter.

ADD REPLY • link 8.9 years ago Aaron Lun ★ 28k

score 1 · Answer 1 · 2015-06-12

That's a bit strange. What are the identities of your chromosomes? In particular, what happens when you run:

names(Rsamtools::scanBamHeader("output.fixed.bam")[[1]]$targets)

and then, compare this to:

as.character(runValue(seqnames(genome.frag)))

or:

as.character(runValue(seqnames(genome.param$fragments)))

Your code looks right, so unless cutGenome is dropping chromosomes or bowtie2-build or samtools are adding chromosomes (both of which are unlikely), there's no obvious cause for this.

score 0 · Answer 2 · 2015-06-26

0

Entering edit mode

rulicosentino • 0

@rulicosentino-8154

Last seen 8.4 years ago

Germany

Hi Aaron, thank you for your answer!

I ran what you told me, and the names of the chromosomes were different. The original names in the fasta file were something like :

> chr1 | organism=Homo sapiens | version=2015 | length= 3000000

> chr2 | organism=Homo sapiens | version=2015 | length= 3000000

and it looks like the python script "presplit_map.py" removed most of the sequence names, leaving just the first part until the first pipe("chr1" and "chr2" in the example), but CutGenome is leaving the names as they were. I Modified the names in the fasta file leaving them as they were named in the bam file, and used it as input for CutGenome and that solved the problem.

Thank you very much for help Aaron!

ADD COMMENT • link 8.8 years ago rulicosentino • 0

1

Entering edit mode

Actually, it's probably bowtie2-build that drops everything in the name past the first space when it constructs the index from the FASTQ file. Which makes sense; no point in storing all that extra stuff when you just want the sequence name. I'll add a note in the cutGenome documentation about this.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k