Question: diffHic-package_PreparePairs-function-error:"missing chromosomes in cut site list"
0
gravatar for rulicosentino
4.5 years ago by
Germany
rulicosentino0 wrote:

Hi, I am trying to learn (and apply) the tools of the package "diffHic" (Differential analysisi of Hi-C data) to analyze  Hi-C datasets and compare them. I am following the user's guide. I am working with a genome that is not available as Biostrings object.

 

To obtain a restriction-fragmented genome object I run:

genome <- "./genome.fasta"

genome.frag <- cutGenome(genome, "AAGCTT", 4)

genome.param <- pairParam(genome.frag)

Until here everything seems to be fine ( when I run "genome.param" i get as output "Genome contains 5450 restriction fragments across 32 chromosomes ..." which seems to be reasonable and the number of chromosomes is right), but when I run the function that matches the mapping location of each read to a restriction fragment in the reference genome(preparePairs), I got the following error:

diagnostics <- preparePairs("output.fixed.bam", genome.param, file="test.h5", dedup=TRUE, minq=10)

Error in preparePairs("output.sorted.bam", genome.param, file = "test.h5",  : missing chromosomes in cut site list

the bam file was obtained in the following way:

bowtie2-build genome.fasta genome # "genome.fasta" is the file with the genome sequence in fasta format
python presplit_map.py -G genome -1 reads_R1.fastq -2 reads_R2.fastq -P 33 --sig "AAGCTAGCTT" -o output.bam # "presplit_map.py" is the python script that is in the diffHic package that does the mapping of Hi-C pair-end reads to the reference genome.

samtools fixmate -O bam output.bam output.fixed.bam

 

I have also tried to follow this pipeline with one genome that is available as Biostrings object ("BSgenome.Celegans.UCSC.ce10"), and a Hi-C dataset available online from this organism, and it worked.. so I think that the problem should be related to the input of my genome as fasta file, but CutGenome function and the storing of the fragments in a pairParam object seems to have worked fine. 

Somebody has an idea of what the problem could be and how could I solve it?

Thanks,

Raúl

 

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by rulicosentino0

Normally, it's good practice to post the sessionInfo(). In this case, if you're using the release version, there's only one version of diffHic so that doesn't really matter.

ADD REPLYlink written 4.5 years ago by Aaron Lun25k
Answer: diffHic-package_PreparePairs-function-error:"missing chromosomes in cut site lis
1
gravatar for Aaron Lun
4.5 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

That's a bit strange. What are the identities of your chromosomes? In particular, what happens when you run:

names(Rsamtools::scanBamHeader("output.fixed.bam")[[1]]$targets)

and then, compare this to:

as.character(runValue(seqnames(genome.frag)))

or:

as.character(runValue(seqnames(genome.param$fragments)))

Your code looks right, so unless cutGenome is dropping chromosomes or bowtie2-build or samtools are adding chromosomes (both of which are unlikely), there's no obvious cause for this.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Aaron Lun25k
Answer: diffHic-package_PreparePairs-function-error:"missing chromosomes in cut site lis
0
gravatar for rulicosentino
4.5 years ago by
Germany
rulicosentino0 wrote:

Hi Aaron, thank you for your answer!

I ran what you told me, and the names of the chromosomes were different. The original names in the fasta file were something like : 

> chr1 | organism=Homo sapiens | version=2015 | length= 3000000

> chr2 | organism=Homo sapiens | version=2015 | length= 3000000

and it looks like the python script "presplit_map.py" removed most of the sequence names, leaving just the first part until the first pipe("chr1" and "chr2" in the example), but CutGenome is leaving the names as they were. I Modified the names in the fasta file leaving them as they were named in the bam file, and used it as input for CutGenome and that solved the problem. 

Thank you very much for help Aaron! 

ADD COMMENTlink written 4.5 years ago by rulicosentino0
1

Actually, it's probably bowtie2-build that drops everything in the name past the first space when it constructs the index from the FASTQ file. Which makes sense; no point in storing all that extra stuff when you just want the sequence name. I'll add a note in the cutGenome documentation about this.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Aaron Lun25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 222 users visited in the last hour