Question

ChIPQC failling when run ChIPQC() over 6 samples

0

Entering edit mode

atisou • 0

@atisou-7468

Last seen 8.6 years ago

Switzerland

Hi,

I am trying to run ChIPQC with my 6 samples, with no success so far.

The commands I ran was:

exp <- ChIPQC(metadata, annotation = "mm10")

It returns:

samp1 brain TF 1 bed samp2 brain TF 2 bed samp3 brain TF 1 bed samp4 brain TF 2 bed samp5 brain TF 1 bed samp6 brain TF 2 bed Checking chromosomes: [1] "chr1" Compiling annotation... Computing metrics for 6 samples... Error: 6 errors; first error: Error in sampleQC(bamFile = reads, bedFile = peaks, GeneAnnotation = annotation, : Contigs of interest are not all in Bam file!!

For more information, use bplasterror(). To resume calculation, re-call the function and set the argument 'BPRESUME' to TRUE or wrap the previous call in bpresume().

First traceback: 30: ChIPQC(metadata, annotation = "mm10") 29: bplapply(samplelist, doChIPQCsample, experiment, chromosomes, annotation, mapQCth, blacklist, profileWin, fragmentLength, shifts) 28: bplapply(samplelist, doChIPQCsample, experiment, chromosomes, annotation, mapQCth, blacklist, profileWin, fragmentLength, shifts) 27: bplapply(X, FUN, ..., BPRESUME = BPRESUME, BPPARAM = BPPARAM) 26: bplapply(X, FUN, ..., BPRESUME = BPRESUME, BPPARAM = BPPARAM) 25: mclapply(X = X, FUN = FUN, ..., mc.set.seed = BPPARAM$setSeed, mc.silent = !BPPARAM$verbose, mc.cores = bpworkers(BPPARAM), mc.cl

metadata is the summary description of the experimental design, as a data.frame:

SampleID Tissue Factor Replicate bamReads Peaks samp1 brain TF 1 /path2/samp1.bam /path2/samp1.bed samp2 brain TF 2 /path2/samp2.bam /path2/samp2.bed samp3 brain TF 1 /path2/samp3.bam /path2/samp3.bed samp4 brain TF 2 /path2/samp4.bam /path2/samp4.bed samp5 brain TF 1 /path2/samp5.bam /path2/samp5.bed samp6 brain TF 2 /path2/samp6.bam /path2/samp6.bed

The .bam files (mapping with bowtie2) are of course sorted with samtools sort.

They have been also indexed with samtools index.

The .bed files (peak calling with MACS) are correctly formatted (tab separated, 5-columns as chr, start, end, name, score) and look like this:

chr1 12000000 12000200 MACS_peak_1 115.92

My sessionInfo() is as following:

R version 3.1.3 (2015-03-09) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.2 LTS

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_CH.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_CH.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=de_CH.UTF-8 LC_NAME=de_CH.UTF-8 LC_ADDRESS=de_CH.UTF-8 LC_TELEPHONE=de_CH.UTF-8 LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=de_CH.UTF-8

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.0.0 GenomicFeatures_1.18.3 AnnotationDbi_1.28.1 Biobase_2.26.0 XLConnect_0.2-11 [6] XLConnectJars_0.2-9 ChIPQC_1.2.2 DiffBind_1.12.3 GenomicAlignments_1.2.2 Rsamtools_1.18.3 [11] Biostrings_2.34.1 XVector_0.6.0 limma_3.22.7 GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 [16] IRanges_2.0.1 S4Vectors_0.4.0 BiocGenerics_0.12.1 ggplot2_1.0.0

loaded via a namespace (and not attached): [1] amap_0.8-14 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.9 BiocParallel_1.0.3 biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 BSgenome_1.34.1 caTools_1.17.1 [11] checkmate_1.5.1 chipseq_1.16.0 codetools_0.2-10 colorspace_1.2-6 DBI_0.3.1 digest_0.6.8 edgeR_3.8.6 fail_1.2 foreach_1.4.2 gdata_2.13.3 [21] gplots_2.16.0 grid_3.1.3 gtable_0.1.2 gtools_3.4.1 hwriter_1.3.2 iterators_1.0.7 KernSmooth_2.23-14 lattice_0.20-30 latticeExtra_0.6-26 MASS_7.3-39 [31] munsell_0.4.2 Nozzle.R1_1.1-1 plyr_1.8.1 proto_0.3-10 RColorBrewer_1.1-2 Rcpp_0.11.5 RCurl_1.95-4.5 reshape2_1.4.1 rJava_0.9-6 RSQLite_1.0.0 [41] rtracklayer_1.26.2 scales_0.2.4 sendmailR_1.2-1 ShortRead_1.24.0 stringr_0.6.2 tools_3.1.3 XML_3.98-1.1 zlibbioc_1.12.0

Also, when I try to run ChIPQCsample with a single sample instead, it "seems" to work, in the way that there is no error message, the method runs till the end. But the result is quite strange:

samp <- ChIPQCsample(reads = samp1.bam, peaks = samp1.bed, annotation = "mm10")

> samp

ChIPQCsample Number of Mapped reads: 30767939 Number of Mapped reads passing MapQ filter: 24496740 Percentage Of Reads as Non-Duplicates (NRF): 100(0) Percentage Of Reads in Blacklisted Regions: NA SSD: 4.27506515515619 Fragment Length Cross-Coverage: 0.000228443139299127 Relative Cross-Coverage: 0.497333607597129 Percentage Of Reads in GenomicFeature: ProportionOfCounts Peaks 0 LongPromoter20000to2000 0 Promoters2000to500 0 Promoters500 0 All5utrs 0 Alltranscripts 0 Allcds 0 Allintrons 0 All3utrs 0 Percentage Of Reads in Peaks: 0 Number of Peaks: 0 GRanges object with 0 ranges and 2 metadata columns: seqnames ranges strand | Counts bedRangesSummitsTemp <Rle> <IRanges> <Rle> | <integer> <numeric> ------- seqinfo: 22 sequences from an unspecified genome; no seqlengths

Although there are no many peaks (this is normal/expected for that TF in those conditions, ~100-1000 peaks per sample/condition), there are some specific significant peaks. So how come ChIPQCsample returns null values for the different proportions counts, especially for Percentage Of Reads in Peaks (instead of something like 4e-04 etc)? Why does it give null for the number of Peaks (I know there are at least ~100 peaks) just by looking at the bed file returned by MACS?

Thanks a lot for your help in advance,

Kind regards,

H.

chipqc • 5.1k views

ADD COMMENT • link updated 8.4 years ago by Thomas Carroll ▴ 420 • written 10.9 years ago by atisou • 0

score 1 · Answer 1 · 2015-03-16

Hi H,

It looks like perhaps there is a disagreement between chromosome names in BAM and bed files.

From you example bed file

chr1 12000000 12000200 MACS_peak_1 115.92

It looks like you have chromosomes starting with "chr" but from your BAM files chromosome names are 1,2,3, rather than as in bed "chr1","chr2","chr3".

@HD   VN:1.0   SO:coordinate
@SQ   SN:10   LN:130694993
@SQ   SN:11   LN:122082543
@SQ   SN:12   LN:120129022

..

I guess the bed file must have generated from another bam file or the chromosome names corrected.

A quick fix for you would be to rename your bed file chromosome names to remove the "chr" prefix from chromosome names in bedfile and to provide annotation which also contain chromosome names without "chr" extension

You could use internal getAnnotation function in ChIPQC to do this.

> annomm10 <- ChIPQC:::getAnnotation("mm10",NULL)
> annomm10$All3utrs
GRanges object with 27430 ranges and 0 metadata columns:
                      seqnames             ranges strand
                         <Rle>          <IRanges>  <Rle>
      [1]                 chr1 [4841160, 4842827]      +
      [2]                 chr1 [4845017, 4846735]      +
      [3]                 chr1 [4896365, 4897909]      +
      [4]                 chr1 [5098209, 5099777]      +
      [5]                 chr1 [5162166, 5162549]      +
      ...                  ...                ...    ...
  [27426] chrX_GL456233_random   [ 10738,  12475]      -
  [27427] chrX_GL456233_random   [101352, 104589]      -
  [27428]       chrUn_JH584304   [ 41125,  41126]      +
  [27429]       chrUn_JH584304   [ 25369,  25417]      -
  [27430]       chrUn_JH584304   [ 55740,  56946]      -
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

> for(k in 2:length(annomm10)){seqlevelsStyle(annomm10[[k]]) <- "NCBI"}
> annomm10$All3utrs
GRanges object with 27430 ranges and 0 metadata columns:
                      seqnames             ranges strand
                         <Rle>          <IRanges>  <Rle>
      [1]                    1 [4841160, 4842827]      +
      [2]                    1 [4845017, 4846735]      +
      [3]                    1 [4896365, 4897909]      +
      [4]                    1 [5098209, 5099777]      +
      [5]                    1 [5162166, 5162549]      +
      ...                  ...                ...    ...
  [27426] chrX_GL456233_random   [ 10738,  12475]      -
  [27427] chrX_GL456233_random   [101352, 104589]      -
  [27428]       chrUn_JH584304   [ 41125,  41126]      +
  [27429]       chrUn_JH584304   [ 25369,  25417]      -
  [27430]       chrUn_JH584304   [ 55740,  56946]      -
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

Hope that helps,

tom

score 1 · Answer 2 · 2015-03-16

Hi,

Thanks for th outputs.

So I think the cause is the names disagree between contigs in BAM and those in your BED/ TranscriptDB annotation

Since it can't find any of the peak or annotation in chromosomes as defined by your BAM then the peaks and annotation counts will appear empty. i.e. all reads in BAM are on contigs 1 to 19,MT,X,Y but your peaks and mm10 annotation are on contigs chr1 to chr19, chrMT, chrY,chrX.

Thanks,

tom

score 0 · Answer 3 · 2015-03-16

Hi H,

Sorry for the difficulties.

It looks like ChIPQC is having some problems parsing your bam header and then that is leading to problems with counting in peak sets and annotation.

Could you run and show output from

samtools view -H samp1.bam

and also show the messages generated from running

samp <- ChIPQCsample(reads = samp1.bam, peaks = samp1.bed, annotation = "mm10")

?

Would you also be able to provide some example data so i can reproduce the error also? I have tried some mm10 data here, aligned with bwa, and dont see any problems like this.

> res <- ChIPQCsample("Sample_R1-0hDupMarked.bam","testbed.bed",annotation="mm10")

> res
					ChIPQCsample
Number of Mapped reads: 81644198
Number of Mapped reads passing MapQ filter: 72180392
Percentage Of Reads as Non-Duplicates (NRF): 72.28(0.28)
Percentage Of Reads in Blacklisted Regions: NA
SSD: 4.21397957171934
Fragment Length Cross-Coverage: 0.0143694513042692
Relative Cross-Coverage: 1.60978493222971
Percentage Of Reads in GenomicFeature: 
                        ProportionOfCounts
Peaks                         1.385418e-08
LongPromoter20000to2000       2.658479e-01
Promoters2000to500            2.979969e-02
Promoters500                  1.736688e-02
All5utrs                      1.182882e-02
Alltranscripts                5.567095e-01
Allcds                        3.521539e-02
Allintrons                    5.158710e-01
All3utrs                      2.211376e-02
Percentage Of Reads in Peaks: 0
Number of Peaks: 1
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand |    Counts bedRangesSummitsTemp
         <Rle>            <IRanges>  <Rle> | <integer>            <numeric>
  [1]     chr1 [12000000, 12000200]      * |         1             12000127

If you provide an example bam file and bed file of peaks which generates this error, i should be able to trouble shoot from there. I can provide you with an ftp or dropbox link to share the data with me.

Best,

Tom Carroll

thomas.carroll@imperial.ac.uk

score 0 · Answer 4 · 2015-03-16

Hi,

Thanks a lot for your quick answer.

samtools view -H samp1.bam

returns

@HD VN:1.0 SO:coordinate @SQ SN:10 LN:130694993 @SQ SN:11 LN:122082543 @SQ SN:12 LN:120129022 @SQ SN:13 LN:120421639 @SQ SN:14 LN:124902244 @SQ SN:15 LN:104043685 @SQ SN:16 LN:98207768 @SQ SN:17 LN:94987271 @SQ SN:18 LN:90702639 @SQ SN:19 LN:61431566 @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 @SQ SN:4 LN:156508116 @SQ SN:5 LN:151834684 @SQ SN:6 LN:149736546 @SQ SN:7 LN:145441459 @SQ SN:8 LN:129401213 @SQ SN:9 LN:124595110 @SQ SN:MT LN:16299 @SQ SN:X LN:171031299 @SQ SN:Y LN:91744698 @PG ID:bowtie2 PN:bowtie2 VN:2.2.4 CL:"/software/UHTS/Aligner/bowtie2/2.2.4/bin/bowtie2-align-s --wrapper basic-0 --time -p 8 -x genome -S samp1.sam -U samp1.txt"

Regarding some sample data, I will try to make some.

Best,

H.

score 0 · Answer 5 · 2015-03-16

And here is the message output while running:

samp1 <- ChIPQCsample(reads = samp1.bam, peaks = samp1.bed, annotation = "mm10")

Bam file has 22 contigs Calculating coverage histogram for 10

Calculating SSD for 10

Calculating shift for 10

300 / 300 Counting reads in features for 10

Signal over peaks for 10

done Calculating coverage Calculating coverage histogram for 11

Calculating SSD for 11

Calculating shift for 11

300 / 300 Counting reads in features for 11

Signal over peaks for 11

done Calculating coverage Calculating coverage histogram for 12

Calculating SSD for 12

Calculating shift for 12

300 / 300 Counting reads in features for 12

Signal over peaks for 12

done Calculating coverage Calculating coverage histogram for 13

Calculating SSD for 13

Calculating shift for 13

300 / 300 Counting reads in features for 13

Signal over peaks for 13

done Calculating coverage Calculating coverage histogram for 14

Calculating SSD for 14

Calculating shift for 14

300 / 300 Counting reads in features for 14

Signal over peaks for 14

done Calculating coverage Calculating coverage histogram for 15

Calculating SSD for 15

Calculating shift for 15

300 / 300 Counting reads in features for 15

Signal over peaks for 15

done Calculating coverage Calculating coverage histogram for 16

Calculating SSD for 16

Calculating shift for 16

300 / 300 Counting reads in features for 16

Signal over peaks for 16

done Calculating coverage Calculating coverage histogram for 17

Calculating SSD for 17

Calculating shift for 17

300 / 300 Counting reads in features for 17

Signal over peaks for 17

done Calculating coverage Calculating coverage histogram for 18

Calculating SSD for 18

Calculating shift for 18

300 / 300 Counting reads in features for 18

Signal over peaks for 18

done Calculating coverage Calculating coverage histogram for 19

Calculating SSD for 19

Calculating shift for 19

300 / 300 Counting reads in features for 19

Signal over peaks for 19

done Calculating coverage Calculating coverage histogram for 1

Calculating SSD for 1

Calculating shift for 1

300 / 300 Counting reads in features for 1

Signal over peaks for 1

done Calculating coverage Calculating coverage histogram for 2

Calculating SSD for 2

Calculating shift for 2

300 / 300 Counting reads in features for 2

Signal over peaks for 2

done Calculating coverage Calculating coverage histogram for 3

Calculating SSD for 3

Calculating shift for 3

300 / 300 Counting reads in features for 3

Signal over peaks for 3

done Calculating coverage Calculating coverage histogram for 4

Calculating SSD for 4

Calculating shift for 4

300 / 300 Counting reads in features for 4

Signal over peaks for 4

done Calculating coverage Calculating coverage histogram for 5

Calculating SSD for 5

Calculating shift for 5

300 / 300 Counting reads in features for 5

Signal over peaks for 5

done Calculating coverage Calculating coverage histogram for 6

Calculating SSD for 6

Calculating shift for 6

300 / 300 Counting reads in features for 6

Signal over peaks for 6

done Calculating coverage Calculating coverage histogram for 7

Calculating SSD for 7

Calculating shift for 7

300 / 300 Counting reads in features for 7

Signal over peaks for 7

done Calculating coverage Calculating coverage histogram for 8

Calculating SSD for 8

Calculating shift for 8

300 / 300 Counting reads in features for 8

Signal over peaks for 8

done Calculating coverage Calculating coverage histogram for 9

Calculating SSD for 9

Calculating shift for 9

300 / 300 Counting reads in features for 9

Signal over peaks for 9

done Calculating coverage Calculating coverage histogram for MT

Calculating SSD for MT

Calculating shift for MT

300 / 30015 / 300 Counting reads in features for MT

Signal over peaks for MT

done Calculating coverage Calculating coverage histogram for X

Calculating SSD for X

Calculating shift for X

300 / 300 Counting reads in features for X

Signal over peaks for X

done Calculating coverage Calculating coverage histogram for Y

Calculating SSD for Y

Calculating shift for Y

300 / 300 Counting reads in features for Y

Signal over peaks for Y

done Calculating coverage [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1 [1] 1

So as I said, this seems to run smoothly, but the output is not really what expected.

Thanks,

H.

score 0 · Answer 6 · 2015-03-16

0

Entering edit mode

atisou • 0

@atisou-7468

Last seen 8.6 years ago

Switzerland

You are right!!! How did I miss that? Thanks for your eagle eyes dear Tom! I am correcting this straight away and will let you know about the outcome.

H.

ADD COMMENT • link 10.9 years ago atisou • 0

0

Entering edit mode

No problem,

Hopefully that will solve the issue.

Let me know if you have any issues running with ChIPQC() or ChIPQCsample() once the names are corrected

best,

tom

ADD REPLY • link 10.9 years ago Thomas Carroll ▴ 420

score 0 · Answer 7 · 2015-03-16

0

Entering edit mode

atisou • 0

@atisou-7468

Last seen 8.6 years ago

Switzerland

It all works perfectly now, thanks a lot for the rapid and efficient troubleshooting.

H.

ADD COMMENT • link 10.9 years ago atisou • 0

score 0 · Answer 8 · 2017-08-04

Hi Tom,

I also had a similar error but my BAM file labels DO match my Bed file labels. I still get the following error:

Bam file has 639 contigs

Error: BiocParallel errors

element index: 1, 2, 3, 4, 5, 6, ...

first error: Contigs of interest are not all in Bam file!!

I found that this is because my chr is labeled as chr 1991, 1992, 2991, 2992, and the program cannot handle this sort of skipping.

Current input

# This works

experiment_test= ChIPQC(test, chromosomes=paste0("chr",1991:1992), annotation = NULL, consensus=TRUE, bCount=TRUE, mapQCth=15, blacklist=NULL,minOverlap=1, bParallel=TRUE,summits=100)

# This doesn't work

experiment_test= ChIPQC(test, chromosomes=paste0("chr",1991:2992), annotation = NULL, consensus=TRUE, bCount=TRUE, mapQCth=15, blacklist=NULL,minOverlap=1, bParallel=TRUE,summits=100)

What do you recommend I do?

Thanks!

score 0 · Answer 9 · 2017-08-22

hi,

Sorry for delay, moved to new post/continent recently.

I think this maybe that ChIPQC is looking for all chromosome between chr1991 and chr2992, so from your description this will include chromosomes not in your BAM file.

> paste0("chr",1991:2992)
[1] "chr1991" "chr1992" "chr1993" "chr1994" "chr1995" "chr1996" "chr1997" "chr1998" "chr1999" "chr2000"
[11] "chr2001" ............

I think using paste0("chr",c(1991, 1992, 2991, 2992)) would include all the chromosomes you have in your samples.

so..

>experiment_test= ChIPQC(test, chromosomes=paste0("chr",c(1991, 1992, 2991, 2992)), annotation = NULL, consensus=TRUE, bCount=TRUE, mapQCth=15, blacklist=NULL,minOverlap=1, bParallel=TRUE,summits=100)

Let me know if that isn't the problem

best,

tom