ChIPQC failling when run ChIPQC() over 6 samples
9
0
Entering edit mode
atisou • 0
@atisou-7468
Last seen 7.5 years ago
Switzerland

Hi,

I am trying to run ChIPQC with my 6 samples, with no success so far.

The commands I ran was:

exp <- ChIPQC(metadata, annotation = "mm10")

It returns:

samp1 brain TF   1 bed
samp2 brain TF   2 bed
samp3 brain TF   1 bed
samp4 brain TF   2 bed
samp5 brain TF   1 bed
samp6 brain TF   2 bed
Checking chromosomes:
[1] "chr1"
Compiling annotation...
Computing metrics for 6 samples...
Error: 6 errors; first error:
  Error in sampleQC(bamFile = reads, bedFile = peaks, GeneAnnotation = annotation, : Contigs of interest are not all in Bam file!!

For more information, use bplasterror(). To resume calculation, re-call the function and set the argument 'BPRESUME' to TRUE or wrap the previous call in bpresume().

First traceback:
  30: ChIPQC(metadata, annotation = "mm10")
  29: bplapply(samplelist, doChIPQCsample, experiment, chromosomes,
          annotation, mapQCth, blacklist, profileWin, fragmentLength,
          shifts)
  28: bplapply(samplelist, doChIPQCsample, experiment, chromosomes,
          annotation, mapQCth, blacklist, profileWin, fragmentLength,
          shifts)
  27: bplapply(X, FUN, ..., BPRESUME = BPRESUME, BPPARAM = BPPARAM)
  26: bplapply(X, FUN, ..., BPRESUME = BPRESUME, BPPARAM = BPPARAM)
  25: mclapply(X = X, FUN = FUN, ..., mc.set.seed = BPPARAM$setSeed,
          mc.silent = !BPPARAM$verbose, mc.cores = bpworkers(BPPARAM),
          mc.cl

metadata is the summary description of the experimental design, as a data.frame:

SampleID    Tissue    Factor    Replicate    bamReads    Peaks
samp1    brain    TF    1    /path2/samp1.bam    /path2/samp1.bed
samp2    brain    TF    2    /path2/samp2.bam    /path2/samp2.bed
samp3    brain    TF    1    /path2/samp3.bam    /path2/samp3.bed
samp4    brain    TF    2    /path2/samp4.bam    /path2/samp4.bed
samp5    brain    TF    1    /path2/samp5.bam    /path2/samp5.bed
samp6    brain    TF    2    /path2/samp6.bam    /path2/samp6.bed

The .bam files (mapping with bowtie2) are of course sorted with samtools sort.

They have been also indexed with samtools index.

The .bed files (peak calling with MACS) are correctly formatted (tab separated, 5-columns as chr, start, end, name, score) and look like this:

chr1    12000000    12000200    MACS_peak_1    115.92

My sessionInfo() is as following:

R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=de_CH.UTF-8           LC_COLLATE=en_US.UTF-8        LC_MONETARY=de_CH.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=de_CH.UTF-8          LC_NAME=de_CH.UTF-8           LC_ADDRESS=de_CH.UTF-8        LC_TELEPHONE=de_CH.UTF-8      LC_MEASUREMENT=de_CH.UTF-8    LC_IDENTIFICATION=de_CH.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.0.0 GenomicFeatures_1.18.3                   AnnotationDbi_1.28.1                     Biobase_2.26.0                           XLConnect_0.2-11                        
 [6] XLConnectJars_0.2-9                      ChIPQC_1.2.2                             DiffBind_1.12.3                          GenomicAlignments_1.2.2                  Rsamtools_1.18.3                        
[11] Biostrings_2.34.1                        XVector_0.6.0                            limma_3.22.7                             GenomicRanges_1.18.4                     GenomeInfoDb_1.2.4                      
[16] IRanges_2.0.1                            S4Vectors_0.4.0                          BiocGenerics_0.12.1                      ggplot2_1.0.0                           

loaded via a namespace (and not attached):
 [1] amap_0.8-14         base64enc_0.1-2     BatchJobs_1.5       BBmisc_1.9          BiocParallel_1.0.3  biomaRt_2.22.0      bitops_1.0-6        brew_1.0-6          BSgenome_1.34.1     caTools_1.17.1     
[11] checkmate_1.5.1     chipseq_1.16.0      codetools_0.2-10    colorspace_1.2-6    DBI_0.3.1           digest_0.6.8        edgeR_3.8.6         fail_1.2            foreach_1.4.2       gdata_2.13.3       
[21] gplots_2.16.0       grid_3.1.3          gtable_0.1.2        gtools_3.4.1        hwriter_1.3.2       iterators_1.0.7     KernSmooth_2.23-14  lattice_0.20-30     latticeExtra_0.6-26 MASS_7.3-39        
[31] munsell_0.4.2       Nozzle.R1_1.1-1     plyr_1.8.1          proto_0.3-10        RColorBrewer_1.1-2  Rcpp_0.11.5         RCurl_1.95-4.5      reshape2_1.4.1      rJava_0.9-6         RSQLite_1.0.0      
[41] rtracklayer_1.26.2  scales_0.2.4        sendmailR_1.2-1     ShortRead_1.24.0    stringr_0.6.2       tools_3.1.3         XML_3.98-1.1        zlibbioc_1.12.0 

 

Also, when I try to run ChIPQCsample with a single sample instead, it "seems" to work, in the way that there is no error message, the method runs till the end. But the result is quite strange:

samp <- ChIPQCsample(reads = samp1.bam, peaks = samp1.bed, annotation = "mm10")

> samp

ChIPQCsample
Number of Mapped reads: 30767939
Number of Mapped reads passing MapQ filter: 24496740
Percentage Of Reads as Non-Duplicates (NRF): 100(0)
Percentage Of Reads in Blacklisted Regions: NA
SSD: 4.27506515515619
Fragment Length Cross-Coverage: 0.000228443139299127
Relative Cross-Coverage: 0.497333607597129
Percentage Of Reads in GenomicFeature:
                        ProportionOfCounts
Peaks                                    0
LongPromoter20000to2000                  0
Promoters2000to500                       0
Promoters500                             0
All5utrs                                 0
Alltranscripts                           0
Allcds                                   0
Allintrons                               0
All3utrs                                 0
Percentage Of Reads in Peaks: 0
Number of Peaks: 0
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |    Counts bedRangesSummitsTemp
      <Rle> <IRanges>  <Rle> | <integer>            <numeric>
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

Although there are no many peaks (this is normal/expected for that TF in those conditions, ~100-1000 peaks per sample/condition), there are some specific significant peaks. So how come ChIPQCsample returns null values for the different proportions counts, especially for Percentage Of Reads in Peaks (instead of something like 4e-04 etc)? Why does it give null for the number of Peaks (I know there are at least ~100 peaks) just by looking at the bed file returned by MACS?

Thanks a lot for your help in advance,

 

Kind regards,

 

H.

 

 

 

 

 

 

 

 

 

 

 

 

 

chipqc • 4.2k views
ADD COMMENT
1
Entering edit mode
@thomas-carroll-7019
Last seen 2.0 years ago
United States/New York/The Rockefeller …

Hi H,

It looks like perhaps there is a disagreement between chromosome names in BAM and bed files.

From you example bed file

chr1    12000000    12000200    MACS_peak_1    115.92

It looks like you have chromosomes starting with "chr" but from your BAM files chromosome names are 1,2,3, rather than as in bed "chr1","chr2","chr3".

@HD    VN:1.0    SO:coordinate
@SQ    SN:10    LN:130694993
@SQ    SN:11    LN:122082543
@SQ    SN:12    LN:120129022

..

 

 

I guess the bed file must have generated from another bam file or the chromosome names corrected.

A quick fix for you would be to rename your bed file chromosome names to remove the "chr" prefix from chromosome names in bedfile and to provide annotation which also contain chromosome names without "chr" extension

You could use internal getAnnotation function in ChIPQC to do this.

> annomm10 <- ChIPQC:::getAnnotation("mm10",NULL)
> annomm10$All3utrs
GRanges object with 27430 ranges and 0 metadata columns:
                      seqnames             ranges strand
                         <Rle>          <IRanges>  <Rle>
      [1]                 chr1 [4841160, 4842827]      +
      [2]                 chr1 [4845017, 4846735]      +
      [3]                 chr1 [4896365, 4897909]      +
      [4]                 chr1 [5098209, 5099777]      +
      [5]                 chr1 [5162166, 5162549]      +
      ...                  ...                ...    ...
  [27426] chrX_GL456233_random   [ 10738,  12475]      -
  [27427] chrX_GL456233_random   [101352, 104589]      -
  [27428]       chrUn_JH584304   [ 41125,  41126]      +
  [27429]       chrUn_JH584304   [ 25369,  25417]      -
  [27430]       chrUn_JH584304   [ 55740,  56946]      -
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome
> for(k in 2:length(annomm10)){seqlevelsStyle(annomm10[[k]]) <- "NCBI"}
> annomm10$All3utrs
GRanges object with 27430 ranges and 0 metadata columns:
                      seqnames             ranges strand
                         <Rle>          <IRanges>  <Rle>
      [1]                    1 [4841160, 4842827]      +
      [2]                    1 [4845017, 4846735]      +
      [3]                    1 [4896365, 4897909]      +
      [4]                    1 [5098209, 5099777]      +
      [5]                    1 [5162166, 5162549]      +
      ...                  ...                ...    ...
  [27426] chrX_GL456233_random   [ 10738,  12475]      -
  [27427] chrX_GL456233_random   [101352, 104589]      -
  [27428]       chrUn_JH584304   [ 41125,  41126]      +
  [27429]       chrUn_JH584304   [ 25369,  25417]      -
  [27430]       chrUn_JH584304   [ 55740,  56946]      -
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome
 

Hope that helps,

 

tom

 

 

 

 

 

ADD COMMENT
1
Entering edit mode

For what it's worth a Bioconductor way to look at the names of the 'chromosomes' in BAM files is

seqlevels(Rsamtools::BamFile("my.bam"))

seqinfo() provides the seqlengths as well; these functions work on lots of GRanges-related objects and (e.g., rtracklayer) file types that have sequence information.

ADD REPLY
0
Entering edit mode

Ah, thanks Martin,

Always good to show the Bioconductor way

best,

tom

ADD REPLY
1
Entering edit mode
@thomas-carroll-7019
Last seen 2.0 years ago
United States/New York/The Rockefeller …

Hi,

Thanks for th outputs.

So I think the cause is  the names disagree between contigs in BAM and those in your BED/ TranscriptDB annotation

Since it can't find any of the peak or annotation in chromosomes as defined by your BAM then the peaks and annotation counts will appear empty. i.e. all reads in BAM are on contigs 1 to 19,MT,X,Y but your peaks and mm10 annotation are on contigs chr1 to chr19, chrMT, chrY,chrX.

 

Thanks,

tom

 

 

ADD COMMENT
0
Entering edit mode
@thomas-carroll-7019
Last seen 2.0 years ago
United States/New York/The Rockefeller …

Hi H,

 

Sorry for the difficulties.

It looks like ChIPQC is having some problems parsing your bam header and then that is leading to problems with counting in peak sets and annotation.

 

Could you run and show output from 

samtools view -H samp1.bam 

and also show the messages generated from running 

samp <- ChIPQCsample(reads = samp1.bam, peaks = samp1.bed, annotation = "mm10")

?

Would you also be able to provide some example data so i can reproduce the error also? I have tried some mm10 data here, aligned with bwa, and dont see any problems like this.

> res <- ChIPQCsample("Sample_R1-0hDupMarked.bam","testbed.bed",annotation="mm10")
> res
					ChIPQCsample
Number of Mapped reads: 81644198
Number of Mapped reads passing MapQ filter: 72180392
Percentage Of Reads as Non-Duplicates (NRF): 72.28(0.28)
Percentage Of Reads in Blacklisted Regions: NA
SSD: 4.21397957171934
Fragment Length Cross-Coverage: 0.0143694513042692
Relative Cross-Coverage: 1.60978493222971
Percentage Of Reads in GenomicFeature: 
                        ProportionOfCounts
Peaks                         1.385418e-08
LongPromoter20000to2000       2.658479e-01
Promoters2000to500            2.979969e-02
Promoters500                  1.736688e-02
All5utrs                      1.182882e-02
Alltranscripts                5.567095e-01
Allcds                        3.521539e-02
Allintrons                    5.158710e-01
All3utrs                      2.211376e-02
Percentage Of Reads in Peaks: 0
Number of Peaks: 1
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand |    Counts bedRangesSummitsTemp
         <Rle>            <IRanges>  <Rle> | <integer>            <numeric>
  [1]     chr1 [12000000, 12000200]      * |         1             12000127

If you provide an example bam file and bed file of peaks which generates this error, i should be able to trouble shoot from there. I can provide you with an ftp or dropbox link to share the data with me.

Best,

Tom Carroll

thomas.carroll@imperial.ac.uk

 

 

 

 

ADD COMMENT
0
Entering edit mode
atisou • 0
@atisou-7468
Last seen 7.5 years ago
Switzerland

Hi,

 

Thanks a lot for your quick answer.

samtools view -H samp1.bam

returns

@HD    VN:1.0    SO:coordinate
@SQ    SN:10    LN:130694993
@SQ    SN:11    LN:122082543
@SQ    SN:12    LN:120129022
@SQ    SN:13    LN:120421639
@SQ    SN:14    LN:124902244
@SQ    SN:15    LN:104043685
@SQ    SN:16    LN:98207768
@SQ    SN:17    LN:94987271
@SQ    SN:18    LN:90702639
@SQ    SN:19    LN:61431566
@SQ    SN:1    LN:195471971
@SQ    SN:2    LN:182113224
@SQ    SN:3    LN:160039680
@SQ    SN:4    LN:156508116
@SQ    SN:5    LN:151834684
@SQ    SN:6    LN:149736546
@SQ    SN:7    LN:145441459
@SQ    SN:8    LN:129401213
@SQ    SN:9    LN:124595110
@SQ    SN:MT    LN:16299
@SQ    SN:X    LN:171031299
@SQ    SN:Y    LN:91744698
@PG    ID:bowtie2    PN:bowtie2    VN:2.2.4    CL:"/software/UHTS/Aligner/bowtie2/2.2.4/bin/bowtie2-align-s --wrapper basic-0 --time -p 8 -x genome -S samp1.sam -U samp1.txt"

Regarding some sample data, I will try to make some.

 

Best,

 

H.

 

 

 

 

 

ADD COMMENT
0
Entering edit mode
atisou • 0
@atisou-7468
Last seen 7.5 years ago
Switzerland

And here is the message output while running:

samp1 <- ChIPQCsample(reads = samp1.bam, peaks = samp1.bed, annotation = "mm10")

Bam file has 22 contigs
Calculating coverage histogram for 10

Calculating SSD for 10

Calculating shift for 10

 300 / 300
Counting reads in features for 10

Signal over peaks for 10

done
Calculating coverage
Calculating coverage histogram for 11

Calculating SSD for 11

Calculating shift for 11

 300 / 300
Counting reads in features for 11

Signal over peaks for 11

done
Calculating coverage
Calculating coverage histogram for 12

Calculating SSD for 12

Calculating shift for 12

 300 / 300
Counting reads in features for 12

Signal over peaks for 12

done
Calculating coverage
Calculating coverage histogram for 13

Calculating SSD for 13

Calculating shift for 13

 300 / 300
Counting reads in features for 13

Signal over peaks for 13

done
Calculating coverage
Calculating coverage histogram for 14

Calculating SSD for 14

Calculating shift for 14

 300 / 300
Counting reads in features for 14

Signal over peaks for 14

done
Calculating coverage
Calculating coverage histogram for 15

Calculating SSD for 15

Calculating shift for 15

 300 / 300
Counting reads in features for 15

Signal over peaks for 15

done
Calculating coverage
Calculating coverage histogram for 16

Calculating SSD for 16

Calculating shift for 16

 300 / 300
Counting reads in features for 16

Signal over peaks for 16

done
Calculating coverage
Calculating coverage histogram for 17

Calculating SSD for 17

Calculating shift for 17

 300 / 300
Counting reads in features for 17

Signal over peaks for 17

done
Calculating coverage
Calculating coverage histogram for 18

Calculating SSD for 18

Calculating shift for 18

 300 / 300
Counting reads in features for 18

Signal over peaks for 18

done
Calculating coverage
Calculating coverage histogram for 19

Calculating SSD for 19

Calculating shift for 19

 300 / 300
Counting reads in features for 19

Signal over peaks for 19

done
Calculating coverage
Calculating coverage histogram for 1

Calculating SSD for 1

Calculating shift for 1

 300 / 300
Counting reads in features for 1

Signal over peaks for 1

done
Calculating coverage
Calculating coverage histogram for 2

Calculating SSD for 2

Calculating shift for 2

 300 / 300
Counting reads in features for 2

Signal over peaks for 2

done
Calculating coverage
Calculating coverage histogram for 3

Calculating SSD for 3

Calculating shift for 3

 300 / 300
Counting reads in features for 3

Signal over peaks for 3

done
Calculating coverage
Calculating coverage histogram for 4

Calculating SSD for 4

Calculating shift for 4

 300 / 300
Counting reads in features for 4

Signal over peaks for 4

done
Calculating coverage
Calculating coverage histogram for 5

Calculating SSD for 5

Calculating shift for 5

 300 / 300
Counting reads in features for 5

Signal over peaks for 5

done
Calculating coverage
Calculating coverage histogram for 6

Calculating SSD for 6

Calculating shift for 6

 300 / 300
Counting reads in features for 6

Signal over peaks for 6

done
Calculating coverage
Calculating coverage histogram for 7

Calculating SSD for 7

Calculating shift for 7

 300 / 300
Counting reads in features for 7

Signal over peaks for 7

done
Calculating coverage
Calculating coverage histogram for 8

Calculating SSD for 8

Calculating shift for 8

 300 / 300
Counting reads in features for 8

Signal over peaks for 8

done
Calculating coverage
Calculating coverage histogram for 9

Calculating SSD for 9

Calculating shift for 9

 300 / 300
Counting reads in features for 9

Signal over peaks for 9

done
Calculating coverage
Calculating coverage histogram for MT

Calculating SSD for MT

Calculating shift for MT

 300 / 30015 / 300
Counting reads in features for MT

Signal over peaks for MT

done
Calculating coverage
Calculating coverage histogram for X

Calculating SSD for X

Calculating shift for X

 300 / 300
Counting reads in features for X

Signal over peaks for X

done
Calculating coverage
Calculating coverage histogram for Y

Calculating SSD for Y

Calculating shift for Y

 300 / 300
Counting reads in features for Y

Signal over peaks for Y

done
Calculating coverage
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1
[1] 1

So as I said, this seems to run smoothly, but the output is not really what expected.

Thanks,

 

H.

 

 

 

 

ADD COMMENT
0
Entering edit mode
atisou • 0
@atisou-7468
Last seen 7.5 years ago
Switzerland

You are right!!! How did I miss that? Thanks for your eagle eyes dear Tom! I am correcting this straight away and will let you know about the outcome.

H.

 

 

ADD COMMENT
0
Entering edit mode

No problem,

Hopefully that will solve the issue.

Let me know if you have any issues running with ChIPQC() or ChIPQCsample() once the names are corrected

best,

 

tom

 

ADD REPLY
0
Entering edit mode
atisou • 0
@atisou-7468
Last seen 7.5 years ago
Switzerland

It all works perfectly now, thanks a lot for the rapid and efficient troubleshooting.

 

H.

 

ADD COMMENT
0
Entering edit mode
ysun • 0
@ysun-13669
Last seen 7.3 years ago

Hi Tom, 

I also had a similar error but my BAM file labels DO match my Bed file labels. I still get the following error: 

Bam file has 639 contigs

Error: BiocParallel errors

  element index: 1, 2, 3, 4, 5, 6, ...

  first error: Contigs of interest are not all in Bam file!!

I found that this is because my chr is labeled as chr 1991, 1992, 2991, 2992, and the program cannot handle this sort of skipping. 

Current input

# This works

experiment_test= ChIPQC(test, chromosomes=paste0("chr",1991:1992), annotation = NULL, consensus=TRUE, bCount=TRUE, mapQCth=15, blacklist=NULL,minOverlap=1, bParallel=TRUE,summits=100)

# This doesn't work

experiment_test= ChIPQC(test, chromosomes=paste0("chr",1991:2992), annotation = NULL, consensus=TRUE, bCount=TRUE, mapQCth=15, blacklist=NULL,minOverlap=1, bParallel=TRUE,summits=100)

 

What do you recommend I do? 

Thanks! 

 

ADD COMMENT
0
Entering edit mode
@thomas-carroll-7019
Last seen 2.0 years ago
United States/New York/The Rockefeller …

 

hi,

Sorry for delay, moved to new post/continent recently.

I think this maybe that ChIPQC is looking for all chromosome between chr1991 and chr2992, so from your description this will include chromosomes not in your BAM file.

> paste0("chr",1991:2992)
   [1] "chr1991" "chr1992" "chr1993" "chr1994" "chr1995" "chr1996" "chr1997" "chr1998" "chr1999" "chr2000"
  [11] "chr2001" ............

I think using paste0("chr",c(1991, 1992, 2991, 2992)) would include all the chromosomes you have in your samples.

so..

>experiment_test= ChIPQC(test, chromosomes=paste0("chr",c(1991, 1992, 2991, 2992)), annotation = NULL, consensus=TRUE, bCount=TRUE, mapQCth=15, blacklist=NULL,minOverlap=1, bParallel=TRUE,summits=100)

Let me know if that isn't the problem

best,

tom

ADD COMMENT

Login before adding your answer.

Traffic: 850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6