DiffBind is losing peaks
2
0
Entering edit mode
Sebastian • 0
@sebastian-24090
Last seen 3.2 years ago

Hello!

I am new to the analysis of ChIP-seq experiments, so please be gentle if I did make a stupid mistake.

I am trying to identify consensus peaks in a series of 6 ChIPs with different antibodies for the same target (H3K27ac) in the same cell line. I used bbmap to map reads to hg38, and MACS2 to find narrow peaks and generate FE bedgraphs.

I inspected the pileups visually (IGV snapshot attached), I see 3 peaks in common to all 6 samples: a weak peak at chr1 around 15.000 bp, a medium strong peak at around 21.500 bp, and a strong peak at around 29.000 bp. All 3 peaks also show up in all 6 .narrowPeak files.

When I run the following code:

library(DiffBind)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

samples <- read.csv("meta/samplesheet.csv")
dbH3K27ac <- dba(sampleSheet = samples) ## Sample sheet with bam files, bam control files, narrowPeak files
dbH3K27ac <- dba.blacklist(dbH3K27ac, blacklist="BSgenome.Hsapiens.UCSC.hg38", greylist=FALSE)

dbH3K27ac <- dba.count(dbH3K27ac, summits=T, minOverlap = 0.66, bUseSummarizeOverlaps = T)
dbH3K27ac_consensus <- dba.peakset(dbH3K27ac, bRetrieve = T, writeFile = "H3K27ac_consensus")

the resulting file is missing the peak at 21.500 bp:

chr1    14652   15878
chr1    28036   30721
...

I have attached the IGV snapshot to illustrate. In the IGV panel, I have also loaded the BAM of the input and of one of the ChIPs, just to exclude that there is anything in the input file in the region of question. The IGV snapshot also shows the content of the blacklist file, which shows that the region around 21.500 bp is not blacklisted.

Why is the peak at 21.500 bp gone? I am worried that there are more peaks that disappear, and that my analysis will not show the complete picture in the end.

Thank you for any input!

Best regards Sebastian

IGV snaphot of the three peaks

    > sessionInfo()
    R version 4.0.3 (2020-10-10)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 10 x64 (build 17763)

    Matrix products: default

    locale:
    [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
    [4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

    attached base packages:
    [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
     [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0 GenomicFeatures_1.42.0                  
     [3] AnnotationDbi_1.52.0                     BiocParallel_1.24.0                     
     [5] DiffBind_3.0.1                           SummarizedExperiment_1.20.0             
     [7] Biobase_2.50.0                           MatrixGenerics_1.2.0                    
     [9] matrixStats_0.57.0                       rtracklayer_1.50.0                      
    [11] GenomicRanges_1.42.0                     GenomeInfoDb_1.26.0                     
    [13] IRanges_2.24.0                           S4Vectors_0.28.0                        
    [15] BiocGenerics_0.36.0                      forcats_0.5.0                           
    [17] stringr_1.4.0                            dplyr_1.0.2                             
    [19] purrr_0.3.4                              readr_1.4.0                             
    [21] tidyr_1.1.2                              tibble_3.0.4                            
    [23] ggplot2_3.3.2                            tidyverse_1.3.0
DiffBind • 1.7k views
ADD COMMENT
1
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 9 weeks ago
Cambridge, UK

If you can send me your DBA object dbH3K27ac, after calling dba() but before calling dba.blacklist(), I can have a look and see what is happening. You can send it as an attachment or a link to somewhere I can download it.

You could also confirm the peak is present in at least 4 of the 6 samples by looking before calling dba.count():

dbH3K27ac_all <- dba.peakset(dbH3K27ac, bRetrieve = T, writeFile = "H3K27ac_all")

There should be a non-zero score in all of the metadata columns for that peak.

FYI you can check the blacklist, and also check which peaks have been blacklisted:

dba.blacklist(dbH3K27ac, Retrieve=DBA_BLACKLIST)
dba.blacklist(dbH3K27ac, Retrieve=DBA_BLACKLISTED_PEAKS)
ADD COMMENT
0
Entering edit mode

Hello Rory,

thanks for your help. The dba object can be downloaded here https://drive.google.com/file/d/14ZBmX2PadnjrZpV7Fi-oBxip2uYwrVFl/view?usp=sharing.

The peak is present in all 6 samples before blacklisting:

chr1    11605   14041   0.00224906507832402 0.00309611744815214 0.00451977574248682 0.00133162222576916 0.00450277373966382 0.00265357996472606
chr1    14652   15878   0.00226898840965304 0.00592618588156941 0.0040077409338716  0.00192911308052596 0.00253298061533046 0.00140647044183567
chr1    20396   22815   0.0197448465234627  0.0328740211875948  0.0290455508339762  0.0131691027543587  0.0148474638639234  0.00964225622847418
chr1    28036   30721   0.309829116387708   0.514620733440064   0.6402576470228 0.213237601565909   0.269141761216816   0.163015915353839
...

As you suggested, I have double-checked both blacklist and the blacklisted peaks, and it's not in there:

> dbH3K27ac <- dba.blacklist(dbH3K27ac, blacklist="BSgenome.Hsapiens.UCSC.hg38", greylist=FALSE)
Genome detected: Hsapiens.UCSC.hg38
Applying blacklist...
Removed: 1419 of 572350 intervals.
Removed: 582 merged (of 135966) and 223 (of 90311) consensus.
> dba.blacklist(dbH3K27ac, Retrieve=DBA_BLACKLISTED_PEAKS)
GRangesList object of length 6:
$ChIP220620_01_H3K27ac_39133_120620_S1
GRanges object with 168 ranges and 1 metadata column:
        seqnames              ranges strand |       Score
           <Rle>           <IRanges>  <Rle> |   <numeric>
    [1]     chr1       629513-629827      * |  0.00122554
    [2]     chr1 121752630-121752881      * |  0.00543776
    [3]     chr1 121760918-121761175      * |  0.00162727
    [4]     chr1 122699686-122699925      * |  0.00233114
    [5]     chr1 122969686-122969916      * |  0.00208456
    ...      ...                 ...    ... .         ...
> dba.blacklist(dbH3K27ac, Retrieve=DBA_BLACKLIST)
GRanges object with 910 ranges and 0 metadata columns:
        seqnames            ranges strand
           <Rle>         <IRanges>  <Rle>
    [1]     chr1     628904-635104      *
    [2]     chr1   5850088-5850571      *
    [3]     chr1   8909611-8910014      *
    [4]     chr1   9574581-9574997      *
    [5]     chr1 32043824-32044203      *
    ...      ...               ...    ...
>

Thank you very much for troubleshooting!

Best

Sebastian

ADD REPLY
0
Entering edit mode

Could you run dba.count() with filter=0 and send me the result as well?

ADD REPLY
0
Entering edit mode

Sure. Please find the dba object here: dbH3K27ac_filter0.rda

The peak seems to be in there now.

Thank you!

ADD REPLY
0
Entering edit mode

Yes, the peak is there, but it has very few reads in either the ChIP or Control, even though we see pileups in the browser screenshot:

> dbH3K27ac <- dba.count(dbH3K27ac,peaks=NULL,score=DBA_SCORE_READS)
> dba.peakset(dbH3K27ac,bRetrieve=TRUE)[3,]

GRanges object with 1 range and 6 metadata columns:
    seqnames      ranges strand | ChIP220620_01_H3K27ac_39133_120620_S1
       <Rle>   <IRanges>  <Rle> |                             <numeric>
  3     chr1 20396-22815      * |                                     0
    ChIP220620_02_H3K27ac_ab4729_120620_S2 ChIP220620_05_H3K27ac_4729_050620_S5
                                 <numeric>                            <numeric>
  3                                      4                                    1
    ChIP220620_06_H3K27ac_39133_050620_S6 ChIP220620_10_H3K27ac_ab4729_260520_S10
                                <numeric>                               <numeric>
  3                                     1                                       0
    ChIP220620_13_H3K27ac_39133_260520_S13
                                 <numeric>
  3                                      0

> dbH3K27ac <- dba.count(dbH3K27ac,peaks=NULL,score=DBA_SCORE_CONTROL_READS)
> dba.peakset(dbH3K27ac,bRetrieve=TRUE)[3,]
GRanges object with 1 range and 6 metadata columns:
    seqnames      ranges strand | ChIP220620_01_H3K27ac_39133_120620_S1
       <Rle>   <IRanges>  <Rle> |                             <numeric>
  3     chr1 20396-22815      * |                                     0
    ChIP220620_02_H3K27ac_ab4729_120620_S2 ChIP220620_05_H3K27ac_4729_050620_S5
                                 <numeric>                            <numeric>
  3                                      0                                    0
    ChIP220620_06_H3K27ac_39133_050620_S6 ChIP220620_10_H3K27ac_ab4729_260520_S10
                                <numeric>                               <numeric>
  3                                     0                                       0
    ChIP220620_13_H3K27ac_39133_260520_S13
                                 <numeric>
  3                                      0

Without the bam files I'm somewhat limited in what I can try. You can make them available (you can send the link in email instead of posting it here), or try a couple of things yourself:

  1. Try running dba.count with mapQCth=0 (and filter=0), then see if the read counts change (as above).This will check if the reads are being filtered. The peak is very close to the start of the chromosome, and I'm wondering if the alignments have low quality scores. I think this region was in the blacklist once upon a time, and looking at your control track, it would likely be filtered out if you ran a greylist.
  1. Try running dba.count() with summits=FALSE (and filter=0), then see if the read counts change (as above). This will check if there is an issue with the read method (summarizeOverlaps is not actually used when summits=TRUE) -- if so we can work around that easily enough.
ADD REPLY
0
Entering edit mode

Thank you very much for your help. I'm very sorry for the late reply, I was busy with clinical duties...

I've tried the two things you suggested:

> dbH3K27ac_filter0_mapQCth0 <- dba.count(dbH3K27ac, summits=T, minOverlap = 0.66, bUseSummarizeOverlaps = T, filter=0, mapQCth=0)
> dba.peakset(dbH3K27ac_filter0_mapQCth0,bRetrieve=TRUE)[3,]
GRanges object with 1 range and 6 metadata columns:
    seqnames      ranges strand | ChIP220620_01_H3K27ac_39133_120620_S1 ChIP220620_02_H3K27ac_ab4729_120620_S2
       <Rle>   <IRanges>  <Rle> |                             <numeric>                              <numeric>
  3     chr1 20396-22815      * |                               1041.93                                1148.96
    ChIP220620_05_H3K27ac_4729_050620_S5 ChIP220620_06_H3K27ac_39133_050620_S6 ChIP220620_10_H3K27ac_ab4729_260520_S10
                               <numeric>                             <numeric>                               <numeric>
  3                              919.312                               764.451                                 1043.78
    ChIP220620_13_H3K27ac_39133_260520_S13
                                 <numeric>
  3                                1098.72
  -------
  seqinfo: 42 sequences from an unspecified genome; no seqlengths
> 
> dbH3K27ac_filter0_summitsF <- dba.count(dbH3K27ac, summits=F, minOverlap = 0.66, bUseSummarizeOverlaps = T, filter=0)
> dba.peakset(dbH3K27ac_filter0_summitsF,bRetrieve=TRUE)[3,]
GRanges object with 1 range and 6 metadata columns:
    seqnames      ranges strand | ChIP220620_01_H3K27ac_39133_120620_S1 ChIP220620_02_H3K27ac_ab4729_120620_S2
       <Rle>   <IRanges>  <Rle> |                             <numeric>                              <numeric>
  3     chr1 20396-22815      * |                                     0                                1.52973
    ChIP220620_05_H3K27ac_4729_050620_S5 ChIP220620_06_H3K27ac_39133_050620_S6 ChIP220620_10_H3K27ac_ab4729_260520_S10
                               <numeric>                             <numeric>                               <numeric>
  3                                    0                                     0                                       0
    ChIP220620_13_H3K27ac_39133_260520_S13
                                 <numeric>
  3                                      0
  -------
  seqinfo: 42 sequences from an unspecified genome; no seqlengths

Does this mean that the mapping quality is very low at this peak? I tried to assess mapping quality at this location in IGV, and when I click on individual reads, it shows a MAPQ of 3 for any of the reads that I've clicked (see attached image). MAPQ is also 3 for any of the reads in the peaks at 15000 and at 29000.

I'll try to cut the BAM file down and make them available for download later.

Thank you for your help!

IGV view of the BAM file, with pop-up window showing a MAPQ of 3

ADD REPLY
0
Entering edit mode

This makes sense. MAPQ of 3 indicates a multi-mapped read -- there is not a unique alignment. This may be a repeat region which is why is has shown up in blacklists before (difficult to get consistent reads unambiguously mapping to this region).

The default behavior in DiffBind is to ignore multi-mapped reads, however by setting mapQCth=0 you are including the "primary" mapping. In my experience, unless there is some reason to be paying particular attention to problematic repeat regions, it is cleaner to filter them out. (I have have had to deal with this case a few times, and put more effort into disambiguating the reads, but it adds a lot of work!).

Bottom line: unless you have some reason to be particularly interested in this near-telomeric region, you can ignore it (so long as you are getting positive read counts along most of the genome).

ADD REPLY
0
Entering edit mode

This absolutely makes sense. I don't really want to include these peaks in my analyses.

Thank you so much for your help!!

Best wishes

Sebastian

ADD REPLY
0
Entering edit mode
Sebastian • 0
@sebastian-24090
Last seen 3.2 years ago

......

ADD COMMENT

Login before adding your answer.

Traffic: 930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6