Question

How to add chr to an dba object before differential analysis to enable removing blacklisted regions

0

Entering edit mode

fleur_p90 • 0

@fleur_p90-10888

Last seen 3.3 years ago

Netherlands

Hello all,

I am performing a differential accessibility analysis of ATACseq data with diffbind after using MACS2 for peak calling. However, somehow (I don't know where it goes wrong) the chromosomes are just numbered 1, 2, 3 etc in my peak files instead of chr1, chr2, chr3. Here's my code for differential analysis:

>OT1 <- dba(sampleSheet = samples)

>OT1 <- dba.count(OT1, summits = 250)

>OT1 <- dba.normalize(OT1, library = DBA_LIBSIZE_PEAKREADS)

>OT1 <- dba.contrast(OT1, categories = DBA_CONDITION)

And now I want to remove blacklisted regions

> OT1 <- dba.blacklist(OT1, blacklist = DBA_BLACKLIST_MM10, greylist = FALSE)

Genome detected: Mmusculus.UCSC.mm10
Applying blacklist...
Removed: 0 of 70147 intervals.
No intervals removed.
Warning message:
Blacklist does not overlap any peak chromosomes!

The warning is because in the blacklisted file the chromosomes are names chr1, chr2, chr3 and in my dba object 1, 2, 3 etc.

Is there an easy way to add chr to a dba object? I know how to do it with addchr() to a GRanges object but is there a similar approach for this case? That will allow me to continue my analysis without having to redo MACS2 or any steps before where the chr should've been included.

Many thanks!


> sessionInfo()

R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.12.0                       forcats_0.5.1                            
 [3] stringr_1.4.0                             dplyr_1.0.5                              
 [5] purrr_0.3.4                               readr_1.4.0                              
 [7] tidyr_1.1.3                               tibble_3.1.0                             
 [9] ggplot2_3.3.3                             tidyverse_1.3.0                          
[11] diffloop_1.18.0                           TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
[13] GenomicFeatures_1.42.2                    AnnotationDbi_1.52.0                     
[15] DiffBind_3.0.14                           SummarizedExperiment_1.20.0              
[17] Biobase_2.50.0                            MatrixGenerics_1.2.1                     
[19] matrixStats_0.58.0                        GenomicRanges_1.42.0                     
[21] GenomeInfoDb_1.26.4                       IRanges_2.24.1                           
[23] S4Vectors_0.28.1                          BiocGenerics_0.36.0                      
[25] ChIPseeker_1.26.2

DiffBind ATACSeq • 2.2k views

ADD COMMENT • link updated 3.9 years ago by James W. MacDonald 67k • written 3.9 years ago by fleur_p90 • 0

0

Entering edit mode

I'm just a newbie... But I think you can get the blacklist from here remove the "chr" manually, convert it into a Granges object and pass it as the parameter blacklist to dba.blacklist.

ADD REPLY • link 3.9 years ago Sam ▴ 10

0

Entering edit mode

Thank you!

ADD REPLY • link 3.9 years ago fleur_p90 • 0

score 1 · Answer 1 · 2021-03-19

If you used Ensembl genome data to do the alignment for your ATAC-Seq reads, then your chromosomes will not have a chr prepended to the name. But UCSC data are +/- based on NCBI genome data, which does use the prepended chr. So you have to 'fix' one or the other to match. Here's an example, using the example data from DiffBind

## I first ran 
> example("dba")
## to get the sset object
>  sset
class: RangedSummarizedExperiment 
dim: 2845 11 
metadata(0):
assays(6): scores RPKM ... cReads Called
rownames(2845): 1 2 ... 2844 2845
rowData names(0):
colnames(11): BT4741 BT4742 ... ZR751 ZR752
colData names(11): Tissue Factor ... bamControl Treatment
> rowRanges(sset)
GRanges object with 2845 ranges and 0 metadata columns:
       seqnames            ranges strand
          <Rle>         <IRanges>  <Rle>
     1    chr18       90841-91241      *
     2    chr18     111395-111795      *
     3    chr18     150222-150622      *
     4    chr18     150807-151207      *
     5    chr18     189192-189592      *
   ...      ...               ...    ...
  2841    chr18 77782651-77783051      *
  2842    chr18 77903180-77903580      *
  2843    chr18 77955380-77955780      *
  2844    chr18 77968226-77968626      *
  2845    chr18 77987616-77988016      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## so you can see these are UCSC style chromosomes. Or you can check
> seqlevelsStyle(sset)
[1] "UCSC"

## and change

> seqlevelsStyle(sset) <- "Ensembl"
> rowRanges(sset)
GRanges object with 2845 ranges and 0 metadata columns:
       seqnames            ranges strand
          <Rle>         <IRanges>  <Rle>
     1       18       90841-91241      *
     2       18     111395-111795      *
     3       18     150222-150622      *
     4       18     150807-151207      *
     5       18     189192-189592      *
   ...      ...               ...    ...
  2841       18 77782651-77783051      *
  2842       18 77903180-77903580      *
  2843       18 77955380-77955780      *
  2844       18 77968226-77968626      *
  2845       18 77987616-77988016      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

You do get a message saying the data were detected as UCSC mm10, which would indicate that you have UCSC genome identifiers, which I don't really understand (I've not used DiffBind for any analyses so my familiarity is minimal). But anyway, that's how you would need to change things to match.

Also, please note that you need to wrap your code in triple back-ticks, not triple single quotes and a parenthesis. The back-tick is found at the top left on a QWERTY keyboard, along with the tilde. There will be a box below where you are typing your question that shows how it will be rendered - if your code isn't in a grey highlighted area, then you are doing something wrong.

score 0 · Answer 2 · 2021-03-23

I don't know how to convert back to a DBA object. Like I said, I have minimal experience. It's unfortunate that DiffBind uses its own list-based object rather than something interoperable like a SummarizedExperiment, or at the very least a way to change the chromosomes.

Anyway, one upside of using a list-based structure is that it allows you to monkey around with things that might otherwise be hard to affect. So you could try doing something like (untested! try at your own risk! caveats!)

> OT1 <- dba(sampleSheet = samples)
## and maybe do - you will have to check - you need the peaks list item to exist
> OT1 <- dba.count(OT1, summits = 250)
> OT1$peaks <- lapply(OT1$peaks, function(x) {x$Chr <- paste0("chr", x$Chr); return(x)})

## and then carry on from there.