Question

DiffBind how to retrieve normalized count matrix of ATAC-seq data

1

Entering edit mode

Shuwan ▴ 10

@potamot

Last seen 2.6 years ago

United States

Hello,

I am now using DiffBind (3.0.14) to analyze my ATAC-seq data. I have two groups of samples that have drastically different peaks numbers called by macs2. The cells before injury have around 16k peaks. The cells after injury have around 56k peaks. And the MA plot of non-normalized data generated by dba.plotMA(cls.dsq, method = DBA_DESEQ2, bNormalized = F, th = 0) shows that the injured samples have a much greater raw read density. Both the full library size normalization and background normalization methods did not alter this bias and the results are extremely unbalanced. The code I used for normalization is as below.

cls.cnt.normalized <- dba.normalize(cls.cnt, method = DBA_ALL_METHODS, normalize = DBA_NORM_LIB) #full library size normalization
#or
cls.cnt.normalized <- dba.normalize(cls.cnt, method = DBA_ALL_METHODS, normalize = DBA_NORM_NATIVE, background = TRUE) #background normalization

The analysis method I used is DBA_DESEQ2.

My first question is what is the best practice in DiffBind to deal with two data sets that have major changes in chromatin accessibility?

non-normalized MA plot full library size normalization

One way I have thought of to double-check whether the normalization is valid is to check the fragment counts (my data is paired-end) of promoter regions of highly expressed unchanged genes between uninjured and injured samples using their RNA-seq data. Now I have the promoter positions of the genes I want to look at but I am not sure how to retrieve the normalized count matrix of my samples in DiffBind. I only found RPKM of each sample at each consensus peak. And the RPKM value stays the same before and after dba.normalize call.

My second question is how is the RPKM calculated after the dba.count call? It is a value produced after library size and genomic interval length normalization of the raw fragment count, just like RNA-seq RPKM?

My final question is how can I retrieve the normalized count of all my samples with genomic locations (Chr Start End)? (I have tried

dsq.dds <- dba.analyze(cls.dsq, method = DBA_DESEQ2, bRetrieveAnalysis = T)
cnt.table <- counts(dsq.dds, normalized = T)

but the row names of the table are just numbers 1, 2, 3, 4, 5... lack of genomic location information.)

Many thanks in advance!

Best regards,

Shuwan

DiffBind • 2.9k views

ADD COMMENT • link updated 3.0 years ago by Rory Stark ★ 5.1k • written 3.1 years ago by Shuwan ▴ 10

0

Entering edit mode

This could be relevant to you as well, normalization seems to be very offset here due to the fact that the 2dpr condition has a lot of regions with increased counts: Comment: How sensitive is DiffBind to Deseq2 assumptions ?

Maybe try to only use a subset of the regions for normalization.

ADD REPLY • link 3.1 years ago ATpoint ★ 4.0k

0

Entering edit mode

See also Rory Stark's answer in this post : Normalization in DiffBind for Atac-Seq

ADD REPLY • link 3.0 years ago Sam ▴ 10

score 1 · Answer 1 · 2021-03-25

1

Entering edit mode

Rory Stark ★ 5.1k

@rory-stark-5741

Last seen 12 days ago

Cambridge, UK

My first question is what is the best practice in DiffBind to deal with two data sets that have major changes in chromatin accessibility

I would suggest calling dba.normalize()with background=TRUE and normalize=DBA_NORM_NATIVE.

If you have a set of promoter regions, you can pass these in to dba.count()using the peaks= parameter:

cls.dsq <- dba.count(cls.dsq, peaks=myPromoterList)

To retrieve count values with genomic intervals (as a GRanges object), use dba.peakset()with bRetrieve=TRUE. The returned values will depend on how score is set in dba.count(). The default should be normalized counts (score=DBA_SCORE_NORMALIZED), not RPKM (score=DBA_SCORE_RPKM). You can change the score without re-counting reads by setting peaks=NULL, eg.:

cls.dsq <- dba.count(cls.dsq, peaks=NULL, score=DBA_SCORE_NORMALIZED)
cnt.table <- dba.peakset(cls.dsq, bRetrieve=TRUE)

To get raw reads:

cls.dsq <- dba.count(cls.dsq, peaks=NULL, score=DBA_SCORE_READS)
cnt.table <- dba.peakset(cls.dsq, bRetrieve=TRUE)

If you use RPKM scores, they do not change when you normalize the data as RPKM is itself an alternate form of normalization (normalizing for library depth and peak width).

ADD COMMENT • link 3.1 years ago Rory Stark ★ 5.1k

0

Entering edit mode

My first question is what is the best practice in DiffBind to deal with two data sets that have major changes in chromatin accessibility

I would suggest calling dba.normalize()with background=TRUE and normalize=DBA_NORM_NATIVE.

Could you explain why this is the best practice when two data sets have major changes in binding (for example, due to changes in chromatin accessibility)?

ADD REPLY • link 3.1 years ago Sam ▴ 10

1

Entering edit mode

Optimally, normalization adjusts for technical issues and not biological signals. If there is a large shift in binding patterns, you run the risk of over-normalizing and essentially removing the differences you are most interested in. Normalizing to background bins instead of consensus peaks helps protect against this.

ADD REPLY • link 3.0 years ago Rory Stark ★ 5.1k

0

Entering edit mode

Why is normalization with background=TRUE and DBA_NORM_NATIVE (for DESeq) better than full library size normalization?

ADD REPLY • link 3.0 years ago Sam ▴ 10

1

Entering edit mode

Most of the time these yield quite similar results. The default is to use full library size as it introduces the fewest assumptions and is very inexpensive computationally. Background normalization requires an additional counting step, which can be time-consuming, but using RLE or TMM to normalize large background bins could do a better job in certain situations where the reads over large genomic segments are not evenly distributed across samples.

ADD REPLY • link 3.0 years ago Rory Stark ★ 5.1k