Question

DiffBind on static peak set for all samples

0

Entering edit mode

Drew • 0

@b1719b98

Last seen 23 months ago

Germany

Hi there,

I am currently using DiffBind 3.2.7 for an analysis on RNA Polymerase III (RNAPIII) occupancy in the human genome. Because of the biology of RNAPIII, in that it is the exclusive polymerase for tRNAs and only very few other sites in the genome, and because of the particular question we are trying to answer, I would like to use a bed file of all annotated human tRNAs as the peak sets for all samples. That is, all samples have the same input file for peaks in the sample sheet. I would also like to retain all of these annotated regions throughout the analysis so that even completely unoccupied tRNA genes form part of the analysis and DESeq results table.

I am having a difficult time right at the first step of loading in my sample data. The bed file of tRNAs I am using (again, the same file for all samples is specified in the sample sheet) contains 560 regions. However, when I load this in with rpc1.alltRNAs <- dba(sampleSheet="RPC1_samples_alltRNAs.csv") and display rpc1.alltRNAs, there are only 536 sites in the matrix. I have been unable to prevent this filtering with any of the parameters that I can supply to dba() such as minOverlap, peakCaller, peakFormat, filter, bRemoveM, bRemoveRandom.

I have even tried reading in the bam file as a GRanges object and supplying that the the dba.count() command in the next step to try force all 560 peaks, but that doesn't work either.

I can't find any clear reason why these particular regions are filtered - there are no duplicates and the score for all 560 is exactly the same.

Any help would be much appreciated!

Drew

EDIT:

I should add that I have so far been able to restrict filtering imposed by dba.count() by setting filter = 0 and summits = FALSE but the initial sample reading step has already filtered some peaks by this point. The second parameter of disabling summits did in fact influence the number of peaks retained after counting, I'm not really sure why since all of the peaks should have the same summit and same 200bp extensions?

ChIPSeq DiffBind • 996 views

ADD COMMENT • link updated 23 months ago by Rory Stark ★ 5.1k • written 23 months ago by Drew • 0

score 0 · Answer 1 · 2022-05-12

0

Entering edit mode

Rory Stark ★ 5.1k

@rory-stark-5741

Last seen 8 days ago

Cambridge, UK

Your static set of tRNAs likely includes overlapping regions that DiffBind merges into single, longer peaks.

One option is to find these overlapping regions and alter your overlapping tRNA peaks to only include the regions that do not overlap.

Alternatively, you may want to look at the DBA$config$mergeOverlap option. Setting this to a positive value will allow overlapping regions to be kept as distinct peaks. For example, setting DBA$config$mergeOverlap <- 100 will allow peaks that overlap by as many as 100bp to remain unmerged.(Negative values enable peaks to be merged if they are within a certain distance of each other, even if they don't actually overlap).

Counting overlapping reads (using summarizeOverlaps()) can get complicated in these conditions, as the default is for reads that overlap multiple peaks to be discarded. There are other posts on this forum discussing these issues.

ADD COMMENT • link 23 months ago Rory Stark ★ 5.1k

0

Entering edit mode

Thanks Rory! That seems to be the issue. I think perhaps trimming those overlapping regions so that they no longer overlap makes the most sense. These tRNA regions are already extended by 200bp on both ends to capture the up- and downstream ChIP signal around the genes themselves, so trimming would likely not interfere too much. As far as I know there are no overlapping tRNA genes. On a more fundamental note, does using DiffBind in this kind of way make sense to you? Are there any reasons you can think of that using DiffBind like this is not a great idea? We would like to know what the global occupancy of RNAPIII at all tRNA genes in different cell types is and where can we detect changes.

ADD REPLY • link 23 months ago Drew • 0

0

Entering edit mode

This should work fine. It is preferable to have shorter, "representative" regions for each tRNA than have longer regions that include all of the tRNA plus some extended region, as the extended regions will contain more background signal that adds noise to the model. So long as the same representative regions are compared across all samples, you should be able to identify which tRNAs have consistent differences in RNAPIII binding.

My co-author, Gord Brown, was actually working on RNAPIII binding to tRNAs when DiffBind was being developed in 2010/2011. See:

Kutter, C., Brown, G.D., Gonçalves, Â., Wilson, M.D., Watt, S., Brazma, A., White, R.J. and Odom, D.T., 2011. Pol III binding in six mammals shows conservation among amino acid isotypes despite divergence among tRNA genes. Nature Genetics, 43(10), pp.948-955.

I know he spent a fair amount of time on the multi-mapping issue and how to allocate reads to tRNAs. While the paper was submitted before DiffBind was available (and hence doesn't use it), he subsequently re-analyzed the data using DiffBind.

Sadly, Gord died last year so we cannot ask him more about what he learned about analyzing these sorts of data.

ADD REPLY • link 23 months ago Rory Stark ★ 5.1k