Hi there,
I am currently using DiffBind 3.2.7 for an analysis on RNA Polymerase III (RNAPIII) occupancy in the human genome. Because of the biology of RNAPIII, in that it is the exclusive polymerase for tRNAs and only very few other sites in the genome, and because of the particular question we are trying to answer, I would like to use a bed file of all annotated human tRNAs as the peak sets for all samples. That is, all samples have the same input file for peaks in the sample sheet. I would also like to retain all of these annotated regions throughout the analysis so that even completely unoccupied tRNA genes form part of the analysis and DESeq results table.
I am having a difficult time right at the first step of loading in my sample data. The bed file of tRNAs I am using (again, the same file for all samples is specified in the sample sheet) contains 560 regions. However, when I load this in with rpc1.alltRNAs <- dba(sampleSheet="RPC1_samples_alltRNAs.csv")
and display rpc1.alltRNAs
, there are only 536 sites in the matrix. I have been unable to prevent this filtering with any of the parameters that I can supply to dba()
such as minOverlap, peakCaller, peakFormat, filter, bRemoveM, bRemoveRandom
.
I have even tried reading in the bam file as a GRanges object and supplying that the the dba.count()
command in the next step to try force all 560 peaks, but that doesn't work either.
I can't find any clear reason why these particular regions are filtered - there are no duplicates and the score for all 560 is exactly the same.
Any help would be much appreciated!
Drew
EDIT:
I should add that I have so far been able to restrict filtering imposed by dba.count()
by setting filter = 0
and summits = FALSE
but the initial sample reading step has already filtered some peaks by this point. The second parameter of disabling summits did in fact influence the number of peaks retained after counting, I'm not really sure why since all of the peaks should have the same summit and same 200bp extensions?
Thanks Rory! That seems to be the issue. I think perhaps trimming those overlapping regions so that they no longer overlap makes the most sense. These tRNA regions are already extended by 200bp on both ends to capture the up- and downstream ChIP signal around the genes themselves, so trimming would likely not interfere too much. As far as I know there are no overlapping tRNA genes. On a more fundamental note, does using DiffBind in this kind of way make sense to you? Are there any reasons you can think of that using DiffBind like this is not a great idea? We would like to know what the global occupancy of RNAPIII at all tRNA genes in different cell types is and where can we detect changes.
This should work fine. It is preferable to have shorter, "representative" regions for each tRNA than have longer regions that include all of the tRNA plus some extended region, as the extended regions will contain more background signal that adds noise to the model. So long as the same representative regions are compared across all samples, you should be able to identify which tRNAs have consistent differences in RNAPIII binding.
My co-author, Gord Brown, was actually working on RNAPIII binding to tRNAs when
DiffBind
was being developed in 2010/2011. See:Kutter, C., Brown, G.D., Gonçalves, Â., Wilson, M.D., Watt, S., Brazma, A., White, R.J. and Odom, D.T., 2011. Pol III binding in six mammals shows conservation among amino acid isotypes despite divergence among tRNA genes. Nature Genetics, 43(10), pp.948-955.
I know he spent a fair amount of time on the multi-mapping issue and how to allocate reads to tRNAs. While the paper was submitted before
DiffBind
was available (and hence doesn't use it), he subsequently re-analyzed the data usingDiffBind
.Sadly, Gord died last year so we cannot ask him more about what he learned about analyzing these sorts of data.