Hello,
Thanks for your effort in producing such a good tool for analysis.
Recently, I am using Diffbind to characterize differential open chromosome regions with ATAC-seq. I have more than 50 samples, equipping with a bam file ranging from 10GB-30GB. However, I found it really hard to handle such a huge amount of data for DiffBind. While I set 'bParallel=TRUE' for dba.count function, it was still too slow to process.
Even when I selected four samples for testing, the tool still broke with two different kinds of error.
library("DiffBind")
samples=read.csv("res.info.csv")
tamoxifen <- dba(sampleSheet=samples)
#sample1 bulk B1 treat1 lib 1 bed
#sample2 bulk B1 treat2 lib 1 bed
#sample3 bulk B1 treat3 lib 1 bed
#sample4 bulk B1 treat4 lib 1 bed
tamoxifen
#4 Samples, 135778 sites in matrix (188638 total):
# ID Tissue Factor Condition Treatment Replicate Intervals
#1 sample1 bulk B1 treat1 lib 1 188126
#2 ...
tamoxifen <- dba.count(tamoxifen,bParallel=TRUE)
#Computing summits...
#Re-centering peaks...
#Reads will be counted as Paired-end.
#Reads will be counted as Paired-end.
#Reads will be counted as Paired-end.
#Reads will be counted as Paired-end.
Type1 error:
#Error: Error processing one or more read files. Check warnings().
#In addition: Warning messages:
#1: In mclapply(arglist, fn, ..., mc.preschedule = TRUE, mc.allow.recursive = TRUE) :
# scheduled core 2 encountered error in user code, all values of the job will be affected
#2: error in evaluating the argument 'x' in selecting a method for function 'assay': wrong args for environment subassignment
Type2 error:
#Re-centering peaks...
#Sample: ./input.bam125
#Reads will be counted as Paired-end.
#'BiocParallel' did not register default BiocParallelParam:
# missing value where TRUE/FALSE needed
#Error in h(simpleError(msg, call)) :
# error in evaluating the argument 'x' in selecting a method for function 'assay': error in evaluating the argument 'BPPARAM' in selecting a method for function 'bplapply': attempt to select less than one element in get1index
I am wondering if there is any suggestion for my case. Many thanks in advance!
Sincerely yours Xiaoyu
I encountered the similar issues, but solved by using:
I think the key in your case is using
summits=0
. I would avoid settingbUseSummarizeOverlaps=F
with paired end data.I have paired-end reads, but only unique mapped reads with only single-end reads in bam files. There was a problem for me to set bUseSummarizeOverlaps=T. There were no errors when I changed bUseSummarizeOverlaps=T.
Can you tell me what version you are using? There was a bug a while back that resulted in this error message but that has been addressed.
For ATAC data, you may want to try a smaller value for
summits
, eg.summits=50
.It may be worth trying this with
bParallel=FALSE
to see if there is an issue running in parallel. You may also want to set the number of cores being used by settingtamoxifen$config$cores
before to call todba.count()
. If you machine has a lot of cores available, but limited memory, you can run into trouble (for example if you end up using 50+ cores with not enough memory for each core).Dear Rory,
Thank you for providing this information. I'll make changes to the script based on your suggestions. Please take a look at the following information about my environment.
Yes, the fixes were introduced from
DiffBind_3.6.3
inBioconductor_3.15
(which usesR_4.2.0
) so if you could update to a more recent version you may have a better experience.Hi Rory, thanks for your suggestion. I have updated the DiffBind to 3.6.5, which was built in 4.2.1, and tried some steps. However, none of them successed. Please find the following code for more information.
Do you have any idea about this problem? Many thanks in advance!
Hi Rory,
I am wondering if my data is not suitable for running the complete analysis due to the data size or something else, would it be reasonable to generate a consensus peak set with DiffBind and calculate the read number in each peak with other tools such as featureCounts, and then continue with DiffBind (if DiffBind accepts such matrix as input) or go directly to DEseq2 or edgeR.
Have you tried setting
summits=0
orsummits=25
?I'm perplexed by the errors which should have changed in the newer version, and I have no idea what port message was about. If you would be willing to make some of your data available to me (even two of the bam files would suffice along with the consensus peakset) I could get to the bottom of it. The number or reads or peaks shouldn't make a difference as
DiffBind
has been used many times on quite large datasets.The alternatives you propose should work. You can export the consensus peakset using
dba.peakset()
bRetrieve=TRUE
and or specifying a value for thewriteFile
parameter.You can read the counts back in to
DiffBind
either by writing the counts for each sample out to a separate file and including them in your samplesheet (in a column labelledCounts
) or by looping through the columns of a count matrix and callingdba.peakset()
and including thecounts=
parameter.There is some code to do this in a recent post: DiffBind and peak counts as input
If you don't read it back into
DiffBind
, there are a number of issues you'll have to deal with (such as normalization), as well as losing the reporting and plotting features, however you can of course do the rest of the processing yourself using the underlying tools (eg.DESeq2
) if you are comfortable with each of processing steps and associated statistics.DiffBind is without a doubt my first choice for analyzing my data. I am impressed by the comprehensive DiffBind tutorial, which takes into account numerous factors that we may overlook while processing ourselves.
In terms of summits setting, it works when I set summits=0! Could you please explain what difference the summit setting could make?
I greatly appreciate your assistance, thanks!