Question: Multithreading CopywriteR on SLURM cluster
0
3.8 years ago by
smcnulty0
smcnulty0 wrote:

Hello,

I'm trying to multi-thread CopywriteR on a slurm cluster using openmpi. My jobs seem to be launching properly and run for about 10m, but then they die prematurely.

The final message in my slurm output file looks like this:

Error in [<-.data.frame(*tmp*, , "total.properreads", value = list( :
replacement element 18 has 2 rows, need 17
Calls: CopywriteR -> [<- -> [<-.data.frame
Execution halted
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[40568,1],0]
Exit code:    1

Any ideas?

ETA:

For clarification, it seems to stop after all the *properreads.bam.bai files have been generated.

slurm copywriter openmpi • 878 views
modified 3.7 years ago by t.kuilman140 • written 3.8 years ago by smcnulty0
0
3.8 years ago by
t.kuilman140
Netherlands
t.kuilman140 wrote:

Hi,

I have seen this error message before and it was solved by removing CopywriteR and reinstalling it using Bioconductor:

remove.packages("CopywriteR")
source("https://bioconductor.org/biocLite.R")
biocLite("CopywriteR”)

Is it true that you installed CopywriteR from GitHub? If you do that and the rest of the dependencies are installed via Bionconductor, than some of the dependencies might be 'broken' (wrong version) and you will get this error message. If a complete reinstallation (and an update of its dependencies) does not help, can you get back to me? The, the output of CopywriteR.log and the exact code to run CopywriteR would be helpful.

Thomas

0
3.7 years ago by
smcnulty0
smcnulty0 wrote:

I'm still working with the people who manage our cluster to determine which version of CopywriteR was loaded and to get the multi threading up and running, but I'll let you know what comes of it.

In the meantime, I'm able to run CopywriteR and CGHcall on individual BAM files, passing each BAM to a different node on the cluster. Yesterday, I processed 17 BAM files, resulting in 17 individual .igv files. Next, I used a simple R script to merge all 17 .igv files into a single .igv file and fed that to CGHcall as you'd instructed me previously (in case anyone is interested: https://support.bioconductor.org/p/77930/). I compared the results to those I'd obtained when I ran everything on my laptop (17 BAMs all at once, resulting in a single combined .igv file). I was surprised to see that the results were different. I'm pretty sure that all the settings were the same (for instance, the tumor cellularity provided w/in CGHcall, etc). Do you have any comment? Is this expected? If so, do you have a suggestion for which result is to be considered more accurate?

Ok, I am interested to see whether you will manage to get the parallel analysis using CopywriteR working.

As to your question: it should not matter whether you run CopywriteR a number of times on single bams, or on the combined set, as long as the provided controls are the same in both analyses (internally the samples are analyzed completely independently even if multiple samples are analyzed together). Did you check whether the input for CGHcall (the 2 .igv files) were identical? I am not quite sure where things went wrong; if you provide me with the exact code you run I might be able to help you out.

Thomas

0
3.7 years ago by
smcnulty0
smcnulty0 wrote:

Hi Thomas,

So, here's the breakdown of my work so far ...

As I said before, I'm trying to process 17 BAM files, either as 1 run on my laptop or as 17 separate runs on our cluster. I'm often getting this error at the end of the run:

Total calculation time of CopywriteR was:  31.96694

Warning message:
In plot.xy(xy.coords(x, y), type = type, ...) :
"subset" is not a graphical parameter

However, the process still generates (what looks like) a completed .igv file. The igv files generated on my laptop and on our cluster look very similar, but are not identical. I have some examples below:

laptop1: chr15    60000001    60050000    chr15:60000001-60050000    0.719653400974822

cluster1: chr15    60000001    60050000    chr15:60000001-60050000    0.719653400974823

laptop2: chr15    60700001    60750000    chr15:60700001-60750000    0.0973944022031557

cluster2: chr15    60700001    60750000    chr15:60700001-60750000    0.0973944022031558

The differences seem super, super tiny, but they seem to matter a great deal b/c I get different results out of CGHcall.

Last night I decided to make sure that all the input files were the same buy comparing md5 checksums to make sure nothing had gotten corrupted/altered in the transfers. In doing this, I realized that there must be a slight difference in the hg19 files being pulled down by PreCopywriteR. I figured this was the source of the problem until I transferred the hg19 files from my laptop to the cluster. Running CopywriteR with the laptop hg19 files still got me the same "cluster" results.

To be clear, I'm using the same versions of CopywriteR and CGHcall in both places, though my laptop is running R version 3.2.3 and our cluster is running R version 3.2.1.

I sent you an email with the code I used in each place since I'm not sure if its possible to attach it here.

Ok, to me it seems that CopywriteR runs as it should and that the small differences in the .igv file stem from the internal settings of the system you are using. My first thought on the differences in the output from CGHcall is that they come from differences in the way you run the tool. Could it be that you process individual samples separately (using CGHcall) on the clusters, while in the other case you run CGHcall on the entire set of samples together? In that case one wouldn't expect the same results, as CGHcall imputes absent values, and this is (as far as I know) dependent on the values of other samples. Therefore there would be a difference whether you run CGHcall on individual samples and then perform the merge, versus running CGHcall on your merged data. Could it be that this is the problem?

Thomas

0
3.7 years ago by
smcnulty0
smcnulty0 wrote:

I think this may account for part of the problem. I was processing the samples together in CGHcall, but was missing a few on my laptop. I processed 17 BAMs on the cluster but only a subset of those on the laptop.

We also found this in the CGHcall documentation (http://www.rdocumentation.org/packages/DNAcopy/html/segment.html)

"Since the segmentation procedure uses a permutation reference distribution, R commands for setting and saving seeds should be used if the user wishes to reproduce the results."

I was just wondering if this is also the case for CopywriteR. Should I be setting a seed manually there as well to make sure my results are 100% reproducible? This will be very important for me going forward.