Question: BiocParallel 1.5.20: Another error following example 4.2.1: all "remote" workers on localhost
1
3.4 years ago by
lee.tibbert20
lee.tibbert20 wrote:

Using BiocParallel 1.5.20, I am trying to follow example 4.2.1 in the "Introduction to BiocParallel" vignette.

Thanks to a fix in 1.5.20, I can now use hostnames/IP_addresses in the workers= argument to SnowParam. So far so good. The problem I seem to be encountering is that the workers are all created on the localhost, specified or not.

That is, when I set hosts <- c("notLocalhost", "notLocalhost"), I get

starting worker localhost:11943
starting worker localhost:11943

I get same result if I use valid IP addresses: c("NNN.NNN.NNN.02", "NNN.NNN.NNN.03"), where 01 is localhost. All nodes are up & I can ssh to them without a password.

Am I misunderstanding something? Have I made a blatant mistake? sessionInfo() below.

Thank you for any help.

Lee

> sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: i686-pc-linux-gnu (32-bit)
Running under: Ubuntu 15.10

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets
[6] methods   base

other attached packages:
[1] BiocParallel_1.5.20

loaded via a namespace (and not attached):
[1] snow_0.4-1     parallel_3.2.4 tools_3.2.4   

biocparallel • 873 views
modified 3.4 years ago by Martin Morgan ♦♦ 23k • written 3.4 years ago by lee.tibbert20
Answer: BiocParallel 1.5.20: Another error following example 4.2.1: all "remote" workers
0
3.4 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

Thanks for the further report; this has been fixed in 1.5.21. One open issue is that a mis-specified host will not timeout in a reasonable length of time, so starting the cluster will appear to hang (there may be a message when the 'client' fails, but this is asynchronous to the 'server' side starting).

> bpstart(SnowParam(c("localhost", "nohost")))
Bioconductor version 3.3 (BiocInstaller 1.21.3), ?biocLite for help
starting worker localhost:11117
^C

Please do post with other problems that you see.

Thank you for the rapid turnaround (and the heads up on typo'ed hostnames).  When, if the fullness of time, 1.5.21 becomes available in 'devel', I will try this out again and report back (close.)

As always, thank you for biocParallel, it is saving me a _lot_ of time (and is interesting & well documented to boot).

Lee

### NAK on 1.5.21, sorry...

Given XXX.local as the local host and YYY.local as the remote host I continue to be unable to execute the example on the remote host.

host <- c("YYY.local")  gives execution on the local node XXX,

BTW: I discovered during my testing that host <- c("") also executes on the local node. I would have expected either an error, or to have fallen into the connection problem you mentioned above.

hosts <- c("localhost", "YYY.local") gives all execution on the local node XXX. I would have expected to see half on XXX & half on YYY. I tried with size 40, 100, etc
and all execute solely on local node XXX. Not even a connection message to YYY.

 starting worker localhost:11552 starting worker XXX:11552 Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b") :   cannot open the connection Calls: local ... eval -> <Anonymous> -> <Anonymous> -> socketConnection In addition: Warning message: In socketConnection(master, port = port, blocking = TRUE, open = "a+b") :   XXX:11552 cannot be opened Execution halted 

hosts <- c("localhost", "localhost"), hosts <- c("localhost", "XXX.local") & hosts <- c("XXX.local", "XXX.local") all work as expected (successful execution on local node).

I have been waiting for success with one remote node before trying two.

In the design, is it intended that a single remote node, no local, work? Similarly, should a pair of remote nodes, no local work? Am I barking up the wrong tree?

Thank you for any suggestions.

Can you be a bit more explicit about what you're doing? I have

> bpstart(SnowParam(c("localhost", "otherhost")))
Bioconductor version 3.3 (BiocInstaller 1.21.4), ?biocLite for help
starting worker localhost:11073
ssh: Could not resolve hostname otherhost: Name or service not known
^C
> bpstart(SnowParam("otherhost"))
ssh: Could not resolve hostname otherhost: Name or service not known
^C

> packageVersion("BiocParallel")
[1] '1.5.21'

Also the port(s) used need to be open; use SnowParam("localhost", port=12345) to specify an appropriate port.

My package version seems to match:

packageVersion("BiocParallel")

[1] ‘1.5.21’

What I am doing. I hope/believe  that I have copied the example correctly:

if (!("BiocParallel") %in% .packages())
library("BiocParallel") # implied & necessary, not in text

# Continues to not work, BiocParallel 1.5.21 2016-03-25
# hosts <- c("localhost", "YYY.local")

hosts <- c("YYY.local") # still fails

param <- SnowParam(workers = hosts, type = "SOCK")

FUN <- function(i) system("hostname", intern = TRUE)
result1 <- bplapply(1:4, FUN, BPPARAM = param)

I will try your bpstart() code to see if I get different results.

ADD REPLYlink modified 3.4 years ago by Martin Morgan ♦♦ 23k • written 3.4 years ago by lee.tibbert20

Using bpstart() I get essentially the same results:

XXX is local, YYY is remote node.  SSH from XXX to YYY works, R is installed & available to SSH'd user on YYY (I have believed the error message about XXX and have _not_ yet checked for  firewall/port_blocked issues on YYY. )

> library("BiocParallel")
> bpstart(SnowParam(c("localhost", "YYY.local")))
starting worker localhost:11685
starting worker XXX:11685

Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b") :
cannot open the connection
Calls: local ... eval -> <Anonymous> -> <Anonymous> -> socketConnection
In socketConnection(master, port = port, blocking = TRUE, open = "a+b") :
XXX:11685 cannot be opened
Execution halted

Can you edit your post to clarify what result you are referring to when you say 'essentially the same result' ?

So bplapply has an 'optimization' where if there is a single worker specified then evaluation occurs on the host. Probably this should be revisited. When you have two hosts, the puzzle is that YYY.local is apparently translated to XXX; is that something done by ssh? You can see the command being evaluated with something like

> trace(system, quote(print(command)))
Tracing function "system" in package "base"
[1] "system"
> bpstart(SnowParam("localhost"))
Tracing system(cmd, wait = FALSE) on entry
[1] "/home/mtmorgan/bin/R-3-3-branch/bin/Rscript /home/mtmorgan/R/x86_64-pc-linux-gnu-library/3.3/BiocParallel/snow/RSOCKnode.R MASTER=localhost PORT=11557 OUT=/dev/null SNOWLIB=/home/mtmorgan/R/x86_64-pc-linux-gnu-library/3.3/BiocParallel"
Bioconductor version 3.3 (BiocInstaller 1.21.4), ?biocLite for help
starting worker localhost:11557

Ah, the trace command gives me someplace to start. At least, it puts me down into SSH land. I will study the R scripts to see if they are doing port forwarding or fancy SSH stuff.

From a fresh invocation of R:

if (!("BiocParallel") %in% .packages())
+   library("BiocParallel") # implied & necessary, not in text
>
> trace(system, quote(print(command)))
Tracing function "system" in package "base"
[1] "system"
> bpstart(SnowParam(c("localhost", "jumbo.local")
+ )
+ )
Tracing system(cmd, wait = FALSE) on entry
[1] "/usr/lib/R/bin/Rscript /home/lee/R/i686-pc-linux-gnu-library/3.2/BiocParallel/snow/RSOCKnode.R MASTER=localhost PORT=11120 OUT=/dev/null SNOWLIB=/home/lee/R/i686-pc-linux-gnu-library/3.2/BiocParallel"
starting worker localhost:11120
Tracing system(cmd, wait = FALSE) on entry
[1] "ssh -l lee YYY.local /usr/lib/R/bin/Rscript /home/lee/R/i686-pc-linux-gnu-library/3.2/BiocParallel/snow/RSOCKnode.R MASTER=XXX PORT=11120 OUT=/dev/null SNOWLIB=/home/lee/R/i686-pc-linux-gnu-library/3.2/BiocParallel"
starting worker XXX:11120
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b") :
cannot open the connection
Calls: local ... eval -> <Anonymous> -> <Anonymous> -> socketConnection
In socketConnection(master, port = port, blocking = TRUE, open = "a+b") :
XXX:11120 cannot be opened
Execution halted

I just checked that a simple "ssh -l lee YYY.local" works. Now to add more parameters... I will probably change the OUT= to try to capture any error message.

Again, thank you for progress to date.



I may have a clue.  On XXX, Rscript is in /usr/bin/Rscript. The ssh command expects it in /usr/lib/R/bin/Rscript. I will check node YYY, but I expect it to have the same setup. I need to go off line for a while, but will report back. (Both nodes are Ubuntu 15.10, using Ubuntu .deb packages).

Following the docs for the ... argument to SnowParam(), you're lead to ?parallel::makeCluster, where there are options such as port, master, and rscript.

[ 2016-03-25 21:36 -0400 Post was blocked because I exceeded my max allowed
posts per 6 hours.]

Progress....

1) The location of the Rscript turned out to be a non-issue. On Ubuntu 15.10 (and probably other versions), /usr/bin/Rscript is a full copy (not link) of /usr/lib/R/bin/Rscript.  An ssh command can execute either with a lightweight ~/echo.R file.

2) The line in the log above: starting worker XXX:11120

appears to be misdirection. That line of code is, I believe, really executing on the remote YYY, where it is attempting to connect back to the master on XXX:11120. I edited my local copy of ~/R/mumble/3.2/BiocParallel/snow/RSOCKnodeR to help me trace the flow:

# 2016-03-25 Lee T. debug code

thisNode <- system("hostname", intern = TRUE)
message("worker on node ", thisNode,
" connecting back to master ", master, ":", port)

That debug code is a bit expensive because of the system() call. There is
probably a faster way to get the local IP address to which inbound
socket connected. It was good enough for debug purposes. Production

[ 2016-03-25 21:36 -0400 Post was blocked because I exceeded my max allowed
posts per 6 hours.]

OK, I think I have identified the problem. The key is in the
trace above:

[1] "ssh -l lee YYY.local /usr/lib/R/bin/Rscript /home/lee/R/i686-pc-linux-gnu-library/3.2/BiocParallel/snow/RSOCKnode.R MASTER=XXX PORT=11120 OUT=/dev/null SNOWLIB=/home/lee/R/i686-pc-linux-gnu-library/3.2/BiocParallel"

Note the MASTER=XXX line. It says XXX, not XXX.local. One can argue that BiocParallel should be using a Fully Qualified Domain Name (FQDN) because there
is no guarantee that the MASTER and Workers will be in the same domain.
That is, that XXX will mean the same thing to each of the MASTER & Workers.

A more robust approach would be to send the character representation of
the IPv4 (or IPv6) address of the interface on which the master will be
listening.  Agreed, this causes _ugly_ human printout. There are a
whole raft of name-to-address-and-back discussions in the network sphere.
As of the time that I left that field, nobody had come up with a good
solution. IP address was the least worst.

As described in ?parallel::makeCluster suggested above, even changing
master to send its IP address will only work if that ip is accessible
from the workers. That is, they may have a firewall between them or
be on different private subnets.

Sending the IP is probably a "reduction in strength" fix, reducing but
not eliminating the most common pain.  The private subnet, etc. problems
are probably a "sanity testing your configuration" doc fix paragraph.

When I changed my SockParam() call to specify master="N.N.N.N" where
N.N.N.N is the IP for my master node, everything in the example
worked as expected: Out of the 4 expected replies, 2 came from XXX and 2 from
YYY.

As I time becomes available, I will try a second remote node.

Thank you for providing the information which helped me track this down.
I hope that reporting my experience helps improve BiocParallel for others.
BiocParallel is pretty slick.

If it would be useful, I could probably, on a time-available basis,
take a look at where the MASTER= parameter is constructed and see
how hard it would be to change to an IPv4/IPv6 address call
(and add an option to enable it, changing code in extensive production
is _hard_.)

Alternately, I could write up a rough "Configuration complexities"
section. Roughly "If the simple case works for you, great. If not
check 1) Is BiocParallel (hence R) installed on the remote node?
2) Can you ssh without login from the master to the remote node?
3) Can you connect (how? ssh, possibly with password)
from the remote node back to the master?
4) Does firewall configuration on master & remote node allow access?
5) TBD

Bottom line: BEWARE master nodes with names like *.local. Specify

param <- SnowParam(workers = hosts, type = "SOCK",
master = "10.0.0.1")

Lee

Thanks for your patience and feedback, Lee; certainly the discussion will lead to a small expansion to the section of the vignette you were working through; if you draft something that would be great (but not required!).

The rules for determining the value of MASTER seem a bit convoluted.

If the worker is 'localhost', then master is set to localhost, regardless of other settings.

If the worker is not (literally) "localhost", then the master can be set by adding master="<hostname or ip address>" to SnowParam(). Also, from parsing the code / trial-and-error it seems that providing master="any.thing" is inserted without change into the command used on the worker (although that isn't your experience?)

If the worker is not "localhost" and no "master" is provided as an argument, then master comes from Sys.info()["nodename"]. Sys.info() gets nodename in turn from man 2 uname on linux (in the R source, R-3-3-branch/src/unix/sys-unix.c:241).

A small utility to translate convenient strings to fully qualified addresses might be helpful (though tricky in a cross-platform context).

Dr. Morgan,

Is this a good place for a design discussion? Would another venue, such as an issue on GitHub,  be better? Please advise.

I have taken some time to parse through the issues & run a few benchmarks.  I considered  a trifecta of

1) an audience which is much more interested in biology than net issues

2) a desire to support multiple operating systems, including windows

3) an installed base

With these givens, I think a few, focused doc changes & perhaps the addition of a option would cut through a bunch of nasty net issues (IPv4 address versus Fully Qualified Domain Names, OS net lookups).

I would like to describe my suggestions in more detail, for your consideration.

Thank you.

Lee

Lee