marcin.krotkiewski
2018-06-04 20:17:30 UTC
huh. This code also runs, but it also only displays 4 connect /
disconnect messages. I should add that the test R script shows 4
connect, but 8 disconnect messages. Looks like a bug to me, but where? I
guess we will try to contact R forums and ask there.
Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this
case I get a warning about fork being used:
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
 Local host:         [[36000,2],1] (PID 23617)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
And the process hangs as well - no change.
Marcin
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
disconnect messages. I should add that the test R script shows 4
connect, but 8 disconnect messages. Looks like a bug to me, but where? I
guess we will try to contact R forums and ask there.
Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this
case I get a warning about fork being used:
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
 Local host:         [[36000,2],1] (PID 23617)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
And the process hangs as well - no change.
Marcin
Just out of curiosity, but would using Rmpi and/or doMPI help in any way?
-- bennet
On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________-- bennet
On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
Thanks, Ralph!
Your code finishes normally, I guess then the reason might be lying in R.
Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
ext2x:client disconnect twice (each PID prints the line twice)
[...]
3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
In your example it's only called once per process.
Do you have any suspicion where the second call comes from? Might this be
the reason for the hang?
Thanks!
Marcin
Try running the attached example dynamic code - if that works, then it
likely is something to do with how R operates.
On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
Hi,
I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
simple R script, which starts a few tasks, hangs at the end on diconnect.
library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)
SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
--slave < mk.R
Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
1. with HPCX it seems that dynamic starting of ranks is not supported, hence
I had to turn off all of yalla/mxm/hcoll
--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.
MPI function: MPI_Comm_spawn
Reason: the Yalla (MXM) PML does not support MPI dynamic process
functionality
--------------------------------------------------------------------------
2. when I do that, the program does create a 'cluster' and starts the ranks,
/lib64/libpthread.so.0
client/pmix_client_connect.c:232
#2 0x00007f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
ext2x_client.c:1432
#3 0x00007f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
dpm/dpm.c:596
#4 0x00007f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
pcomm_disconnect.c:67
#5 0x00007f66a16799e9 in mpi_comm_disconnect () from
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6 0x00007f66b2563de5 in do_dotcall () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7 0x00007f66b25a207b in bcEval () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8 0x00007f66b25b0fd0 in Rf_eval.localalias.34 () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9 0x00007f66b25b2c62 in R_execClosure () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
Might this also be related to the dynamic rank creation in R?
Thanks!
Marcin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________Your code finishes normally, I guess then the reason might be lying in R.
Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
ext2x:client disconnect twice (each PID prints the line twice)
[...]
3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
In your example it's only called once per process.
Do you have any suspicion where the second call comes from? Might this be
the reason for the hang?
Thanks!
Marcin
Try running the attached example dynamic code - if that works, then it
likely is something to do with how R operates.
On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
Hi,
I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
simple R script, which starts a few tasks, hangs at the end on diconnect.
library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)
SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
--slave < mk.R
Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
1. with HPCX it seems that dynamic starting of ranks is not supported, hence
I had to turn off all of yalla/mxm/hcoll
--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.
MPI function: MPI_Comm_spawn
Reason: the Yalla (MXM) PML does not support MPI dynamic process
functionality
--------------------------------------------------------------------------
2. when I do that, the program does create a 'cluster' and starts the ranks,
/lib64/libpthread.so.0
client/pmix_client_connect.c:232
#2 0x00007f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
ext2x_client.c:1432
#3 0x00007f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
dpm/dpm.c:596
#4 0x00007f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
pcomm_disconnect.c:67
#5 0x00007f66a16799e9 in mpi_comm_disconnect () from
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6 0x00007f66b2563de5 in do_dotcall () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7 0x00007f66b25a207b in bcEval () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8 0x00007f66b25b0fd0 in Rf_eval.localalias.34 () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9 0x00007f66b25b2c62 in R_execClosure () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
Might this also be related to the dynamic rank creation in R?
Thanks!
Marcin
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users