Discussion:
[OMPI users] OMPI 3.1.x, PMIx, SLURM, and mpiexec/mpirun
Bennet Fauber
2018-11-12 01:21:02 UTC
Permalink
I have been having some difficulties getting the right combination of
SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way
that both the srun method of starting jobs and mpirun/mpiexec will also run.

If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with
both srun and mpirun and wouldn't mind sending me the version numbers and
any tips for getting this to work, I would appreciate it.

Should mpirun still work? If that is just off the table and I missed the
memo, please let me know.

I'm asking for both because of programs like OpenFOAM and others where
mpirun is built into the application. I have OMPI 1.10.7 built with
similar flags, and it seems to work.

[***@beta-build mpi_example]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 0.000458

[***@beta-build mpi_example]$ mpirun ./test_mpi
The sum = 0.866386
Elapsed time is: 0.000295

SLURM documentation doesn't seem to list a recommended PMIx, that I can
find. I can't find where the version of PMIx that is bundled with OMPI is
specified.

I have slurm 18.08.0, which is built against pmix-2.0.2. We settled on
that version with SLURM 17.something prior to SLURM supporting PMIx 2.1.
Is OMPI 3.1.2 balking at too old a PMIx?

Sorry to be so at sea.

I built OMPI with

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-libevent=external \
--with-hwloc=external \
--with-slurm \
--with-verbs \
--disable-dlopen --enable-shared \
CC=gcc CXX=g++ FC=gfortran

I have a simple test program, and it runs with

[***@beta-build mpi_example]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is: 0.000573

but, on a login node, where I just want a few processors on the local node,
not to run on the compute nodes of the cluster, mpirun fails with

[***@beta-build mpi_example]$ mpirun -np 2 ./test_mpi
[beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
Not found in file base/ess_base_std_app.c at line 219
[beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
Not found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[beta-build.stage.arc-ts.umich.edu:102541] [[13610,1],0] ORTE_ERROR_LOG:
Not found in file ess_pmi_module.c at line 401
[beta-build.stage.arc-ts.umich.edu:102542] [[13610,1],1] ORTE_ERROR_LOG:
Not found in file ess_pmi_module.c at line 401
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_init failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102541] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102542] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

Process name: [[13610,1],0]
Exit code: 1
--------------------------------------------------------------------------
[beta-build.stage.arc-ts.umich.edu:102536] 3 more processes have sent help
message help-orte-runtime.txt / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
message help-orte-runtime / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
message help-mpi-runtime.txt / mpi_init:startup:internal-failure
Ralph H Castain
2018-11-12 17:41:05 UTC
Permalink
mpirun should definitely still work in parallel with srun - they aren’t mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3.

The problem here is that you built Slurm against PMIx v2.0.2, which is not cross-version capable. You can see the cross-version situation here: https://pmix.org/support/faq/how-does-pmix-work-with-containers/

Your options would be to build OMPI against the same PMIx 2.0.2 you used for Slurm, or update the PMIx version you used for Slurm to something that can support cross-version operations.

Ralph
I have been having some difficulties getting the right combination of SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way that both the srun method of starting jobs and mpirun/mpiexec will also run.
If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with both srun and mpirun and wouldn't mind sending me the version numbers and any tips for getting this to work, I would appreciate it.
Should mpirun still work? If that is just off the table and I missed the memo, please let me know.
I'm asking for both because of programs like OpenFOAM and others where mpirun is built into the application. I have OMPI 1.10.7 built with similar flags, and it seems to work.
The sum = 0.866386
Elapsed time is: 0.000458
The sum = 0.866386
Elapsed time is: 0.000295
SLURM documentation doesn't seem to list a recommended PMIx, that I can find. I can't find where the version of PMIx that is bundled with OMPI is specified.
I have slurm 18.08.0, which is built against pmix-2.0.2. We settled on that version with SLURM 17.something prior to SLURM supporting PMIx 2.1. Is OMPI 3.1.2 balking at too old a PMIx?
Sorry to be so at sea.
I built OMPI with
./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-libevent=external \
--with-hwloc=external \
--with-slurm \
--with-verbs \
--disable-dlopen --enable-shared \
CC=gcc CXX=g++ FC=gfortran
I have a simple test program, and it runs with
The sum = 0.866386
Elapsed time is: 0.000573
but, on a login node, where I just want a few processors on the local node, not to run on the compute nodes of the cluster, mpirun fails with
[beta-build.stage.arc-ts.umich.edu:102541 <http://beta-build.stage.arc-ts.umich.edu:102541>] [[13610,1],0] ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219
[beta-build.stage.arc-ts.umich.edu:102542 <http://beta-build.stage.arc-ts.umich.edu:102542>] [[13610,1],1] ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[beta-build.stage.arc-ts.umich.edu:102541 <http://beta-build.stage.arc-ts.umich.edu:102541>] [[13610,1],0] ORTE_ERROR_LOG: Not found in file ess_pmi_module.c at line 401
[beta-build.stage.arc-ts.umich.edu:102542 <http://beta-build.stage.arc-ts.umich.edu:102542>] [[13610,1],1] ORTE_ERROR_LOG: Not found in file ess_pmi_module.c at line 401
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_ess_init failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_mpi_init: ompi_rte_init failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102541 <http://beta-build.stage.arc-ts.umich.edu:102541>] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102542 <http://beta-build.stage.arc-ts.umich.edu:102542>] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[13610,1],0]
Exit code: 1
--------------------------------------------------------------------------
[beta-build.stage.arc-ts.umich.edu:102536 <http://beta-build.stage.arc-ts.umich.edu:102536>] 3 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536 <http://beta-build.stage.arc-ts.umich.edu:102536>] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[beta-build.stage.arc-ts.umich.edu:102536 <http://beta-build.stage.arc-ts.umich.edu:102536>] 1 more process has sent help message help-orte-runtime / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536 <http://beta-build.stage.arc-ts.umich.edu:102536>] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Bennet Fauber
2018-11-12 17:45:29 UTC
Permalink
Thanks, Ralph,

I did try to build OMPI against the PMIx 2.0.2 -- using the configure
option --with-pmix=/opt/pmix/2.0.2, but it sounds like the better route
would be to upgrade to PMIx 2.1.

Thanks, and I'll give it a try!

-- bennet
Post by Ralph H Castain
mpirun should definitely still work in parallel with srun - they aren’t
mutually exclusive. OMPI 3.1.2 contains PMIx v2.1.3.
The problem here is that you built Slurm against PMIx v2.0.2, which is not
https://pmix.org/support/faq/how-does-pmix-work-with-containers/
Your options would be to build OMPI against the same PMIx 2.0.2 you used
for Slurm, or update the PMIx version you used for Slurm to something that
can support cross-version operations.
Ralph
I have been having some difficulties getting the right combination of
SLURM, PMIx, and OMPI 3.1.x (specifically 3.1.2) to compile in such a way
that both the srun method of starting jobs and mpirun/mpiexec will also run.
If someone has a slurm 18.08 or newer, PMIx, and OMPI 3.x that works with
both srun and mpirun and wouldn't mind sending me the version numbers and
any tips for getting this to work, I would appreciate it.
Should mpirun still work? If that is just off the table and I missed the
memo, please let me know.
I'm asking for both because of programs like OpenFOAM and others where
mpirun is built into the application. I have OMPI 1.10.7 built with
similar flags, and it seems to work.
The sum = 0.866386
Elapsed time is: 0.000458
The sum = 0.866386
Elapsed time is: 0.000295
SLURM documentation doesn't seem to list a recommended PMIx, that I can
find. I can't find where the version of PMIx that is bundled with OMPI is
specified.
I have slurm 18.08.0, which is built against pmix-2.0.2. We settled on
that version with SLURM 17.something prior to SLURM supporting PMIx 2.1.
Is OMPI 3.1.2 balking at too old a PMIx?
Sorry to be so at sea.
I built OMPI with
./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-libevent=external \
--with-hwloc=external \
--with-slurm \
--with-verbs \
--disable-dlopen --enable-shared \
CC=gcc CXX=g++ FC=gfortran
I have a simple test program, and it runs with
The sum = 0.866386
Elapsed time is: 0.000573
but, on a login node, where I just want a few processors on the local
node, not to run on the compute nodes of the cluster, mpirun fails with
Not found in file base/ess_base_std_app.c at line 219
Not found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
store DAEMON URI failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Not found in file ess_pmi_module.c at line 401
Not found in file ess_pmi_module.c at line 401
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_ess_init failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
ompi_mpi_init: ompi_rte_init failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102541] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[beta-build.stage.arc-ts.umich.edu:102542] Local abort before MPI_INIT
completed completed successfully, but am not able to aggregate error
messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[13610,1],0]
Exit code: 1
--------------------------------------------------------------------------
[beta-build.stage.arc-ts.umich.edu:102536] 3 more processes have sent
help message help-orte-runtime.txt / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
message help-orte-runtime / orte_init:startup:internal-failure
[beta-build.stage.arc-ts.umich.edu:102536] 1 more process has sent help
message help-mpi-runtime.txt / mpi_init:startup:internal-failure
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...