Discussion:
[OMPI users] Problem with MPI_Comm_spawn using openmpi 2.0.x + sbatch
Jing Gong
2017-02-21 13:08:07 UTC
Permalink
Hi,


The email is intended to follow the thread about

"Problem with MPI_Comm_spawn using openmpi 2.0.x + sbatch".




https://mail-archive.com/***@lists.open-mpi.org/msg30650.html


We have installed the latest version v2.0.2 on the cluster that

<https://mail-archive.com/***@lists.open-mpi.org/msg30654.html>Anastasia Kruchinina were running.


It seems to me that the issue still is not fixed in v2.0.2.


The job script and sample codes can be found at


https://www.pdc.kth.se/~gongjing/files/test_spawn/


The messages we got


$ cat error_file.e



Currently Loaded Modulefiles:
[t03n06.pdc.kth.se:39767] OPAL ERROR: Timeout in file base/pmix_base_fns.c at line 193
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)


$ cat output_file.o

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_dpm_dyn_init() failed
--> Returned "Timeout" (-15) instead of "Success" (0)
--------------------------------------------------------------------------



Please let me know if you need additional information.


Thanks a lot for your help.


Regards, Jing Gong
Jing Gong
2017-09-15 12:15:13 UTC
Permalink
Hi,


We tried to run a job of openfoam with 4480 cpus using the IBM LSF system but got the following error messages:


...

[bs209:16251] [[25529,0],0] ORTE_ERROR_LOG: The specified application failed to start in file /software/OpenFOAM/ThirdParty-v1606+/openmpi-1.10.2/orte/mca/plm/lsf/plm_lsf_module.c at line 346
[bs209: 16251] lsb_launch failed: 0
...


The openfoam is built by openmpi 1.10.2 within its Thirdparty package and it works fine around 2000 cpus on the same cluster.


Is the issue related to the LSF system? Are there any openmpi flags available to

diagnose the problem ?


Thanks a lot.


Regards, Jing
Josh Hursey
2017-09-15 13:35:03 UTC
Permalink
That line of code is here:

https://github.com/open-mpi/ompi/blob/v1.10.2/orte/mca/plm/lsf/plm_lsf_module.c#L346
(Unfortunately we didn't catch the rc from lsb_launch to see why it failed
- I'll fix that).

So it looks like LSF failed to launch our daemon on one or more remote
machines. This could be an LSF issue on one of the machines in your
allocation. One thing to try is a blaunch from the command line to launch
one process per node in your allocation (which is similar to what we are
trying to do in this function). I would expect that to fail, but might show
you which machine is problematic.
Post by Jing Gong
Hi,
We tried to run a job of openfoam with 4480 cpus using the IBM LSF system
...
[bs209:16251] [[25529,0],0] ORTE_ERROR_LOG: The specified application
failed to start in file /software/OpenFOAM/ThirdParty-
v1606+/openmpi-1.10.2/orte/mca/plm/lsf/plm_lsf_module.c at line 346
[bs209: 16251] lsb_launch failed: 0
...
The openfoam is built by openmpi 1.10.2 within its Thirdparty package and
it works fine around 2000 cpus on the same cluster.
Is the issue related to the LSF system? Are there any openmpi flags available to
diagnose the problem ?
Thanks a lot.
Regards, Jing
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Josh Hursey
IBM Spectrum MPI Developer
Loading...