Discussion:
[OMPI users] Failed to register memory (openmpi 2.0.2)
Mark Dixon
2017-10-18 16:00:39 UTC
Permalink
Hi,

We're intermittently seeing messages (below) about failing to register
memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the
vanilla IB stack as shipped by centos.

We're not using any mlx4_core module tweaks at the moment. On earlier
machines we used to set registered memory as per the FAQ, but neither
log_num_mtt nor num_mtt seem to exist these days (according to
/sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to
follow the FAQ.

The output of 'ulimit -l' shows as unlimited for every rank.

Does anyone have any advice, please?

Thanks,

Mark

-------------------------------------------------------------------------
Failed to register memory region (MR):

Hostname: dc1s0b1c
Address: ec5000
Length: 20480
Error: Cannot allocate memory
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
r***@open-mpi.org
2017-10-18 18:41:50 UTC
Permalink
Put “oob=tcp” in your default MCA param file
Hi,
We're intermittently seeing messages (below) about failing to register memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB stack as shipped by centos.
We're not using any mlx4_core module tweaks at the moment. On earlier machines we used to set registered memory as per the FAQ, but neither log_num_mtt nor num_mtt seem to exist these days (according to /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow the FAQ.
The output of 'ulimit -l' shows as unlimited for every rank.
Does anyone have any advice, please?
Thanks,
Mark
-------------------------------------------------------------------------
Hostname: dc1s0b1c
Address: ec5000
Length: 20480
Error: Cannot allocate memory
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Mark Dixon
2017-10-19 09:36:24 UTC
Permalink
Thanks Ralph, will do.

Cheers,

Mark
Put “oob=tcp” in your default MCA param file
Hi,
We're intermittently seeing messages (below) about failing to register memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the vanilla IB stack as shipped by centos.
We're not using any mlx4_core module tweaks at the moment. On earlier machines we used to set registered memory as per the FAQ, but neither log_num_mtt nor num_mtt seem to exist these days (according to /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to follow the FAQ.
The output of 'ulimit -l' shows as unlimited for every rank.
Does anyone have any advice, please?
Thanks,
Mark
-------------------------------------------------------------------------
Hostname: dc1s0b1c
Address: ec5000
Length: 20480
Error: Cannot allocate memory
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Mark Dixon
2017-11-13 15:42:01 UTC
Permalink
Hi there,

We're intermittently seeing messages (below) about failing to register
memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 / 24 core
126G RAM Broadwell nodes and the vanilla IB stack as shipped by centos.

(We previously seen similar messages for the "ud" oob component but, as
recommended in this thread, we stopped oob from using openib via an MCA
parameter.)

I've checked to see what the registered memory limit is (by setting
mlx4_core's debug_level, rebooting and examining kernel messages) and it's
double the system RAM - which I understand is the recommended setting.

Any ideas about what might be going on, please?

Thanks,

Mark


--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    dc1s0b1a
  OMPI source:   btl_openib.c:752
  Function:      opal_free_list_init()
  Device:        mlx4_0
  Memlock limit: unlimited

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
[dc1s0b1a][[59067,1],0][btl_openib.c:1035:mca_btl_openib_add_procs] could not prepare openib device for use
[dc1s0b1a][[59067,1],0][btl_openib.c:1186:mca_btl_openib_get_ep] could not prepare openib device for use
[dc1s0b1a][[59067,1],0][connect/btl_openib_connect_udcm.c:1522:udcm_find_endpoint] could not find endpoint with port: 1, lid: 69, msg_type: 100
Post by Mark Dixon
Thanks Ralph, will do.
Cheers,
Mark
Put “oob=tcp” in your default MCA param file
Post by Mark Dixon
Hi,
We're intermittently seeing messages (below) about failing to register
memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the
vanilla IB stack as shipped by centos.
We're not using any mlx4_core module tweaks at the moment. On earlier
machines we used to set registered memory as per the FAQ, but neither
log_num_mtt nor num_mtt seem to exist these days (according to
/sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to
follow the FAQ.
The output of 'ulimit -l' shows as unlimited for every rank.
Does anyone have any advice, please?
Thanks,
Mark
-------------------------------------------------------------------------
Hostname: dc1s0b1c
Address: ec5000
Length: 20480
Error: Cannot allocate memory
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...