Discussion:
[OMPI users] *** Error in `orted': double free or corruption (out): 0x00002aaab4001680 ***, in some node combos.
Balazs HAJGATO
2018-09-10 21:36:51 UTC
Permalink
Dear list readers,

I have some problems with OpenMPI 3.1.1. In some node combos, I got the error (libibverbs: GRH is mandatory For RoCE address handle; *** Error in `/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted': double free or corruption (out): 0x00002aaab4001680 ***), see details in file 114_151.out.bz2, even with the most simplest run, like
mpirun -host nic114,nic151 hostname
In the file 114_151.out.bz2, you can see the output if I run the command from nic114. If I run the same command from nic151, it simply spits out the hostnames, without any errors.

I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is identical, see ompi.nic114.bz2). I do not have the config.log file, although I still have the config output (see confilg.out.bz2). The nodes have identical opsystems (as we use the same image), and the OpenMPI is also loaded from a central directory shared amongst the nodes. We have an infiniband network (with IP over IB) and an ethernet network. Intel MPI works without a problem, and I confirmed that the network is IB when I use the Intel MPI) It is not clear whether the orted error is the consequence of the libibverbs error, but it is not clear why OpenMPI wants to use RoCE at all. (ibv_devinfo is also attached, we do have a somewhat creative infiniband topology, based on fat-tree, but changing the topology did not solved the problem). The /tmp directory is writable, and not full. As a matter of fact, I get the same error incase of OpenMPI 2.0.2, and 2.1.1, and I do not get this error in case of OpenMPI 1.10.2, and 1.10.3. Can anyone have some thoughts about this issue?

Regards,

Balazs Hajgato
Jeff Squyres (jsquyres) via users
2018-09-11 22:37:34 UTC
Permalink
Thanks for reporting the issue.

First, you can workaround the issue by using:

mpirun --mca oob tcp ...

This uses a different out-of-band plugin (TCP) instead of verbs unreliable datagrams.

Second, I just filed a fix for our current release branches (v2.1.x, v3.0.x, and v3.1.x):

https://github.com/open-mpi/ompi/issues/5672

Could you try it out and let me know if it works for you?

Thanks!
Post by Balazs HAJGATO
Dear list readers,
I have some problems with OpenMPI 3.1.1. In some node combos, I got the error (libibverbs: GRH is mandatory For RoCE address handle; *** Error in `/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted': double free or corruption (out): 0x00002aaab4001680 ***), see details in file 114_151.out.bz2, even with the most simplest run, like
mpirun -host nic114,nic151 hostname
In the file 114_151.out.bz2, you can see the output if I run the command from nic114. If I run the same command from nic151, it simply spits out the hostnames, without any errors.
I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is identical, see ompi.nic114.bz2). I do not have the config.log file, although I still have the config output (see confilg.out.bz2). The nodes have identical opsystems (as we use the same image), and the OpenMPI is also loaded from a central directory shared amongst the nodes. We have an infiniband network (with IP over IB) and an ethernet network. Intel MPI works without a problem, and I confirmed that the network is IB when I use the Intel MPI) It is not clear whether the orted error is the consequence of the libibverbs error, but it is not clear why OpenMPI wants to use RoCE at all. (ibv_devinfo is also attached, we do have a somewhat creative infiniband topology, based on fat-tree, but changing the topology did not solved the problem). The /tmp directory is writable, and not full. As a matter of fact, I get the same error incase of OpenMPI 2.0.2, and 2.1.1, and I do not get this error in case of OpenMPI
1.10.2, and 1.10.3. Can anyone have some thoughts about this issue?
Post by Balazs HAJGATO
Regards,
Balazs Hajgato
<ibv_dev.nic114><ibv_dev.nic151><114_151.out.bz2><config.out.bz2><ompi.nic114.bz2>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Balázs Hajgató
2018-09-12 08:54:25 UTC
Permalink
Dear Jeff,

Setting mca oob to tcp works. I will stick to this solution in our
production environment.

I am not sure that it is relevant, but I also tried the patch on a
non-procduction OpenMPI 3.1.1, and "mpirun -host nic114,nic151 hostname"
works without any parameters, but issuing the libibverbs error
(libibverbs: GRH is mandatory For RoCE address handle)

However, if i enforce mca oob ud, then it does not work, it hangs after
issuing error:
[nic151:23609] [[45140,0],2] ORTE_ERROR_LOG: Unreachable in file
oob_ud_send.c at line 141

After a "ctrl-c": [nic115:23707] [[45140,0],0] ORTE_ERROR_LOG:
Unreachable in file oob_ud_send.c at line 141

Thank you for your answer!

Regards,

Balazs
Post by Jeff Squyres (jsquyres) via users
Thanks for reporting the issue.
mpirun --mca oob tcp ...
This uses a different out-of-band plugin (TCP) instead of verbs unreliable datagrams.
https://github.com/open-mpi/ompi/issues/5672
Could you try it out and let me know if it works for you?
Thanks!
Post by Balazs HAJGATO
Dear list readers,
I have some problems with OpenMPI 3.1.1. In some node combos, I got the error (libibverbs: GRH is mandatory For RoCE address handle; *** Error in `/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted': double free or corruption (out): 0x00002aaab4001680 ***), see details in file 114_151.out.bz2, even with the most simplest run, like
mpirun -host nic114,nic151 hostname
In the file 114_151.out.bz2, you can see the output if I run the command from nic114. If I run the same command from nic151, it simply spits out the hostnames, without any errors.
I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is identical, see ompi.nic114.bz2). I do not have the config.log file, although I still have the config output (see confilg.out.bz2). The nodes have identical opsystems (as we use the same image), and the OpenMPI is also loaded from a central directory shared amongst the nodes. We have an infiniband network (with IP over IB) and an ethernet network. Intel MPI works without a problem, and I confirmed that the network is IB when I use the Intel MPI) It is not clear whether the orted error is the consequence of the libibverbs error, but it is not clear why OpenMPI wants to use RoCE at all. (ibv_devinfo is also attached, we do have a somewhat creative infiniband topology, based on fat-tree, but changing the topology did not solved the problem). The /tmp directory is writable, and not full. As a matter of fact, I get the same error incase of OpenMPI 2.0.2, and 2.1.1, and I do not get this error in case of OpenMP
I
Post by Jeff Squyres (jsquyres) via users
1.10.2, and 1.10.3. Can anyone have some thoughts about this issue?
Post by Balazs HAJGATO
Regards,
Balazs Hajgato
<ibv_dev.nic114><ibv_dev.nic151><114_151.out.bz2><config.out.bz2><ompi.nic114.bz2>_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
--
HPC consultant
HPC/VSC Support and System Administration
Computing Center
ULB/VUB
Avenue Adolphe Buyllaan 91 - CP 197
1050 Brussels
Belgium
Jeff Squyres (jsquyres) via users
2018-09-13 19:12:37 UTC
Permalink
Setting mca oob to tcp works. I will stick to this solution in our production environment.
Great!
I am not sure that it is relevant, but I also tried the patch on a non-procduction OpenMPI 3.1.1, and "mpirun -host nic114,nic151 hostname" works without any parameters, but issuing the libibverbs error (libibverbs: GRH is mandatory For RoCE address handle)
Yeah, I didn't think the fix I did would get rid of that warning. I didn't dig any deeper in the "ud" oob plugin than looking for that double free.
[nic151:23609] [[45140,0],2] ORTE_ERROR_LOG: Unreachable in file oob_ud_send.c at line 141
Somehow the ud oob plugin is failing to make a UD IB verbs handle to contact all the other possible interfaces for the peer. I'm not sure why that is happening -- perhaps IPoIB isn't setup? This is likely a question for your IB support people.

It shouldn't be hanging, either, but that's unlikely to get fixed, unfortunately (i.e., because this is an uncommon error and because the "ud" oob component is EOLed / will be removed in Open MPI v4.0.0 -- it's on its last, dying breaths in the v3.1.x series...).

--
Jeff Squyres
***@cisco.com

Loading...