Discussion:
[OMPI users] cannot run openmpi 2.1
Kapetanakis Giannis
2018-08-11 12:54:17 UTC
Permalink
Hi,

I'm struggling to get 2.1.x to work with our HPC.

Version 1.8.8 and 3.x works fine.

In 2.1.3 and 2.1.4 I get errors and segmentation faults. The builds are
with infiniband and slurm support.
mpirun locally works fine. Any help to debug this?

[node39:20090] [[50526,1],2] usock_peer_recv_connect_ack: received
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20088] [[50526,1],0] usock_peer_recv_connect_ack: received
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20096] [[50526,1],8] usock_peer_recv_connect_ack: received
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20094] [[50526,1],6] usock_peer_recv_connect_ack: received
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6]
mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20097] [[50526,1],9] usock_peer_recv_connect_ack: received
unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20092] [[50526,1],4] usock_peer_recv_connect_ack: received
unexpected process identifier [[50526,0],0] from [[50526,0],1]


a part from debug:

[node39:20515] mca:oob:select: Inserting component
[node39:20515] mca:oob:select: Found 3 active transports
[node39:20515] [[50428,1],9]: set_addr to uri
3304849408.1;usock;tcp://192.168.20.113,10.1.7.69:37147;ud://181895.60.1
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is
reachable via component usock
[node39:20515] [[50428,1],9]:[oob_usock_component.c:349] connect to
[[50428,0],1]
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via
component usock
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is
reachable via component tcp
[node39:20515] [[50428,1],9] oob:tcp: ignoring address usock
[node39:20515] [[50428,1],9] oob:tcp: working peer [[50428,0],1] address
tcp://192.168.20.113,10.1.7.69:37147
[node39:20515] [[50428,1],9] PASSING ADDR 192.168.20.113 TO MODULE
[node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
[node39:20515] [[50428,1],9] PASSING ADDR 10.1.7.69 TO MODULE
[node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
[node39:20515] [[50428,1],9] oob:tcp: ignoring address ud://181895.60.1
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via
component tcp
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is
reachable via component ud
[node39:20515] [[50428,1],9] oob:ud:set_addr: setting location for peer
[[50428,0],1] from ud://181895.60.1
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via
component ud
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to
connect to proc [[50428,0],1]
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to
connect to proc [[50428,0],1] on socket 21
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to
connect to proc [[50428,0],1] - 0 retries
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: Connection
across to proc [[50428,0],1] succeeded
[node39:20515] [[50428,1],9] SEND CONNECT ACK
[node39:20515] [[50428,1],9] send blocking of 232 bytes to socket 21
[node39:20515] [[50428,1],9] blocking send complete to socket 21
[node39:20515] [[50428,1],9]:tcp:processing set_peer cmd
[node39:20515] [[50428,1],9] SET_PEER ADDING PEER [[50428,0],1]
[node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening
on net 192.168.20.113 port 37147
[node39:20515] [[50428,1],9]:tcp:processing set_peer cmd
[node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening
on net 10.1.7.69 port 37147
[node39:20515] [[50428,1],9] oob:ud:get_addr contact information:
ud://181905.60.1
[node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1]
[node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg
[node39:20471] [[50428,0],0]:tcp:recv:handler read hdr
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of
size 3697
[node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1]
(ORIGIN [[50428,0],1]) OF 3697 BYTES FOR DEST [[50428,0],0] TAG 2
[node39:20471] [[50428,0],0] DELIVERING TO RML
[node39:20512] [[50428,1],6]:usock:recv:handler called for peer
[[50428,0],1]
[node39:20512] [[50428,1],6] RECV CONNECT ACK FROM [[50428,0],1] ON
SOCKET 21
[node39:20512] [[50428,1],6] waiting for connect ack from [[50428,0],1]
[node39:20512] [[50428,1],6] connect ack received from [[50428,0],1]
[node39:20512] [[50428,1],6] connect-ack recvd from [[50428,0],1]
[node39:20512] [[50428,1],6] usock_peer_recv_connect_ack: received
unexpected process identifier [[50428,0],0] from [[50428,0],1]
[node39:20512] [[50428,1],6] usock_peer_close for [[50428,0],1] sd 21
state FAILED
[node39:20512] [[50428,1],6] UNABLE TO COMPLETE CONNECT ACK WITH
[[50428,0],1]
[node39:20512] [[50428,1],6] usock:lost connection called for peer
[[50428,0],1]
[node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1]
[node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg
[node39:20471] [[50428,0],0]:tcp:recv:handler read hdr
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of
size 4118
[node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1]
(ORIGIN [[50428,0],1]) OF 4118 BYTES FOR DEST [[50428,0],0] TAG 2
[node39:20471] [[50428,0],0] DELIVERING TO RML
[node39:20514] [[50428,1],8] oob:ud:port_recv_start posting 512 message
buffers


thanks,

G
Ralph H Castain
2018-08-11 13:39:11 UTC
Permalink
Put "oob=^usock” in your default mca param file, or add OMPI_MCA_oob=^usock to your environment
Post by Kapetanakis Giannis
Hi,
I'm struggling to get 2.1.x to work with our HPC.
Version 1.8.8 and 3.x works fine.
In 2.1.3 and 2.1.4 I get errors and segmentation faults. The builds are with infiniband and slurm support.
mpirun locally works fine. Any help to debug this?
[node39:20090] [[50526,1],2] usock_peer_recv_connect_ack: received unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20088] [[50526,1],0] usock_peer_recv_connect_ack: received unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20096] [[50526,1],8] usock_peer_recv_connect_ack: received unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20094] [[50526,1],6] usock_peer_recv_connect_ack: received unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20053] [[50526,0],0]-[[50526,1],2] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],0] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],8] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20053] [[50526,0],0]-[[50526,1],6] mca_oob_usock_peer_recv_handler: invalid socket state(1)
[node39:20097] [[50526,1],9] usock_peer_recv_connect_ack: received unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20092] [[50526,1],4] usock_peer_recv_connect_ack: received unexpected process identifier [[50526,0],0] from [[50526,0],1]
[node39:20515] mca:oob:select: Inserting component
[node39:20515] mca:oob:select: Found 3 active transports
[node39:20515] [[50428,1],9]: set_addr to uri 3304849408.1;usock;tcp://192.168.20.113,10.1.7.69:37147;ud://181895.60.1
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is reachable via component usock
[node39:20515] [[50428,1],9]:[oob_usock_component.c:349] connect to [[50428,0],1]
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component usock
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is reachable via component tcp
[node39:20515] [[50428,1],9] oob:tcp: ignoring address usock
[node39:20515] [[50428,1],9] oob:tcp: working peer [[50428,0],1] address tcp://192.168.20.113,10.1.7.69:37147
[node39:20515] [[50428,1],9] PASSING ADDR 192.168.20.113 TO MODULE
[node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
[node39:20515] [[50428,1],9] PASSING ADDR 10.1.7.69 TO MODULE
[node39:20515] [[50428,1],9]:tcp set addr for peer [[50428,0],1]
[node39:20515] [[50428,1],9] oob:tcp: ignoring address ud://181895.60.1
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component tcp
[node39:20515] [[50428,1],9]:set_addr checking if peer [[50428,0],1] is reachable via component ud
[node39:20515] [[50428,1],9] oob:ud:set_addr: setting location for peer [[50428,0],1] from ud://181895.60.1
[node39:20515] [[50428,1],9]: peer [[50428,0],1] is reachable via component ud
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to connect to proc [[50428,0],1]
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to connect to proc [[50428,0],1] on socket 21
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: attempting to connect to proc [[50428,0],1] - 0 retries
[node39:20515] [[50428,1],9] orte_usock_peer_try_connect: Connection across to proc [[50428,0],1] succeeded
[node39:20515] [[50428,1],9] SEND CONNECT ACK
[node39:20515] [[50428,1],9] send blocking of 232 bytes to socket 21
[node39:20515] [[50428,1],9] blocking send complete to socket 21
[node39:20515] [[50428,1],9]:tcp:processing set_peer cmd
[node39:20515] [[50428,1],9] SET_PEER ADDING PEER [[50428,0],1]
[node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening on net 192.168.20.113 port 37147
[node39:20515] [[50428,1],9]:tcp:processing set_peer cmd
[node39:20515] [[50428,1],9] set_peer: peer [[50428,0],1] is listening on net 10.1.7.69 port 37147
[node39:20515] [[50428,1],9] oob:ud:get_addr contact information: ud://181905.60.1
[node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1]
[node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg
[node39:20471] [[50428,0],0]:tcp:recv:handler read hdr
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of size 3697
[node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1] (ORIGIN [[50428,0],1]) OF 3697 BYTES FOR DEST [[50428,0],0] TAG 2
[node39:20471] [[50428,0],0] DELIVERING TO RML
[node39:20512] [[50428,1],6]:usock:recv:handler called for peer [[50428,0],1]
[node39:20512] [[50428,1],6] RECV CONNECT ACK FROM [[50428,0],1] ON SOCKET 21
[node39:20512] [[50428,1],6] waiting for connect ack from [[50428,0],1]
[node39:20512] [[50428,1],6] connect ack received from [[50428,0],1]
[node39:20512] [[50428,1],6] connect-ack recvd from [[50428,0],1]
[node39:20512] [[50428,1],6] usock_peer_recv_connect_ack: received unexpected process identifier [[50428,0],0] from [[50428,0],1]
[node39:20512] [[50428,1],6] usock_peer_close for [[50428,0],1] sd 21 state FAILED
[node39:20512] [[50428,1],6] UNABLE TO COMPLETE CONNECT ACK WITH [[50428,0],1]
[node39:20512] [[50428,1],6] usock:lost connection called for peer [[50428,0],1]
[node39:20471] [[50428,0],0]:tcp:recv:handler called for peer [[50428,0],1]
[node39:20471] [[50428,0],0]:tcp:recv:handler CONNECTED
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate new recv msg
[node39:20471] [[50428,0],0]:tcp:recv:handler read hdr
[node39:20471] [[50428,0],0]:tcp:recv:handler allocate data region of size 4118
[node39:20471] [[50428,0],0] RECVD COMPLETE MESSAGE FROM [[50428,0],1] (ORIGIN [[50428,0],1]) OF 4118 BYTES FOR DEST [[50428,0],0] TAG 2
[node39:20471] [[50428,0],0] DELIVERING TO RML
[node39:20514] [[50428,1],8] oob:ud:port_recv_start posting 512 message buffers
thanks,
G
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Kapetanakis Giannis
2018-08-11 17:00:41 UTC
Permalink
Post by Ralph H Castain
Put "oob=^usock” in your default mca param file, or add OMPI_MCA_oob=^usock to your environment
Thank you very much, that did the trick.

Could you please explain about this, cause I cannot find documentation

G

Loading...