Discussion:
[OMPI users] usNIC BTL unrecognized payload type 255 when running under SLURM srun nut not mpiexec/mpirun
Forai,Petar
2017-11-09 23:51:43 UTC
Permalink
Hi everyone!

We’re observing output such as the following when running non-trivial MPI software through SLURM’s srun

[cn-11:52778] unrecognized payload type 255
[cn-11:52778] base = 0x9ce2c0, proto = 0x9ce2c0, hdr = 0x9ce300
[cn-11:52778] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778] 10: 00 00 00 00 00 00 06 02 ff 0c 1f c2 06 02 ff 0c
[cn-11:52778] 20: b9 8f 08 00 45 00 00 3c 00 00 40 00 08 11 5d 5d
[cn-11:52778] 30: 0a 95 00 16 0a 95 00 15 e5 05 e8 d9 00 28 7c 8c
[cn-11:52778] 40: 01 00 00 00 00 00 31 b6 00 00 8f e3 00 00 00 00
[cn-11:52778] 50: 00 00 00 00 00 00 06 02 ff 0c d3 25 06 02 ff 0c
[cn-11:52778] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


It is independent of the software BUT is NOT observable when running with mpiexec/mpirun. When switching to the TCP or vader BTL we have clean output and the message is not observed. It is output by different ranks on various nodes, so not reproducibly the same nodes.

The location of the message seems to be from here[1]

Any idea how to get rid of this or what might be the root cause? Hints what to check for would be greatly appreciated!

TIA!

Petar


Environment:
1.4.0-cisco-1.0.531.1-RHEL7U3
SLURM 17.02.7
OpenMPI 2.0.2 configured with libfabric, usnic, SLURM, SLURM’s PMI library:

./configure --prefix=/software/171020/software/openmpi/2.0.2-gcc-6.3.0-2.27 --enable-shared --enable-mpi-thread-multiple --with-libfabric=/opt/cisco/libfabric --without-memory-manager --enable-mpirun-prefix-by-default --enable-mpirun-prefix-by-default --with-hwloc=$EBROOTHWLOC --with-usnic --with-verbs-usnic --with-slurm --with-pmi=/cm/shared/apps/slurm/current --enable-dlopen LDFLAGS="-Wl,-rpath -Wl,/opt/cisco/libfabric/lib -Wl,--enable-new-dtags"

NIC UCSC-MLOM-C40Q-03 [VIC 1387]
VIC Firmware 4.1(3a)


[1] https://github.com/open-mpi/ompi/blob/9c3ae64297e034b30cb65298908014764216c616/opal/mca/btl/usnic/btl_usnic_recv.c#L354
Jeff Squyres (jsquyres)
2017-11-10 18:48:01 UTC
Permalink
We’re observing output such as the following when running non-trivial MPI software through SLURM’s srun
[cn-11:52778] unrecognized payload type 255
[cn-11:52778] base = 0x9ce2c0, proto = 0x9ce2c0, hdr = 0x9ce300
[cn-11:52778] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778] 10: 00 00 00 00 00 00 06 02 ff 0c 1f c2 06 02 ff 0c
[cn-11:52778] 20: b9 8f 08 00 45 00 00 3c 00 00 40 00 08 11 5d 5d
[cn-11:52778] 30: 0a 95 00 16 0a 95 00 15 e5 05 e8 d9 00 28 7c 8c
[cn-11:52778] 40: 01 00 00 00 00 00 31 b6 00 00 8f e3 00 00 00 00
[cn-11:52778] 50: 00 00 00 00 00 00 06 02 ff 0c d3 25 06 02 ff 0c
[cn-11:52778] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
It is independent of the software BUT is NOT observable when running with mpiexec/mpirun.
That is extremely odd. I cannot think of how the choice of launcher would affect the usNIC BTL.
When switching to the TCP or vader BTL we have clean output and the message is not observed. It is output by different ranks on various nodes, so not reproducibly the same nodes.
The location of the message seems to be from here[1]
Let me take a step back and explain the usNIC BTL: it uses OS-bypass UDP for communication. This means that it is connectionless, and will accept datagrams from anywhere. When the usNIC BTL receives a message, it does a few things to verify that it is both an Open MPI frame and from a peer that it recognizes. If the message fails any of the verifications, the usNIC BTL simply drops it.

There are two usual reasons that the usNIC BTL ends up dropping a message:

1. It was a valid message from a peer, but it got corrupted in transit.

PSA: corrupted packets happen. Usually the network layer filters them out and user-level processes don't see them -- but rarely they can eek through and still be received in userspace [with very low frequency].

If a valid message gets dropped, it will simply be re-transmitted by the sender a short time later.

2. It was a message from something else (i.e., a non-Open MPI sender).

In my internal Cisco testing, for example, I periodically get frames from Cisco IT malware scanners (i.e., they find my open usNIC UDP ports and try to send traffic to them). In these cases, the usNIC BTL dropping the frame is the Right Thing To Do.
Any idea how to get rid of this or what might be the root cause? Hints what to check for would be greatly appreciated!
The messages are actually harmless -- they're just the usNIC BTL indicating that it is dropping a message.

But I can see how that would be annoying -- I'll switch the default to turn them off by default for future versions (and only turn them on if the user specifically requests them).

For an immediate fix, you can basically #if 0 out the block in btl_usnic_recv.c that prints out those messages. The attached patch does that and is against v2.0.2, but note that we literally just released v2.0.4 today (just additional bug fixes against the v2.0.x series). Finally, the latest released version of Open MPI is v3.0.0, if you feel like upgrading.

--
Jeff Squyres
***@cisco.com
Forai,Petar
2017-11-13 13:31:05 UTC
Permalink
One more thing to add, this is 100% reproducible when running with srun and no output when running with mpirun:

Mpiexec
[***@login-01 ~]$ srun -N 2 -n 2 --pty bash
[***@cn-21 ~]$ mpiexec -np 2 IMB-MPI1 PingPong
libibverbs: Warning: no node_type attr under /sys/class/infiniband/usnic_0.
libibverbs: Warning: no node_type attr under /sys/class/infiniband/usnic_0.
libibverbs: Warning: no node_type attr under /sys/class/infiniband/usnic_0.
benchmarks to run PingPong
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.1, MPI-1 part
#------------------------------------------------------------
# Date : Mon Nov 13 14:28:57 2017
# Machine : x86_64
# System : Linux
# Release : 3.10.0-514.2.2.el7.x86_64
# Version : #1 SMP Tue Dec 6 23:06:41 UTC 2016
# MPI Version : 3.1
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time



# Calling sequence was:

# IMB-MPI1 PingPong

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 11.22 0.00
1 1000 11.26 0.08
2 1000 11.18 0.17
4 1000 11.16 0.34
8 1000 11.19 0.68
16 1000 11.18 1.36
32 1000 11.28 2.71
64 1000 11.40 5.35
128 1000 11.62 10.51
256 1000 12.08 20.20
512 1000 12.75 38.30
1024 1000 14.44 67.61
2048 1000 16.00 122.04
4096 1000 19.19 203.54
8192 1000 25.41 307.42
16384 1000 30.88 506.04
32768 1000 38.29 816.18
65536 640 56.42 1107.79
131072 320 87.01 1436.58
262144 160 162.14 1541.92
524288 80 257.73 1940.02
1048576 40 450.37 2220.39
2097152 20 806.20 2480.79
4194304 10 1776.69 2251.38


# All processes entering MPI_Finalize

[***@cn-21 ~]$


SRUN


[***@login-01 ~]$ srun -N 2 -n 2 IMB-MPI1 PingPong
libibverbs: Warning: no node_type attr under /sys/class/infiniband/usnic_0.
libibverbs: Warning: no node_type attr under /sys/class/infiniband/usnic_0.
benchmarks to run PingPong
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.1, MPI-1 part
#------------------------------------------------------------
# Date : Mon Nov 13 14:27:26 2017
# Machine : x86_64
# System : Linux
# Release : 3.10.0-514.2.2.el7.x86_64
# Version : #1 SMP Tue Dec 6 23:06:41 UTC 2016
# MPI Version : 3.1
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time



# Calling sequence was:

# /software/171020/software/imb/4.1-foss-2017a/bin/IMB-MPI1 PingPong

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 11.73 0.00
1 1000 11.83 0.08
2 1000 11.66 0.16
4 1000 11.64 0.33
8 1000 11.70 0.65
16 1000 11.73 1.30
32 1000 11.81 2.58
64 1000 12.01 5.08
128 1000 12.23 9.98
256 1000 12.63 19.33
512 1000 13.35 36.58
1024 1000 14.98 65.18
2048 1000 16.43 118.85
4096 1000 19.69 198.38
8192 1000 26.38 296.18
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0c840, proto = 0xa0c840, hdr = 0xa0c880
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 9c 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 8d 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0c500, proto = 0xa0c500, hdr = 0xa0c540
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 9b 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 8e 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0c1c0, proto = 0xa0c1c0, hdr = 0xa0c200
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 9a 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 8f 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0be80, proto = 0xa0be80, hdr = 0xa0bec0
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 99 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 90 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0bb40, proto = 0xa0bb40, hdr = 0xa0bb80
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 98 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 91 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0b800, proto = 0xa0b800, hdr = 0xa0b840
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 97 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 92 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0b4c0, proto = 0xa0b4c0, hdr = 0xa0b500
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 96 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 93 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0b180, proto = 0xa0b180, hdr = 0xa0b1c0
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 95 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 94 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] unrecognized payload type 255
[cn-21:18790] base = 0xa0ae40, proto = 0xa0ae40, hdr = 0xa0ae80
[cn-21:18790] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 10: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 20: da 89 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-21:18790] 30: 0a 95 00 20 0a 95 00 1f a3 1a e3 8c 00 28 94 10
[cn-21:18790] 40: 01 00 00 00 00 00 8b d0 00 00 95 8f 00 00 00 00
[cn-21:18790] 50: 00 00 00 00 00 00 06 02 ff 0c a2 ff 06 02 ff 0c
[cn-21:18790] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-21:18790] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa11880, proto = 0xa11880, hdr = 0xa118c0
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 f8 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 fb 13 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa11540, proto = 0xa11540, hdr = 0xa11580
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 f6 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 fd 13 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa11200, proto = 0xa11200, hdr = 0xa11240
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 f4 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 ff 13 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa10ec0, proto = 0xa10ec0, hdr = 0xa10f00
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 f3 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 00 14 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa10b80, proto = 0xa10b80, hdr = 0xa10bc0
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 f2 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 01 14 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa10840, proto = 0xa10840, hdr = 0xa10880
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 f0 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 03 14 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa10500, proto = 0xa10500, hdr = 0xa10540
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 ee 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 05 14 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa101c0, proto = 0xa101c0, hdr = 0xa10200
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 ec 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 07 14 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] unrecognized payload type 255
[cn-22:15990] base = 0xa0fe80, proto = 0xa0fe80, hdr = 0xa0fec0
[cn-22:15990] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 10: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 20: a2 ff 08 00 45 00 00 3c 00 00 40 00 08 11 5d 49
[cn-22:15990] 30: 0a 95 00 1f 0a 95 00 20 e3 8c a3 1a 00 28 eb 01
[cn-22:15990] 40: 00 00 00 00 00 00 8b d0 00 00 08 14 00 00 00 00
[cn-22:15990] 50: 00 00 00 00 00 00 06 02 ff 0c da 89 06 02 ff 0c
[cn-22:15990] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-22:15990] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
16384 1000 32.29 483.84
32768 1000 41.42 754.43
65536 640 57.21 1092.55
131072 320 92.60 1349.90
262144 160 169.52 1474.76
524288 80 274.93 1818.64
1048576 40 481.37 2077.42
2097152 20 908.20 2202.16
4194304 10 1785.11 2240.76


# All processes entering MPI_Finalize

[***@login-01 ~]$




On 13/11/17 - KW46, 14:21, "users on behalf of Forai,Petar" <users-***@lists.open-mpi.org on behalf of ***@imp.ac.at> wrote:

Hi Jeff!

Thanks for your quick reply. This is a bunch of Cisco UCS C Series machines hooked up into an ACI leaf pair. Before the debugging session today we didn’t even have unicast routing turned on within the MPI bridge domain for this cluster so nothing could reach those IP interfaces before – we’ve observed the kernel interfaces that are “attached” to the usnic0 interface and did not see any UDP traffic appearing.

We’re suspecting the corruption and dropping going on. When we look at the decoded frame there we see the MAC address of the two MPI ranks (06 02 ff 0c 1f c2 and 06 02 ff 0c d3 25) but with a bunch of 00 before and after which creates the impression here, that this dump is actually multiple frames merged into one or something like this. I don’t know the usNIC on wire protocol but seeing the MAC multiple times is kind of suspicious for a host to host protocol.

I’ve attached a pcap file of a IMB PingPong that was captured to the APIC (GRE encapsed) of that fabric while we observed the frame dump being written to the output every now and then. The pcap file looks pretty solid and nothing strange there.

Thanks for the hint about this message being harmless so I guess if we don’t find anything else we’ll go with a small patch to disable the dumping of the frames to the console.

Any other clues?

best,
P

Which level of dropping should we expect here actually? Also PFC is turned on there and set to class 3
Post by Forai,Petar
We’re observing output such as the following when running non-trivial MPI software through SLURM’s srun
[cn-11:52778] unrecognized payload type 255
[cn-11:52778] base = 0x9ce2c0, proto = 0x9ce2c0, hdr = 0x9ce300
[cn-11:52778] 0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778] 10: 00 00 00 00 00 00 06 02 ff 0c 1f c2 06 02 ff 0c
[cn-11:52778] 20: b9 8f 08 00 45 00 00 3c 00 00 40 00 08 11 5d 5d
[cn-11:52778] 30: 0a 95 00 16 0a 95 00 15 e5 05 e8 d9 00 28 7c 8c
[cn-11:52778] 40: 01 00 00 00 00 00 31 b6 00 00 8f e3 00 00 00 00
[cn-11:52778] 50: 00 00 00 00 00 00 06 02 ff 0c d3 25 06 02 ff 0c
[cn-11:52778] 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[cn-11:52778] 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
It is independent of the software BUT is NOT observable when running with mpiexec/mpirun.
That is extremely odd. I cannot think of how the choice of launcher would affect the usNIC BTL.
Post by Forai,Petar
When switching to the TCP or vader BTL we have clean output and the message is not observed. It is output by different ranks on various nodes, so not reproducibly the same nodes.
The location of the message seems to be from here[1]
Let me take a step back and explain the usNIC BTL: it uses OS-bypass UDP for communication. This means that it is connectionless, and will accept datagrams from anywhere. When the usNIC BTL receives a message, it does a few things to verify that it is both an Open MPI frame and from a peer that it recognizes. If the message fails any of the verifications, the usNIC BTL simply drops it.

There are two usual reasons that the usNIC BTL ends up dropping a message:

1. It was a valid message from a peer, but it got corrupted in transit.

PSA: corrupted packets happen. Usually the network layer filters them out and user-level processes don't see them -- but rarely they can eek through and still be received in userspace [with very low frequency].

If a valid message gets dropped, it will simply be re-transmitted by the sender a short time later.

2. It was a message from something else (i.e., a non-Open MPI sender).

In my internal Cisco testing, for example, I periodically get frames from Cisco IT malware scanners (i.e., they find my open usNIC UDP ports and try to send traffic to them). In these cases, the usNIC BTL dropping the frame is the Right Thing To Do.
Post by Forai,Petar
Any idea how to get rid of this or what might be the root cause? Hints what to check for would be greatly appreciated!
The messages are actually harmless -- they're just the usNIC BTL indicating that it is dropping a message.

But I can see how that would be annoying -- I'll switch the default to turn them off by default for future versions (and only turn them on if the user specifically requests them).

For an immediate fix, you can basically #if 0 out the block in btl_usnic_recv.c that prints out those messages. The attached patch does that and is against v2.0.2, but note that we literally just released v2.0.4 today (just additional bug fixes against the v2.0.x series). Finally, the latest released version of Open MPI is v3.0.0, if you feel like upgrading.

--
Jeff Squyres
***@cisco.com

Loading...