Brendan Myers
2017-10-10 19:29:00 UTC
Hello All,
I have a RoCE interoperability event starting next week and I was wondering
if anyone had any ideas to help me with a new vendor I am trying to help get
ready.
I am using:
* Open MPI 2.1
* Intel MPI Benchmarks 2018
* OFED 3.18 (requirement from vendor)
* SLES 11 SP3 (requirement from vendor)
The problem seems to be that the device does not handle larger message sizes
well and I am sure they will be working on this but I am hoping there may be
a way to complete an IMB run with some Open MPI parameter tweaking.
Sample of IMB output from a Sendrecv benchmark:
262144 160 131.07 132.24 131.80 3964.56
524288 80 277.42 284.57 281.57
3684.71
1048576 40 461.16 474.83 470.02
4416.59
2097152 3 1112.15 4294965.49 2147851.04
0.98
4194304 2 2815.25 8589929.73 3222731.54
0.98
In red text is what looks like the problematic results. This happens on many
of the benchmarks at larger message sizes and causes either a major slowdown
or it causes the job to abort with error:
The InfiniBand retry count between two MPI processes has been exceeded.
If anyone has any thoughts on how I can complete the benchmarks without the
job aborting I would appreciate it. If anyone has ideas as to why a RoCE
device might show this issue I would take any information on offer. If more
data is required please let me know what is relevant.
Thank you,
Brendan T. W. Myers
I have a RoCE interoperability event starting next week and I was wondering
if anyone had any ideas to help me with a new vendor I am trying to help get
ready.
I am using:
* Open MPI 2.1
* Intel MPI Benchmarks 2018
* OFED 3.18 (requirement from vendor)
* SLES 11 SP3 (requirement from vendor)
The problem seems to be that the device does not handle larger message sizes
well and I am sure they will be working on this but I am hoping there may be
a way to complete an IMB run with some Open MPI parameter tweaking.
Sample of IMB output from a Sendrecv benchmark:
262144 160 131.07 132.24 131.80 3964.56
524288 80 277.42 284.57 281.57
3684.71
1048576 40 461.16 474.83 470.02
4416.59
2097152 3 1112.15 4294965.49 2147851.04
0.98
4194304 2 2815.25 8589929.73 3222731.54
0.98
In red text is what looks like the problematic results. This happens on many
of the benchmarks at larger message sizes and causes either a major slowdown
or it causes the job to abort with error:
The InfiniBand retry count between two MPI processes has been exceeded.
If anyone has any thoughts on how I can complete the benchmarks without the
job aborting I would appreciate it. If anyone has ideas as to why a RoCE
device might show this issue I would take any information on offer. If more
data is required please let me know what is relevant.
Thank you,
Brendan T. W. Myers