[OMPI users] Q: Binding to cores on AWS?

Discussion:

Brian Dobbins

2017-12-22 19:58:27 UTC

Hi all,

We're testing a model on AWS using C4/C5 nodes and some of our timers, in
a part of the code with no communication, show really poor performance
compared to native runs. We think this is because we're not binding to a
core properly and thus not caching, and a quick 'mpirun --bind-to core
hostname' does suggest issues with this on AWS:

*[***@head run]$ mpirun --bind-to core hostname*
*--------------------------------------------------------------------------*
*WARNING: a request was made to bind a process. While the system*
*supports binding the process itself, at least one node does NOT*
*support binding memory to the process location.*

* Node: head*

*Open MPI uses the "hwloc" library to perform process and memory*
*binding. This error message means that hwloc has indicated that*
*processor binding support is not available on this machine.*

(It also happens on compute nodes, and with real executables.)

Does anyone know how to enforce binding to cores on AWS instances? Any
insight would be great.

Thanks,
- Brian

r***@open-mpi.org

2017-12-22 20:53:37 UTC

Permalink

Actually, that message is telling you that binding to core is available, but that we cannot bind memory to be local to that core. You can verify the binding pattern by adding --report-bindings to your cmd line.

Post by Brian Dobbins
Hi all,
--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.
Node: head
Open MPI uses the "hwloc" library to perform process and memory
binding. This error message means that hwloc has indicated that
processor binding support is not available on this machine.
(It also happens on compute nodes, and with real executables.)
Does anyone know how to enforce binding to cores on AWS instances? Any insight would be great.
Thanks,
- Brian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Brian Dobbins

2017-12-22 21:08:58 UTC

Permalink

Hi Ralph,

OK, that certainly makes sense - so the next question is, what prevents
binding memory to be local to particular cores? Is this possible in a
virtualized environment like AWS HVM instances?

And does this apply only to dynamic allocations within an instance, or
static as well? I'm pretty unfamiliar with how the hypervisor (KVM-based,
I believe) maps out 'real' hardware, including memory, to particular
instances. We've seen *some* parts of the code (bandwidth heavy) run ~10x
faster on bare-metal hardware, though, *presumably* from memory locality,
so it certainly has a big impact.

Thanks again, and merry Christmas!
- Brian

Post by r***@open-mpi.org
Actually, that message is telling you that binding to core is available,
but that we cannot bind memory to be local to that core. You can verify the
binding pattern by adding --report-bindings to your cmd line.
Hi all,
We're testing a model on AWS using C4/C5 nodes and some of our timers,
in a part of the code with no communication, show really poor performance
compared to native runs. We think this is because we're not binding to a
core properly and thus not caching, and a quick 'mpirun --bind-to core
*--------------------------------------------------------------------------*
*WARNING: a request was made to bind a process. While the system*
*supports binding the process itself, at least one node does NOT*
*support binding memory to the process location.*
* Node: head*
*Open MPI uses the "hwloc" library to perform process and memory*
*binding. This error message means that hwloc has indicated that*
*processor binding support is not available on this machine.*
(It also happens on compute nodes, and with real executables.)
Does anyone know how to enforce binding to cores on AWS instances? Any
insight would be great.
Thanks,
- Brian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

r***@open-mpi.org

2017-12-22 21:14:23 UTC

Permalink

I honestly donât know - will have to defer to Brian, who is likely out for at least the extended weekend. Iâll point this one to him when he returns.

Post by Brian Dobbins
Hi Ralph,
OK, that certainly makes sense - so the next question is, what prevents binding memory to be local to particular cores? Is this possible in a virtualized environment like AWS HVM instances?
And does this apply only to dynamic allocations within an instance, or static as well? I'm pretty unfamiliar with how the hypervisor (KVM-based, I believe) maps out 'real' hardware, including memory, to particular instances. We've seen some parts of the code (bandwidth heavy) run ~10x faster on bare-metal hardware, though, presumably from memory locality, so it certainly has a big impact.
Thanks again, and merry Christmas!
- Brian
Actually, that message is telling you that binding to core is available, but that we cannot bind memory to be local to that core. You can verify the binding pattern by adding --report-bindings to your cmd line.

Brian Dobbins

2017-12-22 22:27:17 UTC

Permalink

Hi Ralph,

Well, this gets chalked up to user error - the default AMI images come
without the NUMA-dev libraries, so OpenMPI didn't get built with it (and in
my haste, I hadn't checked). Oops. Things seem to be working correctly
now.

Thanks again for your help,
- Brian

Post by r***@open-mpi.org
I honestly donât know - will have to defer to Brian, who is likely out for
at least the extended weekend. Iâll point this one to him when he returns.
Hi Ralph,
OK, that certainly makes sense - so the next question is, what prevents
binding memory to be local to particular cores? Is this possible in a
virtualized environment like AWS HVM instances?
And does this apply only to dynamic allocations within an instance, or
static as well? I'm pretty unfamiliar with how the hypervisor (KVM-based,
I believe) maps out 'real' hardware, including memory, to particular
instances. We've seen *some* parts of the code (bandwidth heavy) run
~10x faster on bare-metal hardware, though, *presumably* from memory
locality, so it certainly has a big impact.
Thanks again, and merry Christmas!
- Brian

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Gilles Gouaillardet

2017-12-23 02:55:28 UTC

Permalink

Brian,

i have no doubt this was enough to get rid of the warning messages.

out of curiosity, are you now able to experience performances close to
native runs ?
if i understand correctly, the linux kernel allocates memory on the
closest NUMA domain (e.g. socket if i oversimplify), and since
MPI tasks are bound by orted/mpirun before they are execv'ed, i have
some hard time understanding how not binding MPI tasks to
memory can have a significant impact on performances as long as they
are bound on cores.

Cheers,

Gilles

Post by Brian Dobbins
Hi Ralph,
Well, this gets chalked up to user error - the default AMI images come
without the NUMA-dev libraries, so OpenMPI didn't get built with it (and in
my haste, I hadn't checked). Oops. Things seem to be working correctly
now.
Thanks again for your help,
- Brian

I honestly don’t know - will have to defer to Brian, who is likely out for
at least the extended weekend. I’ll point this one to him when he returns.
Hi Ralph,
OK, that certainly makes sense - so the next question is, what prevents
binding memory to be local to particular cores? Is this possible in a
virtualized environment like AWS HVM instances?
And does this apply only to dynamic allocations within an instance, or
static as well? I'm pretty unfamiliar with how the hypervisor (KVM-based, I
believe) maps out 'real' hardware, including memory, to particular
instances. We've seen some parts of the code (bandwidth heavy) run ~10x
faster on bare-metal hardware, though, presumably from memory locality, so
it certainly has a big impact.
Thanks again, and merry Christmas!
- Brian

Post by r***@open-mpi.org
Actually, that message is telling you that binding to core is available,
but that we cannot bind memory to be local to that core. You can verify the
binding pattern by adding --report-bindings to your cmd line.
Hi all,
We're testing a model on AWS using C4/C5 nodes and some of our timers,
in a part of the code with no communication, show really poor performance
compared to native runs. We think this is because we're not binding to a
core properly and thus not caching, and a quick 'mpirun --bind-to core
--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.
Node: head
Open MPI uses the "hwloc" library to perform process and memory
binding. This error message means that hwloc has indicated that
processor binding support is not available on this machine.
(It also happens on compute nodes, and with real executables.)
Does anyone know how to enforce binding to cores on AWS instances? Any
insight would be great.
Thanks,
- Brian
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users

Brian Dobbins

2017-12-23 03:14:48 UTC

Permalink

Hi Gilles,

You're right, we no longer get warnings... and the performance disparity
still exists, though to be clear it's only in select parts of the code -
others run as we'd expect. This is probably why I initially guessed it was
a process/memory affinity issue - the one timer I looked at is in a
memory-intensive part of the code. Now I'm wondering if we're still
getting issues binding (I need to do a comparison with a local system), or
if it could be due to the cache size differences - the AWS C4 instances
have 25MB/socket, and we have 45MB/socket. If we fit in cache on our
system, and don't on theirs, that could account for things. Testing that
is next up on my list, actually.

Cheers,
- Brian

On Fri, Dec 22, 2017 at 7:55 PM, Gilles Gouaillardet <

Post by r***@open-mpi.org
Brian,
i have no doubt this was enough to get rid of the warning messages.
out of curiosity, are you now able to experience performances close to
native runs ?
if i understand correctly, the linux kernel allocates memory on the
closest NUMA domain (e.g. socket if i oversimplify), and since
MPI tasks are bound by orted/mpirun before they are execv'ed, i have
some hard time understanding how not binding MPI tasks to
memory can have a significant impact on performances as long as they
are bound on cores.
Cheers,
Gilles

Post by Brian Dobbins
Hi Ralph,
Well, this gets chalked up to user error - the default AMI images come
without the NUMA-dev libraries, so OpenMPI didn't get built with it (and

Post by Brian Dobbins
my haste, I hadn't checked). Oops. Things seem to be working correctly
now.
Thanks again for your help,
- Brian

Post by r***@open-mpi.org
I honestly donât know - will have to defer to Brian, who is likely out

for

Post by Brian Dobbins

Post by r***@open-mpi.org
at least the extended weekend. Iâll point this one to him when he

returns.

Post by Brian Dobbins

Post by r***@open-mpi.org
Hi Ralph,
OK, that certainly makes sense - so the next question is, what

prevents

Post by Brian Dobbins

Post by r***@open-mpi.org
binding memory to be local to particular cores? Is this possible in a
virtualized environment like AWS HVM instances?
And does this apply only to dynamic allocations within an instance, or
static as well? I'm pretty unfamiliar with how the hypervisor

(KVM-based, I

Post by Brian Dobbins

Post by r***@open-mpi.org
believe) maps out 'real' hardware, including memory, to particular
instances. We've seen some parts of the code (bandwidth heavy) run ~10x
faster on bare-metal hardware, though, presumably from memory locality,

Post by Brian Dobbins

Post by r***@open-mpi.org
it certainly has a big impact.
Thanks again, and merry Christmas!
- Brian

Post by r***@open-mpi.org
Actually, that message is telling you that binding to core is

available,