Discussion:
[OMPI users] NUMA interaction with Open MPI
Adam Sylvester
2017-07-16 19:19:06 UTC
Permalink
I'll start with my question upfront: Is there a way to do the equivalent of
telling mpirun to do 'numactl --interleave=all' on the processes that it
runs? Or if I want to control the memory placement of my applications run
through MPI will I need to use libnuma for this? I tried doing "mpirun
<Open MPI options> numactl --interleave=all <app name and options>". I
don't know how to explicitly verify if this ran the numactl command on each
host or not but based on the performance I'm seeing, it doesn't seem like
it did (or something else is causing my poor performance).

More details: For the particular image I'm benchmarking with, I have a
multi-threaded application which requires 60 GB of RAM to run if it's run
on one machine. It allocates one large ping/pong buffer upfront and uses
this to avoid copies when updating the image at each step. I'm running in
AWS and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10
Gbps) vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps). Running on a single
X1, my application runs ~3x faster than the R3; using numactl
--interleave=all has a significant positive effect on its performance, I
assume because the various threads that are running are accessing memory
spread out across the nodes rather than most of them having slow access to
it. So far so good.

My application also supports distributing across machines via MPI. When
doing this, the memory requirement scales linearly with the number of
machines; there are three pinch points that involve large (GBs of data)
all-to-all communication. For the slowest of these three, I've pipelined
this step and use MPI_Ialltoallv() to hide as much of the latency as I
can. When run on R3 instances, overall runtime scales very well as
machines are added. Still so far so good.

My problems start with the X1 instances. I do get scaling as I add more
machines, but it is significantly worse than with the R3s. This isn't just
a matter of there being more CPUs and the MPI communication time
dominating. The actual time spent in the MPI all-to-all communication is
significantly longer than on the R3s for the same number of machines,
despite the network bandwidth being twice as high (in a post from a few
days ago some folks helped me with MPI settings to improve the network
communication speed - from toy benchmark MPI tests I know I'm getting
faster communication on the X1s than on the R3s, so this feels likely to be
an issue with NUMA, though I'd be interested in any other thoughts.

I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
didn't seem to have what I was looking for. I want MPI to let my
application use all CPUs on the system (I'm the only one running on it)...
I just want to control the memory placement.

Thanks for the help.
-Adam
Gilles Gouaillardet
2017-07-17 03:42:40 UTC
Permalink
Adam,

keep in mind that by default, recent Open MPI bind MPI tasks
- to cores if -np 2
- to NUMA domain otherwise (which is a socket in most cases, unless
you are running on a Xeon Phi)

so unless you specifically asked mpirun to do a binding consistent
with your needs, you might simply try to ask no binding at all
mpirun --bind-to none ...

i am not sure whether you can direclty ask Open MPI to do the memory
binding you expect from the command line.
anyway, as far as i am concerned,
mpirun --bind-to none numactl --interleave=all ...
should do what you expect

if you want to be sure, you can simply
mpirun --bind-to none numactl --interleave=all grep Mems_allowed_list
/proc/self/status
and that should give you an hint

Cheers,

Gilles
Post by Adam Sylvester
I'll start with my question upfront: Is there a way to do the equivalent of
telling mpirun to do 'numactl --interleave=all' on the processes that it
runs? Or if I want to control the memory placement of my applications run
through MPI will I need to use libnuma for this? I tried doing "mpirun
<Open MPI options> numactl --interleave=all <app name and options>". I
don't know how to explicitly verify if this ran the numactl command on each
host or not but based on the performance I'm seeing, it doesn't seem like it
did (or something else is causing my poor performance).
More details: For the particular image I'm benchmarking with, I have a
multi-threaded application which requires 60 GB of RAM to run if it's run on
one machine. It allocates one large ping/pong buffer upfront and uses this
to avoid copies when updating the image at each step. I'm running in AWS
and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps)
vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps). Running on a single X1, my
application runs ~3x faster than the R3; using numactl --interleave=all has
a significant positive effect on its performance, I assume because the
various threads that are running are accessing memory spread out across the
nodes rather than most of them having slow access to it. So far so good.
My application also supports distributing across machines via MPI. When
doing this, the memory requirement scales linearly with the number of
machines; there are three pinch points that involve large (GBs of data)
all-to-all communication. For the slowest of these three, I've pipelined
this step and use MPI_Ialltoallv() to hide as much of the latency as I can.
When run on R3 instances, overall runtime scales very well as machines are
added. Still so far so good.
My problems start with the X1 instances. I do get scaling as I add more
machines, but it is significantly worse than with the R3s. This isn't just
a matter of there being more CPUs and the MPI communication time dominating.
The actual time spent in the MPI all-to-all communication is significantly
longer than on the R3s for the same number of machines, despite the network
bandwidth being twice as high (in a post from a few days ago some folks
helped me with MPI settings to improve the network communication speed -
from toy benchmark MPI tests I know I'm getting faster communication on the
X1s than on the R3s, so this feels likely to be an issue with NUMA, though
I'd be interested in any other thoughts.
I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
didn't seem to have what I was looking for. I want MPI to let my
application use all CPUs on the system (I'm the only one running on it)... I
just want to control the memory placement.
Thanks for the help.
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Iliev, Hristo
2017-07-19 15:29:15 UTC
Permalink
Giles,

Mems_allowed_list has never worked for me:

$ uname -r
3.10.0-514.26.1.e17.x86_64

$ numactl -H | grep available
available: 2 nodes (0-1)

$ grep Mems_allowed_list /proc/self/status
Mems_allowed_list: 0-1

$ numactl -m 0 grep Mems_allowed_list /proc/self/status
Mems_allowed_list: 0-1

It seems that whatever structure Mems_allowed_list exposes is outdated. One should use "numactl -s" instead:

$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1

$ numactl -m 0 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0

$ numactl -i all numactl -s
policy: interleave
preferred node: 0 (interleave next)
interleavemask: 0 1
interleavenode: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1

I wouldn't ask Open MPI not to bind the processes as the policy set by numactl is of higher precedence compared to what orterun/shepherd sets, at least with non-MPI programs:

$ orterun -n 2 --bind-to core --map-by socket numactl -i all numactl -s
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 0 24
cpubind: 0
nodebind: 0
membind: 0 1
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 12 36
cpubind: 1
nodebind: 1
membind: 0 1

Cheers,
Hristo

-----Original Message-----
From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Gilles Gouaillardet
Sent: Monday, July 17, 2017 5:43 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] NUMA interaction with Open MPI

Adam,

keep in mind that by default, recent Open MPI bind MPI tasks
- to cores if -np 2
- to NUMA domain otherwise (which is a socket in most cases, unless
you are running on a Xeon Phi)

so unless you specifically asked mpirun to do a binding consistent
with your needs, you might simply try to ask no binding at all
mpirun --bind-to none ...

i am not sure whether you can direclty ask Open MPI to do the memory
binding you expect from the command line.
anyway, as far as i am concerned,
mpirun --bind-to none numactl --interleave=all ...
should do what you expect

if you want to be sure, you can simply
mpirun --bind-to none numactl --interleave=all grep Mems_allowed_list
/proc/self/status
and that should give you an hint

Cheers,

Gilles
Post by Adam Sylvester
I'll start with my question upfront: Is there a way to do the equivalent of
telling mpirun to do 'numactl --interleave=all' on the processes that it
runs? Or if I want to control the memory placement of my applications run
through MPI will I need to use libnuma for this? I tried doing "mpirun
<Open MPI options> numactl --interleave=all <app name and options>". I
don't know how to explicitly verify if this ran the numactl command on each
host or not but based on the performance I'm seeing, it doesn't seem like it
did (or something else is causing my poor performance).
More details: For the particular image I'm benchmarking with, I have a
multi-threaded application which requires 60 GB of RAM to run if it's run on
one machine. It allocates one large ping/pong buffer upfront and uses this
to avoid copies when updating the image at each step. I'm running in AWS
and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps)
vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps). Running on a single X1, my
application runs ~3x faster than the R3; using numactl --interleave=all has
a significant positive effect on its performance, I assume because the
various threads that are running are accessing memory spread out across the
nodes rather than most of them having slow access to it. So far so good.
My application also supports distributing across machines via MPI. When
doing this, the memory requirement scales linearly with the number of
machines; there are three pinch points that involve large (GBs of data)
all-to-all communication. For the slowest of these three, I've pipelined
this step and use MPI_Ialltoallv() to hide as much of the latency as I can.
When run on R3 instances, overall runtime scales very well as machines are
added. Still so far so good.
My problems start with the X1 instances. I do get scaling as I add more
machines, but it is significantly worse than with the R3s. This isn't just
a matter of there being more CPUs and the MPI communication time dominating.
The actual time spent in the MPI all-to-all communication is significantly
longer than on the R3s for the same number of machines, despite the network
bandwidth being twice as high (in a post from a few days ago some folks
helped me with MPI settings to improve the network communication speed -
from toy benchmark MPI tests I know I'm getting faster communication on the
X1s than on the R3s, so this feels likely to be an issue with NUMA, though
I'd be interested in any other thoughts.
I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
didn't seem to have what I was looking for. I want MPI to let my
application use all CPUs on the system (I'm the only one running on it)... I
just want to control the memory placement.
Thanks for the help.
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Brice Goglin
2017-07-20 07:52:18 UTC
Permalink
Hello

Mems_allowed_list is what your current cgroup/cpuset allows. It is
different from what mbind/numactl/hwloc/... change.
The former is a root-only restriction that cannot be ignored by
processes placed in that cgroup.
The latter is a user-changeable binding that must be inside the former.

Brice
Post by Iliev, Hristo
Giles,
$ uname -r
3.10.0-514.26.1.e17.x86_64
$ numactl -H | grep available
available: 2 nodes (0-1)
$ grep Mems_allowed_list /proc/self/status
Mems_allowed_list: 0-1
$ numactl -m 0 grep Mems_allowed_list /proc/self/status
Mems_allowed_list: 0-1
$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1
$ numactl -m 0 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0
$ numactl -i all numactl -s
policy: interleave
preferred node: 0 (interleave next)
interleavemask: 0 1
interleavenode: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1
$ orterun -n 2 --bind-to core --map-by socket numactl -i all numactl -s
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 0 24
cpubind: 0
nodebind: 0
membind: 0 1
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 12 36
cpubind: 1
nodebind: 1
membind: 0 1
Cheers,
Hristo
-----Original Message-----
Sent: Monday, July 17, 2017 5:43 AM
Subject: Re: [OMPI users] NUMA interaction with Open MPI
Adam,
keep in mind that by default, recent Open MPI bind MPI tasks
- to cores if -np 2
- to NUMA domain otherwise (which is a socket in most cases, unless
you are running on a Xeon Phi)
so unless you specifically asked mpirun to do a binding consistent
with your needs, you might simply try to ask no binding at all
mpirun --bind-to none ...
i am not sure whether you can direclty ask Open MPI to do the memory
binding you expect from the command line.
anyway, as far as i am concerned,
mpirun --bind-to none numactl --interleave=all ...
should do what you expect
if you want to be sure, you can simply
mpirun --bind-to none numactl --interleave=all grep Mems_allowed_list
/proc/self/status
and that should give you an hint
Cheers,
Gilles
Post by Adam Sylvester
I'll start with my question upfront: Is there a way to do the equivalent of
telling mpirun to do 'numactl --interleave=all' on the processes that it
runs? Or if I want to control the memory placement of my applications run
through MPI will I need to use libnuma for this? I tried doing "mpirun
<Open MPI options> numactl --interleave=all <app name and options>". I
don't know how to explicitly verify if this ran the numactl command on each
host or not but based on the performance I'm seeing, it doesn't seem like it
did (or something else is causing my poor performance).
More details: For the particular image I'm benchmarking with, I have a
multi-threaded application which requires 60 GB of RAM to run if it's run on
one machine. It allocates one large ping/pong buffer upfront and uses this
to avoid copies when updating the image at each step. I'm running in AWS
and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps)
vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps). Running on a single X1, my
application runs ~3x faster than the R3; using numactl --interleave=all has
a significant positive effect on its performance, I assume because the
various threads that are running are accessing memory spread out across the
nodes rather than most of them having slow access to it. So far so good.
My application also supports distributing across machines via MPI. When
doing this, the memory requirement scales linearly with the number of
machines; there are three pinch points that involve large (GBs of data)
all-to-all communication. For the slowest of these three, I've pipelined
this step and use MPI_Ialltoallv() to hide as much of the latency as I can.
When run on R3 instances, overall runtime scales very well as machines are
added. Still so far so good.
My problems start with the X1 instances. I do get scaling as I add more
machines, but it is significantly worse than with the R3s. This isn't just
a matter of there being more CPUs and the MPI communication time dominating.
The actual time spent in the MPI all-to-all communication is significantly
longer than on the R3s for the same number of machines, despite the network
bandwidth being twice as high (in a post from a few days ago some folks
helped me with MPI settings to improve the network communication speed -
from toy benchmark MPI tests I know I'm getting faster communication on the
X1s than on the R3s, so this feels likely to be an issue with NUMA, though
I'd be interested in any other thoughts.
I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
didn't seem to have what I was looking for. I want MPI to let my
application use all CPUs on the system (I'm the only one running on it)... I
just want to control the memory placement.
Thanks for the help.
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Iliev, Hristo
2017-07-20 08:24:47 UTC
Permalink
I see... Now it all makes sense. Since Cpus_allowed(_list) shows the effective CPU mask, I expected Mems_allowed(_list) would do the same.

Thanks for the clarification.

Cheers,
Hristo

-----Original Message-----
From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Brice Goglin
Sent: Thursday, July 20, 2017 9:52 AM
To: ***@lists.open-mpi.org
Subject: Re: [OMPI users] NUMA interaction with Open MPI

Hello

Mems_allowed_list is what your current cgroup/cpuset allows. It is different from what mbind/numactl/hwloc/... change.
The former is a root-only restriction that cannot be ignored by processes placed in that cgroup.
The latter is a user-changeable binding that must be inside the former.

Brice
Post by Iliev, Hristo
Giles,
$ uname -r
3.10.0-514.26.1.e17.x86_64
$ numactl -H | grep available
available: 2 nodes (0-1)
$ grep Mems_allowed_list /proc/self/status
Mems_allowed_list: 0-1
$ numactl -m 0 grep Mems_allowed_list /proc/self/status
Mems_allowed_list: 0-1
$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1
$ numactl -m 0 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0
$ numactl -i all numactl -s
policy: interleave
preferred node: 0 (interleave next)
interleavemask: 0 1
interleavenode: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1
$ orterun -n 2 --bind-to core --map-by socket numactl -i all numactl -s
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 0 24
cpubind: 0
nodebind: 0
membind: 0 1
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 12 36
cpubind: 1
nodebind: 1
membind: 0 1
Cheers,
Hristo
-----Original Message-----
Sent: Monday, July 17, 2017 5:43 AM
Subject: Re: [OMPI users] NUMA interaction with Open MPI
Adam,
keep in mind that by default, recent Open MPI bind MPI tasks
- to cores if -np 2
- to NUMA domain otherwise (which is a socket in most cases, unless
you are running on a Xeon Phi)
so unless you specifically asked mpirun to do a binding consistent
with your needs, you might simply try to ask no binding at all mpirun
--bind-to none ...
i am not sure whether you can direclty ask Open MPI to do the memory
binding you expect from the command line.
anyway, as far as i am concerned,
mpirun --bind-to none numactl --interleave=all ...
should do what you expect
if you want to be sure, you can simply mpirun --bind-to none numactl
--interleave=all grep Mems_allowed_list /proc/self/status and that
should give you an hint
Cheers,
Gilles
Post by Adam Sylvester
I'll start with my question upfront: Is there a way to do the
equivalent of telling mpirun to do 'numactl --interleave=all' on the
processes that it runs? Or if I want to control the memory placement
of my applications run through MPI will I need to use libnuma for
this? I tried doing "mpirun <Open MPI options> numactl
--interleave=all <app name and options>". I don't know how to
explicitly verify if this ran the numactl command on each host or not
but based on the performance I'm seeing, it doesn't seem like it did (or something else is causing my poor performance).
More details: For the particular image I'm benchmarking with, I have
a multi-threaded application which requires 60 GB of RAM to run if
it's run on one machine. It allocates one large ping/pong buffer
upfront and uses this to avoid copies when updating the image at each
step. I'm running in AWS and comparing performance on an r3.8xlarge
(16 CPUs, 244 GB RAM, 10 Gbps) vs. an x1.32xlarge (64 CPUs, 2 TB RAM,
20 Gbps). Running on a single X1, my application runs ~3x faster
than the R3; using numactl --interleave=all has a significant
positive effect on its performance, I assume because the various
threads that are running are accessing memory spread out across the nodes rather than most of them having slow access to it. So far so good.
My application also supports distributing across machines via MPI.
When doing this, the memory requirement scales linearly with the
number of machines; there are three pinch points that involve large
(GBs of data) all-to-all communication. For the slowest of these
three, I've pipelined this step and use MPI_Ialltoallv() to hide as much of the latency as I can.
When run on R3 instances, overall runtime scales very well as
machines are added. Still so far so good.
My problems start with the X1 instances. I do get scaling as I add
more machines, but it is significantly worse than with the R3s. This
isn't just a matter of there being more CPUs and the MPI communication time dominating.
The actual time spent in the MPI all-to-all communication is
significantly longer than on the R3s for the same number of machines,
despite the network bandwidth being twice as high (in a post from a
few days ago some folks helped me with MPI settings to improve the
network communication speed - from toy benchmark MPI tests I know I'm
getting faster communication on the X1s than on the R3s, so this
feels likely to be an issue with NUMA, though I'd be interested in any other thoughts.
I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php
but this didn't seem to have what I was looking for. I want MPI to
let my application use all CPUs on the system (I'm the only one
running on it)... I just want to control the memory placement.
Thanks for the help.
-Adam
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Loading...