Discussion:
[OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?
Jordi Guitart
2017-03-24 10:45:05 UTC
Permalink
Hello,

I'm running experiments with BT NAS benchmark on OpenMPI. I've
identified a very weird performance degradation of OpenMPI v1.10.2 (and
later versions) when the system is oversubscribed. In particular, note
the performance difference between 1.10.2 and 1.10.1 when running 36 MPI
processes over 28 CPUs.
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 36 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 36 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 111.71

The performance when the system is undersubscribed (i.e. 16 MPI
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 16 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 96.78
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 16 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 99.35

Any idea of what is happening?

Thanks

PS. As the system has 28 cores with hyperthreaded enabled, I use taskset
to ensure that only one thread per core is used.
PS2. I have tested also versions 1.10.6, 2.0.1 and 2.0.2, and the
degradation also occurs.

http://bsc.es/disclaimer
Jeff Squyres (jsquyres)
2017-03-24 19:39:12 UTC
Permalink
Performance goes out the window if you oversubscribe your machines (i.e., run more MPI processes than cores). The effect of oversubscription is non-deterministic.

(for the next few paragraphs, assume that HT is disabled in the BIOS -- i.e., that there's only 1 hardware thread on each core)

Open MPI uses spinning to check for progress, meaning that any one process will peg a core at 100%. When you run N MPI processes (where N <= num_cores), then each process can run at 100% and run as fast as the cores allow.

When you run M MPI processes (where M > num_cores), then, by definition, some processes will have to yield their position on a core to let another process run. This means that they will react to MPI/network traffic slower than if they had an entire core to themselves (a similar effect occurs with the computational part of the app).

Limiting MPI processes to hyperthreads *helps*, but current generation Intel hyperthreads are not as powerful as cores (they have roughly half the resources of a core), so -- depending on your application and your exact system setup -- you will almost certainly see performance degradation of running N MPI processes across N cores vs. across N hyper threads. You can try it yourself by running the same size application over N cores on a single machine, and then run the same application over N hyper threads (i.e., N/2 cores) on the same machine.

You can use mpirun's binding options to bind to hypethreads or cores, too -- you don't have to use task set (which can be fairly confusing with the differences between physical and logical numbering of linux virtual processor IDs). And/or you might want to look at the hwloc project to get nice pictures of the topology of your machine, and look at hwloc-bind as a simpler-to-use alternative to taskset.

Also be aware of the difference between enabling and disabling hyperthreads: there's a (big) difference between enabling and disabling HT in the BIOS and enabling and disabling HT in the OS.

- Disabling HT in the BIOS means that the one hardware thread left in each core will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will statically be allocated roughly half the core's resources (buffers, queues, processor units, etc.).

- When HT is enabled in the BIOS and you enable HT in the OS, then Linux assigns one virtual processor ID to each HT.
- When HT is enabled in the BIOS and you disable HT in the OS, then Linux simply does not schedule anything to run on half the virtual processor IDs (e.g., the 2nd hardware thread in each core). This is NOT the same thing as disabling HT in the BIOS -- those HTs are still enabled and have half the core's resources; Linux is just choosing not to use them.

Make sense?

Hence, if you're testing whether your applications will work well with HT or not, you need to enable/disable HT in the BIOS to get a proper test.

Spoiler alert: many people have looked at this. In *most* (but not all) cases, using HT is not a performance win for MPI/HPC codes that are designed to run processors at 100%.
Post by Jordi Guitart
Hello,
I'm running experiments with BT NAS benchmark on OpenMPI. I've identified a very weird performance degradation of OpenMPI v1.10.2 (and later versions) when the system is oversubscribed. In particular, note the performance difference between 1.10.2 and 1.10.1 when running 36 MPI processes over 28 CPUs.
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 111.71
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 16 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 96.78
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 16 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 99.35
Any idea of what is happening?
Thanks
PS. As the system has 28 cores with hyperthreaded enabled, I use taskset to ensure that only one thread per core is used.
PS2. I have tested also versions 1.10.6, 2.0.1 and 2.0.2, and the degradation also occurs.
http://bsc.es/disclaimer
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Squyres
***@cisco.com
Reuti
2017-03-24 22:10:35 UTC
Permalink
Hi,
Post by Jeff Squyres (jsquyres)
Limiting MPI processes to hyperthreads *helps*, but current generation Intel hyperthreads are not as powerful as cores (they have roughly half the resources of a core), so -- depending on your application and your exact system setup -- you will almost certainly see performance degradation of running N MPI processes across N cores vs. across N hyper threads. You can try it yourself by running the same size application over N cores on a single machine, and then run the same application over N hyper threads (i.e., N/2 cores) on the same machine.
[…]
- Disabling HT in the BIOS means that the one hardware thread left in each core will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will statically be allocated roughly half the core's resources (buffers, queues, processor units, etc.).
Do you have a reference for the two topics above (sure, I will try next week on my own)? My knowledge was, that there is no dedicated HT core, and using all cores will not give the result that the real cores get N x 100%, plus the HT ones N x 50% (or alike). But the scheduler inside the CPU will balance the resources between the double face of a single core and both are equal.
Post by Jeff Squyres (jsquyres)
[…]
Spoiler alert: many people have looked at this. In *most* (but not all) cases, using HT is not a performance win for MPI/HPC codes that are designed to run processors at 100%.
I think it was also on this mailing list, that someone mentioned that the pipelines in the CPU are reorganized in case you switch HT off, as only half of them would be needed and these resources are then bound to the real cores too, extending their performance. Similar, but not exactly what Jeff mentiones above.

Another aspect is, that even if they are not really doubling the performance, one might get 150%. And if you pay per CPU hours, it can be worth to have it switched on.

My personal experience is, that it depends not only application, but also on the way how you oversubscribe. Using all cores for a single MPI application leads to the effect, that all processes are doing the same stuff at the same time (at least often) and fight for the same part of the CPU, essentially becoming a bottleneck. But using each half of a CPU for two (or even more) applications will allow a better interleaving in the demand for resources. To allow this in the best way: no taskset or binding to cores, let the Linux kernel and CPU do their best - YMMV.

-- Reuti
Tim Prince via users
2017-03-24 22:53:33 UTC
Permalink
Post by Reuti
Hi,
Post by Jeff Squyres (jsquyres)
Limiting MPI processes to hyperthreads *helps*, but current generation Intel hyperthreads are not as powerful as cores (they have roughly half the resources of a core), so -- depending on your application and your exact system setup -- you will almost certainly see performance degradation of running N MPI processes across N cores vs. across N hyper threads. You can try it yourself by running the same size application over N cores on a single machine, and then run the same application over N hyper threads (i.e., N/2 cores) on the same machine.
[…]
- Disabling HT in the BIOS means that the one hardware thread left in each core will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will statically be allocated roughly half the core's resources (buffers, queues, processor units, etc.).
Do you have a reference for the two topics above (sure, I will try next week on my own)? My knowledge was, that there is no dedicated HT core, and using all cores will not give the result that the real cores get N x 100%, plus the HT ones N x 50% (or alike). But the scheduler inside the CPU will balance the resources between the double face of a single core and both are equal.
Post by Jeff Squyres (jsquyres)
[…]
Spoiler alert: many people have looked at this. In *most* (but not all) cases, using HT is not a performance win for MPI/HPC codes that are designed to run processors at 100%.
I think it was also on this mailing list, that someone mentioned that the pipelines in the CPU are reorganized in case you switch HT off, as only half of them would be needed and these resources are then bound to the real cores too, extending their performance. Similar, but not exactly what Jeff mentiones above.
Another aspect is, that even if they are not really doubling the performance, one might get 150%. And if you pay per CPU hours, it can be worth to have it switched on.
My personal experience is, that it depends not only application, but also on the way how you oversubscribe. Using all cores for a single MPI application leads to the effect, that all processes are doing the same stuff at the same time (at least often) and fight for the same part of the CPU, essentially becoming a bottleneck. But using each half of a CPU for two (or even more) applications will allow a better interleaving in the demand for resources. To allow this in the best way: no taskset or binding to cores, let the Linux kernel and CPU do their best - YMMV.
-- Reuti
_______________________________________________
HT implementations vary in some of the details to which you refer.
The most severe limitation in disabling HT on Intel CPUs of the last 5
years has been that half of the hardware ITLB entries remain
inaccessible. This was supposed not to be a serious limitation for many
HPC applications.
Applications where each thread needs all of L1 or fill (cache lines
pending update) buffers aren't so suitable for HT. Intel compilers have
some ability at -O3 to adjust automatic loop fission and fusion for
applications with high fill buffer demand, requiring that there be just
1 thread using those buffers.
HT threading actually reduces in practice the rate at which FPU
instructions may be issued on Intel "big core" CPUs.
HT together with MPI usually requires effective HT-aware pinning. It
seems unusual for MPI ranks to share cores effectively simply under
control of kernel scheduling (although linux is more capable than
Windows). Agree that explicit use of taskset under MPI should have been
superseded by the options implemented by several MPI including openmpi.
--
Tim Prince
Jeff Squyres (jsquyres)
2017-03-24 23:31:20 UTC
Permalink
Post by Reuti
Post by Jeff Squyres (jsquyres)
- Disabling HT in the BIOS means that the one hardware thread left in each core will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will statically be allocated roughly half the core's resources (buffers, queues, processor units, etc.).
Do you have a reference for the two topics above (sure, I will try next week on my own)? My knowledge was, that there is no dedicated HT core, and using all cores will not give the result that the real cores get N x 100%, plus the HT ones N x 50% (or alike). But the scheduler inside the CPU will balance the resources between the double face of a single core and both are equal.
I'm not quite sure I can parse your above statements.

I should be clear: I was referring to Intel server core processors. It may be the same across all the Intel core processors, but I do not have direct knowledge of that.

I'm afraid I don't have specific citations; you'll just have to look in the Intel docs. In addition to what Tim said, I should note that a core is just a collection of resources. When you enable HT, a) there's 2 hardware threads active, and b) most of the resources in the core are effectively split in half and assigned to each hardware thread. When you disable HT, a) there's only 1 hardware thread, and b) the resources of the core are allocated to that one hardware thread.

I'm speaking in generalities; go read the Intel docs for more specifics.
Post by Reuti
My personal experience is, that it depends not only application, but also on the way how you oversubscribe.
+1
--
Jeff Squyres
***@cisco.com
Ben Menadue
2017-03-25 07:04:35 UTC
Permalink
Hi Jeff,
Post by Jeff Squyres (jsquyres)
When you enable HT, a) there's 2 hardware threads active, and b) most of the resources in the core are effectively split in half and assigned to each hardware thread. When you disable HT, a) there's only 1 hardware thread, and b) the resources of the core are allocated to that one hardware thread.
I’m not sure about this. It was my understanding that HyperThreading is implemented as a second set of e.g. registers that share execution units. There’s no division of the resources between the hardware threads, but rather the execution units switch between the two threads as they stall (e.g. cache miss, hazard/dependency, misprediction, 
) — kind of like a context switch, but much cheaper. As long as there’s nothing being scheduled on the other hardware thread, there’s no impact on the performance. Moreover, turning HT off in the BIOS doesn’t make more resources available to now-single hardware thread.

This matches our observations on our cluster — there was no statistically-significant change in performance between having HT turned off in the BIOS and turning the second hardware thread of each core off in Linux. We run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell (all dual-socket Xeon E5s), and KNL, and this appears to hold true across of these.

Moreover, having the second hardware thread turned on in Linux but not used by batch jobs (by cgroup-ing them to just one hardware thread of each core) substantially reduced the performance impact and jitter from the OS — by ~10% in at least one synchronisation-heavy application. This is likely because the kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs, Ganglia, PBS, 
) on the second, unused hardware thread of each core, which were then run when the batch job’s processes stalled the CPU’s execution units. This is with both a CentOS 6.x kernel and a custom (tickless) 7.2 kernel.

Given these results, we now leave HT on in both the BIOS and OS, and cgroup batch jobs to either one or all hardware threads of the allocated cores based on a PBS resource request. Most jobs don’t request or benefit from the extra hardware threads, but some (e.g. very I/O-heavy) do.
Post by Jeff Squyres (jsquyres)
Post by Reuti
My personal experience is, that it depends not only application, but also on the way how you oversubscribe.
+1
+2

As always, experiment to find the best for your hardware and jobs.

Cheers,
Ben
Jeff Squyres (jsquyres)
2017-03-25 14:13:56 UTC
Permalink
I’m not sure about this. It was my understanding that HyperThreading is implemented as a second set of e.g. registers that share execution units. There’s no division of the resources between the hardware threads, but rather the execution units switch between the two threads as they stall (e.g. cache miss, hazard/dependency, misprediction, …) — kind of like a context switch, but much cheaper. As long as there’s nothing being scheduled on the other hardware thread, there’s no impact on the performance. Moreover, turning HT off in the BIOS doesn’t make more resources available to now-single hardware thread.
Here's an old post on this list where I cited a paper from the Intel Technology Journal. The paper is pretty old at this point (2002, I believe?), but I believe it was published near the beginning of the HT technology at Intel:

https://www.mail-archive.com/hwloc-***@lists.open-mpi.org/msg01135.html

The paper is attached on that post; see, in particular, the section "Single-task and multi-task modes".

All this being said, I'm a software wonk with a decent understanding of hardware. But I don't closely follow all the specific details of all hardware. So if Haswell / Broadwell / Skylake processors, for example, are substantially different than the HT architecture described in that paper, please feel free to correct me!
This matches our observations on our cluster — there was no statistically-significant change in performance between having HT turned off in the BIOS and turning the second hardware thread of each core off in Linux. We run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell (all dual-socket Xeon E5s), and KNL, and this appears to hold true across of these.
These are very complex architectures; the impacts of enabling/disabling HT are going to be highly specific to both the platform and application.
Moreover, having the second hardware thread turned on in Linux but not used by batch jobs (by cgroup-ing them to just one hardware thread of each core) substantially reduced the performance impact and jitter from the OS — by ~10% in at least one synchronisation-heavy application. This is likely because the kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs, Ganglia, PBS, …) on the second, unused hardware thread of each core, which were then run when the batch job’s processes stalled the CPU’s execution units. This is with both a CentOS 6.x kernel and a custom (tickless) 7.2 kernel.
Yes, that's a pretty clever use of HT in an HPC environment. But be aware that you are cutting on-core pipeline depths that can be used by applications to do this. In your setup, it sounds like this is still a net performance win (which is pretty sweet). But that may not be a universal effect.

This is probably a +3 on the existing trend from the prior emails in this thread: "As always, experiment to find the best for your hardware and jobs." ;-)

--
Jeff Squyres
***@cisco.com
Ben Menadue
2017-03-26 07:37:56 UTC
Permalink
Hi,
Post by Jeff Squyres (jsquyres)
Here's an old post on this list where I cited a paper from the Intel Technology Journal.
Thanks for that link! I need to go through it in detail, but this paragraph did jump out at me:
On a processor with Hyper-Threading Technology, executing HALT transitions the processor from MT- mode to ST0- or ST1-mode, depending on which logical processor executed the HALT. For example, if logical processor 0 executes HALT, only logical processor 1 would be active; the physical processor would be in ST1-mode and partitioned resources would be recombined giving logical processor 1 full use of all processor resources. If the remaining active logical processor also executes HALT, the physical processor would then be able to go to a lower-power mode.

Linux’s task scheduler will issue halt instructions when there are no runnable tasks (unless you’re running in idle=pool — a nice way to make your datacentre toasty warm). This suggests that as long as you don’t schedule tasks on the second hardware thread, the first will have access to all the resources of the CPU. Yes, you’ll halve your L1 etc as soon as an interrupt wakes the sleeping hardware thread, but hopefully that doesn’t happen to often. Turning off one hardware thread (via /sys/devices/system/cpu/cpu?/online) should force it to issue halt instructions whenever that thread gets woken by an interrupt, since that thread will then never have anything scheduled on it.
Post by Jeff Squyres (jsquyres)
However, what is puzzling me is the performance difference between OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later versions) in my experiments with oversubscription, i.e. 82 seconds vs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual threads between cores. That taskset will bind each MPI process to the same set of 28 logical CPUs (i.e. hardware threads), so if you’re running 36 ranks there then you must have migration happening. Indeed, even when you only launch 28 MPI ranks, you’ll probably still see migration between the cores — but likely a lot less. But as soon as you oversubscribe and spin-wait rather than yield you’ll be very sensitive to small changes in behaviour — any minor changes in OpenMPI’s behaviour, while not visible under normal circumstances, will lead to small changes in how and when the kernel task scheduler runs the tasks, and this can then multiply dramatically when you have synchronisation between the tasks via e.g. MPI calls.

Just as a purely hypothetical example, the newer versions might spin-wait in a slightly tighter loop and this might make the Linux task scheduler less likely to switch between waiting threads. This delay in switching tasks could appear as increased latency in any synchronising MPI call. But this is very speculative — it would be very hard to draw any conclusion about what’s happening if there’s no clear causative change in the code.

Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This will make OpenMPI issue a sched_yield when waiting instead of spin-waiting constantly. While it’s a performance hit when exactly- or under-subscribing, I can see it helping a bit when there’s contention for the cores from over-subscribing. In particular, a call sched_yield relinquishes the rest of that process's current time slice, and allows the task scheduler to run another waiting task (i.e. another of your MPI ranks) in its place.

So in fact this has nothing to do with HyperThreading — assuming 0 through 27 correspond to a single hardware thread on 28 distinct cores. Just keep in mind that this might not always be the case — we have at least one platform where where the logical processor number enumerates the hardware threads before cores, so 0 to (n-1) are the n threads of the first core, n to (2n-1) are of to the second, and so on.

Cheers,
Ben
Jordi Guitart
2017-03-27 10:52:50 UTC
Permalink
Hi Ben,

Thanks for your feedback. As described here
(https://www.open-mpi.org/faq/?category=running#oversubscribing),
OpenMPI detects that I'm oversubscribing and runs in degraded mode
(yielding the processor). Anyway, I repeated the experiments setting
explicitly the yielding flag, and I obtained the same weird results:

$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36
taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36
taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93

Given these results, it seems that spin-waiting is not causing the
issue. I also agree that this should not be caused by HyperThreading,
given that 0-27 correspond to single HW threads on distinct cores, as
shown in the following output returned by the lstopo command:

Machine (128GB total)
NUMANode L#0 (P#0 64GB)
Package L#0 + L3 L#0 (35MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#28)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#29)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#30)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#31)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#32)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#33)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#34)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#35)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#36)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#37)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#38)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#39)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#40)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#41)
HostBridge L#0
PCIBridge
PCI 8086:24f0
Net L#0 "ib0"
OpenFabrics L#1 "hfi1_0"
PCIBridge
PCI 14e4:1665
Net L#2 "eno1"
PCI 14e4:1665
Net L#3 "eno2"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
GPU L#4 "card0"
GPU L#5 "controlD64"
NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#42)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#43)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#44)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#45)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#46)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#47)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#48)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#49)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#50)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#51)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#52)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#53)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#54)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#55)
Post by Jeff Squyres (jsquyres)
However, what is puzzling me is the performance difference between
OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later
versions) in my experiments with oversubscription, i.e. 82 seconds
vs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual threads
between cores. That taskset will bind each MPI process to the same set
of 28 logical CPUs (i.e. hardware threads), so if you’re running 36
ranks there then you must have migration happening. Indeed, even when
you only launch 28 MPI ranks, you’ll probably still see migration
between the cores — but likely a lot less. But as soon as you
oversubscribe and spin-wait rather than yield you’ll be very sensitive
to small changes in behaviour — any minor changes in OpenMPI’s
behaviour, while not visible under normal circumstances, will lead to
small changes in how and when the kernel task scheduler runs the
tasks, and this can then multiply dramatically when you have
synchronisation between the tasks via e.g. MPI calls.
Just as a purely hypothetical example, the newer versions /might/
spin-wait in a slightly tighter loop and this /might/ make the Linux
task scheduler less likely to switch between waiting threads. This
delay in switching tasks /could/ appear as increased latency in any
synchronising MPI call. But this is very speculative — it would be
very hard to draw any conclusion about what’s happening if there’s no
clear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This
will make OpenMPI issue a sched_yield when waiting instead of
spin-waiting constantly. While it’s a performance hit when exactly- or
under-subscribing, I can see it helping a bit when there’s contention
for the cores from over-subscribing. In particular, a call sched_yield
relinquishes the rest of that process's current time slice, and allows
the task scheduler to run another waiting task (i.e. another of your
MPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0
through 27 correspond to a single hardware thread on 28 distinct
cores. Just keep in mind that this might not always be the case — we
have at least one platform where where the logical processor number
enumerates the hardware threads before cores, so 0 to (n-1) are the n
threads of the first core, n to (2n-1) are of to the second, and so on.
Cheers,
Ben
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
http://bsc.es/disclaimer
r***@open-mpi.org
2017-03-27 15:00:53 UTC
Permalink
I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” setting. So why would you expect different results?
Post by Jordi Guitart
Hi Ben,
$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93
Machine (128GB total)
NUMANode L#0 (P#0 64GB)
Package L#0 + L3 L#0 (35MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#28)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#29)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#30)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#31)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#32)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#33)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#34)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#35)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#36)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#37)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#38)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#39)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#40)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#41)
HostBridge L#0
PCIBridge
PCI 8086:24f0
Net L#0 "ib0"
OpenFabrics L#1 "hfi1_0"
PCIBridge
PCI 14e4:1665
Net L#2 "eno1"
PCI 14e4:1665
Net L#3 "eno2"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
GPU L#4 "card0"
GPU L#5 "controlD64"
NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#42)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#43)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#44)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#45)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#46)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#47)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#48)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#49)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#50)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#51)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#52)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#53)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#54)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#55)
Post by Ben Menadue
Post by Jeff Squyres (jsquyres)
However, what is puzzling me is the performance difference between OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later versions) in my experiments with oversubscription, i.e. 82 seconds vs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual threads between cores. That taskset will bind each MPI process to the same set of 28 logical CPUs (i.e. hardware threads), so if you’re running 36 ranks there then you must have migration happening. Indeed, even when you only launch 28 MPI ranks, you’ll probably still see migration between the cores — but likely a lot less. But as soon as you oversubscribe and spin-wait rather than yield you’ll be very sensitive to small changes in behaviour — any minor changes in OpenMPI’s behaviour, while not visible under normal circumstances, will lead to small changes in how and when the kernel task scheduler runs the tasks, and this can then multiply dramatically when you have synchronisation between the tasks via e.g. MPI calls.
Just as a purely hypothetical example, the newer versions might spin-wait in a slightly tighter loop and this might make the Linux task scheduler less likely to switch between waiting threads. This delay in switching tasks could appear as increased latency in any synchronising MPI call. But this is very speculative — it would be very hard to draw any conclusion about what’s happening if there’s no clear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command. This will make OpenMPI issue a sched_yield when waiting instead of spin-waiting constantly. While it’s a performance hit when exactly- or under-subscribing, I can see it helping a bit when there’s contention for the cores from over-subscribing. In particular, a call sched_yield relinquishes the rest of that process's current time slice, and allows the task scheduler to run another waiting task (i.e. another of your MPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0 through 27 correspond to a single hardware thread on 28 distinct cores. Just keep in mind that this might not always be the case — we have at least one platform where where the logical processor number enumerates the hardware threads before cores, so 0 to (n-1) are the n threads of the first core, n to (2n-1) are of to the second, and so on.
Cheers,
Ben
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer <http://www.bsc.es/disclaimer>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Jordi Guitart
2017-03-27 15:46:28 UTC
Permalink
I was not expecting different results. I just wanted to respond to Ben's
suggestion, and demonstrate that the problem (the performance difference
between v.1.10.1 and v.1.10.2) is not caused by spin-waiting.
I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed”
setting. So why would you expect different results?
Post by Jordi Guitart
Hi Ben,
Thanks for your feedback. As described here
(https://www.open-mpi.org/faq/?category=running#oversubscribing),
OpenMPI detects that I'm oversubscribing and runs in degraded mode
(yielding the processor). Anyway, I repeated the experiments setting
$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np
36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in
seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np
36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in
seconds = 110.93
Given these results, it seems that spin-waiting is not causing the
issue. I also agree that this should not be caused by HyperThreading,
given that 0-27 correspond to single HW threads on distinct cores, as
Machine (128GB total)
NUMANode L#0 (P#0 64GB)
Package L#0 + L3 L#0 (35MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#28)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#29)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#30)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#31)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#32)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#33)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#34)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#35)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#36)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#37)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#38)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#39)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#40)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#41)
HostBridge L#0
PCIBridge
PCI 8086:24f0
Net L#0 "ib0"
OpenFabrics L#1 "hfi1_0"
PCIBridge
PCI 14e4:1665
Net L#2 "eno1"
PCI 14e4:1665
Net L#3 "eno2"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
GPU L#4 "card0"
GPU L#5 "controlD64"
NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#42)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#43)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#44)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#45)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#46)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#47)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#48)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#49)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#50)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#51)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#52)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#53)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#54)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#55)
Post by Jeff Squyres (jsquyres)
However, what is puzzling me is the performance difference between
OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later
versions) in my experiments with oversubscription, i.e. 82 seconds
vs. 111 seconds.
You’re oversubscribing while letting the OS migrate individual
threads between cores. That taskset will bind each MPI process to
the same set of 28 logical CPUs (i.e. hardware threads), so if
you’re running 36 ranks there then you must have migration
happening. Indeed, even when you only launch 28 MPI ranks, you’ll
probably still see migration between the cores — but likely a lot
less. But as soon as you oversubscribe and spin-wait rather than
yield you’ll be very sensitive to small changes in behaviour — any
minor changes in OpenMPI’s behaviour, while not visible under normal
circumstances, will lead to small changes in how and when the kernel
task scheduler runs the tasks, and this can then multiply
dramatically when you have synchronisation between the tasks via
e.g. MPI calls.
Just as a purely hypothetical example, the newer versions /might/
spin-wait in a slightly tighter loop and this /might/ make the Linux
task scheduler less likely to switch between waiting threads. This
delay in switching tasks /could/ appear as increased latency in any
synchronising MPI call. But this is very speculative — it would be
very hard to draw any conclusion about what’s happening if there’s
no clear causative change in the code.
Try adding "--mca mpi_yield_when_idle 1" to your mpirun command.
This will make OpenMPI issue a sched_yield when waiting instead of
spin-waiting constantly. While it’s a performance hit when exactly-
or under-subscribing, I can see it helping a bit when there’s
contention for the cores from over-subscribing. In particular, a
call sched_yield relinquishes the rest of that process's current
time slice, and allows the task scheduler to run another waiting
task (i.e. another of your MPI ranks) in its place.
So in fact this has nothing to do with HyperThreading — assuming 0
through 27 correspond to a single hardware thread on 28 distinct
cores. Just keep in mind that this might not always be the case — we
have at least one platform where where the logical processor number
enumerates the hardware threads before cores, so 0 to (n-1) are the
n threads of the first core, n to (2n-1) are of to the second, and
so on.
Cheers,
Ben
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
WARNING / LEGAL TEXT: This message is intended only for the use of
the individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
http://bsc.es/disclaimer
Jeff Squyres (jsquyres)
2017-03-27 15:51:07 UTC
Permalink
I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” setting. So why would you expect different results?
A few additional points to Ralph's question:

1. Recall that sched_yield() has effectively become a no-op in newer Linux kernels. Hence, Open MPI's "yield when idle" may not do much to actually de-schedule a currently-running process.

2. As for why there is a difference between version 1.10.1 and 1.10.2 in oversubscription behavior, we likely do not know offhand (as all of these emails have shown!). Honestly, we don't really pay much attention to oversubscription performance -- our focus tends to be on under/exactly-subscribed performance, because that's the normal operating mode for MPI applications. With oversubscribed, we have typically just said "all bets are off" and leave it at that.

3. I don't recall if there was a default affinity policy change between 1.10.1 and 1.10.2. Do you know that your taskset command is -- for absolutely sure -- overriding what Open MPI is doing? Or is what Open MPI is doing in terms of affinity/binding getting merged with what your taskset call is doing somehow...? (seems unlikely, but I figured I'd ask anyway)
Post by Jordi Guitart
$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93
Per text later in your mail, "taskset -c 0-27" corresponds to the first hardware thread on each core.

Hence, this is effectively binding each process to the set of all "first hardware threads" across all cores.
Post by Jordi Guitart
Given these results, it seems that spin-waiting is not causing the issue.
I'm guessing that this difference is going to end up being the symptom of a highly complex system, of which spin-waiting is playing a part. I.e., if Open MPI weren't spin waiting, this might not be happening.

--
Jeff Squyres
***@cisco.com
Jordi Guitart
2017-03-28 09:15:32 UTC
Permalink
Hi,
Post by Jeff Squyres (jsquyres)
1. Recall that sched_yield() has effectively become a no-op in newer Linux kernels. Hence, Open MPI's "yield when idle" may not do much to actually de-schedule a currently-running process.
Yes, I'm aware of this. However, this should impact both OpenMPI
versions in the same way.
Post by Jeff Squyres (jsquyres)
2. As for why there is a difference between version 1.10.1 and 1.10.2 in oversubscription behavior, we likely do not know offhand (as all of these emails have shown!). Honestly, we don't really pay much attention to oversubscription performance -- our focus tends to be on under/exactly-subscribed performance, because that's the normal operating mode for MPI applications. With oversubscribed, we have typically just said "all bets are off" and leave it at that.
I agree that oversubscription is not the typical usage scenario, and I
can understand the optimizing its performance is not a priority. But
maybe the problem that I'm facing is just a symptom that something is
not working properly and this could impact also undersubscription
scenarios (of course, to a lesser extent).
Post by Jeff Squyres (jsquyres)
3. I don't recall if there was a default affinity policy change between 1.10.1 and 1.10.2. Do you know that your taskset command is -- for absolutely sure -- overriding what Open MPI is doing? Or is what Open MPI is doing in terms of affinity/binding getting merged with what your taskset call is doing somehow...? (seems unlikely, but I figured I'd ask anyway)
Regarding the changes between 1.10.1 and 1.10.2, I only found one that
seems related with oversubscription (i.e. "Correctly handle
oversubscription when not given directives to permit it"). I don't know
if this could be impacting somehow ...

Regarding the impact of OpenMPI affinity options with taskset, I'd say
that it is a combination. With taskset I'm just constraining the
affinity placement decided by OpenMPI to the set of processors from 0 to
27. In any case, the affinity configuration is the same for v1.10.1 and
v1.10.2, namely:

Mapper requested: NULL Last mapper: round_robin Mapping policy:
BYSOCKET Ranking policy: SLOT
Binding policy: NONE:IF-SUPPORTED Cpu set: NULL PPR: NULL
Cpus-per-rank: 1
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1
Post by Jeff Squyres (jsquyres)
Per text later in your mail, "taskset -c 0-27" corresponds to the first hardware thread on each core.
Hence, this is effectively binding each process to the set of all "first hardware threads" across all cores.
Yes, that was the intention to avoid running two MPI processes in the
same physical core.
Post by Jeff Squyres (jsquyres)
I'm guessing that this difference is going to end up being the symptom of a highly complex system, of which spin-waiting is playing a part. I.e., if Open MPI weren't spin waiting, this might not be happening.
I'm not sure about the impact of spin-waiting, taking into account that
OpenMPI is running in degraded mode.

Thanks

http://bsc.es/disclaimer

Ben Menadue
2017-03-28 01:36:06 UTC
Permalink
Hi,
Post by r***@open-mpi.org
I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” setting. So why would you expect different results?
Ahh — I didn’t realise it auto-detected this. I recall working on a system in the past where I needed to explicitly set this to get that behaviour, but that could have been due to some local site configuration.
Post by r***@open-mpi.org
1. Recall that sched_yield() has effectively become a no-op in newer Linux kernels. Hence, Open MPI's "yield when idle" may not do much to actually de-schedule a currently-running process.
In some kernel versions there’s a sysctl (/proc/sys/kernel/sched_compat_yield) that makes it mimic the older behaviour by putting the yielding task at the end of the tree instead of the start. It was introduced in 1799e35 (2.6.23-rc7) and removed in ac53db5 (2.6.39-rc1), so if you have a kernel in this range (e.g. the stock CentOS or RHEL 6 kernel), it might be available to you.

As an aside, has anyone thought about changing the niceness of the thread before yielding (and then back when sched_yield returns)? I just tested that sequence and it seems to produce the desired behaviour.

On their own, and pinned to a single CPU, my test programs sat at 100% CPU utilisations with these timings:
spin: 2.1ns per iteration
yield: 450ns per iteration
nice+yield: 2330ns per iteration

When I ran spin and yield at the same on the same CPU, both consumed 50% of the CPU, and produced times about double of above — as expected given the new sched_yield behaviour.

On the other hand, when I ran spin and nice+yield together, spin sat at 98.5% CPU utilisation, nice+yield at 1.5%, and the timing for spin was identical to when it was on its own.

The only problem I can see is that the timing for nice+yield increased dramatically — to 187,600 ns per iteration! Is this too long for a yield?

Cheers,
Ben
Jeff Hammond
2017-03-28 01:28:47 UTC
Permalink
Post by Ben Menadue
Post by Ben Menadue
I’m not sure about this. It was my understanding that HyperThreading is
implemented as a second set of e.g. registers that share execution units.
There’s no division of the resources between the hardware threads, but
rather the execution units switch between the two threads as they stall
(e.g. cache miss, hazard/dependency, misprediction, 
) — kind of like a
context switch, but much cheaper. As long as there’s nothing being
scheduled on the other hardware thread, there’s no impact on the
performance. Moreover, turning HT off in the BIOS doesn’t make more
resources available to now-single hardware thread.
Here's an old post on this list where I cited a paper from the Intel
Technology Journal. The paper is pretty old at this point (2002, I
believe?), but I believe it was published near the beginning of the HT
The paper is attached on that post; see, in particular, the section
"Single-task and multi-task modes".
All this being said, I'm a software wonk with a decent understanding of
hardware. But I don't closely follow all the specific details of all
hardware. So if Haswell / Broadwell / Skylake processors, for example, are
substantially different than the HT architecture described in that paper,
please feel free to correct me!
I don't know the details, but HPC centers like NERSC noticed a shift around
Ivy Bridge (Edison) that caused them to enable it.

https://www.nersc.gov/users/computational-systems/edison/performance-and-optimization/hyper-threading/

I know two of the authors of that 2002 paper on HT. Will ask them for
insight next time we cross paths.

Jeff
Post by Ben Menadue
Post by Ben Menadue
This matches our observations on our cluster — there was no
statistically-significant change in performance between having HT turned
off in the BIOS and turning the second hardware thread of each core off in
Linux. We run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell
(all dual-socket Xeon E5s), and KNL, and this appears to hold true across
of these.
These are very complex architectures; the impacts of enabling/disabling HT
are going to be highly specific to both the platform and application.
Post by Ben Menadue
Moreover, having the second hardware thread turned on in Linux but not
used by batch jobs (by cgroup-ing them to just one hardware thread of each
core) substantially reduced the performance impact and jitter from the OS —
by ~10% in at least one synchronisation-heavy application. This is likely
because the kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs,
Ganglia, PBS, 
) on the second, unused hardware thread of each core, which
were then run when the batch job’s processes stalled the CPU’s execution
units. This is with both a CentOS 6.x kernel and a custom (tickless) 7.2
kernel.
Yes, that's a pretty clever use of HT in an HPC environment. But be aware
that you are cutting on-core pipeline depths that can be used by
applications to do this. In your setup, it sounds like this is still a net
performance win (which is pretty sweet). But that may not be a universal
effect.
This is probably a +3 on the existing trend from the prior emails in this
thread: "As always, experiment to find the best for your hardware and
jobs." ;-)
--
Jeff Squyres
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
Jeff Hammond
***@gmail.com
http://jeffhammond.github.io/
Jordi Guitart
2017-03-25 15:22:32 UTC
Permalink
Hi,

Very interesting discussing about the impact of HT. I was not aware
about the potential difference between turning off HT in the BIOS vs. in
the OS. However, this was not the main issue in my original message. I
was expecting the performance degradation with oversubscription. And I
can also agree that the performance when using HT depends on the
application. However, what is puzzling me is the performance difference
between OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and
later versions) in my experiments with oversubscription, i.e. 82 seconds
vs. 111 seconds. Note that the two experiments have the same degree of
oversubscription (36 over 28) and the same HT configuration (the same
processors allowed in the cpuset mask). In addition, the performance
difference is consistent between executions. According to this,
non-determinism of oversubscription is not enough to explain this
performance difference, and there must be some implementation issue in
OpenMPI 1.10.2 that was not present in version 1.10.1.

Thanks

PS. About the use of taskset, I tried using --cpu-set flag of mpirun
(which as far as I understand should provide the same effect), but it
was not working correctly in my system, as processes were scheduled in
processors not included in the cpuset list.
Post by Jeff Squyres (jsquyres)
Performance goes out the window if you oversubscribe your machines (i.e., run more MPI processes than cores). The effect of oversubscription is non-deterministic.
(for the next few paragraphs, assume that HT is disabled in the BIOS -- i.e., that there's only 1 hardware thread on each core)
Open MPI uses spinning to check for progress, meaning that any one process will peg a core at 100%. When you run N MPI processes (where N <= num_cores), then each process can run at 100% and run as fast as the cores allow.
When you run M MPI processes (where M > num_cores), then, by definition, some processes will have to yield their position on a core to let another process run. This means that they will react to MPI/network traffic slower than if they had an entire core to themselves (a similar effect occurs with the computational part of the app).
Limiting MPI processes to hyperthreads *helps*, but current generation Intel hyperthreads are not as powerful as cores (they have roughly half the resources of a core), so -- depending on your application and your exact system setup -- you will almost certainly see performance degradation of running N MPI processes across N cores vs. across N hyper threads. You can try it yourself by running the same size application over N cores on a single machine, and then run the same application over N hyper threads (i.e., N/2 cores) on the same machine.
You can use mpirun's binding options to bind to hypethreads or cores, too -- you don't have to use task set (which can be fairly confusing with the differences between physical and logical numbering of linux virtual processor IDs). And/or you might want to look at the hwloc project to get nice pictures of the topology of your machine, and look at hwloc-bind as a simpler-to-use alternative to taskset.
Also be aware of the difference between enabling and disabling hyperthreads: there's a (big) difference between enabling and disabling HT in the BIOS and enabling and disabling HT in the OS.
- Disabling HT in the BIOS means that the one hardware thread left in each core will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will statically be allocated roughly half the core's resources (buffers, queues, processor units, etc.).
- When HT is enabled in the BIOS and you enable HT in the OS, then Linux assigns one virtual processor ID to each HT.
- When HT is enabled in the BIOS and you disable HT in the OS, then Linux simply does not schedule anything to run on half the virtual processor IDs (e.g., the 2nd hardware thread in each core). This is NOT the same thing as disabling HT in the BIOS -- those HTs are still enabled and have half the core's resources; Linux is just choosing not to use them.
Make sense?
Hence, if you're testing whether your applications will work well with HT or not, you need to enable/disable HT in the BIOS to get a proper test.
Spoiler alert: many people have looked at this. In *most* (but not all) cases, using HT is not a performance win for MPI/HPC codes that are designed to run processors at 100%.
Post by Jordi Guitart
Hello,
I'm running experiments with BT NAS benchmark on OpenMPI. I've identified a very weird performance degradation of OpenMPI v1.10.2 (and later versions) when the system is oversubscribed. In particular, note the performance difference between 1.10.2 and 1.10.1 when running 36 MPI processes over 28 CPUs.
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 111.71
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 16 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 96.78
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 16 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 99.35
Any idea of what is happening?
Thanks
PS. As the system has 28 cores with hyperthreaded enabled, I use taskset to ensure that only one thread per core is used.
PS2. I have tested also versions 1.10.6, 2.0.1 and 2.0.2, and the degradation also occurs.
http://bsc.es/disclaimer
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
http://bsc.es/disclaimer
Loading...