Discussion:
[OMPI users] new core binding issues?
Noam Bernstein
2018-06-22 16:33:29 UTC
Permalink
Hi - for the last couple of weeks, more or less since we did some kernel updates, certain compute intensive MPI jobs have been behaving oddly as far as their speed - bits that should be quite fast sometimes (but not consistently) take a long time, and re-running sometimes fixes the issue, sometimes not. I’m starting to suspect core binding problems, which I worry will be difficult to debug, so I hoped to get some feedback on whether my observations are indeed suggesting that there’s something wrong with the core binding.

I’m running withe CentOS 6 latest kernel (2.6.32-696.30.1.el6.x86_64), OpenMPI 3.1.0, a dual cpu 8 core + HT intel Xeon node. Code is compiled with ifort, using “-mkl=sequential”, and just to be certain OMP_NUM_THREADS=1, so there should be no OpenMP parallelism.

The main question is if I’m running 16 MPI tasks per node and look at the PSR field from ps, should I get some simple sequence of numbers?

Here’s the beginning of the output report on the per-core binding I requested from mpirun (—bind-to core)
[compute-7-2:31036] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[compute-7-2:31036] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[compute-7-2:31036] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[compute-7-2:31036] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[compute-7-2:31036] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..]
[compute-7-2:31036] MCW rank 5 bound to socket 1[core 10[hwt 0-1]]: [../../../../../../../..][../../BB/../../../../..]
[compute-7-2:31036] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..]

This is the PSR info from ps
PID PSR TTY TIME CMD
31043 1 ? 00:00:34 vasp.para.intel
31045 2 ? 00:00:34 vasp.para.intel
31047 3 ? 00:00:34 vasp.para.intel
31049 4 ? 00:00:34 vasp.para.intel
31051 5 ? 00:00:34 vasp.para.intel
31055 7 ? 00:00:34 vasp.para.intel
31042 8 ? 00:00:34 vasp.para.intel
31046 10 ? 00:00:34 vasp.para.intel
31048 11 ? 00:00:34 vasp.para.intel
31052 13 ? 00:00:34 vasp.para.intel
31054 14 ? 00:00:34 vasp.para.intel
31053 22 ? 00:00:34 vasp.para.intel
31044 25 ? 00:00:34 vasp.para.intel
31050 28 ? 00:00:34 vasp.para.intel
31056 31 ? 00:00:34 vasp.para.intel

Does this output look reasonable? For any sensible way I can think of to enumerate the 32 virtual cores, those numbers don’t seem to correspond to one mpi task per core. If this isn’t supposed to be giving meaningful output given how openmpi does its binding, is there another tool that can tell me what cores a running job is actually running on/bound to?

An additional bit of confusion is that "ps -mo pid,tid,fname,user,psr -p PID” on one of those processes (which is supposed to be running without threaded parallelism) reports 3 separate TID (which I think correspond to threads), with 3 different PSR values, that seem stable during the run, but don’t have any connection to one another (not P and P+1, or P and P+8, or P and P+16).


thanks,
Noam
r***@open-mpi.org
2018-06-22 17:00:01 UTC
Permalink
I suspect it is okay. Keep in mind that OMPI itself is starting multiple progress threads, so that is likely what you are seeing. The binding patter in the mpirun output looks correct as the default would be to map-by socket and you asked that we bind-to core.
Post by Noam Bernstein
Hi - for the last couple of weeks, more or less since we did some kernel updates, certain compute intensive MPI jobs have been behaving oddly as far as their speed - bits that should be quite fast sometimes (but not consistently) take a long time, and re-running sometimes fixes the issue, sometimes not. I’m starting to suspect core binding problems, which I worry will be difficult to debug, so I hoped to get some feedback on whether my observations are indeed suggesting that there’s something wrong with the core binding.
I’m running withe CentOS 6 latest kernel (2.6.32-696.30.1.el6.x86_64), OpenMPI 3.1.0, a dual cpu 8 core + HT intel Xeon node. Code is compiled with ifort, using “-mkl=sequential”, and just to be certain OMP_NUM_THREADS=1, so there should be no OpenMP parallelism.
The main question is if I’m running 16 MPI tasks per node and look at the PSR field from ps, should I get some simple sequence of numbers?
Here’s the beginning of the output report on the per-core binding I requested from mpirun (—bind-to core)
[compute-7-2:31036] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[compute-7-2:31036] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[compute-7-2:31036] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[compute-7-2:31036] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[compute-7-2:31036] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..]
[compute-7-2:31036] MCW rank 5 bound to socket 1[core 10[hwt 0-1]]: [../../../../../../../..][../../BB/../../../../..]
[compute-7-2:31036] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..]
This is the PSR info from ps
PID PSR TTY TIME CMD
31043 1 ? 00:00:34 vasp.para.intel
31045 2 ? 00:00:34 vasp.para.intel
31047 3 ? 00:00:34 vasp.para.intel
31049 4 ? 00:00:34 vasp.para.intel
31051 5 ? 00:00:34 vasp.para.intel
31055 7 ? 00:00:34 vasp.para.intel
31042 8 ? 00:00:34 vasp.para.intel
31046 10 ? 00:00:34 vasp.para.intel
31048 11 ? 00:00:34 vasp.para.intel
31052 13 ? 00:00:34 vasp.para.intel
31054 14 ? 00:00:34 vasp.para.intel
31053 22 ? 00:00:34 vasp.para.intel
31044 25 ? 00:00:34 vasp.para.intel
31050 28 ? 00:00:34 vasp.para.intel
31056 31 ? 00:00:34 vasp.para.intel
Does this output look reasonable? For any sensible way I can think of to enumerate the 32 virtual cores, those numbers don’t seem to correspond to one mpi task per core. If this isn’t supposed to be giving meaningful output given how openmpi does its binding, is there another tool that can tell me what cores a running job is actually running on/bound to?
An additional bit of confusion is that "ps -mo pid,tid,fname,user,psr -p PID” on one of those processes (which is supposed to be running without threaded parallelism) reports 3 separate TID (which I think correspond to threads), with 3 different PSR values, that seem stable during the run, but don’t have any connection to one another (not P and P+1, or P and P+8, or P and P+16).
thanks,
Noam
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Noam Bernstein
2018-06-22 17:13:42 UTC
Permalink
Post by r***@open-mpi.org
I suspect it is okay. Keep in mind that OMPI itself is starting multiple progress threads, so that is likely what you are seeing. The binding patter in the mpirun output looks correct as the default would be to map-by socket and you asked that we bind-to core.
I agree that the mpirun output makes sense - I’m more worried about the PSR field in the output of ps. Is it really consistent with what mpirun asked for?

Noam



____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
r***@open-mpi.org
2018-06-22 17:25:01 UTC
Permalink
Afraid I’m not familiar with that option, so I really don’t know :-(
Post by Noam Bernstein
Post by r***@open-mpi.org
I suspect it is okay. Keep in mind that OMPI itself is starting multiple progress threads, so that is likely what you are seeing. The binding patter in the mpirun output looks correct as the default would be to map-by socket and you asked that we bind-to core.
I agree that the mpirun output makes sense - I’m more worried about the PSR field in the output of ps. Is it really consistent with what mpirun asked for?
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Brice Goglin
2018-06-22 18:14:16 UTC
Permalink
If psr is the processor where the task is actually running, I guess we'd need your lstopo output to find out where those processors are in the machine.

Brice
Post by r***@open-mpi.org
Post by r***@open-mpi.org
I suspect it is okay. Keep in mind that OMPI itself is starting
multiple progress threads, so that is likely what you are seeing. The
binding patter in the mpirun output looks correct as the default would
be to map-by socket and you asked that we bind-to core.
I agree that the mpirun output makes sense - I’m more worried about the
PSR field in the output of ps. Is it really consistent with what
mpirun asked for?
Noam
____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628 F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
Noam Bernstein
2018-06-22 19:18:44 UTC
Permalink
Post by Brice Goglin
If psr is the processor where the task is actually running, I guess we'd need your lstopo output to find out where those processors are in the machine.
Excellent, that’s exactly the sort of thing I was hoping someone on the list would know how to determine. Looks like (see lstopo output at the end) 0,16 is one real core, 1,17 another, etc. That’s sort of what I expected. Looks like my original message truncated the list, and of course now that process isn’t running. Here’s a current list, with PSR as the second column
5173 2 ? 00:10:35 vasp.para.intel
5182 5 ? 00:10:35 vasp.para.intel
5184 9 ? 00:10:36 vasp.para.intel
5177 16 ? 00:10:35 vasp.para.intel
5169 17 ? 00:10:35 vasp.para.intel
5187 19 ? 00:10:35 vasp.para.intel
5175 20 ? 00:10:36 vasp.para.intel
5179 22 ? 00:10:35 vasp.para.intel
5171 23 ? 00:10:35 vasp.para.intel
5178 24 ? 00:10:36 vasp.para.intel
5170 26 ? 00:10:36 vasp.para.intel
5181 27 ? 00:10:36 vasp.para.intel
5174 28 ? 00:10:36 vasp.para.intel
5172 29 ? 00:10:36 vasp.para.intel
5176 30 ? 00:10:36 vasp.para.intel
5189 31 ? 00:10:36 vasp.para.intel

That corresponds to physical core numbers
2,5,9,0,1,3,4,6,7,8,10,11,12,13,14,15
So I guess it really is associating each process with a single unique core. I guess that’s good for openmpi, although it does undercut my core binding hypothesis.

thanks,
Noam


compute-7-53 1001 : lstopo
Machine (64GB)
NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
HostBridge L#0
PCIBridge
PCI 8086:1521
Net L#0 "eth0"
PCI 8086:1521
Net L#1 "eth1"
PCIBridge
PCI 15b3:1003
Net L#2 "ib0"
OpenFabrics L#3 "mlx4_0"
PCI 8086:8d62
PCIBridge
PCIBridge
PCI 1a03:2000
PCI 8086:8d02
Block L#4 "sda"
NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#24)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#25)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#26)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#27)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#28)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#29)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#30)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#31)

Loading...