Discussion:
[OMPI users] Program hangs in mpi_bcast
Tom Rosmond
2011-11-14 22:10:01 UTC
Permalink
Hello:

A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.

Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.

I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.


I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with

mpirun -np 36 -bycore -bind-to-core program.exe

We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.

T. Rosmond
Ralph Castain
2011-11-14 23:17:46 UTC
Permalink
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.

The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.

The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"

MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective

Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Tom Rosmond
2011-11-15 16:49:23 UTC
Permalink
Ralph,

Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.

T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-11-29 16:35:01 UTC
Permalink
That's quite weird/surprising that you would need to set it down to *5* -- that's really low.

Can you share a simple reproducer code, perchance?
Post by Tom Rosmond
Ralph,
Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.
T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Ralph Castain
2011-11-30 18:50:33 UTC
Permalink
FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See orte/test/mpi/bcast_loop.c
Post by Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- that's really low.
Can you share a simple reproducer code, perchance?
Post by Tom Rosmond
Ralph,
Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.
T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain
2011-11-30 18:51:52 UTC
Permalink
Oh - and another one at orte/test/mpi/reduce-hang.c
Post by Ralph Castain
FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See orte/test/mpi/bcast_loop.c
Post by Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- that's really low.
Can you share a simple reproducer code, perchance?
Post by Tom Rosmond
Ralph,
Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.
T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-11-30 20:29:50 UTC
Permalink
Yes, but I'd like to see a reproducer that requires setting the sync_barrier_before=5. Your reproducers allowed much higher values, IIRC.

I'm curious to know what makes that code require such a low value (i.e., 5)...
Post by Ralph Castain
FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See orte/test/mpi/bcast_loop.c
Post by Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- that's really low.
Can you share a simple reproducer code, perchance?
Post by Tom Rosmond
Ralph,
Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.
T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Tom Rosmond
2011-11-30 20:39:10 UTC
Permalink
Jeff,

I'm afraid trying to produce a reproducer of this problem wouldn't be
worth the effort. It is a legacy code that I wasn't involved in
developing and will soon be discarded, so I can't justify spending time
trying to understand its behavior better. The bottom line is that it
works correctly with the small 'sync' value, and because it isn't very
expensive to run, that is enough for us.

T. Rosmond
Post by Jeff Squyres
Yes, but I'd like to see a reproducer that requires setting the sync_barrier_before=5. Your reproducers allowed much higher values, IIRC.
I'm curious to know what makes that code require such a low value (i.e., 5)...
Post by Ralph Castain
FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See orte/test/mpi/bcast_loop.c
Post by Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- that's really low.
Can you share a simple reproducer code, perchance?
Post by Tom Rosmond
Ralph,
Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.
T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-11-30 20:45:30 UTC
Permalink
Fair enough. Thanks anyway!
Post by Tom Rosmond
Jeff,
I'm afraid trying to produce a reproducer of this problem wouldn't be
worth the effort. It is a legacy code that I wasn't involved in
developing and will soon be discarded, so I can't justify spending time
trying to understand its behavior better. The bottom line is that it
works correctly with the small 'sync' value, and because it isn't very
expensive to run, that is enough for us.
T. Rosmond
Post by Jeff Squyres
Yes, but I'd like to see a reproducer that requires setting the sync_barrier_before=5. Your reproducers allowed much higher values, IIRC.
I'm curious to know what makes that code require such a low value (i.e., 5)...
Post by Ralph Castain
FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See orte/test/mpi/bcast_loop.c
Post by Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- that's really low.
Can you share a simple reproducer code, perchance?
Post by Tom Rosmond
Ralph,
Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.
T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from "ompi_info --param coll sync"
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use "before". Try different values until you get something that works for you.
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
***@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
Alex A. Granovsky
2011-12-02 13:50:35 UTC
Permalink
Dear OpenMPI users,
Dear OpenMPI developers,

I would like to start discussion on implementation of collective
operations within OpenMPI. The reason for this is at least twofold.
Last months, there was the constantly growing number of messages in
the list sent by persons facing problems with collectives so I do
believe these issues must be discussed and hopefully will finally
attract proper attention of OpenMPI developers. The second one is my
involvement in the development of Firefly Quantum Chemistry package,
which, of course, uses collectives rather intensively.

Unlike many list members, I personally believe that the use of
collectives does not automatically mean bad and/or obsolete
programming style. Collectives are just the tool, and as with any tool
they can be both properly and improperly used. Collectives are the
part of MPI standard and in some situations, there are really good
reasons to use them repeatedly. Unfortunately, to the best of my
knowledge, OpenMPI is the only MPI implementation which hangs on the
repeated collective operations. The workaround implemented by Ralph
Castain effectively calls MPI_Barrier() after or before some number of
calls to collectives. Exactly as with any workaround, it actually
intends to mask existing bugs or code design flaws and also decreases
performance to some extent. Thus, I cannot consider this workaround as
an acceptable solution and I always have been wondering why this
problem did not get much attention of OpenMPI developers.

Some of our users would like to use Firefly with OpenMPI. Usually, we
simply answer them that OpenMPI is too buggy to be used. Perhaps, some
of list members answer exactly the same to the users of their
software. I hope this situation will be changed in the future. To be
honest, the workaround very similar to that of Ralph has been
implemented in Firefly already for many years so if one likes one
could activate option "BuggyMPICollectives" in the Firefly source
code. Nevertheless, the existence of this option in the code intended
to run at the performance very close to the hardware limits does not
make us happy at all.

Kind regards,
Alex Granovsky


----- Original Message -----
From: "Jeff Squyres" <***@cisco.com>
To: "Open MPI Users" <***@open-mpi.org>
Sent: Wednesday, November 30, 2011 11:45 PM
Subject: Re: [OMPI users] Program hangs in mpi_bcast
Post by Jeff Squyres
Fair enough. Thanks anyway!
Post by Tom Rosmond
Jeff,
I'm afraid trying to produce a reproducer of this problem wouldn't be
worth the effort. It is a legacy code that I wasn't involved in
developing and will soon be discarded, so I can't justify spending time
trying to understand its behavior better. The bottom line is that it
works correctly with the small 'sync' value, and because it isn't very
expensive to run, that is enough for us.
T. Rosmond
Post by Jeff Squyres
Yes, but I'd like to see a reproducer that requires setting the sync_barrier_before=5. Your reproducers allowed much higher
values, IIRC.
Post by Jeff Squyres
Post by Tom Rosmond
Post by Jeff Squyres
I'm curious to know what makes that code require such a low value (i.e., 5)...
Post by Ralph Castain
FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See
orte/test/mpi/bcast_loop.c
Post by Jeff Squyres
Post by Tom Rosmond
Post by Jeff Squyres
Post by Ralph Castain
Post by Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- that's really low.
Can you share a simple reproducer code, perchance?
Post by Tom Rosmond
Ralph,
Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do
the job. This is a big change from the default value (1000), so our
application seems to be a pretty extreme case.
T. Rosmond
Post by Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times.
The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue.
This causes the matching logic to run a little slower, thus making the process fall further and further behind. Eventually, things
hang because everyone is sitting in bcast waiting for the slow proc to catch up, but it's queue is saturated and it can't.
Post by Jeff Squyres
Post by Tom Rosmond
Post by Jeff Squyres
Post by Ralph Castain
Post by Jeff Squyres
Post by Tom Rosmond
Post by Ralph Castain
The solution is to do exactly what you describe - add some barriers to force the slow process to catch up. This happened
enough that we even added support for it in OMPI itself so you don't have to modify your code. Look at the following from
"ompi_info --param coll sync"
Post by Jeff Squyres
Post by Tom Rosmond
Post by Jeff Squyres
Post by Ralph Castain
Post by Jeff Squyres
Post by Tom Rosmond
Post by Ralph Castain
MCA coll: parameter "coll_base_verbose" (current value: <0>, data source: default value)
Verbosity level for the coll framework (0 = no verbosity)
MCA coll: parameter "coll_sync_priority" (current value: <50>, data source: default value)
Priority of the sync coll component; only relevant if barrier_before or barrier_after is > 0
MCA coll: parameter "coll_sync_barrier_before" (current value: <1000>, data source: default value)
Do a synchronization before each Nth collective
MCA coll: parameter "coll_sync_barrier_after" (current value: <0>, data source: default value)
Do a synchronization after each Nth collective
Take your pick - inserting a barrier before or after doesn't seem to make a lot of difference, but most people use
"before". Try different values until you get something that works for you.
Post by Jeff Squyres
Post by Tom Rosmond
Post by Jeff Squyres
Post by Ralph Castain
Post by Jeff Squyres
Post by Tom Rosmond
Post by Ralph Castain
Post by Tom Rosmond
A colleague and I have been running a large F90 application that does an
enormous number of mpi_bcast calls during execution. I deny any
responsibility for the design of the code and why it needs these calls,
but it is what we have inherited and have to work with.
Recently we ported the code to an 8 node, 6 processor/node NUMA system
(lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
and began having trouble with mysterious 'hangs' in the program inside
the mpi_bcast calls. The hangs were always in the same calls, but not
necessarily at the same time during integration. We originally didn't
have NUMA support, so reinstalled with libnuma support added, but the
problem persisted. Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
I believe conventional wisdom is that properly formulated MPI programs
should run correctly without barriers, so do you have any thoughts on
why we found it necessary to add them? The code has run correctly on
other architectures, i.g. Crayxe6, so I don't think there is a bug
anywhere. My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored. Does this make sense? I'd appreciate any
comments and advice you can provide.
I have attached compressed copies of config.log and ompi_info for the
system. The program is built with ifort 12.0 and typically runs with
mpirun -np 36 -bycore -bind-to-core program.exe
We have run both interactively and with PBS, but that doesn't seem to
make any difference in program behavior.
T. Rosmond
<lstopo_out.txt><config.log.bz2><ompi_info.bz2>_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
http://www.open-mpi.org/mailman/listinfo.cgi/users
Jeff Squyres
2011-12-03 12:36:19 UTC
Permalink
This post might be inappropriate. Click to display it.
Alex A. Granovsky
2011-12-09 14:40:23 UTC
Permalink
This post might be inappropriate. Click to display it.
Loading...