Discussion:
[OMPI users] Help - Client / server - app hangs in connect/accept by the second or next client that wants to connect to server
M. D.
2016-07-15 12:20:08 UTC
Permalink
Hello,

I have a problem with basic client - server application. I tried to run C
program from this website
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found out
that there is probably an error somewhere in connect/accept method. I have
read many discussions and threads on your website, but I have not found
similar problem that I am facing. It seems that nobody had similar problem
like me. When I run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client wants to
connect to the server (it depends, sometimes third client hangs, sometimes
fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because this
client cannot connect to the server. It is strange, because I observed this
behaviour only in some cases... Sometimes it works without any problems,
sometimes it does not work. The behaviour is unpredictable and not stable.

I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem
with Java alternative of this application. It hangs also sometimes... I
need this app in Java, but firstly it must work properly in C
implementation. Because of this strange behaviour I assume that there can
be an error maybe inside of openmpi implementation of connect/accept
methods. I tried it also with another version of openmpi - 1.8.1. However,
the problem did not disappear.

Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with me... I
will appreciate any your help, support, or interest about this topic.

Best regards,
Matus Dobrotka
Gilles Gouaillardet
2016-07-19 04:28:00 UTC
Permalink
How do you run the test ?

you should have the same number of clients in each mpirun instance, the
following simple shell starts the test as i think it is supposed to

note the test itself is arguable since MPI_Comm_disconnect() is never
invoked

(and you will observe some related dpm_base_disconnect_init errors)


#!/bin/sh

clients=3

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do

sleep 1

screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
0 2>&1 | tee /tmp/client.$clients.$i"
done


Ralph,


this test fails with master.

when the "server" (second parameter is 1), MPI_Comm_accept() fails with
a timeout.

i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout

OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);

but this is not the timeout that is triggered ...

the eviction_cbfunc timeout function is invoked, and it has been set
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c


default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)

sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return
when the client invokes MPI_Comm_connect()


could you please have a look at this ?


Cheers,


Gilles
Post by M. D.
Hello,
I have a problem with basic client - server application. I tried to
run C program from this website
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found
out that there is probably an error somewhere in connect/accept
method. I have read many discussions and threads on your website, but
I have not found similar problem that I am facing. It seems that
nobody had similar problem like me. When I run this app with one
server and more clients (3,4,5,6,...) sometimes the app hangs. It
hangs when second or next client wants to connect to the server (it
depends, sometimes third client hangs, sometimes fourth, sometimes
second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because
this client cannot connect to the server. It is strange, because I
observed this behaviour only in some cases... Sometimes it works
without any problems, sometimes it does not work. The behaviour is
unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same
problem with Java alternative of this application. It hangs also
sometimes... I need this app in Java, but firstly it must work
properly in C implementation. Because of this strange behaviour I
assume that there can be an error maybe inside of openmpi
implementation of connect/accept methods. I tried it also with another
version of openmpi - 1.8.1. However, the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with
me... I will appreciate any your help, support, or interest about this
topic.
Best regards,
Matus Dobrotka
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29673.php
M. D.
2016-07-19 07:37:30 UTC
Permalink
Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*

Secondly, I run corresponding number of "clients" via following command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*

So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of clients in each
mpirun instance.

The test does not involve MPI_Comm_disconnect(), but the problem in the
test is in the earlier position, because some of clients (in the most cases
actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).

So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the
timeout value, isn't it?

Cheers,

Matus
Post by Gilles Gouaillardet
How do you run the test ?
you should have the same number of clients in each mpirun instance, the
following simple shell starts the test as i think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is never
invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 1
2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 0
2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails with a
timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been set when
opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Hello,
I have a problem with basic client - server application. I tried to run C
program from this website
<https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found out
that there is probably an error somewhere in connect/accept method. I have
read many discussions and threads on your website, but I have not found
similar problem that I am facing. It seems that nobody had similar problem
like me. When I run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client wants to
connect to the server (it depends, sometimes third client hangs, sometimes
fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because this
client cannot connect to the server. It is strange, because I observed this
behaviour only in some cases... Sometimes it works without any problems,
sometimes it does not work. The behaviour is unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem
with Java alternative of this application. It hangs also sometimes... I
need this app in Java, but firstly it must work properly in C
implementation. Because of this strange behaviour I assume that there can
be an error maybe inside of openmpi implementation of connect/accept
methods. I tried it also with another version of openmpi - 1.8.1. However,
the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with me... I
will appreciate any your help, support, or interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
Gilles Gouaillardet
2016-07-19 08:06:18 UTC
Permalink
MPI_Comm_accept must be called by all the tasks of the local communicator.

so if you

1) mpirun -np 1 ./singleton_client_server 2 1

2) mpirun -np 1 ./singleton_client_server 2 0

3) mpirun -np 1 ./singleton_client_server 2 0

then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
and an exited task (2)

/*

strictly speaking, there is a race condition, if 2) has exited, then
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.

if 2) has not yet exited, then the test will hang because 2) does not
invoke MPI_Comm_accept

*/


there are different ways of seeing things :

1) this is an incorrect usage of the test, the number of clients should
be the same everywhere

2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when

starting task 3) because task 2) does not call MPI_Comm_accept()


i do not know how you want to spawn your tasks.

if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then

you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

if 2 and 3) need to communicate with each other, it would be much easier
to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),

so there is only one inter communicator with all the tasks.


The current test program is growing incrementally the intercomm, which
does require extra steps for synchronization.


Cheers,


Gilles
Post by M. D.
Hi,
thank you for your interest in this topic.
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*
*
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
*
*
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
It means you are right - there should be the same number of clients in
each mpirun instance.
The test does not involve MPI_Comm_disconnect(), but the problem in
the test is in the earlier position, because some of clients (in the
most cases actually the last client) cannot sometimes connect to the
server and therefore all clients with server are hanging (waiting for
the connections with the last client(s) ).
So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in
the timeout value, isn't it?
Cheers,
Matus
How do you run the test ?
you should have the same number of clients in each mpirun
instance, the following simple shell starts the test as i think it
is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is
never invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 0 2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails
with a timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been
set when opal_hotel_init() was invoked in
orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60
seconds (hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not
return when the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Post by M. D.
Hello,
I have a problem with basic client - server application. I tried
to run C program from this website
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website,
so I expected that it should work properly, but after more
testing I found out that there is probably an error somewhere in
connect/accept method. I have read many discussions and threads
on your website, but I have not found similar problem that I am
facing. It seems that nobody had similar problem like me. When I
run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client
wants to connect to the server (it depends, sometimes third
client hangs, sometimes fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept
and client waits for connect. And it is not possible to continue,
because this client cannot connect to the server. It is strange,
because I observed this behaviour only in some cases... Sometimes
it works without any problems, sometimes it does not work. The
behaviour is unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same
problem with Java alternative of this application. It hangs also
sometimes... I need this app in Java, but firstly it must work
properly in C implementation. Because of this strange behaviour I
assume that there can be an error maybe inside of openmpi
implementation of connect/accept methods. I tried it also with
another version of openmpi - 1.8.1. However, the problem did not
disappear.
Could you help me, what can cause the problem? Maybe I did not
get something about openmpi (or connect/server) and the problem
is with me... I will appreciate any your help, support, or
interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
users mailing list
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29687.php
M. D.
2016-07-19 08:48:13 UTC
Permalink
Post by Gilles Gouaillardet
MPI_Comm_accept must be called by all the tasks of the local communicator.
Yes, that's how I understand it. In the source code of the test, all the
tasks call MPI_Comm_accept - server and also relevant clients.
Post by Gilles Gouaillardet
so if you
1) mpirun -np 1 ./singleton_client_server 2 1
2) mpirun -np 1 ./singleton_client_server 2 0
3) mpirun -np 1 ./singleton_client_server 2 0
then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and
an exited task (2)
This is not true in my opinion - because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.
Post by Gilles Gouaillardet
/*
strictly speaking, there is a race condition, if 2) has exited, then
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
if 2) has not yet exited, then the test will hang because 2) does not
invoke MPI_Comm_accept
*/
Task 2) does not exit, because of blocking call of MPI_Comm_accept.
Post by Gilles Gouaillardet
1) this is an incorrect usage of the test, the number of clients should be
the same everywhere
2) task 2) should not exit (because it did not call MPI_Comm_disconnect())
and the test should hang when
starting task 3) because task 2) does not call MPI_Comm_accept()
ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above
Post by Gilles Gouaillardet
i do not know how you want to spawn your tasks.
if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then
you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
if 2 and 3) need to communicate with each other, it would be much easier
to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
so there is only one inter communicator with all the tasks.
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows - First
player creates a game and waits for other players to connect to this game.
On different computers (in the same network) the other players can join
this game. When they are connected, they should be able to play this game
together.
I hope, it is clear what my idea is. If it is not, just ask me, please.
Post by Gilles Gouaillardet
The current test program is growing incrementally the intercomm, which
does require extra steps for synchronization.
Cheers,
Gilles
Cheers,

Matus
Post by Gilles Gouaillardet
Hi,
thank you for your interest in this topic.
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
It means you are right - there should be the same number of clients in
each mpirun instance.
The test does not involve MPI_Comm_disconnect(), but the problem in the
test is in the earlier position, because some of clients (in the most cases
actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).
So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the
timeout value, isn't it?
Cheers,
Matus
Post by Gilles Gouaillardet
How do you run the test ?
you should have the same number of clients in each mpirun instance, the
following simple shell starts the test as i think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is never
invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 1
2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 0
2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails with a
timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been set when
opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Hello,
I have a problem with basic client - server application. I tried to run C
program from this website
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found out
that there is probably an error somewhere in connect/accept method. I have
read many discussions and threads on your website, but I have not found
similar problem that I am facing. It seems that nobody had similar problem
like me. When I run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client wants to
connect to the server (it depends, sometimes third client hangs, sometimes
fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because this
client cannot connect to the server. It is strange, because I observed this
behaviour only in some cases... Sometimes it works without any problems,
sometimes it does not work. The behaviour is unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem
with Java alternative of this application. It hangs also sometimes... I
need this app in Java, but firstly it must work properly in C
implementation. Because of this strange behaviour I assume that there can
be an error maybe inside of openmpi implementation of connect/accept
methods. I tried it also with another version of openmpi - 1.8.1. However,
the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with me... I
will appreciate any your help, support, or interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29687.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29688.php
Gilles Gouaillardet
2016-07-19 08:55:48 UTC
Permalink
here is what the client is doing

printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
rank) ;

for (i = rank ; i < num_clients ; i++)
{
/* client performs a collective accept */
CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0,
intracomm, &intercomm)) ;

printf("CLIENT: connected to server on port\n") ;
[...]

}

2) has rank 1

/* and 3) has rank 2) */

so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
called, hence my analysis of the crash/hang


I understand what you are trying to achieve, keep in mind
MPI_Comm_accept() is a collective call, so when a new player

is willing to join, other players must invoke MPI_Comm_accept().

and it is up to you to make sure that happens


Cheers,


Gilles
Post by Gilles Gouaillardet
MPI_Comm_accept must be called by all the tasks of the local communicator.
Yes, that's how I understand it. In the source code of the test, all
the tasks call MPI_Comm_accept - server and also relevant clients.
so if you
1) mpirun -np 1 ./singleton_client_server 2 1
2) mpirun -np 1 ./singleton_client_server 2 0
3) mpirun -np 1 ./singleton_client_server 2 0
then 3) starts after 2) has exited, so on 1), intracomm is made of
1) and an exited task (2)
This is not true in my opinion - because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.
/*
strictly speaking, there is a race condition, if 2) has exited,
then MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
if 2) has not yet exited, then the test will hang because 2) does
not invoke MPI_Comm_accept
*/
Task 2) does not exit, because of blocking call of MPI_Comm_accept.
1) this is an incorrect usage of the test, the number of clients
should be the same everywhere
2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when
starting task 3) because task 2) does not call MPI_Comm_accept()
ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above
i do not know how you want to spawn your tasks.
if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then
you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
if 2 and 3) need to communicate with each other, it would be much
easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
so there is only one inter communicator with all the tasks.
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows -
First player creates a game and waits for other players to connect to
this game. On different computers (in the same network) the other
players can join this game. When they are connected, they should be
able to play this game together.
I hope, it is clear what my idea is. If it is not, just ask me, please.
The current test program is growing incrementally the intercomm,
which does require extra steps for synchronization.
Cheers,
Gilles
Cheers,
Matus
Post by M. D.
Hi,
thank you for your interest in this topic.
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*
*
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
*
*
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
It means you are right - there should be the same number of
clients in each mpirun instance.
The test does not involve MPI_Comm_disconnect(), but the problem
in the test is in the earlier position, because some of clients
(in the most cases actually the last client) cannot sometimes
connect to the server and therefore all clients with server are
hanging (waiting for the connections with the last client(s) ).
So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not
in the timeout value, isn't it?
Cheers,
Matus
How do you run the test ?
you should have the same number of clients in each mpirun
instance, the following simple shell starts the test as i
think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect()
is never invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1
./singleton_client_server $clients 1 2>&1 | tee
/tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1
./singleton_client_server $clients 0 2>&1 | tee
/tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept()
fails with a timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has
been set when opal_hotel_init() was invoked in
orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or
60 seconds (hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does
not return when the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Post by M. D.
Hello,
I have a problem with basic client - server application. I
tried to run C program from this website
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your
website, so I expected that it should work properly, but
after more testing I found out that there is probably an
error somewhere in connect/accept method. I have read many
discussions and threads on your website, but I have not
found similar problem that I am facing. It seems that nobody
had similar problem like me. When I run this app with one
server and more clients (3,4,5,6,...) sometimes the app
hangs. It hangs when second or next client wants to connect
to the server (it depends, sometimes third client hangs,
sometimes fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for
accept and client waits for connect. And it is not possible
to continue, because this client cannot connect to the
server. It is strange, because I observed this behaviour
only in some cases... Sometimes it works without any
problems, sometimes it does not work. The behaviour is
unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the
same problem with Java alternative of this application. It
hangs also sometimes... I need this app in Java, but firstly
it must work properly in C implementation. Because of this
strange behaviour I assume that there can be an error maybe
inside of openmpi implementation of connect/accept methods.
I tried it also with another version of openmpi - 1.8.1.
However, the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did
not get something about openmpi (or connect/server) and the
problem is with me... I will appreciate any your help,
support, or interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
users mailing list
Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
_______________________________________________ users mailing
https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:http://www.open-mpi.org/community/lists/users/2016/07/29687.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29688.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29689.php
M. D.
2016-07-19 09:24:43 UTC
Permalink
Yes I understand it, but I think, this is exactly that situation you are
talking about. In my opinion, the test is doing exactly what you said -
when a new player is willing to join, other players must invoke
MPI_Comm_accept().
All *other* players must invoke MPI_Comm_accept(). Only the last client (in
this case last player which wants to join) does not
invoke MPI_Comm_accept(), because this client invokes only
MPI_Comm_connect(). He is connecting to communicator, in which all other
players are already involved and therefore this last client doesn't have to
invoke MPI_Comm_accept().

Am I still missing something in this my reflection?

Matus
Post by Gilles Gouaillardet
here is what the client is doing
printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
rank) ;
for (i = rank ; i < num_clients ; i++)
{
/* client performs a collective accept */
CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
&intercomm)) ;
printf("CLIENT: connected to server on port\n") ;
[...]
}
2) has rank 1
/* and 3) has rank 2) */
so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
called, hence my analysis of the crash/hang
I understand what you are trying to achieve, keep in mind
MPI_Comm_accept() is a collective call, so when a new player
is willing to join, other players must invoke MPI_Comm_accept().
and it is up to you to make sure that happens
Cheers,
Gilles
Post by Gilles Gouaillardet
MPI_Comm_accept must be called by all the tasks of the local communicator.
Yes, that's how I understand it. In the source code of the test, all the
tasks call MPI_Comm_accept - server and also relevant clients.
Post by Gilles Gouaillardet
so if you
1) mpirun -np 1 ./singleton_client_server 2 1
2) mpirun -np 1 ./singleton_client_server 2 0
3) mpirun -np 1 ./singleton_client_server 2 0
then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and
an exited task (2)
This is not true in my opinion - because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.
Post by Gilles Gouaillardet
/*
strictly speaking, there is a race condition, if 2) has exited, then
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
if 2) has not yet exited, then the test will hang because 2) does not
invoke MPI_Comm_accept
*/
Task 2) does not exit, because of blocking call of MPI_Comm_accept.
Post by Gilles Gouaillardet
1) this is an incorrect usage of the test, the number of clients should
be the same everywhere
2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when
starting task 3) because task 2) does not call MPI_Comm_accept()
ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above
Post by Gilles Gouaillardet
i do not know how you want to spawn your tasks.
if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then
you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
if 2 and 3) need to communicate with each other, it would be much easier
to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
so there is only one inter communicator with all the tasks.
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows - First
player creates a game and waits for other players to connect to this game.
On different computers (in the same network) the other players can join
this game. When they are connected, they should be able to play this game
together.
I hope, it is clear what my idea is. If it is not, just ask me, please.
Post by Gilles Gouaillardet
The current test program is growing incrementally the intercomm, which
does require extra steps for synchronization.
Cheers,
Gilles
Cheers,
Matus
Post by Gilles Gouaillardet
Hi,
thank you for your interest in this topic.
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
It means you are right - there should be the same number of clients in
each mpirun instance.
The test does not involve MPI_Comm_disconnect(), but the problem in the
test is in the earlier position, because some of clients (in the most cases
actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).
So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the
timeout value, isn't it?
Cheers,
Matus
Post by Gilles Gouaillardet
How do you run the test ?
you should have the same number of clients in each mpirun instance, the
following simple shell starts the test as i think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is never
invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
0 2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails with
a timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been set
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Hello,
I have a problem with basic client - server application. I tried to run
C program from this website
<https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found out
that there is probably an error somewhere in connect/accept method. I have
read many discussions and threads on your website, but I have not found
similar problem that I am facing. It seems that nobody had similar problem
like me. When I run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client wants to
connect to the server (it depends, sometimes third client hangs, sometimes
fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because this
client cannot connect to the server. It is strange, because I observed this
behaviour only in some cases... Sometimes it works without any problems,
sometimes it does not work. The behaviour is unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem
with Java alternative of this application. It hangs also sometimes... I
need this app in Java, but firstly it must work properly in C
implementation. Because of this strange behaviour I assume that there can
be an error maybe inside of openmpi implementation of connect/accept
methods. I tried it also with another version of openmpi - 1.8.1. However,
the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with me... I
will appreciate any your help, support, or interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
<http://www.open-mpi.org/community/lists/users/2016/07/29681.php>
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29687.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29688.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29689.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29690.php
Gilles Gouaillardet
2016-07-19 13:23:18 UTC
Permalink
my bad for the confusion,

I misread you and miswrote my reply.

I will investigate this again.

strictly speaking, the clients can only start after the server first write
the port info to a file.
if you start the client right after the server start, they might use
incorrect/outdated info and cause all the test hang.

I will start reproducing the hang

Cheers,

Gilles
Post by M. D.
Yes I understand it, but I think, this is exactly that situation you are
talking about. In my opinion, the test is doing exactly what you said -
when a new player is willing to join, other players must invoke MPI_Comm_accept().
All *other* players must invoke MPI_Comm_accept(). Only the last client
(in this case last player which wants to join) does not
invoke MPI_Comm_accept(), because this client invokes only
MPI_Comm_connect(). He is connecting to communicator, in which all other
players are already involved and therefore this last client doesn't have to
invoke MPI_Comm_accept().
Am I still missing something in this my reflection?
Matus
Post by Gilles Gouaillardet
here is what the client is doing
printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
rank) ;
for (i = rank ; i < num_clients ; i++)
{
/* client performs a collective accept */
CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
&intercomm)) ;
printf("CLIENT: connected to server on port\n") ;
[...]
}
2) has rank 1
/* and 3) has rank 2) */
so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
called, hence my analysis of the crash/hang
I understand what you are trying to achieve, keep in mind
MPI_Comm_accept() is a collective call, so when a new player
is willing to join, other players must invoke MPI_Comm_accept().
and it is up to you to make sure that happens
Cheers,
Gilles
Post by Gilles Gouaillardet
MPI_Comm_accept must be called by all the tasks of the local
communicator.
Yes, that's how I understand it. In the source code of the test, all the
tasks call MPI_Comm_accept - server and also relevant clients.
Post by Gilles Gouaillardet
so if you
1) mpirun -np 1 ./singleton_client_server 2 1
2) mpirun -np 1 ./singleton_client_server 2 0
3) mpirun -np 1 ./singleton_client_server 2 0
then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
and an exited task (2)
This is not true in my opinion - because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.
Post by Gilles Gouaillardet
/*
strictly speaking, there is a race condition, if 2) has exited, then
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
if 2) has not yet exited, then the test will hang because 2) does not
invoke MPI_Comm_accept
*/
Task 2) does not exit, because of blocking call of MPI_Comm_accept.
Post by Gilles Gouaillardet
1) this is an incorrect usage of the test, the number of clients should
be the same everywhere
2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when
starting task 3) because task 2) does not call MPI_Comm_accept()
ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above
Post by Gilles Gouaillardet
i do not know how you want to spawn your tasks.
if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then
you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
if 2 and 3) need to communicate with each other, it would be much easier
to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
so there is only one inter communicator with all the tasks.
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows - First
player creates a game and waits for other players to connect to this game.
On different computers (in the same network) the other players can join
this game. When they are connected, they should be able to play this game
together.
I hope, it is clear what my idea is. If it is not, just ask me, please.
Post by Gilles Gouaillardet
The current test program is growing incrementally the intercomm, which
does require extra steps for synchronization.
Cheers,
Gilles
Cheers,
Matus
Post by Gilles Gouaillardet
Hi,
thank you for your interest in this topic.
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
It means you are right - there should be the same number of clients in
each mpirun instance.
The test does not involve MPI_Comm_disconnect(), but the problem in the
test is in the earlier position, because some of clients (in the most cases
actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).
So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the
timeout value, isn't it?
Cheers,
Matus
2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet <
Post by Gilles Gouaillardet
How do you run the test ?
you should have the same number of clients in each mpirun instance, the
following simple shell starts the test as i think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is never
invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
0 2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails with
a timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been set
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Hello,
I have a problem with basic client - server application. I tried to run
C program from this website
<https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found out
that there is probably an error somewhere in connect/accept method. I have
read many discussions and threads on your website, but I have not found
similar problem that I am facing. It seems that nobody had similar problem
like me. When I run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client wants to
connect to the server (it depends, sometimes third client hangs, sometimes
fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because this
client cannot connect to the server. It is strange, because I observed this
behaviour only in some cases... Sometimes it works without any problems,
sometimes it does not work. The behaviour is unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same
problem with Java alternative of this application. It hangs also
sometimes... I need this app in Java, but firstly it must work properly in
C implementation. Because of this strange behaviour I assume that there can
be an error maybe inside of openmpi implementation of connect/accept
methods. I tried it also with another version of openmpi - 1.8.1. However,
the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with me... I
will appreciate any your help, support, or interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
<http://www.open-mpi.org/community/lists/users/2016/07/29681.php>
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29687.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29688.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29689.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2016/07/29690.php
M. D.
2016-08-29 12:33:34 UTC
Permalink
Hi,

I would like to ask - are there any new solutions or investigations in this
problem?

Cheers,

Matus Dobrotka

2016-07-19 15:23 GMT+02:00 Gilles Gouaillardet <
Post by Gilles Gouaillardet
my bad for the confusion,
I misread you and miswrote my reply.
I will investigate this again.
strictly speaking, the clients can only start after the server first write
the port info to a file.
if you start the client right after the server start, they might use
incorrect/outdated info and cause all the test hang.
I will start reproducing the hang
Cheers,
Gilles
Post by M. D.
Yes I understand it, but I think, this is exactly that situation you are
talking about. In my opinion, the test is doing exactly what you said -
when a new player is willing to join, other players must invoke MPI_Comm_accept().
All *other* players must invoke MPI_Comm_accept(). Only the last client
(in this case last player which wants to join) does not
invoke MPI_Comm_accept(), because this client invokes only
MPI_Comm_connect(). He is connecting to communicator, in which all other
players are already involved and therefore this last client doesn't have to
invoke MPI_Comm_accept().
Am I still missing something in this my reflection?
Matus
Post by Gilles Gouaillardet
here is what the client is doing
printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
rank) ;
for (i = rank ; i < num_clients ; i++)
{
/* client performs a collective accept */
CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0,
intracomm, &intercomm)) ;
printf("CLIENT: connected to server on port\n") ;
[...]
}
2) has rank 1
/* and 3) has rank 2) */
so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
called, hence my analysis of the crash/hang
I understand what you are trying to achieve, keep in mind
MPI_Comm_accept() is a collective call, so when a new player
is willing to join, other players must invoke MPI_Comm_accept().
and it is up to you to make sure that happens
Cheers,
Gilles
Post by Gilles Gouaillardet
MPI_Comm_accept must be called by all the tasks of the local communicator.
Yes, that's how I understand it. In the source code of the test, all the
tasks call MPI_Comm_accept - server and also relevant clients.
Post by Gilles Gouaillardet
so if you
1) mpirun -np 1 ./singleton_client_server 2 1
2) mpirun -np 1 ./singleton_client_server 2 0
3) mpirun -np 1 ./singleton_client_server 2 0
then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
and an exited task (2)
This is not true in my opinion - because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.
Post by Gilles Gouaillardet
/*
strictly speaking, there is a race condition, if 2) has exited, then
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
if 2) has not yet exited, then the test will hang because 2) does not
invoke MPI_Comm_accept
*/
Task 2) does not exit, because of blocking call of MPI_Comm_accept.
Post by Gilles Gouaillardet
1) this is an incorrect usage of the test, the number of clients should
be the same everywhere
2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when
starting task 3) because task 2) does not call MPI_Comm_accept()
ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above
Post by Gilles Gouaillardet
i do not know how you want to spawn your tasks.
if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then
you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
if 2 and 3) need to communicate with each other, it would be much
easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
so there is only one inter communicator with all the tasks.
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows - First
player creates a game and waits for other players to connect to this game.
On different computers (in the same network) the other players can join
this game. When they are connected, they should be able to play this game
together.
I hope, it is clear what my idea is. If it is not, just ask me, please.
Post by Gilles Gouaillardet
The current test program is growing incrementally the intercomm, which
does require extra steps for synchronization.
Cheers,
Gilles
Cheers,
Matus
Post by Gilles Gouaillardet
Hi,
thank you for your interest in this topic.
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
It means you are right - there should be the same number of clients in
each mpirun instance.
The test does not involve MPI_Comm_disconnect(), but the problem in
the test is in the earlier position, because some of clients (in the most
cases actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).
So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the
timeout value, isn't it?
Cheers,
Matus
Post by Gilles Gouaillardet
How do you run the test ?
you should have the same number of clients in each mpirun instance,
the following simple shell starts the test as i think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is never
invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 0 2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails
with a timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been set
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Hello,
I have a problem with basic client - server application. I tried to
run C program from this website
<https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/
master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found out
that there is probably an error somewhere in connect/accept method. I have
read many discussions and threads on your website, but I have not found
similar problem that I am facing. It seems that nobody had similar problem
like me. When I run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client wants to
connect to the server (it depends, sometimes third client hangs, sometimes
fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because this
client cannot connect to the server. It is strange, because I observed this
behaviour only in some cases... Sometimes it works without any problems,
sometimes it does not work. The behaviour is unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same
problem with Java alternative of this application. It hangs also
sometimes... I need this app in Java, but firstly it must work properly in
C implementation. Because of this strange behaviour I assume that there can
be an error maybe inside of openmpi implementation of connect/accept
methods. I tried it also with another version of openmpi - 1.8.1. However,
the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with me... I
will appreciate any your help, support, or interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
<http://www.open-mpi.org/community/lists/users/2016/07/29681.php>
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29687.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/commun
ity/lists/users/2016/07/29688.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29689.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/commun
ity/lists/users/2016/07/29690.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/commun
ity/lists/users/2016/07/29693.php
Gilles Gouaillardet
2016-10-22 02:45:58 UTC
Permalink
Matus,


This has very likely been fixed by
https://github.com/open-mpi/ompi/pull/2259
Can you download the patch at
https://github.com/open-mpi/ompi/pull/2259.patch and apply it manually on
v1.10 ?

Cheers,

Gilles
Post by M. D.
Hi,
I would like to ask - are there any new solutions or investigations in
this problem?
Cheers,
Matus Dobrotka
2016-07-19 15:23 GMT+02:00 Gilles Gouaillardet <
Post by Gilles Gouaillardet
my bad for the confusion,
I misread you and miswrote my reply.
I will investigate this again.
strictly speaking, the clients can only start after the server first
write the port info to a file.
if you start the client right after the server start, they might use
incorrect/outdated info and cause all the test hang.
I will start reproducing the hang
Cheers,
Gilles
Post by M. D.
Yes I understand it, but I think, this is exactly that situation you are
talking about. In my opinion, the test is doing exactly what you said -
when a new player is willing to join, other players must invoke MPI_Comm_accept().
All *other* players must invoke MPI_Comm_accept(). Only the last client
(in this case last player which wants to join) does not
invoke MPI_Comm_accept(), because this client invokes only
MPI_Comm_connect(). He is connecting to communicator, in which all other
players are already involved and therefore this last client doesn't have to
invoke MPI_Comm_accept().
Am I still missing something in this my reflection?
Matus
Post by Gilles Gouaillardet
here is what the client is doing
printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
rank) ;
for (i = rank ; i < num_clients ; i++)
{
/* client performs a collective accept */
CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0,
intracomm, &intercomm)) ;
printf("CLIENT: connected to server on port\n") ;
[...]
}
2) has rank 1
/* and 3) has rank 2) */
so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
called, hence my analysis of the crash/hang
I understand what you are trying to achieve, keep in mind
MPI_Comm_accept() is a collective call, so when a new player
is willing to join, other players must invoke MPI_Comm_accept().
and it is up to you to make sure that happens
Cheers,
Gilles
Post by Gilles Gouaillardet
MPI_Comm_accept must be called by all the tasks of the local communicator.
Yes, that's how I understand it. In the source code of the test, all
the tasks call MPI_Comm_accept - server and also relevant clients.
Post by Gilles Gouaillardet
so if you
1) mpirun -np 1 ./singleton_client_server 2 1
2) mpirun -np 1 ./singleton_client_server 2 0
3) mpirun -np 1 ./singleton_client_server 2 0
then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
and an exited task (2)
This is not true in my opinion - because of above mentioned fact that
MPI_Comm_accept is called by all the tasks of the local communicator.
Post by Gilles Gouaillardet
/*
strictly speaking, there is a race condition, if 2) has exited, then
MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
if 2) has not yet exited, then the test will hang because 2) does not
invoke MPI_Comm_accept
*/
Task 2) does not exit, because of blocking call of MPI_Comm_accept.
Post by Gilles Gouaillardet
1) this is an incorrect usage of the test, the number of clients
should be the same everywhere
2) task 2) should not exit (because it did not call
MPI_Comm_disconnect()) and the test should hang when
starting task 3) because task 2) does not call MPI_Comm_accept()
ad 1) I am sorry, but maybe I do not understand what you think - In my
previous post I wrote that the number of clients is the same in every
mpirun instance.
ad 2) it is the same as above
Post by Gilles Gouaillardet
i do not know how you want to spawn your tasks.
if 2) and 3) do not need to communicate with each other (they only
communicate with 1)), then
you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
if 2 and 3) need to communicate with each other, it would be much
easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
so there is only one inter communicator with all the tasks.
My aim is that all the tasks need to communicate with each other. I am
implementing a distributed application - game with more players
communicating with each other via MPI. It should work as follows - First
player creates a game and waits for other players to connect to this game.
On different computers (in the same network) the other players can join
this game. When they are connected, they should be able to play this game
together.
I hope, it is clear what my idea is. If it is not, just ask me, please.
Post by Gilles Gouaillardet
The current test program is growing incrementally the intercomm, which
does require extra steps for synchronization.
Cheers,
Gilles
Cheers,
Matus
Post by Gilles Gouaillardet
Hi,
thank you for your interest in this topic.
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
It means you are right - there should be the same number of clients
in each mpirun instance.
The test does not involve MPI_Comm_disconnect(), but the problem in
the test is in the earlier position, because some of clients (in the most
cases actually the last client) cannot sometimes connect to the server and
therefore all clients with server are hanging (waiting for the connections
with the last client(s) ).
So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in
the timeout value, isn't it?
Cheers,
Matus
Post by Gilles Gouaillardet
How do you run the test ?
you should have the same number of clients in each mpirun instance,
the following simple shell starts the test as i think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is never
invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
$clients 0 2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails
with a timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been set
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
Hello,
I have a problem with basic client - server application. I tried to
run C program from this website
<https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/ma
ster/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so
I expected that it should work properly, but after more testing I found out
that there is probably an error somewhere in connect/accept method. I have
read many discussions and threads on your website, but I have not found
similar problem that I am facing. It seems that nobody had similar problem
like me. When I run this app with one server and more clients (3,4,5,6,...)
sometimes the app hangs. It hangs when second or next client wants to
connect to the server (it depends, sometimes third client hangs, sometimes
fourth, sometimes second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because this
client cannot connect to the server. It is strange, because I observed this
behaviour only in some cases... Sometimes it works without any problems,
sometimes it does not work. The behaviour is unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same
problem with Java alternative of this application. It hangs also
sometimes... I need this app in Java, but firstly it must work properly in
C implementation. Because of this strange behaviour I assume that there can
be an error maybe inside of openmpi implementation of connect/accept
methods. I tried it also with another version of openmpi - 1.8.1. However,
the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with me... I
will appreciate any your help, support, or interest about this topic.
Best regards,
Matus Dobrotka
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29673.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
<http://www.open-mpi.org/community/lists/users/2016/07/29681.php>
http://www.open-mpi.org/community/lists/users/2016/07/29681.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29687.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/commun
ity/lists/users/2016/07/29688.php
_______________________________________________
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2016/07/29689.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/commun
ity/lists/users/2016/07/29690.php
_______________________________________________
users mailing list
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/commun
ity/lists/users/2016/07/29693.php
Loading...