Discussion:
[OMPI users] General question about running single-node jobs.
Lee-Ping Wang
2014-09-30 00:12:08 UTC
Permalink
Hi there,

My application uses MPI to run parallel jobs on a single node, so I have no need of any support for communication between nodes. However, when I use mpirun to launch my application I see strange errors such as:

--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------

[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]

It seems like in each case, OpenMPI is trying to use some feature related to networking and crashing as a result. My workaround is to deduce the components that are crashing and disable them in my environment variables like this:

export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp

Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?

Thanks,

- Lee-Ping
Lee-Ping Wang
2014-09-30 00:38:51 UTC
Permalink
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.

When I run the same job (~5 seconds) in rapid succession, I see the following error message on the second execution:

/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------

And here's the source code where the program is exiting (before "initial socket setup ...done")

int GPICommSoc::init(MPI_Comm comm0) {

/* setup basic MPI information */
init_comm(comm0);

MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}

// create the initial socket
inisock = new_server_socket(NULL,0);

// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );

SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}

I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?

Thanks,

- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
Lee-Ping Wang
2014-09-30 00:49:30 UTC
Permalink
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.

Thanks,

- Lee-Ping
Post by Lee-Ping Wang
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial socket setup ...done")
int GPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25411.php
Ralph Castain
2014-09-30 03:45:12 UTC
Permalink
I don't know anything about your application, or what the functions in your code are doing. I imagine it's possible that you are trying to open statically defined ports, which means that running the job again too soon could leave the OS thinking the socket is already busy. It takes awhile for the OS to release a socket resource.
Post by Lee-Ping Wang
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial socket setup ...done")
int GPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25412.php
Lee-Ping Wang
2014-09-30 17:49:15 UTC
Permalink
Hi Ralph,

Thank you. I think your diagnosis is probably correct. Are these sockets the same as TCP/UDP ports (though different numbers) that are used in web servers, email etc? If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?

- Lee-Ping
Post by Ralph Castain
I don't know anything about your application, or what the functions in your code are doing. I imagine it's possible that you are trying to open statically defined ports, which means that running the job again too soon could leave the OS thinking the socket is already busy. It takes awhile for the OS to release a socket resource.
Post by Lee-Ping Wang
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial socket setup ...done")
int GPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25412.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25413.php
Ralph Castain
2014-09-30 18:05:54 UTC
Permalink
Post by Lee-Ping Wang
Hi Ralph,
Thank you. I think your diagnosis is probably correct. Are these sockets the same as TCP/UDP ports (though different numbers) that are used in web servers, email etc?
Yes
Post by Lee-Ping Wang
If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying to connect to, and if that code knows how to handle arbitrary connections

You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work.
Post by Lee-Ping Wang
- Lee-Ping
Post by Ralph Castain
I don't know anything about your application, or what the functions in your code are doing. I imagine it's possible that you are trying to open statically defined ports, which means that running the job again too soon could leave the OS thinking the socket is already busy. It takes awhile for the OS to release a socket resource.
Post by Lee-Ping Wang
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial socket setup ...done")
int GPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25412.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25413.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25419.php
Lee-Ping Wang
2014-09-30 18:19:15 UTC
Permalink
Hi Ralph,
Post by Ralph Castain
Post by Lee-Ping Wang
If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying to connect to, and if that code knows how to handle arbitrary connections
The main reason why Q-Chem is using MPI is for executing parallel tasks on a single node. Thus, I think it's just the MPI ranks attempting to connect with each other on the same machine. This could be off the mark because I'm still a novice with respect to MPI concepts - but I am sure it is just one machine.
Post by Ralph Castain
You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work.
Thanks; I don't think these environment variables are the issue but I will check again. The calculation runs without any problems on four different clusters (where I don't set these environment variables either), it's only broken on the Blue Waters compute node. Also, the calculation runs without any problems the first time it's executed on the BW compute node - it's only subsequent executions that give the error messages.

Thanks,

- Lee-Ping
Post by Ralph Castain
Post by Lee-Ping Wang
Hi Ralph,
Thank you. I think your diagnosis is probably correct. Are these sockets the same as TCP/UDP ports (though different numbers) that are used in web servers, email etc?
Yes
Post by Lee-Ping Wang
If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying to connect to, and if that code knows how to handle arbitrary connections
You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work.
Post by Lee-Ping Wang
- Lee-Ping
Post by Ralph Castain
I don't know anything about your application, or what the functions in your code are doing. I imagine it's possible that you are trying to open statically defined ports, which means that running the job again too soon could leave the OS thinking the socket is already busy. It takes awhile for the OS to release a socket resource.
Post by Lee-Ping Wang
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial socket setup ...done")
int GPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25412.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25413.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25419.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25420.php
Ralph Castain
2014-09-30 19:06:35 UTC
Permalink
Post by Lee-Ping Wang
Hi Ralph,
Post by Ralph Castain
Post by Lee-Ping Wang
If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying to connect to, and if that code knows how to handle arbitrary connections
The main reason why Q-Chem is using MPI is for executing parallel tasks on a single node. Thus, I think it's just the MPI ranks attempting to connect with each other on the same machine. This could be off the mark because I'm still a novice with respect to MPI concepts - but I am sure it is just one machine.
Your statement doesn't match what you sent us - you showed that it was your connection code that was failing, not ours. You wouldn't have gotten that far if our connections failed as you would have failed in MPI_Init. You are clearly much further than that as you already passed an MPI_Barrier before reaching the code in question.
Post by Lee-Ping Wang
Post by Ralph Castain
You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work.
Thanks; I don't think these environment variables are the issue but I will check again. The calculation runs without any problems on four different clusters (where I don't set these environment variables either), it's only broken on the Blue Waters compute node. Also, the calculation runs without any problems the first time it's executed on the BW compute node - it's only subsequent executions that give the error messages.
Thanks,
- Lee-Ping
Post by Ralph Castain
Post by Lee-Ping Wang
Hi Ralph,
Thank you. I think your diagnosis is probably correct. Are these sockets the same as TCP/UDP ports (though different numbers) that are used in web servers, email etc?
Yes
Post by Lee-Ping Wang
If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying to connect to, and if that code knows how to handle arbitrary connections
You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work.
Post by Lee-Ping Wang
- Lee-Ping
Post by Ralph Castain
I don't know anything about your application, or what the functions in your code are doing. I imagine it's possible that you are trying to open statically defined ports, which means that running the job again too soon could leave the OS thinking the socket is already busy. It takes awhile for the OS to release a socket resource.
Post by Lee-Ping Wang
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial socket setup ...done")
int GPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25412.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25413.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25419.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25420.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25421.php
Lee-Ping Wang
2014-09-30 20:14:51 UTC
Permalink
Hi Ralph,

Thanks. I'll add some print statements to the code and try to figure out precisely where the failure is happening.

- Lee-Ping
Post by Ralph Castain
Post by Lee-Ping Wang
Hi Ralph,
Post by Ralph Castain
Post by Lee-Ping Wang
If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying to connect to, and if that code knows how to handle arbitrary connections
The main reason why Q-Chem is using MPI is for executing parallel tasks on a single node. Thus, I think it's just the MPI ranks attempting to connect with each other on the same machine. This could be off the mark because I'm still a novice with respect to MPI concepts - but I am sure it is just one machine.
Your statement doesn't match what you sent us - you showed that it was your connection code that was failing, not ours. You wouldn't have gotten that far if our connections failed as you would have failed in MPI_Init. You are clearly much further than that as you already passed an MPI_Barrier before reaching the code in question.
Post by Lee-Ping Wang
Post by Ralph Castain
You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work.
Thanks; I don't think these environment variables are the issue but I will check again. The calculation runs without any problems on four different clusters (where I don't set these environment variables either), it's only broken on the Blue Waters compute node. Also, the calculation runs without any problems the first time it's executed on the BW compute node - it's only subsequent executions that give the error messages.
Thanks,
- Lee-Ping
Post by Ralph Castain
Post by Lee-Ping Wang
Hi Ralph,
Thank you. I think your diagnosis is probably correct. Are these sockets the same as TCP/UDP ports (though different numbers) that are used in web servers, email etc?
Yes
Post by Lee-Ping Wang
If so, then I should be able to (1) locate where the port number is defined in the code, and (2) randomize the port number every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying to connect to, and if that code knows how to handle arbitrary connections
You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work.
Post by Lee-Ping Wang
- Lee-Ping
Post by Ralph Castain
I don't know anything about your application, or what the functions in your code are doing. I imagine it's possible that you are trying to open statically defined ports, which means that running the job again too soon could leave the OS thinking the socket is already busy. It takes awhile for the OS to release a socket resource.
Post by Lee-Ping Wang
Here's another data point that might be useful: The error message is much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Sorry for my last email - I think I spoke too quick. I realized after reading some more documentation that OpenMPI always uses TCP sockets for out-of-band communication, so it doesn't make sense for me to set OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in my application when running on a specific machine (Blue Waters compute node); I don't see this problem on any other nodes.
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is /tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial socket setup ...done")
int GPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if (me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
int szsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
int iniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if (me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI - but it seems like the initial execution of the program isn't freeing up some system resource as it should. Is there something that needs to be corrected in the code?
Thanks,
- Lee-Ping
Post by Lee-Ping Wang
Hi there,
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket for out-of-band communications in file oob_tcp_component.c at line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25412.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25413.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25419.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25420.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25421.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25422.php
Lee-Ping Wang
2014-10-02 18:08:22 UTC
Permalink
Hi Ralph,



I've been troubleshooting this issue and communicating with Blue Waters
support. It turns out that Q-Chem and OpenMPI are both trying to open
sockets, and I get different error messages depending on which one fails.



As an aside, I don't know why Q-Chem needs sockets of its own to communicate
between ranks; shouldn't OpenMPI be taking care of all that? (I'm
unfamiliar with this part of the Q-Chem code base, maybe it's trying to
duplicate some functionality?)



The Blue Waters support has indicated that there's a problem with their
realm-specific IP addressing (RSIP) for the compute nodes, which they're
working on fixing. I also tried running the same Q-Chem / OpenMPI job on a
management node which I think has the same hardware (but not the RSIP), and
the problem went away. So I think I'll shelve this problem for now, until
Blue Waters support gets back to me with the fix. :)



Thanks,



- Lee-Ping



From: users [mailto:users-***@open-mpi.org] On Behalf Of Lee-Ping Wang
Sent: Tuesday, September 30, 2014 1:15 PM
To: Open MPI Users
Subject: Re: [OMPI users] General question about running single-node jobs.



Hi Ralph,



Thanks. I'll add some print statements to the code and try to figure out
precisely where the failure is happening.



- Lee-Ping



On Sep 30, 2014, at 12:06 PM, Ralph Castain <***@open-mpi.org> wrote:







On Sep 30, 2014, at 11:19 AM, Lee-Ping Wang <***@stanford.edu> wrote:





Hi Ralph,



If so, then I should be able to (1) locate where the port number is defined
in the code, and (2) randomize the port number every time it's called to
work around the issue. What do you think?



That might work, depending on the code. I'm not sure what it is trying to
connect to, and if that code knows how to handle arbitrary connections



The main reason why Q-Chem is using MPI is for executing parallel tasks on a
single node. Thus, I think it's just the MPI ranks attempting to connect
with each other on the same machine. This could be off the mark because I'm
still a novice with respect to MPI concepts - but I am sure it is just one
machine.



Your statement doesn't match what you sent us - you showed that it was your
connection code that was failing, not ours. You wouldn't have gotten that
far if our connections failed as you would have failed in MPI_Init. You are
clearly much further than that as you already passed an MPI_Barrier before
reaching the code in question.







You might check about those warnings - could be that QCLOCALSCR and QCREF
need to be set for the code to work.



Thanks; I don't think these environment variables are the issue but I will
check again. The calculation runs without any problems on four different
clusters (where I don't set these environment variables either), it's only
broken on the Blue Waters compute node. Also, the calculation runs without
any problems the first time it's executed on the BW compute node - it's only
subsequent executions that give the error messages.



Thanks,



- Lee-Ping



On Sep 30, 2014, at 11:05 AM, Ralph Castain <***@open-mpi.org> wrote:







On Sep 30, 2014, at 10:49 AM, Lee-Ping Wang <***@stanford.edu> wrote:





Hi Ralph,



Thank you. I think your diagnosis is probably correct. Are these sockets
the same as TCP/UDP ports (though different numbers) that are used in web
servers, email etc?



Yes





If so, then I should be able to (1) locate where the port number is defined
in the code, and (2) randomize the port number every time it's called to
work around the issue. What do you think?



That might work, depending on the code. I'm not sure what it is trying to
connect to, and if that code knows how to handle arbitrary connections



You might check about those warnings - could be that QCLOCALSCR and QCREF
need to be set for the code to work.







- Lee-Ping



On Sep 29, 2014, at 8:45 PM, Ralph Castain <***@open-mpi.org> wrote:





I don't know anything about your application, or what the functions in your
code are doing. I imagine it's possible that you are trying to open
statically defined ports, which means that running the job again too soon
could leave the OS thinking the socket is already busy. It takes awhile for
the OS to release a socket resource.





On Sep 29, 2014, at 5:49 PM, Lee-Ping Wang <***@stanford.edu> wrote:





Here's another data point that might be useful: The error message is much
more rare if I run my application on 4 cores instead of 8.



Thanks,



- Lee-Ping



On Sep 29, 2014, at 5:38 PM, Lee-Ping Wang <***@stanford.edu> wrote:





Sorry for my last email - I think I spoke too quick. I realized after
reading some more documentation that OpenMPI always uses TCP sockets for
out-of-band communication, so it doesn't make sense for me to set
OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem in
my application when running on a specific machine (Blue Waters compute
node); I don't see this problem on any other nodes.



When I run the same job (~5 seconds) in rapid succession, I see the
following error message on the second execution:



/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0,
./qchem24825/

MPIRUN in parallel.csh is
/tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun

P4_RSHCOMMAND in parallel.csh is ssh

QCOUTFILE is stdout

Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines

[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"

[nid15081:24859] Warning: could not find environment variable "QCREF"

initial socket setup ...start

-------------------------------------------------------

Primary job terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

-------------------------------------------------------

--------------------------------------------------------------------------

mpirun detected that one or more processes exited with non-zero status, thus
causing

the job to be terminated. The first process to do so was:



Process name: [[46773,1],0]

Exit code: 255

--------------------------------------------------------------------------



And here's the source code where the program is exiting (before "initial
socket setup ...done")



int GPICommSoc::init(MPI_Comm comm0) {



/* setup basic MPI information */

init_comm(comm0);



MPI_Barrier(comm);

/*-- start inisock and set serveradd[] array --*/

if (me == 0) {

fprintf(stdout,"initial socket setup ...start\n");

fflush(stdout);

}



// create the initial socket

inisock = new_server_socket(NULL,0);



// fill and gather the serveraddr array

int szsock = sizeof(SOCKADDR);

memset(&serveraddr[0],0, szsock*nproc);

int iniport=get_sockport(inisock);

set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);

//printsockaddr( serveraddr[me] );



SOCKADDR addrsend = serveraddr[me];

MPI_Allgather(&addrsend,szsock,MPI_BYTE,

&serveraddr[0], szsock,MPI_BYTE, comm);

if (me == 0) {

fprintf(stdout,"initial socket setup ...done \n"

);

fflush(stdout);}



I didn't write this part of the program and I'm really a novice to MPI - but
it seems like the initial execution of the program isn't freeing up some
system resource as it should. Is there something that needs to be corrected
in the code?



Thanks,



- Lee-Ping



On Sep 29, 2014, at 5:12 PM, Lee-Ping Wang <***@stanford.edu> wrote:





Hi there,



My application uses MPI to run parallel jobs on a single node, so I have no
need of any support for communication between nodes. However, when I use
mpirun to launch my application I see strange errors such as:



--------------------------------------------------------------------------

No network interfaces were found for out-of-band communications. We require

at least one available network for out-of-band messaging.

--------------------------------------------------------------------------



[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket
for out-of-band communications in file oob_tcp_listener.c at line 113

[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP socket
for out-of-band communications in file oob_tcp_component.c at line 584

--------------------------------------------------------------------------

It looks like orte_init failed for some reason; your parallel process is

likely to abort. There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems. This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):



orte_oob_base_select failed

--> Returned value (null) (-43) instead of ORTE_SUCCESS

--------------------------------------------------------------------------



/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b7
7e9de5aa9]

/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib
_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]



It seems like in each case, OpenMPI is trying to use some feature related to
networking and crashing as a result. My workaround is to deduce the
components that are crashing and disable them in my environment variables
like this:



export OMPI_MCA_btl=self,sm

export OMPI_MCA_oob=^tcp



Is there a better way to do this - i.e. explicitly prohibit OpenMPI from
using any network-related feature and run only on the local node?



Thanks,



- Lee-Ping



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25410.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25411.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25412.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25413.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25419.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25420.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25421.php



_______________________________________________
users mailing list
***@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/09/25422.php
Gus Correa
2014-10-02 19:09:30 UTC
Permalink
Hi Lee-Ping

Computational Chemistry is Greek to me.

However, on pp. 12 of the Q-Chem manual 3.2

(PDF online
http://www.q-chem.com/qchem-website/doc_for_web/qchem_manual_3.2.pdf)

there are explanations of the meaning of QCSCRATCH and
QLOCALSRC, etc, which as Ralph pointed out, seem to be a sticking point,
and showed up in the warning messages, which I enclose below.

QLOCALSRC specifies a local disk for IO.
I wonder if the node(s) is (are) diskless, and this might cause the problem.
Another possibility is that mpiexec may not be passing these
environment variables.
(Do you pass them in the mpiexec/mpirun command line?)


QCSCRATCH defines a directory for temporary files.
If this is a network shared directory, could it be that some nodes
are not mounting it correctly?
Likewise, if your home directory or your job run directory are not
mounted that could be a problem.
Or maybe you don't have write permission (sometimes this
happens in /tmp, specially if it is a ramdir/tmpdir, which may also have
a small size).

Your BlueWaters system administrator may be able to shed some light on
these things.

Also the Q-Chem manual says it is a pre-compiled executable,
which as far as I know would require a matching version of OpenMPI.
(Ralph, please correct me if I am wrong.).

However, you seem to have the source code, at least you sent a
snippet of it. [With all those sockets being opened besides MPI ...]

Did you recompile with OpenMPI?
Did you add the $OMPI/bin to PATH and $OMPI/lib to LD_LIBRARY_PATH
and are these environment variables propagated to the job execution
nodes (specially those that are failing)?


Anyway, just a bunch of guesses ...
Gus Correa

*********************************************
QCSCRATCH Defines the directory in which
Q-Chem
will store temporary files.
Q-Chem
will usually remove these files on successful completion of t
he job, but they
can be saved, if so wished. Therefore,
QCSCRATCH
should not reside in
a directory that will be automatically removed at the end of a
job, if the
files are to be kept for further calculations.
Note that many of these files can be very large, and it should be
ensured that
the volume that contains this directory has sufficient disk sp
ace available.
The
QCSCRATCH
directory should be periodically checked for scratch
files remaining from abnormally terminated jobs.
QCSCRATCH
defaults
to the working directory if not explicitly set. Please see se
ction 2.6 for
details on saving temporary files and consult your systems ad
ministrator.


QCLOCALSCR On certain platforms, such as Linux clusters, it
is sometimes preferable to
write the temporary files to a disk local to the node.
QCLOCALSCR
spec-
ifies this directory. The temporary files will be copied to
QCSCRATCH
at
the end of the job, unless the job is terminated abnormally. I
n such cases
Q-Chem
will attempt to remove the files in
QCLOCALSCR
, but may not
be able to due to access restrictions. Please specify this va
riable only if
required
*********************************************
Post by Lee-Ping Wang
Hi Ralph,
I’ve been troubleshooting this issue and communicating with Blue Waters
support. It turns out that Q-Chem and OpenMPI are both trying to open
sockets, and I get different error messages depending on which one fails.
As an aside, I don’t know why Q-Chem needs sockets of its own to
communicate between ranks; shouldn’t OpenMPI be taking care of all
that? (I’m unfamiliar with this part of the Q-Chem code base, maybe
it’s trying to duplicate some functionality?)
The Blue Waters support has indicated that there’s a problem with their
realm-specific IP addressing (RSIP) for the compute nodes, which they’re
working on fixing. I also tried running the same Q-Chem / OpenMPI job
on a management node which I think has the same hardware (but not the
RSIP), and the problem went away. So I think I’ll shelve this problem
for now, until Blue Waters support gets back to me with the fix. :)
Thanks,
-Lee-Ping
Wang
*Sent:* Tuesday, September 30, 2014 1:15 PM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] General question about running single-node jobs.
Hi Ralph,
Thanks. I'll add some print statements to the code and try to figure
out precisely where the failure is happening.
- Lee-Ping
Hi Ralph,
If so, then I should be able to (1) locate where the port
number is defined in the code, and (2) randomize the port number
every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is
trying to connect to, and if that code knows how to handle arbitrary
connections
The main reason why Q-Chem is using MPI is for executing parallel tasks
on a single node. Thus, I think it's just the MPI ranks attempting to
connect with each other on the same machine. This could be off the mark
because I'm still a novice with respect to MPI concepts - but I am sure
it is just one machine.
Your statement doesn't match what you sent us - you showed that it was
your connection code that was failing, not ours. You wouldn't have
gotten that far if our connections failed as you would have failed in
MPI_Init. You are clearly much further than that as you already passed
an MPI_Barrier before reaching the code in question.
You might check about those warnings - could be that QCLOCALSCR and
QCREF need to be set for the code to work.
Thanks; I don't think these environment variables are the issue but I
will check again. The calculation runs without any problems on four
different clusters (where I don't set these environment variables
either), it's only broken on the Blue Waters compute node. Also, the
calculation runs without any problems the first time it's executed on
the BW compute node - it's only subsequent executions that give the
error messages.
Thanks,
- Lee-Ping
Hi Ralph,
Thank you. I think your diagnosis is probably correct. Are these
sockets the same as TCP/UDP ports (though different numbers) that are
used in web servers, email etc?
Yes
If so, then I should be able to (1) locate where the port number is
defined in the code, and (2) randomize the port number every time it's
called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying
to connect to, and if that code knows how to handle arbitrary connections
You might check about those warnings - could be that QCLOCALSCR and
QCREF need to be set for the code to work.
- Lee-Ping
I don't know anything about your application, or what the functions in
your code are doing. I imagine it's possible that you are trying to open
statically defined ports, which means that running the job again too
soon could leave the OS thinking the socket is already busy. It takes
awhile for the OS to release a socket resource.
Here's another data point that might be useful: The error message is
much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Sorry for my last email - I think I spoke too quick. I realized after
reading some more documentation that OpenMPI always uses TCP sockets for
out-of-band communication, so it doesn't make sense for me to set
OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem
in my application when running on a specific machine (Blue Waters
compute node); I don't see this problem on any other nodes.
When I run the same job (~5 seconds) in rapid succession, I see the
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is
/tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial
socket setup ...done")
intGPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if(me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
intszsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
intiniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if(me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI -
but it seems like the initial execution of the program isn't freeing up
some system resource as it should. Is there something that needs to be
corrected in the code?
Thanks,
- Lee-Ping
Hi there,
My application uses MPI to run parallel jobs on a single node, so I have
no need of any support for communication between nodes. However, when I
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP
socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP
socket for out-of-band communications in file oob_tcp_component.c at
line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
It seems like in each case, OpenMPI is trying to use some feature
related to networking and crashing as a result. My workaround is to
deduce the components that are crashing and disable them in my
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from
using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25412.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25413.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25419.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25420.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25421.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25422.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/10/25428.php
Lee-Ping Wang
2014-10-02 23:26:58 UTC
Permalink
Hi Gus,

Thanks for the suggestions!

I know that QCSCRATCH and QCLOCALSCR are not the problem. When I set QCSCRATCH="." and unset QCLOCALSCR it writes all the scratch files to the current directory, which is the behavior I want. The environment variables are correctly passed in the mpirun command line.

Since my jobs have a fair bit of I/O, I make sure to change to the locally mounted /tmp folder before running the calculations. I do have permissions to write in there.

When I run jobs without OpenMPI they are stable on Blue Waters compute nodes, which suggests the issues are not due to the above.

I compiled Q-Chem from the source code, so I built OpenMPI 1.8.3 first and added $OMPI/bin to my PATH (and $OMPI/lib to LD_LIBRARY_PATH). I configured the Q-Chem build so it properly uses "mpicc", etc. The environment variables for OpenMPI are correctly set at runtime.

At this point, I think the main problem is a limitation on the networking in the compute nodes, and I believe Blue Waters support is currently working on this. I'll make sure to send an update if anything happens.

- Lee-Ping
Post by Gus Correa
Hi Lee-Ping
Computational Chemistry is Greek to me.
However, on pp. 12 of the Q-Chem manual 3.2
(PDF online http://www.q-chem.com/qchem-website/doc_for_web/qchem_manual_3.2.pdf)
there are explanations of the meaning of QCSCRATCH and
QLOCALSRC, etc, which as Ralph pointed out, seem to be a sticking point,
and showed up in the warning messages, which I enclose below.
QLOCALSRC specifies a local disk for IO.
I wonder if the node(s) is (are) diskless, and this might cause the problem.
Another possibility is that mpiexec may not be passing these
environment variables.
(Do you pass them in the mpiexec/mpirun command line?)
QCSCRATCH defines a directory for temporary files.
If this is a network shared directory, could it be that some nodes
are not mounting it correctly?
Likewise, if your home directory or your job run directory are not
mounted that could be a problem.
Or maybe you don't have write permission (sometimes this
happens in /tmp, specially if it is a ramdir/tmpdir, which may also have a small size).
Your BlueWaters system administrator may be able to shed some light on these things.
Also the Q-Chem manual says it is a pre-compiled executable,
which as far as I know would require a matching version of OpenMPI.
(Ralph, please correct me if I am wrong.).
However, you seem to have the source code, at least you sent a
snippet of it. [With all those sockets being opened besides MPI ...]
Did you recompile with OpenMPI?
Did you add the $OMPI/bin to PATH and $OMPI/lib to LD_LIBRARY_PATH
and are these environment variables propagated to the job execution nodes (specially those that are failing)?
Anyway, just a bunch of guesses ...
Gus Correa
*********************************************
QCSCRATCH Defines the directory in which
Q-Chem
will store temporary files.
Q-Chem
will usually remove these files on successful completion of t
he job, but they
can be saved, if so wished. Therefore,
QCSCRATCH
should not reside in
a directory that will be automatically removed at the end of a
job, if the
files are to be kept for further calculations.
Note that many of these files can be very large, and it should be
ensured that
the volume that contains this directory has sufficient disk sp
ace available.
The
QCSCRATCH
directory should be periodically checked for scratch
files remaining from abnormally terminated jobs.
QCSCRATCH
defaults
to the working directory if not explicitly set. Please see se
ction 2.6 for
details on saving temporary files and consult your systems ad
ministrator.
QCLOCALSCR On certain platforms, such as Linux clusters, it
is sometimes preferable to
write the temporary files to a disk local to the node.
QCLOCALSCR
spec-
ifies this directory. The temporary files will be copied to
QCSCRATCH
at
the end of the job, unless the job is terminated abnormally. I
n such cases
Q-Chem
will attempt to remove the files in
QCLOCALSCR
, but may not
be able to due to access restrictions. Please specify this va
riable only if
required
*********************************************
Post by Lee-Ping Wang
Hi Ralph,
I’ve been troubleshooting this issue and communicating with Blue Waters
support. It turns out that Q-Chem and OpenMPI are both trying to open
sockets, and I get different error messages depending on which one fails.
As an aside, I don’t know why Q-Chem needs sockets of its own to
communicate between ranks; shouldn’t OpenMPI be taking care of all
that? (I’m unfamiliar with this part of the Q-Chem code base, maybe
it’s trying to duplicate some functionality?)
The Blue Waters support has indicated that there’s a problem with their
realm-specific IP addressing (RSIP) for the compute nodes, which they’re
working on fixing. I also tried running the same Q-Chem / OpenMPI job
on a management node which I think has the same hardware (but not the
RSIP), and the problem went away. So I think I’ll shelve this problem
for now, until Blue Waters support gets back to me with the fix. :)
Thanks,
-Lee-Ping
Wang
*Sent:* Tuesday, September 30, 2014 1:15 PM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] General question about running single-node jobs.
Hi Ralph,
Thanks. I'll add some print statements to the code and try to figure
out precisely where the failure is happening.
- Lee-Ping
Hi Ralph,
If so, then I should be able to (1) locate where the port
number is defined in the code, and (2) randomize the port number
every time it's called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is
trying to connect to, and if that code knows how to handle arbitrary
connections
The main reason why Q-Chem is using MPI is for executing parallel tasks
on a single node. Thus, I think it's just the MPI ranks attempting to
connect with each other on the same machine. This could be off the mark
because I'm still a novice with respect to MPI concepts - but I am sure
it is just one machine.
Your statement doesn't match what you sent us - you showed that it was
your connection code that was failing, not ours. You wouldn't have
gotten that far if our connections failed as you would have failed in
MPI_Init. You are clearly much further than that as you already passed
an MPI_Barrier before reaching the code in question.
You might check about those warnings - could be that QCLOCALSCR and
QCREF need to be set for the code to work.
Thanks; I don't think these environment variables are the issue but I
will check again. The calculation runs without any problems on four
different clusters (where I don't set these environment variables
either), it's only broken on the Blue Waters compute node. Also, the
calculation runs without any problems the first time it's executed on
the BW compute node - it's only subsequent executions that give the
error messages.
Thanks,
- Lee-Ping
Hi Ralph,
Thank you. I think your diagnosis is probably correct. Are these
sockets the same as TCP/UDP ports (though different numbers) that are
used in web servers, email etc?
Yes
If so, then I should be able to (1) locate where the port number is
defined in the code, and (2) randomize the port number every time it's
called to work around the issue. What do you think?
That might work, depending on the code. I'm not sure what it is trying
to connect to, and if that code knows how to handle arbitrary connections
You might check about those warnings - could be that QCLOCALSCR and
QCREF need to be set for the code to work.
- Lee-Ping
I don't know anything about your application, or what the functions in
your code are doing. I imagine it's possible that you are trying to open
statically defined ports, which means that running the job again too
soon could leave the OS thinking the socket is already busy. It takes
awhile for the OS to release a socket resource.
Here's another data point that might be useful: The error message is
much more rare if I run my application on 4 cores instead of 8.
Thanks,
- Lee-Ping
Sorry for my last email - I think I spoke too quick. I realized after
reading some more documentation that OpenMPI always uses TCP sockets for
out-of-band communication, so it doesn't make sense for me to set
OMPI_MCA_oob=^tcp. That said, I am still running into a strange problem
in my application when running on a specific machine (Blue Waters
compute node); I don't see this problem on any other nodes.
When I run the same job (~5 seconds) in rapid succession, I see the
/tmp/leeping/opt/qchem-4.2/bin/parallel.csh, , qcopt_reactants.in, 8, 0, ./qchem24825/
MPIRUN in parallel.csh is
/tmp/leeping/opt/qchem-4.2/ext-libs/openmpi/bin/mpirun
P4_RSHCOMMAND in parallel.csh is ssh
QCOUTFILE is stdout
Q-Chem machineFile is /tmp/leeping/opt/qchem-4.2/bin/mpi/machines
[nid15081:24859] Warning: could not find environment variable "QCLOCALSCR"
[nid15081:24859] Warning: could not find environment variable "QCREF"
initial socket setup ...start
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
Process name: [[46773,1],0]
Exit code: 255
--------------------------------------------------------------------------
And here's the source code where the program is exiting (before "initial
socket setup ...done")
intGPICommSoc::init(MPI_Comm comm0) {
/* setup basic MPI information */
init_comm(comm0);
MPI_Barrier(comm);
/*-- start inisock and set serveradd[] array --*/
if(me == 0) {
fprintf(stdout,"initial socket setup ...start\n");
fflush(stdout);
}
// create the initial socket
inisock = new_server_socket(NULL,0);
// fill and gather the serveraddr array
intszsock = sizeof(SOCKADDR);
memset(&serveraddr[0],0, szsock*nproc);
intiniport=get_sockport(inisock);
set_sockaddr_byhname(NULL, iniport, &serveraddr[me]);
//printsockaddr( serveraddr[me] );
SOCKADDR addrsend = serveraddr[me];
MPI_Allgather(&addrsend,szsock,MPI_BYTE,
&serveraddr[0], szsock,MPI_BYTE, comm);
if(me == 0) {
fprintf(stdout,"initial socket setup ...done \n"
);
fflush(stdout);}
I didn't write this part of the program and I'm really a novice to MPI -
but it seems like the initial execution of the program isn't freeing up
some system resource as it should. Is there something that needs to be
corrected in the code?
Thanks,
- Lee-Ping
Hi there,
My application uses MPI to run parallel jobs on a single node, so I have
no need of any support for communication between nodes. However, when I
--------------------------------------------------------------------------
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--------------------------------------------------------------------------
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP
socket for out-of-band communications in file oob_tcp_listener.c at line 113
[nid23206:10697] [[33772,1],0] ORTE_ERROR_LOG: Unable to open a TCP
socket for out-of-band communications in file oob_tcp_component.c at
line 584
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(+0xfeaa9)[0x2b77e9de5aa9]
/home/leeping/opt/qchem-4.2/ext-libs/openmpi/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xd0)[0x2b77e9de13a0]
It seems like in each case, OpenMPI is trying to use some feature
related to networking and crashing as a result. My workaround is to
deduce the components that are crashing and disable them in my
export OMPI_MCA_btl=self,sm
export OMPI_MCA_oob=^tcp
Is there a better way to do this - i.e. explicitly prohibit OpenMPI from
using any network-related feature and run only on the local node?
Thanks,
- Lee-Ping
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25410.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25411.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25412.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25413.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25419.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25420.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25421.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
http://www.open-mpi.org/community/lists/users/2014/09/25422.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/10/25428.php
_______________________________________________
users mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2014/10/25429.php
Continue reading on narkive:
Loading...