[OMPI users] "Warning :: opal_list_remove_item" with openmpi-2.1.0rc4

Discussion:

Siegmar Gross

2017-03-21 07:38:17 UTC

Hi,

I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs (it
doesn't matter if I use my cc or gcc version). My gcc version also
displays the message "NVIDIA: no NVIDIA devices found" for the server
without NVIDIA devices (I don't get the message for my cc version).
I used the following commands to build the package (${SYSTEM_ENV}
is Linux and ${MACHINE_ENV} is x86_64).

mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc

../openmpi-2.1.0rc4/configure \
--prefix=/usr/local/openmpi-2.1.0_64_cc \
--libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/
lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.1.0_64_cc.old
mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc

Sometimes everything works as expected.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 1

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 2

More often I get a warning.

loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes

Parent process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 0

Child process 0 running on nfs1
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3

Child process 1 running on nfs2
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 2
mytid in COMM_ALL_PROCESSES: 1
Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list 0x7f96db515998
loki spawn 144

I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.

Kind regards

Siegmar

Sylvain Jeaugey

2017-03-21 16:52:02 UTC

Permalink

Hi Siegmar,

I think this "NVIDIA : ..." error message comes from the fact that you
add CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI
will compile with CUDA support, but hwloc will not find CUDA and that
will be fine. However, setting CUDA in CFLAGS will make hwloc find CUDA,
compile CUDA support (which is not needed) and then NVML will show this
error message when not run on a machine with CUDA devices.

I guess gcc picks the environment variable, while cc does not hence the
different behavior. So again, there is no need to add all those CUDA
includes, --with-cuda is enough.

About the opal_list_remove_item, we'll try to reproduce the issue and
see where it comes from.

Sylvain

Post by Siegmar Gross
Hi,
I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs (it
doesn't matter if I use my cc or gcc version). My gcc version also
displays the message "NVIDIA: no NVIDIA devices found" for the server
without NVIDIA devices (I don't get the message for my cc version).
I used the following commands to build the package (${SYSTEM_ENV}
is Linux and ${MACHINE_ENV} is x86_64).
mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.1.0rc4/configure \
--prefix=/usr/local/openmpi-2.1.0_64_cc \
--libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64
-L/usr/local/cuda/
lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.1.0_64_cc.old
mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Sometimes everything works as expected.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes
Parent process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 1
Child process 1 running on nfs2
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 2
More often I get a warning.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes
Parent process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
Child process 1 running on nfs2
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 2
mytid in COMM_ALL_PROCESSES: 1
Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list 0x7f96db515998
loki spawn 144
I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Akshay Venkatesh

2017-03-21 17:57:25 UTC

Permalink

Hi Siegmar,

Would it possible for you to provide the source to reproduce the issue?

Thanks

Post by Sylvain Jeaugey
Hi Siegmar,
I think this "NVIDIA : ..." error message comes from the fact that you add
CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI will
compile with CUDA support, but hwloc will not find CUDA and that will be
fine. However, setting CUDA in CFLAGS will make hwloc find CUDA, compile
CUDA support (which is not needed) and then NVML will show this error
message when not run on a machine with CUDA devices.
I guess gcc picks the environment variable, while cc does not hence the
different behavior. So again, there is no need to add all those CUDA
includes, --with-cuda is enough.
About the opal_list_remove_item, we'll try to reproduce the issue and see
where it comes from.
Sylvain

------------------------------------------------------------
-----------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
------------------------------------------------------------
-----------------------
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
-Akshay

Siegmar Gross

2017-03-22 06:39:27 UTC

Permalink

Hi Akshay,

Post by Akshay Venkatesh
Would it possible for you to provide the source to reproduce the issue?

Yes, I've appended the file.

Kind regards

Siegmar

Post by Akshay Venkatesh
Thanks
Hi Siegmar,
I think this "NVIDIA : ..." error message comes from the fact that you add CUDA includes in the C*FLAGS. If you just use --with-cuda, Open MPI will compile
with CUDA support, but hwloc will not find CUDA and that will be fine. However, setting CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
(which is not needed) and then NVML will show this error message when not run on a machine with CUDA devices.
I guess gcc picks the environment variable, while cc does not hence the different behavior. So again, there is no need to add all those CUDA includes,
--with-cuda is enough.
About the opal_list_remove_item, we'll try to reproduce the issue and see where it comes from.
Sylvain
Hi,
I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs (it
doesn't matter if I use my cc or gcc version). My gcc version also
displays the message "NVIDIA: no NVIDIA devices found" for the server
without NVIDIA devices (I don't get the message for my cc version).
I used the following commands to build the package (${SYSTEM_ENV}
is Linux and ${MACHINE_ENV} is x86_64).
mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
cd openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.1.0rc4/configure \
--prefix=/usr/local/openmpi-2.1.0_64_cc \
--libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \
LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/
lib64" \
CC="cc" CXX="CC" FC="f95" \
CFLAGS="-m64 -mt -I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include" \
FCFLAGS="-m64" \
CPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
CXXCPP="cpp -I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \
--enable-cxx-exceptions \
--enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \
--with-hwloc=internal \
--without-verbs \
--with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \
--with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \
--enable-debug \
|& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
rm -r /usr/local/openmpi-2.1.0_64_cc.old
mv /usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Sometimes everything works as expected.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes
Parent process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 1
Child process 1 running on nfs2
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 2
More often I get a warning.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2 spawn_intra_comm
Parent process 0: I create 2 slave processes
Parent process 0 running on loki
MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
Child process 1 running on nfs2
MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
mytid in COMM_ALL_PROCESSES: 2
mytid in COMM_ALL_PROCESSES: 1
Warning :: opal_list_remove_item - the item 0x25a76f0 is not on the list 0x7f96db515998
loki spawn 144
I would be grateful, if somebody can fix the problem. Do you need anything
else? Thank you very much for any help in advance.
Kind regards
Siegmar
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
--
-Akshay
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Roland Fehrenbacher

2017-03-21 18:05:06 UTC

Permalink

Hi Silvain,

I get the "NVIDIA : ..." run-time error messages just by compiling
with "--with-cuda=/usr":

./configure --prefix=${prefix} \
--mandir=${prefix}/share/man \
--infodir=${prefix}/share/info \
--sysconfdir=/etc/openmpi/${VERSION} --with-devel-headers \
--disable-memchecker \
--disable-vt \
--with-tm --with-slurm --with-pmi --with-sge \
--with-cuda=/usr \
--with-io-romio-flags='--with-file-system=nfs+lustre' \
--with-cma --without-valgrind \
--enable-openib-connectx-xrc \
--enable-orterun-prefix-by-default \
--disable-java

Roland

SJ> Hi Siegmar, I think this "NVIDIA : ..." error message comes from
SJ> the fact that you add CUDA includes in the C*FLAGS. If you just
SJ> use --with-cuda, Open MPI will compile with CUDA support, but
SJ> hwloc will not find CUDA and that will be fine. However, setting
SJ> CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
SJ> (which is not needed) and then NVML will show this error message
SJ> when not run on a machine with CUDA devices.

SJ> I guess gcc picks the environment variable, while cc does not
SJ> hence the different behavior. So again, there is no need to add
SJ> all those CUDA includes, --with-cuda is enough.

SJ> About the opal_list_remove_item, we'll try to reproduce the
SJ> issue and see where it comes from.

SJ> Sylvain

Post by Siegmar Gross
Hi,
I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get
once
more a warning about a missing item for one of my small programs
(it doesn't matter if I use my cc or gcc version). My gcc version
also displays the message "NVIDIA: no NVIDIA devices found" for
the server without NVIDIA devices (I don't get the message for my
cc version). I used the following commands to build the package
(${SYSTEM_ENV} is Linux and ${MACHINE_ENV} is x86_64).
mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.1.0rc4/configure \
--prefix=/usr/local/openmpi-2.1.0_64_cc \
--libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
-Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt -I/usr/local/include
-I/usr/local/cuda/include" \ CXXFLAGS="-m64 -I/usr/local/include
-I/usr/local/cuda/include" \ FCFLAGS="-m64" \ CPP="cpp
-I/usr/local/include -I/usr/local/cuda/include" \ CXXCPP="cpp
-I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \ --enable-cxx-exceptions \ --enable-mpi-java \
--with-cuda=/usr/local/cuda \ --with-valgrind=/usr/local/valgrind
\ --enable-mpi-thread-multiple \ --with-hwloc=internal \
--without-verbs \ --with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \ --with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \ --enable-debug \ |& tee
log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc rm -r
/usr/local/openmpi-2.1.0_64_cc.old mv
/usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee
log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc make check |& tee
log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Sometimes everything works as expected.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 1
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
More often I get a warning.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2 mytid
in COMM_ALL_PROCESSES: 1 Warning :: opal_list_remove_item - the
item 0x25a76f0 is not on the list 0x7f96db515998 loki spawn 144
I would be grateful, if somebody can fix the problem. Do you need
anything else? Thank you very much for any help in advance.
Kind regards
Siegmar _______________________________________________ users
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Sylvain Jeaugey

2017-03-21 19:41:06 UTC

Permalink

If you installed CUDA libraries and includes in /usr, then it's not
surprising hwloc finds them even without defining CFLAGS.

I'm just saying I think you won't get the error message if Open MPI
finds CUDA but hwloc does not.

Post by Roland Fehrenbacher
Hi Silvain,
I get the "NVIDIA : ..." run-time error messages just by compiling
./configure --prefix=${prefix} \
--mandir=${prefix}/share/man \
--infodir=${prefix}/share/info \
--sysconfdir=/etc/openmpi/${VERSION} --with-devel-headers \
--disable-memchecker \
--disable-vt \
--with-tm --with-slurm --with-pmi --with-sge \
--with-cuda=/usr \
--with-io-romio-flags='--with-file-system=nfs+lustre' \
--with-cma --without-valgrind \
--enable-openib-connectx-xrc \
--enable-orterun-prefix-by-default \
--disable-java
Roland
SJ> Hi Siegmar, I think this "NVIDIA : ..." error message comes from
SJ> the fact that you add CUDA includes in the C*FLAGS. If you just
SJ> use --with-cuda, Open MPI will compile with CUDA support, but
SJ> hwloc will not find CUDA and that will be fine. However, setting
SJ> CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
SJ> (which is not needed) and then NVML will show this error message
SJ> when not run on a machine with CUDA devices.
SJ> I guess gcc picks the environment variable, while cc does not
SJ> hence the different behavior. So again, there is no need to add
SJ> all those CUDA includes, --with-cuda is enough.
SJ> About the opal_list_remove_item, we'll try to reproduce the
SJ> issue and see where it comes from.
SJ> Sylvain

Post by Siegmar Gross
Hi,
I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small programs
(it doesn't matter if I use my cc or gcc version). My gcc version
also displays the message "NVIDIA: no NVIDIA devices found" for
the server without NVIDIA devices (I don't get the message for my
cc version). I used the following commands to build the package
(${SYSTEM_ENV} is Linux and ${MACHINE_ENV} is x86_64).
mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.1.0rc4/configure \
--prefix=/usr/local/openmpi-2.1.0_64_cc \
--libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
-Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt -I/usr/local/include
-I/usr/local/cuda/include" \ CXXFLAGS="-m64 -I/usr/local/include
-I/usr/local/cuda/include" \ FCFLAGS="-m64" \ CPP="cpp
-I/usr/local/include -I/usr/local/cuda/include" \ CXXCPP="cpp
-I/usr/local/include -I/usr/local/cuda/include" \
--enable-mpi-cxx \ --enable-cxx-exceptions \ --enable-mpi-java \
--with-cuda=/usr/local/cuda \ --with-valgrind=/usr/local/valgrind
\ --enable-mpi-thread-multiple \ --with-hwloc=internal \
--without-verbs \ --with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \ --with-wrapper-fcflags="-m64" \
--with-wrapper-ldflags="-mt" \ --enable-debug \ |& tee
log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc rm -r
/usr/local/openmpi-2.1.0_64_cc.old mv
/usr/local/openmpi-2.1.0_64_cc /usr/local/openmpi-2.1.0_64_cc.old
make install |& tee
log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc make check |& tee
log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Sometimes everything works as expected.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 1
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
More often I get a warning.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2 mytid
in COMM_ALL_PROCESSES: 1 Warning :: opal_list_remove_item - the
item 0x25a76f0 is not on the list 0x7f96db515998 loki spawn 144
I would be grateful, if somebody can fix the problem. Do you need
anything else? Thank you very much for any help in advance.
Kind regards
Siegmar _______________________________________________ users
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Roland Fehrenbacher

2017-03-22 10:30:29 UTC

Permalink

SJ> If you installed CUDA libraries and includes in /usr, then it's
SJ> not surprising hwloc finds them even without defining CFLAGS.

Well, that's the place where distribution packages install to :)
I don't think a build system should misbehave, if libraries are installed
in default places.

SJ> I'm just saying I think you won't get the error message if Open
SJ> MPI finds CUDA but hwloc does not.

OK, so I think I need to ask the original question again: Is there a way
to suppress these warnings with a "normal" build? I guess the answer
must be yes, since 1.8.x didn't have this problem. The real question
then would be how ...

Thanks,

Roland

Post by Roland Fehrenbacher
Hi Silvain,
I get the "NVIDIA : ..." run-time error messages just by
./configure --prefix=${prefix} \ --mandir=${prefix}/share/man \
--infodir=${prefix}/share/info \
--sysconfdir=/etc/openmpi/${VERSION} --with-devel-headers \
--disable-memchecker \ --disable-vt \ --with-tm --with-slurm
--with-pmi --with-sge \ --with-cuda=/usr \
--with-io-romio-flags='--with-file-system=nfs+lustre' \
--with-cma --without-valgrind \ --enable-openib-connectx-xrc \
--enable-orterun-prefix-by-default \ --disable-java
Roland

SJ> Hi Siegmar, I think this "NVIDIA : ..." error message comes from
SJ> the fact that you add CUDA includes in the C*FLAGS. If you just
SJ> use --with-cuda, Open MPI will compile with CUDA support, but
SJ> hwloc will not find CUDA and that will be fine. However, setting
SJ> CUDA in CFLAGS will make hwloc find CUDA, compile CUDA support
SJ> (which is not needed) and then NVML will show this error message
SJ> when not run on a machine with CUDA devices.
SJ> I guess gcc picks the environment variable, while cc does not
SJ> hence the different behavior. So again, there is no need to add
SJ> all those CUDA includes, --with-cuda is enough.
SJ> About the opal_list_remove_item, we'll try to reproduce the
SJ> issue and see where it comes from.
SJ> Sylvain

Post by Roland Fehrenbacher

Post by Siegmar Gross
Hi,
I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small
programs (it doesn't matter if I use my cc or gcc version). My
gcc version also displays the message "NVIDIA: no NVIDIA
devices found" for the server without NVIDIA devices (I don't
get the message for my cc version). I used the following
commands to build the package (${SYSTEM_ENV} is Linux and
${MACHINE_ENV} is x86_64).
mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.1.0rc4/configure \
--prefix=/usr/local/openmpi-2.1.0_64_cc \
--libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
-Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt
-I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include"
\ FCFLAGS="-m64" \ CPP="cpp -I/usr/local/include
-I/usr/local/cuda/include" \ CXXCPP="cpp -I/usr/local/include
-I/usr/local/cuda/include" \ --enable-mpi-cxx \
--enable-cxx-exceptions \ --enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \ --with-hwloc=internal \
--without-verbs \ --with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \ --with-wrapper-fcflags="-m64"
\ --with-wrapper-ldflags="-mt" \ --enable-debug \ |& tee
log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc rm -r
/usr/local/openmpi-2.1.0_64_cc.old mv
/usr/local/openmpi-2.1.0_64_cc
/usr/local/openmpi-2.1.0_64_cc.old make install |& tee
log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc make check |&
tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Sometimes everything works as expected.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 1
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
More often I get a warning.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
opal_list_remove_item - the item 0x25a76f0 is not on the list
0x7f96db515998 loki spawn 144
I would be grateful, if somebody can fix the problem. Do you
need anything else? Thank you very much for any help in
advance.
Kind regards
Siegmar _______________________________________________ users
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

SJ> -----------------------------------------------------------------------------------
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information. Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited. If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ> -----------------------------------------------------------------------------------
SJ> _______________________________________________ users mailing
SJ> list ***@lists.open-mpi.org
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
SJ> -----------------------------------------------------------------------------------
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information. Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited. If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ> -----------------------------------------------------------------------------------
SJ> _______________________________________________ users mailing
SJ> list ***@lists.open-mpi.org
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--

Gilles Gouaillardet

2017-03-22 14:47:26 UTC

Permalink

Roland,

the easiest way is to use an external hwloc that is configured with
--disable-nvml

an other option is to hack the embedded hwloc configure.m4 and pass
--disable-nvml to the embedded hwloc configure. note this requires you run
autogen.sh and you hence needs recent autotools.

i guess Open MPI 1.8 embeds an older hwloc that is not aware of nvml, hence
the lack of warning.

Cheers,

Gilles

Post by Roland Fehrenbacher
SJ> If you installed CUDA libraries and includes in /usr, then it's
SJ> not surprising hwloc finds them even without defining CFLAGS.
Well, that's the place where distribution packages install to :)
I don't think a build system should misbehave, if libraries are installed
in default places.
SJ> I'm just saying I think you won't get the error message if Open
SJ> MPI finds CUDA but hwloc does not.
OK, so I think I need to ask the original question again: Is there a way
to suppress these warnings with a "normal" build? I guess the answer
must be yes, since 1.8.x didn't have this problem. The real question
then would be how ...
Thanks,
Roland

Post by Roland Fehrenbacher

Post by Siegmar Gross
Hi,
I have installed openmpi-2.1.0rc4 on my "SUSE Linux Enterprise Server
12.2 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Sometimes I get once
more a warning about a missing item for one of my small
programs (it doesn't matter if I use my cc or gcc version). My
gcc version also displays the message "NVIDIA: no NVIDIA
devices found" for the server without NVIDIA devices (I don't
get the message for my cc version). I used the following
commands to build the package (${SYSTEM_ENV} is Linux and
${MACHINE_ENV} is x86_64).
mkdir openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc cd
openmpi-2.1.0rc4-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
../openmpi-2.1.0rc4/configure \
--prefix=/usr/local/openmpi-2.1.0_64_cc \
--libdir=/usr/local/openmpi-2.1.0_64_cc/lib64 \
--with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
--with-jdk-headers=/usr/local/jdk1.8.0_66/include \
JAVA_HOME=/usr/local/jdk1.8.0_66 \ LDFLAGS="-m64 -mt -Wl,-z
-Wl,noexecstack -L/usr/local/lib64 -L/usr/local/cuda/ lib64" \
CC="cc" CXX="CC" FC="f95" \ CFLAGS="-m64 -mt
-I/usr/local/include -I/usr/local/cuda/include" \
CXXFLAGS="-m64 -I/usr/local/include -I/usr/local/cuda/include"
\ FCFLAGS="-m64" \ CPP="cpp -I/usr/local/include
-I/usr/local/cuda/include" \ CXXCPP="cpp -I/usr/local/include
-I/usr/local/cuda/include" \ --enable-mpi-cxx \
--enable-cxx-exceptions \ --enable-mpi-java \
--with-cuda=/usr/local/cuda \
--with-valgrind=/usr/local/valgrind \
--enable-mpi-thread-multiple \ --with-hwloc=internal \
--without-verbs \ --with-wrapper-cflags="-m64 -mt" \
--with-wrapper-cxxflags="-m64" \ --with-wrapper-fcflags="-m64"
\ --with-wrapper-ldflags="-mt" \ --enable-debug \ |& tee
log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc rm -r
/usr/local/openmpi-2.1.0_64_cc.old mv
/usr/local/openmpi-2.1.0_64_cc
/usr/local/openmpi-2.1.0_64_cc.old make install |& tee
log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc make check |&
tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
Sometimes everything works as expected.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 1
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
More often I get a warning.
loki spawn 144 mpiexec -np 1 --host loki,nfs1,nfs2
spawn_intra_comm Parent process 0: I create 2 slave processes
Parent process 0 running on loki MPI_COMM_WORLD ntasks: 1
COMM_CHILD_PROCESSES ntasks_local: 1 COMM_CHILD_PROCESSES
ntasks_remote: 2 COMM_ALL_PROCESSES ntasks: 3 mytid in
COMM_ALL_PROCESSES: 0
Child process 0 running on nfs1 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3
Child process 1 running on nfs2 MPI_COMM_WORLD ntasks: 2
COMM_ALL_PROCESSES ntasks: 3 mytid in COMM_ALL_PROCESSES: 2
opal_list_remove_item - the item 0x25a76f0 is not on the list
0x7f96db515998 loki spawn 144
I would be grateful, if somebody can fix the problem. Do you
need anything else? Thank you very much for any help in
advance.
Kind regards
Siegmar _______________________________________________ users
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

SJ> ------------------------------------------------------------
-----------------------
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information. Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited. If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ> ------------------------------------------------------------
-----------------------
SJ> _______________________________________________ users mailing
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
SJ> ------------------------------------------------------------
-----------------------
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information. Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited. If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ> ------------------------------------------------------------
-----------------------
SJ> _______________________________________________ users mailing
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users