Discussion:
[OMPI users] Suppressing Nvidia warnings
Roland Fehrenbacher
2017-03-16 11:23:59 UTC
Permalink
Hi,

OpenMPI 2.0.2 built with cuda support brings up lots of warnings like

NVIDIA: no NVIDIA devices found

when running on HW without Nvidia devices. Is there a way to suppress
these warnings? It would be quite a hassle to maintain different OpenMPI
builds on clusters with just some GPU machines.
--
Thanks,

Roland

-------
http://www.q-leap.com / http://qlustar.com
--- HPC / Storage / Cloud Linux Cluster OS ---
Sylvain Jeaugey
2017-03-16 16:34:21 UTC
Permalink
Hi Roland,

I can't find this message in the Open MPI source code. Could it be hwloc
? Some other library you are using ?

Sylvain
Post by Roland Fehrenbacher
Hi,
OpenMPI 2.0.2 built with cuda support brings up lots of warnings like
NVIDIA: no NVIDIA devices found
when running on HW without Nvidia devices. Is there a way to suppress
these warnings? It would be quite a hassle to maintain different OpenMPI
builds on clusters with just some GPU machines.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Roland Fehrenbacher
2017-03-16 20:14:31 UTC
Permalink
Hi Sylvain,

SJ> Hi Roland, I can't find this message in the Open MPI source
SJ> code. Could it be hwloc ? Some other library you are using ?

the message comes from libnvidia-ml.so.x.y which libmpi is linked against.

Thanks,

Roland
Post by Roland Fehrenbacher
Hi,
OpenMPI 2.0.2 built with cuda support brings up lots of warnings like
NVIDIA: no NVIDIA devices found
when running on HW without Nvidia devices. Is there a way to
suppress these warnings? It would be quite a hassle to maintain
different OpenMPI builds on clusters with just some GPU machines.
Roland Fehrenbacher
2017-03-24 20:56:15 UTC
Permalink
Hi Sylvain,

SJ> Hi Roland, I can't find this message in the Open MPI source
SJ> code. Could it be hwloc ? Some other library you are using ?

after a longer detour about the suspicion it might have something to do
with nvml support of hwloc, I now found that a change in libcudart
between 7.5 and 8.0 is the cause of the messages appearing now. Our
earlier 1.8 version was built against CUDA 7.5 and didn't show the
problem, but a 1.8 version built against CUDA 8 shows the same problem
as 2.0.2 built against CUDA 8. Do you think you could ask your team
members at Nvidia how this new behaviour in libcudart can be suppressed?

BTW: Disabling nvml support for the internal hwloc has the effect that
OpenMPI doesn't link in libnvidia-ml.so.x anymore, but has no effect on
the messages.

Thanks,

Roland
Post by Roland Fehrenbacher
Hi,
OpenMPI 2.0.2 built with cuda support brings up lots of warnings like
NVIDIA: no NVIDIA devices found
when running on HW without Nvidia devices. Is there a way to
suppress these warnings? It would be quite a hassle to maintain
different OpenMPI builds on clusters with just some GPU machines.
Sylvain Jeaugey
2017-03-24 21:08:37 UTC
Permalink
I'm still working to get a clear confirmation of what is printing this
error message and since when.

However, running strings, I could only find this string in
/usr/lib/libnvidia-ml.so, which comes with the CUDA driver, so it should
not be related to the CUDA runtime version ... but again, until I find
the code responsible for that, I can't say for sure.

I'm sorry it's taking so long -- I'm on it though.
Post by Roland Fehrenbacher
Hi Sylvain,
SJ> Hi Roland, I can't find this message in the Open MPI source
SJ> code. Could it be hwloc ? Some other library you are using ?
after a longer detour about the suspicion it might have something to do
with nvml support of hwloc, I now found that a change in libcudart
between 7.5 and 8.0 is the cause of the messages appearing now. Our
earlier 1.8 version was built against CUDA 7.5 and didn't show the
problem, but a 1.8 version built against CUDA 8 shows the same problem
as 2.0.2 built against CUDA 8. Do you think you could ask your team
members at Nvidia how this new behaviour in libcudart can be suppressed?
BTW: Disabling nvml support for the internal hwloc has the effect that
OpenMPI doesn't link in libnvidia-ml.so.x anymore, but has no effect on
the messages.
Thanks,
Roland
Post by Roland Fehrenbacher
Hi,
OpenMPI 2.0.2 built with cuda support brings up lots of warnings like
NVIDIA: no NVIDIA devices found
when running on HW without Nvidia devices. Is there a way to
suppress these warnings? It would be quite a hassle to maintain
different OpenMPI builds on clusters with just some GPU machines.
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Roland Fehrenbacher
2017-03-27 08:13:17 UTC
Permalink
Hi Sylvain,

thanks for looking into this further.

SJ> I'm still working to get a clear confirmation of what is
SJ> printing this error message and since when.

SJ> However, running strings, I could only find this string in
SJ> /usr/lib/libnvidia-ml.so, which comes with the CUDA driver, so
SJ> it should not be related to the CUDA runtime version ... but
SJ> again, until I find the code responsible for that, I can't say
SJ> for sure.

libcuda (in my case libcuda.so.367.57) also contains the string, and I'm
pretty sure, that's where it's coming from. libcudart (linked to orted
and libmpi.so.x) seems to dlopen libcuda.1 (at least "strings libcudart"
suggests that) ...

Best,

Roland

-------
http://www.q-leap.com / http://qlustar.com
--- HPC / Storage / Cloud Linux Cluster OS ---

SJ> I'm sorry it's taking so long -- I'm on it though.
Post by Roland Fehrenbacher
Hi Sylvain,
SJ> Hi Roland, I can't find this message in the Open MPI source
SJ> code. Could it be hwloc ? Some other library you are using ?
Post by Roland Fehrenbacher
after a longer detour about the suspicion it might have something
to do with nvml support of hwloc, I now found that a change in
libcudart between 7.5 and 8.0 is the cause of the messages
appearing now. Our earlier 1.8 version was built against CUDA 7.5
and didn't show the problem, but a 1.8 version built against CUDA
8 shows the same problem as 2.0.2 built against CUDA 8. Do you
think you could ask your team members at Nvidia how this new
behaviour in libcudart can be suppressed?
BTW: Disabling nvml support for the internal hwloc has the effect
that OpenMPI doesn't link in libnvidia-ml.so.x anymore, but has
no effect on the messages.
Thanks,
Roland
Post by Roland Fehrenbacher
Hi,
OpenMPI 2.0.2 built with cuda support brings up lots of
warnings like
NVIDIA: no NVIDIA devices found
when running on HW without Nvidia devices. Is there a way to
suppress these warnings? It would be quite a hassle to
maintain different OpenMPI builds on clusters with just some
GPU machines.
_______________________________________________ users mailing
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
SJ> -----------------------------------------------------------------------------------
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information. Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited. If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ> -----------------------------------------------------------------------------------
SJ> _______________________________________________ users mailing
SJ> list ***@lists.open-mpi.org
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

--
Ben Menadue
2017-05-05 05:23:45 UTC
Permalink
Hi,

Sorry to reply to an old thread, but we’re seeing this message with 2.1.0 built against CUDA 8.0. We're using libcuda.so.375.39. Has anyone had any luck suppressing these messages?

Thanks,
Ben
Post by Roland Fehrenbacher
Hi Sylvain,
thanks for looking into this further.
SJ> I'm still working to get a clear confirmation of what is
SJ> printing this error message and since when.
SJ> However, running strings, I could only find this string in
SJ> /usr/lib/libnvidia-ml.so, which comes with the CUDA driver, so
SJ> it should not be related to the CUDA runtime version ... but
SJ> again, until I find the code responsible for that, I can't say
SJ> for sure.
libcuda (in my case libcuda.so.367.57) also contains the string, and I'm
pretty sure, that's where it's coming from. libcudart (linked to orted
and libmpi.so.x) seems to dlopen libcuda.1 (at least "strings libcudart"
suggests that) ...
Best,
Roland
-------
http://www.q-leap.com / http://qlustar.com
--- HPC / Storage / Cloud Linux Cluster OS ---
SJ> I'm sorry it's taking so long -- I'm on it though.
Post by Roland Fehrenbacher
Hi Sylvain,
SJ> Hi Roland, I can't find this message in the Open MPI source
SJ> code. Could it be hwloc ? Some other library you are using ?
Post by Roland Fehrenbacher
after a longer detour about the suspicion it might have something
to do with nvml support of hwloc, I now found that a change in
libcudart between 7.5 and 8.0 is the cause of the messages
appearing now. Our earlier 1.8 version was built against CUDA 7.5
and didn't show the problem, but a 1.8 version built against CUDA
8 shows the same problem as 2.0.2 built against CUDA 8. Do you
think you could ask your team members at Nvidia how this new
behaviour in libcudart can be suppressed?
BTW: Disabling nvml support for the internal hwloc has the effect
that OpenMPI doesn't link in libnvidia-ml.so.x anymore, but has
no effect on the messages.
Thanks,
Roland
Post by Roland Fehrenbacher
Hi,
OpenMPI 2.0.2 built with cuda support brings up lots of
warnings like
NVIDIA: no NVIDIA devices found
when running on HW without Nvidia devices. Is there a way to
suppress these warnings? It would be quite a hassle to
maintain different OpenMPI builds on clusters with just some
GPU machines.
_______________________________________________ users mailing
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
SJ> -----------------------------------------------------------------------------------
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information. Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited. If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ> -----------------------------------------------------------------------------------
SJ> _______________________________________________ users mailing
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Sylvain Jeaugey
2017-05-05 16:55:43 UTC
Permalink
Sorry for not providing an update earlier. The bug has been fixed and
the messages should disappear in a future version of the driver
(hopefully the next one if it got picked in time).
Hi,
Sorry to reply to an old thread, but we’re seeing this message with
2.1.0 built against CUDA 8.0. We're using libcuda.so.375.39. Has
anyone had any luck suppressing these messages?
Thanks,
Ben
Post by Roland Fehrenbacher
Hi Sylvain,
thanks for looking into this further.
SJ> I'm still working to get a clear confirmation of what is
SJ> printing this error message and since when.
SJ> However, running strings, I could only find this string in
SJ> /usr/lib/libnvidia-ml.so, which comes with the CUDA driver, so
SJ> it should not be related to the CUDA runtime version ... but
SJ> again, until I find the code responsible for that, I can't say
SJ> for sure.
libcuda (in my case libcuda.so.367.57) also contains the string, and I'm
pretty sure, that's where it's coming from. libcudart (linked to orted
and libmpi.so.x) seems to dlopen libcuda.1 (at least "strings libcudart"
suggests that) ...
Best,
Roland
-------
http://www.q-leap.com / http://qlustar.com
--- HPC / Storage / Cloud Linux Cluster OS ---
SJ> I'm sorry it's taking so long -- I'm on it though.
Post by Roland Fehrenbacher
Hi Sylvain,
SJ> Hi Roland, I can't find this message in the Open MPI source
SJ> code. Could it be hwloc ? Some other library you are using ?
Post by Roland Fehrenbacher
after a longer detour about the suspicion it might have something
to do with nvml support of hwloc, I now found that a change in
libcudart between 7.5 and 8.0 is the cause of the messages
appearing now. Our earlier 1.8 version was built against CUDA 7.5
and didn't show the problem, but a 1.8 version built against CUDA
8 shows the same problem as 2.0.2 built against CUDA 8. Do you
think you could ask your team members at Nvidia how this new
behaviour in libcudart can be suppressed?
BTW: Disabling nvml support for the internal hwloc has the effect
that OpenMPI doesn't link in libnvidia-ml.so.x anymore, but has
no effect on the messages.
Thanks,
Roland
Post by Roland Fehrenbacher
Hi,
OpenMPI 2.0.2 built with cuda support brings up lots of
warnings like
NVIDIA: no NVIDIA devices found
when running on HW without Nvidia devices. Is there a way to
suppress these warnings? It would be quite a hassle to
maintain different OpenMPI builds on clusters with just some
GPU machines.
_______________________________________________ users mailing
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
SJ>
-----------------------------------------------------------------------------------
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information. Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited. If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ>
-----------------------------------------------------------------------------------
SJ> _______________________________________________ users mailing
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
--
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
Loading...