Charles A Taylor
2018-10-04 21:39:01 UTC
We are seeing a gaping memory leak when running OpenMPI 3.1.x (or 2.1.2, for that matter) built with UCX support. The leak shows up
whether the “ucx” PML is specified for the run or not. The applications in question are arepo and gizmo but it I have no reason to believe
that others are not affected as well.
Basically the MPI processes grow without bound until SLURM kills the job or the host memory is exhausted.
If I configure and build with “--without-ucx” the problem goes away.
I didn’t see anything about this on the UCX github site so I thought I’d ask here. Anyone else seeing the same or similar?
What version of UCX is OpenMPI 3.1.x tested against?
Regards,
Charlie Taylor
UF Research Computing
Details:
—————————————
RHEL7.5
OpenMPI 3.1.2 (and any other version I’ve tried).
ucx 1.2.2-1.el7 (RH native)
RH native IB stack
Mellanox FDR/EDR IB fabric
Intel Parallel Studio 2018.1.163
Configuration Options:
—————————————————
CFG_OPTS=""
CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" LDFLAGS=\"\" "
CFG_OPTS="$CFG_OPTS --enable-static"
CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1"
CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-libevent=external"
CFG_OPTS="$CFG_OPTS --with-hwloc=external"
CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
CFG_OPTS="$CFG_OPTS --with-libfabric=/usr"
CFG_OPTS="$CFG_OPTS --with-ucx=/usr"
CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
CFG_OPTS="$CFG_OPTS --with-mxm=no"
CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"
CFG_OPTS="$CFG_OPTS --disable-pmix-dstore"
rpmbuild --ba \
--define '_name openmpi' \
--define "_version $OMPI_VER" \
--define "_release ${RELEASE}" \
--define "_prefix $PREFIX" \
--define '_mandir %{_prefix}/share/man' \
--define '_defaultdocdir %{_prefix}' \
--define 'mflags -j 8' \
--define 'use_default_rpm_opt_flags 1' \
--define 'use_check_files 0' \
--define 'install_shell_scripts 1' \
--define 'shell_scripts_basename mpivars' \
--define "configure_options $CFG_OPTS " \
openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log
whether the “ucx” PML is specified for the run or not. The applications in question are arepo and gizmo but it I have no reason to believe
that others are not affected as well.
Basically the MPI processes grow without bound until SLURM kills the job or the host memory is exhausted.
If I configure and build with “--without-ucx” the problem goes away.
I didn’t see anything about this on the UCX github site so I thought I’d ask here. Anyone else seeing the same or similar?
What version of UCX is OpenMPI 3.1.x tested against?
Regards,
Charlie Taylor
UF Research Computing
Details:
—————————————
RHEL7.5
OpenMPI 3.1.2 (and any other version I’ve tried).
ucx 1.2.2-1.el7 (RH native)
RH native IB stack
Mellanox FDR/EDR IB fabric
Intel Parallel Studio 2018.1.163
Configuration Options:
—————————————————
CFG_OPTS=""
CFG_OPTS="$CFG_OPTS C=icc CXX=icpc FC=ifort FFLAGS=\"-O2 -g -warn -m64\" LDFLAGS=\"\" "
CFG_OPTS="$CFG_OPTS --enable-static"
CFG_OPTS="$CFG_OPTS --enable-orterun-prefix-by-default"
CFG_OPTS="$CFG_OPTS --with-slurm=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-pmix=/opt/pmix/2.1.1"
CFG_OPTS="$CFG_OPTS --with-pmi=/opt/slurm"
CFG_OPTS="$CFG_OPTS --with-libevent=external"
CFG_OPTS="$CFG_OPTS --with-hwloc=external"
CFG_OPTS="$CFG_OPTS --with-verbs=/usr"
CFG_OPTS="$CFG_OPTS --with-libfabric=/usr"
CFG_OPTS="$CFG_OPTS --with-ucx=/usr"
CFG_OPTS="$CFG_OPTS --with-verbs-libdir=/usr/lib64"
CFG_OPTS="$CFG_OPTS --with-mxm=no"
CFG_OPTS="$CFG_OPTS --with-cuda=${HPC_CUDA_DIR}"
CFG_OPTS="$CFG_OPTS --enable-openib-udcm"
CFG_OPTS="$CFG_OPTS --enable-openib-rdmacm"
CFG_OPTS="$CFG_OPTS --disable-pmix-dstore"
rpmbuild --ba \
--define '_name openmpi' \
--define "_version $OMPI_VER" \
--define "_release ${RELEASE}" \
--define "_prefix $PREFIX" \
--define '_mandir %{_prefix}/share/man' \
--define '_defaultdocdir %{_prefix}' \
--define 'mflags -j 8' \
--define 'use_default_rpm_opt_flags 1' \
--define 'use_check_files 0' \
--define 'install_shell_scripts 1' \
--define 'shell_scripts_basename mpivars' \
--define "configure_options $CFG_OPTS " \
openmpi-${OMPI_VER}.spec 2>&1 | tee rpmbuild.log