I see what you mean. Below is the output (filtered for a single host). Our setup is very generic.
Dell SOS6320 hosts (haswell)
Mellanox connectx-3 HCAs (mlx4 drivers - native RHEL, not mofed).
FDR/EDR switches (stand-alone opensm)
RHEL7.4
slurm 16.05.11
pmix (pmix-1.1.5-1.el7.x86_64)
openmpi (3.0.0, 3.1.0)
Apps include the well known, LAMMPS, VASP, GROMACS, amber, raxml, espresso, namd2, (i.e. the usual list of research university apps).
gadget/gizmo/arepo are really the only ones giving us trouble but I know they run fine under both openmpi and impi/mpich/mvapich at other sites. Iâm trying to figure out why we canât seem to run it reliably but Iâd also like to get up-to-date with our transport APIâs. Seems weâve fallen behind and are just doing the things weâve always done (openib BTL).
Iâll try running with modified âprovider_includeâ list and see what happens. The fi_info output shows the verbs, udp, and sockets providers.
Thanks,
Charlie
[***@login4 mufasa]$ grep 'c29a-s2.ufhpc' mz0.e
[c29a-s2.ufhpc:01463] mca: base: components_register: registering framework mtl components
[c29a-s2.ufhpc:01463] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01463] mca: base: components_register: component ofi register function successful
[c29a-s2.ufhpc:01463] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01463] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01463] mca: base: components_open: component ofi open function successful
[c29a-s2.ufhpc:01464] mca: base: components_register: registering framework mtl components
[c29a-s2.ufhpc:01464] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01464] mca: base: components_register: component ofi register function successful
[c29a-s2.ufhpc:01464] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01464] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01464] mca: base: components_open: component ofi open function successful
[c29a-s2.ufhpc:01465] mca: base: components_register: registering framework mtl components
[c29a-s2.ufhpc:01465] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01465] mca: base: components_register: component ofi register function successful
[c29a-s2.ufhpc:01465] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01465] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01465] mca: base: components_open: component ofi open function successful
[c29a-s2.ufhpc:01466] mca: base: components_register: registering framework mtl components
[c29a-s2.ufhpc:01466] mca: base: components_register: found loaded component ofi
[c29a-s2.ufhpc:01466] mca: base: components_register: component ofi register function successful
[c29a-s2.ufhpc:01466] mca: base: components_open: opening mtl components
[c29a-s2.ufhpc:01466] mca: base: components_open: found loaded component ofi
[c29a-s2.ufhpc:01466] mca: base: components_open: component ofi open function successful
[c29a-s2.ufhpc:01463] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01463] mca:base:select:( mtl) Querying component [ofi]
[c29a-s2.ufhpc:01463] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[c29a-s2.ufhpc:01463] mca:base:select:( mtl) Selected component [ofi]
[c29a-s2.ufhpc:01463] select: initializing mtl component ofi
[c29a-s2.ufhpc:01464] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01464] mca:base:select:( mtl) Querying component [ofi]
[c29a-s2.ufhpc:01464] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[c29a-s2.ufhpc:01464] mca:base:select:( mtl) Selected component [ofi]
[c29a-s2.ufhpc:01464] select: initializing mtl component ofi
[c29a-s2.ufhpc:01465] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01465] mca:base:select:( mtl) Querying component [ofi]
[c29a-s2.ufhpc:01465] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[c29a-s2.ufhpc:01465] mca:base:select:( mtl) Selected component [ofi]
[c29a-s2.ufhpc:01465] select: initializing mtl component ofi
[c29a-s2.ufhpc:01466] mca:base:select: Auto-selecting mtl components
[c29a-s2.ufhpc:01466] mca:base:select:( mtl) Querying component [ofi]
[c29a-s2.ufhpc:01466] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[c29a-s2.ufhpc:01466] mca:base:select:( mtl) Selected component [ofi]
[c29a-s2.ufhpc:01466] select: initializing mtl component ofi
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni"
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)"
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01464] mtl_ofi_component.c:410: select_ofi_provider: no provider found
[c29a-s2.ufhpc:01464] select: init returned failure for component ofi
[c29a-s2.ufhpc:01464] select: no component selected
[c29a-s2.ufhpc:01464] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01464] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni"
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)"
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01465] mtl_ofi_component.c:410: select_ofi_provider: no provider found
[c29a-s2.ufhpc:01465] select: init returned failure for component ofi
[c29a-s2.ufhpc:01465] select: no component selected
[c29a-s2.ufhpc:01465] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01465] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni"
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)"
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:269: mtl:ofi:provider_include = "psm,psm2,gni"
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:272: mtl:ofi:provider_exclude = "(null)"
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "verbs" not in include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:280: mtl:ofi: "sockets" not in include list
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01466] mtl_ofi_component.c:410: select_ofi_provider: no provider found
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:301: mtl:ofi:prov: none
[c29a-s2.ufhpc:01463] mtl_ofi_component.c:410: select_ofi_provider: no provider found
[c29a-s2.ufhpc:01463] select: init returned failure for component ofi
[c29a-s2.ufhpc:01463] select: no component selected
[c29a-s2.ufhpc:01466] select: init returned failure for component ofi
[c29a-s2.ufhpc:01466] select: no component selected
[c29a-s2.ufhpc:01466] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01466] mca: base: close: unloading component ofi
[c29a-s2.ufhpc:01463] mca: base: close: component ofi closed
[c29a-s2.ufhpc:01463] mca: base: close: unloading component ofi
Post by Howard PritchardHello Charles
You are heading in the right direction.
First you might want to run the libfabric fi_info command to see what capabilities you picked up from the libfabric RPMs.
Next you may well not actually be using the OFI mtl.
Could you run your app with
export OMPI_MCA_mtl_base_verbose=100
and post the output?
It would also help if you described the system you are using : OS interconnect cpu type etc.
Howard
Because of the issues we are having with OpenMPI and the openib BTL (questions previously asked), Iâve been looking into what other transports are available. I was particularly interested in OFI/libfabric support but cannot find any information on it more recent than a reference to the usNIC BTL from 2015 (Jeff Squyres, Cisco). Unfortunately, the openmpi-org website FAQâs covering OpenFabrics support donât mention anything beyond OpenMPI 1.8. Given that 3.1 is the current stable version, that seems odd.
That being the case, I thought Iâd ask here. After laying down the libfabric-devel RPM and building (3.1.0) with âwith-libfabric=/usr, I end up with an âofiâ MTL but nothing else. I can run with OMPI_MCA_mtl=ofi and OMPI_MCA_btl=âself,vader,openibâ but it eventually crashes in libopen-pal.so. (mpi_waitall() higher up the stack).
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d]
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754]
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f]
/apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d]
Questions: Am I using the OFI MTL as intended? Should there be an âofiâ BTL? Does anyone use this?
Thanks,
Charlie Taylor
UF Research Computing
PS - If you could use some help updating the FAQs, Iâd be willing to put in some time. Iâd probably learn a lot.
_______________________________________________
users mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=8sBODgXZKw_dNqkFqkTqbGD3_7nNlm_pat-D6AqiaC8&m=EGR5U297e0v1wN5gzlnqAsj7sHLpSN3I_tjwpfbJQAI&s=k64is7lySeSVrkP8ys8ZIVuVHRY6VJpxBEXU1dXczAY&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwMFaQ&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k&s=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk&e=>_______________________________________________
users mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwICAg&c=pZJPUDQ3SB9JplYbifm4nt2lEVG5pWx2KikqINpWlZM&r=HOtXciFqK5GlgIgLAxthUQ&m=nOFQDWuhmU9qhe6be-0JeNMGn1q64kJj0nWQV-vZg7k&s=PoOVfxkE7rR9spMSFabAs8TokTpgbCIyJRGuWTf5jIk&e=