Discussion:
[OMPI users] deadlock with jemalloc
D'Alessandro, Luke K
2016-11-10 21:36:53 UTC
Permalink
Hi users,

I just wanted to report on an experience I just encountered with OpenMPI 2.0.1, mxm, and jemalloc (stack trace at end).

The jemalloc allocator is a high performance concurrent allocator but it is not reentrant. During an application-level free() (frame #40 below), jemalloc can cascade into an madvise() operation (frame #26). Opal interposes this madvise() and forwards it to mxm_mem_unmap() (frame #23) which itself cascades into a call to free() (frame #13) which ultimately tries to acquire a lock and deadlocks itself.

The easiest workaround is for me to prefix jemalloc’s symbols during build and explicitly call the prefixed symbols from the application rather than the malloc/free/etc. Then mxm would hit the system malloc (or whichever other malloc has been preloaded) and voila, no deadlock. This is not idea as it is not transparent to the application.

Does anyone have any other suggestions for this problem?

Thanks,
Luke


```
(gdb) bt
#0 0x00007f729597c334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f729597760e in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x00007f7295977576 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f72948151a6 in je_malloc_mutex_lock (tsdn=0x7f72435d8380, mutex=0x7f7291d5da10) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/include/jemalloc/internal/mutex.h:94
#4 0x00007f729482f235 in arena_dalloc_bin_run (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, chunk=0x7f7241400000, run=0x7f7241403990, bin=0x7f7291d5eea0) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:2853
#5 0x00007f729482f4cd in arena_dalloc_bin_locked_impl (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, chunk=0x7f7241400000, ptr=0x7f724145fa00, bitselm=0x7f7241400290, junked=1 '\001')
at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:2906
#6 0x00007f729482f5ab in je_arena_dalloc_bin_junked_locked (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, chunk=0x7f7241400000, ptr=0x7f724145fa00, bitselm=0x7f7241400290)
at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:2921
#7 0x00007f729487c61a in je_tcache_bin_flush_small (tsd=0x7f72435d8380, tcache=0x7f728393c000, tbin=0x7f728393c248, binind=17, rem=32) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/tcache.c:134
#8 0x00007f72948066bb in je_tcache_dalloc_small (tsd=0x7f72435d8380, tcache=0x7f728393c000, ptr=0x7f7240c44280, binind=17, slow_path=0 '\000')
at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/include/jemalloc/internal/tcache.h:416
#9 0x00007f7294807d4e in je_arena_dalloc (tsdn=0x7f72435d8380, ptr=0x7f7240c44280, tcache=0x7f728393c000, slow_path=0 '\000') at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/include/jemalloc/internal/arena.h:1428
#10 0x00007f729480905d in je_idalloctm (tsdn=0x7f72435d8380, ptr=0x7f7240c44280, tcache=0x7f728393c000, is_metadata=0 '\000', slow_path=0 '\000') at include/jemalloc/internal/jemalloc_internal.h:1070
#11 0x00007f72948090eb in je_iqalloc (tsd=0x7f72435d8380, ptr=0x7f7240c44280, tcache=0x7f728393c000, slow_path=0 '\000') at include/jemalloc/internal/jemalloc_internal.h:1087
#12 0x00007f729480b032 in ifree (tsd=0x7f72435d8380, ptr=0x7f7240c44280, tcache=0x7f728393c000, slow_path=0 '\000') at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/jemalloc.c:1811
#13 0x00007f7294811bb4 in free (ptr=0x7f7240c44280) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/jemalloc.c:1931
#14 0x00007f728a442b8e in mxm_mem_remove_page_recurs (context=0x7f7291d5da10, pte=0x80, dir=0xd, shift=4294967295, address=140130049710608, order=24335) at mxm/core/pgtable.c:219
#15 0x00007f728a442b8e in mxm_mem_remove_page_recurs (context=0x7f7291d5da10, pte=0x80, dir=0xd, shift=4294967295, address=140130049710608, order=24335) at mxm/core/pgtable.c:219
#16 0x00007f728a442b8e in mxm_mem_remove_page_recurs (context=0x7f7291d5da10, pte=0x80, dir=0xd, shift=4294967295, address=140130049710608, order=24335) at mxm/core/pgtable.c:219
#17 0x00007f728a442b8e in mxm_mem_remove_page_recurs (context=0x7f7291d5da10, pte=0x80, dir=0xd, shift=4294967295, address=140130049710608, order=24335) at mxm/core/pgtable.c:219
#18 0x00007f728a44294e in mxm_mem_remove_page_recurs (context=<optimized out>, pte=<optimized out>, dir=<optimized out>, shift=4294967295, address=<optimized out>, order=<optimized out>) at mxm/core/pgtable.c:219
#19 mxm_mem_remove_page (context=<optimized out>, address=<optimized out>, order=<optimized out>) at mxm/core/pgtable.c:244
#20 mxm_mem_region_pgtable_remove (context=0x7f7291d5da10, region=0x80) at mxm/core/pgtable.c:343
#21 0x00007f728a440d4b in mxm_mem_region_remove (context=0x7f7291d5da10, region=0x80) at mxm/core/mem.c:605
#22 0x00007f728a441498 in mxm_mem_unmap_internal (context=<optimized out>, address=<optimized out>, length=<optimized out>, unlock=<optimized out>) at mxm/core/mem.c:490
#23 mxm_mem_unmap (context=0x7f7291d5da10, address=0x80, length=13, flags=4294967295) at mxm/core/mem.c:764
#24 0x00007f72950f2c84 in opal_mem_hooks_release_hook (buf=0x7f7291d5da10, length=128, from_alloc=13 '\r') at memoryhooks/memory.c:131
#25 0x00007f729518aaa8 in intercept_madvise (start=0x7f7291d5da10, length=128, advice=13) at memory_patcher_component.c:234
#26 0x00007f729485b2f0 in je_pages_purge (addr=0x7f71fd400000, size=8388608) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/pages.c:183
#27 0x00007f729483a583 in chunk_purge_default (chunk=0x7f71fd400000, size=8388608, offset=0, length=8388608, arena_ind=4) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/chunk.c:696
#28 0x00007f729483a241 in je_chunk_dalloc_wrapper (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, chunk_hooks=0x7f72435d6fe8, chunk=0x7f71fd400000, size=8388608, zeroed=0 '\000', committed=1 '\001')
at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/chunk.c:658
#29 0x00007f7294829d3b in arena_unstash_purged (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, chunk_hooks=0x7f72435d6fe8, purge_runs_sentinel=0x7f72435d7028, purge_chunks_sentinel=0x7f72435d7038)
at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:1752
#30 0x00007f729482a15c in arena_purge_to_limit (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, ndirty_limit=1024) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:1810
#31 0x00007f7294828896 in arena_maybe_purge_ratio (tsdn=0x7f72435d8380, arena=0x7f7291d5da00) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:1459
#32 0x00007f7294828a61 in je_arena_maybe_purge (tsdn=0x7f72435d8380, arena=0x7f7291d5da00) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:1507
#33 0x00007f7294839eb3 in je_chunk_dalloc_cache (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, chunk_hooks=0x7f72435d71d0, chunk=0x7f71fd400000, size=8388608, committed=1 '\001')
at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/chunk.c:609
#34 0x00007f7294826e70 in je_arena_chunk_dalloc_huge (tsdn=0x7f72435d8380, arena=0x7f7291d5da00, chunk=0x7f71fd400000, usize=5242880) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/arena.c:991
#35 0x00007f7294859d88 in je_huge_dalloc (tsdn=0x7f72435d8380, ptr=0x7f71fd400000) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/huge.c:407
#36 0x00007f7294807ec6 in je_arena_dalloc (tsdn=0x7f72435d8380, ptr=0x7f71fd400000, tcache=0x7f728393c000, slow_path=0 '\000') at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/include/jemalloc/internal/arena.h:1453
#37 0x00007f729480905d in je_idalloctm (tsdn=0x7f72435d8380, ptr=0x7f71fd400000, tcache=0x7f728393c000, is_metadata=0 '\000', slow_path=0 '\000') at include/jemalloc/internal/jemalloc_internal.h:1070
#38 0x00007f72948090eb in je_iqalloc (tsd=0x7f72435d8380, ptr=0x7f71fd400000, tcache=0x7f728393c000, slow_path=0 '\000') at include/jemalloc/internal/jemalloc_internal.h:1087
#39 0x00007f729480b032 in ifree (tsd=0x7f72435d8380, ptr=0x7f71fd400000, tcache=0x7f728393c000, slow_path=0 '\000') at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/jemalloc.c:1811
#40 0x00007f72948131e4 in dallocx (ptr=0x7f71fd400000, flags=0) at /afs/crc.nd.edu/user/l/ldalessa/hpx/contrib/jemalloc/src/jemalloc.c:2514
#41 0x00000000004256aa in as_free (id=0, ptr=0x7f71fd400000) at /afs/crc.nd.edu/user/l/ldalessa/hpx/libhpx/memory/jemalloc.cpp:106
#42 0x0000000000495686 in parcel_delete (p=0x7f71fd400000) at /afs/crc.nd.edu/user/l/ldalessa/hpx/libhpx/network/parcel.cpp:300
#43 0x000000000048d179 in libhpx::network::isir::ISendBuffer::testRange (this=0x7f7291a47750, i=31, n=1, out=0x7f724141a008, ssync=0x7f72435d7800) at /afs/crc.nd.edu/user/l/ldalessa/hpx/libhpx/network/isir/ISendBuffer.cpp:181
#44 0x000000000048d45b in libhpx::network::isir::ISendBuffer::testAll (this=0x7f7291a47750, ssync=0x7f72435d7800) at /afs/crc.nd.edu/user/l/ldalessa/hpx/libhpx/network/isir/ISendBuffer.cpp:240
```

Loading...