daos_test failing with Infiniband
Peter
Hello,
I have had issues getting DAOS to work with Infiniband, and I have been unable to diagnose the issue. I am running DAOS v1.1.1 and have tested both rpms and built from source, on Cent OS 7. I have installed the latest mellanox drivers, and successfully ran the infiniband tests. I can run ibping between my hosts. The DAOS cluster appears to start without issue, as far as I can tell. [daos@swat7-01 ~]$ docker exec dc_ib_auto dmg -i system query --verbose Rank UUID Control Address State Reason ---- ---- --------------- ----- ------ 0 c7adb803-af21-497d-aaba-5da5b8cd121f 10.0.0.63:10001 Joined 1 5333e417-47ef-4747-b4a5-241b88188092 10.0.0.64:10001 Joined 2 768f4769-e21a-44a2-b3a0-647a9a6a5f2f 10.0.0.65:10001 Joined 3 b3bb804b-e453-417b-885d-cf1bae9fa179 10.0.0.61:10001 Joined However, when attempting to run daos_test, I receive the following error: (I can get this test to succeed over ethernet). [daos@swat7-01 ~]$ docker exec dc_ib_auto daos_test -i -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected: Local host: swat7-01 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4123 Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device. NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- 12/15-06:55:24.37 swat7-01 DAOS[574/574] fi INFO src/gurt/fault_inject.c:481 d_fault_inject_init() No config file, fault injection is OFF. 12/15-06:55:24.37 swat7-01 DAOS[574/574] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=32 12/15-06:55:24.37 swat7-01 DAOS[574/574] mgmt INFO src/mgmt/cli_mgmt.c:523 dc_mgmt_net_cfg() Using client provided OFI_INTERFACE: ib0 12/15-06:55:24.37 swat7-01 DAOS[574/574] crt INFO src/cart/crt_init.c:269 crt_init_opt() libcart version 4.8.0 initializing 12/15-06:55:24.37 swat7-01 DAOS[574/574] crt WARN src/cart/crt_init.c:161 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048 12/15-06:55:24.37 swat7-01 DAOS[574/574] crt WARN src/cart/crt_init.c:380 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1 12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:2064 # na_ofi_basic_ep_open(): fi_enable() failed, rc: -12 (Cannot allocate memory) 12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_basic_ep_open() failed [swat7-01:574 :0:574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc) ==== backtrace ==== 0 /lib64/libucs.so.0(+0x17970) [0x7f1f66279970] 1 /lib64/libucs.so.0(+0x17b22) [0x7f1f66279b22] 2 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(fi_log_enabled+0x13) [0x7f1f7a3c49b3] 3 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7353e) [0x7f1f7a41e53e] 4 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7459c) [0x7f1f7a41f59c] 5 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xc3ec) [0x7f1f7bdd63ec] 6 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xd44d) [0x7f1f7bdd744d] 7 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(NA_Initialize_opt+0x3bf) [0x7f1f7bdce0cf] 8 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Core_init_opt+0xef) [0x7f1f7bff862f] 9 /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Init_opt+0x6f) [0x7f1f7bfefdbf] 10 /home/daos/daos/install/bin/../lib64/libcart.so.4(+0x4b211) [0x7f1f7e239211] 11 /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_hg_ctx_init+0x388) [0x7f1f7e23a548] 12 /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_context_create+0x3dd) [0x7f1f7e207d8d] 13 /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_eq_lib_init+0x1fc) [0x7f1f7eb4776c] 14 /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_init+0x184) [0x7f1f7eb4b3f4] 15 daos_test() [0x407baf] 16 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1f7d511555] 17 daos_test() [0x409050] Would anyone happen to know what is causing this error, and how I could fix it? Thank you, I appreciate any help. Best, Peter
|
|
Lombardi, Johann
Hi Peter,
Could you please advise what provider you have specified in the DAOS yaml file? Libfabric seems to be loading libucs.so which is, AFAIK, a library of UCX that we don’t support.
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Hello,
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
Peter
I have specified ofi+verbs;ofi_rxm
What should I look into to get libfabric to load a supported library? Thank you for your reply.
|
|
Lombardi, Johann
I see, then maybe libucs is somehow used under the hood. Are you using the MOFED stack? Maybe you could try to reduce FI_UNIVERSE_SIZE to 512 (i.e. export FI_UNIVERSE_SIZE=512).
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
I have specified ofi+verbs;ofi_rxm
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
Oganezov, Alexander A
Hi Peter,
I saw something similar a while ago when our mpi-based applications ended up compiling against ‘bad’ version of MPI, or more specifically MPI that links bad UCX (ucx provides libucs). There appears to be a bug in some UCX versions causing this segfault (e.g. https://github.com/open-mpi/ompi/issues/6789)
One thing to try is to see which MPIs you have installed and compile against different one from what you are using.
“module avail” will provide you list of installed mpi packages You can use then “module load <package>” and after that recompile daos via scons -c ; scons -c install; scons MPI_PKG=any -j 12 install
Let me know if this helps any.
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Lombardi, Johann
Sent: Tuesday, December 15, 2020 12:00 AM To: daos@daos.groups.io Subject: Re: [daos] daos_test failing with Infiniband
I see, then maybe libucs is somehow used under the hood. Are you using the MOFED stack? Maybe you could try to reduce FI_UNIVERSE_SIZE to 512 (i.e. export FI_UNIVERSE_SIZE=512).
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
I have specified ofi+verbs;ofi_rxm --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|