daos_test failing with Infiniband


Peter
 

Hello,

I have had issues getting DAOS to work with Infiniband, and I have been unable to diagnose the issue. I am running DAOS v1.1.1 and have tested both rpms and built from source, on Cent OS 7.
I have installed the latest mellanox drivers, and successfully ran the infiniband tests. I can run ibping between my hosts. The DAOS cluster appears to start without issue, as far as I can tell.

[daos@swat7-01 ~]$ docker exec dc_ib_auto dmg -i system query --verbose
Rank UUID                                 Control Address State  Reason
---- ----                                 --------------- -----  ------
0    c7adb803-af21-497d-aaba-5da5b8cd121f 10.0.0.63:10001 Joined
1    5333e417-47ef-4747-b4a5-241b88188092 10.0.0.64:10001 Joined
2    768f4769-e21a-44a2-b3a0-647a9a6a5f2f 10.0.0.65:10001 Joined
3    b3bb804b-e453-417b-885d-cf1bae9fa179 10.0.0.61:10001 Joined

However, when attempting to run daos_test, I receive the following error:  (I can get this test to succeed over ethernet).

[daos@swat7-01 ~]$ docker exec dc_ib_auto daos_test -i

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            swat7-01
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
12/15-06:55:24.37 swat7-01 DAOS[574/574] fi   INFO src/gurt/fault_inject.c:481 d_fault_inject_init() No config file, fault injection is OFF.
12/15-06:55:24.37 swat7-01 DAOS[574/574] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=32
12/15-06:55:24.37 swat7-01 DAOS[574/574] mgmt INFO src/mgmt/cli_mgmt.c:523 dc_mgmt_net_cfg() Using client provided OFI_INTERFACE: ib0
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  INFO src/cart/crt_init.c:269 crt_init_opt() libcart version 4.8.0 initializing
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:161 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:380 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:2064
 # na_ofi_basic_ep_open(): fi_enable() failed, rc: -12 (Cannot allocate memory)
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:1981
 # na_ofi_endpoint_open(): na_ofi_basic_ep_open() failed
[swat7-01:574  :0:574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x7f1f66279970]
    1  /lib64/libucs.so.0(+0x17b22) [0x7f1f66279b22]
    2  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(fi_log_enabled+0x13) [0x7f1f7a3c49b3]
    3  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7353e) [0x7f1f7a41e53e]
    4  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7459c) [0x7f1f7a41f59c]
    5  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xc3ec) [0x7f1f7bdd63ec]
    6  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xd44d) [0x7f1f7bdd744d]
    7  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(NA_Initialize_opt+0x3bf) [0x7f1f7bdce0cf]
    8  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Core_init_opt+0xef) [0x7f1f7bff862f]
    9  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Init_opt+0x6f) [0x7f1f7bfefdbf]
   10  /home/daos/daos/install/bin/../lib64/libcart.so.4(+0x4b211) [0x7f1f7e239211]
   11  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_hg_ctx_init+0x388) [0x7f1f7e23a548]
   12  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_context_create+0x3dd) [0x7f1f7e207d8d]
   13  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_eq_lib_init+0x1fc) [0x7f1f7eb4776c]
   14  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_init+0x184) [0x7f1f7eb4b3f4]
   15  daos_test() [0x407baf]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1f7d511555]
   17  daos_test() [0x409050]

Would anyone happen to know what is causing this error, and how I could fix it?

Thank you, I appreciate any help.

Best,
Peter

Join daos@daos.groups.io to automatically receive all group messages.