Re: daos_test failing with Infiniband


Lombardi, Johann
 

Hi Peter,

 

Could you please advise what provider you have specified in the DAOS yaml file? Libfabric seems to be loading libucs.so which is, AFAIK, a library of UCX that we don’t support.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday 15 December 2020 at 08:10
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] daos_test failing with Infiniband

 

Hello,

I have had issues getting DAOS to work with Infiniband, and I have been unable to diagnose the issue. I am running DAOS v1.1.1 and have tested both rpms and built from source, on Cent OS 7.
I have installed the latest mellanox drivers, and successfully ran the infiniband tests. I can run ibping between my hosts. The DAOS cluster appears to start without issue, as far as I can tell.

[daos@swat7-01 ~]$ docker exec dc_ib_auto dmg -i system query --verbose
Rank UUID                                 Control Address State  Reason
---- ----                                 --------------- -----  ------
0    c7adb803-af21-497d-aaba-5da5b8cd121f 10.0.0.63:10001 Joined
1    5333e417-47ef-4747-b4a5-241b88188092 10.0.0.64:10001 Joined
2    768f4769-e21a-44a2-b3a0-647a9a6a5f2f 10.0.0.65:10001 Joined
3    b3bb804b-e453-417b-885d-cf1bae9fa179 10.0.0.61:10001 Joined

However, when attempting to run daos_test, I receive the following error:  (I can get this test to succeed over ethernet).

[daos@swat7-01 ~]$ docker exec dc_ib_auto daos_test -i

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            swat7-01
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
12/15-06:55:24.37 swat7-01 DAOS[574/574] fi   INFO src/gurt/fault_inject.c:481 d_fault_inject_init() No config file, fault injection is OFF.
12/15-06:55:24.37 swat7-01 DAOS[574/574] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=32
12/15-06:55:24.37 swat7-01 DAOS[574/574] mgmt INFO src/mgmt/cli_mgmt.c:523 dc_mgmt_net_cfg() Using client provided OFI_INTERFACE: ib0
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  INFO src/cart/crt_init.c:269 crt_init_opt() libcart version 4.8.0 initializing
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:161 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:380 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:2064
 # na_ofi_basic_ep_open(): fi_enable() failed, rc: -12 (Cannot allocate memory)
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:1981
 # na_ofi_endpoint_open(): na_ofi_basic_ep_open() failed
[swat7-01:574  :0:574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x7f1f66279970]
    1  /lib64/libucs.so.0(+0x17b22) [0x7f1f66279b22]
    2  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(fi_log_enabled+0x13) [0x7f1f7a3c49b3]
    3  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7353e) [0x7f1f7a41e53e]
    4  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7459c) [0x7f1f7a41f59c]
    5  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xc3ec) [0x7f1f7bdd63ec]
    6  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xd44d) [0x7f1f7bdd744d]
    7  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(NA_Initialize_opt+0x3bf) [0x7f1f7bdce0cf]
    8  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Core_init_opt+0xef) [0x7f1f7bff862f]
    9  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Init_opt+0x6f) [0x7f1f7bfefdbf]
   10  /home/daos/daos/install/bin/../lib64/libcart.so.4(+0x4b211) [0x7f1f7e239211]
   11  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_hg_ctx_init+0x388) [0x7f1f7e23a548]
   12  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_context_create+0x3dd) [0x7f1f7e207d8d]
   13  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_eq_lib_init+0x1fc) [0x7f1f7eb4776c]
   14  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_init+0x184) [0x7f1f7eb4b3f4]
   15  daos_test() [0x407baf]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1f7d511555]
   17  daos_test() [0x409050]

Would anyone happen to know what is causing this error, and how I could fix it?

Thank you, I appreciate any help.

Best,
Peter

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Join daos@daos.groups.io to automatically receive all group messages.