Topics

daos_test failing with Infiniband


Peter
 

Hello,

I have had issues getting DAOS to work with Infiniband, and I have been unable to diagnose the issue. I am running DAOS v1.1.1 and have tested both rpms and built from source, on Cent OS 7.
I have installed the latest mellanox drivers, and successfully ran the infiniband tests. I can run ibping between my hosts. The DAOS cluster appears to start without issue, as far as I can tell.

[daos@swat7-01 ~]$ docker exec dc_ib_auto dmg -i system query --verbose
Rank UUID                                 Control Address State  Reason
---- ----                                 --------------- -----  ------
0    c7adb803-af21-497d-aaba-5da5b8cd121f 10.0.0.63:10001 Joined
1    5333e417-47ef-4747-b4a5-241b88188092 10.0.0.64:10001 Joined
2    768f4769-e21a-44a2-b3a0-647a9a6a5f2f 10.0.0.65:10001 Joined
3    b3bb804b-e453-417b-885d-cf1bae9fa179 10.0.0.61:10001 Joined

However, when attempting to run daos_test, I receive the following error:  (I can get this test to succeed over ethernet).

[daos@swat7-01 ~]$ docker exec dc_ib_auto daos_test -i

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            swat7-01
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
12/15-06:55:24.37 swat7-01 DAOS[574/574] fi   INFO src/gurt/fault_inject.c:481 d_fault_inject_init() No config file, fault injection is OFF.
12/15-06:55:24.37 swat7-01 DAOS[574/574] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=32
12/15-06:55:24.37 swat7-01 DAOS[574/574] mgmt INFO src/mgmt/cli_mgmt.c:523 dc_mgmt_net_cfg() Using client provided OFI_INTERFACE: ib0
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  INFO src/cart/crt_init.c:269 crt_init_opt() libcart version 4.8.0 initializing
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:161 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:380 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:2064
 # na_ofi_basic_ep_open(): fi_enable() failed, rc: -12 (Cannot allocate memory)
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:1981
 # na_ofi_endpoint_open(): na_ofi_basic_ep_open() failed
[swat7-01:574  :0:574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x7f1f66279970]
    1  /lib64/libucs.so.0(+0x17b22) [0x7f1f66279b22]
    2  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(fi_log_enabled+0x13) [0x7f1f7a3c49b3]
    3  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7353e) [0x7f1f7a41e53e]
    4  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7459c) [0x7f1f7a41f59c]
    5  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xc3ec) [0x7f1f7bdd63ec]
    6  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xd44d) [0x7f1f7bdd744d]
    7  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(NA_Initialize_opt+0x3bf) [0x7f1f7bdce0cf]
    8  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Core_init_opt+0xef) [0x7f1f7bff862f]
    9  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Init_opt+0x6f) [0x7f1f7bfefdbf]
   10  /home/daos/daos/install/bin/../lib64/libcart.so.4(+0x4b211) [0x7f1f7e239211]
   11  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_hg_ctx_init+0x388) [0x7f1f7e23a548]
   12  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_context_create+0x3dd) [0x7f1f7e207d8d]
   13  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_eq_lib_init+0x1fc) [0x7f1f7eb4776c]
   14  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_init+0x184) [0x7f1f7eb4b3f4]
   15  daos_test() [0x407baf]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1f7d511555]
   17  daos_test() [0x409050]

Would anyone happen to know what is causing this error, and how I could fix it?

Thank you, I appreciate any help.

Best,
Peter


Lombardi, Johann
 

Hi Peter,

 

Could you please advise what provider you have specified in the DAOS yaml file? Libfabric seems to be loading libucs.so which is, AFAIK, a library of UCX that we don’t support.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday 15 December 2020 at 08:10
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] daos_test failing with Infiniband

 

Hello,

I have had issues getting DAOS to work with Infiniband, and I have been unable to diagnose the issue. I am running DAOS v1.1.1 and have tested both rpms and built from source, on Cent OS 7.
I have installed the latest mellanox drivers, and successfully ran the infiniband tests. I can run ibping between my hosts. The DAOS cluster appears to start without issue, as far as I can tell.

[daos@swat7-01 ~]$ docker exec dc_ib_auto dmg -i system query --verbose
Rank UUID                                 Control Address State  Reason
---- ----                                 --------------- -----  ------
0    c7adb803-af21-497d-aaba-5da5b8cd121f 10.0.0.63:10001 Joined
1    5333e417-47ef-4747-b4a5-241b88188092 10.0.0.64:10001 Joined
2    768f4769-e21a-44a2-b3a0-647a9a6a5f2f 10.0.0.65:10001 Joined
3    b3bb804b-e453-417b-885d-cf1bae9fa179 10.0.0.61:10001 Joined

However, when attempting to run daos_test, I receive the following error:  (I can get this test to succeed over ethernet).

[daos@swat7-01 ~]$ docker exec dc_ib_auto daos_test -i

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            swat7-01
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
12/15-06:55:24.37 swat7-01 DAOS[574/574] fi   INFO src/gurt/fault_inject.c:481 d_fault_inject_init() No config file, fault injection is OFF.
12/15-06:55:24.37 swat7-01 DAOS[574/574] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=32
12/15-06:55:24.37 swat7-01 DAOS[574/574] mgmt INFO src/mgmt/cli_mgmt.c:523 dc_mgmt_net_cfg() Using client provided OFI_INTERFACE: ib0
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  INFO src/cart/crt_init.c:269 crt_init_opt() libcart version 4.8.0 initializing
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:161 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048
12/15-06:55:24.37 swat7-01 DAOS[574/574] crt  WARN src/cart/crt_init.c:380 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:2064
 # na_ofi_basic_ep_open(): fi_enable() failed, rc: -12 (Cannot allocate memory)
12/15-06:55:24.40 swat7-01 DAOS[574/574] external ERR  # NA -- Error -- /home/daos/daos/build/external/dev/mercury/src/na/na_ofi.c:1981
 # na_ofi_endpoint_open(): na_ofi_basic_ep_open() failed
[swat7-01:574  :0:574] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x17970) [0x7f1f66279970]
    1  /lib64/libucs.so.0(+0x17b22) [0x7f1f66279b22]
    2  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(fi_log_enabled+0x13) [0x7f1f7a3c49b3]
    3  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7353e) [0x7f1f7a41e53e]
    4  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7459c) [0x7f1f7a41f59c]
    5  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xc3ec) [0x7f1f7bdd63ec]
    6  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(+0xd44d) [0x7f1f7bdd744d]
    7  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libna.so.2(NA_Initialize_opt+0x3bf) [0x7f1f7bdce0cf]
    8  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Core_init_opt+0xef) [0x7f1f7bff862f]
    9  /home/daos/daos/install/bin/../lib64/../prereq/dev/mercury/lib/libmercury.so.2(HG_Init_opt+0x6f) [0x7f1f7bfefdbf]
   10  /home/daos/daos/install/bin/../lib64/libcart.so.4(+0x4b211) [0x7f1f7e239211]
   11  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_hg_ctx_init+0x388) [0x7f1f7e23a548]
   12  /home/daos/daos/install/bin/../lib64/libcart.so.4(crt_context_create+0x3dd) [0x7f1f7e207d8d]
   13  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_eq_lib_init+0x1fc) [0x7f1f7eb4776c]
   14  /home/daos/daos/install/bin/../lib64/libdaos.so.0(daos_init+0x184) [0x7f1f7eb4b3f4]
   15  daos_test() [0x407baf]
   16  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1f7d511555]
   17  daos_test() [0x409050]

Would anyone happen to know what is causing this error, and how I could fix it?

Thank you, I appreciate any help.

Best,
Peter

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Peter
 

I have specified ofi+verbs;ofi_rxm

What should I look into to get libfabric to load a supported library?

Thank you for your reply. 


Lombardi, Johann
 

I see, then maybe libucs is somehow used under the hood. Are you using the MOFED stack?

Maybe you could try to reduce FI_UNIVERSE_SIZE to 512 (i.e. export FI_UNIVERSE_SIZE=512).

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday 15 December 2020 at 08:36
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] daos_test failing with Infiniband

 

I have specified ofi+verbs;ofi_rxm

What should I look into to get libfabric to load a supported library?

Thank you for your reply. 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Oganezov, Alexander A
 

Hi Peter,

 

I saw something similar a while ago when our mpi-based applications ended up compiling against ‘bad’ version of MPI, or more specifically MPI that links bad UCX (ucx provides libucs). There appears to be a bug in some UCX versions causing this segfault (e.g. https://github.com/open-mpi/ompi/issues/6789)

 

One thing to try is to see which MPIs you have installed and compile against different one from what you are using.

 

“module avail”  will provide you list of installed mpi packages

You can use then “module load <package>” and after that recompile daos via

scons -c ; scons -c install;  scons MPI_PKG=any -j 12 install

 

Let me know if this helps any.

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Lombardi, Johann
Sent: Tuesday, December 15, 2020 12:00 AM
To: daos@daos.groups.io
Subject: Re: [daos] daos_test failing with Infiniband

 

I see, then maybe libucs is somehow used under the hood. Are you using the MOFED stack?

Maybe you could try to reduce FI_UNIVERSE_SIZE to 512 (i.e. export FI_UNIVERSE_SIZE=512).

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Reply-To: "
daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday 15 December 2020 at 08:36
To: "
daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] daos_test failing with Infiniband

 

I have specified ofi+verbs;ofi_rxm

What should I look into to get libfabric to load a supported library?

Thank you for your reply. 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.