Topics

DAOS/OFI & MOFED Support


Farrell, Patrick Arthur
 

Good afternoon,

I am curious if anyone has tried DAOS with MLNX_OFED_LINUX-5.1-0.6.6.0 - The current latest version of MOFED 5.1.

I did, and I'm getting mercury errors related to CQs...

So, before dumping the errors:
Should this work?  Is it supported to run DAOS with MOFED 5.1?

Thanks - error dump follows:

For example, on rank0 when trying to create a pool on ranks 0 and 1:
08/28-16:29:42.684123 delphi-002 DAOS[279252/279298] hg   ERR  # NA -- Error -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:2555
 # na_ofi_cq_read(): Operation ID was not canceled
08/28-16:29:42.684134 delphi-002 DAOS[279252/279298] hg   ERR  # NA -- Error -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:4585
 # na_ofi_progress(): Could not read events from context CQ
08/28-16:29:42.684154 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:2758
 # hg_core_progress_na(): Could not make progress on NA (NA_FAULT)
08/28-16:29:42.684161 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:2926
 # hg_core_progress(): hg_core_progress_na() failed
08/28-16:29:42.684168 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:4317
 # HG_Core_progress(): Could not make progress
08/28-16:29:42.684178 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury.c:1996
 # HG_Progress(): Could not make progress on context (HG_FAULT)
08/28-16:29:42.684185 delphi-002 DAOS[279252/279298] hg   ERR  src/cart/crt_hg.c:1234 crt_hg_progress() HG_Progress failed, hg_ret: 7.
08/28-16:29:42.684194 delphi-002 DAOS[279252/279298] rpc  ERR  src/cart/crt_context.c:1316 crt_progress() crt_hg_progress failed, rc: -1020.
08/28-16:29:42.684201 delphi-002 DAOS[279252/279298] server ERR  src/iosrv/srv.c:565 dss_srv_handler() failed to progress CART context: -1020
08/28-16:30:42.684033 delphi-002 DAOS[279252/279298] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fcd8d7a3870) [opc=0x1010007 rpcid=0x6608781e00000134 rank:tag=1:0] ctx_id 0, (status: 0x38) timed out, tgt rank 1, tag 0

And on rank 1:
08/28-16:07:42.807417 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820aba0) [opc=0xfe000000 rpcid=0x642008fe00000128 rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0
08/28-16:07:42.807443 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820aba0) [opc=0xfe000000 rpcid=0x642008fe00000128 rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)
08/28-16:07:45.208410 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0
08/28-16:07:45.208419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)
08/28-16:07:47.609419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0
08/28-16:07:47.609428 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)
08/28-16:07:49.811412 delphi-002 DAOS[279251/279299] swim ERR  src/cart/swim/swim.c:802 swim_progress() SWIM shutdown
08/28-16:07:50.10411 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0
08/28-16:07:50.10419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)
08/28-16:08:14.96837 delphi-002 DAOS[279251/279299] hg   WARN # NA -- Warning -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:2575
 # na_ofi_cq_read(): fi_cq_readerr() got err: 5 (Input/output error), prov_errno: 12 (transport retry counter exceeded)
08/28-16:08:14.96853 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] RPC failed; rc: -1011
08/28-16:08:14.96867 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] RPC failed; rc: -1011
08/28-16:08:14.96874 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] RPC failed; rc: -1011

-Patrick


Lombardi, Johann
 

Hi Patrick,

 

We are using MOFED 5.0.2 on Frontera and I don’t think we have ever tested with 5.1. Were you able to figure it out?

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday 28 August 2020 at 23:34
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] DAOS/OFI & MOFED Support

 

Good afternoon,

 

I am curious if anyone has tried DAOS with MLNX_OFED_LINUX-5.1-0.6.6.0 - The current latest version of MOFED 5.1.

 

I did, and I'm getting mercury errors related to CQs...

 

So, before dumping the errors:
Should this work?  Is it supported to run DAOS with MOFED 5.1?

 

Thanks - error dump follows:

 

For example, on rank0 when trying to create a pool on ranks 0 and 1:

08/28-16:29:42.684123 delphi-002 DAOS[279252/279298] hg   ERR  # NA -- Error -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:2555

 # na_ofi_cq_read(): Operation ID was not canceled

08/28-16:29:42.684134 delphi-002 DAOS[279252/279298] hg   ERR  # NA -- Error -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:4585

 # na_ofi_progress(): Could not read events from context CQ

08/28-16:29:42.684154 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:2758

 # hg_core_progress_na(): Could not make progress on NA (NA_FAULT)

08/28-16:29:42.684161 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:2926

 # hg_core_progress(): hg_core_progress_na() failed

08/28-16:29:42.684168 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:4317

 # HG_Core_progress(): Could not make progress

08/28-16:29:42.684178 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury.c:1996

 # HG_Progress(): Could not make progress on context (HG_FAULT)

08/28-16:29:42.684185 delphi-002 DAOS[279252/279298] hg   ERR  src/cart/crt_hg.c:1234 crt_hg_progress() HG_Progress failed, hg_ret: 7.

08/28-16:29:42.684194 delphi-002 DAOS[279252/279298] rpc  ERR  src/cart/crt_context.c:1316 crt_progress() crt_hg_progress failed, rc: -1020.

08/28-16:29:42.684201 delphi-002 DAOS[279252/279298] server ERR  src/iosrv/srv.c:565 dss_srv_handler() failed to progress CART context: -1020

08/28-16:30:42.684033 delphi-002 DAOS[279252/279298] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fcd8d7a3870) [opc=0x1010007 rpcid=0x6608781e00000134 rank:tag=1:0] ctx_id 0, (status: 0x38) timed out, tgt rank 1, tag 0


And on rank 1:

08/28-16:07:42.807417 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820aba0) [opc=0xfe000000 rpcid=0x642008fe00000128 rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:42.807443 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820aba0) [opc=0xfe000000 rpcid=0x642008fe00000128 rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:07:45.208410 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:45.208419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:07:47.609419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:47.609428 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:07:49.811412 delphi-002 DAOS[279251/279299] swim ERR  src/cart/swim/swim.c:802 swim_progress() SWIM shutdown

08/28-16:07:50.10411 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:50.10419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:08:14.96837 delphi-002 DAOS[279251/279299] hg   WARN # NA -- Warning -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:2575

 # na_ofi_cq_read(): fi_cq_readerr() got err: 5 (Input/output error), prov_errno: 12 (transport retry counter exceeded)

08/28-16:08:14.96853 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] RPC failed; rc: -1011

08/28-16:08:14.96867 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] RPC failed; rc: -1011

08/28-16:08:14.96874 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] RPC failed; rc: -1011

 

-Patrick

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Farrell, Patrick Arthur
 

We had no specific need of 5.1, so we rolled back for now, since 5.0.x is still supported from Mellanox.  The little digging I did suggested the issue is in OFA, rather than DAOS, so I expect that the Open Fabrics people will fix the incompatibility.

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Lombardi, Johann <johann.lombardi@...>
Sent: Monday, September 7, 2020 12:52 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] DAOS/OFI & MOFED Support
 

Hi Patrick,

 

We are using MOFED 5.0.2 on Frontera and I don’t think we have ever tested with 5.1. Were you able to figure it out?

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday 28 August 2020 at 23:34
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] DAOS/OFI & MOFED Support

 

Good afternoon,

 

I am curious if anyone has tried DAOS with MLNX_OFED_LINUX-5.1-0.6.6.0 - The current latest version of MOFED 5.1.

 

I did, and I'm getting mercury errors related to CQs...

 

So, before dumping the errors:
Should this work?  Is it supported to run DAOS with MOFED 5.1?

 

Thanks - error dump follows:

 

For example, on rank0 when trying to create a pool on ranks 0 and 1:

08/28-16:29:42.684123 delphi-002 DAOS[279252/279298] hg   ERR  # NA -- Error -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:2555

 # na_ofi_cq_read(): Operation ID was not canceled

08/28-16:29:42.684134 delphi-002 DAOS[279252/279298] hg   ERR  # NA -- Error -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:4585

 # na_ofi_progress(): Could not read events from context CQ

08/28-16:29:42.684154 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:2758

 # hg_core_progress_na(): Could not make progress on NA (NA_FAULT)

08/28-16:29:42.684161 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:2926

 # hg_core_progress(): hg_core_progress_na() failed

08/28-16:29:42.684168 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury_core.c:4317

 # HG_Core_progress(): Could not make progress

08/28-16:29:42.684178 delphi-002 DAOS[279252/279298] hg   ERR  # HG -- Error -- /delphi/common/daos/build/external/dev/mercury/src/mercury.c:1996

 # HG_Progress(): Could not make progress on context (HG_FAULT)

08/28-16:29:42.684185 delphi-002 DAOS[279252/279298] hg   ERR  src/cart/crt_hg.c:1234 crt_hg_progress() HG_Progress failed, hg_ret: 7.

08/28-16:29:42.684194 delphi-002 DAOS[279252/279298] rpc  ERR  src/cart/crt_context.c:1316 crt_progress() crt_hg_progress failed, rc: -1020.

08/28-16:29:42.684201 delphi-002 DAOS[279252/279298] server ERR  src/iosrv/srv.c:565 dss_srv_handler() failed to progress CART context: -1020

08/28-16:30:42.684033 delphi-002 DAOS[279252/279298] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fcd8d7a3870) [opc=0x1010007 rpcid=0x6608781e00000134 rank:tag=1:0] ctx_id 0, (status: 0x38) timed out, tgt rank 1, tag 0


And on rank 1:

08/28-16:07:42.807417 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820aba0) [opc=0xfe000000 rpcid=0x642008fe00000128 rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:42.807443 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820aba0) [opc=0xfe000000 rpcid=0x642008fe00000128 rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:07:45.208410 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:45.208419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:07:47.609419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:47.609428 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:07:49.811412 delphi-002 DAOS[279251/279299] swim ERR  src/cart/swim/swim.c:802 swim_progress() SWIM shutdown

08/28-16:07:50.10411 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:790 crt_context_timeout_check(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] ctx_id 0, (status: 0x38) timed out, tgt rank 0, tag 0

08/28-16:07:50.10419 delphi-002 DAOS[279251/279299] rpc  ERR  src/cart/crt_context.c:748 crt_req_timeout_hdlr(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] aborting to group daos_server, rank 0, tgt_uri (null)

08/28-16:08:14.96837 delphi-002 DAOS[279251/279299] hg   WARN # NA -- Warning -- /delphi/common/daos/build/external/dev/mercury/src/na/na_ofi.c:2575

 # na_ofi_cq_read(): fi_cq_readerr() got err: 5 (Input/output error), prov_errno: 12 (transport retry counter exceeded)

08/28-16:08:14.96853 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820b3f0) [opc=0xfe000000 rpcid=0x642008fe00000129 rank:tag=0:0] RPC failed; rc: -1011

08/28-16:08:14.96867 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820be90) [opc=0xfe000000 rpcid=0x642008fe0000012a rank:tag=0:0] RPC failed; rc: -1011

08/28-16:08:14.96874 delphi-002 DAOS[279251/279299] hg   ERR  src/cart/crt_hg.c:1031 crt_hg_req_send_cb(0x7fc6f820c930) [opc=0xfe000000 rpcid=0x642008fe0000012b rank:tag=0:0] RPC failed; rc: -1011

 

-Patrick

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.