Topics

DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected


Ari
 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Farrell, Patrick Arthur
 

Ari,

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Ari
 

Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email).  Other errors I saw were related to memory errors which I haven’t seen a match. 

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, October 22, 2020 10:23
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Ari,

 

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

 

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Oganezov, Alexander A
 

Hi Ari,

 

Is it possible for you to install more recent MOFED on your system? In the past we’ve had issues with MOFEDs older than 4.7; locally we’ve been using MOFED 5.0.2.

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Ari
Sent: Thursday, October 22, 2020 8:34 AM
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email).  Other errors I saw were related to memory errors which I haven’t seen a match. 

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, October 22, 2020 10:23
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Ari,

 

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

 

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Ari
 

Hi Alex,

 

That’s good info, I’ll give it a go.  We were hoping to avoid customizing the stack in this particular environment unless required, and this definitely qualifies.

 

Thanks

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Oganezov, Alexander A
Sent: Monday, November 2, 2020 10:58
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Hi Ari,

 

Is it possible for you to install more recent MOFED on your system? In the past we’ve had issues with MOFEDs older than 4.7; locally we’ve been using MOFED 5.0.2.

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Ari
Sent: Thursday, October 22, 2020 8:34 AM
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email).  Other errors I saw were related to memory errors which I haven’t seen a match. 

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, October 22, 2020 10:23
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Ari,

 

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

 

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}