-----Original Message-----
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Steffen Christgau
Sent: Friday, March 26, 2021 7:50 AM
To: daos@daos.groups.io
Subject: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class
Hi everybody,
during testing and performance assessment with IOR (latest Github
version from main branch) we are facing problems with DAOS v1.1.3.
When running IOR from a single client node there is no problem with
object class S1 and S2 with up to NP = 48 processes (from the dual
socket 96 core client machine). When we use the SX class (which is the
default in IOR), the benchmark successfully completes some of its
iterations but then hangs. This happens with as "little" as NP = 16
processes on that single client.
mpiexec -n NP --map-by socket --bind-to core ior -F -r -w -t 1m -b 1g -i
3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy
--dfs.group=daos_server --dfs.oclass=OCLASS
In the client log we find the following
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] ctx_id 0, (status: 0x38) timed out (60 seconds), target (9:3)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] ctx_id 0, (status: 0x38) timed out (60 seconds), target (3:6)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:4)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] aborting to group daos_server, rank 1, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] aborting to group daos_server, rank 3, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] aborting to group daos_server, rank 9, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] ctx_id 0, (status: 0x38) timed out (60 seconds), target (14:2)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] aborting to group daos_server, rank 14, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
At 60 seconds before the timestamp at which the timeout error occurs on
the client we find the following on rank9 (which has hostname bdaos14)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2674
# hg_core_forward(): Could not forward buffer
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:5017
# HG_Core_forward(): Could not forward handle
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury.c:1960
# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] hg ERR src/cart/crt_hg.c:1090 crt_hg_req_send(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] HG_Forward failed, hg_ret: 12
> [...]
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] ctx_id 4, (status: 0x3f) timed out (60 seconds), target (16:5)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:743 crt_req_timeout_hdlr(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed due to group daos_server, rank 16, tgt_uri ofi+sockets://10.246.101.23:20005 can't rea
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:292 crt_rpc_complete(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed, DER_UNREACH(-1006): 'Unreachable node'
03/25-12:16:01.57 bdaos14 DAOS[28486/28508] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
This happens for other rank:tag combinations as well.
The log on rank 16 (which is bdaos3) is basically clean at this point in
time (12:16:01). At the time the timout error manifests at the client we
see the following in the log of bdaos3.
03/25-12:17:01.56 bdaos3 DAOS[27816/27835] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
03/25-12:17:01.56 bdaos3 DAOS[27816/27836] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
There are a lot more similar errors over all server nodes which I can
send in a PM to whoever raises a hand ;-) Basic operations like
container creations and destruction are still working but even 'daos
pool autotest' fails although it worked fine before we started the
deadly IOR run.
daos pool autotest --pool=...
Step Operation Status Time(sec) Comment
0 Initializing DAOS OK 0.000
1 Connecting to pool OK 0.070
2 Creating container OK 0.000 uuid =
3 Opening container OK 0.060
10 Generating 1M S1 layouts OK 2.530
11 Generating 10K SX layouts OK 0.630
20 Inserting 1M 128B values rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] ctx_id 1, (status: 0x38) timed out (60 seconds), target (9:5)
rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] aborting to group daos_server, rank 9, tgt_uri ofi+sockets://10.246.101.34:20005
hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
object ERR src/object/cli_shard.c:631 dc_rw_cb() RPC 0 failed, DER_TIMEDOUT(-1011): 'Time out'
In the end, the DAOS system is in a state were it is hardly usable. Only
stopping the system and restarting the services brings it fully back to
life.
Maybe the object class has no impact at all but the with S1/S2 classes
the problem did not manifest. With SX we can provoke the issue quite
fast. While I would understand that striping over all nodes (which is my
understanding of SX) may decrease performance compared S1 or S2 I would
not expect that the system transitions into a unusable state. Could the
libfabric provider (sockets) be an isseue here?
Does anybody know what might the reason for this issue and/or what might
be changed to solve it?
Regards
Steffen