Timeouts/DAOS rendered useless when running IOR with SX/default object class


Steffen Christgau
 

Hi everybody,

during testing and performance assessment with IOR (latest Github version from main branch) we are facing problems with DAOS v1.1.3.

When running IOR from a single client node there is no problem with object class S1 and S2 with up to NP = 48 processes (from the dual socket 96 core client machine). When we use the SX class (which is the default in IOR), the benchmark successfully completes some of its iterations but then hangs. This happens with as "little" as NP = 16 processes on that single client.

mpiexec -n NP --map-by socket --bind-to core ior -F -r -w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=OCLASS

In the client log we find the following

03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] ctx_id 0, (status: 0x38) timed out (60 seconds), target (9:3)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] ctx_id 0, (status: 0x38) timed out (60 seconds), target (3:6)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:4)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] aborting to group daos_server, rank 1, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] aborting to group daos_server, rank 3, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] aborting to group daos_server, rank 9, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] ctx_id 0, (status: 0x38) timed out (60 seconds), target (14:2)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] aborting to group daos_server, rank 14, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
At 60 seconds before the timestamp at which the timeout error occurs on the client we find the following on rank9 (which has hostname bdaos14)

03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2674
# hg_core_forward(): Could not forward buffer
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:5017
# HG_Core_forward(): Could not forward handle
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury.c:1960
# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] hg ERR src/cart/crt_hg.c:1090 crt_hg_req_send(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] HG_Forward failed, hg_ret: 12
[...]
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] ctx_id 4, (status: 0x3f) timed out (60 seconds), target (16:5)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:743 crt_req_timeout_hdlr(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed due to group daos_server, rank 16, tgt_uri ofi+sockets://10.246.101.23:20005 can't rea
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:292 crt_rpc_complete(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed, DER_UNREACH(-1006): 'Unreachable node'
03/25-12:16:01.57 bdaos14 DAOS[28486/28508] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
This happens for other rank:tag combinations as well.
The log on rank 16 (which is bdaos3) is basically clean at this point in time (12:16:01). At the time the timout error manifests at the client we see the following in the log of bdaos3.

03/25-12:17:01.56 bdaos3 DAOS[27816/27835] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
03/25-12:17:01.56 bdaos3 DAOS[27816/27836] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
There are a lot more similar errors over all server nodes which I can send in a PM to whoever raises a hand ;-) Basic operations like container creations and destruction are still working but even 'daos pool autotest' fails although it worked fine before we started the deadly IOR run.

daos pool autotest --pool=...
Step Operation Status Time(sec) Comment
0 Initializing DAOS OK 0.000 1 Connecting to pool OK 0.070 2 Creating container OK 0.000 uuid = 3 Opening container OK 0.060 10 Generating 1M S1 layouts OK 2.530 11 Generating 10K SX layouts OK 0.630 20 Inserting 1M 128B values rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] ctx_id 1, (status: 0x38) timed out (60 seconds), target (9:5)
rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] aborting to group daos_server, rank 9, tgt_uri ofi+sockets://10.246.101.34:20005
hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
object ERR src/object/cli_shard.c:631 dc_rw_cb() RPC 0 failed, DER_TIMEDOUT(-1011): 'Time out'
In the end, the DAOS system is in a state were it is hardly usable. Only stopping the system and restarting the services brings it fully back to life.

Maybe the object class has no impact at all but the with S1/S2 classes the problem did not manifest. With SX we can provoke the issue quite fast. While I would understand that striping over all nodes (which is my understanding of SX) may decrease performance compared S1 or S2 I would not expect that the system transitions into a unusable state. Could the libfabric provider (sockets) be an isseue here?

Does anybody know what might the reason for this issue and/or what might be changed to solve it?

Regards

Steffen


Oganezov, Alexander A
 

Hi Steffen,

Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?
Also can you tell us what your ulimit -a reports on client/server nodes?

We've seen issues before where if ulimit is set to too low for ulimit -n (open files) then some sockets connections could fail to be established. Getting ofi logs from the error would help to narrow this down.

Thanks,
~~Alex.

-----Original Message-----
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Steffen Christgau
Sent: Friday, March 26, 2021 7:50 AM
To: daos@daos.groups.io
Subject: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class

Hi everybody,

during testing and performance assessment with IOR (latest Github
version from main branch) we are facing problems with DAOS v1.1.3.

When running IOR from a single client node there is no problem with
object class S1 and S2 with up to NP = 48 processes (from the dual
socket 96 core client machine). When we use the SX class (which is the
default in IOR), the benchmark successfully completes some of its
iterations but then hangs. This happens with as "little" as NP = 16
processes on that single client.

mpiexec -n NP --map-by socket --bind-to core ior -F -r -w -t 1m -b 1g -i
3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy
--dfs.group=daos_server --dfs.oclass=OCLASS

In the client log we find the following

03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] ctx_id 0, (status: 0x38) timed out (60 seconds), target (9:3)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] ctx_id 0, (status: 0x38) timed out (60 seconds), target (3:6)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:4)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] aborting to group daos_server, rank 1, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] aborting to group daos_server, rank 3, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] aborting to group daos_server, rank 9, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] ctx_id 0, (status: 0x38) timed out (60 seconds), target (14:2)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] aborting to group daos_server, rank 14, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
At 60 seconds before the timestamp at which the timeout error occurs on
the client we find the following on rank9 (which has hostname bdaos14)

03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2674
# hg_core_forward(): Could not forward buffer
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:5017
# HG_Core_forward(): Could not forward handle
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury.c:1960
# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] hg ERR src/cart/crt_hg.c:1090 crt_hg_req_send(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] HG_Forward failed, hg_ret: 12
> [...]
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] ctx_id 4, (status: 0x3f) timed out (60 seconds), target (16:5)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:743 crt_req_timeout_hdlr(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed due to group daos_server, rank 16, tgt_uri ofi+sockets://10.246.101.23:20005 can't rea
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:292 crt_rpc_complete(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed, DER_UNREACH(-1006): 'Unreachable node'
03/25-12:16:01.57 bdaos14 DAOS[28486/28508] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
This happens for other rank:tag combinations as well.
The log on rank 16 (which is bdaos3) is basically clean at this point in
time (12:16:01). At the time the timout error manifests at the client we
see the following in the log of bdaos3.

03/25-12:17:01.56 bdaos3 DAOS[27816/27835] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
03/25-12:17:01.56 bdaos3 DAOS[27816/27836] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
There are a lot more similar errors over all server nodes which I can
send in a PM to whoever raises a hand ;-) Basic operations like
container creations and destruction are still working but even 'daos
pool autotest' fails although it worked fine before we started the
deadly IOR run.

daos pool autotest --pool=...
Step Operation Status Time(sec) Comment
0 Initializing DAOS OK 0.000
1 Connecting to pool OK 0.070
2 Creating container OK 0.000 uuid =
3 Opening container OK 0.060
10 Generating 1M S1 layouts OK 2.530
11 Generating 10K SX layouts OK 0.630
20 Inserting 1M 128B values rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] ctx_id 1, (status: 0x38) timed out (60 seconds), target (9:5)
rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] aborting to group daos_server, rank 9, tgt_uri ofi+sockets://10.246.101.34:20005
hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
object ERR src/object/cli_shard.c:631 dc_rw_cb() RPC 0 failed, DER_TIMEDOUT(-1011): 'Time out'
In the end, the DAOS system is in a state were it is hardly usable. Only
stopping the system and restarting the services brings it fully back to
life.

Maybe the object class has no impact at all but the with S1/S2 classes
the problem did not manifest. With SX we can provoke the issue quite
fast. While I would understand that striping over all nodes (which is my
understanding of SX) may decrease performance compared S1 or S2 I would
not expect that the system transitions into a unusable state. Could the
libfabric provider (sockets) be an isseue here?

Does anybody know what might the reason for this issue and/or what might
be changed to solve it?

Regards

Steffen


Steffen Christgau
 

Hi Alex,

On 3/26/21 4:49 PM, Oganezov, Alexander A wrote:
Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?
Thanks for that input, we'll try to reproduce the issue with those settings and provide them ASAP.

Also can you tell us what your ulimit -a reports on client/server nodes?
Sure.

client $ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1541126
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 370688000
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

On the server side pending signals is lower: 761096.

Regards, Steffen


Steffen Christgau
 

Hi again,

On 3/26/21 5:14 PM, Steffen Christgau wrote:
On 3/26/21 4:49 PM, Oganezov, Alexander A wrote:
Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?
Thanks for that input, we'll try to reproduce the issue with those settings and provide them ASAP
Here is the output of a failed attempt to run IOR. It now crashed for 48 processes on a single client. For smaller process counts IOR succeeds with the same messages/warnings from libfabric

$ export FI_LOG_LEVEL=warn > $ mpiexec -n 48 --map-by socket --bind-to core
/home/bemschri/opt/local/ior/github/bin/ior -F -r -w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=S> libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:607767:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
[repeats for each MPI process]

libfabric:607767:core:core:ofi_ns_add_local_name():370<warn> Cannot add
local name - name server uninitialized [repeats again]
IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began : Mon Mar 29 10:47:36 2021
Command line : /home/bemschri/opt/local/ior/github/bin/ior -F -r
-w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=SX
Machine : Linux bcn1031
TestID : 0
StartTime : Mon Mar 29 10:47:36 2021
Path : /ior_file.00000000
FS : 4607.9 TiB Used FS: 100.0% Inodes: 192512.0 Mi Used Inodes: 38.3%
Options: api : DFS
apiVersion : DAOS
test filename : /ior_file
access : file-per-process
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
nodes : 1
tasks : 48
clients per node : 48
repetitions : 3
xfersize : 1 MiB
blocksize : 1 GiB
aggregate filesize : 48 GiB
Results: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
^C
And in the DAOS client log we have the following

03/29-10:47:36.48 bcn1031 DAOS[607790/607790] crt INFO src/cart/crt_init.c:151 data_init() Disabling MR CACHE (FI_MR_CACHE_COUNT=0)
03/29-10:47:36.63 bcn1031 DAOS[607790/607790] mem WARN src/gurt/hash.c:763 d_hash_table_create_inplace() The d_hash_table_ops_t->hop_rec_hash()
callback is not provided!
Therefore the whole hash table locking will be used for backward compatibility.
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x1311b60) [opc=0x4020000 (DAOS) rpcid=0x7f90350400000033 rank:tag=14:7] ctx_id 0, (status: 0x38) timed out (60 seconds), target (14:7)
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x1311b60) [opc=0x4020000 (DAOS) rpcid=0x7f90350400000033 rank:tag=14:7] aborting to group daos_server, rank 14, tgt_uri ofi+sockets://10.246.101.33:20007
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x1311b60) [opc=0x4020000 (DAOS) rpcid=0x7f90350400000033 rank:tag=14:7] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] object ERR src/object/cli_shard.c:631 dc_rw_cb() RPC 0 failed, DER_TIMEDOUT(-1011): 'Time out'
Regards, Steffen


Steffen Christgau
 

Hi again once more,

meanwhile we checked the 'tcp' and the 'verbs' provider.

For 'tcp' we also experience the timeouts and an subsequently unusable DAOS system.

For 'verbs' (on an OmniPath network) we observe Mercury error on failed memory registrations:

03/29-12:36:21.95 bdaos15 DAOS[308011/308012] pool ERR src/pool/srv_pool.c:1899 transfer_map_buf() 4810a635: remote pool map buffer (4128) < required (5472)
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:846
# hg_bulk_register(): NA_Mem_register() failed (NA_PROTOCOL_ERROR)
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:762
# hg_bulk_create_na_mem_descs(): Could not register segment
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:626
# hg_bulk_create(): Could not create NA mem descriptors
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:2516
# HG_Bulk_create(): Could not create bulk handle
The version of all the employed providers is '111.10' - both on client and server side.

Maybe this help a little for further investigation.

Regards, Steffen


Steffen Christgau
 

A final "Hi" on that topic,

we have discovered the reason for the issue: The ulimit on the _server_ side was too low and it differs between regular users and daemons like the DAOS server. For the latter it was set to soft 1024/hard 4096. We increased it to 50000 respectively by modifying the service/unit file. With that we did multiple IOR runs with up to 48 processes and SX object class from a single client node without any errors.

We noted that the coredump end memlock limits are already "increased" in the server's unit file. Maybe it is a good idea to increase the file limit as well by default, although the limit may depend on the provider in use.

Regards, Steffen


Lombardi, Johann
 

Hi Steffen,

 

Good catch! It sounds like we need to add a “LimitNOFILE” entry to our daos_server’s systemd unit file.

@Rosenzweig, Joel B could you please take of this? Thanks in advance.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Steffen Christgau <christgau@...>
Reply-To: <daos@daos.groups.io>
Date: Tuesday 30 March 2021 at 17:04
To: <daos@daos.groups.io>
Subject: Re: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class

 

A final "Hi" on that topic,

 

we have discovered the reason for the issue: The ulimit on the _server_

side was too low and it differs between regular users and daemons like

the DAOS server. For the latter it was set to soft 1024/hard 4096. We

increased it to 50000 respectively by modifying the service/unit file.

With that we did multiple IOR runs with up to 48 processes and SX object

class from a single client node without any errors.

 

We noted that the coredump end memlock limits are already "increased" in

the server's unit file. Maybe it is a good idea to increase the file

limit as well by default, although the limit may depend on the provider

in use.

 

Regards, Steffen

 

 

 

 

 

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Rosenzweig, Joel B
 

Sure thing.  Unless you say otherwise, I’m planning to submit it against 1.2 and 2.0 branches.

 

https://github.com/daos-stack/daos/pull/5246

 

 

From: Lombardi, Johann <johann.lombardi@...>
Sent: Tuesday, March 30, 2021 3:19 PM
To: daos@daos.groups.io; Rosenzweig, Joel B <joel.b.rosenzweig@...>
Subject: Re: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class

 

Hi Steffen,

 

Good catch! It sounds like we need to add a “LimitNOFILE” entry to our daos_server’s systemd unit file.

@Rosenzweig, Joel B could you please take of this? Thanks in advance.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Steffen Christgau <christgau@...>
Reply-To: <daos@daos.groups.io>
Date: Tuesday 30 March 2021 at 17:04
To: <daos@daos.groups.io>
Subject: Re: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class

 

A final "Hi" on that topic,

 

we have discovered the reason for the issue: The ulimit on the _server_

side was too low and it differs between regular users and daemons like

the DAOS server. For the latter it was set to soft 1024/hard 4096. We

increased it to 50000 respectively by modifying the service/unit file.

With that we did multiple IOR runs with up to 48 processes and SX object

class from a single client node without any errors.

 

We noted that the coredump end memlock limits are already "increased" in

the server's unit file. Maybe it is a good idea to increase the file

limit as well by default, although the limit may depend on the provider

in use.

 

Regards, Steffen