Timeouts/DAOS rendered useless when running IOR with SX/default object class
Steffen Christgau
Hi everybody,
during testing and performance assessment with IOR (latest Github version from main branch) we are facing problems with DAOS v1.1.3. When running IOR from a single client node there is no problem with object class S1 and S2 with up to NP = 48 processes (from the dual socket 96 core client machine). When we use the SX class (which is the default in IOR), the benchmark successfully completes some of its iterations but then hangs. This happens with as "little" as NP = 16 processes on that single client. mpiexec -n NP --map-by socket --bind-to core ior -F -r -w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=OCLASS In the client log we find the following 03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] ctx_id 0, (status: 0x38) timed out (60 seconds), target (9:3)At 60 seconds before the timestamp at which the timeout error occurs on the client we find the following on rank9 (which has hostname bdaos14) 03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751This happens for other rank:tag combinations as well. The log on rank 16 (which is bdaos3) is basically clean at this point in time (12:16:01). At the time the timout error manifests at the client we see the following in the log of bdaos3. 03/25-12:17:01.56 bdaos3 DAOS[27816/27835] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'There are a lot more similar errors over all server nodes which I can send in a PM to whoever raises a hand ;-) Basic operations like container creations and destruction are still working but even 'daos pool autotest' fails although it worked fine before we started the deadly IOR run. daos pool autotest --pool=...In the end, the DAOS system is in a state were it is hardly usable. Only stopping the system and restarting the services brings it fully back to life. Maybe the object class has no impact at all but the with S1/S2 classes the problem did not manifest. With SX we can provoke the issue quite fast. While I would understand that striping over all nodes (which is my understanding of SX) may decrease performance compared S1 or S2 I would not expect that the system transitions into a unusable state. Could the libfabric provider (sockets) be an isseue here? Does anybody know what might the reason for this issue and/or what might be changed to solve it? Regards Steffen |
|
Oganezov, Alexander A
Hi Steffen,
toggle quoted message
Show quoted text
Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts? Also can you tell us what your ulimit -a reports on client/server nodes? We've seen issues before where if ulimit is set to too low for ulimit -n (open files) then some sockets connections could fail to be established. Getting ofi logs from the error would help to narrow this down. Thanks, ~~Alex. -----Original Message-----
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Steffen Christgau Sent: Friday, March 26, 2021 7:50 AM To: daos@daos.groups.io Subject: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class Hi everybody, during testing and performance assessment with IOR (latest Github version from main branch) we are facing problems with DAOS v1.1.3. When running IOR from a single client node there is no problem with object class S1 and S2 with up to NP = 48 processes (from the dual socket 96 core client machine). When we use the SX class (which is the default in IOR), the benchmark successfully completes some of its iterations but then hangs. This happens with as "little" as NP = 16 processes on that single client. mpiexec -n NP --map-by socket --bind-to core ior -F -r -w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=OCLASS In the client log we find the following 03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] ctx_id 0, (status: 0x38) timed out (60 seconds), target (9:3)At 60 seconds before the timestamp at which the timeout error occurs on the client we find the following on rank9 (which has hostname bdaos14) 03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751> [...] 03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] ctx_id 4, (status: 0x3f) timed out (60 seconds), target (16:5)This happens for other rank:tag combinations as well. The log on rank 16 (which is bdaos3) is basically clean at this point in time (12:16:01). At the time the timout error manifests at the client we see the following in the log of bdaos3. 03/25-12:17:01.56 bdaos3 DAOS[27816/27835] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'There are a lot more similar errors over all server nodes which I can send in a PM to whoever raises a hand ;-) Basic operations like container creations and destruction are still working but even 'daos pool autotest' fails although it worked fine before we started the deadly IOR run. daos pool autotest --pool=...In the end, the DAOS system is in a state were it is hardly usable. Only stopping the system and restarting the services brings it fully back to life. Maybe the object class has no impact at all but the with S1/S2 classes the problem did not manifest. With SX we can provoke the issue quite fast. While I would understand that striping over all nodes (which is my understanding of SX) may decrease performance compared S1 or S2 I would not expect that the system transitions into a unusable state. Could the libfabric provider (sockets) be an isseue here? Does anybody know what might the reason for this issue and/or what might be changed to solve it? Regards Steffen |
|
Steffen Christgau
Hi Alex,
On 3/26/21 4:49 PM, Oganezov, Alexander A wrote: Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?Thanks for that input, we'll try to reproduce the issue with those settings and provide them ASAP. Also can you tell us what your ulimit -a reports on client/server nodes?Sure. client $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1541126 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 370688000 open files (-n) 65536 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited On the server side pending signals is lower: 761096. Regards, Steffen |
|
Steffen Christgau
Hi again,
On 3/26/21 5:14 PM, Steffen Christgau wrote: On 3/26/21 4:49 PM, Oganezov, Alexander A wrote:Here is the output of a failed attempt to run IOR. It now crashed for 48 processes on a single client. For smaller process counts IOR succeeds with the same messages/warnings from libfabricCould you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?Thanks for that input, we'll try to reproduce the issue with those settings and provide them ASAP $ export FI_LOG_LEVEL=warn > $ mpiexec -n 48 --map-by socket --bind-to core/home/bemschri/opt/local/ior/github/bin/ior -F -r -w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=S> libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider usnic returned -61 (No data available) libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)[repeats for each MPI process] libfabric:607767:core:core:ofi_ns_add_local_name():370<warn> Cannot addlocal name - name server uninitialized [repeats again] IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O-w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=SX Machine : Linux bcn1031And in the DAOS client log we have the following 03/29-10:47:36.48 bcn1031 DAOS[607790/607790] crt INFO src/cart/crt_init.c:151 data_init() Disabling MR CACHE (FI_MR_CACHE_COUNT=0)callback is not provided! Therefore the whole hash table locking will be used for backward compatibility.Regards, Steffen |
|
Steffen Christgau
Hi again once more,
meanwhile we checked the 'tcp' and the 'verbs' provider. For 'tcp' we also experience the timeouts and an subsequently unusable DAOS system. For 'verbs' (on an OmniPath network) we observe Mercury error on failed memory registrations: 03/29-12:36:21.95 bdaos15 DAOS[308011/308012] pool ERR src/pool/srv_pool.c:1899 transfer_map_buf() 4810a635: remote pool map buffer (4128) < required (5472)The version of all the employed providers is '111.10' - both on client and server side. Maybe this help a little for further investigation. Regards, Steffen |
|
Steffen Christgau
A final "Hi" on that topic,
we have discovered the reason for the issue: The ulimit on the _server_ side was too low and it differs between regular users and daemons like the DAOS server. For the latter it was set to soft 1024/hard 4096. We increased it to 50000 respectively by modifying the service/unit file. With that we did multiple IOR runs with up to 48 processes and SX object class from a single client node without any errors. We noted that the coredump end memlock limits are already "increased" in the server's unit file. Maybe it is a good idea to increase the file limit as well by default, although the limit may depend on the provider in use. Regards, Steffen |
|
Lombardi, Johann
Hi Steffen,
Good catch! It sounds like we need to add a “LimitNOFILE” entry to our daos_server’s systemd unit file. @Rosenzweig, Joel B could you please take of this? Thanks in advance.
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of Steffen Christgau <christgau@...>
A final "Hi" on that topic,
we have discovered the reason for the issue: The ulimit on the _server_ side was too low and it differs between regular users and daemons like the DAOS server. For the latter it was set to soft 1024/hard 4096. We increased it to 50000 respectively by modifying the service/unit file. With that we did multiple IOR runs with up to 48 processes and SX object class from a single client node without any errors.
We noted that the coredump end memlock limits are already "increased" in the server's unit file. Maybe it is a good idea to increase the file limit as well by default, although the limit may depend on the provider in use.
Regards, Steffen
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sure thing. Unless you say otherwise, I’m planning to submit it against 1.2 and 2.0 branches.
https://github.com/daos-stack/daos/pull/5246
From: Lombardi, Johann <johann.lombardi@...>
Sent: Tuesday, March 30, 2021 3:19 PM To: daos@daos.groups.io; Rosenzweig, Joel B <joel.b.rosenzweig@...> Subject: Re: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class
Hi Steffen,
Good catch! It sounds like we need to add a “LimitNOFILE” entry to our daos_server’s systemd unit file. @Rosenzweig, Joel B could you please take of this? Thanks in advance.
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of Steffen Christgau <christgau@...>
A final "Hi" on that topic,
we have discovered the reason for the issue: The ulimit on the _server_ side was too low and it differs between regular users and daemons like the DAOS server. For the latter it was set to soft 1024/hard 4096. We increased it to 50000 respectively by modifying the service/unit file. With that we did multiple IOR runs with up to 48 processes and SX object class from a single client node without any errors.
We noted that the coredump end memlock limits are already "increased" in the server's unit file. Maybe it is a good idea to increase the file limit as well by default, although the limit may depend on the provider in use.
Regards, Steffen
|
|