Date   

Re: Timeouts/DAOS rendered useless when running IOR with SX/default object class

Lombardi, Johann
 

Hi Steffen,

 

Good catch! It sounds like we need to add a “LimitNOFILE” entry to our daos_server’s systemd unit file.

@Rosenzweig, Joel B could you please take of this? Thanks in advance.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Steffen Christgau <christgau@...>
Reply-To: <daos@daos.groups.io>
Date: Tuesday 30 March 2021 at 17:04
To: <daos@daos.groups.io>
Subject: Re: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class

 

A final "Hi" on that topic,

 

we have discovered the reason for the issue: The ulimit on the _server_

side was too low and it differs between regular users and daemons like

the DAOS server. For the latter it was set to soft 1024/hard 4096. We

increased it to 50000 respectively by modifying the service/unit file.

With that we did multiple IOR runs with up to 48 processes and SX object

class from a single client node without any errors.

 

We noted that the coredump end memlock limits are already "increased" in

the server's unit file. Maybe it is a good idea to increase the file

limit as well by default, although the limit may depend on the provider

in use.

 

Regards, Steffen

 

 

 

 

 

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Timeouts/DAOS rendered useless when running IOR with SX/default object class

Steffen Christgau
 

A final "Hi" on that topic,

we have discovered the reason for the issue: The ulimit on the _server_ side was too low and it differs between regular users and daemons like the DAOS server. For the latter it was set to soft 1024/hard 4096. We increased it to 50000 respectively by modifying the service/unit file. With that we did multiple IOR runs with up to 48 processes and SX object class from a single client node without any errors.

We noted that the coredump end memlock limits are already "increased" in the server's unit file. Maybe it is a good idea to increase the file limit as well by default, although the limit may depend on the provider in use.

Regards, Steffen


Re: Timeouts/DAOS rendered useless when running IOR with SX/default object class

Steffen Christgau
 

Hi again once more,

meanwhile we checked the 'tcp' and the 'verbs' provider.

For 'tcp' we also experience the timeouts and an subsequently unusable DAOS system.

For 'verbs' (on an OmniPath network) we observe Mercury error on failed memory registrations:

03/29-12:36:21.95 bdaos15 DAOS[308011/308012] pool ERR src/pool/srv_pool.c:1899 transfer_map_buf() 4810a635: remote pool map buffer (4128) < required (5472)
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:846
# hg_bulk_register(): NA_Mem_register() failed (NA_PROTOCOL_ERROR)
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:762
# hg_bulk_create_na_mem_descs(): Could not register segment
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:626
# hg_bulk_create(): Could not create NA mem descriptors
03/29-12:36:50.65 bdaos15 DAOS[308011/308089] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_bulk.c:2516
# HG_Bulk_create(): Could not create bulk handle
The version of all the employed providers is '111.10' - both on client and server side.

Maybe this help a little for further investigation.

Regards, Steffen


Re: Timeouts/DAOS rendered useless when running IOR with SX/default object class

Steffen Christgau
 

Hi again,

On 3/26/21 5:14 PM, Steffen Christgau wrote:
On 3/26/21 4:49 PM, Oganezov, Alexander A wrote:
Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?
Thanks for that input, we'll try to reproduce the issue with those settings and provide them ASAP
Here is the output of a failed attempt to run IOR. It now crashed for 48 processes on a single client. For smaller process counts IOR succeeds with the same messages/warnings from libfabric

$ export FI_LOG_LEVEL=warn > $ mpiexec -n 48 --map-by socket --bind-to core
/home/bemschri/opt/local/ior/github/bin/ior -F -r -w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=S> libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:607767:core:core:fi_getinfo_():1019<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:607767:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
[repeats for each MPI process]

libfabric:607767:core:core:ofi_ns_add_local_name():370<warn> Cannot add
local name - name server uninitialized [repeats again]
IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began : Mon Mar 29 10:47:36 2021
Command line : /home/bemschri/opt/local/ior/github/bin/ior -F -r
-w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=SX
Machine : Linux bcn1031
TestID : 0
StartTime : Mon Mar 29 10:47:36 2021
Path : /ior_file.00000000
FS : 4607.9 TiB Used FS: 100.0% Inodes: 192512.0 Mi Used Inodes: 38.3%
Options: api : DFS
apiVersion : DAOS
test filename : /ior_file
access : file-per-process
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
nodes : 1
tasks : 48
clients per node : 48
repetitions : 3
xfersize : 1 MiB
blocksize : 1 GiB
aggregate filesize : 48 GiB
Results: access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
^C
And in the DAOS client log we have the following

03/29-10:47:36.48 bcn1031 DAOS[607790/607790] crt INFO src/cart/crt_init.c:151 data_init() Disabling MR CACHE (FI_MR_CACHE_COUNT=0)
03/29-10:47:36.63 bcn1031 DAOS[607790/607790] mem WARN src/gurt/hash.c:763 d_hash_table_create_inplace() The d_hash_table_ops_t->hop_rec_hash()
callback is not provided!
Therefore the whole hash table locking will be used for backward compatibility.
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x1311b60) [opc=0x4020000 (DAOS) rpcid=0x7f90350400000033 rank:tag=14:7] ctx_id 0, (status: 0x38) timed out (60 seconds), target (14:7)
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x1311b60) [opc=0x4020000 (DAOS) rpcid=0x7f90350400000033 rank:tag=14:7] aborting to group daos_server, rank 14, tgt_uri ofi+sockets://10.246.101.33:20007
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x1311b60) [opc=0x4020000 (DAOS) rpcid=0x7f90350400000033 rank:tag=14:7] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/29-10:48:38.41 bcn1031 DAOS[607798/607798] object ERR src/object/cli_shard.c:631 dc_rw_cb() RPC 0 failed, DER_TIMEDOUT(-1011): 'Time out'
Regards, Steffen


Re: Timeouts/DAOS rendered useless when running IOR with SX/default object class

Steffen Christgau
 

Hi Alex,

On 3/26/21 4:49 PM, Oganezov, Alexander A wrote:
Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?
Thanks for that input, we'll try to reproduce the issue with those settings and provide them ASAP.

Also can you tell us what your ulimit -a reports on client/server nodes?
Sure.

client $ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1541126
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) 370688000
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

On the server side pending signals is lower: 761096.

Regards, Steffen


Re: Timeouts/DAOS rendered useless when running IOR with SX/default object class

Oganezov, Alexander A
 

Hi Steffen,

Could you enable OFI level logs by setting FI_LOG_LEVEL=warn on the client side and provide stdout/stderr output from runs that result in mercury erorrs/timeouts?
Also can you tell us what your ulimit -a reports on client/server nodes?

We've seen issues before where if ulimit is set to too low for ulimit -n (open files) then some sockets connections could fail to be established. Getting ofi logs from the error would help to narrow this down.

Thanks,
~~Alex.

-----Original Message-----
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Steffen Christgau
Sent: Friday, March 26, 2021 7:50 AM
To: daos@daos.groups.io
Subject: [daos] Timeouts/DAOS rendered useless when running IOR with SX/default object class

Hi everybody,

during testing and performance assessment with IOR (latest Github
version from main branch) we are facing problems with DAOS v1.1.3.

When running IOR from a single client node there is no problem with
object class S1 and S2 with up to NP = 48 processes (from the dual
socket 96 core client machine). When we use the SX class (which is the
default in IOR), the benchmark successfully completes some of its
iterations but then hangs. This happens with as "little" as NP = 16
processes on that single client.

mpiexec -n NP --map-by socket --bind-to core ior -F -r -w -t 1m -b 1g -i
3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy
--dfs.group=daos_server --dfs.oclass=OCLASS

In the client log we find the following

03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] ctx_id 0, (status: 0x38) timed out (60 seconds), target (9:3)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] ctx_id 0, (status: 0x38) timed out (60 seconds), target (3:6)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:4)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] aborting to group daos_server, rank 1, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] aborting to group daos_server, rank 3, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] aborting to group daos_server, rank 9, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] ctx_id 0, (status: 0x38) timed out (60 seconds), target (14:2)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] aborting to group daos_server, rank 14, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
At 60 seconds before the timestamp at which the timeout error occurs on
the client we find the following on rank9 (which has hostname bdaos14)

03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2674
# hg_core_forward(): Could not forward buffer
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:5017
# HG_Core_forward(): Could not forward handle
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury.c:1960
# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] hg ERR src/cart/crt_hg.c:1090 crt_hg_req_send(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] HG_Forward failed, hg_ret: 12
> [...]
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] ctx_id 4, (status: 0x3f) timed out (60 seconds), target (16:5)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:743 crt_req_timeout_hdlr(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed due to group daos_server, rank 16, tgt_uri ofi+sockets://10.246.101.23:20005 can't rea
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:292 crt_rpc_complete(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed, DER_UNREACH(-1006): 'Unreachable node'
03/25-12:16:01.57 bdaos14 DAOS[28486/28508] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
This happens for other rank:tag combinations as well.
The log on rank 16 (which is bdaos3) is basically clean at this point in
time (12:16:01). At the time the timout error manifests at the client we
see the following in the log of bdaos3.

03/25-12:17:01.56 bdaos3 DAOS[27816/27835] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
03/25-12:17:01.56 bdaos3 DAOS[27816/27836] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
There are a lot more similar errors over all server nodes which I can
send in a PM to whoever raises a hand ;-) Basic operations like
container creations and destruction are still working but even 'daos
pool autotest' fails although it worked fine before we started the
deadly IOR run.

daos pool autotest --pool=...
Step Operation Status Time(sec) Comment
0 Initializing DAOS OK 0.000
1 Connecting to pool OK 0.070
2 Creating container OK 0.000 uuid =
3 Opening container OK 0.060
10 Generating 1M S1 layouts OK 2.530
11 Generating 10K SX layouts OK 0.630
20 Inserting 1M 128B values rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] ctx_id 1, (status: 0x38) timed out (60 seconds), target (9:5)
rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] aborting to group daos_server, rank 9, tgt_uri ofi+sockets://10.246.101.34:20005
hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
object ERR src/object/cli_shard.c:631 dc_rw_cb() RPC 0 failed, DER_TIMEDOUT(-1011): 'Time out'
In the end, the DAOS system is in a state were it is hardly usable. Only
stopping the system and restarting the services brings it fully back to
life.

Maybe the object class has no impact at all but the with S1/S2 classes
the problem did not manifest. With SX we can provoke the issue quite
fast. While I would understand that striping over all nodes (which is my
understanding of SX) may decrease performance compared S1 or S2 I would
not expect that the system transitions into a unusable state. Could the
libfabric provider (sockets) be an isseue here?

Does anybody know what might the reason for this issue and/or what might
be changed to solve it?

Regards

Steffen


Timeouts/DAOS rendered useless when running IOR with SX/default object class

Steffen Christgau
 

Hi everybody,

during testing and performance assessment with IOR (latest Github version from main branch) we are facing problems with DAOS v1.1.3.

When running IOR from a single client node there is no problem with object class S1 and S2 with up to NP = 48 processes (from the dual socket 96 core client machine). When we use the SX class (which is the default in IOR), the benchmark successfully completes some of its iterations but then hangs. This happens with as "little" as NP = 16 processes on that single client.

mpiexec -n NP --map-by socket --bind-to core ior -F -r -w -t 1m -b 1g -i 3 -o /ior_file -a DFS --dfs.pool=... --dfs.cont=... --dfs.destroy --dfs.group=daos_server --dfs.oclass=OCLASS

In the client log we find the following

03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] ctx_id 0, (status: 0x38) timed out (60 seconds), target (9:3)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] ctx_id 0, (status: 0x38) timed out (60 seconds), target (3:6)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] ctx_id 0, (status: 0x38) timed out (60 seconds), target (1:4)
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] aborting to group daos_server, rank 1, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] aborting to group daos_server, rank 3, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] aborting to group daos_server, rank 9, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] ctx_id 0, (status: 0x38) timed out (60 seconds), target (14:2)
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] aborting to group daos_server, rank 14, tgt_uri (null)
03/25-12:17:01.53 bcn1031 DAOS[536875/536875] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x1333750) [opc=0x4020012 (DAOS) rpcid=0x5edd88cd00000909 rank:tag=3:6] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536878/536878] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x132e540) [opc=0x4020012 (DAOS) rpcid=0x5d481ae000000909 rank:tag=9:3] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536874/536874] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13338a0) [opc=0x4020012 (DAOS) rpcid=0x454be3aa00000909 rank:tag=1:4] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/25-12:17:01.53 bcn1031 DAOS[536873/536873] hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x13340c0) [opc=0x4020012 (DAOS) rpcid=0xaffa39e00000909 rank:tag=14:2] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
At 60 seconds before the timestamp at which the timeout error occurs on the client we find the following on rank9 (which has hostname bdaos14)

03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2674
# hg_core_forward(): Could not forward buffer
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:5017
# HG_Core_forward(): Could not forward handle
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury.c:1960
# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] hg ERR src/cart/crt_hg.c:1090 crt_hg_req_send(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] HG_Forward failed, hg_ret: 12
[...]
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] ctx_id 4, (status: 0x3f) timed out (60 seconds), target (16:5)
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:743 crt_req_timeout_hdlr(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed due to group daos_server, rank 16, tgt_uri ofi+sockets://10.246.101.23:20005 can't rea
03/25-12:16:01.53 bdaos14 DAOS[28486/28507] rpc ERR src/cart/crt_context.c:292 crt_rpc_complete(0x7f255ad80f40) [opc=0x4020012 (DAOS) rpcid=0x4a70d1bc00001f4f rank:tag=16:5] failed, DER_UNREACH(-1006): 'Unreachable node'
03/25-12:16:01.57 bdaos14 DAOS[28486/28508] external ERR # HG -- error -- /builddir/build/BUILD/mercury-2.0.1rc1/src/mercury_core.c:2751
This happens for other rank:tag combinations as well.
The log on rank 16 (which is bdaos3) is basically clean at this point in time (12:16:01). At the time the timout error manifests at the client we see the following in the log of bdaos3.

03/25-12:17:01.56 bdaos3 DAOS[27816/27835] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
03/25-12:17:01.56 bdaos3 DAOS[27816/27836] object ERR src/object/srv_obj.c:3946 ds_obj_dtx_follower() Handled DTX add8eaf5.199f0f144b80000 on non-leader: DER_UNKNOWN(1): 'Unknown error code 1'
There are a lot more similar errors over all server nodes which I can send in a PM to whoever raises a hand ;-) Basic operations like container creations and destruction are still working but even 'daos pool autotest' fails although it worked fine before we started the deadly IOR run.

daos pool autotest --pool=...
Step Operation Status Time(sec) Comment
0 Initializing DAOS OK 0.000 1 Connecting to pool OK 0.070 2 Creating container OK 0.000 uuid = 3 Opening container OK 0.060 10 Generating 1M S1 layouts OK 2.530 11 Generating 10K SX layouts OK 0.630 20 Inserting 1M 128B values rpc ERR src/cart/crt_context.c:806 crt_context_timeout_check(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] ctx_id 1, (status: 0x38) timed out (60 seconds), target (9:5)
rpc ERR src/cart/crt_context.c:755 crt_req_timeout_hdlr(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] aborting to group daos_server, rank 9, tgt_uri ofi+sockets://10.246.101.34:20005
hg ERR src/cart/crt_hg.c:1050 crt_hg_req_send_cb(0x12a5540) [opc=0x4020000 (DAOS) rpcid=0x2c178c3e0000001a rank:tag=9:5] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
object ERR src/object/cli_shard.c:631 dc_rw_cb() RPC 0 failed, DER_TIMEDOUT(-1011): 'Time out'
In the end, the DAOS system is in a state were it is hardly usable. Only stopping the system and restarting the services brings it fully back to life.

Maybe the object class has no impact at all but the with S1/S2 classes the problem did not manifest. With SX we can provoke the issue quite fast. While I would understand that striping over all nodes (which is my understanding of SX) may decrease performance compared S1 or S2 I would not expect that the system transitions into a unusable state. Could the libfabric provider (sockets) be an isseue here?

Does anybody know what might the reason for this issue and/or what might be changed to solve it?

Regards

Steffen


Re: Errors while compiling DAOS on ARM64 platform

Rosenzweig, Joel B
 

Hi Huijun,

 

At one point in time, we added “// +build linux,amd64” to the netdetect.go file to enable it to build under ARM.  Does your version of netdetect.go have the following at the end of the copyright header before “Package netdetect”?  If it does not, go ahead and patch your file accordingly and try again. 

 

//

// +build linux,amd64

//

 

Package netdetect

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Wu Huijun
Sent: Friday, March 19, 2021 11:11 PM
To: daos@daos.groups.io
Subject: [daos] Errors while compiling DAOS on ARM64 platform

 

Hi all, 

I am trying to compile DAOS on ARM64 platform (little endian). I am working with the branch'tanabarr/control-no-ipmctl-May2020' to avoid the ipmctl dependency. 

However, I got errors below with go build github.com/daos-stack/daos/src/control/lib/netdetect
Any clue about this? I checked the GOPATH and it seems the go compiler and indeed find the code but just could not compile. 

ar rc build/dev/gcc/src/control/lib/spdk/libnvme_control.a build/dev/gcc/src/control/lib/spdk/src/nvme_control.o build/dev/gcc/src/control/lib/spdk/src/nvme_control_common.o

gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_swim.c | cat > build/dev/gcc/src/cart/src/cart/crt_swim_pp.c

ranlib build/dev/gcc/src/control/lib/spdk/libnvme_control.a

cd /root/daos/src/control; /usr/lib/go-1.13/bin/go build -mod vendor -v -ldflags "-X github.com/daos-stack/daos/src/control/build.DaosVersion=1.1.0 -X github.com/daos-stack/daos/src/control/build.ConfigDir=/root/daos/install/etc -B 0x91d6cda8b03b8b86157723c893b049e89e83e1d6" -o /root/daos/build/dev/gcc/src/control/bin/daos_admin github.com/daos-stack/daos/src/control/cmd/daos_admin

github.com/daos-stack/daos/src/control/lib/netdetect

go build github.com/daos-stack/daos/src/control/lib/netdetect: build constraints exclude all Go files in /root/daos/src/control/lib/netdetect

gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_pp.c

scons: *** [build/dev/gcc/src/control/bin/daos_agent] Error 1

gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree_flat.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_flat_pp.c

gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree_kary.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_kary_pp.c

gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree_knomial.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_knomial_pp.c

gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_hlc.c | cat > build/dev/gcc/src/cart/src/cart/crt_hlc_pp.c

scons: building terminated because of errors.

Cheers,
Huijun

 


Errors while compiling DAOS on ARM64 platform

Wu Huijun
 

Hi all, 

I am trying to compile DAOS on ARM64 platform (little endian). I am working with the branch'tanabarr/control-no-ipmctl-May2020' to avoid the ipmctl dependency. 

However, I got errors below with go build github.com/daos-stack/daos/src/control/lib/netdetect
Any clue about this? I checked the GOPATH and it seems the go compiler and indeed find the code but just could not compile. 

ar rc build/dev/gcc/src/control/lib/spdk/libnvme_control.a build/dev/gcc/src/control/lib/spdk/src/nvme_control.o build/dev/gcc/src/control/lib/spdk/src/nvme_control_common.o
gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_swim.c | cat > build/dev/gcc/src/cart/src/cart/crt_swim_pp.c
ranlib build/dev/gcc/src/control/lib/spdk/libnvme_control.a
cd /root/daos/src/control; /usr/lib/go-1.13/bin/go build -mod vendor -v -ldflags "-X github.com/daos-stack/daos/src/control/build.DaosVersion=1.1.0 -X github.com/daos-stack/daos/src/control/build.ConfigDir=/root/daos/install/etc -B 0x91d6cda8b03b8b86157723c893b049e89e83e1d6" -o /root/daos/build/dev/gcc/src/control/bin/daos_admin github.com/daos-stack/daos/src/control/cmd/daos_admin
github.com/daos-stack/daos/src/control/lib/netdetect
go build github.com/daos-stack/daos/src/control/lib/netdetect: build constraints exclude all Go files in /root/daos/src/control/lib/netdetect
gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_pp.c
scons: *** [build/dev/gcc/src/control/bin/daos_agent] Error 1
gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree_flat.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_flat_pp.c
gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree_kary.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_kary_pp.c
gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_tree_knomial.c | cat > build/dev/gcc/src/cart/src/cart/crt_tree_knomial_pp.c
gcc -c -Isrc/cart/src/include -Isrc/cart/src/cart -I/root/daos/install/include -I/root/daos/install/include/na -E -P src/cart/src/cart/crt_hlc.c | cat > build/dev/gcc/src/cart/src/cart/crt_hlc_pp.c
scons: building terminated because of errors.

Cheers,
Huijun
 


Re: Questions about Daos consistency

段世博
 

T3 starts before T1, so T3 can obtain a timestamp less than T1. T1 has not yet started when T3 is read, so there will be no uncertain reading
When T1 writes C1, it does not check the read timestamp smaller than itself, so T3 cannot see the write of T1.


Re: Questions about Daos consistency

Olivier, Jeffrey V
 

I may be missing something here but assuming T3 is at a later timestamp to T1, the read of C1 would update the read timestamp in the negative entry for C1 (based on a hash of the key).   Before T1 creates C1, it would check this timestamp, find a conflict, and be forced to restart at a later timestamp.

 

-Jeff

 

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Li, Wei G <wei.g.li@...>
Date: Friday, March 19, 2021 at 3:03 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Questions about Daos consistency

You are right. This can also happen with DAOS. I will correct that document.

Thanks,
liwei

> On Mar 19, 2021, at 4:06 PM, 段世博 <duanshibo.d@...> wrote:
>
>   I found that the concurrency control of DAOS is similar to CockroachDB, but the following situations may occur in CockroachDB according to jepsen analysis (https://jepsen.io/analyses/cockroachdb-beta-20160829). C1 and C2 are two unrelated data. T2 starts after T1 is committed. However, the data returned by T3 only sees the writing of T2 while can not see the writing of T1. Obviously, this violates external consistency.
>
> T3: r(C1) (not found)
> T1: w(C1)
> T1: commit
> T2: w(C2)
> T2: commit
> T3: r(C2) (found)
> T3: commit
>  
>   Can this happen in DAOS? If can't, How Daos avoids this situation?
>   Thanks.
>






Re: Questions about Daos consistency

Li, Wei G
 

You are right. This can also happen with DAOS. I will correct that document.

Thanks,
liwei

On Mar 19, 2021, at 4:06 PM, 段世博 <duanshibo.d@gmail.com> wrote:

I found that the concurrency control of DAOS is similar to CockroachDB, but the following situations may occur in CockroachDB according to jepsen analysis (https://jepsen.io/analyses/cockroachdb-beta-20160829). C1 and C2 are two unrelated data. T2 starts after T1 is committed. However, the data returned by T3 only sees the writing of T2 while can not see the writing of T1. Obviously, this violates external consistency.

T3: r(C1) (not found)
T1: w(C1)
T1: commit
T2: w(C2)
T2: commit
T3: r(C2) (found)
T3: commit

Can this happen in DAOS? If can't, How Daos avoids this situation?
Thanks.


Re: Questions about Daos consistency

段世博
 

  I found that the concurrency control of DAOS is similar to CockroachDB, but the following situations may occur in CockroachDB according to jepsen analysis (https://jepsen.io/analyses/cockroachdb-beta-20160829). C1 and C2 are two unrelated data. T2 starts after T1 is committed. However, the data returned by T3 only sees the writing of T2 while can not see the writing of T1. Obviously, this violates external consistency.

T3: r(C1) (not found) 
T1: w(C1)
T1: commit
T2: w(C2)
T2: commit
T3: r(C2) (found)
T3: commit
 
  Can this happen in DAOS? If can't, How Daos avoids this situation?
  Thanks.


DFS fio engine

Lombardi, Johann
 

Hi there,

 

I just would like to share with you that the DAOS File System (DFS) engine has been integrated into the upstream FIO repository (https://github.com/axboe/fio).

 

How to build it on centos7:

 

$ sudo yum install centos-release-scl

$ sudo yum install -y git devtoolset-9-gcc libuuid-devel

$ scl enable devtoolset-9 bash

$ git clone http://git.kernel.dk/fio.git

$ cd fio

 

If DAOS is installed via RPMs:

$ ./configure 

 

Otherwise:

$ CFLAGS="-I/path/to/daos/install/include" LDFLAGS="-L/path/to/daos/install/lib64" ./configure

 

$ make -j install

 

How to use it:

 

$ export POOL= # your pool UUID

$ export CONT= # your container UUID

$ fio ./examples/dfs.fio

 

Those instructions will be integrated soon into our online documentation.

 

Cheers,

Johann

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Questions about Daos consistency

Li, Wei G
 

Yes. A DAOS client can only "see a state” via unversioned transactions (including I/O operations submitted without an explicit transaction) and explicitly-created snapshots. If an application hacks the snapshot epoch, however, effectively specifying an arbitrary version that may not have been snapshotted, then no transactional consistency is promised.

liwei

On Mar 16, 2021, at 10:01 PM, 段世博 <duanshibo.d@gmail.com> wrote:

In the VOS document, the MVCC section mentions "The MVCC rules ensure that transactions execute as if they are serialized in their epoch order while complying with external consistency, as long as the system clock offsets are always within the expected maximum system clock offset (epsilon )."
I want to know whether the external consistency here has the same meaning as spanner's external consistency? which is "In addition if one transaction completes before another transaction starts to commit, the system guarantees that clients can never see a state that includes the effect of the second transaction but not the first."


Questions about Daos consistency

段世博
 

    In the VOS document, the MVCC section mentions "The MVCC rules ensure that transactions execute as if they are serialized in their epoch order while complying with external consistency, as long as the system clock offsets are always within the expected maximum system clock offset (epsilon )."
    I want to know whether the external consistency here has the same meaning as google spanner's external consistency? which is "In addition if one transaction completes before another transaction starts to commit, the system guarantees that clients can never see a state that includes the effect of the second transaction but not the first."


Questions about Daos consistency

段世博
 

     In the VOS document, the MVCC section mentions "The MVCC rules ensure that transactions execute as if they are serialized in their epoch order while complying with external consistency, as long as the system clock offsets are always within the expected maximum system clock offset (epsilon )."
    I want to know whether the external consistency here has the same meaning as spanner's external consistency? which is "In addition if one transaction completes before another transaction starts to commit, the system guarantees that clients can never see a state that includes the effect of the second transaction but not the first."


Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

Nabarro, Tom
 

Hello Neale,

 

I’m happy to work with you directly on this to get you past any hurdles if you would like, my e-mail is tom.nabarro@....

The TRANSIENT_FAILURE does indicate some local network related issue and is unlikely to be fixed by a new release.

 

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, March 4, 2021 7:57 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Tom, 

 

I tried your suggestion by uninstalling / reinstalling DAOS RPMs to get a blank config file then added things line by line. Unfortunately, I ended up getting "insufficient information in configuration" errors until I ended up with essentially the config file I had before. 

 

I think we're going to suspend our testing of DAOS until a new release comes out instead of tracking down these issues. 

 

Neale


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Nabarro, Tom <tom.nabarro@...>
Sent: Thursday, February 25, 2021 5:52 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

I think getting the most basic configuration working is probably the best way forward given that dmg is not connecting, try with an empty config file (discovery mode) on a single host and on that same host without any certificates installed (and try running without systemd just to reduce to a minimal viable configuration):

 

[tanabarr@wolf-71 daos_m]$ sudo mkdir /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ sudo chmod 777 /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ install/bin/daos_server start -i

DAOS Server config loaded from /home/tanabarr/projects/daos_m/install/etc/daos_server.yml

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

no control log file specified; logging to stdout

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

DAOS Control Server v1.3.0 (pid 218642) listening on 0.0.0.0:10001

 

Then on a separate terminal on the same host:

 

[tanabarr@wolf-71 daos-control-demo]$ dmg -i -l wolf-71 storage scan

Hosts   SCM Total             NVMe Total

-----   ---------             ----------

wolf-71 6.4 TB (2 namespaces) 3.1 TB (3 controllers)

 

See if you get the transient failure with the above.

 

The insecure mode is only suitable for development and testing purposes, just to be clear.

v1.3.0 version does not represent a release, it’s just the version printed when running from master, the above should work with any 1.x version.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, February 25, 2021 5:40 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Kris and tom, 

 

I'm using Systemd for all the service control and a parallel shell for running the systemctl commands.

 

Disabling certs didn't help. I did find a permission problem with the socket directory though, and fixing that allows me to run dmg on the access point successfully. I still get the TRANSIENT_FAILURE on my test node, though. Now when I run the 'dmg storage format' I get: 

 

Cannot format storage with running I/O server instance

 

I tried running 'dmg system stop' on the access point but got the TRANSIENT_FAILURE error. I'm also still getting the no-hugepages error on all the servers.

 

Are the RPMs available for the newer versions? I've also pasted the config file below: 

 

## DAOS server configuration file.

#

## Location of this configuration file is determined by first checking for the

## path specified through the -f option of the daos_server command line.

## Otherwise, /etc/daos_server.conf is used.

#

#

## Name associated with the DAOS system.

## Immutable after reformat.

#

name: daos

#

#

## Access points

#

## To operate, DAOS will need a quorum of access point nodes to be available.

## Must have the same value for all agents and servers in a system.

## Immutable after reformat.

## Hosts can be specified with or without port, default port below

## assumed if not specified.

#

## default: hostname of this node

access_points:

  - <host01>

#

## Default port

#

## Port number to bind daos_server to, this will also

## be used when connecting to access points unless a port is specified.

#

## default: 10001

port: 10001

#

## Transport Credentials Specifying certificates to secure communications

#

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

#

## Fault domain path

#

## Immutable after reformat.

#

## default: /hostname for a local configuration w/o fault domain

#fault_path: /vcdu0/rack1/hostname

#

#

## Fault domain callback

#

## Path to executable which will return fault domain string.

## Immutable after reformat.

#

#fault_cb: ./.daos/fd_callback

#

#

## Use specific OFI interfaces

#

## Specify either a single fabric interface that will be used by all

## spawned servers or a comma-seperated list of fabric interfaces to be

## assigned individually.

## By default, the DAOS server will auto-detect and use all fabric

## interfaces if any and fall back to socket on the first eth card

## otherwise.

fabric_ifaces:

  - enp94s0

  - enp216s0

#

#

## Use specific OFI provider

#

## Force a specific provider to be used by all the servers.

## The default provider depends on the interfaces that will be auto-detected:

## ofi+psm2 for Omni-Path, ofi+verbs;ofi_rxm for Infiniband/RoCE and finally

## ofi+socket for non-RDMA-capable Ethernet.

#

provider: ofi+verbs;ofi_rxm

#

#

## Storage mount directory

#

## TODO: If no pre-configured mountpoints are specified, DAOS will auto-detect

## NVDIMMs, configure them in interleave mode, format with ext4 and

## mount with the DAX extension creating a subdirectory within scm_mount_path.

#

## This option allows to specify a preferred path where the mountpoints will

## be created. Either the specified directory or its parent must be a mount

## point.

#

## default: /mnt/daos

scm_mount_path: /mnt/daos

#

#

## NVMe SSD whitelist

#

## Only use NVMe controllers with specific PCI addresses.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

## By default, DAOS will use all the NVMe-capable SSDs that don't have active

## mount points.

#

#bdev_include: ["0000:81:00.1","0000:81:00.2","0000:81:00.3"]

#

#

## NVMe SSD blacklist

#

## Only use NVMe controllers with specific PCI addresses. Overrides drives

## listed in nvme_include and forces auto-detection to skip those drives.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

#

#bdev_exclude: ["0000:81:00.1"]

#

#

## Use Hyperthreads

#

## When Hyperthreading is enabled and supported on the system, this parameter

## defines whether the DAOS service thread should only be bound to different

## physical cores (value 0) or hyperthreads (value 1).

#

## default: false

hyperthreads: False

#

#

## Use the given directory for creating unix domain sockets

#

## DAOS Agent and DAOS Server both use unix domain sockets for communication

## with other system components. This setting is the base location to place

## the sockets in.

#

## default: /var/run/daos_server

socket_dir: /var/run/daos_server

#

#

## Number of hugepages to allocate for use by NVMe SSDs

#

## Specifies the number (not size) of hugepages to allocate for use by NVMe

## through SPDK. This indicates the total number to be used by any spawned

## servers. Default system hugepage size will be used and hugepages will be

## evenly distributed between CPU nodes.

#

## default: 1024

nr_hugepages: 4096

#

#

## Force specific debug mask for daos_server (control plane).

## By default, just use the default debug mask used by daos_server.

## Mask specifies minimum level of message significance to pass to logger.

## Currently supported values are DEBUG and ERROR.

#

## default: DEBUG

#control_log_mask: ERROR

#

#

## Force specific path for daos_server (control plane) logs.

#

## default: print to stderr

control_log_file: /var/log/daos/daos_control.log

#

#

## Enable daos_admin (privileged helper) logging.

#

## default: disabled (errors only to control plane log)

helper_log_file: /var/log/daos/daos_admin.log

#

#

# When per-server definitions exist, auto-allocation of resources is not

# performed. Without per-server definitions, node resources will

# automatically be assigned to servers based on NUMA ratings, there will

# be a one-to-one relationship between servers and sockets.

 

servers:

-

  # Rank to be assigned as identifier for server.

  # Immutable after reformat.

  # Optional parameter, will be auto generated if not supplied.

 

  rank: 0

 

  # Targets (VOS) represent the count of storage targets per data plane

  # server starting at core offset specified by first_core.

 

  # Immutable after reformat.

 

  targets: 24

 

  # Count of offload/helper xstreams per target. (allowed values: 0-2)

  # Immutable after reformat.

 

  # default: 2

  nr_xs_helpers: 0

 

  # Offset of the first core for service xstreams.

  # Immutable after reformat.

 

  # default: 0

  first_core: 0

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp94s0

  fabric_iface_port: 20000

  pinned_numa_node: 0

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs (D_LOG_FILE).

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server1.log

#

#  # Pass specific environment variables to the DAOS server.

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=30

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#  # Either the specified directory or its parent must be a mount point.

#

  scm_mount: /mnt/daos/1

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size ignored

#  # - "ram" to emulate SCM with memory, scm_list ignored

#  # Immutable after reformat.

#

#  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem0]

 

#

#  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

#  # The size of ram is specified by scm_size in GB units.

#  scm_size: 16

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # default: nvme

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:1c:00.0","0000:20:00.0","0000:3f:00.0","0000:43:00.0"]  # generate regular nvme.conf

-

#  # Rank to be assigned as identifier for server.

#  # Immutable after reformat.

#  # Optional parameter, will be auto generated if not supplied.

#

  rank: 1

 

  # Targets (VOS) represent the number of logical CPUs to be used starting at

  # index specified by first_core.

 

  # Targets will be used to run XStreams can be thought of as service threads.

  # Immutable after reformat.

 

  targets: 24

 

  # Number of helper XStreams per VOS target. (allowed values: 0-2)

  # Immutable after reformat.

#

#  # default: 2

#  nr_xs_helpers: 1

#

#  # Index of first core for service thread.

#  # Immutable after reformat.

#

  # default: 0

  first_core: 24

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp216s0

  fabric_iface_port: 20000

  pinned_numa_node: 1

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs.

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server2.log

#

#  # Pass specific environment variables to the DAOS server

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=100

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#

  scm_mount: /mnt/daos/2

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size is ignored

#  # - "ram" to emulate SCM with memory, scm_list is ignored

#  # Immutable after reformat.

#

  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem1]

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # When bdev_class is set to malloc, bdev_number is the number of devices

#  # to allocate and bdev_size is the size in GB of each LUN/device.

#  bdev_class: malloc

#  bdev_number: 1

#  bdev_size: 4

#

#  # When bdev_class is set to file, bdev_list is the list of file paths that

#  # will be used to emulate NVMe SSDs. The size of each file is specified by

#  # bdev_size in GB unit.

#  bdev_class: file

#  bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2]

#  bdev_size: 16

#

#  # When bdev_class is set to kdev, bdev_list is the list of unique kernel

#  # block devices that should be different across different server instance.

#  bdev_class: kdev

#  bdev_list: [/dev/sdc,/dev/sdd]

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:89:00.0","0000:8d:00.0","0000:b2:00.0","0000:b6:00.0"]  # generate regular nvme.conf

 

 

 


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Jacque, Kristin <kristin.jacque@...>
Sent: Wednesday, February 24, 2021 8:31 PM
To:
daos@daos.groups.io <daos@daos.groups.io>
Subject: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To:
daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

Neale Petrillo (Contractor)
 

Hi Tom, 

I tried your suggestion by uninstalling / reinstalling DAOS RPMs to get a blank config file then added things line by line. Unfortunately, I ended up getting "insufficient information in configuration" errors until I ended up with essentially the config file I had before. 

I think we're going to suspend our testing of DAOS until a new release comes out instead of tracking down these issues. 

Neale


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Nabarro, Tom <tom.nabarro@...>
Sent: Thursday, February 25, 2021 5:52 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors
 

I think getting the most basic configuration working is probably the best way forward given that dmg is not connecting, try with an empty config file (discovery mode) on a single host and on that same host without any certificates installed (and try running without systemd just to reduce to a minimal viable configuration):

 

[tanabarr@wolf-71 daos_m]$ sudo mkdir /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ sudo chmod 777 /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ install/bin/daos_server start -i

DAOS Server config loaded from /home/tanabarr/projects/daos_m/install/etc/daos_server.yml

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

no control log file specified; logging to stdout

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

DAOS Control Server v1.3.0 (pid 218642) listening on 0.0.0.0:10001

 

Then on a separate terminal on the same host:

 

[tanabarr@wolf-71 daos-control-demo]$ dmg -i -l wolf-71 storage scan

Hosts   SCM Total             NVMe Total

-----   ---------             ----------

wolf-71 6.4 TB (2 namespaces) 3.1 TB (3 controllers)

 

See if you get the transient failure with the above.

 

The insecure mode is only suitable for development and testing purposes, just to be clear.

v1.3.0 version does not represent a release, it’s just the version printed when running from master, the above should work with any 1.x version.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, February 25, 2021 5:40 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Kris and tom, 

 

I'm using Systemd for all the service control and a parallel shell for running the systemctl commands.

 

Disabling certs didn't help. I did find a permission problem with the socket directory though, and fixing that allows me to run dmg on the access point successfully. I still get the TRANSIENT_FAILURE on my test node, though. Now when I run the 'dmg storage format' I get: 

 

Cannot format storage with running I/O server instance

 

I tried running 'dmg system stop' on the access point but got the TRANSIENT_FAILURE error. I'm also still getting the no-hugepages error on all the servers.

 

Are the RPMs available for the newer versions? I've also pasted the config file below: 

 

## DAOS server configuration file.

#

## Location of this configuration file is determined by first checking for the

## path specified through the -f option of the daos_server command line.

## Otherwise, /etc/daos_server.conf is used.

#

#

## Name associated with the DAOS system.

## Immutable after reformat.

#

name: daos

#

#

## Access points

#

## To operate, DAOS will need a quorum of access point nodes to be available.

## Must have the same value for all agents and servers in a system.

## Immutable after reformat.

## Hosts can be specified with or without port, default port below

## assumed if not specified.

#

## default: hostname of this node

access_points:

  - <host01>

#

## Default port

#

## Port number to bind daos_server to, this will also

## be used when connecting to access points unless a port is specified.

#

## default: 10001

port: 10001

#

## Transport Credentials Specifying certificates to secure communications

#

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

#

## Fault domain path

#

## Immutable after reformat.

#

## default: /hostname for a local configuration w/o fault domain

#fault_path: /vcdu0/rack1/hostname

#

#

## Fault domain callback

#

## Path to executable which will return fault domain string.

## Immutable after reformat.

#

#fault_cb: ./.daos/fd_callback

#

#

## Use specific OFI interfaces

#

## Specify either a single fabric interface that will be used by all

## spawned servers or a comma-seperated list of fabric interfaces to be

## assigned individually.

## By default, the DAOS server will auto-detect and use all fabric

## interfaces if any and fall back to socket on the first eth card

## otherwise.

fabric_ifaces:

  - enp94s0

  - enp216s0

#

#

## Use specific OFI provider

#

## Force a specific provider to be used by all the servers.

## The default provider depends on the interfaces that will be auto-detected:

## ofi+psm2 for Omni-Path, ofi+verbs;ofi_rxm for Infiniband/RoCE and finally

## ofi+socket for non-RDMA-capable Ethernet.

#

provider: ofi+verbs;ofi_rxm

#

#

## Storage mount directory

#

## TODO: If no pre-configured mountpoints are specified, DAOS will auto-detect

## NVDIMMs, configure them in interleave mode, format with ext4 and

## mount with the DAX extension creating a subdirectory within scm_mount_path.

#

## This option allows to specify a preferred path where the mountpoints will

## be created. Either the specified directory or its parent must be a mount

## point.

#

## default: /mnt/daos

scm_mount_path: /mnt/daos

#

#

## NVMe SSD whitelist

#

## Only use NVMe controllers with specific PCI addresses.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

## By default, DAOS will use all the NVMe-capable SSDs that don't have active

## mount points.

#

#bdev_include: ["0000:81:00.1","0000:81:00.2","0000:81:00.3"]

#

#

## NVMe SSD blacklist

#

## Only use NVMe controllers with specific PCI addresses. Overrides drives

## listed in nvme_include and forces auto-detection to skip those drives.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

#

#bdev_exclude: ["0000:81:00.1"]

#

#

## Use Hyperthreads

#

## When Hyperthreading is enabled and supported on the system, this parameter

## defines whether the DAOS service thread should only be bound to different

## physical cores (value 0) or hyperthreads (value 1).

#

## default: false

hyperthreads: False

#

#

## Use the given directory for creating unix domain sockets

#

## DAOS Agent and DAOS Server both use unix domain sockets for communication

## with other system components. This setting is the base location to place

## the sockets in.

#

## default: /var/run/daos_server

socket_dir: /var/run/daos_server

#

#

## Number of hugepages to allocate for use by NVMe SSDs

#

## Specifies the number (not size) of hugepages to allocate for use by NVMe

## through SPDK. This indicates the total number to be used by any spawned

## servers. Default system hugepage size will be used and hugepages will be

## evenly distributed between CPU nodes.

#

## default: 1024

nr_hugepages: 4096

#

#

## Force specific debug mask for daos_server (control plane).

## By default, just use the default debug mask used by daos_server.

## Mask specifies minimum level of message significance to pass to logger.

## Currently supported values are DEBUG and ERROR.

#

## default: DEBUG

#control_log_mask: ERROR

#

#

## Force specific path for daos_server (control plane) logs.

#

## default: print to stderr

control_log_file: /var/log/daos/daos_control.log

#

#

## Enable daos_admin (privileged helper) logging.

#

## default: disabled (errors only to control plane log)

helper_log_file: /var/log/daos/daos_admin.log

#

#

# When per-server definitions exist, auto-allocation of resources is not

# performed. Without per-server definitions, node resources will

# automatically be assigned to servers based on NUMA ratings, there will

# be a one-to-one relationship between servers and sockets.

 

servers:

-

  # Rank to be assigned as identifier for server.

  # Immutable after reformat.

  # Optional parameter, will be auto generated if not supplied.

 

  rank: 0

 

  # Targets (VOS) represent the count of storage targets per data plane

  # server starting at core offset specified by first_core.

 

  # Immutable after reformat.

 

  targets: 24

 

  # Count of offload/helper xstreams per target. (allowed values: 0-2)

  # Immutable after reformat.

 

  # default: 2

  nr_xs_helpers: 0

 

  # Offset of the first core for service xstreams.

  # Immutable after reformat.

 

  # default: 0

  first_core: 0

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp94s0

  fabric_iface_port: 20000

  pinned_numa_node: 0

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs (D_LOG_FILE).

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server1.log

#

#  # Pass specific environment variables to the DAOS server.

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=30

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#  # Either the specified directory or its parent must be a mount point.

#

  scm_mount: /mnt/daos/1

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size ignored

#  # - "ram" to emulate SCM with memory, scm_list ignored

#  # Immutable after reformat.

#

#  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem0]

 

#

#  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

#  # The size of ram is specified by scm_size in GB units.

#  scm_size: 16

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # default: nvme

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:1c:00.0","0000:20:00.0","0000:3f:00.0","0000:43:00.0"]  # generate regular nvme.conf

-

#  # Rank to be assigned as identifier for server.

#  # Immutable after reformat.

#  # Optional parameter, will be auto generated if not supplied.

#

  rank: 1

 

  # Targets (VOS) represent the number of logical CPUs to be used starting at

  # index specified by first_core.

 

  # Targets will be used to run XStreams can be thought of as service threads.

  # Immutable after reformat.

 

  targets: 24

 

  # Number of helper XStreams per VOS target. (allowed values: 0-2)

  # Immutable after reformat.

#

#  # default: 2

#  nr_xs_helpers: 1

#

#  # Index of first core for service thread.

#  # Immutable after reformat.

#

  # default: 0

  first_core: 24

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp216s0

  fabric_iface_port: 20000

  pinned_numa_node: 1

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs.

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server2.log

#

#  # Pass specific environment variables to the DAOS server

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=100

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#

  scm_mount: /mnt/daos/2

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size is ignored

#  # - "ram" to emulate SCM with memory, scm_list is ignored

#  # Immutable after reformat.

#

  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem1]

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # When bdev_class is set to malloc, bdev_number is the number of devices

#  # to allocate and bdev_size is the size in GB of each LUN/device.

#  bdev_class: malloc

#  bdev_number: 1

#  bdev_size: 4

#

#  # When bdev_class is set to file, bdev_list is the list of file paths that

#  # will be used to emulate NVMe SSDs. The size of each file is specified by

#  # bdev_size in GB unit.

#  bdev_class: file

#  bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2]

#  bdev_size: 16

#

#  # When bdev_class is set to kdev, bdev_list is the list of unique kernel

#  # block devices that should be different across different server instance.

#  bdev_class: kdev

#  bdev_list: [/dev/sdc,/dev/sdd]

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:89:00.0","0000:8d:00.0","0000:b2:00.0","0000:b6:00.0"]  # generate regular nvme.conf

 

 

 


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Jacque, Kristin <kristin.jacque@...>
Sent: Wednesday, February 24, 2021 8:31 PM
To:
daos@daos.groups.io <daos@daos.groups.io>
Subject: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To:
daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

Nabarro, Tom
 

I think getting the most basic configuration working is probably the best way forward given that dmg is not connecting, try with an empty config file (discovery mode) on a single host and on that same host without any certificates installed (and try running without systemd just to reduce to a minimal viable configuration):

 

[tanabarr@wolf-71 daos_m]$ sudo mkdir /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ sudo chmod 777 /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ install/bin/daos_server start -i

DAOS Server config loaded from /home/tanabarr/projects/daos_m/install/etc/daos_server.yml

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

no control log file specified; logging to stdout

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

DAOS Control Server v1.3.0 (pid 218642) listening on 0.0.0.0:10001

 

Then on a separate terminal on the same host:

 

[tanabarr@wolf-71 daos-control-demo]$ dmg -i -l wolf-71 storage scan

Hosts   SCM Total             NVMe Total

-----   ---------             ----------

wolf-71 6.4 TB (2 namespaces) 3.1 TB (3 controllers)

 

See if you get the transient failure with the above.

 

The insecure mode is only suitable for development and testing purposes, just to be clear.

v1.3.0 version does not represent a release, it’s just the version printed when running from master, the above should work with any 1.x version.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, February 25, 2021 5:40 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Kris and tom, 

 

I'm using Systemd for all the service control and a parallel shell for running the systemctl commands.

 

Disabling certs didn't help. I did find a permission problem with the socket directory though, and fixing that allows me to run dmg on the access point successfully. I still get the TRANSIENT_FAILURE on my test node, though. Now when I run the 'dmg storage format' I get: 

 

Cannot format storage with running I/O server instance

 

I tried running 'dmg system stop' on the access point but got the TRANSIENT_FAILURE error. I'm also still getting the no-hugepages error on all the servers.

 

Are the RPMs available for the newer versions? I've also pasted the config file below: 

 

## DAOS server configuration file.

#

## Location of this configuration file is determined by first checking for the

## path specified through the -f option of the daos_server command line.

## Otherwise, /etc/daos_server.conf is used.

#

#

## Name associated with the DAOS system.

## Immutable after reformat.

#

name: daos

#

#

## Access points

#

## To operate, DAOS will need a quorum of access point nodes to be available.

## Must have the same value for all agents and servers in a system.

## Immutable after reformat.

## Hosts can be specified with or without port, default port below

## assumed if not specified.

#

## default: hostname of this node

access_points:

  - <host01>

#

## Default port

#

## Port number to bind daos_server to, this will also

## be used when connecting to access points unless a port is specified.

#

## default: 10001

port: 10001

#

## Transport Credentials Specifying certificates to secure communications

#

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

#

## Fault domain path

#

## Immutable after reformat.

#

## default: /hostname for a local configuration w/o fault domain

#fault_path: /vcdu0/rack1/hostname

#

#

## Fault domain callback

#

## Path to executable which will return fault domain string.

## Immutable after reformat.

#

#fault_cb: ./.daos/fd_callback

#

#

## Use specific OFI interfaces

#

## Specify either a single fabric interface that will be used by all

## spawned servers or a comma-seperated list of fabric interfaces to be

## assigned individually.

## By default, the DAOS server will auto-detect and use all fabric

## interfaces if any and fall back to socket on the first eth card

## otherwise.

fabric_ifaces:

  - enp94s0

  - enp216s0

#

#

## Use specific OFI provider

#

## Force a specific provider to be used by all the servers.

## The default provider depends on the interfaces that will be auto-detected:

## ofi+psm2 for Omni-Path, ofi+verbs;ofi_rxm for Infiniband/RoCE and finally

## ofi+socket for non-RDMA-capable Ethernet.

#

provider: ofi+verbs;ofi_rxm

#

#

## Storage mount directory

#

## TODO: If no pre-configured mountpoints are specified, DAOS will auto-detect

## NVDIMMs, configure them in interleave mode, format with ext4 and

## mount with the DAX extension creating a subdirectory within scm_mount_path.

#

## This option allows to specify a preferred path where the mountpoints will

## be created. Either the specified directory or its parent must be a mount

## point.

#

## default: /mnt/daos

scm_mount_path: /mnt/daos

#

#

## NVMe SSD whitelist

#

## Only use NVMe controllers with specific PCI addresses.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

## By default, DAOS will use all the NVMe-capable SSDs that don't have active

## mount points.

#

#bdev_include: ["0000:81:00.1","0000:81:00.2","0000:81:00.3"]

#

#

## NVMe SSD blacklist

#

## Only use NVMe controllers with specific PCI addresses. Overrides drives

## listed in nvme_include and forces auto-detection to skip those drives.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

#

#bdev_exclude: ["0000:81:00.1"]

#

#

## Use Hyperthreads

#

## When Hyperthreading is enabled and supported on the system, this parameter

## defines whether the DAOS service thread should only be bound to different

## physical cores (value 0) or hyperthreads (value 1).

#

## default: false

hyperthreads: False

#

#

## Use the given directory for creating unix domain sockets

#

## DAOS Agent and DAOS Server both use unix domain sockets for communication

## with other system components. This setting is the base location to place

## the sockets in.

#

## default: /var/run/daos_server

socket_dir: /var/run/daos_server

#

#

## Number of hugepages to allocate for use by NVMe SSDs

#

## Specifies the number (not size) of hugepages to allocate for use by NVMe

## through SPDK. This indicates the total number to be used by any spawned

## servers. Default system hugepage size will be used and hugepages will be

## evenly distributed between CPU nodes.

#

## default: 1024

nr_hugepages: 4096

#

#

## Force specific debug mask for daos_server (control plane).

## By default, just use the default debug mask used by daos_server.

## Mask specifies minimum level of message significance to pass to logger.

## Currently supported values are DEBUG and ERROR.

#

## default: DEBUG

#control_log_mask: ERROR

#

#

## Force specific path for daos_server (control plane) logs.

#

## default: print to stderr

control_log_file: /var/log/daos/daos_control.log

#

#

## Enable daos_admin (privileged helper) logging.

#

## default: disabled (errors only to control plane log)

helper_log_file: /var/log/daos/daos_admin.log

#

#

# When per-server definitions exist, auto-allocation of resources is not

# performed. Without per-server definitions, node resources will

# automatically be assigned to servers based on NUMA ratings, there will

# be a one-to-one relationship between servers and sockets.

 

servers:

-

  # Rank to be assigned as identifier for server.

  # Immutable after reformat.

  # Optional parameter, will be auto generated if not supplied.

 

  rank: 0

 

  # Targets (VOS) represent the count of storage targets per data plane

  # server starting at core offset specified by first_core.

 

  # Immutable after reformat.

 

  targets: 24

 

  # Count of offload/helper xstreams per target. (allowed values: 0-2)

  # Immutable after reformat.

 

  # default: 2

  nr_xs_helpers: 0

 

  # Offset of the first core for service xstreams.

  # Immutable after reformat.

 

  # default: 0

  first_core: 0

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp94s0

  fabric_iface_port: 20000

  pinned_numa_node: 0

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs (D_LOG_FILE).

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server1.log

#

#  # Pass specific environment variables to the DAOS server.

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=30

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#  # Either the specified directory or its parent must be a mount point.

#

  scm_mount: /mnt/daos/1

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size ignored

#  # - "ram" to emulate SCM with memory, scm_list ignored

#  # Immutable after reformat.

#

#  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem0]

 

#

#  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

#  # The size of ram is specified by scm_size in GB units.

#  scm_size: 16

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # default: nvme

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:1c:00.0","0000:20:00.0","0000:3f:00.0","0000:43:00.0"]  # generate regular nvme.conf

-

#  # Rank to be assigned as identifier for server.

#  # Immutable after reformat.

#  # Optional parameter, will be auto generated if not supplied.

#

  rank: 1

 

  # Targets (VOS) represent the number of logical CPUs to be used starting at

  # index specified by first_core.

 

  # Targets will be used to run XStreams can be thought of as service threads.

  # Immutable after reformat.

 

  targets: 24

 

  # Number of helper XStreams per VOS target. (allowed values: 0-2)

  # Immutable after reformat.

#

#  # default: 2

#  nr_xs_helpers: 1

#

#  # Index of first core for service thread.

#  # Immutable after reformat.

#

  # default: 0

  first_core: 24

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp216s0

  fabric_iface_port: 20000

  pinned_numa_node: 1

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs.

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server2.log

#

#  # Pass specific environment variables to the DAOS server

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=100

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#

  scm_mount: /mnt/daos/2

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size is ignored

#  # - "ram" to emulate SCM with memory, scm_list is ignored

#  # Immutable after reformat.

#

  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem1]

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # When bdev_class is set to malloc, bdev_number is the number of devices

#  # to allocate and bdev_size is the size in GB of each LUN/device.

#  bdev_class: malloc

#  bdev_number: 1

#  bdev_size: 4

#

#  # When bdev_class is set to file, bdev_list is the list of file paths that

#  # will be used to emulate NVMe SSDs. The size of each file is specified by

#  # bdev_size in GB unit.

#  bdev_class: file

#  bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2]

#  bdev_size: 16

#

#  # When bdev_class is set to kdev, bdev_list is the list of unique kernel

#  # block devices that should be different across different server instance.

#  bdev_class: kdev

#  bdev_list: [/dev/sdc,/dev/sdd]

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:89:00.0","0000:8d:00.0","0000:b2:00.0","0000:b6:00.0"]  # generate regular nvme.conf

 

 

 


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Jacque, Kristin <kristin.jacque@...>
Sent: Wednesday, February 24, 2021 8:31 PM
To:
daos@daos.groups.io <daos@daos.groups.io>
Subject: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To:
daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

101 - 120 of 1488