Hello Richard
"ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1”
indicates that there might be some useful information in the io_server log for the first instance, the default location as set in the server config file (log_file) is /tmp/server0.log. If nothing useful in there try increasing the log_mask to DEBUG.
Regards,
Tom Nabarro – HPC
M: +44 (0)7786 260986
Skype: tom.nabarro
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
richard.dahringer@...
Sent: Thursday, July 30, 2020 3:27 PM
To: daos@daos.groups.io
Subject: [daos] issues with NVMe drives from RPM installation
Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices,
'dmg -i' commands all seem to work. When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there.
My daos_server.conf file:
name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
allow_insecure: true
servers:
-
targets: 1
first_core: 0
nr_xs_helpers: 0
fabric_iface: hib0
fabric_iface_port: 31416
log_file: /tmp/daos_server.log
env_vars:
- DAOS_MD_CAP=1024
- CRT_CTX_SHARE_ADDR=0
- CRT_TIMEOUT=30
- FI_SOCKETS_MAX_CONN_RETRY=1
- FI_SOCKETS_CONN_TIMEOUT=2000
# Storage definitions
# When scm_class is set to ram, tmpfs will be used to emulate SCM.
# The size of ram is specified by scm_size in GB units.
scm_mount: /mnt/daos0 # map to -s /mnt/daos
scm_class: dcpm
scm_list: [/dev/pmem0]
bdev_class: nvme
bdev_list: ["0000:5e:00.0"]
The startup error:
[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1
Can someone provide some pointers to what is going on?