Re: issues with NVMe drives from RPM installation

Farrell, Patrick Arthur
 

Richard,

There's nothing obviously wrong - to me, anyway - with your config, and no useful errors in the output.  You can check the logs in /tmp/daos*.log (There will be multiple files), they should contain more information.  You could also turn on debug before you start the server to possibly get more info - described in the manual https://daos-stack.github.io/admin/troubleshooting/

Also, if you have not, you can check your drives are visible to DAOS and can be prepared as expected with the daos_server storage commands, scan and prepare, detailed here:
https://daos-stack.github.io/admin/deployment/

That details how to run them for SCM, look at the command help for how to run them for NVMe devices.  (You'll want to select NVMe only or it may ask you to reboot to set up your SCM goals, which you've obviously already done.)

Regards,
-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of richard.dahringer@... <richard.dahringer@...>
Sent: Thursday, July 30, 2020 9:27 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] issues with NVMe drives from RPM installation
 

Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices, 'dmg -i' commands all seem to work.  When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there. 

My daos_server.conf file:

name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
   allow_insecure: true

servers:
-
  targets: 1
  first_core: 0
  nr_xs_helpers: 0
  fabric_iface: hib0
  fabric_iface_port: 31416
  log_file: /tmp/daos_server.log

 

  env_vars:
  - DAOS_MD_CAP=1024
  - CRT_CTX_SHARE_ADDR=0
  - CRT_TIMEOUT=30
  - FI_SOCKETS_MAX_CONN_RETRY=1
  - FI_SOCKETS_CONN_TIMEOUT=2000

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos0  # map to -s /mnt/daos
  scm_class: dcpm
  scm_list: [/dev/pmem0]

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

The startup error:

[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1

Can someone provide some pointers to what is going on? 

Join daos@daos.groups.io to automatically receive all group messages.