Topics

[EXTERNAL SENDER] Re: [daos] Startup Errors


Neale Petrillo (Contractor)
 

Hi Kris and tom, 

I'm using Systemd for all the service control and a parallel shell for running the systemctl commands.

Disabling certs didn't help. I did find a permission problem with the socket directory though, and fixing that allows me to run dmg on the access point successfully. I still get the TRANSIENT_FAILURE on my test node, though. Now when I run the 'dmg storage format' I get: 

Cannot format storage with running I/O server instance

I tried running 'dmg system stop' on the access point but got the TRANSIENT_FAILURE error. I'm also still getting the no-hugepages error on all the servers.

Are the RPMs available for the newer versions? I've also pasted the config file below: 

## DAOS server configuration file.
#
## Location of this configuration file is determined by first checking for the
## path specified through the -f option of the daos_server command line.
## Otherwise, /etc/daos_server.conf is used.
#
#
## Name associated with the DAOS system.
## Immutable after reformat.
#
name: daos
#
#
## Access points
#
## To operate, DAOS will need a quorum of access point nodes to be available.
## Must have the same value for all agents and servers in a system.
## Immutable after reformat.
## Hosts can be specified with or without port, default port below
## assumed if not specified.
#
## default: hostname of this node
access_points:
  - <host01>
#
## Default port
#
## Port number to bind daos_server to, this will also
## be used when connecting to access points unless a port is specified.
#
## default: 10001
port: 10001
#
## Transport Credentials Specifying certificates to secure communications
#
transport_config:
#  # In order to disable transport security, uncomment and set allow_insecure
#  # to true. Not recommended for production configurations.
  allow_insecure: false

  # Location where daos_server will look for Client certificates
  client_cert_dir: /etc/daos/daosCA/clients
  # Custom CA Root certificate for generated certs
  ca_cert: /etc/daos/daosCA/certs/daosCA.crt
  # Server certificate for use in TLS handshakes
  cert: /etc/daos/daosCA/certs/server.crt
  # Key portion of Server Certificate
  key: /etc/daos/daosCA/certs/server.key

#
## Fault domain path
#
## Immutable after reformat.
#
## default: /hostname for a local configuration w/o fault domain
#fault_path: /vcdu0/rack1/hostname
#
#
## Fault domain callback
#
## Path to executable which will return fault domain string.
## Immutable after reformat.
#
#fault_cb: ./.daos/fd_callback
#
#
## Use specific OFI interfaces
#
## Specify either a single fabric interface that will be used by all
## spawned servers or a comma-seperated list of fabric interfaces to be
## assigned individually.
## By default, the DAOS server will auto-detect and use all fabric
## interfaces if any and fall back to socket on the first eth card
## otherwise.
fabric_ifaces:
  - enp94s0
  - enp216s0
#
#
## Use specific OFI provider
#
## Force a specific provider to be used by all the servers.
## The default provider depends on the interfaces that will be auto-detected:
## ofi+psm2 for Omni-Path, ofi+verbs;ofi_rxm for Infiniband/RoCE and finally
## ofi+socket for non-RDMA-capable Ethernet.
#
provider: ofi+verbs;ofi_rxm
#
#
## Storage mount directory
#
## TODO: If no pre-configured mountpoints are specified, DAOS will auto-detect
## NVDIMMs, configure them in interleave mode, format with ext4 and
## mount with the DAX extension creating a subdirectory within scm_mount_path.
#
## This option allows to specify a preferred path where the mountpoints will
## be created. Either the specified directory or its parent must be a mount
## point.
#
## default: /mnt/daos
scm_mount_path: /mnt/daos
#
#
## NVMe SSD whitelist
#
## Only use NVMe controllers with specific PCI addresses.
## Immutable after reformat, colons replaced by dots in PCI identifiers.
## By default, DAOS will use all the NVMe-capable SSDs that don't have active
## mount points.
#
#bdev_include: ["0000:81:00.1","0000:81:00.2","0000:81:00.3"]
#
#
## NVMe SSD blacklist
#
## Only use NVMe controllers with specific PCI addresses. Overrides drives
## listed in nvme_include and forces auto-detection to skip those drives.
## Immutable after reformat, colons replaced by dots in PCI identifiers.
#
#bdev_exclude: ["0000:81:00.1"]
#
#
## Use Hyperthreads
#
## When Hyperthreading is enabled and supported on the system, this parameter
## defines whether the DAOS service thread should only be bound to different
## physical cores (value 0) or hyperthreads (value 1).
#
## default: false
hyperthreads: False
#
#
## Use the given directory for creating unix domain sockets
#
## DAOS Agent and DAOS Server both use unix domain sockets for communication
## with other system components. This setting is the base location to place
## the sockets in.
#
## default: /var/run/daos_server
socket_dir: /var/run/daos_server
#
#
## Number of hugepages to allocate for use by NVMe SSDs
#
## Specifies the number (not size) of hugepages to allocate for use by NVMe
## through SPDK. This indicates the total number to be used by any spawned
## servers. Default system hugepage size will be used and hugepages will be
## evenly distributed between CPU nodes.
#
## default: 1024
nr_hugepages: 4096
#
#
## Force specific debug mask for daos_server (control plane).
## By default, just use the default debug mask used by daos_server.
## Mask specifies minimum level of message significance to pass to logger.
## Currently supported values are DEBUG and ERROR.
#
## default: DEBUG
#control_log_mask: ERROR
#
#
## Force specific path for daos_server (control plane) logs.
#
## default: print to stderr
control_log_file: /var/log/daos/daos_control.log
#
#
## Enable daos_admin (privileged helper) logging.
#
## default: disabled (errors only to control plane log)
helper_log_file: /var/log/daos/daos_admin.log
#
#
# When per-server definitions exist, auto-allocation of resources is not
# performed. Without per-server definitions, node resources will
# automatically be assigned to servers based on NUMA ratings, there will
# be a one-to-one relationship between servers and sockets.

servers:
-
  # Rank to be assigned as identifier for server.
  # Immutable after reformat.
  # Optional parameter, will be auto generated if not supplied.

  rank: 0

  # Targets (VOS) represent the count of storage targets per data plane
  # server starting at core offset specified by first_core.

  # Immutable after reformat.

  targets: 24

  # Count of offload/helper xstreams per target. (allowed values: 0-2)
  # Immutable after reformat.

  # default: 2
  nr_xs_helpers: 0

  # Offset of the first core for service xstreams.
  # Immutable after reformat.

  # default: 0
  first_core: 0

  # Use specific OFI interfaces.
  # Specify the fabric network interface that will be used by this server.
  # Optionally specify the fabric network interface port that will be used
  # by this server but please only if you have a specific need, this will
  # normally be chosen automatically.

  fabric_iface: enp94s0
  fabric_iface_port: 20000
  pinned_numa_node: 0

#  # Force specific debug mask (D_LOG_MASK) at start up time.
#  # By default, just use the default debug mask used by DAOS.
#  # Mask specifies minimum level of message significance to pass to logger.
#
#  # default: ERR
#  log_mask: WARN
#
#  # Force specific path for DAOS debug logs (D_LOG_FILE).
#
#  # default: /tmp/daos.log
  log_file: /var/log/daos/daos_server1.log
#
#  # Pass specific environment variables to the DAOS server.
#  # Empty by default. Values should be supplied without encapsulating quotes.
#
#  env_vars:
#      - CRT_TIMEOUT=30
#
#  # Define a pre-configured mountpoint for storage class memory to be used
#  # by this server.
#  # Path should be unique to server instance (can use different subdirs).
#  # Either the specified directory or its parent must be a mount point.
#
  scm_mount: /mnt/daos/1
#
#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)
#  # modules configured in interleaved mode (AppDirect regions) or emulated
#  # tmpfs in RAM.
#  # Options are:
#  # - "dcpm" for real SCM (preferred option), scm_size ignored
#  # - "ram" to emulate SCM with memory, scm_list ignored
#  # Immutable after reformat.
#
#  # default: dcpm
  scm_class: dcpm

  # When scm_class is set to dcpm, scm_list is the list of device paths for
  # AppDirect pmem namespaces (currently only one per server supported).
  scm_list: [/dev/pmem0]

#
#  # When scm_class is set to ram, tmpfs will be used to emulate SCM.
#  # The size of ram is specified by scm_size in GB units.
#  scm_size: 16
#
#  # Backend block device type. Force a SPDK driver to be used by this server
#  # instance.
#  # Options are:
#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored
#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored
#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored
#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored
#  # Immutable after reformat.
#
#  # default: nvme
  bdev_class: nvme
#
#  # Backend block device configuration to be used by this server instance.
#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs
#  # that should be different across different server instance.
#  # Immutable after reformat.
  bdev_list: ["0000:1c:00.0","0000:20:00.0","0000:3f:00.0","0000:43:00.0"]  # generate regular nvme.conf
-
#  # Rank to be assigned as identifier for server.
#  # Immutable after reformat.
#  # Optional parameter, will be auto generated if not supplied.
#
  rank: 1

  # Targets (VOS) represent the number of logical CPUs to be used starting at
  # index specified by first_core.

  # Targets will be used to run XStreams can be thought of as service threads.
  # Immutable after reformat.

  targets: 24

  # Number of helper XStreams per VOS target. (allowed values: 0-2)
  # Immutable after reformat.
#
#  # default: 2
#  nr_xs_helpers: 1
#
#  # Index of first core for service thread.
#  # Immutable after reformat.
#
  # default: 0
  first_core: 24

  # Use specific OFI interfaces.
  # Specify the fabric network interface that will be used by this server.
  # Optionally specify the fabric network interface port that will be used
  # by this server but please only if you have a specific need, this will
  # normally be chosen automatically.

  fabric_iface: enp216s0
  fabric_iface_port: 20000
  pinned_numa_node: 1

#  # Force specific debug mask (D_LOG_MASK) at start up time.
#  # By default, just use the default debug mask used by DAOS.
#  # Mask specifies minimum level of message significance to pass to logger.
#
#  # default: ERR
#  log_mask: WARN
#
#  # Force specific path for DAOS debug logs.
#
#  # default: /tmp/daos.log
  log_file: /var/log/daos/daos_server2.log
#
#  # Pass specific environment variables to the DAOS server
#  # Empty by default. Values should be supplied without encapsulating quotes.
#
#  env_vars:
#      - CRT_TIMEOUT=100
#
#  # Define a pre-configured mountpoint for storage class memory to be used
#  # by this server.
#  # Path should be unique to server instance (can use different subdirs).
#
  scm_mount: /mnt/daos/2
#
#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)
#  # modules configured in interleaved mode (AppDirect regions) or emulated
#  # tmpfs in RAM.
#  # Options are:
#  # - "dcpm" for real SCM (preferred option), scm_size is ignored
#  # - "ram" to emulate SCM with memory, scm_list is ignored
#  # Immutable after reformat.
#
  # default: dcpm
  scm_class: dcpm

  # When scm_class is set to dcpm, scm_list is the list of device paths for
  # AppDirect pmem namespaces (currently only one per server supported).
  scm_list: [/dev/pmem1]
#
#  # Backend block device type. Force a SPDK driver to be used by this server
#  # instance.
#  # Options are:
#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored
#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored
#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored
#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored
#  # Immutable after reformat.
#
#  # When bdev_class is set to malloc, bdev_number is the number of devices
#  # to allocate and bdev_size is the size in GB of each LUN/device.
#  bdev_class: malloc
#  bdev_number: 1
#  bdev_size: 4
#
#  # When bdev_class is set to file, bdev_list is the list of file paths that
#  # will be used to emulate NVMe SSDs. The size of each file is specified by
#  # bdev_size in GB unit.
#  bdev_class: file
#  bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2]
#  bdev_size: 16
#
#  # When bdev_class is set to kdev, bdev_list is the list of unique kernel
#  # block devices that should be different across different server instance.
#  bdev_class: kdev
#  bdev_list: [/dev/sdc,/dev/sdd]
  bdev_class: nvme
#
#  # Backend block device configuration to be used by this server instance.
#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs
#  # that should be different across different server instance.
#  # Immutable after reformat.
  bdev_list: ["0000:89:00.0","0000:8d:00.0","0000:b2:00.0","0000:b6:00.0"]  # generate regular nvme.conf




From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Jacque, Kristin <kristin.jacque@...>
Sent: Wednesday, February 24, 2021 8:31 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [EXTERNAL SENDER] Re: [daos] Startup Errors
 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To: daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 


Nabarro, Tom
 

I think getting the most basic configuration working is probably the best way forward given that dmg is not connecting, try with an empty config file (discovery mode) on a single host and on that same host without any certificates installed (and try running without systemd just to reduce to a minimal viable configuration):

 

[tanabarr@wolf-71 daos_m]$ sudo mkdir /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ sudo chmod 777 /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ install/bin/daos_server start -i

DAOS Server config loaded from /home/tanabarr/projects/daos_m/install/etc/daos_server.yml

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

no control log file specified; logging to stdout

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

DAOS Control Server v1.3.0 (pid 218642) listening on 0.0.0.0:10001

 

Then on a separate terminal on the same host:

 

[tanabarr@wolf-71 daos-control-demo]$ dmg -i -l wolf-71 storage scan

Hosts   SCM Total             NVMe Total

-----   ---------             ----------

wolf-71 6.4 TB (2 namespaces) 3.1 TB (3 controllers)

 

See if you get the transient failure with the above.

 

The insecure mode is only suitable for development and testing purposes, just to be clear.

v1.3.0 version does not represent a release, it’s just the version printed when running from master, the above should work with any 1.x version.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, February 25, 2021 5:40 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Kris and tom, 

 

I'm using Systemd for all the service control and a parallel shell for running the systemctl commands.

 

Disabling certs didn't help. I did find a permission problem with the socket directory though, and fixing that allows me to run dmg on the access point successfully. I still get the TRANSIENT_FAILURE on my test node, though. Now when I run the 'dmg storage format' I get: 

 

Cannot format storage with running I/O server instance

 

I tried running 'dmg system stop' on the access point but got the TRANSIENT_FAILURE error. I'm also still getting the no-hugepages error on all the servers.

 

Are the RPMs available for the newer versions? I've also pasted the config file below: 

 

## DAOS server configuration file.

#

## Location of this configuration file is determined by first checking for the

## path specified through the -f option of the daos_server command line.

## Otherwise, /etc/daos_server.conf is used.

#

#

## Name associated with the DAOS system.

## Immutable after reformat.

#

name: daos

#

#

## Access points

#

## To operate, DAOS will need a quorum of access point nodes to be available.

## Must have the same value for all agents and servers in a system.

## Immutable after reformat.

## Hosts can be specified with or without port, default port below

## assumed if not specified.

#

## default: hostname of this node

access_points:

  - <host01>

#

## Default port

#

## Port number to bind daos_server to, this will also

## be used when connecting to access points unless a port is specified.

#

## default: 10001

port: 10001

#

## Transport Credentials Specifying certificates to secure communications

#

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

#

## Fault domain path

#

## Immutable after reformat.

#

## default: /hostname for a local configuration w/o fault domain

#fault_path: /vcdu0/rack1/hostname

#

#

## Fault domain callback

#

## Path to executable which will return fault domain string.

## Immutable after reformat.

#

#fault_cb: ./.daos/fd_callback

#

#

## Use specific OFI interfaces

#

## Specify either a single fabric interface that will be used by all

## spawned servers or a comma-seperated list of fabric interfaces to be

## assigned individually.

## By default, the DAOS server will auto-detect and use all fabric

## interfaces if any and fall back to socket on the first eth card

## otherwise.

fabric_ifaces:

  - enp94s0

  - enp216s0

#

#

## Use specific OFI provider

#

## Force a specific provider to be used by all the servers.

## The default provider depends on the interfaces that will be auto-detected:

## ofi+psm2 for Omni-Path, ofi+verbs;ofi_rxm for Infiniband/RoCE and finally

## ofi+socket for non-RDMA-capable Ethernet.

#

provider: ofi+verbs;ofi_rxm

#

#

## Storage mount directory

#

## TODO: If no pre-configured mountpoints are specified, DAOS will auto-detect

## NVDIMMs, configure them in interleave mode, format with ext4 and

## mount with the DAX extension creating a subdirectory within scm_mount_path.

#

## This option allows to specify a preferred path where the mountpoints will

## be created. Either the specified directory or its parent must be a mount

## point.

#

## default: /mnt/daos

scm_mount_path: /mnt/daos

#

#

## NVMe SSD whitelist

#

## Only use NVMe controllers with specific PCI addresses.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

## By default, DAOS will use all the NVMe-capable SSDs that don't have active

## mount points.

#

#bdev_include: ["0000:81:00.1","0000:81:00.2","0000:81:00.3"]

#

#

## NVMe SSD blacklist

#

## Only use NVMe controllers with specific PCI addresses. Overrides drives

## listed in nvme_include and forces auto-detection to skip those drives.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

#

#bdev_exclude: ["0000:81:00.1"]

#

#

## Use Hyperthreads

#

## When Hyperthreading is enabled and supported on the system, this parameter

## defines whether the DAOS service thread should only be bound to different

## physical cores (value 0) or hyperthreads (value 1).

#

## default: false

hyperthreads: False

#

#

## Use the given directory for creating unix domain sockets

#

## DAOS Agent and DAOS Server both use unix domain sockets for communication

## with other system components. This setting is the base location to place

## the sockets in.

#

## default: /var/run/daos_server

socket_dir: /var/run/daos_server

#

#

## Number of hugepages to allocate for use by NVMe SSDs

#

## Specifies the number (not size) of hugepages to allocate for use by NVMe

## through SPDK. This indicates the total number to be used by any spawned

## servers. Default system hugepage size will be used and hugepages will be

## evenly distributed between CPU nodes.

#

## default: 1024

nr_hugepages: 4096

#

#

## Force specific debug mask for daos_server (control plane).

## By default, just use the default debug mask used by daos_server.

## Mask specifies minimum level of message significance to pass to logger.

## Currently supported values are DEBUG and ERROR.

#

## default: DEBUG

#control_log_mask: ERROR

#

#

## Force specific path for daos_server (control plane) logs.

#

## default: print to stderr

control_log_file: /var/log/daos/daos_control.log

#

#

## Enable daos_admin (privileged helper) logging.

#

## default: disabled (errors only to control plane log)

helper_log_file: /var/log/daos/daos_admin.log

#

#

# When per-server definitions exist, auto-allocation of resources is not

# performed. Without per-server definitions, node resources will

# automatically be assigned to servers based on NUMA ratings, there will

# be a one-to-one relationship between servers and sockets.

 

servers:

-

  # Rank to be assigned as identifier for server.

  # Immutable after reformat.

  # Optional parameter, will be auto generated if not supplied.

 

  rank: 0

 

  # Targets (VOS) represent the count of storage targets per data plane

  # server starting at core offset specified by first_core.

 

  # Immutable after reformat.

 

  targets: 24

 

  # Count of offload/helper xstreams per target. (allowed values: 0-2)

  # Immutable after reformat.

 

  # default: 2

  nr_xs_helpers: 0

 

  # Offset of the first core for service xstreams.

  # Immutable after reformat.

 

  # default: 0

  first_core: 0

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp94s0

  fabric_iface_port: 20000

  pinned_numa_node: 0

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs (D_LOG_FILE).

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server1.log

#

#  # Pass specific environment variables to the DAOS server.

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=30

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#  # Either the specified directory or its parent must be a mount point.

#

  scm_mount: /mnt/daos/1

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size ignored

#  # - "ram" to emulate SCM with memory, scm_list ignored

#  # Immutable after reformat.

#

#  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem0]

 

#

#  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

#  # The size of ram is specified by scm_size in GB units.

#  scm_size: 16

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # default: nvme

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:1c:00.0","0000:20:00.0","0000:3f:00.0","0000:43:00.0"]  # generate regular nvme.conf

-

#  # Rank to be assigned as identifier for server.

#  # Immutable after reformat.

#  # Optional parameter, will be auto generated if not supplied.

#

  rank: 1

 

  # Targets (VOS) represent the number of logical CPUs to be used starting at

  # index specified by first_core.

 

  # Targets will be used to run XStreams can be thought of as service threads.

  # Immutable after reformat.

 

  targets: 24

 

  # Number of helper XStreams per VOS target. (allowed values: 0-2)

  # Immutable after reformat.

#

#  # default: 2

#  nr_xs_helpers: 1

#

#  # Index of first core for service thread.

#  # Immutable after reformat.

#

  # default: 0

  first_core: 24

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp216s0

  fabric_iface_port: 20000

  pinned_numa_node: 1

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs.

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server2.log

#

#  # Pass specific environment variables to the DAOS server

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=100

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#

  scm_mount: /mnt/daos/2

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size is ignored

#  # - "ram" to emulate SCM with memory, scm_list is ignored

#  # Immutable after reformat.

#

  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem1]

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # When bdev_class is set to malloc, bdev_number is the number of devices

#  # to allocate and bdev_size is the size in GB of each LUN/device.

#  bdev_class: malloc

#  bdev_number: 1

#  bdev_size: 4

#

#  # When bdev_class is set to file, bdev_list is the list of file paths that

#  # will be used to emulate NVMe SSDs. The size of each file is specified by

#  # bdev_size in GB unit.

#  bdev_class: file

#  bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2]

#  bdev_size: 16

#

#  # When bdev_class is set to kdev, bdev_list is the list of unique kernel

#  # block devices that should be different across different server instance.

#  bdev_class: kdev

#  bdev_list: [/dev/sdc,/dev/sdd]

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:89:00.0","0000:8d:00.0","0000:b2:00.0","0000:b6:00.0"]  # generate regular nvme.conf

 

 

 


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Jacque, Kristin <kristin.jacque@...>
Sent: Wednesday, February 24, 2021 8:31 PM
To:
daos@daos.groups.io <daos@daos.groups.io>
Subject: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To:
daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Neale Petrillo (Contractor)
 

Hi Tom, 

I tried your suggestion by uninstalling / reinstalling DAOS RPMs to get a blank config file then added things line by line. Unfortunately, I ended up getting "insufficient information in configuration" errors until I ended up with essentially the config file I had before. 

I think we're going to suspend our testing of DAOS until a new release comes out instead of tracking down these issues. 

Neale


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Nabarro, Tom <tom.nabarro@...>
Sent: Thursday, February 25, 2021 5:52 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors
 

I think getting the most basic configuration working is probably the best way forward given that dmg is not connecting, try with an empty config file (discovery mode) on a single host and on that same host without any certificates installed (and try running without systemd just to reduce to a minimal viable configuration):

 

[tanabarr@wolf-71 daos_m]$ sudo mkdir /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ sudo chmod 777 /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ install/bin/daos_server start -i

DAOS Server config loaded from /home/tanabarr/projects/daos_m/install/etc/daos_server.yml

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

no control log file specified; logging to stdout

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

DAOS Control Server v1.3.0 (pid 218642) listening on 0.0.0.0:10001

 

Then on a separate terminal on the same host:

 

[tanabarr@wolf-71 daos-control-demo]$ dmg -i -l wolf-71 storage scan

Hosts   SCM Total             NVMe Total

-----   ---------             ----------

wolf-71 6.4 TB (2 namespaces) 3.1 TB (3 controllers)

 

See if you get the transient failure with the above.

 

The insecure mode is only suitable for development and testing purposes, just to be clear.

v1.3.0 version does not represent a release, it’s just the version printed when running from master, the above should work with any 1.x version.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, February 25, 2021 5:40 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Kris and tom, 

 

I'm using Systemd for all the service control and a parallel shell for running the systemctl commands.

 

Disabling certs didn't help. I did find a permission problem with the socket directory though, and fixing that allows me to run dmg on the access point successfully. I still get the TRANSIENT_FAILURE on my test node, though. Now when I run the 'dmg storage format' I get: 

 

Cannot format storage with running I/O server instance

 

I tried running 'dmg system stop' on the access point but got the TRANSIENT_FAILURE error. I'm also still getting the no-hugepages error on all the servers.

 

Are the RPMs available for the newer versions? I've also pasted the config file below: 

 

## DAOS server configuration file.

#

## Location of this configuration file is determined by first checking for the

## path specified through the -f option of the daos_server command line.

## Otherwise, /etc/daos_server.conf is used.

#

#

## Name associated with the DAOS system.

## Immutable after reformat.

#

name: daos

#

#

## Access points

#

## To operate, DAOS will need a quorum of access point nodes to be available.

## Must have the same value for all agents and servers in a system.

## Immutable after reformat.

## Hosts can be specified with or without port, default port below

## assumed if not specified.

#

## default: hostname of this node

access_points:

  - <host01>

#

## Default port

#

## Port number to bind daos_server to, this will also

## be used when connecting to access points unless a port is specified.

#

## default: 10001

port: 10001

#

## Transport Credentials Specifying certificates to secure communications

#

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

#

## Fault domain path

#

## Immutable after reformat.

#

## default: /hostname for a local configuration w/o fault domain

#fault_path: /vcdu0/rack1/hostname

#

#

## Fault domain callback

#

## Path to executable which will return fault domain string.

## Immutable after reformat.

#

#fault_cb: ./.daos/fd_callback

#

#

## Use specific OFI interfaces

#

## Specify either a single fabric interface that will be used by all

## spawned servers or a comma-seperated list of fabric interfaces to be

## assigned individually.

## By default, the DAOS server will auto-detect and use all fabric

## interfaces if any and fall back to socket on the first eth card

## otherwise.

fabric_ifaces:

  - enp94s0

  - enp216s0

#

#

## Use specific OFI provider

#

## Force a specific provider to be used by all the servers.

## The default provider depends on the interfaces that will be auto-detected:

## ofi+psm2 for Omni-Path, ofi+verbs;ofi_rxm for Infiniband/RoCE and finally

## ofi+socket for non-RDMA-capable Ethernet.

#

provider: ofi+verbs;ofi_rxm

#

#

## Storage mount directory

#

## TODO: If no pre-configured mountpoints are specified, DAOS will auto-detect

## NVDIMMs, configure them in interleave mode, format with ext4 and

## mount with the DAX extension creating a subdirectory within scm_mount_path.

#

## This option allows to specify a preferred path where the mountpoints will

## be created. Either the specified directory or its parent must be a mount

## point.

#

## default: /mnt/daos

scm_mount_path: /mnt/daos

#

#

## NVMe SSD whitelist

#

## Only use NVMe controllers with specific PCI addresses.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

## By default, DAOS will use all the NVMe-capable SSDs that don't have active

## mount points.

#

#bdev_include: ["0000:81:00.1","0000:81:00.2","0000:81:00.3"]

#

#

## NVMe SSD blacklist

#

## Only use NVMe controllers with specific PCI addresses. Overrides drives

## listed in nvme_include and forces auto-detection to skip those drives.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

#

#bdev_exclude: ["0000:81:00.1"]

#

#

## Use Hyperthreads

#

## When Hyperthreading is enabled and supported on the system, this parameter

## defines whether the DAOS service thread should only be bound to different

## physical cores (value 0) or hyperthreads (value 1).

#

## default: false

hyperthreads: False

#

#

## Use the given directory for creating unix domain sockets

#

## DAOS Agent and DAOS Server both use unix domain sockets for communication

## with other system components. This setting is the base location to place

## the sockets in.

#

## default: /var/run/daos_server

socket_dir: /var/run/daos_server

#

#

## Number of hugepages to allocate for use by NVMe SSDs

#

## Specifies the number (not size) of hugepages to allocate for use by NVMe

## through SPDK. This indicates the total number to be used by any spawned

## servers. Default system hugepage size will be used and hugepages will be

## evenly distributed between CPU nodes.

#

## default: 1024

nr_hugepages: 4096

#

#

## Force specific debug mask for daos_server (control plane).

## By default, just use the default debug mask used by daos_server.

## Mask specifies minimum level of message significance to pass to logger.

## Currently supported values are DEBUG and ERROR.

#

## default: DEBUG

#control_log_mask: ERROR

#

#

## Force specific path for daos_server (control plane) logs.

#

## default: print to stderr

control_log_file: /var/log/daos/daos_control.log

#

#

## Enable daos_admin (privileged helper) logging.

#

## default: disabled (errors only to control plane log)

helper_log_file: /var/log/daos/daos_admin.log

#

#

# When per-server definitions exist, auto-allocation of resources is not

# performed. Without per-server definitions, node resources will

# automatically be assigned to servers based on NUMA ratings, there will

# be a one-to-one relationship between servers and sockets.

 

servers:

-

  # Rank to be assigned as identifier for server.

  # Immutable after reformat.

  # Optional parameter, will be auto generated if not supplied.

 

  rank: 0

 

  # Targets (VOS) represent the count of storage targets per data plane

  # server starting at core offset specified by first_core.

 

  # Immutable after reformat.

 

  targets: 24

 

  # Count of offload/helper xstreams per target. (allowed values: 0-2)

  # Immutable after reformat.

 

  # default: 2

  nr_xs_helpers: 0

 

  # Offset of the first core for service xstreams.

  # Immutable after reformat.

 

  # default: 0

  first_core: 0

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp94s0

  fabric_iface_port: 20000

  pinned_numa_node: 0

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs (D_LOG_FILE).

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server1.log

#

#  # Pass specific environment variables to the DAOS server.

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=30

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#  # Either the specified directory or its parent must be a mount point.

#

  scm_mount: /mnt/daos/1

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size ignored

#  # - "ram" to emulate SCM with memory, scm_list ignored

#  # Immutable after reformat.

#

#  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem0]

 

#

#  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

#  # The size of ram is specified by scm_size in GB units.

#  scm_size: 16

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # default: nvme

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:1c:00.0","0000:20:00.0","0000:3f:00.0","0000:43:00.0"]  # generate regular nvme.conf

-

#  # Rank to be assigned as identifier for server.

#  # Immutable after reformat.

#  # Optional parameter, will be auto generated if not supplied.

#

  rank: 1

 

  # Targets (VOS) represent the number of logical CPUs to be used starting at

  # index specified by first_core.

 

  # Targets will be used to run XStreams can be thought of as service threads.

  # Immutable after reformat.

 

  targets: 24

 

  # Number of helper XStreams per VOS target. (allowed values: 0-2)

  # Immutable after reformat.

#

#  # default: 2

#  nr_xs_helpers: 1

#

#  # Index of first core for service thread.

#  # Immutable after reformat.

#

  # default: 0

  first_core: 24

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp216s0

  fabric_iface_port: 20000

  pinned_numa_node: 1

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs.

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server2.log

#

#  # Pass specific environment variables to the DAOS server

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=100

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#

  scm_mount: /mnt/daos/2

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size is ignored

#  # - "ram" to emulate SCM with memory, scm_list is ignored

#  # Immutable after reformat.

#

  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem1]

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # When bdev_class is set to malloc, bdev_number is the number of devices

#  # to allocate and bdev_size is the size in GB of each LUN/device.

#  bdev_class: malloc

#  bdev_number: 1

#  bdev_size: 4

#

#  # When bdev_class is set to file, bdev_list is the list of file paths that

#  # will be used to emulate NVMe SSDs. The size of each file is specified by

#  # bdev_size in GB unit.

#  bdev_class: file

#  bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2]

#  bdev_size: 16

#

#  # When bdev_class is set to kdev, bdev_list is the list of unique kernel

#  # block devices that should be different across different server instance.

#  bdev_class: kdev

#  bdev_list: [/dev/sdc,/dev/sdd]

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:89:00.0","0000:8d:00.0","0000:b2:00.0","0000:b6:00.0"]  # generate regular nvme.conf

 

 

 


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Jacque, Kristin <kristin.jacque@...>
Sent: Wednesday, February 24, 2021 8:31 PM
To:
daos@daos.groups.io <daos@daos.groups.io>
Subject: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To:
daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Nabarro, Tom
 

Hello Neale,

 

I’m happy to work with you directly on this to get you past any hurdles if you would like, my e-mail is tom.nabarro@....

The TRANSIENT_FAILURE does indicate some local network related issue and is unlikely to be fixed by a new release.

 

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, March 4, 2021 7:57 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Tom, 

 

I tried your suggestion by uninstalling / reinstalling DAOS RPMs to get a blank config file then added things line by line. Unfortunately, I ended up getting "insufficient information in configuration" errors until I ended up with essentially the config file I had before. 

 

I think we're going to suspend our testing of DAOS until a new release comes out instead of tracking down these issues. 

 

Neale


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Nabarro, Tom <tom.nabarro@...>
Sent: Thursday, February 25, 2021 5:52 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

I think getting the most basic configuration working is probably the best way forward given that dmg is not connecting, try with an empty config file (discovery mode) on a single host and on that same host without any certificates installed (and try running without systemd just to reduce to a minimal viable configuration):

 

[tanabarr@wolf-71 daos_m]$ sudo mkdir /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ sudo chmod 777 /var/run/daos_server

[tanabarr@wolf-71 daos_m]$ install/bin/daos_server start -i

DAOS Server config loaded from /home/tanabarr/projects/daos_m/install/etc/daos_server.yml

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

no control log file specified; logging to stdout

No DAOS I/O Engines in configuration, DAOS Control Server starting in discovery mode

DAOS Control Server v1.3.0 (pid 218642) listening on 0.0.0.0:10001

 

Then on a separate terminal on the same host:

 

[tanabarr@wolf-71 daos-control-demo]$ dmg -i -l wolf-71 storage scan

Hosts   SCM Total             NVMe Total

-----   ---------             ----------

wolf-71 6.4 TB (2 namespaces) 3.1 TB (3 controllers)

 

See if you get the transient failure with the above.

 

The insecure mode is only suitable for development and testing purposes, just to be clear.

v1.3.0 version does not represent a release, it’s just the version printed when running from master, the above should work with any 1.x version.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Neale Petrillo (Contractor) via groups.io
Sent: Thursday, February 25, 2021 5:40 PM
To: daos@daos.groups.io
Subject: Re: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Kris and tom, 

 

I'm using Systemd for all the service control and a parallel shell for running the systemctl commands.

 

Disabling certs didn't help. I did find a permission problem with the socket directory though, and fixing that allows me to run dmg on the access point successfully. I still get the TRANSIENT_FAILURE on my test node, though. Now when I run the 'dmg storage format' I get: 

 

Cannot format storage with running I/O server instance

 

I tried running 'dmg system stop' on the access point but got the TRANSIENT_FAILURE error. I'm also still getting the no-hugepages error on all the servers.

 

Are the RPMs available for the newer versions? I've also pasted the config file below: 

 

## DAOS server configuration file.

#

## Location of this configuration file is determined by first checking for the

## path specified through the -f option of the daos_server command line.

## Otherwise, /etc/daos_server.conf is used.

#

#

## Name associated with the DAOS system.

## Immutable after reformat.

#

name: daos

#

#

## Access points

#

## To operate, DAOS will need a quorum of access point nodes to be available.

## Must have the same value for all agents and servers in a system.

## Immutable after reformat.

## Hosts can be specified with or without port, default port below

## assumed if not specified.

#

## default: hostname of this node

access_points:

  - <host01>

#

## Default port

#

## Port number to bind daos_server to, this will also

## be used when connecting to access points unless a port is specified.

#

## default: 10001

port: 10001

#

## Transport Credentials Specifying certificates to secure communications

#

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

#

## Fault domain path

#

## Immutable after reformat.

#

## default: /hostname for a local configuration w/o fault domain

#fault_path: /vcdu0/rack1/hostname

#

#

## Fault domain callback

#

## Path to executable which will return fault domain string.

## Immutable after reformat.

#

#fault_cb: ./.daos/fd_callback

#

#

## Use specific OFI interfaces

#

## Specify either a single fabric interface that will be used by all

## spawned servers or a comma-seperated list of fabric interfaces to be

## assigned individually.

## By default, the DAOS server will auto-detect and use all fabric

## interfaces if any and fall back to socket on the first eth card

## otherwise.

fabric_ifaces:

  - enp94s0

  - enp216s0

#

#

## Use specific OFI provider

#

## Force a specific provider to be used by all the servers.

## The default provider depends on the interfaces that will be auto-detected:

## ofi+psm2 for Omni-Path, ofi+verbs;ofi_rxm for Infiniband/RoCE and finally

## ofi+socket for non-RDMA-capable Ethernet.

#

provider: ofi+verbs;ofi_rxm

#

#

## Storage mount directory

#

## TODO: If no pre-configured mountpoints are specified, DAOS will auto-detect

## NVDIMMs, configure them in interleave mode, format with ext4 and

## mount with the DAX extension creating a subdirectory within scm_mount_path.

#

## This option allows to specify a preferred path where the mountpoints will

## be created. Either the specified directory or its parent must be a mount

## point.

#

## default: /mnt/daos

scm_mount_path: /mnt/daos

#

#

## NVMe SSD whitelist

#

## Only use NVMe controllers with specific PCI addresses.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

## By default, DAOS will use all the NVMe-capable SSDs that don't have active

## mount points.

#

#bdev_include: ["0000:81:00.1","0000:81:00.2","0000:81:00.3"]

#

#

## NVMe SSD blacklist

#

## Only use NVMe controllers with specific PCI addresses. Overrides drives

## listed in nvme_include and forces auto-detection to skip those drives.

## Immutable after reformat, colons replaced by dots in PCI identifiers.

#

#bdev_exclude: ["0000:81:00.1"]

#

#

## Use Hyperthreads

#

## When Hyperthreading is enabled and supported on the system, this parameter

## defines whether the DAOS service thread should only be bound to different

## physical cores (value 0) or hyperthreads (value 1).

#

## default: false

hyperthreads: False

#

#

## Use the given directory for creating unix domain sockets

#

## DAOS Agent and DAOS Server both use unix domain sockets for communication

## with other system components. This setting is the base location to place

## the sockets in.

#

## default: /var/run/daos_server

socket_dir: /var/run/daos_server

#

#

## Number of hugepages to allocate for use by NVMe SSDs

#

## Specifies the number (not size) of hugepages to allocate for use by NVMe

## through SPDK. This indicates the total number to be used by any spawned

## servers. Default system hugepage size will be used and hugepages will be

## evenly distributed between CPU nodes.

#

## default: 1024

nr_hugepages: 4096

#

#

## Force specific debug mask for daos_server (control plane).

## By default, just use the default debug mask used by daos_server.

## Mask specifies minimum level of message significance to pass to logger.

## Currently supported values are DEBUG and ERROR.

#

## default: DEBUG

#control_log_mask: ERROR

#

#

## Force specific path for daos_server (control plane) logs.

#

## default: print to stderr

control_log_file: /var/log/daos/daos_control.log

#

#

## Enable daos_admin (privileged helper) logging.

#

## default: disabled (errors only to control plane log)

helper_log_file: /var/log/daos/daos_admin.log

#

#

# When per-server definitions exist, auto-allocation of resources is not

# performed. Without per-server definitions, node resources will

# automatically be assigned to servers based on NUMA ratings, there will

# be a one-to-one relationship between servers and sockets.

 

servers:

-

  # Rank to be assigned as identifier for server.

  # Immutable after reformat.

  # Optional parameter, will be auto generated if not supplied.

 

  rank: 0

 

  # Targets (VOS) represent the count of storage targets per data plane

  # server starting at core offset specified by first_core.

 

  # Immutable after reformat.

 

  targets: 24

 

  # Count of offload/helper xstreams per target. (allowed values: 0-2)

  # Immutable after reformat.

 

  # default: 2

  nr_xs_helpers: 0

 

  # Offset of the first core for service xstreams.

  # Immutable after reformat.

 

  # default: 0

  first_core: 0

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp94s0

  fabric_iface_port: 20000

  pinned_numa_node: 0

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs (D_LOG_FILE).

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server1.log

#

#  # Pass specific environment variables to the DAOS server.

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=30

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#  # Either the specified directory or its parent must be a mount point.

#

  scm_mount: /mnt/daos/1

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size ignored

#  # - "ram" to emulate SCM with memory, scm_list ignored

#  # Immutable after reformat.

#

#  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem0]

 

#

#  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

#  # The size of ram is specified by scm_size in GB units.

#  scm_size: 16

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # default: nvme

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:1c:00.0","0000:20:00.0","0000:3f:00.0","0000:43:00.0"]  # generate regular nvme.conf

-

#  # Rank to be assigned as identifier for server.

#  # Immutable after reformat.

#  # Optional parameter, will be auto generated if not supplied.

#

  rank: 1

 

  # Targets (VOS) represent the number of logical CPUs to be used starting at

  # index specified by first_core.

 

  # Targets will be used to run XStreams can be thought of as service threads.

  # Immutable after reformat.

 

  targets: 24

 

  # Number of helper XStreams per VOS target. (allowed values: 0-2)

  # Immutable after reformat.

#

#  # default: 2

#  nr_xs_helpers: 1

#

#  # Index of first core for service thread.

#  # Immutable after reformat.

#

  # default: 0

  first_core: 24

 

  # Use specific OFI interfaces.

  # Specify the fabric network interface that will be used by this server.

  # Optionally specify the fabric network interface port that will be used

  # by this server but please only if you have a specific need, this will

  # normally be chosen automatically.

 

  fabric_iface: enp216s0

  fabric_iface_port: 20000

  pinned_numa_node: 1

 

#  # Force specific debug mask (D_LOG_MASK) at start up time.

#  # By default, just use the default debug mask used by DAOS.

#  # Mask specifies minimum level of message significance to pass to logger.

#

#  # default: ERR

#  log_mask: WARN

#

#  # Force specific path for DAOS debug logs.

#

#  # default: /tmp/daos.log

  log_file: /var/log/daos/daos_server2.log

#

#  # Pass specific environment variables to the DAOS server

#  # Empty by default. Values should be supplied without encapsulating quotes.

#

#  env_vars:

#      - CRT_TIMEOUT=100

#

#  # Define a pre-configured mountpoint for storage class memory to be used

#  # by this server.

#  # Path should be unique to server instance (can use different subdirs).

#

  scm_mount: /mnt/daos/2

#

#  # Backend SCM device type. Either use DCPM (datacentre persistent memory)

#  # modules configured in interleaved mode (AppDirect regions) or emulated

#  # tmpfs in RAM.

#  # Options are:

#  # - "dcpm" for real SCM (preferred option), scm_size is ignored

#  # - "ram" to emulate SCM with memory, scm_list is ignored

#  # Immutable after reformat.

#

  # default: dcpm

  scm_class: dcpm

 

  # When scm_class is set to dcpm, scm_list is the list of device paths for

  # AppDirect pmem namespaces (currently only one per server supported).

  scm_list: [/dev/pmem1]

#

#  # Backend block device type. Force a SPDK driver to be used by this server

#  # instance.

#  # Options are:

#  # - "nvme" for NVMe SSDs (preferred option), bdev_{size,number} ignored

#  # - "malloc" to emulate a NVMe SSD with memory, bdev_list ignored

#  # - "file" to emulate a NVMe SSD with a regular file, bdev_number ignored

#  # - "kdev" to use a kernel block device, bdev_{size,number} ignored

#  # Immutable after reformat.

#

#  # When bdev_class is set to malloc, bdev_number is the number of devices

#  # to allocate and bdev_size is the size in GB of each LUN/device.

#  bdev_class: malloc

#  bdev_number: 1

#  bdev_size: 4

#

#  # When bdev_class is set to file, bdev_list is the list of file paths that

#  # will be used to emulate NVMe SSDs. The size of each file is specified by

#  # bdev_size in GB unit.

#  bdev_class: file

#  bdev_list: [/tmp/daos-bdev1,/tmp/daos-bdev2]

#  bdev_size: 16

#

#  # When bdev_class is set to kdev, bdev_list is the list of unique kernel

#  # block devices that should be different across different server instance.

#  bdev_class: kdev

#  bdev_list: [/dev/sdc,/dev/sdd]

  bdev_class: nvme

#

#  # Backend block device configuration to be used by this server instance.

#  # When bdev_class is set to nvme, bdev_list is the list of unique NVMe IDs

#  # that should be different across different server instance.

#  # Immutable after reformat.

  bdev_list: ["0000:89:00.0","0000:8d:00.0","0000:b2:00.0","0000:b6:00.0"]  # generate regular nvme.conf

 

 

 


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Jacque, Kristin <kristin.jacque@...>
Sent: Wednesday, February 24, 2021 8:31 PM
To:
daos@daos.groups.io <daos@daos.groups.io>
Subject: [EXTERNAL SENDER] Re: [daos] Startup Errors

 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To:
daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.