Startup Errors


Petrillo, Neale A. (Contractor) <Neale.Petrillo@...>
 

Hello Group! 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale



Nabarro, Tom
 

Hello Neale,

 

First of all is there any chance you can try using a more recent version, 1.1.3 for example.

 

How are you launching daos_server? Using the start command directly from the commandline or systemd or other?

 

Is there any firewall blocking the traffic on port 10001?

 

Note that you can format storage across all your hosts in parallel by populating the "hostlist" parameter in the "daos_control.yml" dmg config file https://daos-stack.github.io/admin/deployment/#daos-server-remote-access , then you simply run dmg -i storage format (once you have the above issue sorted out).

 

Maybe to start, run the server and dmg commands on the same (single) host, also please paste your server config file.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 9:00 PM
To: daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Jacque, Kristin
 

Hi Neale,

 

I suspect this may be a case of incompatible transport configurations. All components must be configured to either enable or disable certificates. If you prefer to run without certs, as with the dmg “-i” option, your server and agent must also be configured with “allow_insecure: true” in the yml file.

 

In your server config file I am seeing certs enabled:

 

transport_config:

#  # In order to disable transport security, uncomment and set allow_insecure

#  # to true. Not recommended for production configurations.

  allow_insecure: false

 

  # Location where daos_server will look for Client certificates

  client_cert_dir: /etc/daos/daosCA/clients

  # Custom CA Root certificate for generated certs

  ca_cert: /etc/daos/daosCA/certs/daosCA.crt

  # Server certificate for use in TLS handshakes

  cert: /etc/daos/daosCA/certs/server.crt

  # Key portion of Server Certificate

  key: /etc/daos/daosCA/certs/server.key

 

If that doesn’t resolve the connection failure, Tom’s suggestions will help you get to a good starting point to debug further.

 

Please let us know how it goes.

 

Thanks,

Kris

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Petrillo, Neale A. (Contractor) via groups.io
Sent: Wednesday, February 24, 2021 2:00 PM
To: daos@daos.groups.io
Subject: [daos] Startup Errors

 

Hello Group! 

 

I'm having some trouble getting my new DAOS cluster working. I've installed 6 servers all with the 1.0.1 RPMs. When I do a 'dmg storage format' from my test host, I get the following output:

 

[root@head ~]# dmg -i -l <host01>:10001 storage format

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

[root@head ~]# dmg -i -l <host01> system query

ERROR: <host01>:10001: socket connection is not active (TRANSIENT_FAILURE)

ERROR: dmg: no active connections

 

I'm also seeing these errors in the log files:

 

INFO 2021/02/18 10:40:15 DAOS I/O Server instance 0 storage not ready: context canceled

INFO 2021/02/18 10:40:19 SCM format required on instance 1

INFO 2021/02/18 10:40:19 DAOS I/O Server instance 1 storage not ready: context canceled

INFO 2021/02/18 10:40:19 DAOS Control Server (pid 9993) shutting down

ERROR 2021/02/18 10:40:54 /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

INFO 2021/02/18 10:41:00 DAOS Control Server (pid 11507) listening on 0.0.0.0:10001

INFO 2021/02/18 10:41:00 Waiting for DAOS I/O Server instance storage to be ready...

INFO 2021/02/18 10:41:04 SCM format required on instance 0

 

Configuration files are attached. Any help would be appreciated! 

Neale