issues with NVMe drives from RPM installation


Dahringer, Richard
 

Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices, 'dmg -i' commands all seem to work.  When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there. 

My daos_server.conf file:

name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
   allow_insecure: true

servers:
-
  targets: 1
  first_core: 0
  nr_xs_helpers: 0
  fabric_iface: hib0
  fabric_iface_port: 31416
  log_file: /tmp/daos_server.log

 

  env_vars:
  - DAOS_MD_CAP=1024
  - CRT_CTX_SHARE_ADDR=0
  - CRT_TIMEOUT=30
  - FI_SOCKETS_MAX_CONN_RETRY=1
  - FI_SOCKETS_CONN_TIMEOUT=2000

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos0  # map to -s /mnt/daos
  scm_class: dcpm
  scm_list: [/dev/pmem0]

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

The startup error:

[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1

Can someone provide some pointers to what is going on? 


Farrell, Patrick Arthur <patrick.farrell@...>
 

Richard,

There's nothing obviously wrong - to me, anyway - with your config, and no useful errors in the output.  You can check the logs in /tmp/daos*.log (There will be multiple files), they should contain more information.  You could also turn on debug before you start the server to possibly get more info - described in the manual https://daos-stack.github.io/admin/troubleshooting/

Also, if you have not, you can check your drives are visible to DAOS and can be prepared as expected with the daos_server storage commands, scan and prepare, detailed here:
https://daos-stack.github.io/admin/deployment/

That details how to run them for SCM, look at the command help for how to run them for NVMe devices.  (You'll want to select NVMe only or it may ask you to reboot to set up your SCM goals, which you've obviously already done.)

Regards,
-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of richard.dahringer@... <richard.dahringer@...>
Sent: Thursday, July 30, 2020 9:27 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] issues with NVMe drives from RPM installation
 

Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices, 'dmg -i' commands all seem to work.  When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there. 

My daos_server.conf file:

name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
   allow_insecure: true

servers:
-
  targets: 1
  first_core: 0
  nr_xs_helpers: 0
  fabric_iface: hib0
  fabric_iface_port: 31416
  log_file: /tmp/daos_server.log

 

  env_vars:
  - DAOS_MD_CAP=1024
  - CRT_CTX_SHARE_ADDR=0
  - CRT_TIMEOUT=30
  - FI_SOCKETS_MAX_CONN_RETRY=1
  - FI_SOCKETS_CONN_TIMEOUT=2000

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos0  # map to -s /mnt/daos
  scm_class: dcpm
  scm_list: [/dev/pmem0]

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

The startup error:

[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1

Can someone provide some pointers to what is going on? 


Nabarro, Tom
 

Hello Richard

 

"ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1”
indicates that there might be some useful information in the io_server log for the first instance, the default location as set in the server config file (log_file) is /tmp/server0.log. If nothing useful in there try increasing the log_mask to DEBUG.

Regards,

Tom Nabarro – HPC

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of richard.dahringer@...
Sent: Thursday, July 30, 2020 3:27 PM
To: daos@daos.groups.io
Subject: [daos] issues with NVMe drives from RPM installation

 

Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices, 'dmg -i' commands all seem to work.  When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there. 

My daos_server.conf file:

name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
   allow_insecure: true

servers:
-
  targets: 1
  first_core: 0
  nr_xs_helpers: 0
  fabric_iface: hib0
  fabric_iface_port: 31416
  log_file: /tmp/daos_server.log

 

  env_vars:
  - DAOS_MD_CAP=1024
  - CRT_CTX_SHARE_ADDR=0
  - CRT_TIMEOUT=30
  - FI_SOCKETS_MAX_CONN_RETRY=1
  - FI_SOCKETS_CONN_TIMEOUT=2000

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos0  # map to -s /mnt/daos
  scm_class: dcpm
  scm_list: [/dev/pmem0]

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

The startup error:

[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1

Can someone provide some pointers to what is going on? 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Dahringer, Richard
 

Thanks Tom, that led me to this:


07/30-08:21:08.63 elfs13o01 DAOS[74504/74524] bio  INFO src/bio/bio_xstream.c:1049 bio_xsctxt_alloc() Initialize NVMe context, tgt_id:0, init_thread:(nil)

07/30-08:21:10.77 elfs13o01 DAOS[74504/74524] bio  ERR  src/bio/bio_xstream.c:877 init_blobstore_ctxt() Device list & device mapping is inconsistent

07/30-08:21:14.13 elfs13o01 DAOS[74504/74524] server ERR  src/iosrv/srv.c:452 dss_srv_handler() failed to init spdk context for xstream(2) rc:-1005

 

When I check for consistency, I see :

 

[root@elfs13o01 tmp]# daos_server storage scan

Scanning locally-attached storage...

ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

NVMe controllers and namespaces:

                PCI:0000:5e:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:0 Capacity:4.0 TB

                PCI:0000:5f:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:0 Capacity:4.0 TB

                PCI:0000:d8:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:1 Capacity:4.0 TB

                PCI:0000:d9:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:1 Capacity:4.0 TB

SCM Namespaces:

                Device:pmem0 Socket:0 Capacity:266 GB

                Device:pmem1 Socket:1 Capacity:266 GB

 

And the first line of the NVMe controllers listed is the drive I have in the configuration file (from below)

 

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

 

Is there another file somewhere that I need to set up?  I saw some documentation of ‘daos_nvme.conf’ which is automatically generated.  I added the second NVMe device on socket 0 to the configuration to test to see if that would change anything, but I have the same results.

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Nabarro, Tom
Sent: Thursday, July 30, 2020 09:59
To: daos@daos.groups.io
Subject: Re: [daos] issues with NVMe drives from RPM installation

 

Hello Richard

 

"ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1”
indicates that there might be some useful information in the io_server log for the first instance, the default location as set in the server config file (log_file) is /tmp/server0.log. If nothing useful in there try increasing the log_mask to DEBUG.

Regards,

Tom Nabarro – HPC

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of richard.dahringer@...
Sent: Thursday, July 30, 2020 3:27 PM
To: daos@daos.groups.io
Subject: [daos] issues with NVMe drives from RPM installation

 

Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices, 'dmg -i' commands all seem to work.  When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there. 

My daos_server.conf file:

name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
   allow_insecure: true

servers:
-
  targets: 1
  first_core: 0
  nr_xs_helpers: 0
  fabric_iface: hib0
  fabric_iface_port: 31416
  log_file: /tmp/daos_server.log

 

  env_vars:
  - DAOS_MD_CAP=1024
  - CRT_CTX_SHARE_ADDR=0
  - CRT_TIMEOUT=30
  - FI_SOCKETS_MAX_CONN_RETRY=1
  - FI_SOCKETS_CONN_TIMEOUT=2000

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos0  # map to -s /mnt/daos
  scm_class: dcpm
  scm_list: [/dev/pmem0]

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

The startup error:

[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1

Can someone provide some pointers to what is going on? 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Nabarro, Tom
 

Sounds like maybe metadata is out of sync, can you try removing /mnt/daos0/*, starting the server and then (on a separate tty) reformatting with "dmg storage format --reformat"?

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Dahringer, Richard
Sent: Thursday, July 30, 2020 5:28 PM
To: daos@daos.groups.io
Subject: Re: [daos] issues with NVMe drives from RPM installation

 

Thanks Tom, that led me to this:


07/30-08:21:08.63 elfs13o01 DAOS[74504/74524] bio  INFO src/bio/bio_xstream.c:1049 bio_xsctxt_alloc() Initialize NVMe context, tgt_id:0, init_thread:(nil)

07/30-08:21:10.77 elfs13o01 DAOS[74504/74524] bio  ERR  src/bio/bio_xstream.c:877 init_blobstore_ctxt() Device list & device mapping is inconsistent

07/30-08:21:14.13 elfs13o01 DAOS[74504/74524] server ERR  src/iosrv/srv.c:452 dss_srv_handler() failed to init spdk context for xstream(2) rc:-1005

 

When I check for consistency, I see :

 

[root@elfs13o01 tmp]# daos_server storage scan

Scanning locally-attached storage...

ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

NVMe controllers and namespaces:

                PCI:0000:5e:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:0 Capacity:4.0 TB

                PCI:0000:5f:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:0 Capacity:4.0 TB

                PCI:0000:d8:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:1 Capacity:4.0 TB

                PCI:0000:d9:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:1 Capacity:4.0 TB

SCM Namespaces:

                Device:pmem0 Socket:0 Capacity:266 GB

                Device:pmem1 Socket:1 Capacity:266 GB

 

And the first line of the NVMe controllers listed is the drive I have in the configuration file (from below)

 

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

 

Is there another file somewhere that I need to set up?  I saw some documentation of ‘daos_nvme.conf’ which is automatically generated.  I added the second NVMe device on socket 0 to the configuration to test to see if that would change anything, but I have the same results.

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Nabarro, Tom
Sent: Thursday, July 30, 2020 09:59
To: daos@daos.groups.io
Subject: Re: [daos] issues with NVMe drives from RPM installation

 

Hello Richard

 

"ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1”
indicates that there might be some useful information in the io_server log for the first instance, the default location as set in the server config file (log_file) is /tmp/server0.log. If nothing useful in there try increasing the log_mask to DEBUG.

Regards,

Tom Nabarro – HPC

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of richard.dahringer@...
Sent: Thursday, July 30, 2020 3:27 PM
To: daos@daos.groups.io
Subject: [daos] issues with NVMe drives from RPM installation

 

Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices, 'dmg -i' commands all seem to work.  When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there. 

My daos_server.conf file:

name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
   allow_insecure: true

servers:
-
  targets: 1
  first_core: 0
  nr_xs_helpers: 0
  fabric_iface: hib0
  fabric_iface_port: 31416
  log_file: /tmp/daos_server.log

 

  env_vars:
  - DAOS_MD_CAP=1024
  - CRT_CTX_SHARE_ADDR=0
  - CRT_TIMEOUT=30
  - FI_SOCKETS_MAX_CONN_RETRY=1
  - FI_SOCKETS_CONN_TIMEOUT=2000

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos0  # map to -s /mnt/daos
  scm_class: dcpm
  scm_list: [/dev/pmem0]

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

The startup error:

[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1

Can someone provide some pointers to what is going on? 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Dahringer, Richard
 

That worked!

 

Thanks Tom!

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Nabarro, Tom
Sent: Thursday, July 30, 2020 12:11
To: daos@daos.groups.io
Subject: Re: [daos] issues with NVMe drives from RPM installation

 

Sounds like maybe metadata is out of sync, can you try removing /mnt/daos0/*, starting the server and then (on a separate tty) reformatting with "dmg storage format --reformat"?

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Dahringer, Richard
Sent: Thursday, July 30, 2020 5:28 PM
To: daos@daos.groups.io
Subject: Re: [daos] issues with NVMe drives from RPM installation

 

Thanks Tom, that led me to this:


07/30-08:21:08.63 elfs13o01 DAOS[74504/74524] bio  INFO src/bio/bio_xstream.c:1049 bio_xsctxt_alloc() Initialize NVMe context, tgt_id:0, init_thread:(nil)

07/30-08:21:10.77 elfs13o01 DAOS[74504/74524] bio  ERR  src/bio/bio_xstream.c:877 init_blobstore_ctxt() Device list & device mapping is inconsistent

07/30-08:21:14.13 elfs13o01 DAOS[74504/74524] server ERR  src/iosrv/srv.c:452 dss_srv_handler() failed to init spdk context for xstream(2) rc:-1005

 

When I check for consistency, I see :

 

[root@elfs13o01 tmp]# daos_server storage scan

Scanning locally-attached storage...

ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB

NVMe controllers and namespaces:

                PCI:0000:5e:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:0 Capacity:4.0 TB

                PCI:0000:5f:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:0 Capacity:4.0 TB

                PCI:0000:d8:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:1 Capacity:4.0 TB

                PCI:0000:d9:00.0 Model:INTEL SSDPE2KX040T8  FW:VDV10131 Socket:1 Capacity:4.0 TB

SCM Namespaces:

                Device:pmem0 Socket:0 Capacity:266 GB

                Device:pmem1 Socket:1 Capacity:266 GB

 

And the first line of the NVMe controllers listed is the drive I have in the configuration file (from below)

 

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

 

Is there another file somewhere that I need to set up?  I saw some documentation of ‘daos_nvme.conf’ which is automatically generated.  I added the second NVMe device on socket 0 to the configuration to test to see if that would change anything, but I have the same results.

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Nabarro, Tom
Sent: Thursday, July 30, 2020 09:59
To: daos@daos.groups.io
Subject: Re: [daos] issues with NVMe drives from RPM installation

 

Hello Richard

 

"ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1”
indicates that there might be some useful information in the io_server log for the first instance, the default location as set in the server config file (log_file) is /tmp/server0.log. If nothing useful in there try increasing the log_mask to DEBUG.

Regards,

Tom Nabarro – HPC

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of richard.dahringer@...
Sent: Thursday, July 30, 2020 3:27 PM
To: daos@daos.groups.io
Subject: [daos] issues with NVMe drives from RPM installation

 

Hi all -
I'm trying to set up a proof of concept daos cluster, and it is proving to be tricky. The systems have 4 SCM 128G DIMMs, and 4 U.2 NVMe drives installed. I have installed all the RPMs from registrationcenter.intel.com, and have been able to set up the SCM devices, 'dmg -i' commands all seem to work.  When I add nvme drives to the configuration though, daos_server does not start - it does start when the nvme drives are not there. 

My daos_server.conf file:

name: daos_server
access_points: ['elfs13o01']
# port: 10001
provider: ofi+psm2
nr_hugepages: 4096
control_log_file: /tmp/daos_control.log
transport_config:
   allow_insecure: true

servers:
-
  targets: 1
  first_core: 0
  nr_xs_helpers: 0
  fabric_iface: hib0
  fabric_iface_port: 31416
  log_file: /tmp/daos_server.log

 

  env_vars:
  - DAOS_MD_CAP=1024
  - CRT_CTX_SHARE_ADDR=0
  - CRT_TIMEOUT=30
  - FI_SOCKETS_MAX_CONN_RETRY=1
  - FI_SOCKETS_CONN_TIMEOUT=2000

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos0  # map to -s /mnt/daos
  scm_class: dcpm
  scm_list: [/dev/pmem0]

  bdev_class: nvme
  bdev_list: ["0000:5e:00.0"]

The startup error:

[root@elfs13o01 ~]# daos_server -o daos_local.yml start
daos_server logging to file /tmp/daos_control.log
ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 73257) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM @ /mnt/daos0: 262 GB Total/247 GB Avail
Starting I/O server instance 0: /usr/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
daos_io_server:0 Starting SPDK v19.04.1 / DPDK 19.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --pci-whitelist=0000:5e:00.0 --log-level=lib.eal:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk73258 --proc-type=auto ]
ERROR: daos_io_server:0 EAL: No free hugepages reported in hugepages-1048576kB
ERROR: /var/run/daos_server/daos_server.sock: failed to accept connection: accept unixpacket /var/run/daos_server/daos_server.sock: use of closed network connection
ERROR: DAOS I/O Server exited with error: /usr/bin/daos_io_server (instance 0) exited: exit status 1

Can someone provide some pointers to what is going on? 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.