SPDK NVMe Startup Issue


Farrell, Patrick Arthur <patrick.farrell@...>
 

Good morning,

I'm trying to use DAOS with a new system with several NVMe drives in it, and I'm running in to some kind of odd issues.

In particular, starting the DAOS server takes an extremely long time (on the order of 1-2 minutes), which appears to be waiting for the storage prepare operation.  (I confirmed this by running the storage prepare operation separately and confirming I'm seeing the same results.)

Eventually, the NVMe devices fail to come up.  DAOS/SPDK spits out the following messages:
ERROR: /root/daos/install/bin/daos_admin nvme_pcie.c:1031:nvme_pcie_qpair_construct: *ERROR*: alloc qpair_cmd failed
ERROR: /root/daos/install/bin/daos_admin nvme.c: 408:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0000:b2:00.0

It spits them out for almost every NVMe SSD.  (Which is a bit weird - The ones which are missing seem to be identical to those which are listed.  And they are *not* already partitioned, etc, they were all unused before starting DAOS.)

Our SSDs are a mix of Intel Optane SSDs and Samsung, specifically these models:
1a:00.0 Non-Volatile memory controller: Intel Corporation Optane SSD 900P Series
3b:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824

There are also a lot of odd messages in dmesg, starting around when I start daos_server (I am assuming this is storage prepare):
[249124.508491] nvme nvme1: pci function 0000:1a:00.0
[249124.509321] nvme 0000:1a:00.0: irq 58 for MSI/MSI-X
[249124.585958] nvme nvme2: pci function 0000:3b:00.0
[249124.586851] nvme 0000:3b:00.0: irq 59 for MSI/MSI-X
[249124.663408] nvme nvme3: pci function 0000:3c:00.0
[249124.664335] nvme 0000:3c:00.0: irq 60 for MSI/MSI-X
[249124.712222] nvme 0000:1a:00.0: irq 58 for MSI/MSI-X
[249124.712267] nvme 0000:1a:00.0: irq 61 for MSI/MSI-X
[249124.712309] nvme 0000:1a:00.0: irq 63 for MSI/MSI-X
[249124.712349] nvme 0000:1a:00.0: irq 64 for MSI/MSI-X
[249124.712387] nvme 0000:1a:00.0: irq 67 for MSI/MSI-X

A few of these sorts of messages appear:
[249693.267636] nvme nvme7: pci function 0000:87:00.0
[249693.269017] nvme 0000:87:00.0: irq 295 for MSI/MSI-X
[249693.324097] nvme nvme8: pci function 0000:af:00.0
[249693.325265] nvme 0000:af:00.0: irq 296 for MSI/MSI-X
[249693.338993] nvme nvme5: Shutdown timeout set to 10 seconds

And finally, I get:
[249697.898755] pcieport 0000:3a:00.0: bridge window [io  0x1000-0x0fff] to [bus 3b] add_size 1000
[249697.898771] pcieport 0000:3a:01.0: bridge window [io  0x1000-0x0fff] to [bus 3c] add_size 1000
[249697.898800] pcieport 0000:3a:00.0: res[13]=[io  0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000
[249697.898806] pcieport 0000:3a:00.0: res[13]=[io  0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000
[249697.898812] pcieport 0000:3a:01.0: res[13]=[io  0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000
[249697.898818] pcieport 0000:3a:01.0: res[13]=[io  0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000
[249697.898826] pcieport 0000:3a:00.0: BAR 13: no space for [io  size 0x1000]
[249697.898832] pcieport 0000:3a:00.0: BAR 13: failed to assign [io  size 0x1000]
[249697.898838] pcieport 0000:3a:01.0: BAR 13: no space for [io  size 0x1000]
[249697.898843] pcieport 0000:3a:01.0: BAR 13: failed to assign [io  size 0x1000]
[249697.898849] pcieport 0000:3a:01.0: BAR 13: no space for [io  size 0x1000]
[249697.898854] pcieport 0000:3a:01.0: BAR 13: failed to assign [io  size 0x1000]
[249697.898875] pcieport 0000:3a:00.0: BAR 13: no space for [io  size 0x1000]
[249697.898880] pcieport 0000:3a:00.0: BAR 13: failed to assign [io  size 0x1000]

I haven't dug in in too much detail, but it seems that there is some sort of settings issue or incompatibility, etc, between my drives/controller and SPDK/DAOS.  I tried updating SPDK to a newer version - as there are some references to similar bugs being fixed in newer versions - but ran in to build issues (understandably).

Just curious if anyone has any particular thoughts on the issue here, before I start digging in heavily?

Thanks in advance,
Patrick Farrell


Nabarro, Tom
 

Could you please try with number of hugepages set to 8192 in server config.

We are also in the process of updating to SPDK v20.01.1 for DAOS 1.0-1.2 (TBD).

 

Regards,

Tom Nabarro – DCG/ESAD

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Monday, March 2, 2020 3:39 PM
To: daos@daos.groups.io
Subject: [daos] SPDK NVMe Startup Issue

 

Good morning,

 

I'm trying to use DAOS with a new system with several NVMe drives in it, and I'm running in to some kind of odd issues.

 

In particular, starting the DAOS server takes an extremely long time (on the order of 1-2 minutes), which appears to be waiting for the storage prepare operation.  (I confirmed this by running the storage prepare operation separately and confirming I'm seeing the same results.)

 

Eventually, the NVMe devices fail to come up.  DAOS/SPDK spits out the following messages:
ERROR: /root/daos/install/bin/daos_admin nvme_pcie.c:1031:nvme_pcie_qpair_construct: *ERROR*: alloc qpair_cmd failed

ERROR: /root/daos/install/bin/daos_admin nvme.c: 408:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0000:b2:00.0

 

It spits them out for almost every NVMe SSD.  (Which is a bit weird - The ones which are missing seem to be identical to those which are listed.  And they are *not* already partitioned, etc, they were all unused before starting DAOS.)

 

Our SSDs are a mix of Intel Optane SSDs and Samsung, specifically these models:

1a:00.0 Non-Volatile memory controller: Intel Corporation Optane SSD 900P Series

3b:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824


There are also a lot of odd messages in dmesg, starting around when I start daos_server (I am assuming this is storage prepare):

[249124.508491] nvme nvme1: pci function 0000:1a:00.0

[249124.509321] nvme 0000:1a:00.0: irq 58 for MSI/MSI-X

[249124.585958] nvme nvme2: pci function 0000:3b:00.0

[249124.586851] nvme 0000:3b:00.0: irq 59 for MSI/MSI-X

[249124.663408] nvme nvme3: pci function 0000:3c:00.0

[249124.664335] nvme 0000:3c:00.0: irq 60 for MSI/MSI-X

[249124.712222] nvme 0000:1a:00.0: irq 58 for MSI/MSI-X

[249124.712267] nvme 0000:1a:00.0: irq 61 for MSI/MSI-X

[249124.712309] nvme 0000:1a:00.0: irq 63 for MSI/MSI-X

[249124.712349] nvme 0000:1a:00.0: irq 64 for MSI/MSI-X

[249124.712387] nvme 0000:1a:00.0: irq 67 for MSI/MSI-X

 

A few of these sorts of messages appear:

[249693.267636] nvme nvme7: pci function 0000:87:00.0

[249693.269017] nvme 0000:87:00.0: irq 295 for MSI/MSI-X

[249693.324097] nvme nvme8: pci function 0000:af:00.0

[249693.325265] nvme 0000:af:00.0: irq 296 for MSI/MSI-X

[249693.338993] nvme nvme5: Shutdown timeout set to 10 seconds

 

And finally, I get:

[249697.898755] pcieport 0000:3a:00.0: bridge window [io  0x1000-0x0fff] to [bus 3b] add_size 1000

[249697.898771] pcieport 0000:3a:01.0: bridge window [io  0x1000-0x0fff] to [bus 3c] add_size 1000

[249697.898800] pcieport 0000:3a:00.0: res[13]=[io  0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898806] pcieport 0000:3a:00.0: res[13]=[io  0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898812] pcieport 0000:3a:01.0: res[13]=[io  0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898818] pcieport 0000:3a:01.0: res[13]=[io  0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898826] pcieport 0000:3a:00.0: BAR 13: no space for [io  size 0x1000]

[249697.898832] pcieport 0000:3a:00.0: BAR 13: failed to assign [io  size 0x1000]

[249697.898838] pcieport 0000:3a:01.0: BAR 13: no space for [io  size 0x1000]

[249697.898843] pcieport 0000:3a:01.0: BAR 13: failed to assign [io  size 0x1000]

[249697.898849] pcieport 0000:3a:01.0: BAR 13: no space for [io  size 0x1000]

[249697.898854] pcieport 0000:3a:01.0: BAR 13: failed to assign [io  size 0x1000]

[249697.898875] pcieport 0000:3a:00.0: BAR 13: no space for [io  size 0x1000]

[249697.898880] pcieport 0000:3a:00.0: BAR 13: failed to assign [io  size 0x1000]

 

I haven't dug in in too much detail, but it seems that there is some sort of settings issue or incompatibility, etc, between my drives/controller and SPDK/DAOS.  I tried updating SPDK to a newer version - as there are some references to similar bugs being fixed in newer versions - but ran in to build issues (understandably).

 

Just curious if anyone has any particular thoughts on the issue here, before I start digging in heavily?

 

Thanks in advance,
Patrick Farrell

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Farrell, Patrick Arthur <patrick.farrell@...>
 

Tom,

I tried the hugepages change.  No change on the SSD/NVMe side - Same errors, etc.

Sounds like I'll have to dig in.  Is there a particular ticket I could track to keep an eye on the SPDK update process?  (It sounds like the timing on that is still TBD, so not imminent, but I'd like to follow along if possible.)

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Nabarro, Tom <tom.nabarro@...>
Sent: Monday, March 2, 2020 9:52 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] SPDK NVMe Startup Issue
 

Could you please try with number of hugepages set to 8192 in server config.

We are also in the process of updating to SPDK v20.01.1 for DAOS 1.0-1.2 (TBD).

 

Regards,

Tom Nabarro – DCG/ESAD

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Monday, March 2, 2020 3:39 PM
To: daos@daos.groups.io
Subject: [daos] SPDK NVMe Startup Issue

 

Good morning,

 

I'm trying to use DAOS with a new system with several NVMe drives in it, and I'm running in to some kind of odd issues.

 

In particular, starting the DAOS server takes an extremely long time (on the order of 1-2 minutes), which appears to be waiting for the storage prepare operation.  (I confirmed this by running the storage prepare operation separately and confirming I'm seeing the same results.)

 

Eventually, the NVMe devices fail to come up.  DAOS/SPDK spits out the following messages:
ERROR: /root/daos/install/bin/daos_admin nvme_pcie.c:1031:nvme_pcie_qpair_construct: *ERROR*: alloc qpair_cmd failed

ERROR: /root/daos/install/bin/daos_admin nvme.c: 408:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0000:b2:00.0

 

It spits them out for almost every NVMe SSD.  (Which is a bit weird - The ones which are missing seem to be identical to those which are listed.  And they are *not* already partitioned, etc, they were all unused before starting DAOS.)

 

Our SSDs are a mix of Intel Optane SSDs and Samsung, specifically these models:

1a:00.0 Non-Volatile memory controller: Intel Corporation Optane SSD 900P Series

3b:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824


There are also a lot of odd messages in dmesg, starting around when I start daos_server (I am assuming this is storage prepare):

[249124.508491] nvme nvme1: pci function 0000:1a:00.0

[249124.509321] nvme 0000:1a:00.0: irq 58 for MSI/MSI-X

[249124.585958] nvme nvme2: pci function 0000:3b:00.0

[249124.586851] nvme 0000:3b:00.0: irq 59 for MSI/MSI-X

[249124.663408] nvme nvme3: pci function 0000:3c:00.0

[249124.664335] nvme 0000:3c:00.0: irq 60 for MSI/MSI-X

[249124.712222] nvme 0000:1a:00.0: irq 58 for MSI/MSI-X

[249124.712267] nvme 0000:1a:00.0: irq 61 for MSI/MSI-X

[249124.712309] nvme 0000:1a:00.0: irq 63 for MSI/MSI-X

[249124.712349] nvme 0000:1a:00.0: irq 64 for MSI/MSI-X

[249124.712387] nvme 0000:1a:00.0: irq 67 for MSI/MSI-X

 

A few of these sorts of messages appear:

[249693.267636] nvme nvme7: pci function 0000:87:00.0

[249693.269017] nvme 0000:87:00.0: irq 295 for MSI/MSI-X

[249693.324097] nvme nvme8: pci function 0000:af:00.0

[249693.325265] nvme 0000:af:00.0: irq 296 for MSI/MSI-X

[249693.338993] nvme nvme5: Shutdown timeout set to 10 seconds

 

And finally, I get:

[249697.898755] pcieport 0000:3a:00.0: bridge window [io  0x1000-0x0fff] to [bus 3b] add_size 1000

[249697.898771] pcieport 0000:3a:01.0: bridge window [io  0x1000-0x0fff] to [bus 3c] add_size 1000

[249697.898800] pcieport 0000:3a:00.0: res[13]=[io  0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898806] pcieport 0000:3a:00.0: res[13]=[io  0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898812] pcieport 0000:3a:01.0: res[13]=[io  0x1000-0x0fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898818] pcieport 0000:3a:01.0: res[13]=[io  0x1000-0x1fff] res_to_dev_res add_size 1000 min_align 1000

[249697.898826] pcieport 0000:3a:00.0: BAR 13: no space for [io  size 0x1000]

[249697.898832] pcieport 0000:3a:00.0: BAR 13: failed to assign [io  size 0x1000]

[249697.898838] pcieport 0000:3a:01.0: BAR 13: no space for [io  size 0x1000]

[249697.898843] pcieport 0000:3a:01.0: BAR 13: failed to assign [io  size 0x1000]

[249697.898849] pcieport 0000:3a:01.0: BAR 13: no space for [io  size 0x1000]

[249697.898854] pcieport 0000:3a:01.0: BAR 13: failed to assign [io  size 0x1000]

[249697.898875] pcieport 0000:3a:00.0: BAR 13: no space for [io  size 0x1000]

[249697.898880] pcieport 0000:3a:00.0: BAR 13: failed to assign [io  size 0x1000]

 

I haven't dug in in too much detail, but it seems that there is some sort of settings issue or incompatibility, etc, between my drives/controller and SPDK/DAOS.  I tried updating SPDK to a newer version - as there are some references to similar bugs being fixed in newer versions - but ran in to build issues (understandably).

 

Just curious if anyone has any particular thoughts on the issue here, before I start digging in heavily?

 

Thanks in advance,
Patrick Farrell

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.