dmg pool operation stuck
Allen
Hi All:
I installed DAOS from Scratch on ubuntu 20.04. When creating a pool after storage format, the create operation is stuck. The "dmg" terminal only printed this message: $ dmg -i storage formatFormat Summary:Hosts SCM Devices NVMe Devices----- ----------- ------------sw2 1 1$ dmg -i storage scanHosts SCM Total NVMe Total----- --------- ----------sw2 0 B (0 modules) 4.0 TB (1 controller)$ dmg -i storage query usageHosts SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used----- --------- -------- -------- ---------- --------- ---------sw2 0 B 0 B N/A 0 B 0 B N/A$ dmg -i storage queryERROR: dmg: Please specify one command of: device-health, list-devices, list-pools, target-health or usage$ dmg -i storage query list-devicesErrors:Hosts Error----- -----sw2 DAOS I/O Engine instance not started or not responding on dRPCERROR: dmg: 1 host had errors$ dmg -i storage query list-poolsErrors:Hosts Error----- -----sw2 DAOS I/O Engine instance not started or not responding on dRPCERROR: dmg: 1 host had errors$ dmg -i pool create -z 10GBCreating DAOS pool with automatic storage allocation: 10 GB NVMe + 6.00% SCMERROR: dmg: context deadline exceeded And the "server" and "engine" did not print any messages, even though I set their log masks to debug. I am new to daos. Please help me, thanks. |
|
Hello,
In your server log file there are some error messages about requested huge pages, to start off with I would recommend setting nr_hugepages: 4096 and targets: 8 in the server config file (daos_server.yml). you will need to reformat (umount /mnt/daos then restart daos_server and run dmg storage format in another terminal). Instructions can be found in the admin guide: https://docs.daos.io/admin/deployment/
Regards, Tom
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of allen.zhuo@...
Sent: Monday, November 29, 2021 11:23 AM To: daos@daos.groups.io Subject: [daos] dmg pool operation stuck
Hi All:
And the "server" and "engine" did not print any messages, even though I set their log masks to debug. |
|
Allen
Hi,
The total memory of my server is 512GB, and the hugepagesize is 1GB. $ free -htotal used free shared buff/cache availableMem: 503Gi 130Gi 372Gi 132Mi 1.0Gi 370GiSwap: 8.0Gi 0B 8.0Gi When I set nr_hugepages: 4096 and targets: 8, the following error will be printed when daos_server starts:$ cat /proc/meminfo | grep HugeAnonHugePages: 0 kBShmemHugePages: 0 kBFileHugePages: 0 kBHugePages_Total: 128HugePages_Free: 124HugePages_Rsvd: 0HugePages_Surp: 0Hugepagesize: 1048576 kBHugetlb: 134217728 kB It looks like Daos wants to alloc nr_hugepages * hugepagesize memory.$ daos_server start -o ~/daos/build/etc/daos_server.ymlDAOS Server config loaded from /home/daos/daos/build/etc/daos_server.ymldaos_server logging to file /tmp/daos_server.logDEBUG 01:52:49.731469 start.go:89: Switching control log level to DEBUGDEBUG 01:52:49.831569 netdetect.go:279: 2 NUMA nodes detected with 28 cores per nodeDEBUG 01:52:49.831823 netdetect.go:284: initDeviceScan completed. Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]DEBUG 01:52:49.831859 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxmDEBUG 01:52:49.831876 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxmDEBUG 01:52:49.832098 netdetect.go:995: There are 0 hfi1 devices in the systemDEBUG 01:52:49.832132 netdetect.go:572: There are 2 NUMA nodes.DEBUG 01:52:49.832155 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxmDEBUG 01:52:49.833024 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)DEBUG 01:52:49.833067 server.go:113: fault domain: /sw2DEBUG 01:52:49.833306 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}DEBUG 01:53:05.988214 main.go:70: server: code = 610 description = "requested 4096 hugepages; got 494"ERROR: server: code = 610 description = "requested 4096 hugepages; got 494"ERROR: server: code = 610 resolution = "reboot the system or manually clear /dev/hugepages as appropriate" $ numastat -mc | egrep "Node|Huge"Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Token Node not in hash table.Node 0 Node 1 TotalAnonHugePages 0 0 0HugePages_Total 253952 254976 508928HugePages_Free 250880 254976 505856HugePages_Surp 0 0 0 But spdk setup.sh will not encounter such problems.
|
|
Allen
Hi,
toggle quoted message
Show quoted text
After reviewing the daos_server.log, I think we can ignore the hugepage error. Because the log of daos_server startup is printed twice in the file. The "nr_hugepages" set for the first time was 4096, which exceeded the total memory, so it failed. I reset it to 16 and succeeded. So we should start from the 15th line of the file. DEBUG 10:39:51.086435 start.go:89: Switching control log level to DEBUG |
|
Can you please try with 2M hugepagesize I’m not sure we have much test coverage using 1G hugepages and there may be some built-in assumptions that might cause problems if using them.
Regards, Tom
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of allen.zhuo@...
Sent: Tuesday, November 30, 2021 2:09 AM To: daos@daos.groups.io Subject: Re: [daos] dmg pool operation stuck
Hi,
When I set nr_hugepages: 4096 and targets: 8, the following error will be printed when daos_server starts:
It looks like Daos wants to alloc nr_hugepages * hugepagesize memory.
But spdk setup.sh will not encounter such problems.
|
|
PATEYRON Sacha
Le 30/11/2021 à 12:06, Nabarro, Tom a
écrit :
Hi, If that can help you.
Test OK in Docker Ubuntu 20.04.
sysctl -a | grep vm.nr_hugepages
free -h Best regards. -- Sacha Pateyron CEA/SSI/SISR/LISD |
|
Allen
Hi Tom,
The same issue still exists after changing the hugepagesize to 2MB.
When dmg pool create, daos_server did not print any message. I think this is abnormal. So, can we add some debugging code?
$ cat /proc/meminfo | grep Huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 4096
HugePages_Free: 3931
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 8388608 kB
Print information of dmg terminal: daos_server@sw2:~/daos$ dmg -i storage scan
Hosts SCM Total NVMe Total
----- --------- ----------
sw2 0 B (0 modules) 4.0 TB (1 controller)
daos_server@sw2:~/daos$ dmg -i storage format
Format Summary:
Hosts SCM Devices NVMe Devices
----- ----------- ------------
sw2 1 1
daos_server@sw2:~/daos$ dmg -i pool create -z 100GB
Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM
ERROR: dmg: context deadline exceeded
The latest daos_server log:
daos_server@sw2:~/daos$ daos_server start -o ~/daos/build/etc/daos_server.yml
DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml
daos_server logging to file /tmp/daos_server.log
DEBUG 01:58:17.438639 start.go:89: Switching control log level to DEBUG
DEBUG 01:58:17.537242 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node
DEBUG 01:58:17.537639 netdetect.go:284: initDeviceScan completed. Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]
DEBUG 01:58:17.537780 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm
DEBUG 01:58:17.537805 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm
DEBUG 01:58:17.538059 netdetect.go:995: There are 0 hfi1 devices in the system
DEBUG 01:58:17.538100 netdetect.go:572: There are 2 NUMA nodes.
DEBUG 01:58:17.538121 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm
DEBUG 01:58:17.539248 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)
DEBUG 01:58:17.539297 server.go:113: fault domain: /sw2
DEBUG 01:58:17.539619 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}
DEBUG 01:58:32.790943 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node
DEBUG 01:58:32.791323 netdetect.go:284: initDeviceScan completed. Depth -6, numObj 26, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]
DEBUG 01:58:32.791406 netdetect.go:669: Searching for a device alias for: ens5f0
DEBUG 01:58:32.791447 netdetect.go:693: Device alias for ens5f0 is mlx5_0
DEBUG 01:58:32.791495 class.go:209: output bdev conf file set to /mnt/daos/daos_nvme.conf
DEBUG 01:58:33.319087 provider.go:217: bdev scan: update cache (1 devices)
DAOS Control Server v1.2 (pid 3681) listening on 0.0.0.0:10001
DEBUG 01:58:33.320410 instance_exec.go:35: instance 0: checking if storage is formatted
Checking DAOS I/O Engine instance 0 storage ...
DEBUG 01:58:33.320503 instance_storage.go:74: /mnt/daos: checking formatting
DEBUG 01:58:33.346603 instance_storage.go:90: /mnt/daos (ram) needs format: true
SCM format required on instance 0
DEBUG 01:59:04.391268 ctl_storage_rpc.go:368: received StorageScan RPC
DEBUG 01:59:04.391386 provider.go:217: bdev scan: reuse cache (1 devices)
DEBUG 01:59:04.420740 ctl_storage_rpc.go:387: responding to StorageScan RPC
DEBUG 01:59:08.933555 ctl_storage_rpc.go:407: received StorageFormat RPC ; proceeding to instance storage format
Formatting scm storage for DAOS I/O Engine instance 0 (reformat: false)
DEBUG 01:59:08.933794 instance_storage.go:74: /mnt/daos: checking formatting
DEBUG 01:59:08.961338 instance_storage.go:90: /mnt/daos (ram) needs format: true
Instance 0: starting format of SCM (ram:/mnt/daos)
Instance 0: finished format of SCM (ram:/mnt/daos)
Formatting nvme storage for DAOS I/O Engine instance 0
DEBUG 01:59:09.018278 instance_superblock.go:90: /mnt/daos: checking superblock
DEBUG 01:59:09.018801 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)
Instance 0: starting format of nvme block devices [0000:98:00.0]
Instance 0: finished format of nvme block devices [0000:98:00.0]
DAOS I/O Engine instance 0 storage ready
DEBUG 01:59:13.503527 instance_superblock.go:90: /mnt/daos: checking superblock
DEBUG 01:59:13.504009 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)
DEBUG 01:59:13.504107 instance_superblock.go:119: idx 0 createSuperblock()
DEBUG 01:59:13.504432 instance_superblock.go:149: creating /mnt/daos/superblock: (rank: NilRank, uuid: 8dd7c6e2-8b2e-43b7-b180-f68ed64e8960)
DEBUG 01:59:13.504745 instance_exec.go:62: instance start()
DEBUG 01:59:13.505003 class.go:241: create /mnt/daos/daos_nvme.conf with [0000:98:00.0] bdevs
SCM @ /mnt/daos: 137 GB Total/137 GB Avail
DEBUG 01:59:13.505327 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init
DEBUG 01:59:13.506206 exec.go:69: daos_engine:0 args: [-t 8 -x 6 -g daos_server -d /var/run/daos_server -s /mnt/daos -n /mnt/daos/daos_nvme.conf -I 0]
DEBUG 01:59:13.506300 exec.go:70: daos_engine:0 env: [CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm CRT_TIMEOUT=1200 D_LOG_MASK=DEBUG D_LOG_FILE=/tmp/daos_engine.0.log CRT_CTX_SHARE_ADDR=0 OFI_DOMAIN=mlx5_0 VOS_BDEV_CLASS=NVME OFI_INTERFACE=ens5f0 OFI_PORT=20000]
Starting I/O Engine instance 0: /home/daos/daos/build/bin/daos_engine
daos_engine:0 Using legacy core allocation algorithm
daos_engine:0 Starting SPDK v20.01.2 git sha1 b2808069e / DPDK 19.11.6 initialization...
[ DPDK EAL parameters: daos --no-shconf -c 0x1 --pci-whitelist=0000:98:00.0 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --log-level=lib.eal:4 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid4969 ]
|
|
Allen
Hi Tom,
I added some debugging codes. I found that dmg create pool failed because Database replicaAddr.get() returned <nil>. So Database CheckReplica returned an error. // CheckReplica returns an error if the node is not configured as a // replica or the service is not running.
func (db *Database) CheckReplica() error {
if !db.IsReplica() {
return &ErrNotReplica{db.cfg.stringReplicas(nil)}
}
return db.raft.withReadLock(func(_ raftService) error { return nil })
}
Is this because my daos_server node is too few? I only have 1 daos_server node and only 1 NVMe SSD.
If it is because of this, can I set the DAOS replication to 1? Or does it mean that DAOS needs at least 3 daos_server nodes? Or can there be only one daos_server node, but at least 3 NVMe SSDs? |
|
PATEYRON Sacha
Le 01/12/2021 à 10:23,
allen.zhuo@... a écrit :
Hi Tom, Hi Allen,
With my Dockere integration, I only have 1 daos_server with 1 HDD.
ps aux|grep -i daos
-- Sacha Pateyron SSI/SISR/LISD |
|
Hello,
The format is completing and the engine process is being spawned, now we need to look at the engine log which is specified in the server config file (consult the admin guide for more details: https://docs.daos.io/admin/deployment/). Could you try specifying the engine/server specific log file and mask to DEBUG and paste your server config file here please.
Regards, Tom
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of allen.zhuo@...
Sent: Wednesday, December 1, 2021 2:30 AM To: daos@daos.groups.io Subject: Re: [daos] dmg pool operation stuck
Hi Tom, The same issue still exists after changing the hugepagesize to 2MB. When dmg pool create, daos_server did not print any message. I think this is abnormal. So, can we add some debugging code? $ cat /proc/meminfo | grep Huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 4096 HugePages_Free: 3931 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 8388608 kB
daos_server@sw2:~/daos$ dmg -i storage scan Hosts SCM Total NVMe Total ----- --------- ---------- sw2 0 B (0 modules) 4.0 TB (1 controller) daos_server@sw2:~/daos$ dmg -i storage format Format Summary: Hosts SCM Devices NVMe Devices ----- ----------- ------------ sw2 1 1 daos_server@sw2:~/daos$ dmg -i pool create -z 100GB Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM ERROR: dmg: context deadline exceeded
The latest daos_server log: daos_server@sw2:~/daos$ daos_server start -o ~/daos/build/etc/daos_server.yml DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml daos_server logging to file /tmp/daos_server.log DEBUG 01:58:17.438639 start.go:89: Switching control log level to DEBUG DEBUG 01:58:17.537242 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node DEBUG 01:58:17.537639 netdetect.go:284: initDeviceScan completed. Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1] DEBUG 01:58:17.537780 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm DEBUG 01:58:17.537805 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm DEBUG 01:58:17.538059 netdetect.go:995: There are 0 hfi1 devices in the system DEBUG 01:58:17.538100 netdetect.go:572: There are 2 NUMA nodes. DEBUG 01:58:17.538121 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm DEBUG 01:58:17.539248 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only) DEBUG 01:58:17.539297 server.go:113: fault domain: /sw2 DEBUG 01:58:17.539619 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true} DEBUG 01:58:32.790943 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node DEBUG 01:58:32.791323 netdetect.go:284: initDeviceScan completed. Depth -6, numObj 26, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1] DEBUG 01:58:32.791406 netdetect.go:669: Searching for a device alias for: ens5f0 DEBUG 01:58:32.791447 netdetect.go:693: Device alias for ens5f0 is mlx5_0 DEBUG 01:58:32.791495 class.go:209: output bdev conf file set to /mnt/daos/daos_nvme.conf DEBUG 01:58:33.319087 provider.go:217: bdev scan: update cache (1 devices) DAOS Control Server v1.2 (pid 3681) listening on 0.0.0.0:10001 DEBUG 01:58:33.320410 instance_exec.go:35: instance 0: checking if storage is formatted Checking DAOS I/O Engine instance 0 storage ... DEBUG 01:58:33.320503 instance_storage.go:74: /mnt/daos: checking formatting DEBUG 01:58:33.346603 instance_storage.go:90: /mnt/daos (ram) needs format: true SCM format required on instance 0 DEBUG 01:59:04.391268 ctl_storage_rpc.go:368: received StorageScan RPC DEBUG 01:59:04.391386 provider.go:217: bdev scan: reuse cache (1 devices) DEBUG 01:59:04.420740 ctl_storage_rpc.go:387: responding to StorageScan RPC DEBUG 01:59:08.933555 ctl_storage_rpc.go:407: received StorageFormat RPC ; proceeding to instance storage format Formatting scm storage for DAOS I/O Engine instance 0 (reformat: false) DEBUG 01:59:08.933794 instance_storage.go:74: /mnt/daos: checking formatting DEBUG 01:59:08.961338 instance_storage.go:90: /mnt/daos (ram) needs format: true Instance 0: starting format of SCM (ram:/mnt/daos) Instance 0: finished format of SCM (ram:/mnt/daos) Formatting nvme storage for DAOS I/O Engine instance 0 DEBUG 01:59:09.018278 instance_superblock.go:90: /mnt/daos: checking superblock DEBUG 01:59:09.018801 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist) Instance 0: starting format of nvme block devices [0000:98:00.0] Instance 0: finished format of nvme block devices [0000:98:00.0] DAOS I/O Engine instance 0 storage ready DEBUG 01:59:13.503527 instance_superblock.go:90: /mnt/daos: checking superblock DEBUG 01:59:13.504009 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist) DEBUG 01:59:13.504107 instance_superblock.go:119: idx 0 createSuperblock() DEBUG 01:59:13.504432 instance_superblock.go:149: creating /mnt/daos/superblock: (rank: NilRank, uuid: 8dd7c6e2-8b2e-43b7-b180-f68ed64e8960) DEBUG 01:59:13.504745 instance_exec.go:62: instance start() DEBUG 01:59:13.505003 class.go:241: create /mnt/daos/daos_nvme.conf with [0000:98:00.0] bdevs SCM @ /mnt/daos: 137 GB Total/137 GB Avail DEBUG 01:59:13.505327 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init DEBUG 01:59:13.506206 exec.go:69: daos_engine:0 args: [-t 8 -x 6 -g daos_server -d /var/run/daos_server -s /mnt/daos -n /mnt/daos/daos_nvme.conf -I 0] DEBUG 01:59:13.506300 exec.go:70: daos_engine:0 env: [CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm CRT_TIMEOUT=1200 D_LOG_MASK=DEBUG D_LOG_FILE=/tmp/daos_engine.0.log CRT_CTX_SHARE_ADDR=0 OFI_DOMAIN=mlx5_0 VOS_BDEV_CLASS=NVME OFI_INTERFACE=ens5f0 OFI_PORT=20000] Starting I/O Engine instance 0: /home/daos/daos/build/bin/daos_engine daos_engine:0 Using legacy core allocation algorithm daos_engine:0 Starting SPDK v20.01.2 git sha1 b2808069e / DPDK 19.11.6 initialization... [ DPDK EAL parameters: daos --no-shconf -c 0x1 --pci-whitelist=0000:98:00.0 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --log-level=lib.eal:4 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid4969 ]
|
|
Allen
Hi Tom,
please see Attachment. |
|
Can you try with nr_xs_helpers: 0 in the config please, you will need to reformat.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of allen.zhuo@...
Sent: Wednesday, December 1, 2021 10:40 AM To: daos@daos.groups.io Subject: Re: [daos] dmg pool operation stuck
Hi Tom, |
|
Allen
Hi Tom,
The same issue still exists after setting nr_xs_helpers: 0. $ ps aux|grep daos_engine | grep -v grep
daos_de+ 5301 394 0.1 135622300 771552 pts/0 RLl+ 11:45 5:43 /home/daos_debug/daos/build/bin/daos_engine -t 8 -x 0 -g daos_server -d /var/run/daos_server -s /mnt/daos -n /mnt/daos/daos_nvme.conf -I 0
|
|
Allen
Hi Tom,
I think I know why it timed out when creating the pool. Because I set the 'access_points:' in daos_server.yml to my hostname 'sw2', it should be set to 'localhost'. If it is set to hostname, it will be judged as LocalAddr (common.IsLocalAddr(repAddr)) in new NewDatabase, and then not setReplica. I'm not sure if this is a bug, but I think it should support setting to hostname. // NewDatabase returns a configured and initialized Database instance.
func NewDatabase(log logging.Logger, cfg *DatabaseConfig) (*Database, error) {
if cfg == nil {
cfg = &DatabaseConfig{}
}
if cfg.SystemName == "" {
cfg.SystemName = build.DefaultSystemName
}
db := &Database{
log: log,
cfg: cfg,
replicaAddr: &syncTCPAddr{},
shutdownErrCh: make(chan error),
raftLeaderNotifyCh: make(chan bool),
data: &dbData{
log: log,
Members: &MemberDatabase{
Ranks: make(MemberRankMap),
Uuids: make(MemberUuidMap),
Addrs: make(MemberAddrMap),
FaultDomains: NewFaultDomainTree(),
},
Pools: &PoolDatabase{
Ranks: make(PoolRankMap),
Uuids: make(PoolUuidMap),
Labels: make(PoolLabelMap),
},
SchemaVersion: CurrentSchemaVersion,
},
}
for _, repAddr := range db.cfg.Replicas {
if !common.IsLocalAddr(repAddr) {
continue
}
db.setReplica(repAddr)
}
return db, nil
}
|
|
Allen
Hi Tom,
A new question.
$ dmg -i pool create -z 100GB
Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM
ERROR: dmg: pool create failed: rpc error: code = Unknown desc = pool request contains zero target ranks Any ideas? Please see the attachment for the latest abc and log files. |
|
This is unusual, normally the name resolution works.
After talking to a colleague (Mike), we suspect that the IsLocalAddr() test is failing to match the AP address with a local address, with the result that the MS never starts: https://github.com/daos-stack/daos/blob/release/1.2/src/control/common/net_utils.go#L67
In ordered to investigate this case, since you seem to be open to adding debugging code, could you add some debug to dump the list of ifaceAddrs returned so that we can see what the system's idea of the local address set is please?
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of allen.zhuo@...
Sent: Thursday, December 2, 2021 7:49 AM To: daos@daos.groups.io Subject: Re: [daos] dmg pool operation stuck
Hi Tom, // NewDatabase returns a configured and initialized Database instance. func NewDatabase(log logging.Logger, cfg *DatabaseConfig) (*Database, error) { if cfg == nil { cfg = &DatabaseConfig{} }
if cfg.SystemName == "" { cfg.SystemName = build.DefaultSystemName }
db := &Database{ log: log, cfg: cfg, replicaAddr: &syncTCPAddr{}, shutdownErrCh: make(chan error), raftLeaderNotifyCh: make(chan bool),
data: &dbData{ log: log,
Members: &MemberDatabase{ Ranks: make(MemberRankMap), Uuids: make(MemberUuidMap), Addrs: make(MemberAddrMap), FaultDomains: NewFaultDomainTree(), }, Pools: &PoolDatabase{ Ranks: make(PoolRankMap), Uuids: make(PoolUuidMap), Labels: make(PoolLabelMap), }, SchemaVersion: CurrentSchemaVersion, }, }
for _, repAddr := range db.cfg.Replicas { if !common.IsLocalAddr(repAddr) { continue } db.setReplica(repAddr) }
return db, nil }
|
|
The engine is actually not starting, in the server log you should see a message containing "…started on rank 0" and then if you run "dmg system query [--verbose]" it should report at least one "Joined" rank.
A couple of things to try: - set engines->targets to 4 and engines->nr_xs_helpers to 0 - add engines->env_vars key/value DD_MASK=all and provide engine log after rerunning
Regards, Tom
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of allen.zhuo@...
Sent: Thursday, December 2, 2021 9:22 AM To: daos@daos.groups.io Subject: Re: [daos] dmg pool operation stuck
Hi Tom, A new question. $ dmg -i pool create -z 100GB Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM ERROR: dmg: pool create failed: rpc error: code = Unknown desc = pool request contains zero target ranks |
|
Allen
Hi Tom,
I'm very sorry, it was my mistake. I accidentally set "/etc/hosts" wrong. It should be "172.20.148.244", but I incorrectly wrote it as "127.20.148.244". |
|
Allen
Hi Tom,
Please see Attachment. The rerun terminal prints the message as follows. daos_debug@sw2:~$ dmg -i storage format
Format Summary:
Hosts SCM Devices NVMe Devices
----- ----------- ------------
localhost 1 1
daos_debug@sw2:~$ dmg -i system query --verbose
Query matches no members in system.
daos_debug@sw2:~$ dmg -i pool create -z 100GB
Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM
ERROR: dmg: pool create failed: rpc error: code = Unknown desc = pool request contains zero target ranks |
|
Allen
Hi Tom,
I noticed an error in the engine log. DAOS[11610/11614] bio DBUG src/bio/bio_xstream.c:662 load_blobstore() load blobstore failed -1025 Is it because of this? And what does "-1025" mean? Some parameters of calling spdk_bs_load are as follows: bs_dev->blocklen = 512 bs_dev->blockcnt = 7814037168 bs_opts.max_md_ops = 32 bs_opts.max_channel_ops = 4096 bs_opts.cluster_sz = 1073741824 And the memory information of the server is as follows: daos_debug@sw2:~$ free -h
total used free shared buff/cache available
Mem: 503Gi 11Gi 490Gi 131Mi 1.7Gi 489Gi
Swap: 8.0Gi 0B 8.0Gi
daos_debug@sw2:~$ numastat -mc | egrep "Node|Huge"
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Node 0 Node 1 Total
AnonHugePages 0 0 0
HugePages_Total 4096 4096 8192
HugePages_Free 3766 4096 7862
HugePages_Surp 0 0 0 |
|