Re: dmg pool operation stuck


Nabarro, Tom
 

Hello,

 

The format is completing and the engine process is being spawned, now we need to look at the engine log which is specified in the server config file (consult the admin guide for more details: https://docs.daos.io/admin/deployment/). Could you try specifying the engine/server specific log file and mask to DEBUG and paste your server config file here please.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of allen.zhuo@...
Sent: Wednesday, December 1, 2021 2:30 AM
To: daos@daos.groups.io
Subject: Re: [daos] dmg pool operation stuck

 

Hi Tom,

The same issue still exists after changing the hugepagesize to 2MB.

When dmg pool create, daos_server did not print any message. I think this is abnormal. So, can we add some debugging code?

$ cat /proc/meminfo | grep Huge

AnonHugePages:         0 kB

ShmemHugePages:        0 kB

FileHugePages:         0 kB

HugePages_Total:    4096

HugePages_Free:     3931

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

Hugetlb:         8388608 kB


Print information of dmg terminal:

daos_server@sw2:~/daos$ dmg -i storage scan

Hosts SCM Total       NVMe Total

----- ---------       ----------

sw2   0 B (0 modules) 4.0 TB (1 controller)

daos_server@sw2:~/daos$ dmg -i storage format

Format Summary:

  Hosts SCM Devices NVMe Devices

  ----- ----------- ------------

  sw2   1           1

daos_server@sw2:~/daos$ dmg -i pool create -z 100GB

Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM

ERROR: dmg: context deadline exceeded

 

The latest daos_server log:

daos_server@sw2:~/daos$ daos_server start -o ~/daos/build/etc/daos_server.yml

DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml

daos_server logging to file /tmp/daos_server.log

DEBUG 01:58:17.438639 start.go:89: Switching control log level to DEBUG

DEBUG 01:58:17.537242 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node

DEBUG 01:58:17.537639 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]

DEBUG 01:58:17.537780 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm

DEBUG 01:58:17.537805 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm

DEBUG 01:58:17.538059 netdetect.go:995: There are 0 hfi1 devices in the system

DEBUG 01:58:17.538100 netdetect.go:572: There are 2 NUMA nodes.

DEBUG 01:58:17.538121 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm

DEBUG 01:58:17.539248 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)

DEBUG 01:58:17.539297 server.go:113: fault domain: /sw2

DEBUG 01:58:17.539619 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}

DEBUG 01:58:32.790943 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node

DEBUG 01:58:32.791323 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 26, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]

DEBUG 01:58:32.791406 netdetect.go:669: Searching for a device alias for: ens5f0

DEBUG 01:58:32.791447 netdetect.go:693: Device alias for ens5f0 is mlx5_0

DEBUG 01:58:32.791495 class.go:209: output bdev conf file set to /mnt/daos/daos_nvme.conf

DEBUG 01:58:33.319087 provider.go:217: bdev scan: update cache (1 devices)

DAOS Control Server v1.2 (pid 3681) listening on 0.0.0.0:10001

DEBUG 01:58:33.320410 instance_exec.go:35: instance 0: checking if storage is formatted

Checking DAOS I/O Engine instance 0 storage ...

DEBUG 01:58:33.320503 instance_storage.go:74: /mnt/daos: checking formatting

DEBUG 01:58:33.346603 instance_storage.go:90: /mnt/daos (ram) needs format: true

SCM format required on instance 0

DEBUG 01:59:04.391268 ctl_storage_rpc.go:368: received StorageScan RPC

DEBUG 01:59:04.391386 provider.go:217: bdev scan: reuse cache (1 devices)

DEBUG 01:59:04.420740 ctl_storage_rpc.go:387: responding to StorageScan RPC

DEBUG 01:59:08.933555 ctl_storage_rpc.go:407: received StorageFormat RPC ; proceeding to instance storage format

Formatting scm storage for DAOS I/O Engine instance 0 (reformat: false)

DEBUG 01:59:08.933794 instance_storage.go:74: /mnt/daos: checking formatting

DEBUG 01:59:08.961338 instance_storage.go:90: /mnt/daos (ram) needs format: true

Instance 0: starting format of SCM (ram:/mnt/daos)

Instance 0: finished format of SCM (ram:/mnt/daos)

Formatting nvme storage for DAOS I/O Engine instance 0

DEBUG 01:59:09.018278 instance_superblock.go:90: /mnt/daos: checking superblock

DEBUG 01:59:09.018801 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)

Instance 0: starting format of nvme block devices [0000:98:00.0]

Instance 0: finished format of nvme block devices [0000:98:00.0]

DAOS I/O Engine instance 0 storage ready

DEBUG 01:59:13.503527 instance_superblock.go:90: /mnt/daos: checking superblock

DEBUG 01:59:13.504009 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)

DEBUG 01:59:13.504107 instance_superblock.go:119: idx 0 createSuperblock()

DEBUG 01:59:13.504432 instance_superblock.go:149: creating /mnt/daos/superblock: (rank: NilRank, uuid: 8dd7c6e2-8b2e-43b7-b180-f68ed64e8960)

DEBUG 01:59:13.504745 instance_exec.go:62: instance start()

DEBUG 01:59:13.505003 class.go:241: create /mnt/daos/daos_nvme.conf with [0000:98:00.0] bdevs

SCM @ /mnt/daos: 137 GB Total/137 GB Avail

DEBUG 01:59:13.505327 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init

DEBUG 01:59:13.506206 exec.go:69: daos_engine:0 args: [-t 8 -x 6 -g daos_server -d /var/run/daos_server -s /mnt/daos -n /mnt/daos/daos_nvme.conf -I 0]

DEBUG 01:59:13.506300 exec.go:70: daos_engine:0 env: [CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm CRT_TIMEOUT=1200 D_LOG_MASK=DEBUG D_LOG_FILE=/tmp/daos_engine.0.log CRT_CTX_SHARE_ADDR=0 OFI_DOMAIN=mlx5_0 VOS_BDEV_CLASS=NVME OFI_INTERFACE=ens5f0 OFI_PORT=20000]

Starting I/O Engine instance 0: /home/daos/daos/build/bin/daos_engine

daos_engine:0 Using legacy core allocation algorithm

daos_engine:0 Starting SPDK v20.01.2 git sha1 b2808069e / DPDK 19.11.6 initialization...

[ DPDK EAL parameters: daos --no-shconf -c 0x1 --pci-whitelist=0000:98:00.0 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --log-level=lib.eal:4 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid4969 ]

 

Join daos@daos.groups.io to automatically receive all group messages.