Date   

Re: Where should I to look for release notes?

cheneydeng@...
 

Thanks Michael,

I checked the DUG and the roadmap, it seems the self-healing feature is delivered in release1.2. I would like to confirm the status of self-healing since the README.md in 'src/rebuild/' claimed that:
Currently, since the raft leader can not exclude the target automatically, the sysadmin has to manually exclude the target from the pool, which then triggers the rebuild.

Do we still need this manual operations in release 1.2 when we say daos support seal-healing?


Re: dmg pool operation stuck

Allen
 

Hi Tom,
The same issue still exists after setting nr_xs_helpers: 0.
$ ps aux|grep daos_engine | grep -v grep
daos_de+    5301  394  0.1 135622300 771552 pts/0 RLl+ 11:45   5:43 /home/daos_debug/daos/build/bin/daos_engine -t 8 -x 0 -g daos_server -d /var/run/daos_server -s /mnt/daos -n /mnt/daos/daos_nvme.conf -I 0
 


Re: Where should I to look for release notes?

Hennecke, Michael
 

Hi,

 

For roadmap and feature updates, you can check out the slides and recordings from the DAOS User Group in November (http://dug.daos.io).

 

The release notes and other documentation will be refreshed as part of the upcoming DAOS 2.0 release.

 

Best regards,

Michael

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of cheneydeng via groups.io
Sent: Wednesday, 1 December 2021 10:58
To: daos@daos.groups.io
Subject: [daos] Where should I to look for release notes?

 

Hi DAOS,

I found the release notes on Github is very simple, I can't find a place to know what the progress of the roadmap and try to figure out if some features which I need are implemented or not. Is there any place to find such information?

DJ.

Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva  
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928


Re: dmg pool operation stuck

Nabarro, Tom
 

Can you try with nr_xs_helpers: 0 in the config please, you will need to reformat.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of allen.zhuo@...
Sent: Wednesday, December 1, 2021 10:40 AM
To: daos@daos.groups.io
Subject: Re: [daos] dmg pool operation stuck

 

Hi Tom,
please see Attachment.


Re: dmg pool operation stuck

Allen
 

Hi Tom,
please see Attachment.


Re: dmg pool operation stuck

Nabarro, Tom
 

Hello,

 

The format is completing and the engine process is being spawned, now we need to look at the engine log which is specified in the server config file (consult the admin guide for more details: https://docs.daos.io/admin/deployment/). Could you try specifying the engine/server specific log file and mask to DEBUG and paste your server config file here please.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of allen.zhuo@...
Sent: Wednesday, December 1, 2021 2:30 AM
To: daos@daos.groups.io
Subject: Re: [daos] dmg pool operation stuck

 

Hi Tom,

The same issue still exists after changing the hugepagesize to 2MB.

When dmg pool create, daos_server did not print any message. I think this is abnormal. So, can we add some debugging code?

$ cat /proc/meminfo | grep Huge

AnonHugePages:         0 kB

ShmemHugePages:        0 kB

FileHugePages:         0 kB

HugePages_Total:    4096

HugePages_Free:     3931

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

Hugetlb:         8388608 kB


Print information of dmg terminal:

daos_server@sw2:~/daos$ dmg -i storage scan

Hosts SCM Total       NVMe Total

----- ---------       ----------

sw2   0 B (0 modules) 4.0 TB (1 controller)

daos_server@sw2:~/daos$ dmg -i storage format

Format Summary:

  Hosts SCM Devices NVMe Devices

  ----- ----------- ------------

  sw2   1           1

daos_server@sw2:~/daos$ dmg -i pool create -z 100GB

Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM

ERROR: dmg: context deadline exceeded

 

The latest daos_server log:

daos_server@sw2:~/daos$ daos_server start -o ~/daos/build/etc/daos_server.yml

DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml

daos_server logging to file /tmp/daos_server.log

DEBUG 01:58:17.438639 start.go:89: Switching control log level to DEBUG

DEBUG 01:58:17.537242 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node

DEBUG 01:58:17.537639 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]

DEBUG 01:58:17.537780 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm

DEBUG 01:58:17.537805 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm

DEBUG 01:58:17.538059 netdetect.go:995: There are 0 hfi1 devices in the system

DEBUG 01:58:17.538100 netdetect.go:572: There are 2 NUMA nodes.

DEBUG 01:58:17.538121 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm

DEBUG 01:58:17.539248 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)

DEBUG 01:58:17.539297 server.go:113: fault domain: /sw2

DEBUG 01:58:17.539619 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}

DEBUG 01:58:32.790943 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node

DEBUG 01:58:32.791323 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 26, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]

DEBUG 01:58:32.791406 netdetect.go:669: Searching for a device alias for: ens5f0

DEBUG 01:58:32.791447 netdetect.go:693: Device alias for ens5f0 is mlx5_0

DEBUG 01:58:32.791495 class.go:209: output bdev conf file set to /mnt/daos/daos_nvme.conf

DEBUG 01:58:33.319087 provider.go:217: bdev scan: update cache (1 devices)

DAOS Control Server v1.2 (pid 3681) listening on 0.0.0.0:10001

DEBUG 01:58:33.320410 instance_exec.go:35: instance 0: checking if storage is formatted

Checking DAOS I/O Engine instance 0 storage ...

DEBUG 01:58:33.320503 instance_storage.go:74: /mnt/daos: checking formatting

DEBUG 01:58:33.346603 instance_storage.go:90: /mnt/daos (ram) needs format: true

SCM format required on instance 0

DEBUG 01:59:04.391268 ctl_storage_rpc.go:368: received StorageScan RPC

DEBUG 01:59:04.391386 provider.go:217: bdev scan: reuse cache (1 devices)

DEBUG 01:59:04.420740 ctl_storage_rpc.go:387: responding to StorageScan RPC

DEBUG 01:59:08.933555 ctl_storage_rpc.go:407: received StorageFormat RPC ; proceeding to instance storage format

Formatting scm storage for DAOS I/O Engine instance 0 (reformat: false)

DEBUG 01:59:08.933794 instance_storage.go:74: /mnt/daos: checking formatting

DEBUG 01:59:08.961338 instance_storage.go:90: /mnt/daos (ram) needs format: true

Instance 0: starting format of SCM (ram:/mnt/daos)

Instance 0: finished format of SCM (ram:/mnt/daos)

Formatting nvme storage for DAOS I/O Engine instance 0

DEBUG 01:59:09.018278 instance_superblock.go:90: /mnt/daos: checking superblock

DEBUG 01:59:09.018801 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)

Instance 0: starting format of nvme block devices [0000:98:00.0]

Instance 0: finished format of nvme block devices [0000:98:00.0]

DAOS I/O Engine instance 0 storage ready

DEBUG 01:59:13.503527 instance_superblock.go:90: /mnt/daos: checking superblock

DEBUG 01:59:13.504009 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)

DEBUG 01:59:13.504107 instance_superblock.go:119: idx 0 createSuperblock()

DEBUG 01:59:13.504432 instance_superblock.go:149: creating /mnt/daos/superblock: (rank: NilRank, uuid: 8dd7c6e2-8b2e-43b7-b180-f68ed64e8960)

DEBUG 01:59:13.504745 instance_exec.go:62: instance start()

DEBUG 01:59:13.505003 class.go:241: create /mnt/daos/daos_nvme.conf with [0000:98:00.0] bdevs

SCM @ /mnt/daos: 137 GB Total/137 GB Avail

DEBUG 01:59:13.505327 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init

DEBUG 01:59:13.506206 exec.go:69: daos_engine:0 args: [-t 8 -x 6 -g daos_server -d /var/run/daos_server -s /mnt/daos -n /mnt/daos/daos_nvme.conf -I 0]

DEBUG 01:59:13.506300 exec.go:70: daos_engine:0 env: [CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm CRT_TIMEOUT=1200 D_LOG_MASK=DEBUG D_LOG_FILE=/tmp/daos_engine.0.log CRT_CTX_SHARE_ADDR=0 OFI_DOMAIN=mlx5_0 VOS_BDEV_CLASS=NVME OFI_INTERFACE=ens5f0 OFI_PORT=20000]

Starting I/O Engine instance 0: /home/daos/daos/build/bin/daos_engine

daos_engine:0 Using legacy core allocation algorithm

daos_engine:0 Starting SPDK v20.01.2 git sha1 b2808069e / DPDK 19.11.6 initialization...

[ DPDK EAL parameters: daos --no-shconf -c 0x1 --pci-whitelist=0000:98:00.0 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --log-level=lib.eal:4 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid4969 ]

 


Where should I to look for release notes?

cheneydeng@...
 

Hi DAOS,

I found the release notes on Github is very simple, I can't find a place to know what the progress of the roadmap and try to figure out if some features which I need are implemented or not. Is there any place to find such information?

DJ.


Re: dmg pool operation stuck

PATEYRON Sacha
 

Le 01/12/2021 à 10:23, allen.zhuo@... a écrit :
Hi Tom,
I added some debugging codes. I found that dmg create pool failed because Database replicaAddr.get() returned <nil>. So Database CheckReplica returned an error.

// CheckReplica returns an error if the node is not configured as a
// replica or the service is not running.
func (db *Database) CheckReplica() error {
        if !db.IsReplica() {
                return &ErrNotReplica{db.cfg.stringReplicas(nil)}
        }
 
        return db.raft.withReadLock(func(_ raftService) error { return nil })
}
 
Is this because my daos_server node is too few? I only have 1 daos_server node and only 1 NVMe SSD.
If it is because of this, can I set the DAOS replication to 1? Or does it mean that DAOS needs at least 3 daos_server nodes? Or can there be only one daos_server node, but at least 3 NVMe SSDs?

Hi Allen,


With my Dockere integration, I only have 1 daos_server with 1 HDD.


ps aux|grep -i daos
daos_se+       1  0.0  0.0   4248  3576 pts/2    Ss   Nov30   0:00 bash
root         400  0.0  0.1 1460828 36524 pts/2   Sl   Nov30   0:28 /opt/daos/bin/daos_server start -o /home/daos/daos/utils/config/examples/daos_server_local.yml
root         527  0.0  0.0 680280 31132 pts/2    Sl   Nov30   0:00 /opt/daos/bin/daos_agent start --insecure
root         577  100  2.9 68784396 954120 pts/2 Sl   Nov30 1434:56 /opt/daos/bin/daos_engine -t 1 -x 0 -g daos_server -d /var/run/daos_server -T 2 -n /mnt/daos/daos_nvme.conf -I 0 -r 4096 -H 2 -s /mnt/daos



-- 
Sacha Pateyron
SSI/SISR/LISD


Re: dmg pool operation stuck

Allen
 

Hi Tom,
I added some debugging codes. I found that dmg create pool failed because Database replicaAddr.get() returned <nil>. So Database CheckReplica returned an error.

// CheckReplica returns an error if the node is not configured as a
// replica or the service is not running.
func (db *Database) CheckReplica() error {
        if !db.IsReplica() {
                return &ErrNotReplica{db.cfg.stringReplicas(nil)}
        }
 
        return db.raft.withReadLock(func(_ raftService) error { return nil })
}
 
Is this because my daos_server node is too few? I only have 1 daos_server node and only 1 NVMe SSD.
If it is because of this, can I set the DAOS replication to 1? Or does it mean that DAOS needs at least 3 daos_server nodes? Or can there be only one daos_server node, but at least 3 NVMe SSDs?


Re: dmg pool operation stuck

Allen
 

Hi Tom,
The same issue still exists after changing the hugepagesize to 2MB.
When dmg pool create, daos_server did not print any message. I think this is abnormal. So, can we add some debugging code?

$ cat /proc/meminfo | grep Huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:    4096
HugePages_Free:     3931
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         8388608 kB

Print information of dmg terminal:
daos_server@sw2:~/daos$ dmg -i storage scan
Hosts SCM Total       NVMe Total
----- ---------       ----------
sw2   0 B (0 modules) 4.0 TB (1 controller)
daos_server@sw2:~/daos$ dmg -i storage format
Format Summary:
  Hosts SCM Devices NVMe Devices
  ----- ----------- ------------
  sw2   1           1
daos_server@sw2:~/daos$ dmg -i pool create -z 100GB
Creating DAOS pool with automatic storage allocation: 100 GB NVMe + 6.00% SCM
ERROR: dmg: context deadline exceeded
 
The latest daos_server log:
daos_server@sw2:~/daos$ daos_server start -o ~/daos/build/etc/daos_server.yml
DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml
daos_server logging to file /tmp/daos_server.log
DEBUG 01:58:17.438639 start.go:89: Switching control log level to DEBUG
DEBUG 01:58:17.537242 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node
DEBUG 01:58:17.537639 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]
DEBUG 01:58:17.537780 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm
DEBUG 01:58:17.537805 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm
DEBUG 01:58:17.538059 netdetect.go:995: There are 0 hfi1 devices in the system
DEBUG 01:58:17.538100 netdetect.go:572: There are 2 NUMA nodes.
DEBUG 01:58:17.538121 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm
DEBUG 01:58:17.539248 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)
DEBUG 01:58:17.539297 server.go:113: fault domain: /sw2
DEBUG 01:58:17.539619 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}
DEBUG 01:58:32.790943 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node
DEBUG 01:58:32.791323 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 26, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]
DEBUG 01:58:32.791406 netdetect.go:669: Searching for a device alias for: ens5f0
DEBUG 01:58:32.791447 netdetect.go:693: Device alias for ens5f0 is mlx5_0
DEBUG 01:58:32.791495 class.go:209: output bdev conf file set to /mnt/daos/daos_nvme.conf
DEBUG 01:58:33.319087 provider.go:217: bdev scan: update cache (1 devices)
DAOS Control Server v1.2 (pid 3681) listening on 0.0.0.0:10001
DEBUG 01:58:33.320410 instance_exec.go:35: instance 0: checking if storage is formatted
Checking DAOS I/O Engine instance 0 storage ...
DEBUG 01:58:33.320503 instance_storage.go:74: /mnt/daos: checking formatting
DEBUG 01:58:33.346603 instance_storage.go:90: /mnt/daos (ram) needs format: true
SCM format required on instance 0
DEBUG 01:59:04.391268 ctl_storage_rpc.go:368: received StorageScan RPC
DEBUG 01:59:04.391386 provider.go:217: bdev scan: reuse cache (1 devices)
DEBUG 01:59:04.420740 ctl_storage_rpc.go:387: responding to StorageScan RPC
DEBUG 01:59:08.933555 ctl_storage_rpc.go:407: received StorageFormat RPC ; proceeding to instance storage format
Formatting scm storage for DAOS I/O Engine instance 0 (reformat: false)
DEBUG 01:59:08.933794 instance_storage.go:74: /mnt/daos: checking formatting
DEBUG 01:59:08.961338 instance_storage.go:90: /mnt/daos (ram) needs format: true
Instance 0: starting format of SCM (ram:/mnt/daos)
Instance 0: finished format of SCM (ram:/mnt/daos)
Formatting nvme storage for DAOS I/O Engine instance 0
DEBUG 01:59:09.018278 instance_superblock.go:90: /mnt/daos: checking superblock
DEBUG 01:59:09.018801 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)
Instance 0: starting format of nvme block devices [0000:98:00.0]
Instance 0: finished format of nvme block devices [0000:98:00.0]
DAOS I/O Engine instance 0 storage ready
DEBUG 01:59:13.503527 instance_superblock.go:90: /mnt/daos: checking superblock
DEBUG 01:59:13.504009 instance_superblock.go:94: /mnt/daos: needs superblock (doesn't exist)
DEBUG 01:59:13.504107 instance_superblock.go:119: idx 0 createSuperblock()
DEBUG 01:59:13.504432 instance_superblock.go:149: creating /mnt/daos/superblock: (rank: NilRank, uuid: 8dd7c6e2-8b2e-43b7-b180-f68ed64e8960)
DEBUG 01:59:13.504745 instance_exec.go:62: instance start()
DEBUG 01:59:13.505003 class.go:241: create /mnt/daos/daos_nvme.conf with [0000:98:00.0] bdevs
SCM @ /mnt/daos: 137 GB Total/137 GB Avail
DEBUG 01:59:13.505327 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init
DEBUG 01:59:13.506206 exec.go:69: daos_engine:0 args: [-t 8 -x 6 -g daos_server -d /var/run/daos_server -s /mnt/daos -n /mnt/daos/daos_nvme.conf -I 0]
DEBUG 01:59:13.506300 exec.go:70: daos_engine:0 env: [CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm CRT_TIMEOUT=1200 D_LOG_MASK=DEBUG D_LOG_FILE=/tmp/daos_engine.0.log CRT_CTX_SHARE_ADDR=0 OFI_DOMAIN=mlx5_0 VOS_BDEV_CLASS=NVME OFI_INTERFACE=ens5f0 OFI_PORT=20000]
Starting I/O Engine instance 0: /home/daos/daos/build/bin/daos_engine
daos_engine:0 Using legacy core allocation algorithm
daos_engine:0 Starting SPDK v20.01.2 git sha1 b2808069e / DPDK 19.11.6 initialization...
[ DPDK EAL parameters: daos --no-shconf -c 0x1 --pci-whitelist=0000:98:00.0 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --log-level=lib.eal:4 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid4969 ]
 


Re: dmg pool operation stuck

PATEYRON Sacha
 

Le 30/11/2021 à 12:06, Nabarro, Tom a écrit :

Can you please try with 2M hugepagesize I’m not sure we have much test coverage using 1G hugepages and there may be some built-in assumptions that might cause problems if using them.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of allen.zhuo@...
Sent: Tuesday, November 30, 2021 2:09 AM
To: daos@daos.groups.io
Subject: Re: [daos] dmg pool operation stuck

 

Hi, 

The total memory of my server is 512GB, and the hugepagesize is 1GB.

$ free -h

              total        used        free      shared  buff/cache   available

Mem:          503Gi       130Gi       372Gi       132Mi       1.0Gi       370Gi

Swap:         8.0Gi          0B       8.0Gi

$ cat /proc/meminfo | grep Huge

AnonHugePages:         0 kB

ShmemHugePages:        0 kB

FileHugePages:         0 kB

HugePages_Total:     128

HugePages_Free:      124

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:    1048576 kB

Hugetlb:        134217728 kB

 

When I set nr_hugepages: 4096 and targets: 8, the following error will be printed when daos_server starts:

$ daos_server start -o ~/daos/build/etc/daos_server.yml

DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml

daos_server logging to file /tmp/daos_server.log

DEBUG 01:52:49.731469 start.go:89: Switching control log level to DEBUG

DEBUG 01:52:49.831569 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node

DEBUG 01:52:49.831823 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]

DEBUG 01:52:49.831859 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm

DEBUG 01:52:49.831876 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm

DEBUG 01:52:49.832098 netdetect.go:995: There are 0 hfi1 devices in the system

DEBUG 01:52:49.832132 netdetect.go:572: There are 2 NUMA nodes.

DEBUG 01:52:49.832155 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm

DEBUG 01:52:49.833024 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)

DEBUG 01:52:49.833067 server.go:113: fault domain: /sw2

DEBUG 01:52:49.833306 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}

DEBUG 01:53:05.988214 main.go:70: server: code = 610 description = "requested 4096 hugepages; got 494"

ERROR: server: code = 610 description = "requested 4096 hugepages; got 494"

ERROR: server: code = 610 resolution = "reboot the system or manually clear /dev/hugepages as appropriate"

 

It looks like Daos wants to alloc nr_hugepages * hugepagesize memory.

$ numastat -mc | egrep "Node|Huge"

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

                 Node 0 Node 1  Total

AnonHugePages         0      0      0

HugePages_Total  253952 254976 508928

HugePages_Free   250880 254976 505856

HugePages_Surp        0      0      0

 

 

But spdk setup.sh will not encounter such problems.

 

daos_server@sw2:~/daos/build/prereq/release/spdk/share/spdk/scripts$ sudo HUGEMEM=4096 ./setup.sh

0000:31:00.0 (8086 0a55): nvme -> vfio-pci

0000:4c:00.0 (8086 0a55): nvme -> vfio-pci

0000:98:00.0 (8086 0a55): nvme -> vfio-pci

0000:00:01.0 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.1 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.2 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.3 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.4 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.5 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.6 (8086 0b00): ioatdm$»a -> vfio-pci

0000:00:01.7 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.0 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.1 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.2 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.3 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.4 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.5 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.6 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.7 (8086 0b00): ioatdma -> vfio-pci

daos_server@sw2:~/daos/build/prereq/release/spdk/share/spdk/scripts$ numastat -mc | egrep "Node|Huge"

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

                 Node 0 Node 1  Total

AnonHugePages         0      0      0

HugePages_Total    2048   2048   4096

HugePages_Free     2048   2048   4096

HugePages_Surp        0      0      0

 

Hi,

If that can help you.


Test OK in Docker Ubuntu 20.04.


sysctl -a | grep vm.nr_hugepages
vm.nr_hugepages = 2048
vm.nr_hugepages_mempolicy = 2048


free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi        21Gi       7.7Gi       541Mi       2.3Gi       9.0Gi
Swap:            0B          0B          0B

Best regards.

-- 
Sacha Pateyron
CEA/SSI/SISR/LISD


Re: dmg pool operation stuck

Nabarro, Tom
 

Can you please try with 2M hugepagesize I’m not sure we have much test coverage using 1G hugepages and there may be some built-in assumptions that might cause problems if using them.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of allen.zhuo@...
Sent: Tuesday, November 30, 2021 2:09 AM
To: daos@daos.groups.io
Subject: Re: [daos] dmg pool operation stuck

 

Hi, 

The total memory of my server is 512GB, and the hugepagesize is 1GB.

$ free -h

              total        used        free      shared  buff/cache   available

Mem:          503Gi       130Gi       372Gi       132Mi       1.0Gi       370Gi

Swap:         8.0Gi          0B       8.0Gi

$ cat /proc/meminfo | grep Huge

AnonHugePages:         0 kB

ShmemHugePages:        0 kB

FileHugePages:         0 kB

HugePages_Total:     128

HugePages_Free:      124

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:    1048576 kB

Hugetlb:        134217728 kB

 

When I set nr_hugepages: 4096 and targets: 8, the following error will be printed when daos_server starts:

$ daos_server start -o ~/daos/build/etc/daos_server.yml

DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml

daos_server logging to file /tmp/daos_server.log

DEBUG 01:52:49.731469 start.go:89: Switching control log level to DEBUG

DEBUG 01:52:49.831569 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node

DEBUG 01:52:49.831823 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]

DEBUG 01:52:49.831859 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm

DEBUG 01:52:49.831876 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm

DEBUG 01:52:49.832098 netdetect.go:995: There are 0 hfi1 devices in the system

DEBUG 01:52:49.832132 netdetect.go:572: There are 2 NUMA nodes.

DEBUG 01:52:49.832155 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm

DEBUG 01:52:49.833024 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)

DEBUG 01:52:49.833067 server.go:113: fault domain: /sw2

DEBUG 01:52:49.833306 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}

DEBUG 01:53:05.988214 main.go:70: server: code = 610 description = "requested 4096 hugepages; got 494"

ERROR: server: code = 610 description = "requested 4096 hugepages; got 494"

ERROR: server: code = 610 resolution = "reboot the system or manually clear /dev/hugepages as appropriate"

 

It looks like Daos wants to alloc nr_hugepages * hugepagesize memory.

$ numastat -mc | egrep "Node|Huge"

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

                 Node 0 Node 1  Total

AnonHugePages         0      0      0

HugePages_Total  253952 254976 508928

HugePages_Free   250880 254976 505856

HugePages_Surp        0      0      0

 

 

But spdk setup.sh will not encounter such problems.

 

daos_server@sw2:~/daos/build/prereq/release/spdk/share/spdk/scripts$ sudo HUGEMEM=4096 ./setup.sh

0000:31:00.0 (8086 0a55): nvme -> vfio-pci

0000:4c:00.0 (8086 0a55): nvme -> vfio-pci

0000:98:00.0 (8086 0a55): nvme -> vfio-pci

0000:00:01.0 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.1 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.2 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.3 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.4 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.5 (8086 0b00): ioatdma -> vfio-pci

0000:00:01.6 (8086 0b00): ioatdm$»a -> vfio-pci

0000:00:01.7 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.0 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.1 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.2 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.3 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.4 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.5 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.6 (8086 0b00): ioatdma -> vfio-pci

0000:80:01.7 (8086 0b00): ioatdma -> vfio-pci

daos_server@sw2:~/daos/build/prereq/release/spdk/share/spdk/scripts$ numastat -mc | egrep "Node|Huge"

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

Token Node not in hash table.

                 Node 0 Node 1  Total

AnonHugePages         0      0      0

HugePages_Total    2048   2048   4096

HugePages_Free     2048   2048   4096

HugePages_Surp        0      0      0

 


Re: dmg pool operation stuck

Allen
 

Hi,
After reviewing the daos_server.log, I think we can ignore the hugepage error. Because the log of daos_server startup is printed twice in the file. The "nr_hugepages" set for the first time was 4096, which exceeded the total memory, so it failed. I reset it to 16 and succeeded. So we should start from the 15th line of the file.

DEBUG 10:39:51.086435 start.go:89: Switching control log level to DEBUG


Re: dmg pool operation stuck

Allen
 

Hi, 

The total memory of my server is 512GB, and the hugepagesize is 1GB.
$ free -h
              total        used        free      shared  buff/cache   available
Mem:          503Gi       130Gi       372Gi       132Mi       1.0Gi       370Gi
Swap:         8.0Gi          0B       8.0Gi
$ cat /proc/meminfo | grep Huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:     128
HugePages_Free:      124
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        134217728 kB
 
When I set nr_hugepages: 4096 and targets: 8, the following error will be printed when daos_server starts:
$ daos_server start -o ~/daos/build/etc/daos_server.yml
DAOS Server config loaded from /home/daos/daos/build/etc/daos_server.yml
daos_server logging to file /tmp/daos_server.log
DEBUG 01:52:49.731469 start.go:89: Switching control log level to DEBUG
DEBUG 01:52:49.831569 netdetect.go:279: 2 NUMA nodes detected with 28 cores per node
DEBUG 01:52:49.831823 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 27, systemDeviceNames [lo ens4f0 ens5f0 ens5f1 ens4f1 enx0a148ab58408], hwlocDeviceNames [dma0chan0 dma1chan0 dma2chan0 dma3chan0 dma4chan0 dma5chan0 dma6chan0 dma7chan0 enx0a148ab58408 card0 sda nvme1n1 dma8chan0 dma9chan0 dma10chan0 dma11chan0 dma12chan0 dma13chan0 dma14chan0 dma15chan0 nvme2n1 ens4f0 ens4f1 ens5f0 mlx5_0 ens5f1 mlx5_1]
DEBUG 01:52:49.831859 netdetect.go:913: Calling ValidateProviderConfig with ens5f0, ofi+verbs;ofi_rxm
DEBUG 01:52:49.831876 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm
DEBUG 01:52:49.832098 netdetect.go:995: There are 0 hfi1 devices in the system
DEBUG 01:52:49.832132 netdetect.go:572: There are 2 NUMA nodes.
DEBUG 01:52:49.832155 netdetect.go:928: Device ens5f0 supports provider: ofi+verbs;ofi_rxm
DEBUG 01:52:49.833024 server.go:401: Active config saved to /home/daos/daos/build/etc/.daos_server.active.yml (read-only)
DEBUG 01:52:49.833067 server.go:113: fault domain: /sw2
DEBUG 01:52:49.833306 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 DisableCleanHugePages:false PCIWhitelist:0000:98:00.0 PCIBlacklist: TargetUser:daos_server ResetOnly:false DisableVFIO:false DisableVMD:true}
DEBUG 01:53:05.988214 main.go:70: server: code = 610 description = "requested 4096 hugepages; got 494"
ERROR: server: code = 610 description = "requested 4096 hugepages; got 494"
ERROR: server: code = 610 resolution = "reboot the system or manually clear /dev/hugepages as appropriate"
 
It looks like Daos wants to alloc nr_hugepages * hugepagesize memory.
$ numastat -mc | egrep "Node|Huge"
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
                 Node 0 Node 1  Total
AnonHugePages         0      0      0
HugePages_Total  253952 254976 508928
HugePages_Free   250880 254976 505856
HugePages_Surp        0      0      0
 


But spdk setup.sh will not encounter such problems.

daos_server@sw2:~/daos/build/prereq/release/spdk/share/spdk/scripts$ sudo HUGEMEM=4096 ./setup.sh
0000:31:00.0 (8086 0a55): nvme -> vfio-pci
0000:4c:00.0 (8086 0a55): nvme -> vfio-pci
0000:98:00.0 (8086 0a55): nvme -> vfio-pci
0000:00:01.0 (8086 0b00): ioatdma -> vfio-pci
0000:00:01.1 (8086 0b00): ioatdma -> vfio-pci
0000:00:01.2 (8086 0b00): ioatdma -> vfio-pci
0000:00:01.3 (8086 0b00): ioatdma -> vfio-pci
0000:00:01.4 (8086 0b00): ioatdma -> vfio-pci
0000:00:01.5 (8086 0b00): ioatdma -> vfio-pci
0000:00:01.6 (8086 0b00): ioatdma -> vfio-pci
0000:00:01.7 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.0 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.1 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.2 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.3 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.4 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.5 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.6 (8086 0b00): ioatdma -> vfio-pci
0000:80:01.7 (8086 0b00): ioatdma -> vfio-pci
daos_server@sw2:~/daos/build/prereq/release/spdk/share/spdk/scripts$ numastat -mc | egrep "Node|Huge"
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
Token Node not in hash table.
                 Node 0 Node 1  Total
AnonHugePages         0      0      0
HugePages_Total    2048   2048   4096
HugePages_Free     2048   2048   4096
HugePages_Surp        0      0      0
 


Re: dmg pool operation stuck

Nabarro, Tom
 

Hello,

 

In your server log file there are some error messages about requested huge pages, to start off with I would recommend setting nr_hugepages: 4096 and targets: 8 in the server config file (daos_server.yml). you will need to reformat (umount /mnt/daos then restart daos_server and run dmg storage format in another terminal). Instructions can be found in the admin guide: https://docs.daos.io/admin/deployment/

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of allen.zhuo@...
Sent: Monday, November 29, 2021 11:23 AM
To: daos@daos.groups.io
Subject: [daos] dmg pool operation stuck

 

Hi All:
I installed DAOS from Scratch on ubuntu 20.04. When creating a pool after storage format, the create operation is stuck. 
The "dmg" terminal only printed this message:

$ dmg -i storage format

Format Summary:

  Hosts SCM Devices NVMe Devices

  ----- ----------- ------------

  sw2   1           1

$ dmg -i storage scan

Hosts SCM Total       NVMe Total

----- ---------       ----------

sw2   0 B (0 modules) 4.0 TB (1 controller)

$ dmg -i storage query usage

Hosts SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used

----- --------- -------- -------- ---------- --------- ---------

sw2   0 B       0 B      N/A      0 B        0 B       N/A

$ dmg -i storage query

ERROR: dmg: Please specify one command of: device-health, list-devices, list-pools, target-health or usage

$ dmg -i storage query list-devices

Errors:

  Hosts Error

  ----- -----

  sw2   DAOS I/O Engine instance not started or not responding on dRPC

ERROR: dmg: 1 host had errors

$ dmg -i storage query list-pools

Errors:

  Hosts Error

  ----- -----

  sw2   DAOS I/O Engine instance not started or not responding on dRPC

ERROR: dmg: 1 host had errors

$ dmg -i pool create -z 10GB

Creating DAOS pool with automatic storage allocation: 10 GB NVMe + 6.00% SCM

ERROR: dmg: context deadline exceeded

And the "server" and "engine" did not print any messages, even though I set their log masks to debug.
I am new to daos. Please help me, thanks.


dmg pool operation stuck

Allen
 

Hi All:
I installed DAOS from Scratch on ubuntu 20.04. When creating a pool after storage format, the create operation is stuck. 
The "dmg" terminal only printed this message:
$ dmg -i storage format
Format Summary:
  Hosts SCM Devices NVMe Devices
  ----- ----------- ------------
  sw2   1           1
$ dmg -i storage scan
Hosts SCM Total       NVMe Total
----- ---------       ----------
sw2   0 B (0 modules) 4.0 TB (1 controller)
$ dmg -i storage query usage
Hosts SCM-Total SCM-Free SCM-Used NVMe-Total NVMe-Free NVMe-Used
----- --------- -------- -------- ---------- --------- ---------
sw2   0 B       0 B      N/A      0 B        0 B       N/A
$ dmg -i storage query
ERROR: dmg: Please specify one command of: device-health, list-devices, list-pools, target-health or usage
$ dmg -i storage query list-devices
Errors:
  Hosts Error
  ----- -----
  sw2   DAOS I/O Engine instance not started or not responding on dRPC
ERROR: dmg: 1 host had errors
$ dmg -i storage query list-pools
Errors:
  Hosts Error
  ----- -----
  sw2   DAOS I/O Engine instance not started or not responding on dRPC
ERROR: dmg: 1 host had errors
$ dmg -i pool create -z 10GB
Creating DAOS pool with automatic storage allocation: 10 GB NVMe + 6.00% SCM
ERROR: dmg: context deadline exceeded
And the "server" and "engine" did not print any messages, even though I set their log masks to debug.
I am new to daos. Please help me, thanks.


[DUG'21] Slides and Videos Available Online!

Lombardi, Johann
 

Hi there,

 

The recordings and slide decks of the DUG’21 presentations are now available at http://dug.daos.io. A playlist with all the videos was also created (see https://bit.ly/3CGzstW). As a reminder, we would like to hear from you on how to improve the DUG next year. We would appreciate if you could please take a moment to fill the DUG feedback survey at https://bit.ly/3CzG2mb.

 

I would like to thank again all presenters and I look forward to another exciting year in the DAOS community!

 

Take care.

Johann – on behalf of the DAOS team

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


[DUG'21] Reminder

Lombardi, Johann
 

As a reminder, the DUG’21 is on Friday (see http://dug.daos.io for more info). I look forward to seeing you there!

 

Best regards,

Johann

 

From: <daos@daos.groups.io> on behalf of "Lombardi, Johann" <johann.lombardi@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday 25 October 2021 at 22:26
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] [DUG'21] Zoom Invite and Agenda Available!

 

Hi there,

 

The agenda for the 5th annual DAOS User Group conference is now available online: http://dug.daos.io

I am again very excited by the diversity and number of presentations this year. I wish to extend a huge thank you to all the presenters.

 

As a reminder, the DUG is virtual this year:

·         On Nov 19

·         Starts at 6:45am Pacific / 7:45am Mountain / 8:45am Central / 3:45pm CET

·         4h45 of live presentations

·         Please see instructions on how to join the zoom meeting

  

Hope to see you there!

 

Best regards,

Johann

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Does Daos have a plan to support disk expansion without restart?

Lombardi, Johann
 

Hi Qiu,

 

We currently don’t support adding new SSDs to an existing storage node. Assuming that you have spare DIMM slots, SCM can be expanded if you take the engine offline and do a dump/restore of the SCM content. This cannot be done online/in-place.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of 尹秋霞 <yinqiux@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 4 November 2021 at 09:15
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Does Daos have a plan to support disk expansion without restart

 

 

Hi Johann

 

Thanks for your reply. I mean yea, add new SSDs to an existing storage node. Adding SSDs is more economical than  a new storage node so that  it may be more common.

By the way, is there any way to expand SCM on the existing storage node?

      

 

 

 

Regards,

Qiu

 

At 2021-11-04 06:00:02, "Lombardi, Johann" <johann.lombardi@...> wrote:

Hi Qiu,

 

Could you please elaborate on what you mean by disk expansion? We currently have support for quiescent pool expansion by adding new storage node(s). Are you considering adding new SSDs to an existing storage node?

 

Cheers,
Johann

 

From: <daos@daos.groups.io> on behalf of 尹秋霞 <yinqiux@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 28 October 2021 at 04:34
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Does Daos have a plan to support disk expansion without restart

 

Hi DAOS,

I found that Daos does not support disk expansion. Is there any plan for disk expansion without restart?

 

Regards

Qiu





 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.




 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


DAOS Community Update / Nov'21

Lombardi, Johann
 

Hi there,

 

Please find below the DAOS community newsletter for November 2021.

 

Past Events (October)

 

Upcoming Events

  • SC’21 Tutorial (Nov 15)
    Practical Persistent Memory Programming: PMDK and DAOS
    Adrian Jackson (University of Edinburgh)
    Mohamad Chaarawi (Intel)
    Johann Lombardi (Intel)

    https://sc21.supercomputing.org/presentation/?id=tut134&sess=sess210
  • SC’21 BoF (Nov 16)
    Object-stores for HPC - a Devonian Explosion or an Extinction Event?
    Philippe Deniel, CEA
    John Bent, Seagate
    Tiago Quinto, ECMWF
    Johann Lombardi, Intel

    https://sc21.supercomputing.org/presentation/?id=bof122&sess=sess368
  • DAOS User Group (DUG’21) on Nov 19 from 8:45am to 1:30pm (Central time).
    Agenda and Zoom invite available at http://dug.daos.io
  • Intel SC21 Booth
    Dev led talk:
    DAOS Unleashes the Power in HPC Applications: QCT(Quanta) DevCloud Experience Sharing
    A Compact, Scalable, Efficient Data Collection Solution for Edge/IoT to Cloud by Zettar
    Fireside chat:
    Cambridge Service for Data Driven Discovery Enables Real-time Hospital Decision Support Systems to Improve Outcomes

 

Release

  • A new 2.0 test build (v1.3.106-tb) was tagged a few weeks ago and we are now working towards a release candidate.
  • The 2.0 release stream has been branched under release/2.0
  • Master is now the development branch for the future 2.2 release.
  • Major recent 2.0 changes:
    • Upgrade Libfabric to v1.13.2rc1 to grab some critical rxm/verbs fixes
    • Fix overflow in SPDK when issuing NVMe unmap (aka trim) operation to large SSDs
    • Fix a PMDK issue where a transaction can return the wrong error code if yielding in the TX_STAGE_NONE callbacks
    • Improve interface detection for verbs in the agent
    • Fix ULT leaks causing OOM issues when running for a long time
    • Fix races on GetAttachInfo operation in the DAOS agent
    • show VMD backing addresses in storage scan
    • Improve Prometheus exporter performance
    • Add many new tests
    • Avoid truncation from causing incorrect reads in dfuse
    • Add interception for mkstemp in the interception library
    • Several EC fixes
    • Move SWIM ULT to a separate core to avoid interference
    • Allow object discard on multiple objects in VOS
  • Major recent master changes:
    • Add new engine metrics for the VEA module (extent allocator)
    • Migrate several CI tests from the sockets to the tcp provider
    • Initial support for the CXI provider
    • Add build support for AlmaLinux and Rocky Linux
  • What is coming:
    • Addressing the last few 2.0 blockers

 

R&D

  • Major features under development:
    • Checksum scrubbing
    • LDMS plugin to export DAOS metrics (targeted for 2.2)
    • API to collect libdaos metrics to be integrated with Darshan (targeted for 2.2)
    • Multi-user dfuse (targeted for 2.2)
    • More aggressive caching in dfuse for AI APPs (targeted for 2.2)
    • Design for catastrophic recovery / fsck
  • Pathfinding:
    • MariaDB DAOS engine with predicate pushdown to the DAOS storage nodes
      • Prototyped DAOS MariaDB engine available here: https://github.com/daos-stack/mariadb
      • PR for pipeline API (#6238)
      • Work in progress to support pipeline API in the engine
    • Leveraging the Intel Data Streaming Accelerator (DSA) to accelerate DAOS
      • Prototype leveraging DSA for VOS aggregation

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

41 - 60 of 1516