Re: CPU NUMA node bind error


Farrell, Patrick Arthur
 

Client NUMA binding is not controlled by DAOS, it is a function of where your client application process is running (since DAOS is just a library linked in to that process).

You will have to control client NUMA binding using whatever technique you would normally use independent of DAOS.  mpirun implementations generally support NUMA binding, or if you're not running an mpi app, you can use something like numactl to run your app.

For the server, the NUMA node option you describe (pinned_numa_node) is the correct method.  The error you shared is not obviously related to that setting - You may want to try to confirm whether or not it's caused by changing the NUMA node setting.

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Wu Huijun <huijunw91@...>
Sent: Friday, October 16, 2020 8:50 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] CPU NUMA node bind error
 
I found DAOS cannot saturate the bandwidth of the IB network in our settings. We received warnings from the client-side saying "No network devices bound to client NUMA node 0" so I guess this caused the sub-optimal performance.

By commands such as daos_server network scan/ daos_agent net-scan, I noticed that the only IB card is with NUMA node 1 while the client is somehow bound to NUMA node 0. This is also the case for the server. I tried to use the option pinned_numa_node in the server config.yml to force it the bind to NUMA node 0. However, I got the following errors. Are there any good ways to control the NUMA bindings for both the clients and the servers? Thanks if anyone could help.

ERROR: daos_io_server:0 10/17-09:24:09.69 len-cn3 DAOS[6970/7024] bio  EMRG src/bio/bio_monitor.c:196 get_spdk_identify_ctrlr_completion() Assertion 'dev_health->bdh_io_channel != NULL' failed

daos_io_server: src/bio/bio_monitor.c:196: get_spdk_identify_ctrlr_completion: Assertion `dev_health->bdh_io_channel != ((void *)0)' failed.

ERROR: daos_io_server:0 *** Process 6970 received signal 6 ***

Associated errno: Success (0)

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f8e5adb45f0]

/usr/lib64/libc.so.6(gsignal+0x37)[0x7f8e5a164337]

/usr/lib64/libc.so.6(abort+0x148)[0x7f8e5a165a28]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f156)[0x7f8e5a15d156]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f202)[0x7f8e5a15d202]

/usr/local/daos/lib64/daos_srv/libbio.so(+0x1322d)[0x7f8e5b3dd22d]

/usr/local/daos/lib64/daos_srv/../../prereq/dev/spdk/lib/libspdk_thread.so.2.0(spdk_thread_poll+0xc6)[0x7f8e58dd7036]

/usr/local/daos/lib64/daos_srv/libbio.so(bio_xsctxt_free+0x28d)[0x7f8e5b3e309d]

ERROR: daos_io_server:0 /usr/local/daos/bin/daos_io_server[0x41b439]

/usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x1313b)[0x7f8e5ab9613b]

ERROR: daos_io_server:0 /usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x13811)[0x7f8e5ab96811]

instance 0 exited: instance 0 exited prematurely: /usr/local/daos/bin/daos_io_server (instance 0) exited: signal: aborted (core dumped)

ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

Join daos@daos.groups.io to automatically receive all group messages.