CPU NUMA node bind error


Wu Huijun
 

I found DAOS cannot saturate the bandwidth of the IB network in our settings. We received warnings from the client-side saying "No network devices bound to client NUMA node 0" so I guess this caused the sub-optimal performance.

By commands such as daos_server network scan/ daos_agent net-scan, I noticed that the only IB card is with NUMA node 1 while the client is somehow bound to NUMA node 0. This is also the case for the server. I tried to use the option pinned_numa_node in the server config.yml to force it the bind to NUMA node 0. However, I got the following errors. Are there any good ways to control the NUMA bindings for both the clients and the servers? Thanks if anyone could help.

ERROR: daos_io_server:0 10/17-09:24:09.69 len-cn3 DAOS[6970/7024] bio  EMRG src/bio/bio_monitor.c:196 get_spdk_identify_ctrlr_completion() Assertion 'dev_health->bdh_io_channel != NULL' failed

daos_io_server: src/bio/bio_monitor.c:196: get_spdk_identify_ctrlr_completion: Assertion `dev_health->bdh_io_channel != ((void *)0)' failed.

ERROR: daos_io_server:0 *** Process 6970 received signal 6 ***

Associated errno: Success (0)

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f8e5adb45f0]

/usr/lib64/libc.so.6(gsignal+0x37)[0x7f8e5a164337]

/usr/lib64/libc.so.6(abort+0x148)[0x7f8e5a165a28]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f156)[0x7f8e5a15d156]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f202)[0x7f8e5a15d202]

/usr/local/daos/lib64/daos_srv/libbio.so(+0x1322d)[0x7f8e5b3dd22d]

/usr/local/daos/lib64/daos_srv/../../prereq/dev/spdk/lib/libspdk_thread.so.2.0(spdk_thread_poll+0xc6)[0x7f8e58dd7036]

/usr/local/daos/lib64/daos_srv/libbio.so(bio_xsctxt_free+0x28d)[0x7f8e5b3e309d]

ERROR: daos_io_server:0 /usr/local/daos/bin/daos_io_server[0x41b439]

/usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x1313b)[0x7f8e5ab9613b]

ERROR: daos_io_server:0 /usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x13811)[0x7f8e5ab96811]

instance 0 exited: instance 0 exited prematurely: /usr/local/daos/bin/daos_io_server (instance 0) exited: signal: aborted (core dumped)

ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

Join daos@daos.groups.io to automatically receive all group messages.