Re: CPU NUMA node bind error


Niu, Yawei
 

The assert of “bdh_io_channel != NULL” is because a bio poll is called after the context is freed on error cleanup, could you open a ticket for it? Thanks!

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Wu Huijun <huijunw91@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, October 16, 2020 at 10:47 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] CPU NUMA node bind error

 

Patrick, thanks for your reply. I see. But for the server, the only change I made that triggered this error was to add "pinned_numa_node: 1" in the server config yml file...

 

Cheers,

Huijun

 

On Fri, Oct 16, 2020 at 10:30 PM Farrell, Patrick Arthur <patrick.farrell@...> wrote:

Client NUMA binding is not controlled by DAOS, it is a function of where your client application process is running (since DAOS is just a library linked in to that process).

 

You will have to control client NUMA binding using whatever technique you would normally use independent of DAOS.  mpirun implementations generally support NUMA binding, or if you're not running an mpi app, you can use something like numactl to run your app.

 

For the server, the NUMA node option you describe (pinned_numa_node) is the correct method.  The error you shared is not obviously related to that setting - You may want to try to confirm whether or not it's caused by changing the NUMA node setting.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Wu Huijun <huijunw91@...>
Sent: Friday, October 16, 2020 8:50 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] CPU NUMA node bind error

 

I found DAOS cannot saturate the bandwidth of the IB network in our settings. We received warnings from the client-side saying "No network devices bound to client NUMA node 0" so I guess this caused the sub-optimal performance.

By commands such as daos_server network scan/ daos_agent net-scan, I noticed that the only IB card is with NUMA node 1 while the client is somehow bound to NUMA node 0. This is also the case for the server. I tried to use the option pinned_numa_node in the server config.yml to force it the bind to NUMA node 0. However, I got the following errors. Are there any good ways to control the NUMA bindings for both the clients and the servers? Thanks if anyone could help.

ERROR: daos_io_server:0 10/17-09:24:09.69 len-cn3 DAOS[6970/7024] bio  EMRG src/bio/bio_monitor.c:196 get_spdk_identify_ctrlr_completion() Assertion 'dev_health->bdh_io_channel != NULL' failed

daos_io_server: src/bio/bio_monitor.c:196: get_spdk_identify_ctrlr_completion: Assertion `dev_health->bdh_io_channel != ((void *)0)' failed.

ERROR: daos_io_server:0 *** Process 6970 received signal 6 ***

Associated errno: Success (0)

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f8e5adb45f0]

/usr/lib64/libc.so.6(gsignal+0x37)[0x7f8e5a164337]

/usr/lib64/libc.so.6(abort+0x148)[0x7f8e5a165a28]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f156)[0x7f8e5a15d156]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f202)[0x7f8e5a15d202]

/usr/local/daos/lib64/daos_srv/libbio.so(+0x1322d)[0x7f8e5b3dd22d]

/usr/local/daos/lib64/daos_srv/../../prereq/dev/spdk/lib/libspdk_thread.so.2.0(spdk_thread_poll+0xc6)[0x7f8e58dd7036]

/usr/local/daos/lib64/daos_srv/libbio.so(bio_xsctxt_free+0x28d)[0x7f8e5b3e309d]

ERROR: daos_io_server:0 /usr/local/daos/bin/daos_io_server[0x41b439]

/usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x1313b)[0x7f8e5ab9613b]

ERROR: daos_io_server:0 /usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x13811)[0x7f8e5ab96811]

instance 0 exited: instance 0 exited prematurely: /usr/local/daos/bin/daos_io_server (instance 0) exited: signal: aborted (core dumped)

ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

Join daos@daos.groups.io to automatically receive all group messages.