Re: CPU NUMA node bind error


Rosenzweig, Joel B
 

Hi Huijun,

 

On the client side, the daos_agent examines the NUMA binding associated with the PID of the client application and automatically assigns an interface to the client that matches that NUMA affinity.  If the client is bound to a NUMA node that has no compatible network interface, or isn’t bound at all, then the agent assigns an interface from the default NUMA node.    To get the best performance then, you’d want to bind your client application to a NUMA node that matches one of the network interfaces available to daos_agent running on your client node.

 

If the client is bound to a NUMA node without a compatible interface, then performance will suffer.  I wrote some details about this in the /doc/admin/performance_tuning.md file.  I go into more detail there.  There’s additional info I wrote about this mechanism in the “Get Attach Info” section of the /src/control/cmd/daos_agent/README.md.  That said, you can specifically choose an interface for the client and override the automatic selection, by setting OFI_INTERFACE=… in the client environment if you desire to do so.

 

Using the pinned_numa_node setting on the daos_server is separate from settings that affect the client side.  This setting only controls how the daos_io_server processes are bound.  In the ideal case, a daos_server launches up to 1 daos_io_server process per NUMA node / matching network interface and using the ‘pinned_numa_node’ setting instructs the daos_io_server process to bind itself to cores matching that NUMA affinity.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Niu, Yawei
Sent: Friday, October 16, 2020 11:17 AM
To: daos@daos.groups.io
Subject: Re: [daos] CPU NUMA node bind error

 

The assert of “bdh_io_channel != NULL” is because a bio poll is called after the context is freed on error cleanup, could you open a ticket for it? Thanks!

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Wu Huijun <huijunw91@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, October 16, 2020 at 10:47 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] CPU NUMA node bind error

 

Patrick, thanks for your reply. I see. But for the server, the only change I made that triggered this error was to add "pinned_numa_node: 1" in the server config yml file...

 

Cheers,

Huijun

 

On Fri, Oct 16, 2020 at 10:30 PM Farrell, Patrick Arthur <patrick.farrell@...> wrote:

Client NUMA binding is not controlled by DAOS, it is a function of where your client application process is running (since DAOS is just a library linked in to that process).

 

You will have to control client NUMA binding using whatever technique you would normally use independent of DAOS.  mpirun implementations generally support NUMA binding, or if you're not running an mpi app, you can use something like numactl to run your app.

 

For the server, the NUMA node option you describe (pinned_numa_node) is the correct method.  The error you shared is not obviously related to that setting - You may want to try to confirm whether or not it's caused by changing the NUMA node setting.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Wu Huijun <huijunw91@...>
Sent: Friday, October 16, 2020 8:50 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] CPU NUMA node bind error

 

I found DAOS cannot saturate the bandwidth of the IB network in our settings. We received warnings from the client-side saying "No network devices bound to client NUMA node 0" so I guess this caused the sub-optimal performance.

By commands such as daos_server network scan/ daos_agent net-scan, I noticed that the only IB card is with NUMA node 1 while the client is somehow bound to NUMA node 0. This is also the case for the server. I tried to use the option pinned_numa_node in the server config.yml to force it the bind to NUMA node 0. However, I got the following errors. Are there any good ways to control the NUMA bindings for both the clients and the servers? Thanks if anyone could help.

ERROR: daos_io_server:0 10/17-09:24:09.69 len-cn3 DAOS[6970/7024] bio  EMRG src/bio/bio_monitor.c:196 get_spdk_identify_ctrlr_completion() Assertion 'dev_health->bdh_io_channel != NULL' failed

daos_io_server: src/bio/bio_monitor.c:196: get_spdk_identify_ctrlr_completion: Assertion `dev_health->bdh_io_channel != ((void *)0)' failed.

ERROR: daos_io_server:0 *** Process 6970 received signal 6 ***

Associated errno: Success (0)

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f8e5adb45f0]

/usr/lib64/libc.so.6(gsignal+0x37)[0x7f8e5a164337]

/usr/lib64/libc.so.6(abort+0x148)[0x7f8e5a165a28]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f156)[0x7f8e5a15d156]

ERROR: daos_io_server:0 /usr/lib64/libc.so.6(+0x2f202)[0x7f8e5a15d202]

/usr/local/daos/lib64/daos_srv/libbio.so(+0x1322d)[0x7f8e5b3dd22d]

/usr/local/daos/lib64/daos_srv/../../prereq/dev/spdk/lib/libspdk_thread.so.2.0(spdk_thread_poll+0xc6)[0x7f8e58dd7036]

/usr/local/daos/lib64/daos_srv/libbio.so(bio_xsctxt_free+0x28d)[0x7f8e5b3e309d]

ERROR: daos_io_server:0 /usr/local/daos/bin/daos_io_server[0x41b439]

/usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x1313b)[0x7f8e5ab9613b]

ERROR: daos_io_server:0 /usr/local/daos/bin/../prereq/dev/argobots/lib/libabt.so.0(+0x13811)[0x7f8e5ab96811]

instance 0 exited: instance 0 exited prematurely: /usr/local/daos/bin/daos_io_server (instance 0) exited: signal: aborted (core dumped)

ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

Join daos@daos.groups.io to automatically receive all group messages.