Re: When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.


Lombardi, Johann
 

Hi there,

 

IIUC, you are running into a bug where a DAOS engine is not able to rejoin the system / CART group if you “just” unplug & replug the network cable. You then noticed that you could work around this issue by reinitializing the cart contexts, but don’t know how to do that across the board for all network contexts used by the engine. Is that correct?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "dagouxiong2015@..." <dagouxiong2015@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 3 August 2022 at 04:15
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally
rpc msg OOG.

 

We tried the mercury demo and found that initializing hg can solve this OOG problem,
and then analyzed the code of the cart
and found that the cart context applied a global context, and it is used by all other ranks rpc msg.

When we want to initialize the cart context for a rank, instead of all rankswhat can we do?


struct dss_module_info {

    crt_context_t       dmi_ctx;

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Join daos@daos.groups.io to automatically receive all group messages.