Re: When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.


Lombardi, Johann
 

Thanks for the clarification. Since the engine is killed and restarted in step 4, I am not sure to understand why network contexts would need to be reinitialized. Could you please create a jira ticket with the logs?

In the meantime, you should be able to iterate over all the network contexts in cart (we keep an array with all the contexts there IIRC).

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "dagouxiong2015@..." <dagouxiong2015@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 4 August 2022 at 03:55
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally
rpc msg OOG.

 

Thank you for your recovery, your understanding is correct.
At that time our sequence of operations was
 1) Unplug the network cable 2) Kill the process corresponding to the network cable 3) Reinsert the network cable 4) Restore the rank through "dmg system start --ranks" 5) DER_OOG appears on the recovered rank log

We noticed that this problem is inevitable

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Join daos@daos.groups.io to automatically receive all group messages.