When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.


dagouxiong2015@...
 

We tried the mercury demo and found that initializing hg can solve this OOG problem,
and then analyzed the code of the cart,
and found that the cart context applied a global context, and it is used by all other ranks rpc msg.

When we want to initialize the cart context for a rank, instead of all ranks,what can we do?




struct dss_module_info {
    crt_context_t       dmi_ctx;


Lombardi, Johann
 

Hi there,

 

IIUC, you are running into a bug where a DAOS engine is not able to rejoin the system / CART group if you “just” unplug & replug the network cable. You then noticed that you could work around this issue by reinitializing the cart contexts, but don’t know how to do that across the board for all network contexts used by the engine. Is that correct?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "dagouxiong2015@..." <dagouxiong2015@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 3 August 2022 at 04:15
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally
rpc msg OOG.

 

We tried the mercury demo and found that initializing hg can solve this OOG problem,
and then analyzed the code of the cart
and found that the cart context applied a global context, and it is used by all other ranks rpc msg.

When we want to initialize the cart context for a rank, instead of all rankswhat can we do?


struct dss_module_info {

    crt_context_t       dmi_ctx;

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


dagouxiong2015@...
 

Gǎnxiè nín de huīfù, nǐ de lǐjiě shì duì de. Dāngshí wǒmen zhíxíng de cāozuò shì: 1) Bá diào wǎngxiàn 2) shā diào wǎngxiàn duìyìng de jìnchéng 3) chóngxīn chārù wǎngxiàn 4) tōngguò “dmg system start --ranks” huīfù duìyìng de rank 5) huīfù de rank rìzhì shàng chūxiàn DER_OOG
展开
 
 
 
 
 
120 / 5,000
 

翻译结果

 
star_border
 
Thank you for your recovery, your understanding is correct.
At that time our sequence of operations was: 1) Unplug the network cable 2) Kill the process corresponding to the network cable 3) Reinsert the network cable 4) Restore the rank through "dmg system start --ranks" 5) DER_OOG appears on the recovered rank log

We noticed that this problem is inevitable,



Lombardi, Johann
 

Thanks for the clarification. Since the engine is killed and restarted in step 4, I am not sure to understand why network contexts would need to be reinitialized. Could you please create a jira ticket with the logs?

In the meantime, you should be able to iterate over all the network contexts in cart (we keep an array with all the contexts there IIRC).

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "dagouxiong2015@..." <dagouxiong2015@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 4 August 2022 at 03:55
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally
rpc msg OOG.

 

Thank you for your recovery, your understanding is correct.
At that time our sequence of operations was
 1) Unplug the network cable 2) Kill the process corresponding to the network cable 3) Reinsert the network cable 4) Restore the rank through "dmg system start --ranks" 5) DER_OOG appears on the recovered rank log

We noticed that this problem is inevitable

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


dagouxiong2015@...
 

The cart context of the normal process needs to be initialized so that it can communicate normally with the failed recovery process.

Thank you for your reply!