infinite loop in daos_test


Kevan Rehm
 

I have run into an infinite-loop problem with daos_test.   Is this already a known problem?   If not, I’m willing to open a Jira on it, but I’d like some input first on how the code is intended to work.   If there is a work-around, I’d be interested in that as well.

 

My config is verbs;ofi_rxm, 6 server nodes, 1 client node, fake SCM (ram), fake NVMe (file).  I run daos_test by hand on the client node.

 

The infinite loop is in run_daos_degraded_test() in daos_test.   Just prior to this test, the previous test intentionally kills two of the six daos_io_servers.  The test startup code for run_daos_degraded_test() then tries to create a pool.   This fails because the Management Service Replica is unable to communicate with the two dead servers, it reports “No route to host” in its log, which makes sense.   It returns DER_UNREACH as the result of the failed RPC attempt to create a pool.

 

In the client, routine mgmt._rsvc_client_complete_rpc() calls rsvc_client_complete_rpc() which returns RSVC_CLIENT_PROCEED because it took the branch:

        } else if (hint == NULL || !(hint->sh_flags & RSVC_HINT_VALID)) {

                /* This may happen if the service wasn't found. */

                D_DEBUG(DB_MD, "\"leader\" reply without hint from rank %u: "

                        "rc_svc=%d\n", ep->ep_rank, rc_svc);

                return RSVC_CLIENT_PROCEED;

 

Because of the above, routine mgmt._rsvc_client_complete_rpc() then enters the following if statement:

        if (rc == RSVC_CLIENT_RECHOOSE ||

            (rc == RSVC_CLIENT_PROCEED && daos_rpc_retryable_rc(rc_svc))) {

                rc = tse_task_reinit(task);

                if (rc != 0)

                        return rc;

                return RSVC_CLIENT_RECHOOSE;

        }

because DER_UNREACH is considered to be a retryable RC code by daos_rpc_retryable_rc().  The task gets rescheduled, the dc_pool_create() routine gets called again, this goes round and round forever.

 

Note that the DER_UNREACH is for one of the servers that the MSR is trying to contact, not the MSR itself.   The RECHOOSE is not selecting a different server, it picks the same (only) MSR every time.   Which is surprising to me at least.

 

So what is the bug exactly?   Was the MSR supposed to go ahead and create the pool anyway with two missing servers?    Is the DER_UNREACH the wrong error code for the MSR to return?   Does it make sense for RECHOOSE to pick the same server over and over?

 

Comments welcome,

 

Kevan

 

Join daos@daos.groups.io to automatically receive all group messages.