Client application single value KV Put high latency using multiple threads (pthread)


Lombardi, Johann
 

Hi Ping,

 

The daos_agent is actually only needed on the client nodes (where the APP runs).

Please find attached some configuration files that I use on a ethernet network with the tcp provider.

I will collect some numbers later this week with “daos pool autotest” so that we can hopefully compare it with your results.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "ping.wong via groups.io" <ping.wong@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 4 February 2021 at 22:20
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Client application single value KV Put high latency using multiple threads (pthread)

 

[Edited Message Follows]

Hi Johann,

I ran the servers (Test46, Test48) and client (Test60) on different nodes.  They are all running in the foreground including daos_server, and daos_agent.  On each node, I press ctrl-C to stop and restart them individually.  There are only one agent running on each node.  Are there any information cached somewhere that I need to remove before restarting the servers and agents?

My understanding is that each node should have an agent running.  In my case, I have 3 agents running one each node.

Please give me examples, what should I set in each of the daos_agent.yml, daos_server.yml and daos_control.yml files on each node in terms of access_points and hostlist.  I'd like setup the servers (Test46 and Test48)  in a replication cluster.  The client is Test60.

I must have mis-configured my environment.  Please correct me.

Thanks
Ping

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


ping.wong@...
 
Edited

Hi Johann,

I ran the servers (Test46, Test48) and client (Test60) on different nodes.  They are all running in the foreground including daos_server, and daos_agent.  On each node, I press ctrl-C to stop and restart them individually.  There are only one agent running on each node.  Are there any information cached somewhere that I need to remove before restarting the servers and agents?

My understanding is that each node should have an agent running.  In my case, I have 3 agents running one each node.

Please give me examples, what should I set in each of the daos_agent.yml, daos_server.yml and daos_control.yml files on each node in terms of access_points and hostlist.  I'd like setup the servers (Test46 and Test48)  in a replication cluster.  The client is Test60.

I must have mis-configured my environment.  Please correct me.

Thanks
Ping


Lombardi, Johann
 

Hi Ping,

 

Right, no endpoint configuration on the client side, the agent fetches everything (except the access points and certificate) by connecting to the servers.

Are you sure that you don’t have (inadvertently) multiple agents running?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "ping.wong via groups.io" <ping.wong@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 3 February 2021 at 17:46
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Client application single value KV Put high latency using multiple threads (pthread)

 

Hi Johann,

Since I ran both servers as root, I did umount /mnt/root and rm -rf /mnt/root before restarting both servers and agents.  I am using Then I did the storage format.  After format, I restart both servers again to make sure the configuration persisted.  Both servers rejoin the domain and seem to restart ok. 

There are no configuration changes in the client side, correct?  The errors shown comes from the client log.  The client did not detect the provider changes and continues to use ofi+sockets as you observed.

Ping

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


ping.wong@...
 

For testing purpose, I run the servers and agents in the foreground.  I press ctrl-c to stop servers and agents.  Then, I start the servers one after the other and restart all agents on servers and start the client agent last.

Ping


ping.wong@...
 

Hi Mohamad,

On the client node, I stopped the old agent and restart agent again.

Ping 


Chaarawi, Mohamad
 

Hi Ping,

 

Did you restart the agent on the client side or did you have an older agent running?

 

Thanks,

Mohamad

 

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of ping.wong via groups.io <ping.wong@...>
Date: Wednesday, February 3, 2021 at 10:46 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Client application single value KV Put high latency using multiple threads (pthread)

Hi Johann,

Since I ran both servers as root, I did umount /mnt/root and rm -rf /mnt/root before restarting both servers and agents.  I am using Then I did the storage format.  After format, I restart both servers again to make sure the configuration persisted.  Both servers rejoin the domain and seem to restart ok. 

There are no configuration changes in the client side, correct?  The errors shown comes from the client log.  The client did not detect the provider changes and continues to use ofi+sockets as you observed.

Ping


ping.wong@...
 

Hi Johann,

Since I ran both servers as root, I did umount /mnt/root and rm -rf /mnt/root before restarting both servers and agents.  I am using Then I did the storage format.  After format, I restart both servers again to make sure the configuration persisted.  Both servers rejoin the domain and seem to restart ok. 

There are no configuration changes in the client side, correct?  The errors shown comes from the client log.  The client did not detect the provider changes and continues to use ofi+sockets as you observed.

Ping


Lombardi, Johann
 

Hey Ping,

 

I did look into your logs and notice messages like “Could not lookup ofi+sockets://11.11.200.48:31416” which mean that sockets URIs (instead of tcp) are still registered and storage nodes haven’t registered the new tcp-based URIs yet. Please make sure to stop the servers, umount /mnt/daos* (and wipefs -a /dev/pmem* if you use pmem) before restarting the servers.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "ping.wong via groups.io" <ping.wong@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 3 February 2021 at 17:05
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Client application single value KV Put high latency using multiple threads (pthread)

 

Hi Johann,

I did reformat and restart all agents on client node and the two servers.  Both servers using ofi+tcp;ofi_rxm provider start fine; however, the client application failed.  Please refer the errors in my previous email (marked with ****).   For now, I can only get ofi+sockets provider to work reliably.  Are there any addition parameter settings in any of the yaml file (daos_server.yml, daos_control.yml, daos_agent.yml etc.) that I need to change beside switching from ofi+sockets to ofi+tcp;ofi_rxm? Any other environment variables to set?  

Ping

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


ping.wong@...
 

Hi Johann,

I did reformat and restart all agents on client node and the two servers.  Both servers using ofi+tcp;ofi_rxm provider start fine; however, the client application failed.  Please refer the errors in my previous email (marked with ****).   For now, I can only get ofi+sockets provider to work reliably.  Are there any addition parameter settings in any of the yaml file (daos_server.yml, daos_control.yml, daos_agent.yml etc.) that I need to change beside switching from ofi+sockets to ofi+tcp;ofi_rxm? Any other environment variables to set?  

Ping


Lombardi, Johann
 

Hi Ping,

 

Sorry, I should have provided more details in my previous email. After switching to ofi+tcp;ofi_rxm in the config file, you will have to reformat and restart the agent since we don’t support live provider change yet. It would be great if you could provide me with the output of “daos pool autotest” with both ofi+sockets and ofi+tcp;ofi_rxm so that I can compare it with results that I have on my side with 40Gbps.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "ping.wong via groups.io" <ping.wong@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 3 February 2021 at 08:13
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Client application single value KV Put high latency using multiple threads (pthread)

 

[Edited Message Follows]

Hi Johann,

I have the control interface on 10Gbps Ethernet and the data plane interface is on 100Gbps Ethernet.
 
Per your recommendation, I tried ofi+tcp;ofi_rxm; however, the client application failed (marked with ******).  

 

Server1 - connected ofi+tcp;ofi_rxm

 

DEBUG 01:02:25.452378 mgmt_system.go:183: processing 1 join requests

DEBUG 01:02:25.458189 mgmt_system.go:255: updated system member: rank 0, uri ofi+tcp;ofi_rxm://11.11.200.46:31416, Joined->Joined

daos_io_server:0 DAOS I/O server (v1.1.2.1) process 215563 started on rank 0 with 4 target, 2 helper XS, firstcore 0, host test46.autocache.com.

 

Server2 - conected  ofi+tcp;ofi_rxm

 

DEBUG 01:02:09.275423 raft.go:204: no known peers, aborting election:

DEBUG 01:02:09.911677 instance_drpc.go:66: DAOS I/O Server instance 0 drpc ready: uri:"ofi+tcp;ofi_rxm://11.11.200.48:31416" nctxs:7 drpcListenerSock:"/tmp/daos_sockets/daos_io_server_28178.sock" ntgts:4

DEBUG 01:02:09.914435 system.go:155: DAOS system join request: sys:"daos_server" uuid:"e32fcef5-c6c4-491f-a25b-f21ae4d3a75f" rank:1 uri:"ofi+tcp;ofi_rxm://11.11.200.48:31416" nctxs:7 addr:"0.0.0.0:10001" srvFaultDomain:"/test48.sdmsl.net"

DEBUG 01:02:09.915330 rpc.go:213: request hosts: [test46:10001 test48:10001 test62:10001]

daos_io_server:0 DAOS I/O server (v1.1.2.1) process 28178 started on rank 1 with 4 target, 2 helper XS, firstcore 1, host test48.sdmsl.net.

 

Client Failed

=================

DAOS Flat KV test..

=================

[==========] Running 1 test(s).

setup: creating pool, SCM size=4 GB, NVMe size=16 GB

setup: created pool a9177073-f014-477b-9ad1-5fe36d334f07

setup: connecting to pool

daos_pool_connect failed, rc: -1020                                *******************************

[  FAILED  ] GROUP SETUP

[  ERROR   ] DAOS KV API tests

state not set, likely due to group-setup issue

[==========] 0 test(s) run.

[  PASSED  ] 0 test(s).

daos_fini() failed with -1001               

 

This is part of the client log (with errors):

 

02/03-01:04:24.76 test62 DAOS[29842/29842] mgmt DBUG src/mgmt/cli_mgmt.c:192 fill_sys_info() GetAttachInfo Provider: ofi+tcp;ofi_rxm, Interface: enp24s0f0, Domain: enp24s0f0,CRT_CTX_SHARE_ADDR: 0, CRT_TIMEOUT: 0

                                                                 ...

 

02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # NA -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/na/na_ofi.c:3431

 # na_ofi_addr_lookup(): Unrecognized provider type found from: sockets://11.11.200.48:31416

02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # HG -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/mercury_core.c:1220

 # hg_core_addr_lookup(): Could not lookup address ofi+sockets://11.11.200.48:31416 (NA_INVALID_ARG)

02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # HG -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/mercury_core.c:3850

 # HG_Core_addr_lookup2(): Could not lookup address

02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # HG -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/mercury.c:1490

 # HG_Addr_lookup2(): Could not lookup ofi+sockets://11.11.200.48:31416 (HG_INVALID_ARG) ************************************************************************************************

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_rpc.c:1038 crt_req_hg_addr_lookup() HG_Addr_lookup2() failed. uri=ofi+sockets://11.11.200.48:31416, hg_ret=11 **********************************

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_rpc.c:1133 crt_req_send_internal() crt_req_hg_addr_lookup() failed, rc -1020, opc: 0x1010003.

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_rpc.c:1234 crt_req_send(0x1f7ea90) [opc=0x1010003 (DAOS) rpcid=0x636fb8e100000000 rank:tag=1:0] crt_req_send_internal() failed, DER_HG(-1020): 'Transport layer mercury error'

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_rpc.c:1580 timeout_bp_node_exit(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] exiting the timeout binheap.

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:629 crt_req_timeout_untrack(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 4.

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:1017 crt_context_req_untrack(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 3.

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_context.c:309 crt_rpc_complete(0x1f7ea90) [opc=0x1010003 (DAOS) rpcid=0x636fb8e100000000 rank:tag=1:0] failed, DER_HG(-1020): 'Transport layer mercury error'

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:316 crt_rpc_complete(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] Invoking RPC callback (rank 1 tag 0) rc: DER_HG(-1020): 'Transport layer mercury error'

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:321 crt_rpc_complete(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 2.

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_rpc.c:1260 crt_req_send(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 1.

02/03-01:04:32.78 test62 DAOS[29842/29842] mgmt DBUG src/mgmt/cli_mgmt.c:808 dc_mgmt_get_pool_svc_ranks() a9177073: daos_rpc_send_wait() failed, DER_HG(-1020): 'Transport layer mercury error'

02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_rpc.c:537 crt_req_decref(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 0.

02/03-01:04:32.78 test62 DAOS[29842/29842] hg   DBUG src/cart/crt_hg.c:971 crt_hg_req_destroy(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] destroying

 

02/03-01:04:32.78 test62 DAOS[29842/29842] crt  ERR  src/cart/crt_init.c:537 crt_finalize() cannot finalize, current ctx_num(1).    ***********************************

02/03-01:04:32.78 test62 DAOS[29842/29842] crt  ERR  src/cart/crt_init.c:596 crt_finalize() crt_finalize failed, rc: -1001.

02/03-01:04:32.78 test62 DAOS[29842/29842] client ERR  src/client/api/event.c:147 daos_eq_lib_fini() failed to shutdown crt: DER_NO_PERM(-1001): 'Operation not permitted'

02/03-01:04:32.78 test62 DAOS[29842/29842] client ERR  src/client/api/init.c:267 daos_fini() failed to finalize eq: DER_NO_PERM(-1001): 'Operation not permitted' ******************


I cannot find any documentation in the Deployment Guide about ofi+tcp;ofi_rxm settings on the server side and on the client side.   Perhaps, I missed some settings in some .yml file.


Thanks
Ping

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


ping.wong@...
 
Edited

Hi Johann,

I have the control interface on 10Gbps Ethernet and the data plane interface is on 100Gbps Ethernet.
 
Per your recommendation, I tried ofi+tcp;ofi_rxm; however, the client application failed (marked with ******).  
 
Server1 - connected ofi+tcp;ofi_rxm
 
DEBUG 01:02:25.452378 mgmt_system.go:183: processing 1 join requests
DEBUG 01:02:25.458189 mgmt_system.go:255: updated system member: rank 0, uri ofi+tcp;ofi_rxm://11.11.200.46:31416, Joined->Joined
daos_io_server:0 DAOS I/O server (v1.1.2.1) process 215563 started on rank 0 with 4 target, 2 helper XS, firstcore 0, host test46.autocache.com.
 
Server2 - conected  ofi+tcp;ofi_rxm
 
DEBUG 01:02:09.275423 raft.go:204: no known peers, aborting election:
DEBUG 01:02:09.911677 instance_drpc.go:66: DAOS I/O Server instance 0 drpc ready: uri:"ofi+tcp;ofi_rxm://11.11.200.48:31416" nctxs:7 drpcListenerSock:"/tmp/daos_sockets/daos_io_server_28178.sock" ntgts:4
DEBUG 01:02:09.914435 system.go:155: DAOS system join request: sys:"daos_server" uuid:"e32fcef5-c6c4-491f-a25b-f21ae4d3a75f" rank:1 uri:"ofi+tcp;ofi_rxm://11.11.200.48:31416" nctxs:7 addr:"0.0.0.0:10001" srvFaultDomain:"/test48.sdmsl.net"
DEBUG 01:02:09.915330 rpc.go:213: request hosts: [test46:10001 test48:10001 test62:10001]
daos_io_server:0 DAOS I/O server (v1.1.2.1) process 28178 started on rank 1 with 4 target, 2 helper XS, firstcore 1, host test48.sdmsl.net.
 
Client Failed
=================
DAOS Flat KV test..
=================
[==========] Running 1 test(s).
setup: creating pool, SCM size=4 GB, NVMe size=16 GB
setup: created pool a9177073-f014-477b-9ad1-5fe36d334f07
setup: connecting to pool
daos_pool_connect failed, rc: -1020                                *******************************
[  FAILED  ] GROUP SETUP
[  ERROR   ] DAOS KV API tests
state not set, likely due to group-setup issue
[==========] 0 test(s) run.
[  PASSED  ] 0 test(s).
daos_fini() failed with -1001               
 
This is part of the client log (with errors):
 
02/03-01:04:24.76 test62 DAOS[29842/29842] mgmt DBUG src/mgmt/cli_mgmt.c:192 fill_sys_info() GetAttachInfo Provider: ofi+tcp;ofi_rxm, Interface: enp24s0f0, Domain: enp24s0f0,CRT_CTX_SHARE_ADDR: 0, CRT_TIMEOUT: 0
                                                                 ...
 
02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # NA -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/na/na_ofi.c:3431
 # na_ofi_addr_lookup(): Unrecognized provider type found from: sockets://11.11.200.48:31416
02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # HG -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/mercury_core.c:1220
 # hg_core_addr_lookup(): Could not lookup address ofi+sockets://11.11.200.48:31416 (NA_INVALID_ARG)
02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # HG -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/mercury_core.c:3850
 # HG_Core_addr_lookup2(): Could not lookup address
02/03-01:04:32.78 test62 DAOS[29842/29842] external ERR  # HG -- Error -- /home/ssgroot/git/daos/build/external/dev/mercury/src/mercury.c:1490
 # HG_Addr_lookup2(): Could not lookup ofi+sockets://11.11.200.48:31416 (HG_INVALID_ARG) ************************************************************************************************
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_rpc.c:1038 crt_req_hg_addr_lookup() HG_Addr_lookup2() failed. uri=ofi+sockets://11.11.200.48:31416, hg_ret=11 **********************************
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_rpc.c:1133 crt_req_send_internal() crt_req_hg_addr_lookup() failed, rc -1020, opc: 0x1010003.
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_rpc.c:1234 crt_req_send(0x1f7ea90) [opc=0x1010003 (DAOS) rpcid=0x636fb8e100000000 rank:tag=1:0] crt_req_send_internal() failed, DER_HG(-1020): 'Transport layer mercury error'
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_rpc.c:1580 timeout_bp_node_exit(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] exiting the timeout binheap.
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:629 crt_req_timeout_untrack(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 4.
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:1017 crt_context_req_untrack(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 3.
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  ERR  src/cart/crt_context.c:309 crt_rpc_complete(0x1f7ea90) [opc=0x1010003 (DAOS) rpcid=0x636fb8e100000000 rank:tag=1:0] failed, DER_HG(-1020): 'Transport layer mercury error'
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:316 crt_rpc_complete(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] Invoking RPC callback (rank 1 tag 0) rc: DER_HG(-1020): 'Transport layer mercury error'
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_context.c:321 crt_rpc_complete(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 2.
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_rpc.c:1260 crt_req_send(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 1.
02/03-01:04:32.78 test62 DAOS[29842/29842] mgmt DBUG src/mgmt/cli_mgmt.c:808 dc_mgmt_get_pool_svc_ranks() a9177073: daos_rpc_send_wait() failed, DER_HG(-1020): 'Transport layer mercury error'
02/03-01:04:32.78 test62 DAOS[29842/29842] rpc  DBUG src/cart/crt_rpc.c:537 crt_req_decref(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] decref to 0.
02/03-01:04:32.78 test62 DAOS[29842/29842] hg   DBUG src/cart/crt_hg.c:971 crt_hg_req_destroy(0x1f7ea90) [opc=0x1010003 rpcid=0x636fb8e100000000 rank:tag=1:0] destroying
 
02/03-01:04:32.78 test62 DAOS[29842/29842] crt  ERR  src/cart/crt_init.c:537 crt_finalize() cannot finalize, current ctx_num(1).    ***********************************
02/03-01:04:32.78 test62 DAOS[29842/29842] crt  ERR  src/cart/crt_init.c:596 crt_finalize() crt_finalize failed, rc: -1001.
02/03-01:04:32.78 test62 DAOS[29842/29842] client ERR  src/client/api/event.c:147 daos_eq_lib_fini() failed to shutdown crt: DER_NO_PERM(-1001): 'Operation not permitted'
02/03-01:04:32.78 test62 DAOS[29842/29842] client ERR  src/client/api/init.c:267 daos_fini() failed to finalize eq: DER_NO_PERM(-1001): 'Operation not permitted' ******************

I cannot find any documentation in the Deployment Guide about ofi+tcp;ofi_rxm settings on the server side and on the client side.   Perhaps, I missed some settings in some .yml file.


Thanks
Ping


 


Lombardi, Johann
 

One roundtrip from client to leader and then one from leader to each other replica (i.e. one roundtrip for 2-way replication since the leader is a replica). Please check Liang’s paper (i.e. https://link.springer.com/chapter/10.1007/978-3-030-63393-6_22) for more information.

I am still interested in the type of network that you use. 40Gbps Ethernet with ofi+sockets provider? If so, you may want to try with ofi+tcp;ofi_rxm too.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "ping.wong via groups.io" <ping.wong@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday 2 February 2021 at 16:30
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Client application single value KV Put high latency using multiple threads (pthread)

 

Hi Johann,

Thanks for the tips.  Could you tell me how many roundtrip RPCs are involved from the client's perspective?  Also, how many RPCs are involved between the leader and the replica?

Thanks
Ping

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


ping.wong@...
 

Hi Johann,

Thanks for the tips.  Could you tell me how many roundtrip RPCs are involved from the client's perspective?  Also, how many RPCs are involved between the leader and the replica?

Thanks
Ping


Lombardi, Johann
 

Hi Ping,

 

Those latency numbers are indeed way higher than what we expect/see. Could you please advise what type of network and network provider you are using?
If you are on a recent master, could you please create a pool and give a try to “daos pool autotest --pool $PUUID”?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "ping.wong via groups.io" <ping.wong@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday 31 January 2021 at 22:16
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Client application single value KV Put high latency using multiple threads (pthread)

 

Hi all,

To evaluate replication performance, I write a client application with multiple pthreads (schedule to run on different cores if possible) using daos_kv_put with async event API in a 2-servers cluster.

One server has 44 cores and the other server has 88 cores.
Client is running on a different node with 48 cores.  The leader server replicates to the replica server.  I notice that the leader role switches between the two servers.

To find out why the client has high latency, I added some timing counters to track the duration of KV Puts in the io servers (please refer to the third column of the table below).

The client test application calls daos_kv_put(oh, DAOS_TX_NONE, 0, key, buf_size, buf, &ev) with async event 
then calls daos_event_test(&ev, DAOS_EQ_WAIT, &ev_flag) for IO completion

Each pthread writes to different key for the same object for 1000 values (4K each); hence 1000 calls to daos_kv_put
Increasing number of threads in the client application, I observed a higher latency from the client's perspective (see table below)

In the 100 threads test case, io server has higher latency as well.  

Am I missing something critical?
Is the overhead caused by the daos library?
Is there a way to improve the application latency overhead?

Thanks
Ping


Number client threads | Number of daos_kv_put | daos_io_server average put duration | client average put duration |

-----------------------------------------------------------------------------------------------------------------------------------------------------------

        5                          |      1,000                         |           0.28 ms                                     |                   1.05 ms             |

-----------------------------------------------------------------------------------------------------------------------------------------------------------

        10                        |      1,000                         |           0.27 ms                                     |                   1.65 ms             |

-----------------------------------------------------------------------------------------------------------------------------------------------------------

        15                        |      1,000                         |           0.41 ms                                     |                   2.20 ms             |

-----------------------------------------------------------------------------------------------------------------------------------------------------------

        20                        |      1,000                         |           0.48 ms                                     |                   2.86 ms             |

-----------------------------------------------------------------------------------------------------------------------------------------------------------

        100                      |      1,000                         |           7.02 ms                                     |                 11.45 ms             |

-----------------------------------------------------------------------------------------------------------------------------------------------------------



 

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


ping.wong@...
 

Hi all,

To evaluate replication performance, I write a client application with multiple pthreads (schedule to run on different cores if possible) using daos_kv_put with async event API in a 2-servers cluster.

One server has 44 cores and the other server has 88 cores.
Client is running on a different node with 48 cores.  The leader server replicates to the replica server.  I notice that the leader role switches between the two servers.

To find out why the client has high latency, I added some timing counters to track the duration of KV Puts in the io servers (please refer to the third column of the table below).

The client test application calls daos_kv_put(oh, DAOS_TX_NONE, 0, key, buf_size, buf, &ev) with async event 
then calls daos_event_test(&ev, DAOS_EQ_WAIT, &ev_flag) for IO completion

Each pthread writes to different key for the same object for 1000 values (4K each); hence 1000 calls to daos_kv_put
Increasing number of threads in the client application, I observed a higher latency from the client's perspective (see table below)

In the 100 threads test case, io server has higher latency as well.  

Am I missing something critical?
Is the overhead caused by the daos library?
Is there a way to improve the application latency overhead?

Thanks
Ping

Number client threads | Number of daos_kv_put | daos_io_server average put duration | client average put duration |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
        5                          |      1,000                         |           0.28 ms                                     |                   1.05 ms             |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
        10                        |      1,000                         |           0.27 ms                                     |                   1.65 ms             |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
        15                        |      1,000                         |           0.41 ms                                     |                   2.20 ms             |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
        20                        |      1,000                         |           0.48 ms                                     |                   2.86 ms             |
-----------------------------------------------------------------------------------------------------------------------------------------------------------
        100                      |      1,000                         |           7.02 ms                                     |                 11.45 ms             |
-----------------------------------------------------------------------------------------------------------------------------------------------------------