POSIX file system tests.
Colin Ngam
Greetings,
Just a quick question.
What regression tests is executed to test DAOS Posix functionality.
Thanks.
Colin
|
|
Re: Need advice debugging "No route to host" failures
Kevan Rehm
Sean,
If you open a libfabric ticket, please post it back here so that we can follow. Thanks, Kevan On 4/17/20, 6:59 PM, "daos@daos.groups.io on behalf of Hefty, Sean" <daos@daos.groups.io on behalf of sean.hefty@...> wrote: > Interesting finds. I will let Sean comment on expected OFI behavior in case of running > out of hugepages, but to me it sounds like it should have switched and used regular > memory in future calls. Yes, this sounds ideal. I'll look into this. - Sean |
|
Re: How to figure out the all the daos_io_server are talking?
So when you use ctrl-c we are unceremoniously SIGKILLing the IO servers (graceful shutdown currently doesn’t work for various reasons and the control plane change to enable it was reverted). The system membership won’t get updated until a query is issued so when you restart and rejoin happens, the status is still “Started” and you get the below message. This is a cosmetic error and notifies us that the state wasn’t updated on stop. Correct states will be reported on a system query.
I’m working on updates in this area to decouple flow-control of instances on the same host in order to lift associated limitations and enable independent start and reformat of instances regardless of state of other instances on the same host.
I will fix this error in the process.
Tom
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi Tom,
So, I can repeat, stop and restart the servers:
DEBUG 19:32:47.524108 mgmt_system.go:147: updated system member: rank 2, addr 172.30.222.230:10001, Started->Started ERROR: unexpected same state in rank 2 update (Started->Started) DEBUG 19:32:47.609783 mgmt_system.go:147: updated system member: rank 3, addr 172.30.222.230:10001, Started->Started
However, if I shutdown the server, execute the “dmg -l delphi-006 system query –verbose” it knows that the servers have terminated (it does send ping out): DEBUG 19:33:24.566669 ctl_system.go:173: Responding to SystemQuery RPC DEBUG 19:34:59.186204 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Stopped->Started DEBUG 19:34:59.271000 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Stopped->Started
May be some heartbeat etc.
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
It is not a 100% hit either ..
DEBUG 18:47:46.314720 ctl_system.go:173: Responding to SystemQuery RPC DEBUG 18:49:02.859993 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Stopped->Started DEBUG 18:49:02.945670 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Stopped->Started
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
That’s unexpected, I will try to reproduce over the weekend.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi Tom,
I have a host that is rank 4 and 5. I killed the servers Control-C and restarted the serves after maybe 3-5 minutes. The access-host did not see any server going away. After restart I see:
DEBUG 14:56:53.800937 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 4 update (Started->Started) DEBUG 14:56:53.888180 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 5 update (Started->Started)
Is this good or bad?
Today’s master tree.
Thanks.
Colin
From: "Ngam, Colin" <colin.ngam@...>
Hi Tom,
Thanks Tom. I will see if I can replicate the issue.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
The superblock stores allocated ranks: “ [tanabarr@wolf-71 daos_m]$ cat /mnt/daos0/superblock version: 0 uuid: 8359d5d6-1d8f-430c-90db-205fc07113bf system: daos_server rank: 2 validrank: true ms: false createms: false bootstrapms: false “
On start-up, non-MS daos_server will supply the superblock rank in the join request to the system. The iosrv/mgmt maintains rank:uuid mappings and should allow the rank to rejoin the system given the expected uuid is supplied along with the rank in the request.
I can’t recreate the behavior you are describing at the moment, if I stop then start a non-MS daos_server, they join with the same rank they were previously assigned. (I assume you are not reformatting)
if you can repeat that behavior, please send me the logs. We should be able to get some relevant information from the mgmt module in the server log from the function ds_mgmt_join_handler().
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
I also noticed the following:
My configuration file does not have any “rank” specified. 2 daos_io_server per host. They came up and I see them as Rank 0 .. 3. All fine. I killed Rank 2 and 3 (same host) Control-C, restart daos_server, they came up registering as Rank 4 and 5. Is Rank persistent? If it is, what’s the association between Rank Number and daos_io_server? Host name IB GUID etc?
I will try to replicate and send trace when the opportunity arise here.
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
Thanks for the positive feedback, much appreciated.
We are looking to add information such as swim activity and other metrics to the dmg system query capability.
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Excellent!
I have seen messages that shows correct daos_io_server joined but 1 of the daos_io_server has no swim messages. I assume that if you set DEBUG mode they will all have swim messages flowing .. I will check nice time around.
The dmg command has feature grown nicely.
Thanks Tom.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
“dmg -l <access_point_host> system query --verbose” will give you details of ranks that have previously joined the system including current state. This is an area of active development.
https://daos-stack.github.io/admin/administration/ “system operations“ section
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
The daos_io_server say is it up and may not be communicating with others. I set the DEBUG flag and watch for swim messages.
Is there a way to figure out that all the daos_io_server are healthy and functional?
Thanks.
Colin --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Re: Need advice debugging "No route to host" failures
In my daos_server.yml file the nr_hugepages parameter is now set to 8192. This resulted in 16384 pages being allocated, because each SPDK run (2 daos_io_servers) allocated 8192 pages. The comments in daos_server.yml are a little fuzzy, they could be read either way, did this behave correctly, should the total number of pages allocated be double the value of nr_hugepages?
>> yes , the value specified in server config file will be allocated for each IO server process
Originally I had nr_hugepages set to 4096 in the daos_server.yml file. Using sysctl I had set vm.nr_hugepages to 16384, and /proc/meminfo showed I had 16384 huge pages. But when I started the daos server, SPDK released half of my huge pages, I was left with only 8192, or 4096 per server. If I configure the machine for 16384, SPDK should not lower it, correct? It should raise the number of huge pages if the config file says so, but it should never lower the value set for the machine if it is greater than what is set in nr_hugepages.
>> it makes sense that if you set 4096 and had 2 IO servers configured then SPDK setup would allocate 8192 pages and reset what you have previously set with sysctl. We can look at altering the behavior if necessary. The control-plane is currently a bit selfish and sometimes assumes it is the sole user of persistent memory, we need to work at making it more cooperative.
I had to do the math to figure out how many hugepages my clients needed, divide that by two, then add thatett to nr_hugepages in the yml file to ensure that SPDK would allocate enough pages both for the two servers and my clients. That seems complicated.
>> I think so far we have assumed the clients would be run on a separate host from the servers.
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Kevan Rehm
Alex,
Okay, I figured out what was happening here. I have a couple of questions; I wonder if you think this is mostly a configuration error, or a bug. Or at least maybe there is a way to make the problem get reported more clearly so that I don’t have to debug libfabric to figure it out…
First the memory failure. I was running the 36 clients on one of the server nodes. The libfabric function ofi_bufpool_grow() is coded to try the first time it needs memory to get a huge page. If that succeeds, it will always use hugepages in subsequent calls. If that fails, it will instead use regular memory in all future calls. Note that if libfabric runs out of hugepages, it does NOT then switch and use regular memory in future calls.
In my case there were huge pages defined, but between the 36 clients they used up all the available hugepages not used by the servers, they still needed more, and so libfabric died with “Cannot allocate memory”. Do you think that is acceptable, or should it have switched to using regular memory when the hugepages were gone?
In my daos_server.yml file the nr_hugepages parameter is now set to 8192. This resulted in 16384 pages being allocated, because each SPDK run (2 daos_io_servers) allocated 8192 pages. The comments in daos_server.yml are a little fuzzy, they could be read either way, did this behave correctly, should the total number of pages allocated be double the value of nr_hugepages?
Originally I had nr_hugepages set to 4096 in the daos_server.yml file. Using sysctl I had set vm.nr_hugepages to 16384, and /proc/meminfo showed I had 16384 huge pages. But when I started the daos server, SPDK released half of my huge pages, I was left with only 8192, or 4096 per server. If I configure the machine for 16384, SPDK should not lower it, correct? It should raise the number of huge pages if the config file says so, but it should never lower the value set for the machine if it is greater than what is set in nr_hugepages.
I had to do the math to figure out how many hugepages my clients needed, divide that by two, then add that to nr_hugepages in the yml file to ensure that SPDK would allocate enough pages both for the two servers and my clients. That seems complicated.
Your thoughts would be appreciated,
Thanks, Kevan
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan,
In our scenarios/cases so far we do not see any error before “No route to host”, so it sounds like you might be hitting a different issue there. Can you provide more information as to what your clients are and which params are being used?
Based on your description, yes, it does not look like FI_UNIVERSE_SIZE would make any difference there given the default size of 256 which you do not cross in your case.
Also, to verify, when you see those errors you also have the Xuezhaos patch, from below, correct?
As to allocation failures – we have not had any reports of that yet using latest daos (which includes xuezhaos patch).
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Kevan Rehm
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in a config this small, correct?
The way this always seems to happen is that some other libfabric error occurs first, e.g. see below, which causes libfabric to close the connection with the server and remove the server’s address from the libfabric AV table. Mercury doesn’t give up when that happens, it keeps retrying when the retry cannot possibly succeed because the client no longer has the server’s address.
I have been meaning to track down this issue, but other issues keep getting in my way. I know there’s a spot in libfabric where it shuts down the connection, but then returns FI_EAGAIN, and that is the beginning of at least one of the “No route to host” loops in Mercury. I am chasing another one at the moment that looks a little different, see below
I think we need some code in Mercury to detect when the connection to the server has been shut down, and cancel the affected requests immediately, so that the logs don’t clutter with extraneous downstream errors.
This current failure is actually a memory allocation failure in libfabric. You can see from the bolded messages that if Mercury would retry this request or any other request, it is guaranteed to get “No route to host”. Does this look familiar to you? If not, I will start inserting debug messages into libfabric.
Thanks for keeping us in the loop,
Kevan
… libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Supported: FI_EP_DGRAM libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Requested: FI_EP_MSG libfabric:140355:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 libfabric:140355:core:core:ofi_bufpool_grow():87<debug> Allocation failed: Cannot allocate memory libfabric:140355:verbs:domain:vrb_ep_close():383<info> EP 0x3ed0520 is being closed libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():159<debug> marking connection handle: 0x3e18c80 for deletion libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():162<debug> [CM] handle: 0x3e18c80 RXM_CMAP_IDLE -> RXM_CMAP_SHUTDOWN 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # NA -- Error -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:3877 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -12 (Cannot allocate memory) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:1985 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4626 # na_ofi_cancel(): Canceling operation ID 0x1d92ca0 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4667 # na_ofi_cancel(): fi_cancel() rc: 0 (Success) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:4393 # HG_Core_forward(): Could not forward buffer 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury.c:1919 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR src/cart/src/cart/crt_hg.c:1208 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] HG_Forward failed, hg_ret: 11 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG src/cart/src/cart/crt_hg.c:1230 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] decref to 5. 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_hg_req_send failed, rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:517 crt_req_hg_addr_lookup_cb(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_req_send_internal() failed, rc -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_context.c:302 crt_rpc_complete(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] RPC failed; rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] object ERR src/object/cli_shard.c:263 dc_rw_cb() RPC 1 failed: -1020
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Re: How to figure out the all the daos_io_server are talking?
Colin Ngam
Hi Tom,
So, I can repeat, stop and restart the servers:
DEBUG 19:32:47.524108 mgmt_system.go:147: updated system member: rank 2, addr 172.30.222.230:10001, Started->Started ERROR: unexpected same state in rank 2 update (Started->Started) DEBUG 19:32:47.609783 mgmt_system.go:147: updated system member: rank 3, addr 172.30.222.230:10001, Started->Started
However, if I shutdown the server, execute the “dmg -l delphi-006 system query –verbose” it knows that the servers have terminated (it does send ping out): DEBUG 19:33:24.566669 ctl_system.go:173: Responding to SystemQuery RPC DEBUG 19:34:59.186204 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Stopped->Started DEBUG 19:34:59.271000 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Stopped->Started
May be some heartbeat etc.
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
It is not a 100% hit either ..
DEBUG 18:47:46.314720 ctl_system.go:173: Responding to SystemQuery RPC DEBUG 18:49:02.859993 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Stopped->Started DEBUG 18:49:02.945670 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Stopped->Started
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
That’s unexpected, I will try to reproduce over the weekend.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi Tom,
I have a host that is rank 4 and 5. I killed the servers Control-C and restarted the serves after maybe 3-5 minutes. The access-host did not see any server going away. After restart I see:
DEBUG 14:56:53.800937 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 4 update (Started->Started) DEBUG 14:56:53.888180 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 5 update (Started->Started)
Is this good or bad?
Today’s master tree.
Thanks.
Colin
From: "Ngam, Colin" <colin.ngam@...>
Hi Tom,
Thanks Tom. I will see if I can replicate the issue.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
The superblock stores allocated ranks: “ [tanabarr@wolf-71 daos_m]$ cat /mnt/daos0/superblock version: 0 uuid: 8359d5d6-1d8f-430c-90db-205fc07113bf system: daos_server rank: 2 validrank: true ms: false createms: false bootstrapms: false “
On start-up, non-MS daos_server will supply the superblock rank in the join request to the system. The iosrv/mgmt maintains rank:uuid mappings and should allow the rank to rejoin the system given the expected uuid is supplied along with the rank in the request.
I can’t recreate the behavior you are describing at the moment, if I stop then start a non-MS daos_server, they join with the same rank they were previously assigned. (I assume you are not reformatting)
if you can repeat that behavior, please send me the logs. We should be able to get some relevant information from the mgmt module in the server log from the function ds_mgmt_join_handler().
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
I also noticed the following:
My configuration file does not have any “rank” specified. 2 daos_io_server per host. They came up and I see them as Rank 0 .. 3. All fine. I killed Rank 2 and 3 (same host) Control-C, restart daos_server, they came up registering as Rank 4 and 5. Is Rank persistent? If it is, what’s the association between Rank Number and daos_io_server? Host name IB GUID etc?
I will try to replicate and send trace when the opportunity arise here.
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
Thanks for the positive feedback, much appreciated.
We are looking to add information such as swim activity and other metrics to the dmg system query capability.
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Excellent!
I have seen messages that shows correct daos_io_server joined but 1 of the daos_io_server has no swim messages. I assume that if you set DEBUG mode they will all have swim messages flowing .. I will check nice time around.
The dmg command has feature grown nicely.
Thanks Tom.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
“dmg -l <access_point_host> system query --verbose” will give you details of ranks that have previously joined the system including current state. This is an area of active development.
https://daos-stack.github.io/admin/administration/ “system operations“ section
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
The daos_io_server say is it up and may not be communicating with others. I set the DEBUG flag and watch for swim messages.
Is there a way to figure out that all the daos_io_server are healthy and functional?
Thanks.
Colin --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Re: Need advice debugging "No route to host" failures
Hefty, Sean <sean.hefty@...>
Interesting finds. I will let Sean comment on expected OFI behavior in case of runningYes, this sounds ideal. I'll look into this. - Sean |
|
Re: How to figure out the all the daos_io_server are talking?
Colin Ngam
It is not a 100% hit either ..
DEBUG 18:47:46.314720 ctl_system.go:173: Responding to SystemQuery RPC DEBUG 18:49:02.859993 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Stopped->Started DEBUG 18:49:02.945670 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Stopped->Started
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
That’s unexpected, I will try to reproduce over the weekend.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi Tom,
I have a host that is rank 4 and 5. I killed the servers Control-C and restarted the serves after maybe 3-5 minutes. The access-host did not see any server going away. After restart I see:
DEBUG 14:56:53.800937 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 4 update (Started->Started) DEBUG 14:56:53.888180 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 5 update (Started->Started)
Is this good or bad?
Today’s master tree.
Thanks.
Colin
From: "Ngam, Colin" <colin.ngam@...>
Hi Tom,
Thanks Tom. I will see if I can replicate the issue.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
The superblock stores allocated ranks: “ [tanabarr@wolf-71 daos_m]$ cat /mnt/daos0/superblock version: 0 uuid: 8359d5d6-1d8f-430c-90db-205fc07113bf system: daos_server rank: 2 validrank: true ms: false createms: false bootstrapms: false “
On start-up, non-MS daos_server will supply the superblock rank in the join request to the system. The iosrv/mgmt maintains rank:uuid mappings and should allow the rank to rejoin the system given the expected uuid is supplied along with the rank in the request.
I can’t recreate the behavior you are describing at the moment, if I stop then start a non-MS daos_server, they join with the same rank they were previously assigned. (I assume you are not reformatting)
if you can repeat that behavior, please send me the logs. We should be able to get some relevant information from the mgmt module in the server log from the function ds_mgmt_join_handler().
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
I also noticed the following:
My configuration file does not have any “rank” specified. 2 daos_io_server per host. They came up and I see them as Rank 0 .. 3. All fine. I killed Rank 2 and 3 (same host) Control-C, restart daos_server, they came up registering as Rank 4 and 5. Is Rank persistent? If it is, what’s the association between Rank Number and daos_io_server? Host name IB GUID etc?
I will try to replicate and send trace when the opportunity arise here.
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
Thanks for the positive feedback, much appreciated.
We are looking to add information such as swim activity and other metrics to the dmg system query capability.
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Excellent!
I have seen messages that shows correct daos_io_server joined but 1 of the daos_io_server has no swim messages. I assume that if you set DEBUG mode they will all have swim messages flowing .. I will check nice time around.
The dmg command has feature grown nicely.
Thanks Tom.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
“dmg -l <access_point_host> system query --verbose” will give you details of ranks that have previously joined the system including current state. This is an area of active development.
https://daos-stack.github.io/admin/administration/ “system operations“ section
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
The daos_io_server say is it up and may not be communicating with others. I set the DEBUG flag and watch for swim messages.
Is there a way to figure out that all the daos_io_server are healthy and functional?
Thanks.
Colin --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Re: Need advice debugging "No route to host" failures
Oganezov, Alexander A
Hi Kevan,
Interesting finds. I will let Sean comment on expected OFI behavior in case of running out of hugepages, but to me it sounds like it should have switched and used regular memory in future calls.
As for nr_hugepages param I would need to defer to some other folks who are familiar with SPDK and daoses behavior in this case.
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Kevan Rehm
Alex,
Okay, I figured out what was happening here. I have a couple of questions; I wonder if you think this is mostly a configuration error, or a bug. Or at least maybe there is a way to make the problem get reported more clearly so that I don’t have to debug libfabric to figure it out…
First the memory failure. I was running the 36 clients on one of the server nodes. The libfabric function ofi_bufpool_grow() is coded to try the first time it needs memory to get a huge page. If that succeeds, it will always use hugepages in subsequent calls. If that fails, it will instead use regular memory in all future calls. Note that if libfabric runs out of hugepages, it does NOT then switch and use regular memory in future calls.
In my case there were huge pages defined, but between the 36 clients they used up all the available hugepages not used by the servers, they still needed more, and so libfabric died with “Cannot allocate memory”. Do you think that is acceptable, or should it have switched to using regular memory when the hugepages were gone?
In my daos_server.yml file the nr_hugepages parameter is now set to 8192. This resulted in 16384 pages being allocated, because each SPDK run (2 daos_io_servers) allocated 8192 pages. The comments in daos_server.yml are a little fuzzy, they could be read either way, did this behave correctly, should the total number of pages allocated be double the value of nr_hugepages?
Originally I had nr_hugepages set to 4096 in the daos_server.yml file. Using sysctl I had set vm.nr_hugepages to 16384, and /proc/meminfo showed I had 16384 huge pages. But when I started the daos server, SPDK released half of my huge pages, I was left with only 8192, or 4096 per server. If I configure the machine for 16384, SPDK should not lower it, correct? It should raise the number of huge pages if the config file says so, but it should never lower the value set for the machine if it is greater than what is set in nr_hugepages.
I had to do the math to figure out how many hugepages my clients needed, divide that by two, then add that to nr_hugepages in the yml file to ensure that SPDK would allocate enough pages both for the two servers and my clients. That seems complicated.
Your thoughts would be appreciated,
Thanks, Kevan
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan,
In our scenarios/cases so far we do not see any error before “No route to host”, so it sounds like you might be hitting a different issue there. Can you provide more information as to what your clients are and which params are being used?
Based on your description, yes, it does not look like FI_UNIVERSE_SIZE would make any difference there given the default size of 256 which you do not cross in your case.
Also, to verify, when you see those errors you also have the Xuezhaos patch, from below, correct?
As to allocation failures – we have not had any reports of that yet using latest daos (which includes xuezhaos patch).
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Kevan Rehm
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in a config this small, correct?
The way this always seems to happen is that some other libfabric error occurs first, e.g. see below, which causes libfabric to close the connection with the server and remove the server’s address from the libfabric AV table. Mercury doesn’t give up when that happens, it keeps retrying when the retry cannot possibly succeed because the client no longer has the server’s address.
I have been meaning to track down this issue, but other issues keep getting in my way. I know there’s a spot in libfabric where it shuts down the connection, but then returns FI_EAGAIN, and that is the beginning of at least one of the “No route to host” loops in Mercury. I am chasing another one at the moment that looks a little different, see below
I think we need some code in Mercury to detect when the connection to the server has been shut down, and cancel the affected requests immediately, so that the logs don’t clutter with extraneous downstream errors.
This current failure is actually a memory allocation failure in libfabric. You can see from the bolded messages that if Mercury would retry this request or any other request, it is guaranteed to get “No route to host”. Does this look familiar to you? If not, I will start inserting debug messages into libfabric.
Thanks for keeping us in the loop,
Kevan
… libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Supported: FI_EP_DGRAM libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Requested: FI_EP_MSG libfabric:140355:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 libfabric:140355:core:core:ofi_bufpool_grow():87<debug> Allocation failed: Cannot allocate memory libfabric:140355:verbs:domain:vrb_ep_close():383<info> EP 0x3ed0520 is being closed libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():159<debug> marking connection handle: 0x3e18c80 for deletion libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():162<debug> [CM] handle: 0x3e18c80 RXM_CMAP_IDLE -> RXM_CMAP_SHUTDOWN 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # NA -- Error -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:3877 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -12 (Cannot allocate memory) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:1985 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4626 # na_ofi_cancel(): Canceling operation ID 0x1d92ca0 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4667 # na_ofi_cancel(): fi_cancel() rc: 0 (Success) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:4393 # HG_Core_forward(): Could not forward buffer 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury.c:1919 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR src/cart/src/cart/crt_hg.c:1208 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] HG_Forward failed, hg_ret: 11 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG src/cart/src/cart/crt_hg.c:1230 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] decref to 5. 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_hg_req_send failed, rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:517 crt_req_hg_addr_lookup_cb(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_req_send_internal() failed, rc -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_context.c:302 crt_rpc_complete(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] RPC failed; rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] object ERR src/object/cli_shard.c:263 dc_rw_cb() RPC 1 failed: -1020
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan |
|
Re: How to figure out the all the daos_io_server are talking?
That’s unexpected, I will try to reproduce over the weekend.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi Tom,
I have a host that is rank 4 and 5. I killed the servers Control-C and restarted the serves after maybe 3-5 minutes. The access-host did not see any server going away. After restart I see:
DEBUG 14:56:53.800937 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 4 update (Started->Started) DEBUG 14:56:53.888180 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 5 update (Started->Started)
Is this good or bad?
Today’s master tree.
Thanks.
Colin
From: "Ngam, Colin" <colin.ngam@...>
Hi Tom,
Thanks Tom. I will see if I can replicate the issue.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
The superblock stores allocated ranks: “ [tanabarr@wolf-71 daos_m]$ cat /mnt/daos0/superblock version: 0 uuid: 8359d5d6-1d8f-430c-90db-205fc07113bf system: daos_server rank: 2 validrank: true ms: false createms: false bootstrapms: false “
On start-up, non-MS daos_server will supply the superblock rank in the join request to the system. The iosrv/mgmt maintains rank:uuid mappings and should allow the rank to rejoin the system given the expected uuid is supplied along with the rank in the request.
I can’t recreate the behavior you are describing at the moment, if I stop then start a non-MS daos_server, they join with the same rank they were previously assigned. (I assume you are not reformatting)
if you can repeat that behavior, please send me the logs. We should be able to get some relevant information from the mgmt module in the server log from the function ds_mgmt_join_handler().
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
I also noticed the following:
My configuration file does not have any “rank” specified. 2 daos_io_server per host. They came up and I see them as Rank 0 .. 3. All fine. I killed Rank 2 and 3 (same host) Control-C, restart daos_server, they came up registering as Rank 4 and 5. Is Rank persistent? If it is, what’s the association between Rank Number and daos_io_server? Host name IB GUID etc?
I will try to replicate and send trace when the opportunity arise here.
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
Thanks for the positive feedback, much appreciated.
We are looking to add information such as swim activity and other metrics to the dmg system query capability.
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Excellent!
I have seen messages that shows correct daos_io_server joined but 1 of the daos_io_server has no swim messages. I assume that if you set DEBUG mode they will all have swim messages flowing .. I will check nice time around.
The dmg command has feature grown nicely.
Thanks Tom.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
“dmg -l <access_point_host> system query --verbose” will give you details of ranks that have previously joined the system including current state. This is an area of active development.
https://daos-stack.github.io/admin/administration/ “system operations“ section
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
The daos_io_server say is it up and may not be communicating with others. I set the DEBUG flag and watch for swim messages.
Is there a way to figure out that all the daos_io_server are healthy and functional?
Thanks.
Colin --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Re: Need advice debugging "No route to host" failures
Farrell, Patrick Arthur <patrick.farrell@...>
Note also that this problem can happen on a pure client - So I'd just add that changes to the server config file & handling are not enough. (I think that client was running a server previously, but the server was definitely not running when I saw this issue.)
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Sent: Friday, April 17, 2020 4:25 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Need advice debugging "No route to host" failures Alex,
Okay, I figured out what was happening here. I have a couple of questions; I wonder if you think this is mostly a configuration error, or a bug. Or at least maybe there is a way to make the problem get reported more clearly so that I don’t have to debug libfabric to figure it out…
First the memory failure. I was running the 36 clients on one of the server nodes. The libfabric function ofi_bufpool_grow() is coded to try the first time it needs memory to get a huge page. If that succeeds, it will always use hugepages in subsequent calls. If that fails, it will instead use regular memory in all future calls. Note that if libfabric runs out of hugepages, it does NOT then switch and use regular memory in future calls.
In my case there were huge pages defined, but between the 36 clients they used up all the available hugepages not used by the servers, they still needed more, and so libfabric died with “Cannot allocate memory”. Do you think that is acceptable, or should it have switched to using regular memory when the hugepages were gone?
In my daos_server.yml file the nr_hugepages parameter is now set to 8192. This resulted in 16384 pages being allocated, because each SPDK run (2 daos_io_servers) allocated 8192 pages. The comments in daos_server.yml are a little fuzzy, they could be read either way, did this behave correctly, should the total number of pages allocated be double the value of nr_hugepages?
Originally I had nr_hugepages set to 4096 in the daos_server.yml file. Using sysctl I had set vm.nr_hugepages to 16384, and /proc/meminfo showed I had 16384 huge pages. But when I started the daos server, SPDK released half of my huge pages, I was left with only 8192, or 4096 per server. If I configure the machine for 16384, SPDK should not lower it, correct? It should raise the number of huge pages if the config file says so, but it should never lower the value set for the machine if it is greater than what is set in nr_hugepages.
I had to do the math to figure out how many hugepages my clients needed, divide that by two, then add that to nr_hugepages in the yml file to ensure that SPDK would allocate enough pages both for the two servers and my clients. That seems complicated.
Your thoughts would be appreciated,
Thanks, Kevan
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan,
In our scenarios/cases so far we do not see any error before “No route to host”, so it sounds like you might be hitting a different issue there. Can you provide more information as to what your clients are and which params are being used?
Based on your description, yes, it does not look like FI_UNIVERSE_SIZE would make any difference there given the default size of 256 which you do not cross in your case.
Also, to verify, when you see those errors you also have the Xuezhaos patch, from below, correct?
As to allocation failures – we have not had any reports of that yet using latest daos (which includes xuezhaos patch).
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Kevan Rehm
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in a config this small, correct?
The way this always seems to happen is that some other libfabric error occurs first, e.g. see below, which causes libfabric to close the connection with the server and remove the server’s address from the libfabric AV table. Mercury doesn’t give up when that happens, it keeps retrying when the retry cannot possibly succeed because the client no longer has the server’s address.
I have been meaning to track down this issue, but other issues keep getting in my way. I know there’s a spot in libfabric where it shuts down the connection, but then returns FI_EAGAIN, and that is the beginning of at least one of the “No route to host” loops in Mercury. I am chasing another one at the moment that looks a little different, see below
I think we need some code in Mercury to detect when the connection to the server has been shut down, and cancel the affected requests immediately, so that the logs don’t clutter with extraneous downstream errors.
This current failure is actually a memory allocation failure in libfabric. You can see from the bolded messages that if Mercury would retry this request or any other request, it is guaranteed to get “No route to host”. Does this look familiar to you? If not, I will start inserting debug messages into libfabric.
Thanks for keeping us in the loop,
Kevan
… libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Supported: FI_EP_DGRAM libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Requested: FI_EP_MSG libfabric:140355:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 libfabric:140355:core:core:ofi_bufpool_grow():87<debug> Allocation failed: Cannot allocate memory libfabric:140355:verbs:domain:vrb_ep_close():383<info> EP 0x3ed0520 is being closed libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():159<debug> marking connection handle: 0x3e18c80 for deletion libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():162<debug> [CM] handle: 0x3e18c80 RXM_CMAP_IDLE -> RXM_CMAP_SHUTDOWN 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # NA -- Error -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:3877 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -12 (Cannot allocate memory) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:1985 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4626 # na_ofi_cancel(): Canceling operation ID 0x1d92ca0 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4667 # na_ofi_cancel(): fi_cancel() rc: 0 (Success) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:4393 # HG_Core_forward(): Could not forward buffer 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury.c:1919 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR src/cart/src/cart/crt_hg.c:1208 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] HG_Forward failed, hg_ret: 11 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG src/cart/src/cart/crt_hg.c:1230 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] decref to 5. 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_hg_req_send failed, rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:517 crt_req_hg_addr_lookup_cb(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_req_send_internal() failed, rc -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_context.c:302 crt_rpc_complete(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] RPC failed; rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] object ERR src/object/cli_shard.c:263 dc_rw_cb() RPC 1 failed: -1020
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan |
|
Re: Need advice debugging "No route to host" failures
Kevan Rehm
Alex,
Okay, I figured out what was happening here. I have a couple of questions; I wonder if you think this is mostly a configuration error, or a bug. Or at least maybe there is a way to make the problem get reported more clearly so that I don’t have to debug libfabric to figure it out…
First the memory failure. I was running the 36 clients on one of the server nodes. The libfabric function ofi_bufpool_grow() is coded to try the first time it needs memory to get a huge page. If that succeeds, it will always use hugepages in subsequent calls. If that fails, it will instead use regular memory in all future calls. Note that if libfabric runs out of hugepages, it does NOT then switch and use regular memory in future calls.
In my case there were huge pages defined, but between the 36 clients they used up all the available hugepages not used by the servers, they still needed more, and so libfabric died with “Cannot allocate memory”. Do you think that is acceptable, or should it have switched to using regular memory when the hugepages were gone?
In my daos_server.yml file the nr_hugepages parameter is now set to 8192. This resulted in 16384 pages being allocated, because each SPDK run (2 daos_io_servers) allocated 8192 pages. The comments in daos_server.yml are a little fuzzy, they could be read either way, did this behave correctly, should the total number of pages allocated be double the value of nr_hugepages?
Originally I had nr_hugepages set to 4096 in the daos_server.yml file. Using sysctl I had set vm.nr_hugepages to 16384, and /proc/meminfo showed I had 16384 huge pages. But when I started the daos server, SPDK released half of my huge pages, I was left with only 8192, or 4096 per server. If I configure the machine for 16384, SPDK should not lower it, correct? It should raise the number of huge pages if the config file says so, but it should never lower the value set for the machine if it is greater than what is set in nr_hugepages.
I had to do the math to figure out how many hugepages my clients needed, divide that by two, then add that to nr_hugepages in the yml file to ensure that SPDK would allocate enough pages both for the two servers and my clients. That seems complicated.
Your thoughts would be appreciated,
Thanks, Kevan
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan,
In our scenarios/cases so far we do not see any error before “No route to host”, so it sounds like you might be hitting a different issue there. Can you provide more information as to what your clients are and which params are being used?
Based on your description, yes, it does not look like FI_UNIVERSE_SIZE would make any difference there given the default size of 256 which you do not cross in your case.
Also, to verify, when you see those errors you also have the Xuezhaos patch, from below, correct?
As to allocation failures – we have not had any reports of that yet using latest daos (which includes xuezhaos patch).
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Kevan Rehm
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in a config this small, correct?
The way this always seems to happen is that some other libfabric error occurs first, e.g. see below, which causes libfabric to close the connection with the server and remove the server’s address from the libfabric AV table. Mercury doesn’t give up when that happens, it keeps retrying when the retry cannot possibly succeed because the client no longer has the server’s address.
I have been meaning to track down this issue, but other issues keep getting in my way. I know there’s a spot in libfabric where it shuts down the connection, but then returns FI_EAGAIN, and that is the beginning of at least one of the “No route to host” loops in Mercury. I am chasing another one at the moment that looks a little different, see below
I think we need some code in Mercury to detect when the connection to the server has been shut down, and cancel the affected requests immediately, so that the logs don’t clutter with extraneous downstream errors.
This current failure is actually a memory allocation failure in libfabric. You can see from the bolded messages that if Mercury would retry this request or any other request, it is guaranteed to get “No route to host”. Does this look familiar to you? If not, I will start inserting debug messages into libfabric.
Thanks for keeping us in the loop,
Kevan
… libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Supported: FI_EP_DGRAM libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Requested: FI_EP_MSG libfabric:140355:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 libfabric:140355:core:core:ofi_bufpool_grow():87<debug> Allocation failed: Cannot allocate memory libfabric:140355:verbs:domain:vrb_ep_close():383<info> EP 0x3ed0520 is being closed libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():159<debug> marking connection handle: 0x3e18c80 for deletion libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():162<debug> [CM] handle: 0x3e18c80 RXM_CMAP_IDLE -> RXM_CMAP_SHUTDOWN 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # NA -- Error -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:3877 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -12 (Cannot allocate memory) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:1985 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4626 # na_ofi_cancel(): Canceling operation ID 0x1d92ca0 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4667 # na_ofi_cancel(): fi_cancel() rc: 0 (Success) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:4393 # HG_Core_forward(): Could not forward buffer 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury.c:1919 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR src/cart/src/cart/crt_hg.c:1208 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] HG_Forward failed, hg_ret: 11 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG src/cart/src/cart/crt_hg.c:1230 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] decref to 5. 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_hg_req_send failed, rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:517 crt_req_hg_addr_lookup_cb(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_req_send_internal() failed, rc -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_context.c:302 crt_rpc_complete(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] RPC failed; rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] object ERR src/object/cli_shard.c:263 dc_rw_cb() RPC 1 failed: -1020
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan |
|
Re: How to figure out the all the daos_io_server are talking?
Colin Ngam
Hi Tom,
I have a host that is rank 4 and 5. I killed the servers Control-C and restarted the serves after maybe 3-5 minutes. The access-host did not see any server going away. After restart I see:
DEBUG 14:56:53.800937 mgmt_system.go:147: updated system member: rank 4, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 4 update (Started->Started) DEBUG 14:56:53.888180 mgmt_system.go:147: updated system member: rank 5, addr 172.30.222.236:10001, Started->Started ERROR: unexpected same state in rank 5 update (Started->Started)
Is this good or bad?
Today’s master tree.
Thanks.
Colin
From: "Ngam, Colin" <colin.ngam@...>
Hi Tom,
Thanks Tom. I will see if I can replicate the issue.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
The superblock stores allocated ranks: “ [tanabarr@wolf-71 daos_m]$ cat /mnt/daos0/superblock version: 0 uuid: 8359d5d6-1d8f-430c-90db-205fc07113bf system: daos_server rank: 2 validrank: true ms: false createms: false bootstrapms: false “
On start-up, non-MS daos_server will supply the superblock rank in the join request to the system. The iosrv/mgmt maintains rank:uuid mappings and should allow the rank to rejoin the system given the expected uuid is supplied along with the rank in the request.
I can’t recreate the behavior you are describing at the moment, if I stop then start a non-MS daos_server, they join with the same rank they were previously assigned. (I assume you are not reformatting)
if you can repeat that behavior, please send me the logs. We should be able to get some relevant information from the mgmt module in the server log from the function ds_mgmt_join_handler().
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
I also noticed the following:
My configuration file does not have any “rank” specified. 2 daos_io_server per host. They came up and I see them as Rank 0 .. 3. All fine. I killed Rank 2 and 3 (same host) Control-C, restart daos_server, they came up registering as Rank 4 and 5. Is Rank persistent? If it is, what’s the association between Rank Number and daos_io_server? Host name IB GUID etc?
I will try to replicate and send trace when the opportunity arise here.
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
Thanks for the positive feedback, much appreciated.
We are looking to add information such as swim activity and other metrics to the dmg system query capability.
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Excellent!
I have seen messages that shows correct daos_io_server joined but 1 of the daos_io_server has no swim messages. I assume that if you set DEBUG mode they will all have swim messages flowing .. I will check nice time around.
The dmg command has feature grown nicely.
Thanks Tom.
Colin
From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
“dmg -l <access_point_host> system query --verbose” will give you details of ranks that have previously joined the system including current state. This is an area of active development.
https://daos-stack.github.io/admin/administration/ “system operations“ section
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Colin Ngam
Hi,
The daos_io_server say is it up and may not be communicating with others. I set the DEBUG flag and watch for swim messages.
Is there a way to figure out that all the daos_io_server are healthy and functional?
Thanks.
Colin --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Re: Need advice debugging "No route to host" failures
Oganezov, Alexander A
Hi Kevan,
In our scenarios/cases so far we do not see any error before “No route to host”, so it sounds like you might be hitting a different issue there. Can you provide more information as to what your clients are and which params are being used?
Based on your description, yes, it does not look like FI_UNIVERSE_SIZE would make any difference there given the default size of 256 which you do not cross in your case.
Also, to verify, when you see those errors you also have the Xuezhaos patch, from below, correct?
As to allocation failures – we have not had any reports of that yet using latest daos (which includes xuezhaos patch).
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Kevan Rehm
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in a config this small, correct?
The way this always seems to happen is that some other libfabric error occurs first, e.g. see below, which causes libfabric to close the connection with the server and remove the server’s address from the libfabric AV table. Mercury doesn’t give up when that happens, it keeps retrying when the retry cannot possibly succeed because the client no longer has the server’s address.
I have been meaning to track down this issue, but other issues keep getting in my way. I know there’s a spot in libfabric where it shuts down the connection, but then returns FI_EAGAIN, and that is the beginning of at least one of the “No route to host” loops in Mercury. I am chasing another one at the moment that looks a little different, see below
I think we need some code in Mercury to detect when the connection to the server has been shut down, and cancel the affected requests immediately, so that the logs don’t clutter with extraneous downstream errors.
This current failure is actually a memory allocation failure in libfabric. You can see from the bolded messages that if Mercury would retry this request or any other request, it is guaranteed to get “No route to host”. Does this look familiar to you? If not, I will start inserting debug messages into libfabric.
Thanks for keeping us in the loop,
Kevan
… libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Supported: FI_EP_DGRAM libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Requested: FI_EP_MSG libfabric:140355:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 libfabric:140355:core:core:ofi_bufpool_grow():87<debug> Allocation failed: Cannot allocate memory libfabric:140355:verbs:domain:vrb_ep_close():383<info> EP 0x3ed0520 is being closed libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():159<debug> marking connection handle: 0x3e18c80 for deletion libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():162<debug> [CM] handle: 0x3e18c80 RXM_CMAP_IDLE -> RXM_CMAP_SHUTDOWN 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # NA -- Error -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:3877 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -12 (Cannot allocate memory) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:1985 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4626 # na_ofi_cancel(): Canceling operation ID 0x1d92ca0 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4667 # na_ofi_cancel(): fi_cancel() rc: 0 (Success) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:4393 # HG_Core_forward(): Could not forward buffer 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury.c:1919 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR src/cart/src/cart/crt_hg.c:1208 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] HG_Forward failed, hg_ret: 11 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG src/cart/src/cart/crt_hg.c:1230 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] decref to 5. 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_hg_req_send failed, rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:517 crt_req_hg_addr_lookup_cb(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_req_send_internal() failed, rc -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_context.c:302 crt_rpc_complete(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] RPC failed; rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] object ERR src/object/cli_shard.c:263 dc_rw_cb() RPC 1 failed: -1020
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan |
|
Re: Need advice debugging "No route to host" failures
Farrell, Patrick Arthur <patrick.farrell@...>
Note, if it helps, that this ENOMEM is strongly connected to the number of processes and does not seem to be connected to, eg, transfer size. More processes makes this much more likely - we start seeing around 30-36 processes on a node.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Sent: Friday, April 17, 2020 2:27 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Need advice debugging "No route to host" failures Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in a config this small, correct?
The way this always seems to happen is that some other libfabric error occurs first, e.g. see below, which causes libfabric to close the connection with the server and remove the server’s address from the libfabric AV table. Mercury doesn’t give up when that happens, it keeps retrying when the retry cannot possibly succeed because the client no longer has the server’s address.
I have been meaning to track down this issue, but other issues keep getting in my way. I know there’s a spot in libfabric where it shuts down the connection, but then returns FI_EAGAIN, and that is the beginning of at least one of the “No route to host” loops in Mercury. I am chasing another one at the moment that looks a little different, see below
I think we need some code in Mercury to detect when the connection to the server has been shut down, and cancel the affected requests immediately, so that the logs don’t clutter with extraneous downstream errors.
This current failure is actually a memory allocation failure in libfabric. You can see from the bolded messages that if Mercury would retry this request or any other request, it is guaranteed to get “No route to host”. Does this look familiar to you? If not, I will start inserting debug messages into libfabric.
Thanks for keeping us in the loop,
Kevan
… libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Supported: FI_EP_DGRAM libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Requested: FI_EP_MSG libfabric:140355:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 libfabric:140355:core:core:ofi_bufpool_grow():87<debug> Allocation failed: Cannot allocate memory libfabric:140355:verbs:domain:vrb_ep_close():383<info> EP 0x3ed0520 is being closed libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():159<debug> marking connection handle: 0x3e18c80 for deletion libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():162<debug> [CM] handle: 0x3e18c80 RXM_CMAP_IDLE -> RXM_CMAP_SHUTDOWN 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # NA -- Error -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:3877 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -12 (Cannot allocate memory) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:1985 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4626 # na_ofi_cancel(): Canceling operation ID 0x1d92ca0 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4667 # na_ofi_cancel(): fi_cancel() rc: 0 (Success) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:4393 # HG_Core_forward(): Could not forward buffer 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury.c:1919 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR src/cart/src/cart/crt_hg.c:1208 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] HG_Forward failed, hg_ret: 11 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG src/cart/src/cart/crt_hg.c:1230 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] decref to 5. 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_hg_req_send failed, rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:517 crt_req_hg_addr_lookup_cb(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_req_send_internal() failed, rc -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_context.c:302 crt_rpc_complete(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] RPC failed; rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] object ERR src/object/cli_shard.c:263 dc_rw_cb() RPC 1 failed: -1020
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan |
|
Re: Need advice debugging "No route to host" failures
Kevan Rehm
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in a config this small, correct?
The way this always seems to happen is that some other libfabric error occurs first, e.g. see below, which causes libfabric to close the connection with the server and remove the server’s address from the libfabric AV table. Mercury doesn’t give up when that happens, it keeps retrying when the retry cannot possibly succeed because the client no longer has the server’s address.
I have been meaning to track down this issue, but other issues keep getting in my way. I know there’s a spot in libfabric where it shuts down the connection, but then returns FI_EAGAIN, and that is the beginning of at least one of the “No route to host” loops in Mercury. I am chasing another one at the moment that looks a little different, see below
I think we need some code in Mercury to detect when the connection to the server has been shut down, and cancel the affected requests immediately, so that the logs don’t clutter with extraneous downstream errors.
This current failure is actually a memory allocation failure in libfabric. You can see from the bolded messages that if Mercury would retry this request or any other request, it is guaranteed to get “No route to host”. Does this look familiar to you? If not, I will start inserting debug messages into libfabric.
Thanks for keeping us in the loop,
Kevan
… libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Supported: FI_EP_DGRAM libfabric:140355:verbs:core:ofi_check_ep_type():624<info> Requested: FI_EP_MSG libfabric:140355:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 libfabric:140355:core:core:ofi_bufpool_grow():87<debug> Allocation failed: Cannot allocate memory libfabric:140355:verbs:domain:vrb_ep_close():383<info> EP 0x3ed0520 is being closed libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():159<debug> marking connection handle: 0x3e18c80 for deletion libfabric:140355:ofi_rxm:ep_ctrl:rxm_cmap_del_handle():162<debug> [CM] handle: 0x3e18c80 RXM_CMAP_IDLE -> RXM_CMAP_SHUTDOWN 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # NA -- Error -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:3877 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -12 (Cannot allocate memory) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:1985 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4626 # na_ofi_cancel(): Canceling operation ID 0x1d92ca0 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG # NA -- Debug -- /delphi/krehm/daos/_build.external/mercury/src/na/na_ofi.c:4667 # na_ofi_cancel(): fi_cancel() rc: 0 (Success) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury_core.c:4393 # HG_Core_forward(): Could not forward buffer 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR # HG -- Error -- /delphi/krehm/daos/_build.external/mercury/src/mercury.c:1919 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg ERR src/cart/src/cart/crt_hg.c:1208 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] HG_Forward failed, hg_ret: 11 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] hg DBUG src/cart/src/cart/crt_hg.c:1230 crt_hg_req_send(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] decref to 5. 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_hg_req_send failed, rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_rpc.c:517 crt_req_hg_addr_lookup_cb(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] crt_req_send_internal() failed, rc -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] rpc ERR src/cart/src/cart/crt_context.c:302 crt_rpc_complete(0x3ed0040) [opc=0x4010001 xid=0x5c rank:tag=0:6] RPC failed; rc: -1020 04/17-14:02:59.70 delphi-004 DAOS[140355/140355] object ERR src/object/cli_shard.c:263 dc_rw_cb() RPC 1 failed: -1020
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan |
|
Re: Need advice debugging "No route to host" failures
Oganezov, Alexander A
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable, which in RXM-based providers (verbs included) dictates internal size of address vector table for peers.
With Xuezhaos patch below (already part of latest 0.9 DAOS) and FI_UNIVERSE_SIZE=2048 setting we were able to avoid “No route to host” issues on all known failures so far.
This issue is also being tracked in libfabric as https://github.com/ofiwg/libfabric/issues/5843
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Liu, Xuezhao
Hi Kevan,
Great, thanks for verifying and the helpful logs.
“On the server side, what should be the proper behavior if a client tries to connect twice using the same IPoIB port? Should the server reject the second connection attempt? Close the first connection and then open the second?” Server should not reject the 2nd connection attempt, the current handling (with the patch) is it will check its local cache (a hash-table) and reuse it if possible.
Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Xuezhao,
Yes, it works, thank you very much! I have been trying to figure that one out for several days now! 😊
Kevan
From:
<daos@daos.groups.io> on behalf of "Liu, Xuezhao" <xuezhao.liu@...>
Hi Kevan,
The log you posted is very helpful. It helped to find a potential bug in mercury code for address management, please find attached small patch. It needs to patch mercury source code and rebuild it. Commonly I do that by: cd daos_m_path/_build.external/mercury; git am 0001-NA-OFI-fix-a-potential-bug-in-na_ofi_addr_ht_lookup.patch; cd ../ mercury.build; make clean; make install
Could you please have a test to see if it can resolve the problem?
Thanks, Xuezhao
From:
<daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Greetings,
We have been unsuccessful over the last few days debugging “No route to host” failures that we are constantly seeing when running “ior -a DAOS”. The errors only occur in the log of the daos_io_server on IB1, never on the IB0 server. Client processes hang when this occurs. The problem does not appear to be related to our recent verbs;ofi_rxm issues which we think are now resolved. Any advice you could provide would be greatly appreciated.
Our configuration is one dual-daos_io_server node and one client node. The libfabric transport is verbs;ofi_rxm, both nodes have two interfaces. The server node’s PMEM is wiped and reformatted before each test run. We use mpirun to start 16 ior client processes on the client node. The ior program has been slightly modified so that processes with even-numbered ranks use IB0 while odd-numbered ranks use IB1. Here is an example command:
mpirun --allow-run-as-root -np 16 $X_OPTS /root/ior/ior/src/ior -a DAOS --daos.pool $POOL --daos.cont $CONT --daos.svcl $SVCL -w -r -t 64M -b 1G
When we reduce the number of client processes, the problem occurs less frequently. With only two processes on the client node we haven’t seen the problem.
What happens is that some client processes which connect to the IB1 server actually connect twice using the same source IPoIB port number. Those that do are guaranteed to generate “No route to host” errors in the IB1 server log and then hang. For example, the following grep of the IB1 server log shows that the connecting processes that were assigned to libfabric fi_addrs 2 and 3 are going to fail. If you would repeat this same grep on the IB0 server log, you would never see duplicate connections:
# grep '<debug> av_insert ' *1.log libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.220:20000 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 0 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:43782 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 1 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:37665 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 2 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.1.222:48826 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 3 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:39600 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 4 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():609<debug> av_insert addr: fi_sockaddr_in://10.0.0.222:56137 libfabric:18490:ofi_rxm:av:ip_av_insert_addr():612<debug> av_insert fi_addr: 5
Here are the initial failure messages for fi_addr 2 in the IB1 server log. The messages with “krehm” in them were added by me for debugging purposes:
04/08-19:18:08.43 delphi-002 DAOS[18490/18530] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fecab64eba0, fi_addr=2, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 2 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 2 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7fec1b64e820, fi_addr=1, uri=(null)) libfabric:18490:core:mr:ofi_intercept_mmap():368<debug> intercepted mmap start (nil) len 4194304 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
My suspicion is that the server’s cleanup of the first connection from fi_addr 2 throws away the libfabric address of that client, so then when the server tries to initiate a RMA Read request using the second client connection, it has no libfabric address for the client and fails.
The failure sequence for fi_addr 3 is identical:
04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4442 # na_ofi_get(): Posting RMA get (op id=0x7feab365f550, fi_addr=3, uri=(null)) libfabric:18490:ofi_rxm:ep_ctrl:rxm_ep_prepare_tx():851<warn> krehm: rxm_cmap_acquire_handle() returned NULL, fi_addr: 3 libfabric:18490:ofi_rxm:ep_data:rxm_ep_rma_common():80<warn> krehm: rxm_ep_prepare_tx() returned -113, fi_addr: 3 04/08-19:18:08.43 delphi-002 DAOS[18490/18554] hg DBUG # NA -- Debug -- /root/daos/_build.external/mercury/src/na/na_ofi.c:2707 # na_ofi_cq_process_recv_unexpected_event(): unexpected recv msg completion event with tag=140657172791930, len=4294967364 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:384 obj_bulk_transfer() crt_bulk_transfer 0 error (-1020). 04/08-19:18:08.43 delphi-002 DAOS[18490/18578] object ERR src/object/srv_obj.c:1062 obj_local_rw() 1153784440846024704.5704104871060330004.13 data transfer failed, dma 1 rc DER_HG(-1020) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4462 # na_ofi_get(): fi_readmsg() failed, rc: -113 (No route to host) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:793 # hg_bulk_transfer_pieces(): Could not transfer data (NA_PROTOCOL_ERROR) 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:952 # hg_bulk_transfer(): Could not transfer data pieces 04/08-19:18:08.43 delphi-002 DAOS[18490/18614] hg ERR # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1793 # HG_Bulk_transfer_id(): Could not start transfer of bulk data
Could someone suggest ways that we might narrow down the problem?
Thanks, Kevan |
|
potential removal of daos_server --recreate-superblocks option
Trying to gauge usage of the “—recreate-superblocks” option, does anyone rely on it? If anyone is could you please provide your use case.
It is a developer shortcut to avoid the need to run “dmg storage format” when starting daos_server from scratch.
It's proving to be inconvenient side door from the standpoint of control-plane server side development and I would like to remove it.
Tom
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|
Re: Broken build - FUSE changes
Patrick Farrell <paf@...>
OK, thanks - Sorry to prod, it just wasn't clear what the plan was. Thanks for the workaround.
Regards,
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Olivier, Jeffrey V <jeffrey.v.olivier@...>
Sent: Wednesday, April 15, 2020 10:02 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Broken build - FUSE changes Hi Patrick,
Yes, it’s a bug. Whether we revert and fix or just fix it, it is only a temporary issue.
Jeff
From:
<daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Brian,
OK, I'm happy to pass that option. Is there no concern over the build failure when that option is not passed?
-Patrick From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Murrell, Brian <brian.murrell@...>
On Tue, 2020-04-14 at 20:04 +0000, Farrell, Patrick Arthur wrote:
> Good afternoon, Hi. > The most recent commit (dab7da05cd06b80bd2751a783090ea8a2d4b96b3) has > broken the build for me, with compile errors in fuse on CentOS > 7.7. It refers to adding fuse3-devel to the build requirements, but > configure doesn't require this (it looks like it was added to the RPM > instructions, not sure how that interacts with scons). > > In any case, installing fuse3-devel didn't help, You have to also tell scons that you want to use the installed fuse3- devel rather than still try to build it locally with: $ scons USE_INSTALLED=fuse By default scons tries to build everything with no regard to which components are installed on the system and using them instead. Cheers, b. |
|
Re: Broken build - FUSE changes
Hi Patrick,
Yes, it’s a bug. Whether we revert and fix or just fix it, it is only a temporary issue.
Jeff
From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Brian,
OK, I'm happy to pass that option. Is there no concern over the build failure when that option is not passed?
-Patrick From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Murrell, Brian <brian.murrell@...>
On Tue, 2020-04-14 at 20:04 +0000, Farrell, Patrick Arthur wrote:
> Good afternoon, Hi. > The most recent commit (dab7da05cd06b80bd2751a783090ea8a2d4b96b3) has > broken the build for me, with compile errors in fuse on CentOS > 7.7. It refers to adding fuse3-devel to the build requirements, but > configure doesn't require this (it looks like it was added to the RPM > instructions, not sure how that interacts with scons). > > In any case, installing fuse3-devel didn't help, You have to also tell scons that you want to use the installed fuse3- devel rather than still try to build it locally with: $ scons USE_INSTALLED=fuse By default scons tries to build everything with no regard to which components are installed on the system and using them instead. Cheers, b. |
|
Re: Broken build - FUSE changes
Farrell, Patrick Arthur <patrick.farrell@...>
Brian,
OK, I'm happy to pass that option. Is there no concern over the build failure when that option is not passed?
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Murrell, Brian <brian.murrell@...>
Sent: Wednesday, April 15, 2020 8:08 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Broken build - FUSE changes On Tue, 2020-04-14 at 20:04 +0000, Farrell, Patrick Arthur wrote:
> Good afternoon, Hi. > The most recent commit (dab7da05cd06b80bd2751a783090ea8a2d4b96b3) has > broken the build for me, with compile errors in fuse on CentOS > 7.7. It refers to adding fuse3-devel to the build requirements, but > configure doesn't require this (it looks like it was added to the RPM > instructions, not sure how that interacts with scons). > > In any case, installing fuse3-devel didn't help, You have to also tell scons that you want to use the installed fuse3- devel rather than still try to build it locally with: $ scons USE_INSTALLED=fuse By default scons tries to build everything with no regard to which components are installed on the system and using them instead. Cheers, b. |
|