|
Re: Need advice debugging "No route to host" failures
Sean,
If you open a libfabric ticket, please post it back here so that we can follow.
Thanks, Kevan
On 4/17/20, 6:59 PM, "daos@daos.groups.io on behalf of Hefty, Sean" <daos@daos.groups.io on
Sean,
If you open a libfabric ticket, please post it back here so that we can follow.
Thanks, Kevan
On 4/17/20, 6:59 PM, "daos@daos.groups.io on behalf of Hefty, Sean" <daos@daos.groups.io on
|
By
Kevan Rehm
·
#903
·
|
|
Re: How to figure out the all the daos_io_server are talking?
So when you use ctrl-c we are unceremoniously SIGKILLing the IO servers (graceful shutdown currently doesn’t work for various reasons and the control plane change to enable it was reverted). The
So when you use ctrl-c we are unceremoniously SIGKILLing the IO servers (graceful shutdown currently doesn’t work for various reasons and the control plane change to enable it was reverted). The
|
By
Nabarro, Tom
·
#902
·
|
|
Re: Need advice debugging "No route to host" failures
In my daos_server.yml file the nr_hugepages parameter is now set to 8192. This resulted in 16384 pages being allocated, because each SPDK run (2 daos_io_servers) allocated 8192 pages. The comments
In my daos_server.yml file the nr_hugepages parameter is now set to 8192. This resulted in 16384 pages being allocated, because each SPDK run (2 daos_io_servers) allocated 8192 pages. The comments
|
By
Nabarro, Tom
·
#901
·
|
|
Re: How to figure out the all the daos_io_server are talking?
Hi Tom,
So, I can repeat, stop and restart the servers:
DEBUG 19:32:47.524108 mgmt_system.go:147: updated system member: rank 2, addr 172.30.222.230:10001, Started->Started
ERROR: unexpected
Hi Tom,
So, I can repeat, stop and restart the servers:
DEBUG 19:32:47.524108 mgmt_system.go:147: updated system member: rank 2, addr 172.30.222.230:10001, Started->Started
ERROR: unexpected
|
By
Colin Ngam
·
#900
·
|
|
Re: Need advice debugging "No route to host" failures
Yes, this sounds ideal. I'll look into this.
- Sean
Yes, this sounds ideal. I'll look into this.
- Sean
|
By
Hefty, Sean <sean.hefty@...>
·
#899
·
|
|
Re: How to figure out the all the daos_io_server are talking?
It is not a 100% hit either ..
DEBUG 18:47:46.314720 ctl_system.go:173: Responding to SystemQuery RPC
DEBUG 18:49:02.859993 mgmt_system.go:147: updated system member: rank 4, addr
It is not a 100% hit either ..
DEBUG 18:47:46.314720 ctl_system.go:173: Responding to SystemQuery RPC
DEBUG 18:49:02.859993 mgmt_system.go:147: updated system member: rank 4, addr
|
By
Colin Ngam
·
#898
·
|
|
Re: Need advice debugging "No route to host" failures
Hi Kevan,
Interesting finds. I will let Sean comment on expected OFI behavior in case of running out of hugepages, but to me it sounds like it should have switched and used regular memory in
Hi Kevan,
Interesting finds. I will let Sean comment on expected OFI behavior in case of running out of hugepages, but to me it sounds like it should have switched and used regular memory in
|
By
Oganezov, Alexander A
·
#897
·
|
|
Re: How to figure out the all the daos_io_server are talking?
That’s unexpected, I will try to reproduce over the weekend.
From: daos@daos.groups.io <daos@daos.groups.io>On Behalf Of Colin Ngam
Sent: Friday, April 17, 2020 9:05 PM
To:
That’s unexpected, I will try to reproduce over the weekend.
From: daos@daos.groups.io <daos@daos.groups.io>On Behalf Of Colin Ngam
Sent: Friday, April 17, 2020 9:05 PM
To:
|
By
Nabarro, Tom
·
#896
·
|
|
Re: Need advice debugging "No route to host" failures
Note also that this problem can happen on a pure client - So I'd just add that changes to the server config file & handling are not enough. (I think that client was running a server previously, but
Note also that this problem can happen on a pure client - So I'd just add that changes to the server config file & handling are not enough. (I think that client was running a server previously, but
|
By
Farrell, Patrick Arthur <patrick.farrell@...>
·
#895
·
|
|
Re: Need advice debugging "No route to host" failures
Alex,
Okay, I figured out what was happening here. I have a couple of questions; I wonder if you think this is mostly a configuration error, or a bug. Or at least maybe there is a way to make
Alex,
Okay, I figured out what was happening here. I have a couple of questions; I wonder if you think this is mostly a configuration error, or a bug. Or at least maybe there is a way to make
|
By
Kevan Rehm
·
#894
·
|
|
Re: How to figure out the all the daos_io_server are talking?
Hi Tom,
I have a host that is rank 4 and 5. I killed the servers Control-C and restarted the serves after maybe 3-5 minutes. The access-host did not see any server going away. After restart I
Hi Tom,
I have a host that is rank 4 and 5. I killed the servers Control-C and restarted the serves after maybe 3-5 minutes. The access-host did not see any server going away. After restart I
|
By
Colin Ngam
·
#893
·
|
|
Re: Need advice debugging "No route to host" failures
Hi Kevan,
In our scenarios/cases so far we do not see any error before “No route to host”, so it sounds like you might be hitting a different issue there.
Can you provide more information as
Hi Kevan,
In our scenarios/cases so far we do not see any error before “No route to host”, so it sounds like you might be hitting a different issue there.
Can you provide more information as
|
By
Oganezov, Alexander A
·
#892
·
|
|
Re: Need advice debugging "No route to host" failures
Note, if it helps, that this ENOMEM is strongly connected to the number of processes and does not seem to be connected to, eg, transfer size. More processes makes this much more likely - we start
Note, if it helps, that this ENOMEM is strongly connected to the number of processes and does not seem to be connected to, eg, transfer size. More processes makes this much more likely - we start
|
By
Farrell, Patrick Arthur <patrick.farrell@...>
·
#891
·
|
|
Re: Need advice debugging "No route to host" failures
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in
Alex,
I see “No route to host” a lot, including right at this moment, with only 36 clients talking to 2 daos_io_servers. Your FI_UNIVERSE_SIZE variable probably wouldn’t have an effect in
|
By
Kevan Rehm
·
#890
·
|
|
Re: Need advice debugging "No route to host" failures
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable,
Hi Kevan & others,
As we’ve been debugging few other “no route to host” failures here, one thing that also turns out needs to be set on large-scale systems is FI_UNIVERSE_SIZE envariable,
|
By
Oganezov, Alexander A
·
#889
·
|
|
potential removal of daos_server --recreate-superblocks option
Trying to gauge usage of the “—recreate-superblocks” option, does anyone rely on it? If anyone is could you please provide your use case.
It is a developer shortcut to avoid the need to run
Trying to gauge usage of the “—recreate-superblocks” option, does anyone rely on it? If anyone is could you please provide your use case.
It is a developer shortcut to avoid the need to run
|
By
Nabarro, Tom
·
#888
·
|
|
Re: Broken build - FUSE changes
OK, thanks - Sorry to prod, it just wasn't clear what the plan was. Thanks for the workaround.
Regards,
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Olivier, Jeffrey V
OK, thanks - Sorry to prod, it just wasn't clear what the plan was. Thanks for the workaround.
Regards,
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Olivier, Jeffrey V
|
By
Patrick Farrell <paf@...>
·
#887
·
|
|
Re: Broken build - FUSE changes
Hi Patrick,
Yes, it’s a bug. Whether we revert and fix or just fix it, it is only a temporary issue.
Jeff
From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur"
Hi Patrick,
Yes, it’s a bug. Whether we revert and fix or just fix it, it is only a temporary issue.
Jeff
From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur"
|
By
Olivier, Jeffrey V
·
#886
·
|
|
Re: Broken build - FUSE changes
Brian,
OK, I'm happy to pass that option. Is there no concern over the build failure when that option is not passed?
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Murrell,
Brian,
OK, I'm happy to pass that option. Is there no concern over the build failure when that option is not passed?
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Murrell,
|
By
Farrell, Patrick Arthur <patrick.farrell@...>
·
#885
·
|
|
Re: Known problem creating containers?
Greetings,
You can get the following “export LD_DEBUG=libs” to see the search path it uses on the system.
Here are some additional info:
delphi-006.us.cray.com ERROR 2020/04/15
Greetings,
You can get the following “export LD_DEBUG=libs” to see the search path it uses on the system.
Here are some additional info:
delphi-006.us.cray.com ERROR 2020/04/15
|
By
Colin Ngam
·
#884
·
|