Restarting all the daos_servers.


Colin Ngam
 

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?

 

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?

 

Thanks.

 

Colin


Wang, Di
 

Hello,

Replied inline.

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 10:40 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Restarting all the daos_servers.

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?


Yes, this means these two servers has been excluded from the pool. Unfortunately, reintegration is not supported until 1.2, and once it is supported, these two servers will be reintegrated into the pool once they are restarted.

The excluding is automatically triggered once the server disconnection is detected, which only takes a few seconds if the server is really dead.

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?


If you are using master, then you can not prevent this. Though there is a ticket to add an option in pool property to disable the self-healing. https://jira.hpdd.intel.com/browse/DAOS-4229
If you are using 0.9, you should not see this, since auto-excluding is disabled on 0.9.

Thanks
WangDi

 

Thanks.

 

Colin


Colin Ngam
 

Hello WangDi,

 

Thanks for your response.

 

May be the Priority should be higher than 4-LOW? Although we are still in testing environment.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 1:20 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Restarting all the daos_servers.

 

Hello,

 

Replied inline.

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 10:40 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Restarting all the daos_servers.

 

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?

 

Yes, this means these two servers has been excluded from the pool. Unfortunately, reintegration is not supported until 1.2, and once it is supported, these two servers will be reintegrated into the pool once they are restarted.

 

The excluding is automatically triggered once the server disconnection is detected, which only takes a few seconds if the server is really dead.

 

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?

 

If you are using master, then you can not prevent this. Though there is a ticket to add an option in pool property to disable the self-healing. https://jira.hpdd.intel.com/browse/DAOS-4229

If you are using 0.9, you should not see this, since auto-excluding is disabled on 0.9.

 

Thanks

WangDi

 

Thanks.

 

Colin


Wang, Di
 

Hello, Colin

Sure, I will work on the patch sooner. 

Thanks
WangDi
From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 11:28 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Restarting all the daos_servers.

Hello WangDi,

 

Thanks for your response.

 

May be the Priority should be higher than 4-LOW? Although we are still in testing environment.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 1:20 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Restarting all the daos_servers.

 

Hello,

 

Replied inline.

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 10:40 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Restarting all the daos_servers.

 

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?

 

Yes, this means these two servers has been excluded from the pool. Unfortunately, reintegration is not supported until 1.2, and once it is supported, these two servers will be reintegrated into the pool once they are restarted.

 

The excluding is automatically triggered once the server disconnection is detected, which only takes a few seconds if the server is really dead.

 

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?

 

If you are using master, then you can not prevent this. Though there is a ticket to add an option in pool property to disable the self-healing. https://jira.hpdd.intel.com/browse/DAOS-4229

If you are using 0.9, you should not see this, since auto-excluding is disabled on 0.9.

 

Thanks

WangDi

 

Thanks.

 

Colin