Date   

Re: Restarting all the daos_servers.

Wang, Di
 

Hello, Colin

Sure, I will work on the patch sooner. 

Thanks
WangDi
From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 11:28 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Restarting all the daos_servers.

Hello WangDi,

 

Thanks for your response.

 

May be the Priority should be higher than 4-LOW? Although we are still in testing environment.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 1:20 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Restarting all the daos_servers.

 

Hello,

 

Replied inline.

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 10:40 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Restarting all the daos_servers.

 

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?

 

Yes, this means these two servers has been excluded from the pool. Unfortunately, reintegration is not supported until 1.2, and once it is supported, these two servers will be reintegrated into the pool once they are restarted.

 

The excluding is automatically triggered once the server disconnection is detected, which only takes a few seconds if the server is really dead.

 

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?

 

If you are using master, then you can not prevent this. Though there is a ticket to add an option in pool property to disable the self-healing. https://jira.hpdd.intel.com/browse/DAOS-4229

If you are using 0.9, you should not see this, since auto-excluding is disabled on 0.9.

 

Thanks

WangDi

 

Thanks.

 

Colin


Re: Restarting all the daos_servers.

Colin Ngam
 

Hello WangDi,

 

Thanks for your response.

 

May be the Priority should be higher than 4-LOW? Although we are still in testing environment.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 1:20 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Restarting all the daos_servers.

 

Hello,

 

Replied inline.

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 10:40 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Restarting all the daos_servers.

 

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?

 

Yes, this means these two servers has been excluded from the pool. Unfortunately, reintegration is not supported until 1.2, and once it is supported, these two servers will be reintegrated into the pool once they are restarted.

 

The excluding is automatically triggered once the server disconnection is detected, which only takes a few seconds if the server is really dead.

 

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?

 

If you are using master, then you can not prevent this. Though there is a ticket to add an option in pool property to disable the self-healing. https://jira.hpdd.intel.com/browse/DAOS-4229

If you are using 0.9, you should not see this, since auto-excluding is disabled on 0.9.

 

Thanks

WangDi

 

Thanks.

 

Colin


Re: Restarting all the daos_servers.

Wang, Di
 

Hello,

Replied inline.

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 10:40 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Restarting all the daos_servers.

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?


Yes, this means these two servers has been excluded from the pool. Unfortunately, reintegration is not supported until 1.2, and once it is supported, these two servers will be reintegrated into the pool once they are restarted.

The excluding is automatically triggered once the server disconnection is detected, which only takes a few seconds if the server is really dead.

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?


If you are using master, then you can not prevent this. Though there is a ticket to add an option in pool property to disable the self-healing. https://jira.hpdd.intel.com/browse/DAOS-4229
If you are using 0.9, you should not see this, since auto-excluding is disabled on 0.9.

Thanks
WangDi

 

Thanks.

 

Colin


Re: Known problem creating containers?

Kevan Rehm
 

Joel,

 

I came to the same conclusion, sorry for wasting your time.  

 

There is currently an issue where the daos_io_server dies immediately because it can’t find its own librdb.so module, which got moved into lib64/daos_srv.   If I move librdb.so to lib then it complains about other modules.  What is the correct way to configure for this?  

 

04/13-04:20:00.47 delphi-004 DAOS[70088/70088] server ERR  src/iosrv/module.c:105 dss_module_load() cannot load librdb.so: librdb.so: cannot open shared object file: No such file or directory

04/13-04:20:00.47 delphi-004 DAOS[70088/70088] server ERR  src/iosrv/init.c:195 modules_load() Failed to load module rdb: -1003

 

To work around this, I set LD_LIBRARY_PATH in the environ section of daos_server.yml to include all library-related subdirectories within the install tree.   And to get the install_dir pushed out to all the server nodes I use rsync.   By default rsync doesn’t delete files at the destination if they are not in the source, so as libraries move around in the install tree over time, I eventually ended up with two copies of the same .so in different directories, and the LD_LIBRARY_PATH resulted in the wrong one being picked.

 

Sorry, Kevan

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 12:20 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Hi Kevan,

 

I ran your test locally in my environment on master and it encountered no issues.  Figures, right?   I spent some time looking at the auto-generated code to see what it is doing.  I don’t have any particular expertise in that.  But, it’s clear that if any of the parts are out of sync, it will not work.  I’m wondering if you have any stale protobuf files on your machine.  Can you diff the protobuf files on your machine against 1) daos master, and 2) your coworker’s machine to see if all three sets are equal?

 

There is a full list in src/proto/Makefile.  Of particular interest are these three:

 

src/mgmt/srv.pb-c.c

src/mgmt/srv.pb-c.h

src/control/common/proto/mgmt/srv.pb.go

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 12:30 PM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

Thanks for the explanation below, makes sense.   Can’t wait for that code to land.

 

Back to the problem at hand;  now I am even more confused….   I borrowed one of my compatriot’s machines, breakpointed his daos_io_server in routine ds_mgmt_drpc_get_attach_info, in his daemon the resp structure has all 7 fields, and so he doesn’t get a segfault.   We are building with the same commit  point.   ????? 

 

Do you have any ideas on what could be different in my machine?   Same centos 7 release.   I will keep debugging, it again appears to be related to my environment somehow.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 9:19 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Hi Kevan,

 

You are right that it won’t help a client to know the interface and domain names of the server.  In this case, we’re not actually sending the server’s interface and domain in the server’s response.  These fields are left empty until they are populated by the agent.  On the update I am working on now, the agent scans the client machine for network interfaces that support the server’s provider (based on the GetAttachInfo provider data) and populates the interface and domain fields in the response sent to the client.  In an update after that, the libdaos library then gets some rework to generate a GetAttachInfo prior to initializing CaRT so that it can use the interface and domain data that’s returned to it.  I’m working on getting this through review now.

 

Thanks for the additional debug log.  I appreciate your insight and help.  I will work on replicating the problem locally so I can fix it.

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 9:57 AM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

I’m curious; how does it help a client to know the interface and domain names of this server?   I can’t see how the client could possibly use them.

 

Anyway, back to the problem.   I am breakpointed in ds_mgmt_drpc_get_attach_info().   At the top of the routine is this:

 

        Mgmt__GetAttachInfoResp  resp = MGMT__GET_ATTACH_INFO_RESP__INIT;

 

If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc.  And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields.   But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:

 

(gdb) p resp

$7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0}

(gdb) p resp.status

$8 = 0

(gdb) p resp.n_psrs

$9 = 0

(gdb) p resp.psrs

$10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0

(gdb) p resp.provider

There is no member named provider.

(gdb) p resp.interface

There is no member named interface.

(gdb) p resp.domain

There is no member named domain.

 

(gdb) p sizeof(resp)

$13 = 48

 

If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.

 

If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:

 

239           len = mgmt__get_attach_info_resp__get_packed_size(&resp);

(gdb) s

mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295

295      assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor);

(gdb) n

296      return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message));

(gdb) p message

$11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740

(gdb) p *message

$12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, 

  psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", 

  crtctxshareaddr = 4093798800, crttimeout = 32662}

 

Happy hunting,

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 9:34 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 

 

I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 

 

I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  

 

Joel

 

 

On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:

Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?




On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Restarting all the daos_servers.

Colin Ngam
 

Hi,

 

I rebooted 2 hosts. Restarted the access_host server before restarting the other hosts. Here’s the log:

 

daos_io_server:1  04/13-12:24:09.73 delphi-006 a8a27db9: rank 1 became pool service leader 0

daos_io_server:1  04/13-12:25:09.02 delphi-006 Target (rank 2 idx 0) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 1) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 2) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 3) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 4) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 5) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 6) is down.

04/13-12:25:09.02 delphi-006 Target (rank 2 idx 7) is down.

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [queued] (a8a27db9 ver=9) id 16

daos_io_server:1  04/13-12:25:09.02 delphi-006 Rebuild [started] (pool a8a27db9 ver=9)

daos_io_server:1  04/13-12:25:13.82 delphi-006 Target (rank 3 idx 0) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 1) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 2) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 3) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 4) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 5) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 6) is down.

04/13-12:25:13.82 delphi-006 Target (rank 3 idx 7) is down.

daos_io_server:1  04/13-12:25:13.82 delphi-006 Rebuild [queued] (a8a27db9 ver=17) id 24

daos_io_server:1  04/13-12:26:12.73 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=63 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=9, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=71 secs)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 0) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 1) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 2) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 3) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 4) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 5) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Target (rank 2 idx 6) is excluded.

04/13-12:26:20.36 delphi-006 Target (rank 2 idx 7) is excluded.

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [started] (pool a8a27db9 ver=17)

daos_io_server:1  04/13-12:26:20.36 delphi-006 Rebuild [scanning] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 0 status 0/0 duration=0 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Rebuild [completed] (pool a8a27db9 ver=17, toberb_obj=0, rb_obj=0, rec=0, size=0 done 1 status 0/0 duration=7 secs)

daos_io_server:1  04/13-12:26:28.36 delphi-006 Target (rank 3 idx 0) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 1) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 2) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 3) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 4) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 5) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 6) is excluded.

04/13-12:26:28.36 delphi-006 Target (rank 3 idx 7) is excluded.

 

Does the above mean that the existing Pool has been reconfigured (rebuild). If so, what’s the timing before this happen with respect to having all the hosts up?

 

[root@delphi-006 daos]# dmg pool list -l delphi-006

delphi-006:10001: connected

Pool UUID                            Svc Replicas

---------                            ------------

a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 1

[root@delphi-006 daos]# dmg pool query --pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1 -l delphi-006

delphi-006:10001: connected

Pool a8a27db9-a69a-4cd3-ae7e-b90a2a2ef3a1, ntarget=32, disabled=16

Pool space info:

- Target(VOS) count:16

- SCM:

  Total size: 1.5 TB

  Free: 1.5 TB, min:96 GB, max:96 GB, mean:96 GB

- NVMe:

  Total size: 20 TB

  Free: 20 TB, min:1.2 TB, max:1.2 TB, mean:1.2 TB

Rebuild done, 0 objs, 0 recs

 

How do you prevent this from happening?

 

Thanks.

 

Colin


Re: Known problem creating containers?

Rosenzweig, Joel B <joel.b.rosenzweig@...>
 

Hi Kevan,

 

I ran your test locally in my environment on master and it encountered no issues.  Figures, right?   I spent some time looking at the auto-generated code to see what it is doing.  I don’t have any particular expertise in that.  But, it’s clear that if any of the parts are out of sync, it will not work.  I’m wondering if you have any stale protobuf files on your machine.  Can you diff the protobuf files on your machine against 1) daos master, and 2) your coworker’s machine to see if all three sets are equal?

 

There is a full list in src/proto/Makefile.  Of particular interest are these three:

 

src/mgmt/srv.pb-c.c

src/mgmt/srv.pb-c.h

src/control/common/proto/mgmt/srv.pb.go

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 12:30 PM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

Thanks for the explanation below, makes sense.   Can’t wait for that code to land.

 

Back to the problem at hand;  now I am even more confused….   I borrowed one of my compatriot’s machines, breakpointed his daos_io_server in routine ds_mgmt_drpc_get_attach_info, in his daemon the resp structure has all 7 fields, and so he doesn’t get a segfault.   We are building with the same commit  point.   ????? 

 

Do you have any ideas on what could be different in my machine?   Same centos 7 release.   I will keep debugging, it again appears to be related to my environment somehow.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 9:19 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Hi Kevan,

 

You are right that it won’t help a client to know the interface and domain names of the server.  In this case, we’re not actually sending the server’s interface and domain in the server’s response.  These fields are left empty until they are populated by the agent.  On the update I am working on now, the agent scans the client machine for network interfaces that support the server’s provider (based on the GetAttachInfo provider data) and populates the interface and domain fields in the response sent to the client.  In an update after that, the libdaos library then gets some rework to generate a GetAttachInfo prior to initializing CaRT so that it can use the interface and domain data that’s returned to it.  I’m working on getting this through review now.

 

Thanks for the additional debug log.  I appreciate your insight and help.  I will work on replicating the problem locally so I can fix it.

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 9:57 AM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

I’m curious; how does it help a client to know the interface and domain names of this server?   I can’t see how the client could possibly use them.

 

Anyway, back to the problem.   I am breakpointed in ds_mgmt_drpc_get_attach_info().   At the top of the routine is this:

 

        Mgmt__GetAttachInfoResp  resp = MGMT__GET_ATTACH_INFO_RESP__INIT;

 

If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc.  And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields.   But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:

 

(gdb) p resp

$7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0}

(gdb) p resp.status

$8 = 0

(gdb) p resp.n_psrs

$9 = 0

(gdb) p resp.psrs

$10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0

(gdb) p resp.provider

There is no member named provider.

(gdb) p resp.interface

There is no member named interface.

(gdb) p resp.domain

There is no member named domain.

 

(gdb) p sizeof(resp)

$13 = 48

 

If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.

 

If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:

 

239           len = mgmt__get_attach_info_resp__get_packed_size(&resp);

(gdb) s

mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295

295      assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor);

(gdb) n

296      return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message));

(gdb) p message

$11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740

(gdb) p *message

$12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, 

  psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", 

  crtctxshareaddr = 4093798800, crttimeout = 32662}

 

Happy hunting,

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 9:34 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 

 

I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 

 

I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  

 

Joel

 

 

On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:

Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?



On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Re: Dkeys and NULL Akey

Niu, Yawei
 

Hi, Colin

 

Unfortunately daos_perf supports only DAOS fetch/update APIs so far, there are IOR and FIO plugin runs over DAOS array API, but I’m not aware of any benchmark using daos_kv_put().

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 9:12 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Dkeys and NULL Akey

 

Greetings Niu,

 

Daos_perf does not support DAOS Simple KV interfaces e.g. daos_kv_put? Is there another performance utility that does?

 

Thanks.

 

Colin

 


Re: Known problem creating containers?

Kevan Rehm
 

Joel,

 

Thanks for the explanation below, makes sense.   Can’t wait for that code to land.

 

Back to the problem at hand;  now I am even more confused….   I borrowed one of my compatriot’s machines, breakpointed his daos_io_server in routine ds_mgmt_drpc_get_attach_info, in his daemon the resp structure has all 7 fields, and so he doesn’t get a segfault.   We are building with the same commit  point.   ????? 

 

Do you have any ideas on what could be different in my machine?   Same centos 7 release.   I will keep debugging, it again appears to be related to my environment somehow.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 9:19 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Hi Kevan,

 

You are right that it won’t help a client to know the interface and domain names of the server.  In this case, we’re not actually sending the server’s interface and domain in the server’s response.  These fields are left empty until they are populated by the agent.  On the update I am working on now, the agent scans the client machine for network interfaces that support the server’s provider (based on the GetAttachInfo provider data) and populates the interface and domain fields in the response sent to the client.  In an update after that, the libdaos library then gets some rework to generate a GetAttachInfo prior to initializing CaRT so that it can use the interface and domain data that’s returned to it.  I’m working on getting this through review now.

 

Thanks for the additional debug log.  I appreciate your insight and help.  I will work on replicating the problem locally so I can fix it.

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 9:57 AM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

I’m curious; how does it help a client to know the interface and domain names of this server?   I can’t see how the client could possibly use them.

 

Anyway, back to the problem.   I am breakpointed in ds_mgmt_drpc_get_attach_info().   At the top of the routine is this:

 

        Mgmt__GetAttachInfoResp  resp = MGMT__GET_ATTACH_INFO_RESP__INIT;

 

If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc.  And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields.   But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:

 

(gdb) p resp

$7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0}

(gdb) p resp.status

$8 = 0

(gdb) p resp.n_psrs

$9 = 0

(gdb) p resp.psrs

$10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0

(gdb) p resp.provider

There is no member named provider.

(gdb) p resp.interface

There is no member named interface.

(gdb) p resp.domain

There is no member named domain.

 

(gdb) p sizeof(resp)

$13 = 48

 

If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.

 

If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:

 

239           len = mgmt__get_attach_info_resp__get_packed_size(&resp);

(gdb) s

mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295

295      assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor);

(gdb) n

296      return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message));

(gdb) p message

$11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740

(gdb) p *message

$12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, 

  psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", 

  crtctxshareaddr = 4093798800, crttimeout = 32662}

 

Happy hunting,

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 9:34 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 

 

I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 

 

I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  

 

Joel

 

 

On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:

Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?




On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Re: Known problem creating containers?

Rosenzweig, Joel B <joel.b.rosenzweig@...>
 

Hi Kevan,

 

You are right that it won’t help a client to know the interface and domain names of the server.  In this case, we’re not actually sending the server’s interface and domain in the server’s response.  These fields are left empty until they are populated by the agent.  On the update I am working on now, the agent scans the client machine for network interfaces that support the server’s provider (based on the GetAttachInfo provider data) and populates the interface and domain fields in the response sent to the client.  In an update after that, the libdaos library then gets some rework to generate a GetAttachInfo prior to initializing CaRT so that it can use the interface and domain data that’s returned to it.  I’m working on getting this through review now.

 

Thanks for the additional debug log.  I appreciate your insight and help.  I will work on replicating the problem locally so I can fix it.

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 9:57 AM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

I’m curious; how does it help a client to know the interface and domain names of this server?   I can’t see how the client could possibly use them.

 

Anyway, back to the problem.   I am breakpointed in ds_mgmt_drpc_get_attach_info().   At the top of the routine is this:

 

        Mgmt__GetAttachInfoResp  resp = MGMT__GET_ATTACH_INFO_RESP__INIT;

 

If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc.  And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields.   But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:

 

(gdb) p resp

$7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0}

(gdb) p resp.status

$8 = 0

(gdb) p resp.n_psrs

$9 = 0

(gdb) p resp.psrs

$10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0

(gdb) p resp.provider

There is no member named provider.

(gdb) p resp.interface

There is no member named interface.

(gdb) p resp.domain

There is no member named domain.

 

(gdb) p sizeof(resp)

$13 = 48

 

If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.

 

If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:

 

239           len = mgmt__get_attach_info_resp__get_packed_size(&resp);

(gdb) s

mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295

295      assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor);

(gdb) n

296      return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message));

(gdb) p message

$11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740

(gdb) p *message

$12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, 

  psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", 

  crtctxshareaddr = 4093798800, crttimeout = 32662}

 

Happy hunting,

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 9:34 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 

 

I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 

 

I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  

 

Joel

 

 

On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:

Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?



On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Re: Known problem creating containers?

Kevan Rehm
 

Joel,

 

I’m curious; how does it help a client to know the interface and domain names of this server?   I can’t see how the client could possibly use them.

 

Anyway, back to the problem.   I am breakpointed in ds_mgmt_drpc_get_attach_info().   At the top of the routine is this:

 

        Mgmt__GetAttachInfoResp  resp = MGMT__GET_ATTACH_INFO_RESP__INIT;

 

If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc.  And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields.   But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:

 

(gdb) p resp

$7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0}

(gdb) p resp.status

$8 = 0

(gdb) p resp.n_psrs

$9 = 0

(gdb) p resp.psrs

$10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0

(gdb) p resp.provider

There is no member named provider.

(gdb) p resp.interface

There is no member named interface.

(gdb) p resp.domain

There is no member named domain.

 

(gdb) p sizeof(resp)

$13 = 48

 

If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.

 

If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:

 

239           len = mgmt__get_attach_info_resp__get_packed_size(&resp);

(gdb) s

mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295

295      assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor);

(gdb) n

296      return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message));

(gdb) p message

$11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740

(gdb) p *message

$12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, 

  psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", 

  crtctxshareaddr = 4093798800, crttimeout = 32662}

 

Happy hunting,

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 9:34 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 

 

I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 

 

I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  

 

Joel

 



On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:

Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?




On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Dkeys and NULL Akey

Colin Ngam
 

Greetings Niu,

 

Daos_perf does not support DAOS Simple KV interfaces e.g. daos_kv_put? Is there another performance utility that does?

 

Thanks.

 

Colin

 


Re: daos_perf

Niu, Yawei
 

Yes, I agree with you, and I’ve pasted your findings in DAOS-4521 and we’ll fix it along with problem of verification failure in SV mode. Thanks a lot!

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Saturday, April 11, 2020 at 3:19 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] daos_perf

 

Hi Niu,

 

I did more testing just to make sure that the targets are good.

 

This is just to very that I can create a Pool with SCM and NVMe:

 

[root@delphi-006 daos]# dmg pool create -l delphi-006:10001 --ranks=0,1,2,3 --scm-size=10G --nvme-size=1T

delphi-006:10001: connected

Pool-create command SUCCEEDED: UUID: ffbe50b2-a978-490a-9cc2-79c97e0ee403, Service replicas: 1

 

DEBUG 13:59:17.044202 mgmt_svc.go:339: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:10000000000 Nvmebytes:1000000000000 Ranks:[0 1 2 3] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:ffbe50b2-a978-490a-9cc2-79c97e0ee403 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

daos_io_server:1  04/10-13:59:17.55 delphi-006 ffbe50b2: rank 1 became pool service leader 0

DEBUG 13:59:17.551069 mgmt_svc.go:367: MgmtSvc.PoolCreate dispatch, resp:{Status:0 Svcreps:[1] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

[root@delphi-006 daos]# dmg pool query -l delphi-006:10001 --pool ffbe50b2-a978-490a-9cc2-79c97e0ee403

delphi-006:10001: connected

Pool ffbe50b2-a978-490a-9cc2-79c97e0ee403, ntarget=32, disabled=0

Pool space info:

- Target(VOS) count:32

- SCM:

  Total size: 40 GB

  Free: 40 GB, min:1.2 GB, max:1.2 GB, mean:1.2 GB

- NVMe:

  Total size: 4.0 TB

  Free: 4.0 TB, min:125 GB, max:125 GB, mean:125 GB

Rebuild idle, 0 objs, 0 recs

 

Now I run the daos_perf test:

 

[root@delphi-006 daos]# daos_perf -P 250G -N 4T -T daos -C 0 -c LARGE -o 1 -d 1 -s 750K -B

--------------------------------------------------------------------------

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

  Local host:           delphi-006

  Local device:         mlx5_0

  Local port:           1

  CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

Test :

        DAOS LARGE (full stack, non-replica)

Parameters :

        pool size     : SCM: 256000 MB, NVMe: 0 MB

        credits       : 0 (sync I/O for -ve)

        obj_per_cont  : 1 x 1 (procs)

        dkey_per_obj  : 1

        akey_per_dkey : 100

        recx_per_akey : 1000

        value type    : single

        value size    : 768000

        zero copy     : no

        overwrite     : no

        verify fetch  : no

        VOS file      : <NULL>

[delphi-006.us.cray.com:12161] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

[delphi-006.us.cray.com:12161] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Started...

Update failed. rc=-1007, epoch=42655

Failed: DER_NOSPACE(-1007)

 

I noticed that the above pool size: SCM: 256000 MB, NVMe: 0 MB.

 

I looked at the code and noticed that:

 

static uint64_t

ts_val_factor(uint64_t val, char factor)

{

 

}

 

Does not deal with the postfix “T”, and no error was given.

 

Terabytes are small nowadays. Can you include support for “T” please.

 

Thanks.

 

Colin

 

 

 

 

 

From: <daos@daos.groups.io> on behalf of "Niu, Yawei" <yawei.niu@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 9:52 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] daos_perf

 

Hi, Colin

 

Current implementation is that the NVMe size option will be ignored when NVMe isn’t configured (if NVMe is configured but not available for some other reason, the pool creation will fail),  the major reason we choose this way back when developing NVMe feature is that we don’t want the auto-test being broken on the nodes without NVMe devices.  (Otherwise, all existing tests have to be revised to use different pool creation parameter according to the  server hardware configuration)

 

I agree with you that fail the pool creation in such case could be a better choice, I suppose we may switch to it when all test systems are ready.

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 8:55 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] daos_perf

 

Hi Niu,

 

Just a comment. If NVMe is not available, and NVMe is requested, the Pool creation should fail?

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Niu, Yawei" <yawei.niu@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 2:53 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] daos_perf

 

Hi, Colin

 

You could double check if your NVMe is configured properly on server side, if no NVMe is configured, all data will be landed to SCM.

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <cngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 5:22 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] daos_perf

 

Greetings,

 

I am testing KV Store:

 

[root@delphi-006 tmp]# daos_perf -P 10G -N 1T -T daos -C 0 -c LARGE -o 1 -d 1 -s 32K -B

3:45

Update failed. rc=-1007, epoch=25050

3:46

[root@delphi-006 tmp]# daos_perf -P 100G -N 1T -T daos -C 0 -c LARGE -o 1 -d 1 -s 32K -B

3:46

This works.

 

root@delphi-006 tmp]# daos_perf -P 10G -N 1T -T daos -C 0 -c LARGE -o 1 -d 1 -s 1K -B

3:50

This passed 

 

Why does increasing the size of SCM needed for the Pool? I thought each 32K would land in NVMe only?

 

Thanks.

 

Colin


Re: NVMe/SPDK disk IO traffic monitor.

Niu, Yawei
 

The env is only read once on server start (actually, you can put it in server yaml file just like other env variables), so it can’t be set dynamically so far.

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 11:12 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] NVMe/SPDK disk IO traffic monitor.

 

Hi Niu,

 

Is it possible to just toggle IO_STAT_PERIOD dynamically? Like turn it on IO_STAT_PERIOD=10, turn it off unset IO_STAT_PERIOD etc.?

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 7:16 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] NVMe/SPDK disk IO traffic monitor.

 

Hi Niu,

 

Thanks. That is going to be very helpful.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Niu, Yawei" <yawei.niu@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 3:01 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] NVMe/SPDK disk IO traffic monitor.

 

Hi, Colin

 

To verify if IO goes properly to NVMe SSD, set the env “IO_STAT_PERIOD=10” on server, then SPDK io statistics will be printed on server console every 10 seconds. As far as I know, there isn’t any standard monitor tool (like iostat) available yet.  

 

Thanks

-Niu

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <cngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, April 10, 2020 at 5:22 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] NVMe/SPDK disk IO traffic monitor.

 

Greetings,

 

What kind of tools is recommended for monitoring SCM and NVMe, when profiling IO traffic in DAOS?

 

Thanks for all your help.

 

Colin


Re: Known problem creating containers?

Rosenzweig, Joel B <joel.b.rosenzweig@...>
 

That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 

I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 

I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  

Joel


On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:



Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?



On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Re: Known problem creating containers?

Kevan Rehm
 

Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?



On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Re: Known problem creating containers?

Patrick Farrell <paf@...>
 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

So I assume if I updated, I would have the same issue.

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?
 
Are you both running the same build?


On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:



Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Re: Known problem creating containers?

Rosenzweig, Joel B <joel.b.rosenzweig@...>
 

Are you both running the same build?


On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:



Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Re: Known problem creating containers?

Kevan Rehm
 

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Known problem creating containers?

Kevan Rehm
 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan


Object Array Performance tests

Colin Ngam
 

Greetings,

 

Is there an Object Array performance test available?

 

I do not see this support in daos_perf.

 

Thanks.

 

Colin