Re: Known problem creating containers?

Kevan Rehm



I’m curious; how does it help a client to know the interface and domain names of this server?   I can’t see how the client could possibly use them.


Anyway, back to the problem.   I am breakpointed in ds_mgmt_drpc_get_attach_info().   At the top of the routine is this:


        Mgmt__GetAttachInfoResp  resp = MGMT__GET_ATTACH_INFO_RESP__INIT;


If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc.  And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields.   But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:


(gdb) p resp

$7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0}

(gdb) p resp.status

$8 = 0

(gdb) p resp.n_psrs

$9 = 0

(gdb) p resp.psrs

$10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0

(gdb) p resp.provider

There is no member named provider.

(gdb) p resp.interface

There is no member named interface.

(gdb) p resp.domain

There is no member named domain.


(gdb) p sizeof(resp)

$13 = 48


If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.


If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:


239           len = mgmt__get_attach_info_resp__get_packed_size(&resp);

(gdb) s

mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295

295      assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor);

(gdb) n

296      return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message));

(gdb) p message

$11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740

(gdb) p *message

$12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, 

  psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", 

  crtctxshareaddr = 4093798800, crttimeout = 32662}


Happy hunting,





From: <> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "" <>
Date: Sunday, April 12, 2020 at 9:34 PM
To: "" <>
Subject: Re: [daos] Known problem creating containers?


That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 


I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 


I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  




On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:



I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.


I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.





From: <> on behalf of Patrick Farrell <paf@...>
Reply-To: "" <>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "" <>
Subject: Re: [daos] Known problem creating containers?


Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.


So I assume if I updated, I would have the same issue.



From: <> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: <>
Subject: Re: [daos] Known problem creating containers?


Are you both running the same build?

On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.




From: <> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "" <>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "" <>
Subject: [daos] Known problem creating containers?




Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.


Commands on the client:


[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0


At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:


Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>


Is this a known problem?


Thanks, Kevan

Join to automatically receive all group messages.