Re: Known problem creating containers?
Kevan Rehm
Joel,
Thanks for the explanation below, makes sense. Can’t wait for that code to land.
Back to the problem at hand; now I am even more confused…. I borrowed one of my compatriot’s machines, breakpointed his daos_io_server in routine ds_mgmt_drpc_get_attach_info, in his daemon the resp structure has all 7 fields, and so he doesn’t get a segfault. We are building with the same commit point. ?????
Do you have any ideas on what could be different in my machine? Same centos 7 release. I will keep debugging, it again appears to be related to my environment somehow.
Kevan
From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Hi Kevan,
You are right that it won’t help a client to know the interface and domain names of the server. In this case, we’re not actually sending the server’s interface and domain in the server’s response. These fields are left empty until they are populated by the agent. On the update I am working on now, the agent scans the client machine for network interfaces that support the server’s provider (based on the GetAttachInfo provider data) and populates the interface and domain fields in the response sent to the client. In an update after that, the libdaos library then gets some rework to generate a GetAttachInfo prior to initializing CaRT so that it can use the interface and domain data that’s returned to it. I’m working on getting this through review now.
Thanks for the additional debug log. I appreciate your insight and help. I will work on replicating the problem locally so I can fix it.
Joel
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Kevan Rehm
Joel,
I’m curious; how does it help a client to know the interface and domain names of this server? I can’t see how the client could possibly use them.
Anyway, back to the problem. I am breakpointed in ds_mgmt_drpc_get_attach_info(). At the top of the routine is this:
Mgmt__GetAttachInfoResp resp = MGMT__GET_ATTACH_INFO_RESP__INIT;
If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc. And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields. But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:
(gdb) p resp $7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0} (gdb) p resp.status $8 = 0 (gdb) p resp.n_psrs $9 = 0 (gdb) p resp.psrs $10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0 (gdb) p resp.provider There is no member named provider. (gdb) p resp.interface There is no member named interface. (gdb) p resp.domain There is no member named domain.
(gdb) p sizeof(resp) $13 = 48
If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.
If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:
239 len = mgmt__get_attach_info_resp__get_packed_size(&resp); (gdb) s mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295 295 assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor); (gdb) n 296 return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message)); (gdb) p message $11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740 (gdb) p *message $12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", crtctxshareaddr = 4093798800, crttimeout = 32662}
Happy hunting,
Kevan
From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
That’s good information. I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition. However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of.
I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools. That meant that the related protobuf files were recompiled with the new tools and are now in the tree.
I’ll look at this to understand what’s happening. Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools. The answer to that would give some clues.
Joel
|
|