Re: Known problem creating containers?


Olivier, Jeffrey V
 

Hi Kevan,

 

Are you building daos from scratch?

 

You should not need to set LD_LIBRARY_PATH to run the IO server in such a setup because daos_io_server is built using RPATH.   Can you do an ldd on daos_io_server ?   If you do use LD_LIBRARY_PATH, make sure lib64 comes before any occurrence of lib

 

-Jeff

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 11:54 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

I came to the same conclusion, sorry for wasting your time.  

 

There is currently an issue where the daos_io_server dies immediately because it can’t find its own librdb.so module, which got moved into lib64/daos_srv.   If I move librdb.so to lib then it complains about other modules.  What is the correct way to configure for this?  

 

04/13-04:20:00.47 delphi-004 DAOS[70088/70088] server ERR  src/iosrv/module.c:105 dss_module_load() cannot load librdb.so: librdb.so: cannot open shared object file: No such file or directory

04/13-04:20:00.47 delphi-004 DAOS[70088/70088] server ERR  src/iosrv/init.c:195 modules_load() Failed to load module rdb: -1003

 

To work around this, I set LD_LIBRARY_PATH in the environ section of daos_server.yml to include all library-related subdirectories within the install tree.   And to get the install_dir pushed out to all the server nodes I use rsync.   By default rsync doesn’t delete files at the destination if they are not in the source, so as libraries move around in the install tree over time, I eventually ended up with two copies of the same .so in different directories, and the LD_LIBRARY_PATH resulted in the wrong one being picked.

 

Sorry, Kevan

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 12:20 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Hi Kevan,

 

I ran your test locally in my environment on master and it encountered no issues.  Figures, right?   I spent some time looking at the auto-generated code to see what it is doing.  I don’t have any particular expertise in that.  But, it’s clear that if any of the parts are out of sync, it will not work.  I’m wondering if you have any stale protobuf files on your machine.  Can you diff the protobuf files on your machine against 1) daos master, and 2) your coworker’s machine to see if all three sets are equal?

 

There is a full list in src/proto/Makefile.  Of particular interest are these three:

 

src/mgmt/srv.pb-c.c

src/mgmt/srv.pb-c.h

src/control/common/proto/mgmt/srv.pb.go

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 12:30 PM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

Thanks for the explanation below, makes sense.   Can’t wait for that code to land.

 

Back to the problem at hand;  now I am even more confused….   I borrowed one of my compatriot’s machines, breakpointed his daos_io_server in routine ds_mgmt_drpc_get_attach_info, in his daemon the resp structure has all 7 fields, and so he doesn’t get a segfault.   We are building with the same commit  point.   ????? 

 

Do you have any ideas on what could be different in my machine?   Same centos 7 release.   I will keep debugging, it again appears to be related to my environment somehow.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday, April 13, 2020 at 9:19 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Hi Kevan,

 

You are right that it won’t help a client to know the interface and domain names of the server.  In this case, we’re not actually sending the server’s interface and domain in the server’s response.  These fields are left empty until they are populated by the agent.  On the update I am working on now, the agent scans the client machine for network interfaces that support the server’s provider (based on the GetAttachInfo provider data) and populates the interface and domain fields in the response sent to the client.  In an update after that, the libdaos library then gets some rework to generate a GetAttachInfo prior to initializing CaRT so that it can use the interface and domain data that’s returned to it.  I’m working on getting this through review now.

 

Thanks for the additional debug log.  I appreciate your insight and help.  I will work on replicating the problem locally so I can fix it.

 

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Monday, April 13, 2020 9:57 AM
To: daos@daos.groups.io
Subject: Re: [daos] Known problem creating containers?

 

Joel,

 

I’m curious; how does it help a client to know the interface and domain names of this server?   I can’t see how the client could possibly use them.

 

Anyway, back to the problem.   I am breakpointed in ds_mgmt_drpc_get_attach_info().   At the top of the routine is this:

 

        Mgmt__GetAttachInfoResp  resp = MGMT__GET_ATTACH_INFO_RESP__INIT;

 

If I look in the code at the definition of Mgmt__GetAttachInfoResp it has the 7 data fields including your new interface field, etc.  And the value of MGMT__GET_ATTACH_INFO_RESP__INIT appears to initialize all 7 of those fields.   But if I use gdb to look at that structure you can see that the code doesn’t actually know about any of the new fields, it is only aware of status and n_psrs/psrs:

 

(gdb) p resp

$7 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 0, psrs = 0x0}

(gdb) p resp.status

$8 = 0

(gdb) p resp.n_psrs

$9 = 0

(gdb) p resp.psrs

$10 = (Mgmt__GetAttachInfoResp__Psr **) 0x0

(gdb) p resp.provider

There is no member named provider.

(gdb) p resp.interface

There is no member named interface.

(gdb) p resp.domain

There is no member named domain.

 

(gdb) p sizeof(resp)

$13 = 48

 

If you do the math, you can see that the size of ‘resp’ is correct if the struct ends with field psrs, there is no room in the struct for the new fields.

 

If I then step forward and enter routine mgmt__get_attach_info_resp__get_packed_size(), that routine DOES know about 7 fields and tries to reference all of them, but of course the resp structure on the stack isn’t big enough to hold the 7 fields, so this routine is looking at other junk on the stack past the end of the structure:

 

239           len = mgmt__get_attach_info_resp__get_packed_size(&resp);

(gdb) s

mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab041f740) at src/mgmt/srv.pb-c.c:295

295      assert(message->base.descriptor == &mgmt__get_attach_info_resp__descriptor);

(gdb) n

296      return protobuf_c_message_get_packed_size ((const ProtobufCMessage*)(message));

(gdb) p message

$11 = (const Mgmt__GetAttachInfoResp *) 0x2aaab041f740

(gdb) p *message

$12 = {base = {descriptor = 0x7f9756b5dcc0 <mgmt__get_attach_info_resp__descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, status = 0, n_psrs = 1, 

  psrs = 0x7f9396113e00, provider = 0x0, interface = 0xad26fa6a89442100 <Address 0xad26fa6a89442100 out of bounds>, domain = 0x7f96f4026a10 "\340\230VW\227\177", 

  crtctxshareaddr = 4093798800, crttimeout = 32662}

 

Happy hunting,

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 9:34 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

That’s good information.  I added the “interface” field and some others last week as we are expanding the capabilities of the GetAttachInfo message to help automatically configure the clients. It’s not clear why adding the fields would cause any of the unpacking code to fail, especially when it’s auto generated based on the protobuf definition.  However, I did build those files with newer versions of the protobuf compiler, and it’s possible that there’s a subtle incompatibility that I wasn’t aware of. 

 

I upgraded to the newer versions of the tools because it was less friction than getting and installing the older tools.  That meant that the related protobuf files were recompiled with the new tools and are now in the tree. 

 

I’ll look at this to understand what’s happening.  Aside from debugging the failure, I’ll see if I can get the old tools reinstalled so I can rebuild the protobufs and have you try them to see if it works when compiled with the older tools.  The answer to that would give some clues.  

 

Joel

 

 

On Apr 12, 2020, at 9:13 PM, Kevan Rehm <kevan.rehm@...> wrote:

Joel,

 

I am still chasing this.   Problem is occurring in the server in routine ds_mgmt_drpc_get_attach_info.  Routine ds_mgmt_get_attach_info_handler() fills in ‘resp’ with nsprs and the psrs array.  Then this routine fills in resp.status and calls mgmg__get_attach_info_resp___get_packed_size().  It is in that routine that the segfault occurs.   The struct is _Mgmt__GetAttachInfoResp, there are other fields that are not being filled in, and the segfault occurs on one of these, ‘interface’.   The MGMT__GET_ATTACH_INFO_RESP__INIT macro at the beginning of function ds_mgmt_drpc_get_attach_info appears to set all the string fields to “”, but by the time the code gets to the ‘interface’ parameter in mgmt__get_attach_info_resp__get_packed_size it contains some out-of-range value that causes the segfault.

 

I don’t really understand the packing code, just giving you these tidbits until I can dig further tomorrow.

 

Kevan

 

 

From: <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 8:05 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Actually, we are not - There was some confusion on that point.  Kevan is running latest master, I accidentally wound up a week out of date.

 

So I assume if I updated, I would have the same issue.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Sunday, April 12, 2020 5:03 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Known problem creating containers?

 

Are you both running the same build?





On Apr 12, 2020, at 4:36 PM, Kevan Rehm <kevan.rehm@...> wrote:

Sigh.   Please ignore this, one of my compatriots with the same hardware config was able to create this pool and container without error.   So the problem is obviously in my setup.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of Kevan Rehm <kevan.rehm@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Sunday, April 12, 2020 at 2:01 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Known problem creating containers?

 

Greetings,

 

Recently I updated my daos repo to master top of tree, and now any attempt to create a container causes the access-point daos_io_server to segfault.   Before I dig deeply, is this a known issue?  My config is one client node plus one server node with dual daos_io_servers.  Before running this test the server storage was reformatted.

 

Commands on the client:

 

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

No pools in system

[root@delphi-005 tmp]# dmg -i -l delphi-004 pool create --scm-size=768G --nvme-size=10T

delphi-004:10001: connected

Pool-create command SUCCEEDED: UUID: 9acb0a19-2ecf-4d3f-8f7a-2afcec26128f, Service replicas: 0

[root@delphi-005 tmp]# dmg -i -l delphi-004 system list-pools

delphi-004:10001: connected

Pool UUID                            Svc Replicas 

---------                            ------------ 

9acb0a19-2ecf-4d3f-8f7a-2afcec26128f 0            

[root@delphi-005 tmp]# daos container create --pool=9acb0a19-2ecf-4d3f-8f7a-2afcec26128f --svc=0

 

At the point the client window hangs, and the daos_io_server setfaults.  Back trace collected via gdb is:

 

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread 0x7f37bcdfd700 (LWP 22203)]

0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

559                  ret = (NULL == *(const char * const *) member) ||

(gdb) bt

#0  0x00007f37cdd2a0a8 in field_is_zeroish (member=member@entry=0x2aaab0423678, field=<optimized out>) at protobuf-c/protobuf-c.c:559

#1  0x00007f37cdd2aa53 in unlabeled_field_get_packed_size (member=0x2aaab0423678, field=0x7f37cf1d4e18 <mgmt__get_attach_info_resp__field_descriptors+216>)

    at protobuf-c/protobuf-c.c:591

#2  protobuf_c_message_get_packed_size (message=message@entry=0x2aaab0423640) at protobuf-c/protobuf-c.c:739

#3  0x00007f37cef93d31 in mgmt__get_attach_info_resp__get_packed_size (message=message@entry=0x2aaab0423640) at src/mgmt/srv.pb-c.c:296

#4  0x00007f37c6998d4a in ds_mgmt_drpc_get_attach_info (drpc_req=<optimized out>, drpc_resp=0x7f3770026a10) at src/mgmt/srv_drpc.c:239

#5  0x000000000040beb5 in drpc_handler_ult (call_ctx=0x7f3770026990) at src/iosrv/drpc_progress.c:297

#6  0x00007f37ce3c317b in ABTD_thread_func_wrapper_thread () from /home/users/daos/daos/install/lib/libabt.so.0

#7  0x00007f37ce3c3851 in make_fcontext () from /home/users/daos/daos/install/lib/libabt.so.0

#8  0x0000000000000000 in ?? ()

(gdb) p member

$1 = (const void *) 0x2aaab0423678

(gdb) p *(const char * const *) member

$3 = 0xb801e74ea7845500 <Address 0xb801e74ea7845500 out of bounds>

 

Is this a known problem?

 

Thanks, Kevan

Join daos@daos.groups.io to automatically receive all group messages.