Date   

Re: Need More Info from DUNS Attribute

Pittman, Ashley M
 

 

Hi,

 

Server_group certainly seems a useful thing to add, service ranks I thought was scheduled for removal but would also make sense if it’s still in use.  I can file a ticket and work on this as part of the current dfuse work I’m doing.  You might be interested in https://github.com/daos-stack/daos/pull/2557 which allows you to run “daos container create –path <path> and then “dfuse –mountpoint <path>” and have dfuse load the pool/container from the xattr.

 

Things like application info should be in container attributes I suspect, and chunk/cell size should ideally be in the array objects themselves.  This is something that’s currently hard-coded in both dfs and the interception library so we need a mechanism to communicate those values at run-time but it would be on a per object basis.

 

Ashley,

 

 

From: <daos@daos.groups.io> on behalf of "Zhang, Jiafu" <jiafu.zhang@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, 15 May 2020 at 10:05
To: "daos@daos.groups.io" <daos@daos.groups.io>
Cc: "Wang, Carson" <carson.wang@...>, "Zhu, Minming" <minming.zhu@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: [daos] Need More Info from DUNS Attribute

 

Hi Guys,

 

Currently, there are only three info set to duns attribute of UNS path. They are FS type, pool UUID and container UUID. It’s enough for dfuse since dfuse’s server group and service ranks are used to connect to the pool. Do you know if Lustre will use the same way to connect to pool ? 

 

For Hadoop DAOS, customer seems not want to keep additional info outside of UNS path. Can we expand the duns attribute ("user.daos") to hold more info, like server group and pool service ranks? Or we can use another attribute name, for example, “user.daos.hadoop”, to hold even more application info, like read and write buffer size?

 

Thanks.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: nr_xs_helpers

Liu, Xuezhao
 

Hi Colin,

 

If the #cores is enough, to get best performance commonly can configure with one helper XS for each VOS target – in your case can set nr_xs_helpers as 16 if you configured with 16 targets. In that case each VOS target will with one private helper XS that can help for IO RPC forwarding etc.

That will need 16+16+1 (last +1 is for internal system XS) cores to make each XS with one core.

 

If the #cores is not enough, can configure with less  or zero nr_xs_helpers. In that case those helper XS will be shared used by all the VOS IO XS. For the performance of this configuration it is hard to say, probably better to do some real perf test.

 

Is 4 targets per NVMe a reasonable configuration?

From information from Niu Yawei, 2 ~ 4 targets per NVMe device is commonly reasonable config.

 

Is there a tuning guide or best practice writeup available?

There is some related comments in src/iosrv/srv.c, and some info in src/iosrc/README.md. We can add some extra information in the README.md or other place later.

 

Thanks,

Xuezhao

 

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, May 15, 2020 at 7:50 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] nr_xs_helpers

 

Greetings,

 

We currently set nr_xs_helpers to 0. We have defined 16 targets. That’s per daos_io_server. For best performance, what value should nr_xs_helpers be set? Also, consideration like number cpus etc.?

 

Is 4 targets per NVMe a reasonable configuration?

 

Is there a tuning guide or best practice writeup available?

 

Thanks.

 

Colin


Need More Info from DUNS Attribute

Zhang, Jiafu
 

Hi Guys,

 

Currently, there are only three info set to duns attribute of UNS path. They are FS type, pool UUID and container UUID. It’s enough for dfuse since dfuse’s server group and service ranks are used to connect to the pool. Do you know if Lustre will use the same way to connect to pool ? 

 

For Hadoop DAOS, customer seems not want to keep additional info outside of UNS path. Can we expand the duns attribute ("user.daos") to hold more info, like server group and pool service ranks? Or we can use another attribute name, for example, “user.daos.hadoop”, to hold even more application info, like read and write buffer size?

 

Thanks.


nr_xs_helpers

Colin Ngam
 

Greetings,

 

We currently set nr_xs_helpers to 0. We have defined 16 targets. That’s per daos_io_server. For best performance, what value should nr_xs_helpers be set? Also, consideration like number cpus etc.?

 

Is 4 targets per NVMe a reasonable configuration?

 

Is there a tuning guide or best practice writeup available?

 

Thanks.

 

Colin


Re: Message looks serious?

Wang, Di
 

Hello, Colin

I tried with this commit, and it can generate the failure message on my env. 

I assume you build the source yourself? And what is your output of “ldd install/bin/daos” ?

Thanks
WangDi 
From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, May 14, 2020 at 9:41 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

Greetings,

 

commit 8200a7fb403e091b51b4b00c1aec57dafefb1ada

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, May 14, 2020 at 11:37 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

Hello, 

 

Thanks.  This does show the connection failed with –svc 40.  I am not sure why it does not output any failure messages. But I do see others also complained about zero failure message.  Which version are you using 0.9 or master?

 

Btw: these server log might be seen during pool creation as well.  It is a known issue, and we will fix it. Thanks.

 

"05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

Thanks

Wangdi

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, May 14, 2020 at 8:06 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

Hi WangDi,

 

Is this what you need:

 

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] fi   INFO src/cart/src/gurt/fault_inject.c:486 d_fault_inject_init() No config file, fault injection is OFF.

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  INFO src/cart/src/cart/crt_init.c:282 crt_init_opt() libcart version 4.7.0 initializing

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:174 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:393 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:142 cmd_args_print()   DAOS system name: daos_server

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:143 cmd_args_print()   pool UUID: 6eb32fb0-49e9-49fd-96e8-bba14728a8c3

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:144 cmd_args_print()   cont UUID: 00000000-0000-0000-0000-000000000000

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:148 cmd_args_print()   pool svc: parsed 1 ranks from input 1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:152 cmd_args_print()   attr: name=NULL, value=NULL

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:156 cmd_args_print()   path=NULL, type=unknown, oclass=UNKNOWN, chunk_size=0

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:162 cmd_args_print()   snapshot: name=NULL, epoch=0, epoch range=NULL (0-0)

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:163 cmd_args_print()   oid: 0.0

05/14-09:53:01.31 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.35 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common ERR  src/common/rsvc.c:142 rsvc_client_process_error() removed rank 1 from replica list due to DER_NOTREPLICA(-2020)

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common WARN src/common/rsvc.c:102 rsvc_client_choose() replica list empty

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] pool ERR  src/pool/cli.c:539 dc_pool_connect() 6eb32fb0: cannot find pool service: DER_NOTREPLICA(-2020)

failed to connect to pool: -1005

 

This comes from the daos.log on the Client.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 4:07 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

If 40 does not exist, it should not be able to connect to the pool at all, I.e. it should output sth like "failed to connect to pool: …”.

 

These logs seems suggesting the pool connection did happen.  Would you please collect the client side daos log? (by "export D_LOG_FILE=xxx “? It might tell us what really happened. Thanks.

 

Thanks

WangDi

From: <daos@daos.groups.io> on behalf of Colin Ngam <cngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 10:08 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Message looks serious?

 

Greetings,

 

Executing the command:

daos pool list-cont --pool a68b3845-fe78-481e-aa84-164e851d5f52 --svc 40

 

Note that 40 does not exist.

 

We did not get an error from the daos command.

 

In the log:

 

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

My guess is that ds_pool_tgt_map_update() should not even be called?

 

Cheers,

 

Colin

 


Re: Message looks serious?

Colin Ngam
 

Greetings,

 

commit 8200a7fb403e091b51b4b00c1aec57dafefb1ada

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, May 14, 2020 at 11:37 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

Hello, 

 

Thanks.  This does show the connection failed with –svc 40.  I am not sure why it does not output any failure messages. But I do see others also complained about zero failure message.  Which version are you using 0.9 or master?

 

Btw: these server log might be seen during pool creation as well.  It is a known issue, and we will fix it. Thanks.

 

"05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

Thanks

Wangdi

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, May 14, 2020 at 8:06 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

Hi WangDi,

 

Is this what you need:

 

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] fi   INFO src/cart/src/gurt/fault_inject.c:486 d_fault_inject_init() No config file, fault injection is OFF.

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  INFO src/cart/src/cart/crt_init.c:282 crt_init_opt() libcart version 4.7.0 initializing

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:174 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:393 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:142 cmd_args_print()   DAOS system name: daos_server

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:143 cmd_args_print()   pool UUID: 6eb32fb0-49e9-49fd-96e8-bba14728a8c3

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:144 cmd_args_print()   cont UUID: 00000000-0000-0000-0000-000000000000

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:148 cmd_args_print()   pool svc: parsed 1 ranks from input 1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:152 cmd_args_print()   attr: name=NULL, value=NULL

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:156 cmd_args_print()   path=NULL, type=unknown, oclass=UNKNOWN, chunk_size=0

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:162 cmd_args_print()   snapshot: name=NULL, epoch=0, epoch range=NULL (0-0)

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:163 cmd_args_print()   oid: 0.0

05/14-09:53:01.31 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.35 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common ERR  src/common/rsvc.c:142 rsvc_client_process_error() removed rank 1 from replica list due to DER_NOTREPLICA(-2020)

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common WARN src/common/rsvc.c:102 rsvc_client_choose() replica list empty

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] pool ERR  src/pool/cli.c:539 dc_pool_connect() 6eb32fb0: cannot find pool service: DER_NOTREPLICA(-2020)

failed to connect to pool: -1005

 

This comes from the daos.log on the Client.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 4:07 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

If 40 does not exist, it should not be able to connect to the pool at all, I.e. it should output sth like "failed to connect to pool: …”.

 

These logs seems suggesting the pool connection did happen.  Would you please collect the client side daos log? (by "export D_LOG_FILE=xxx “? It might tell us what really happened. Thanks.

 

Thanks

WangDi

From: <daos@daos.groups.io> on behalf of Colin Ngam <cngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 10:08 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Message looks serious?

 

Greetings,

 

Executing the command:

daos pool list-cont --pool a68b3845-fe78-481e-aa84-164e851d5f52 --svc 40

 

Note that 40 does not exist.

 

We did not get an error from the daos command.

 

In the log:

 

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

My guess is that ds_pool_tgt_map_update() should not even be called?

 

Cheers,

 

Colin

 


Re: Message looks serious?

Wang, Di
 

Hello, 

Thanks.  This does show the connection failed with –svc 40.  I am not sure why it does not output any failure messages. But I do see others also complained about zero failure message.  Which version are you using 0.9 or master?

Btw: these server log might be seen during pool creation as well.  It is a known issue, and we will fix it. Thanks.

"05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1


Thanks

Wangdi

From: <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, May 14, 2020 at 8:06 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

Hi WangDi,

 

Is this what you need:

 

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] fi   INFO src/cart/src/gurt/fault_inject.c:486 d_fault_inject_init() No config file, fault injection is OFF.

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  INFO src/cart/src/cart/crt_init.c:282 crt_init_opt() libcart version 4.7.0 initializing

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:174 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:393 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:142 cmd_args_print()   DAOS system name: daos_server

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:143 cmd_args_print()   pool UUID: 6eb32fb0-49e9-49fd-96e8-bba14728a8c3

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:144 cmd_args_print()   cont UUID: 00000000-0000-0000-0000-000000000000

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:148 cmd_args_print()   pool svc: parsed 1 ranks from input 1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:152 cmd_args_print()   attr: name=NULL, value=NULL

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:156 cmd_args_print()   path=NULL, type=unknown, oclass=UNKNOWN, chunk_size=0

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:162 cmd_args_print()   snapshot: name=NULL, epoch=0, epoch range=NULL (0-0)

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:163 cmd_args_print()   oid: 0.0

05/14-09:53:01.31 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.35 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common ERR  src/common/rsvc.c:142 rsvc_client_process_error() removed rank 1 from replica list due to DER_NOTREPLICA(-2020)

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common WARN src/common/rsvc.c:102 rsvc_client_choose() replica list empty

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] pool ERR  src/pool/cli.c:539 dc_pool_connect() 6eb32fb0: cannot find pool service: DER_NOTREPLICA(-2020)

failed to connect to pool: -1005

 

This comes from the daos.log on the Client.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 4:07 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

If 40 does not exist, it should not be able to connect to the pool at all, I.e. it should output sth like "failed to connect to pool: …”.

 

These logs seems suggesting the pool connection did happen.  Would you please collect the client side daos log? (by "export D_LOG_FILE=xxx “? It might tell us what really happened. Thanks.

 

Thanks

WangDi

From: <daos@daos.groups.io> on behalf of Colin Ngam <cngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 10:08 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Message looks serious?

 

Greetings,

 

Executing the command:

daos pool list-cont --pool a68b3845-fe78-481e-aa84-164e851d5f52 --svc 40

 

Note that 40 does not exist.

 

We did not get an error from the daos command.

 

In the log:

 

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

My guess is that ds_pool_tgt_map_update() should not even be called?

 

Cheers,

 

Colin

 


Re: Message looks serious?

Colin Ngam
 

Hi WangDi,

 

Is this what you need:

 

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] fi   INFO src/cart/src/gurt/fault_inject.c:486 d_fault_inject_init() No config file, fault injection is OFF.

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  INFO src/cart/src/cart/crt_init.c:282 crt_init_opt() libcart version 4.7.0 initializing

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:174 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048

05/14-09:53:00.86 hl-d106 DAOS[20928/20928] crt  WARN src/cart/src/cart/crt_init.c:393 crt_init_opt() FI_OFI_RXM_USE_SRX not set, set=1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:142 cmd_args_print()   DAOS system name: daos_server

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:143 cmd_args_print()   pool UUID: 6eb32fb0-49e9-49fd-96e8-bba14728a8c3

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:144 cmd_args_print()   cont UUID: 00000000-0000-0000-0000-000000000000

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:148 cmd_args_print()   pool svc: parsed 1 ranks from input 1

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:152 cmd_args_print()   attr: name=NULL, value=NULL

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:156 cmd_args_print()   path=NULL, type=unknown, oclass=UNKNOWN, chunk_size=0

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:162 cmd_args_print()   snapshot: name=NULL, epoch=0, epoch range=NULL (0-0)

05/14-09:53:01.23 hl-d106 DAOS[20928/20928] client INFO src/utils/daos.c:163 cmd_args_print()   oid: 0.0

05/14-09:53:01.31 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.35 hl-d106 DAOS[20928/20928] daos INFO src/common/drpc.c:664 drpc_close() Closing dRPC socket fd=19

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common ERR  src/common/rsvc.c:142 rsvc_client_process_error() removed rank 1 from replica list due to DER_NOTREPLICA(-2020)

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] common WARN src/common/rsvc.c:102 rsvc_client_choose() replica list empty

05/14-09:53:01.40 hl-d106 DAOS[20928/20928] pool ERR  src/pool/cli.c:539 dc_pool_connect() 6eb32fb0: cannot find pool service: DER_NOTREPLICA(-2020)

failed to connect to pool: -1005

 

This comes from the daos.log on the Client.

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Wang, Di" <di.wang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 4:07 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Message looks serious?

 

If 40 does not exist, it should not be able to connect to the pool at all, I.e. it should output sth like "failed to connect to pool: …”.

 

These logs seems suggesting the pool connection did happen.  Would you please collect the client side daos log? (by "export D_LOG_FILE=xxx “? It might tell us what really happened. Thanks.

 

Thanks

WangDi

From: <daos@daos.groups.io> on behalf of Colin Ngam <cngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 10:08 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Message looks serious?

 

Greetings,

 

Executing the command:

daos pool list-cont --pool a68b3845-fe78-481e-aa84-164e851d5f52 --svc 40

 

Note that 40 does not exist.

 

We did not get an error from the daos command.

 

In the log:

 

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

My guess is that ds_pool_tgt_map_update() should not even be called?

 

Cheers,

 

Colin

 


Re: DAOS command no longer giving output on failure

Faccini, Bruno
 

Hello Patrick,

Well, this is really strange indeed. BTW, I am unable to reproduce when running with current master which is only 45 commits ahead of yours and none seem to be related…

If problem is still there or will reoccur, and you don’t want to debug it, could try to at least re-run one of these failing commands under strace tool ?

Thanks,

Bruno.

 

From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday 8 May 2020 at 16:07
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] DAOS command no longer giving output on failure

 

Good morning,

 

I'm running commit:

commit 53b0c5ff3d45c8addfee11cbfd6dd49e6f88dc3e

And the daos container create command is no longer giving any output on failure:

[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c08 --svc=1 16G

[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c08 --svc=1

[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c08

[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c0

 

All of those failed.  I am not completely sure why they failed, and I'm capable of using debug to determine why (it's probably some sort of config issue), and obviously the first command is wrong.

 

The key point is:
None of them gave *any* output.  This makes troubleshooting a little tricky.

 

I'm not sure at what commit this last worked, but it definitely did a few weeks ago.

 

Thanks,

Patrick

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Pool creation fails with "instance is not an access point"

4felgenh@...
 

Hi,

thanks for the tip! It seems that a "regular dmg storage format" doesn't do the trick, but appending "--reformat" resolves the issue. It runs fine now.

Kind regards
Ruben

Am 14.05.20 um 09:02 schrieb Lombardi, Johann:


Hi,

I assume that you have run dmg storage format after starting the server and before creating the pool, right? If you don’t want to emulate any SSD, you should also comment out the bdev_* options in the yaml file.

Cheers,

Johann

*From: *<daos@daos.groups.io> on behalf of "4felgenh@..." <4felgenh@...>
*Reply-To: *"daos@daos.groups.io" <daos@daos.groups.io>
*Date: *Saturday 9 May 2020 at 21:30
*To: *"daos@daos.groups.io" <daos@daos.groups.io>
*Subject: *[daos] Pool creation fails with "instance is not an access point"

Hello,

My setup and issues are very similar to what has been described in https://daos.groups.io/g/daos/message/317 <https://daos.groups.io/g/daos/message/317>. I have tried every suggested fix from the corresponding thread and so far, none have worked.

I'd like to setup a very simple daos server for testing purposes on a server with 120 GB of regular RAM, no NVMe SSDs attached, and only one single server instance. However, I'm not using docker.

Hence, I installed daos as described in the admin guide and used the config file from daos/utils/config/examples/daos_server_local.yml which would use a ram disk to emulate scm.

After I have started the daos server with, "daos_server --debug --config=$basepath/daos_server_local.yml start", I'd like to create a pool in a second terminal with "dmg -i -l localhost:10001 pool create -s 1G". This fails with:

localhost:10001: connected
Pool-create command FAILED: rpc error: code = Unknown desc = instance is not an access point
ERROR: dmg: rpc error: code = Unknown desc = instance is not an access point

Did I miss some step that I have to execute beforehand?

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Pool creation fails with "instance is not an access point"

Lombardi, Johann
 

Hi,

 

I assume that you have run dmg storage format after starting the server and before creating the pool, right? If you don’t want to emulate any SSD, you should also comment out the bdev_* options in the yaml file.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "4felgenh@..." <4felgenh@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Saturday 9 May 2020 at 21:30
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Pool creation fails with "instance is not an access point"

 

Hello,

My setup and issues are very similar to what has been described in https://daos.groups.io/g/daos/message/317. I have tried every suggested fix from the corresponding thread and so far, none have worked.

I'd like to setup a very simple daos server for testing purposes on a server with 120 GB of regular RAM, no NVMe SSDs attached, and only one single server instance. However, I'm not using docker.

Hence, I installed daos as described in the admin guide and used the config file from daos/utils/config/examples/daos_server_local.yml which would use a ram disk to emulate scm.

After I have started the daos server with, "daos_server --debug --config=$basepath/daos_server_local.yml start", I'd like to create a pool in a second terminal with "dmg -i -l localhost:10001 pool create -s 1G". This fails with:

localhost:10001: connected
Pool-create command FAILED: rpc error: code = Unknown desc = instance is not an access point
ERROR: dmg: rpc error: code = Unknown desc = instance is not an access point

Did I miss some step that I have to execute beforehand?

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: after formatting scm , no dRPC client set problem

Lombardi, Johann
 

Hi there,

 

From your log:

snode3: Starting I/O server instance 0: /usr/bin/daos_io_server

snode2: daos_io_server:0  05/12-06:28:47.96 snode2 Using legacy core allocation algorithm

snode2: instance 0 exited: instance 0 exited prematurely: /usr/bin/daos_io_server (instance 0) exited: exit status 1

 

After format, the I/O engine failed to be started. Could you please look into the server logs under /tmp (i.e. /tmp/server.log)?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "timehuang88@..." <timehuang88@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday 11 May 2020 at 11:40
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] after formatting scm , no dRPC client set problem

 

Hi there,
I try to run DAOS in a real physical environment, but got problem. Below is the output when i try to start DAOS system with 3 storage nodes. 
do anybody help me. if need any informatino, please let me know. thx.
by the way, another question: inside the daos_server.yml file, which files should I put into the client_cert_ dir(/etc/daos/clients), or just leave it empty?

[root@client ~]# clush -w snode[1-3] daos_server start -o /etc/daos/daos_server.yml

snode2: daos_server logging to file /tmp/daos_control.log

snode2: DEBUG 06:28:33.811928 start.go:105: Switching control log level to DEBUG

snode2: DEBUG 06:28:33.812236 netdetect.go:829: Calling ValidateProviderConfig with eno3, ofi+verbs;ofi_rxm

snode2: DEBUG 06:28:33.812276 netdetect.go:880: Input provider string: ofi+verbs;ofi_rxm

snode1: daos_server logging to file /tmp/daos_control.log

snode1: DEBUG 17:27:30.107559 start.go:105: Switching control log level to DEBUG

snode1: DEBUG 17:27:30.107776 netdetect.go:829: Calling ValidateProviderConfig with eno3, ofi+verbs;ofi_rxm

snode1: DEBUG 17:27:30.107811 netdetect.go:880: Input provider string: ofi+verbs;ofi_rxm

snode3: daos_server logging to file /tmp/daos_control.log

snode3: DEBUG 17:28:01.164972 start.go:105: Switching control log level to DEBUG

snode3: DEBUG 17:28:01.165212 netdetect.go:829: Calling ValidateProviderConfig with eno3, ofi+verbs;ofi_rxm

snode3: DEBUG 17:28:01.165251 netdetect.go:880: Input provider string: ofi+verbs;ofi_rxm

snode2: DEBUG 06:28:34.026543 netdetect.go:912: There are 0 hfi1 devices in the system

snode2: DEBUG 06:28:34.026642 netdetect.go:844: Device eno3 supports provider: ofi+verbs;ofi_rxm

snode2: DEBUG 06:28:34.027690 config.go:391: Active config saved to /etc/daos/.daos_server.active.yml (read-only)

snode2: DEBUG 06:28:34.028020 server.go:137: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 PCIWhitelist: TargetUser:root ResetOnly:false}

snode1: DEBUG 17:27:30.326383 netdetect.go:912: There are 0 hfi1 devices in the system

snode1: DEBUG 17:27:30.326540 netdetect.go:844: Device eno3 supports provider: ofi+verbs;ofi_rxm

snode1: DEBUG 17:27:30.327572 config.go:391: Active config saved to /etc/daos/.daos_server.active.yml (read-only)

snode1: DEBUG 17:27:30.327889 server.go:137: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 PCIWhitelist: TargetUser:root ResetOnly:false}

snode3: DEBUG 17:28:01.386394 netdetect.go:912: There are 0 hfi1 devices in the system

snode3: DEBUG 17:28:01.386496 netdetect.go:844: Device eno3 supports provider: ofi+verbs;ofi_rxm

snode3: DEBUG 17:28:01.387691 config.go:391: Active config saved to /etc/daos/.daos_server.active.yml (read-only)

snode3: DEBUG 17:28:01.387992 server.go:137: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 PCIWhitelist: TargetUser:root ResetOnly:false}

snode2: DEBUG 06:28:42.730225 netdetect.go:591: Searching for a device alias for: eno3

snode2: DEBUG 06:28:42.961947 netdetect.go:334: There are 2 children of this parent node.

snode2: DEBUG 06:28:42.962026 netdetect.go:616: Device alias for eno3 is i40iw0

snode3: DEBUG 17:28:10.257330 netdetect.go:591: Searching for a device alias for: eno3

snode2: ERROR: /usr/bin/daos_admin EAL: No available hugepages reported in hugepages-1048576kB

snode3: DEBUG 17:28:10.492866 netdetect.go:334: There are 2 children of this parent node.

snode3: DEBUG 17:28:10.492945 netdetect.go:616: Device alias for eno3 is i40iw1

snode3: ERROR: /usr/bin/daos_admin EAL: No available hugepages reported in hugepages-1048576kB

snode1: DEBUG 17:27:40.062619 netdetect.go:591: Searching for a device alias for: eno3

snode1: DEBUG 17:27:40.294456 netdetect.go:334: There are 2 children of this parent node.

snode1: DEBUG 17:27:40.294538 netdetect.go:616: Device alias for eno3 is i40iw1

snode1: ERROR: /usr/bin/daos_admin EAL: No available hugepages reported in hugepages-1048576kB

snode2: DAOS Control Server (pid 10183) listening on 0.0.0.0:10001

snode2: DEBUG 06:28:45.672802 instance_exec.go:55: instance 0: checking if storage is formatted

snode2: Waiting for DAOS I/O Server instance 0 storage to be ready...

snode2: DEBUG 06:28:45.672850 instance_storage.go:88: /mnt/daos: checking formatting

snode3: DAOS Control Server (pid 11997) listening on 0.0.0.0:10001

snode3: DEBUG 17:28:12.955111 instance_exec.go:55: instance 0: checking if storage is formatted

snode3: Waiting for DAOS I/O Server instance 0 storage to be ready...

snode3: DEBUG 17:28:12.955181 instance_storage.go:88: /mnt/daos: checking formatting

snode1: DAOS Control Server (pid 14644) listening on 0.0.0.0:10001

snode1: DEBUG 17:27:42.844751 instance_exec.go:55: instance 0: checking if storage is formatted

snode1: Waiting for DAOS I/O Server instance 0 storage to be ready...

snode1: DEBUG 17:27:42.844827 instance_storage.go:88: /mnt/daos: checking formatting

snode2: DEBUG 06:28:47.811976 instance_storage.go:104: /mnt/daos (dcpm) needs format: false

snode2: DEBUG 06:28:47.812056 instance_storage.go:135: instance 0: no SCM format required; checking for superblock

snode2: DEBUG 06:28:47.812109 superblock.go:112: /mnt/daos: checking superblock

snode2: DEBUG 06:28:47.813721 instance_storage.go:141: instance 0: superblock not needed

snode2: SCM @ /mnt/daos: 532 GB Total/528 GB Avail

snode2: DEBUG 06:28:47.814325 instance_exec.go:93: instance 0: awaiting DAOS I/O Server init

snode2: DEBUG 06:28:47.814536 exec.go:115: daos_io_server:0 args: [-t 8 -x 0 -f 1 -g daos_server -d /tmp/daos_sockets -s /mnt/daos -n /mnt/daos/daos_nvme.conf -i 10184 -I 0]

snode2: DEBUG 06:28:47.814617 exec.go:116: daos_io_server:0 env: [OFI_INTERFACE=eno3 CRT_TIMEOUT=0 DAOS_MD_CAP=1024 CRT_CTX_SHARE_ADDR=0 CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm D_LOG_FILE=/tmp/server0.log OFI_PORT=31416 FI_SOCKETS_MAX_CONN_RETRY=1 FI_SOCKETS_CONN_TIMEOUT=2000 OFI_DOMAIN=i40iw0 D_LOG_MASK=ERR]

snode2: Starting I/O server instance 0: /usr/bin/daos_io_server

snode3: DEBUG 17:28:15.086695 instance_storage.go:104: /mnt/daos (dcpm) needs format: false

snode3: DEBUG 17:28:15.086764 instance_storage.go:135: instance 0: no SCM format required; checking for superblock

snode3: DEBUG 17:28:15.086822 superblock.go:112: /mnt/daos: checking superblock

snode3: DEBUG 17:28:15.088428 instance_storage.go:141: instance 0: superblock not needed

snode3: SCM @ /mnt/daos: 532 GB Total/528 GB Avail

snode3: DEBUG 17:28:15.089161 instance_exec.go:93: instance 0: awaiting DAOS I/O Server init

snode3: DEBUG 17:28:15.089521 exec.go:115: daos_io_server:0 args: [-t 8 -x 0 -f 1 -g daos_server -d /tmp/daos_sockets -s /mnt/daos -n /mnt/daos/daos_nvme.conf -i 11998 -I 0]

snode3: DEBUG 17:28:15.089610 exec.go:116: daos_io_server:0 env: [OFI_DOMAIN=i40iw1 DAOS_MD_CAP=1024 FI_SOCKETS_MAX_CONN_RETRY=1 D_LOG_MASK=ERR CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_INTERFACE=eno3 OFI_PORT=31416 CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0 FI_SOCKETS_CONN_TIMEOUT=2000 D_LOG_FILE=/tmp/server0.log]

snode3: Starting I/O server instance 0: /usr/bin/daos_io_server

snode2: daos_io_server:0  05/12-06:28:47.96 snode2 Using legacy core allocation algorithm

snode2: instance 0 exited: instance 0 exited prematurely: /usr/bin/daos_io_server (instance 0) exited: exit status 1

snode2: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

snode3: daos_io_server:0  05/11-17:28:15.24 snode3 Using legacy core allocation algorithm

snode3: instance 0 exited: instance 0 exited prematurely: /usr/bin/daos_io_server (instance 0) exited: exit status 1

snode3: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

snode1: DEBUG 17:27:44.741324 instance_storage.go:104: /mnt/daos (dcpm) needs format: false

snode1: DEBUG 17:27:44.741373 instance_storage.go:135: instance 0: no SCM format required; checking for superblock

snode1: DEBUG 17:27:44.741404 superblock.go:112: /mnt/daos: checking superblock

snode1: DEBUG 17:27:44.742695 instance_storage.go:141: instance 0: superblock not needed

snode1: SCM @ /mnt/daos: 532 GB Total/528 GB Avail

snode1: DEBUG 17:27:44.743511 instance.go:382: instance 0: bootstrapping system member: rank 0, addr 10.158.24.33:10001

snode1: DEBUG 17:27:44.743543 instance_exec.go:93: instance 0: awaiting DAOS I/O Server init

snode1: DEBUG 17:27:44.743998 exec.go:115: daos_io_server:0 args: [-t 8 -x 0 -f 1 -g daos_server -d /tmp/daos_sockets -s /mnt/daos -n /mnt/daos/daos_nvme.conf -i 14645 -I 0]

snode1: DEBUG 17:27:44.744066 exec.go:116: daos_io_server:0 env: [FI_SOCKETS_CONN_TIMEOUT=2000 OFI_DOMAIN=i40iw1 DAOS_MD_CAP=1024 D_LOG_MASK=ERR CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_PORT=31416 CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0 FI_SOCKETS_MAX_CONN_RETRY=1 D_LOG_FILE=/tmp/server0.log OFI_INTERFACE=eno3]

snode1: Starting I/O server instance 0: /usr/bin/daos_io_server

snode1: daos_io_server:0  05/11-17:27:44.89 snode1 Using legacy core allocation algorithm

snode1: instance 0 exited: instance 0 exited prematurely: /usr/bin/daos_io_server (instance 0) exited: exit status 1

snode1: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Message looks serious?

Wang, Di
 

If 40 does not exist, it should not be able to connect to the pool at all, I.e. it should output sth like "failed to connect to pool: …”.

These logs seems suggesting the pool connection did happen.  Would you please collect the client side daos log? (by "export D_LOG_FILE=xxx “? It might tell us what really happened. Thanks.

Thanks
WangDi
From: <daos@daos.groups.io> on behalf of Colin Ngam <cngam@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, May 13, 2020 at 10:08 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Message looks serious?

Greetings,

 

Executing the command:

daos pool list-cont --pool a68b3845-fe78-481e-aa84-164e851d5f52 --svc 40

 

Note that 40 does not exist.

 

We did not get an error from the daos command.

 

In the log:

 

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

My guess is that ds_pool_tgt_map_update() should not even be called?

 

Cheers,

 

Colin

 


Message looks serious?

Colin Ngam <cngam@...>
 

Greetings,

 

Executing the command:

daos pool list-cont --pool a68b3845-fe78-481e-aa84-164e851d5f52 --svc 40

 

Note that 40 does not exist.

 

We did not get an error from the daos command.

 

In the log:

 

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

05/13-11:57:11.02 delphi-006 DAOS[26509/26552] pool WARN src/pool/srv_target.c:1020 ds_pool_tgt_map_update() Ignore update pool a68b3845 1 -> 1

 

My guess is that ds_pool_tgt_map_update() should not even be called?

 

Cheers,

 

Colin

 


after formatting scm , no dRPC client set problem

timehuang88@...
 

Hi there,
I try to run DAOS in a real physical environment, but got problem. Below is the output when i try to start DAOS system with 3 storage nodes. 
do anybody help me. if need any informatino, please let me know. thx.
by the way, another question: inside the daos_server.yml file, which files should I put into the client_cert_ dir(/etc/daos/clients), or just leave it empty?

[root@client ~]# clush -w snode[1-3] daos_server start -o /etc/daos/daos_server.yml
snode2: daos_server logging to file /tmp/daos_control.log
snode2: DEBUG 06:28:33.811928 start.go:105: Switching control log level to DEBUG
snode2: DEBUG 06:28:33.812236 netdetect.go:829: Calling ValidateProviderConfig with eno3, ofi+verbs;ofi_rxm
snode2: DEBUG 06:28:33.812276 netdetect.go:880: Input provider string: ofi+verbs;ofi_rxm
snode1: daos_server logging to file /tmp/daos_control.log
snode1: DEBUG 17:27:30.107559 start.go:105: Switching control log level to DEBUG
snode1: DEBUG 17:27:30.107776 netdetect.go:829: Calling ValidateProviderConfig with eno3, ofi+verbs;ofi_rxm
snode1: DEBUG 17:27:30.107811 netdetect.go:880: Input provider string: ofi+verbs;ofi_rxm
snode3: daos_server logging to file /tmp/daos_control.log
snode3: DEBUG 17:28:01.164972 start.go:105: Switching control log level to DEBUG
snode3: DEBUG 17:28:01.165212 netdetect.go:829: Calling ValidateProviderConfig with eno3, ofi+verbs;ofi_rxm
snode3: DEBUG 17:28:01.165251 netdetect.go:880: Input provider string: ofi+verbs;ofi_rxm
snode2: DEBUG 06:28:34.026543 netdetect.go:912: There are 0 hfi1 devices in the system
snode2: DEBUG 06:28:34.026642 netdetect.go:844: Device eno3 supports provider: ofi+verbs;ofi_rxm
snode2: DEBUG 06:28:34.027690 config.go:391: Active config saved to /etc/daos/.daos_server.active.yml (read-only)
snode2: DEBUG 06:28:34.028020 server.go:137: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 PCIWhitelist: TargetUser:root ResetOnly:false}
snode1: DEBUG 17:27:30.326383 netdetect.go:912: There are 0 hfi1 devices in the system
snode1: DEBUG 17:27:30.326540 netdetect.go:844: Device eno3 supports provider: ofi+verbs;ofi_rxm
snode1: DEBUG 17:27:30.327572 config.go:391: Active config saved to /etc/daos/.daos_server.active.yml (read-only)
snode1: DEBUG 17:27:30.327889 server.go:137: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 PCIWhitelist: TargetUser:root ResetOnly:false}
snode3: DEBUG 17:28:01.386394 netdetect.go:912: There are 0 hfi1 devices in the system
snode3: DEBUG 17:28:01.386496 netdetect.go:844: Device eno3 supports provider: ofi+verbs;ofi_rxm
snode3: DEBUG 17:28:01.387691 config.go:391: Active config saved to /etc/daos/.daos_server.active.yml (read-only)
snode3: DEBUG 17:28:01.387992 server.go:137: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:4096 PCIWhitelist: TargetUser:root ResetOnly:false}
snode2: DEBUG 06:28:42.730225 netdetect.go:591: Searching for a device alias for: eno3
snode2: DEBUG 06:28:42.961947 netdetect.go:334: There are 2 children of this parent node.
snode2: DEBUG 06:28:42.962026 netdetect.go:616: Device alias for eno3 is i40iw0
snode3: DEBUG 17:28:10.257330 netdetect.go:591: Searching for a device alias for: eno3
snode2: ERROR: /usr/bin/daos_admin EAL: No available hugepages reported in hugepages-1048576kB
snode3: DEBUG 17:28:10.492866 netdetect.go:334: There are 2 children of this parent node.
snode3: DEBUG 17:28:10.492945 netdetect.go:616: Device alias for eno3 is i40iw1
snode3: ERROR: /usr/bin/daos_admin EAL: No available hugepages reported in hugepages-1048576kB
snode1: DEBUG 17:27:40.062619 netdetect.go:591: Searching for a device alias for: eno3
snode1: DEBUG 17:27:40.294456 netdetect.go:334: There are 2 children of this parent node.
snode1: DEBUG 17:27:40.294538 netdetect.go:616: Device alias for eno3 is i40iw1
snode1: ERROR: /usr/bin/daos_admin EAL: No available hugepages reported in hugepages-1048576kB
snode2: DAOS Control Server (pid 10183) listening on 0.0.0.0:10001
snode2: DEBUG 06:28:45.672802 instance_exec.go:55: instance 0: checking if storage is formatted
snode2: Waiting for DAOS I/O Server instance 0 storage to be ready...
snode2: DEBUG 06:28:45.672850 instance_storage.go:88: /mnt/daos: checking formatting
snode3: DAOS Control Server (pid 11997) listening on 0.0.0.0:10001
snode3: DEBUG 17:28:12.955111 instance_exec.go:55: instance 0: checking if storage is formatted
snode3: Waiting for DAOS I/O Server instance 0 storage to be ready...
snode3: DEBUG 17:28:12.955181 instance_storage.go:88: /mnt/daos: checking formatting
snode1: DAOS Control Server (pid 14644) listening on 0.0.0.0:10001
snode1: DEBUG 17:27:42.844751 instance_exec.go:55: instance 0: checking if storage is formatted
snode1: Waiting for DAOS I/O Server instance 0 storage to be ready...
snode1: DEBUG 17:27:42.844827 instance_storage.go:88: /mnt/daos: checking formatting
snode2: DEBUG 06:28:47.811976 instance_storage.go:104: /mnt/daos (dcpm) needs format: false
snode2: DEBUG 06:28:47.812056 instance_storage.go:135: instance 0: no SCM format required; checking for superblock
snode2: DEBUG 06:28:47.812109 superblock.go:112: /mnt/daos: checking superblock
snode2: DEBUG 06:28:47.813721 instance_storage.go:141: instance 0: superblock not needed
snode2: SCM @ /mnt/daos: 532 GB Total/528 GB Avail
snode2: DEBUG 06:28:47.814325 instance_exec.go:93: instance 0: awaiting DAOS I/O Server init
snode2: DEBUG 06:28:47.814536 exec.go:115: daos_io_server:0 args: [-t 8 -x 0 -f 1 -g daos_server -d /tmp/daos_sockets -s /mnt/daos -n /mnt/daos/daos_nvme.conf -i 10184 -I 0]
snode2: DEBUG 06:28:47.814617 exec.go:116: daos_io_server:0 env: [OFI_INTERFACE=eno3 CRT_TIMEOUT=0 DAOS_MD_CAP=1024 CRT_CTX_SHARE_ADDR=0 CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm D_LOG_FILE=/tmp/server0.log OFI_PORT=31416 FI_SOCKETS_MAX_CONN_RETRY=1 FI_SOCKETS_CONN_TIMEOUT=2000 OFI_DOMAIN=i40iw0 D_LOG_MASK=ERR]
snode2: Starting I/O server instance 0: /usr/bin/daos_io_server
snode3: DEBUG 17:28:15.086695 instance_storage.go:104: /mnt/daos (dcpm) needs format: false
snode3: DEBUG 17:28:15.086764 instance_storage.go:135: instance 0: no SCM format required; checking for superblock
snode3: DEBUG 17:28:15.086822 superblock.go:112: /mnt/daos: checking superblock
snode3: DEBUG 17:28:15.088428 instance_storage.go:141: instance 0: superblock not needed
snode3: SCM @ /mnt/daos: 532 GB Total/528 GB Avail
snode3: DEBUG 17:28:15.089161 instance_exec.go:93: instance 0: awaiting DAOS I/O Server init
snode3: DEBUG 17:28:15.089521 exec.go:115: daos_io_server:0 args: [-t 8 -x 0 -f 1 -g daos_server -d /tmp/daos_sockets -s /mnt/daos -n /mnt/daos/daos_nvme.conf -i 11998 -I 0]
snode3: DEBUG 17:28:15.089610 exec.go:116: daos_io_server:0 env: [OFI_DOMAIN=i40iw1 DAOS_MD_CAP=1024 FI_SOCKETS_MAX_CONN_RETRY=1 D_LOG_MASK=ERR CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_INTERFACE=eno3 OFI_PORT=31416 CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0 FI_SOCKETS_CONN_TIMEOUT=2000 D_LOG_FILE=/tmp/server0.log]
snode3: Starting I/O server instance 0: /usr/bin/daos_io_server
snode2: daos_io_server:0  05/12-06:28:47.96 snode2 Using legacy core allocation algorithm
snode2: instance 0 exited: instance 0 exited prematurely: /usr/bin/daos_io_server (instance 0) exited: exit status 1
snode2: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)
snode3: daos_io_server:0  05/11-17:28:15.24 snode3 Using legacy core allocation algorithm
snode3: instance 0 exited: instance 0 exited prematurely: /usr/bin/daos_io_server (instance 0) exited: exit status 1
snode3: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)
snode1: DEBUG 17:27:44.741324 instance_storage.go:104: /mnt/daos (dcpm) needs format: false
snode1: DEBUG 17:27:44.741373 instance_storage.go:135: instance 0: no SCM format required; checking for superblock
snode1: DEBUG 17:27:44.741404 superblock.go:112: /mnt/daos: checking superblock
snode1: DEBUG 17:27:44.742695 instance_storage.go:141: instance 0: superblock not needed
snode1: SCM @ /mnt/daos: 532 GB Total/528 GB Avail
snode1: DEBUG 17:27:44.743511 instance.go:382: instance 0: bootstrapping system member: rank 0, addr 10.158.24.33:10001
snode1: DEBUG 17:27:44.743543 instance_exec.go:93: instance 0: awaiting DAOS I/O Server init
snode1: DEBUG 17:27:44.743998 exec.go:115: daos_io_server:0 args: [-t 8 -x 0 -f 1 -g daos_server -d /tmp/daos_sockets -s /mnt/daos -n /mnt/daos/daos_nvme.conf -i 14645 -I 0]
snode1: DEBUG 17:27:44.744066 exec.go:116: daos_io_server:0 env: [FI_SOCKETS_CONN_TIMEOUT=2000 OFI_DOMAIN=i40iw1 DAOS_MD_CAP=1024 D_LOG_MASK=ERR CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_PORT=31416 CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0 FI_SOCKETS_MAX_CONN_RETRY=1 D_LOG_FILE=/tmp/server0.log OFI_INTERFACE=eno3]
snode1: Starting I/O server instance 0: /usr/bin/daos_io_server
snode1: daos_io_server:0  05/11-17:27:44.89 snode1 Using legacy core allocation algorithm
snode1: instance 0 exited: instance 0 exited prematurely: /usr/bin/daos_io_server (instance 0) exited: exit status 1
snode1: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)
 


Pool creation fails with "instance is not an access point"

4felgenh@...
 

Hello,

My setup and issues are very similar to what has been described in https://daos.groups.io/g/daos/message/317. I have tried every suggested fix from the corresponding thread and so far, none have worked.

I'd like to setup a very simple daos server for testing purposes on a server with 120 GB of regular RAM, no NVMe SSDs attached, and only one single server instance. However, I'm not using docker.

Hence, I installed daos as described in the admin guide and used the config file from daos/utils/config/examples/daos_server_local.yml which would use a ram disk to emulate scm.

After I have started the daos server with, "daos_server --debug --config=$basepath/daos_server_local.yml start", I'd like to create a pool in a second terminal with "dmg -i -l localhost:10001 pool create -s 1G". This fails with:

localhost:10001: connected
Pool-create command FAILED: rpc error: code = Unknown desc = instance is not an access point
ERROR: dmg: rpc error: code = Unknown desc = instance is not an access point

Did I miss some step that I have to execute beforehand?


DAOS command no longer giving output on failure

Farrell, Patrick Arthur <patrick.farrell@...>
 

Good morning,

I'm running commit:
commit 53b0c5ff3d45c8addfee11cbfd6dd49e6f88dc3e

And the daos container create command is no longer giving any output on failure:
[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c08 --svc=1 16G
[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c08 --svc=1
[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c08
[daos@hl-d102 ~]$ daos container create --pool=d885cc1f-9363-4342-a859-ec32b3230c0

All of those failed.  I am not completely sure why they failed, and I'm capable of using debug to determine why (it's probably some sort of config issue), and obviously the first command is wrong.

The key point is:
None of them gave *any* output.  This makes troubleshooting a little tricky.

I'm not sure at what commit this last worked, but it definitely did a few weeks ago.

Thanks,
Patrick


Re: Dead definition?

Li, Wei G
 

Hi Colin,

This part could use some improvements indeed. The “in” and “out” structs are generated by macros from src/container/rpc.h:

CRT_RPC_DECLARE(cont_snap_create, DAOS_ISEQ_CONT_EPOCH_OP,
DAOS_OSEQ_CONT_EPOCH_OP)
CRT_RPC_DECLARE(cont_snap_destroy, DAOS_ISEQ_CONT_EPOCH_OP,
DAOS_OSEQ_CONT_EPOCH_OP)

I think ds_cont_snap_{create,destroy} should, instead, use cont_snap_{create,destroy}_{in,out} that differ from cont_epoch_op_{in,out}. For “create”, cei_epoch doesn’t really apply (i.e., DAOS-4453); for “destroy”, so doesn’t eco_epoch. We shall make this RPC format change early, if possible.

Thanks,
liwei

On May 8, 2020, at 8:12 AM, Colin Ngam <colin.ngam@...> wrote:

Greetings,

struct cont_snap_destroy_in {
struct cont_op_in cei_op;
daos_epoch_t cei_epoch;
};

struct cont_snap_destroy_out {
struct cont_op_out ceo_op;
daos_epoch_t ceo_epoch;
};

Looks dead as the routine ds_cont_snap_destroy() seems to use:
struct cont_epoch_op_in {
struct cont_op_in cei_op;
daos_epoch_t cei_epoch;
};

struct cont_epoch_op_out {
struct cont_op_out ceo_op;
daos_epoch_t ceo_epoch;
};

Thanks.

Colin




Re: missing protoc-gen-c

Zhang, Jiafu
 

Thanks you, Nabarro.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Nabarro, Tom
Sent: Thursday, May 7, 2020 11:31 PM
To: daos@daos.groups.io
Subject: Re: [daos] missing protoc-gen-c

 

That’s the one . thanks

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Colin Ngam
Sent: Thursday, May 7, 2020 4:27 PM
To: daos@daos.groups.io
Subject: Re: [daos] missing protoc-gen-c

 

This one?

doc/dev/development.md

 

From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, May 7, 2020 at 10:18 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] missing protoc-gen-c

 

So apologies but that link is dead, anyone know where the contents got moved to? I can’t find any references to relevant content in the docs.

 

The protobuf-c package is installed during build (I’m using master) and gets pulled in from <daos>/utils/sl/components/__init__.py but we don’t build the protoc* binaries/full compiler to avoid overhead for something that is only required very occasionally for developer purposes (we supply the –disable-protoc configure option in the build).

 

For the moment you will need to build the compiler and plug-in yourself but we plan on providing a build flag for developers that’s need it.

 

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Nabarro, Tom
Sent: Thursday, May 7, 2020 2:12 PM
To: daos@daos.groups.io
Subject: Re: [daos] missing protoc-gen-c

 

This is a development tool so maybe it’s not pulled in by the default build.

 

I installed as follows:

 

git clone https://github.com/protobuf-c/protobuf-c

cd protobuf-c/

./autogen.sh

./configure --prefix=/home/tanabarr/protobuf/install PKG_CONFIG_PATH=/home/tanabarr/protobuf/install/lib/pkgconfig

make && make install

 

it is not a plug-in that ships with the stock protobuf compiler package (which ships instead with C++ plugin).

 

See the <daos>/src/proto/Makefile for some details and this will point you to the following doc if compiler is missing.

 

https://github.com/daos-stack/daos/blob/master/doc/development.md#protobuf-compiler

 

Regards,

Tom Nabarro – DCG/ESAD

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Thursday, May 7, 2020 11:54 AM
To: daos@daos.groups.io
Subject: [daos] missing protoc-gen-c

 

Hi Guys,

 

In https://github.com/daos-stack/daos/tree/master/src/proto/README.md, it uses “protoc -I mgmt --c_out=../mgmt mgmt/srv.proto --plugin=/opt/potobuf/install/bin/protoc-gen-c” to generate C code. In recent DAOS, I cannot find “protoc-gen-c”.

 

Do you know why?

 

Thanks.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Dead definition?

Colin Ngam
 

Greetings,

 

struct cont_snap_destroy_in {

    struct cont_op_in cei_op;

    daos_epoch_t cei_epoch;

};

 

struct cont_snap_destroy_out {

    struct cont_op_out ceo_op;

    daos_epoch_t ceo_epoch;

};

 

Looks dead as the routine ds_cont_snap_destroy() seems to use:

struct cont_epoch_op_in {

    struct cont_op_in cei_op;

    daos_epoch_t cei_epoch;

};

 

struct cont_epoch_op_out {

    struct cont_op_out ceo_op;

    daos_epoch_t ceo_epoch;

};

 

Thanks.

 

Colin

 

 

 

661 - 680 of 1653