Date   

Re: Qustion about Questions about data placement

Lombardi, Johann
 

You can monitor the output of pool query that reports the space usage on PMEM and SSD separately.

That been said, we don’t have a metric reporting the total amount of data migrated by aggregation for each pool. We should add that since it can be helpful to differentiate the bandwidth used by regular I/O vs aggregation when analyzing performance issues.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of 段世博 <duanshibo.d@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 8 March 2023 at 08:04
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Qustion about Questions about data placement

 

Is there a way to know how much data has been migrated from PMEM to SSD

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Qustion about Questions about data placement

段世博
 

Is there a way to know how much data has been migrated from PMEM to SSD


Re: Qustion about Questions about data placement

Lombardi, Johann
 

Extents smaller than 4KiB that cannot be aggregated with other contiguous extents remain in PMEM and are not migrated to SSDs.

As for overwrites, extents that are eventually not readable any longer (i.e., completely overwritten or truncated and no snapshots) are deleted in the background. This is true for both extents on SSDs and PMEM.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of 段世博 <duanshibo.d@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday 7 March 2023 at 16:24
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Qustion about Questions about data placement

 

Thank you very much for your answer!
I have another question: whether all the data written to PMEM will be written to SSD, for example under Zipfian workload, if the data still in PMEM is overwritten by new writes, will the old data still be written to SSD

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Qustion about Questions about data placement

段世博
 

Thank you very much for your answer!
I have another question: whether all the data written to PMEM will be written to SSD, for example under Zipfian workload, if the data still in PMEM is overwritten by new writes, will the old data still be written to SSD


Re: Qustion about Questions about data placement

Lombardi, Johann
 

By default, any contiguous extents strictly smaller than 4KiB are written to SCM and the ones bigger than or equal to 4KiB are written to SSDs.

The 4KiB threshold is configurable starting DAOS v2.0 at the pool level via the “policy” property.

 

$ dmg pool get-prop test | grep placement

Tier placement policy (policy)                   type=io_size

$ dmg pool set-prop test policy:type=io_size/th1=16384

pool set-prop succeeded

$ dmg pool get-prop test | grep placement

Tier placement policy (policy)                   type=io_size/th1= 16384

 

HTH

 

Cheers,

Johann

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Macdonald, Mjmac
 

In this case, the core problem is that there is an empty storage tier in the configuration, and the config parser doesn’t handle this correctly. Created https://daosio.atlassian.net/browse/DAOS-12826 to address the defect. Once the empty storage tier is removed, the command should fail with a more sensible error when the system does not have any PMem modules installed.

 

mjmac

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Lombardi, Johann
Sent: Monday, 6 March, 2023 05:26
To: daos@daos.groups.io; Nabarro, Tom <tom.nabarro@...>
Subject: Re: [daos] DAOS 2.3.103 #chat #docker #installation #2.3.103 #rocky #ubuntu

 

Hi there,

 

scm prepare is only required when using Optane PMEM. Since you use DRAM, you don’t need to run scm prepare.

That being said, it would be great for scm prepare to fail nicely in this case, @Nabarro, Tom?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "salma.salem@..." <salma.salem@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday 6 March 2023 at 11:16
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] DAOS 2.3.103 #chat #docker #installation #2.3.103 #rocky #ubuntu

 

I'm trying to set up daos 2.3.103 using the EL8 Dockerfile but I ended up with this error when trying to run scm prepare


this is what my configuration file looks like




I also tried this with the ubuntu file and stopped at the same point but my focus now is to have the rocky version working.
Has anyone encountered this error before?

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Lombardi, Johann
 

Hi there,

 

scm prepare is only required when using Optane PMEM. Since you use DRAM, you don’t need to run scm prepare.

That being said, it would be great for scm prepare to fail nicely in this case, @Nabarro, Tom?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "salma.salem@..." <salma.salem@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday 6 March 2023 at 11:16
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] DAOS 2.3.103 #chat #docker #installation #2.3.103 #rocky #ubuntu

 

I'm trying to set up daos 2.3.103 using the EL8 Dockerfile but I ended up with this error when trying to run scm prepare


this is what my configuration file looks like




I also tried this with the ubuntu file and stopped at the same point but my focus now is to have the rocky version working.
Has anyone encountered this error before?

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


salma.salem@...
 

I'm trying to set up daos 2.3.103 using the EL8 Dockerfile but I ended up with this error when trying to run scm prepare


this is what my configuration file looks like




I also tried this with the ubuntu file and stopped at the same point but my focus now is to have the rocky version working.
Has anyone encountered this error before?


Qustion about Questions about data placement

段世博
 

How does DAOS decide whether to write data to SSD or PMEM?


Re: Fail to create pool

Oganezov, Alexander A
 

Tianzy,

 

Good to hear it is working now, however in general reboot should not be needed after applying those sysctl settings; dual_iface_server test also worked before... I wonder if there was something else stale running such as old daos agent on client nodes or some other stale cache somewhere.

 

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of landen.tian@...
Sent: Saturday, March 4, 2023 8:13 AM
To: daos@daos.groups.io
Subject: Re: [daos] Fail to create pool

 

Alex, 

       For some reasons, my cluster rebooted. After that, I tried what you have said. My issue is fixed!
       I guess after applying https://docs.daos.io/v2.0/admin/predeployment_check/#setup-for-multiple-network-links,   the system should reboot.

Thanks,
Tianzy

     


Re: Fail to create pool

landen.tian@...
 

Alex, 

       For some reasons, my cluster rebooted. After that, I tried what you have said. My issue is fixed!
       I guess after applying https://docs.daos.io/v2.0/admin/predeployment_check/#setup-for-multiple-network-links,   the system should reboot.

Thanks,
Tianzy

     


Re: Fail to create pool

Oganezov, Alexander A
 

Can you change log_mask: debug in daos server yaml file and provide logs from all engines involved?

On a client side please provide ofi logs by doing export FI_LOG_LEVEL=warn  before running self_test -u 

 

It would be best if you file an issue on https://daosio.atlassian.net/ and attach all logs there.

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of landen.tian@...
Sent: Friday, March 3, 2023 8:44 PM
To: daos@daos.groups.io
Subject: Re: [daos] Fail to create pool

 

1 https://docs.daos.io/v2.0/admin/predeployment_check/#setup-for-multiple-network-links
Lombardi had pointed out it and I had done it.
2.   the test:


Any further suggestion?


Re: Fail to create pool

landen.tian@...
 

1、 https://docs.daos.io/v2.0/admin/predeployment_check/#setup-for-multiple-network-links
Lombardi had pointed out it and I had done it.
2.   the test:


Any further suggestion?


Re: Fail to create pool

Oganezov, Alexander A
 

Hi Tian,

 

I noticed that you are running 2 engines per node. Did you follow this guide in order to set sysctl settings properly?

https://docs.daos.io/v2.0/admin/predeployment_check/#setup-for-multiple-network-links

 

Also can you run the following test on your server nodes?

./install/lib/daos/TESTING/tests/dual_iface_server -p 'ofi+verbs;ofi_rxm' -i 'ib0,ib1' -d 'mlx5_0,mlx5_1'

 

This is a basic sanity check to ensure you can run dual-interface setup for servers. If successful you are supposed to see something like this in the output:

 

SRV [rank=1 pid=1678185]        Starting server rank=1

SRV [rank=0 pid=1678189]        Starting server rank=0

SRV [rank=1 pid=1678185]        my_rank=1 uri=ofi+verbs;ofi_rxm://192.168.101.21:32337

SRV [rank=0 pid=1678189]        my_rank=0 uri=ofi+verbs;ofi_rxm://192.168.100.21:31337

SRV [rank=1 pid=1678185]        Other servers uri is 'ofi+verbs;ofi_rxm://192.168.100.21:31337'

SRV [rank=1 pid=1678185]        Ping successful to rank=0 tag=0

SRV [rank=0 pid=1678189]        Other servers uri is 'ofi+verbs;ofi_rxm://192.168.101.21:32337'

SRV [rank=0 pid=1678189]        Ping successful to rank=1 tag=0

 

If this test does not work then its likely still something missing in sysctl network settings preventing libfabric from communicating between servers.

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of landen.tian@...
Sent: Thursday, March 2, 2023 5:21 PM
To: daos@daos.groups.io
Subject: Re: [daos] Fail to create pool

 

[Edited Message Follows]

It makes progress either, but it still failed.

client01:


On storage01:

03/03-08:17:03.29 storage01 DAOS[13113/0/614] daos INFO src/engine/drpc_progress.c:278 drpc_handler_ult() dRPC handler ULT for module=2 method=207

03/03-08:17:03.29 storage01 DAOS[13113/0/614] mgmt INFO src/mgmt/srv_drpc.c:434 ds_mgmt_drpc_pool_create() Received request to create pool on 6 ranks.

03/03-08:17:04.25 storage01 DAOS[13113/0/615] telem INFO src/gurt/telemetry.c:211 new_shmem() creating new shared memory segment, key=0x4f11c3a9, size=172592

03/03-08:17:04.25 storage01 DAOS[13113/0/615] pool INFO src/pool/srv_metrics.c:152 ds_pool_metrics_start() f59eddb9: created metrics for pool

03/03-08:17:04.32 storage01 DAOS[13113/0/7] server ERR  src/engine/sched.c:1964 sched_watchdog_post() WATCHDOG: Thread 0x43f230 took 59 ms. symbol:/mnt/nfs/daos/install/bin/daos_engine() [0x43f230]

03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)

03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)

03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)

03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)

03/03-08:17:05.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:17:05.97 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:17:05.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:17:05.97 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f87550002353c rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)

03/03-08:18:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)

03/03-08:18:03.29 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:03.29 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357b rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:03.29 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_pool.c:97 ds_mgmt_tgt_pool_create_ranks() f59eddb9: dss_rpc_send MGMT_TGT_CREATE: rc=DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:03.29 storage01 DAOS[13113/0/616] server INFO src/engine/server_iv.c:876 ds_iv_ns_stop() f59eddb9 ns stopped

03/03-08:18:03.29 storage01 DAOS[13113/0/617] container INFO src/container/srv_target.c:2408 ds_cont_tgt_ec_eph_query_ult() f59eddb9 stop tgt ec aggregation

03/03-08:18:03.29 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_target.c:650 ds_pool_tgt_ec_eph_query_abort() f59eddb9: EC query ULT stopped

03/03-08:18:03.30 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_target.c:668 pool_fetch_hdls_ult_abort() f59eddb9: fetch hdls ULT aborted

03/03-08:18:03.30 storage01 DAOS[13113/0/616] rebuild INFO src/rebuild/srv.c:1605 ds_rebuild_abort() f59eddb9 rebuild aborted

03/03-08:18:03.30 storage01 DAOS[13113/0/616] server INFO src/object/srv_obj_migrate.c:2747 ds_migrate_stop() f59eddb9 migrate stopped

03/03-08:18:03.41 storage01 DAOS[13113/0/616] telem INFO src/gurt/telemetry.c:235 destroy_shmem() Destroying shared memory segment (shmid=98344)

03/03-08:18:03.41 storage01 DAOS[13113/0/616] telem INFO src/gurt/telemetry.c:2457 d_tm_del_ephemeral_dir() Removed ephemeral directory [pool/f59eddb9-d930-448c-bc98-c6316c0455ca]

03/03-08:18:03.41 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_metrics.c:176 ds_pool_metrics_stop() f59eddb9: destroyed ds_pool metrics

03/03-08:18:03.41 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_target.c:808 ds_pool_stop() f59eddb9: pool service is aborted

03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)

03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)

03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)

03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)

03/03-08:18:08.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:08.97 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:08.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:08.98 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:18:08.98 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f875500023585 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/03-08:19:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)

03/03-08:19:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)

03/03-08:19:03.29 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:19:03.29 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:19:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c2 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/03-08:19:03.30 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_pool.c:122 ds_mgmt_tgt_pool_create_ranks() f59eddb9: failed to clean up failed pool: DER_TIMEDOUT(-1011): 'Time out'

03/03-08:19:03.30 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_pool.c:193 ds_mgmt_create_pool() creating pool f59eddb9 on ranks failed: rc DER_TIMEDOUT(-1011): 'Time out'

03/03-08:19:03.30 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_drpc.c:485 ds_mgmt_drpc_pool_create() failed to create pool: DER_TIMEDOUT(-1011): 'Time out'


It seems it has some errors on network.

If I create a pool used racks in a same server, I works:

[root@client01 ~]# dmg pool create -r 0,3 --scm-size=1T --nvme-size=15T --nsvc=1 Pool1

Creating DAOS pool with manual per-engine storage allocation: 1.0 TB SCM, 15 TB NVMe (6.67% ratio)

Pool created with 6.25%,93.75% storage tier ratio

-------------------------------------------------

  UUID                 : 31cf7760-fc1a-45b8-ab9b-0b539726bfe3

  Service Ranks        : 0

  Storage Ranks        : [0,3]

  Total Size           : 32 TB

  Storage tier 0 (SCM) : 2.0 TB (1.0 TB / rank)

  Storage tier 1 (NVMe): 30 TB (15 TB / rank)


Problems should be on CART comunications between servers.


Re: Fail to create pool

landen.tian@...
 




CART didn't work


Re: Fail to create pool

landen.tian@...
 
Edited

It makes progress either, but it still failed.

client01:


On storage01:

03/03-08:17:03.29 storage01 DAOS[13113/0/614] daos INFO src/engine/drpc_progress.c:278 drpc_handler_ult() dRPC handler ULT for module=2 method=207
03/03-08:17:03.29 storage01 DAOS[13113/0/614] mgmt INFO src/mgmt/srv_drpc.c:434 ds_mgmt_drpc_pool_create() Received request to create pool on 6 ranks.
03/03-08:17:04.25 storage01 DAOS[13113/0/615] telem INFO src/gurt/telemetry.c:211 new_shmem() creating new shared memory segment, key=0x4f11c3a9, size=172592
03/03-08:17:04.25 storage01 DAOS[13113/0/615] pool INFO src/pool/srv_metrics.c:152 ds_pool_metrics_start() f59eddb9: created metrics for pool
03/03-08:17:04.32 storage01 DAOS[13113/0/7] server ERR  src/engine/sched.c:1964 sched_watchdog_post() WATCHDOG: Thread 0x43f230 took 59 ms. symbol:/mnt/nfs/daos/install/bin/daos_engine() [0x43f230]
03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)
03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)
03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)
03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)
03/03-08:17:05.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:17:05.97 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x102000b (DAOS) rpcid=0x95f87550002353e rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:17:05.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:17:05.97 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c45dc900) [opc=0x102000b (DAOS) rpcid=0x95f875500023541 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:17:05.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f87550002353c rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)
03/03-08:18:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)
03/03-08:18:03.29 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:03.29 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c45dc6f0) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357d rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020007 (DAOS) rpcid=0x95f87550002357b rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:03.29 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_pool.c:97 ds_mgmt_tgt_pool_create_ranks() f59eddb9: dss_rpc_send MGMT_TGT_CREATE: rc=DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:03.29 storage01 DAOS[13113/0/616] server INFO src/engine/server_iv.c:876 ds_iv_ns_stop() f59eddb9 ns stopped
03/03-08:18:03.29 storage01 DAOS[13113/0/617] container INFO src/container/srv_target.c:2408 ds_cont_tgt_ec_eph_query_ult() f59eddb9 stop tgt ec aggregation
03/03-08:18:03.29 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_target.c:650 ds_pool_tgt_ec_eph_query_abort() f59eddb9: EC query ULT stopped
03/03-08:18:03.30 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_target.c:668 pool_fetch_hdls_ult_abort() f59eddb9: fetch hdls ULT aborted
03/03-08:18:03.30 storage01 DAOS[13113/0/616] rebuild INFO src/rebuild/srv.c:1605 ds_rebuild_abort() f59eddb9 rebuild aborted
03/03-08:18:03.30 storage01 DAOS[13113/0/616] server INFO src/object/srv_obj_migrate.c:2747 ds_migrate_stop() f59eddb9 migrate stopped
03/03-08:18:03.41 storage01 DAOS[13113/0/616] telem INFO src/gurt/telemetry.c:235 destroy_shmem() Destroying shared memory segment (shmid=98344)
03/03-08:18:03.41 storage01 DAOS[13113/0/616] telem INFO src/gurt/telemetry.c:2457 d_tm_del_ephemeral_dir() Removed ephemeral directory [pool/f59eddb9-d930-448c-bc98-c6316c0455ca]
03/03-08:18:03.41 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_metrics.c:176 ds_pool_metrics_stop() f59eddb9: destroyed ds_pool metrics
03/03-08:18:03.41 storage01 DAOS[13113/0/616] pool INFO src/pool/srv_target.c:808 ds_pool_stop() f59eddb9: pool service is aborted
03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)
03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)
03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)
03/03-08:18:08.97 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)
03/03-08:18:08.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:08.97 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500023587 rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:08.97 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:08.98 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f87550002358a rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:18:08.98 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f875500023585 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/03-08:19:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)
03/03-08:19:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)
03/03-08:19:03.29 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:19:03.29 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c4 rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:19:03.29 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020008 (DAOS) rpcid=0x95f8755000235c2 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/03-08:19:03.30 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_pool.c:122 ds_mgmt_tgt_pool_create_ranks() f59eddb9: failed to clean up failed pool: DER_TIMEDOUT(-1011): 'Time out'
03/03-08:19:03.30 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_pool.c:193 ds_mgmt_create_pool() creating pool f59eddb9 on ranks failed: rc DER_TIMEDOUT(-1011): 'Time out'
03/03-08:19:03.30 storage01 DAOS[13113/0/614] mgmt ERR  src/mgmt/srv_drpc.c:485 ds_mgmt_drpc_pool_create() failed to create pool: DER_TIMEDOUT(-1011): 'Time out'

It seems it has some errors on network.

If I create a pool used racks in a same server, I works:
[root@client01 ~]# dmg pool create -r 0,3 --scm-size=1T --nvme-size=15T --nsvc=1 Pool1
Creating DAOS pool with manual per-engine storage allocation: 1.0 TB SCM, 15 TB NVMe (6.67% ratio)
Pool created with 6.25%,93.75% storage tier ratio
-------------------------------------------------
  UUID                 : 31cf7760-fc1a-45b8-ab9b-0b539726bfe3
  Service Ranks        : 0
  Storage Ranks        : [0,3]
  Total Size           : 32 TB
  Storage tier 0 (SCM) : 2.0 TB (1.0 TB / rank)
  Storage tier 1 (NVMe): 30 TB (15 TB / rank)

Problems should be on CART comunications between servers.


Re: Fail to create pool

Lombardi, Johann
 

That’s progress! IIUC, you have 5x 3.8TB = 19TB per engine, so 20TB is probably too much. Could you please try with a smaller nvme-size? Since you use 2.2, you could also run “dmg pool create --size 100% Pool1” to allocate all the SCM and NVMe space.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "landen.tian@..." <landen.tian@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 2 March 2023 at 13:20
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Fail to create pool

 

[Edited Message Follows]

Lombardi,
  
        Glad to see your response!

        I followed  https://docs.daos.io/v2.2/admin/predeployment_check/#multi-railnic-setup, error is changed from DER_HG(-1020) to DER_NOSPACE(-1007)

command: 

[root@client01 ~]# dmg pool create -r 0,3,4,5 --scm-size=1T --nvme-size=20T --nsvc=1 Pool1

Creating DAOS pool with manual per-engine storage allocation: 1.0 TB SCM, 20 TB NVMe (5.00% ratio)

ERROR: dmg: client: code = 509 description = "the *control.PoolCreateReq request timed out after 10m0s"

ERROR: dmg: client: code = 509 resolution = "retry the request or check server logs for more information"


Errors log on storage01:

03/02-19:51:25.68 storage01 DAOS[13113/0/561] mgmt INFO src/mgmt/srv_drpc.c:434 ds_mgmt_drpc_pool_create() Received request to create pool on 4 ranks.

03/02-19:51:26.48 storage01 DAOS[13113/3/562] bio  ERR  src/bio/bio_context.c:396 bio_blob_create() Create blob failed for xs:0x7f578844ea20 pool:d62847b7 rc:-1007

03/02-19:51:26.48 storage01 DAOS[13113/3/562] vos  ERR  src/vos/vos_pool.c:402 vos_pool_create() Error creating blob for xs:0x7f578844ea20 pool:d62847b7 DER_NOSPACE(-1007): 'No space on storage target'

03/02-19:51:26.53 storage01 DAOS[13113/5/563] bio  ERR  src/bio/bio_context.c:396 bio_blob_create() Create blob failed for xs:0x7f575c64ea20 pool:d62847b7 rc:-1007

03/02-19:51:26.53 storage01 DAOS[13113/5/563] vos  ERR  src/vos/vos_pool.c:402 vos_pool_create() Error creating blob for xs:0x7f575c64ea20 pool:d62847b7 DER_NOSPACE(-1007): 'No space on storage target'

03/02-19:51:26.58 storage01 DAOS[13113/4/564] bio  ERR  src/bio/bio_context.c:396 bio_blob_create() Create blob failed for xs:0x7f577464ea20 pool:d62847b7 rc:-1007

03/02-19:51:26.58 storage01 DAOS[13113/4/564] vos  ERR  src/vos/vos_pool.c:402 vos_pool_create() Error creating blob for xs:0x7f577464ea20 pool:d62847b7 DER_NOSPACE(-1007): 'No space on storage target'

03/02-19:51:26.59 storage01 DAOS[13113/3/562] mgmt ERR  src/mgmt/srv_target.c:480 tgt_vos_create_one() d62847b7: failed to init vos pool /mnt/daos0/NEWBORNS/d62847b7-10d6-4c42-a996-ea4f4dfce486/vos-0: -1007

03/02-19:51:26.59 storage01 DAOS[13113/5/563] mgmt ERR  src/mgmt/srv_target.c:480 tgt_vos_create_one() d62847b7: failed to init vos pool /mnt/daos0/NEWBORNS/d62847b7-10d6-4c42-a996-ea4f4dfce486/vos-2: -1007

03/02-19:51:26.59 storage01 DAOS[13113/4/564] mgmt ERR  src/mgmt/srv_target.c:480 tgt_vos_create_one() d62847b7: failed to init vos pool /mnt/daos0/NEWBORNS/d62847b7-10d6-4c42-a996-ea4f4dfce486/vos-1: -1007

03/02-19:51:26.60 storage01 DAOS[13113/4/565] bio  WARN src/bio/bio_context.c:279 bio_blob_delete() Blob for xs:0x7f577464ea20, pool:d62847b7 doesn't exist

03/02-19:51:26.60 storage01 DAOS[13113/5/566] bio  WARN src/bio/bio_context.c:279 bio_blob_delete() Blob for xs:0x7f575c64ea20, pool:d62847b7 doesn't exist

03/02-19:51:26.60 storage01 DAOS[13113/3/567] bio  WARN src/bio/bio_context.c:279 bio_blob_delete() Blob for xs:0x7f578844ea20, pool:d62847b7 doesn't exist

03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)

03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)

03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)

03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)

03/02-19:51:28.53 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:51:28.53 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:51:28.56 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:51:28.56 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:51:28.56 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c500) [opc=0x102000b (DAOS) rpcid=0x95f8755000175db rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)

03/02-19:52:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)

03/02-19:52:25.68 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:25.68 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761a rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:25.68 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_pool.c:97 ds_mgmt_tgt_pool_create_ranks() d62847b7: dss_rpc_send MGMT_TGT_CREATE: rc=DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)

03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)

03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)

03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)

03/02-19:52:31.58 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:31.58 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:31.58 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:31.58 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f875500017624 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/02-19:53:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)

03/02-19:53:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)

03/02-19:53:25.68 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:53:25.68 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:53:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020008 (DAOS) rpcid=0x95f875500017660 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'

03/02-19:53:25.69 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_pool.c:122 ds_mgmt_tgt_pool_create_ranks() d62847b7: failed to clean up failed pool: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:53:25.69 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_pool.c:193 ds_mgmt_create_pool() creating pool d62847b7 on ranks failed: rc DER_TIMEDOUT(-1011): 'Time out'

03/02-19:53:25.69 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_drpc.c:485 ds_mgmt_drpc_pool_create() failed to create pool: DER_TIMEDOUT(-1011): 'Time out'

03/02-19:53:25.69 storage01 DAOS[13113/2/1] daos INFO src/engine/drpc_progress.c:392 process_session_activity() Session 1619 connection has been terminated


Furthermore,  there are one 1G ethernet and two IB 200 nics per node. Ethernet has subnet 10.166.15.*/24, ib0 and ib1 have subnet like 192.168.15.*/24.

  253  sysctl -w net.ipv4.conf.all.accept_local=1

  254  sysctl -w net.ipv4.conf.all.arp_ignore=2

  255  sysctl -w net.ipv4.conf.ib0.rp_filter=2

  256  sysctl -w net.ipv4.conf.ib1.rp_filter=2

  258  sysctl -w net.ipv4.conf.ens21f0.rp_filter=2

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Fail to create pool

landen.tian@...
 
Edited

Lombardi,
  
        Glad to see your response!

        I followed  https://docs.daos.io/v2.2/admin/predeployment_check/#multi-railnic-setup, error is changed from DER_HG(-1020) to DER_NOSPACE(-1007)

command: 
[root@client01 ~]# dmg pool create -r 0,3,4,5 --scm-size=1T --nvme-size=20T --nsvc=1 Pool1
Creating DAOS pool with manual per-engine storage allocation: 1.0 TB SCM, 20 TB NVMe (5.00% ratio)
ERROR: dmg: client: code = 509 description = "the *control.PoolCreateReq request timed out after 10m0s"
ERROR: dmg: client: code = 509 resolution = "retry the request or check server logs for more information"

Errors log on storage01:
03/02-19:51:25.68 storage01 DAOS[13113/0/561] mgmt INFO src/mgmt/srv_drpc.c:434 ds_mgmt_drpc_pool_create() Received request to create pool on 4 ranks.
03/02-19:51:26.48 storage01 DAOS[13113/3/562] bio  ERR  src/bio/bio_context.c:396 bio_blob_create() Create blob failed for xs:0x7f578844ea20 pool:d62847b7 rc:-1007
03/02-19:51:26.48 storage01 DAOS[13113/3/562] vos  ERR  src/vos/vos_pool.c:402 vos_pool_create() Error creating blob for xs:0x7f578844ea20 pool:d62847b7 DER_NOSPACE(-1007): 'No space on storage target'
03/02-19:51:26.53 storage01 DAOS[13113/5/563] bio  ERR  src/bio/bio_context.c:396 bio_blob_create() Create blob failed for xs:0x7f575c64ea20 pool:d62847b7 rc:-1007
03/02-19:51:26.53 storage01 DAOS[13113/5/563] vos  ERR  src/vos/vos_pool.c:402 vos_pool_create() Error creating blob for xs:0x7f575c64ea20 pool:d62847b7 DER_NOSPACE(-1007): 'No space on storage target'
03/02-19:51:26.58 storage01 DAOS[13113/4/564] bio  ERR  src/bio/bio_context.c:396 bio_blob_create() Create blob failed for xs:0x7f577464ea20 pool:d62847b7 rc:-1007
03/02-19:51:26.58 storage01 DAOS[13113/4/564] vos  ERR  src/vos/vos_pool.c:402 vos_pool_create() Error creating blob for xs:0x7f577464ea20 pool:d62847b7 DER_NOSPACE(-1007): 'No space on storage target'
03/02-19:51:26.59 storage01 DAOS[13113/3/562] mgmt ERR  src/mgmt/srv_target.c:480 tgt_vos_create_one() d62847b7: failed to init vos pool /mnt/daos0/NEWBORNS/d62847b7-10d6-4c42-a996-ea4f4dfce486/vos-0: -1007
03/02-19:51:26.59 storage01 DAOS[13113/5/563] mgmt ERR  src/mgmt/srv_target.c:480 tgt_vos_create_one() d62847b7: failed to init vos pool /mnt/daos0/NEWBORNS/d62847b7-10d6-4c42-a996-ea4f4dfce486/vos-2: -1007
03/02-19:51:26.59 storage01 DAOS[13113/4/564] mgmt ERR  src/mgmt/srv_target.c:480 tgt_vos_create_one() d62847b7: failed to init vos pool /mnt/daos0/NEWBORNS/d62847b7-10d6-4c42-a996-ea4f4dfce486/vos-1: -1007
03/02-19:51:26.60 storage01 DAOS[13113/4/565] bio  WARN src/bio/bio_context.c:279 bio_blob_delete() Blob for xs:0x7f577464ea20, pool:d62847b7 doesn't exist
03/02-19:51:26.60 storage01 DAOS[13113/5/566] bio  WARN src/bio/bio_context.c:279 bio_blob_delete() Blob for xs:0x7f575c64ea20, pool:d62847b7 doesn't exist
03/02-19:51:26.60 storage01 DAOS[13113/3/567] bio  WARN src/bio/bio_context.c:279 bio_blob_delete() Blob for xs:0x7f578844ea20, pool:d62847b7 doesn't exist
03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)
03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)
03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)
03/02-19:51:28.53 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)
03/02-19:51:28.53 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:51:28.53 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f8755000175dd rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:51:28.56 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:51:28.56 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f8755000175e0 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:51:28.56 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c500) [opc=0x102000b (DAOS) rpcid=0x95f8755000175db rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)
03/02-19:52:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)
03/02-19:52:25.68 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:25.68 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761d rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020007 (DAOS) rpcid=0x95f87550001761a rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:25.68 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_pool.c:97 ds_mgmt_tgt_pool_create_ranks() d62847b7: dss_rpc_send MGMT_TGT_CREATE: rc=DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (2:0)
03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] aborting to group daos_server, rank 2, tgt_uri (null)
03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)
03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)
03/02-19:52:31.58 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:31.58 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c457b9c0) [opc=0x102000b (DAOS) rpcid=0x95f875500017626 rank:tag=2:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:31.58 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:31.58 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c46113d0) [opc=0x102000b (DAOS) rpcid=0x95f875500017629 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:52:31.58 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c459c040) [opc=0x102000b (DAOS) rpcid=0x95f875500017624 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/02-19:53:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:939 crt_context_timeout_check(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (5:0)
03/02-19:53:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:888 crt_req_timeout_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] aborting to group daos_server, rank 5, tgt_uri (null)
03/02-19:53:25.68 storage01 DAOS[13113/0/3] hg   ERR  src/cart/crt_hg.c:1246 crt_hg_req_send_cb(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:53:25.68 storage01 DAOS[13113/0/3] corpc ERR  src/cart/crt_corpc.c:648 crt_corpc_reply_hdlr(0x7f57c4576a00) [opc=0x1020008 (DAOS) rpcid=0x95f875500017663 rank:tag=5:0] error, rc: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:53:25.68 storage01 DAOS[13113/0/3] rpc  ERR  src/cart/crt_context.c:381 crt_rpc_complete(0x7f57c45b2460) [opc=0x1020008 (DAOS) rpcid=0x95f875500017660 rank:tag=0:0] failed, DER_TIMEDOUT(-1011): 'Time out'
03/02-19:53:25.69 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_pool.c:122 ds_mgmt_tgt_pool_create_ranks() d62847b7: failed to clean up failed pool: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:53:25.69 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_pool.c:193 ds_mgmt_create_pool() creating pool d62847b7 on ranks failed: rc DER_TIMEDOUT(-1011): 'Time out'
03/02-19:53:25.69 storage01 DAOS[13113/0/561] mgmt ERR  src/mgmt/srv_drpc.c:485 ds_mgmt_drpc_pool_create() failed to create pool: DER_TIMEDOUT(-1011): 'Time out'
03/02-19:53:25.69 storage01 DAOS[13113/2/1] daos INFO src/engine/drpc_progress.c:392 process_session_activity() Session 1619 connection has been terminated

Furthermore,  there are one 1G ethernet and two IB 200 nics per node. Ethernet has subnet 10.166.15.*/24, ib0 and ib1 have subnet like 192.168.15.*/24.
  253  sysctl -w net.ipv4.conf.all.accept_local=1
  254  sysctl -w net.ipv4.conf.all.arp_ignore=2
  255  sysctl -w net.ipv4.conf.ib0.rp_filter=2
  256  sysctl -w net.ipv4.conf.ib1.rp_filter=2
  258  sysctl -w net.ipv4.conf.ens21f0.rp_filter=2
 


Re: Questions about daos evolution and design

Niu, Yawei
 

I see, so it looks to me same to the question 4 of "append write"? Hope my answer is helpful.

Thanks
-Niu


Re: Questions about daos evolution and design

hongxunlinpub@...
 

Q3:I mean ROW(Redirect-on-write ) .


Thanks 
-Lin