Date   

Re: Unable to run DAOS commands - Agent reports "no dRPC client set"

Macdonald, Mjmac
 

Hi Patrick.

A commit (18d31d) just landed to master this morning that will probably fix that issue. As part of the work you referenced, a new whitelist parameter is being used to ensure that each ioserver only has access to the devices specified in the configuration. Unfortunately, this doesn't work with emulated devices, so the fix is to avoid using the whitelist except with real devices.

Sorry about that, hope this helps.

Best,
mjmac


Re: How to configure IB with multiple mlx4 devices per server

Kevan Rehm
 

An update on this: a colleague of mine showed me that the two problems are different.     In the Lustre world the ping problem (#2) below is called the “arp flux” problem, he gave me a set of linux commands to run on each cluster node that fixes this.

 

The first problem with fi_pingpong remains.  If I don’t hear anything here in a day or so, I will take it over to the libfabric email reflector.

 

Thanks, Kevan

 

From: "Rehm, Kevan Flint" <kevan.rehm@...>
Date: Sunday, February 16, 2020 at 11:28 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: How to configure IB with multiple mlx4 devices per server

 

Greetings,

 

Could someone please describe how to configure linux and/or libfabric for infiniband such that a verbs;ofi_rxm configuration is able to communicate with all active IB devices in the cluster when each server has two mlx4 devices attached to the same subnet?   Specifically, how to set up routing so that every device port can reach every other device port in the cluster?

 

The first example here is with fi_pingpong, which has different symptoms than the second ‘ping’ example near the bottom…

 

Example #1: host hl_c500 has IB devices mlx4_0 and mlx4_1 configured with IPoIB addresses 10.0.0.36 and 10.0.0.136 respectively.  Host hl_d100 has devices mlx4_0 and mlx4_1 with addresses 10.0.0.16 and 10.0.0.116 respectively.  opensm is running.    I previously used ibping to verify that I can ibping each IB device port from every other device port in the cluster.

 

If I run this on hl-c500:

                fi_pingpong -vvv -d mlx4_0 -p 'verbs;ofi_rxm' -e rdm

and run this on hl-d100:

                fi_pingpong -vvv -d mlx4_0 -p 'verbs;ofi_rxm' -e rdm hl-c500

everything works fine.

 

If I instead run the following on hl-d100:

                fi_pingpong -vvv -d mlx4_1 -p 'verbs;ofi_rxm' -e rdm hl-c500

it fails, I am unable to use the mlx4_1 device to talk to any other server on the subnet.  

 

The routing table seems to imply that both mlx devices should work:

$ route

Kernel IP routing table

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

default         cfgw-3222-vrrp. 0.0.0.0         UG    100    0        0 enp6s0

10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 mlx4_0

10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 mlx4_1

link-local      0.0.0.0         255.255.0.0     U     1004   0        0 mlx4_0

link-local      0.0.0.0         255.255.0.0     U     1005   0        0 mlx4_1

 

I can’t be the first person to want to use multiple IB devices on each node in the cluster, what are the configuration tricks to make it work?

 

Later… I debugged the above failure, it seems like a bug to me, am I missing something?  

 

The fi_pingpong on the hl-d100 client machine looks for a connection-based verbs producer to use underneath ofi_rxm.  It calls vrb_getinfo() which calls vrb_get_match_infos().    Routine vrb_get_match_infos() starts by calling vrb_get_matching_info() with a ‘hints’ structure that includes the ‘mlx4_1’ domain hint from the “-d mlx4_1” parameter, and it fills in the ‘info’ parameter with a list containing only the ‘mlx4_1’ producer, just as one would expect.

 

Routine vrb_get_match_infos() then calls vrb_handle_sock_addr(), which calls vrb_get_rai_id(), which calls rdma_resolve_addr() to resolve the 10.0.0.36 address.  The second parameter to rdma_resolve_addr() is a src_addr, but the value passed in this case (*rai)->ai_src_addr is NULL.  So rdma_resolve_addr() has to pick an IB device to use, and it happens to pick mlx4_0, returns it in id->verbs.

 

Then vrb_handle_sock_addr() goes on to call ibv_get_device_name(), which uses id->verbs to set dev_name to “mlx4_0”.  Next it calls vrb_del_info_not_belong_to() which removes all non-mlx4_0 devices from the ‘info’ list.  Since the only entry in the list was mlx4_1, the list is now empty, and vrb_del_info_nto_belong_to() returns -61, you get the messages:

 

libfabric:27193:verbs:fabric:vrb_get_match_infos():1636<info> handling of the socket address fails - -61

libfabric:27193:verbs:core:vrb_get_match_infos():1656<info> Handling of the addresses fails, the getting infos is unsuccessful

 

and ultimately vrb_getinfo() returns – FI_ENODATA.

 

It seems to me that if vrb_get_rai_id() had used the mlx4_1 ‘hint’ structure to include the mlx4_1 src_addr parameter in the rdma_resolve_addr() call that the kernel would have probably selected the mlx4_1 device instead, and the mlx4_1 device would have worked.

 

This did seem like a bug to me, but then I find that ping has a similar issues, so now I am not so sure.  

 

Example #2: these two commands issued on hl-d100 both successfully ping hl-c500:

     $ ping -I mlx4_0 -r 10.0.0.36

     $ ping -I mlx4_0 -r 10.0.0.136

But both of these commands fail:

     $ ping -I mlx4_1 -r 10.0.0.36

     $ ping -I mlx4_1 -r 10.0.0.136

So again, it is a routing problem, even though I specifically told ping which mlx4 device to use.  Or maybe this is a separate problem?

 

Help please!

 

Regards, Kevan


Re: Unable to run DAOS commands - Agent reports "no dRPC client set"

Patrick Farrell <paf@...>
 

I finally gave up and bisected this.

This problem started with DAOS-4034 control: enable vfio permissions for non-root (#1785)/14c7c2e06512659f4122a01c57e82ad58ee642b0

Looking at, it does a variety of things, and I'm not having any luck tracking down what's broken by this change.  I made sure to enable the vfio driver as mentioned in the patch notes, but I'm not seeing any change.

One note.  I am running as root, because that has been the easiest set up so far.
Is running as root perhaps broken with this patch?

- Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Sent: Wednesday, February 12, 2020 11:18 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
 
Good morning,

I've just moved up to latest tip of tree DAOS (I'm not sure exactly which commit I was running before, a week or two out of date), and I can't get any tests to run.

I've pared back to a trivial config, and I appear to be able to start the server, etc, but the agent claims the data plane is not running and I'm not having a lot of luck troubleshooting.

Here's my server startup command & output:
/root/daos/install/bin/daos_server start -o /root/daos/utils/config/examples/daos_server_local.yml
/root/daos/install/bin/daos_server logging to file /tmp/daos_control.log
ERROR: /root/daos/install/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 22075) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM format required on instance 0
formatting storage for DAOS I/O Server instance 0 (reformat: false)
Starting format of SCM (ram:/mnt/daos)
Finished format of SCM (ram:/mnt/daos)
Starting format of kdev block devices (/dev/sdl1)
Finished format of kdev block devices (/dev/sdl1)
DAOS I/O Server instance 0 storage ready
SCM @ /mnt/daos: 16.00GB Total/16.00GB Avail
Starting I/O server instance 0: /root/daos/install/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm

As you can see, I format and the server appears to start normally.

Here's that format command output:
dmg -i storage format
localhost:10001: connected

localhost: storage format ok

I run the agent, and it appears OK:
daos_agent -i
Starting daos_agent:
Using logfile: /tmp/daos_agent.log
Listening on /var/run/daos_agent/agent.sock

But when I try to run daos_test, everything it attempts fails, and the agent prints this message over and over:
ERROR: HandleCall for 2:206 failed: GetAttachInfo hl-d102:10001 {daos_server {} [] 13}: rpc error: code = Unknown desc = no dRPC client set (data plane not started?)

I believe I've got the environment variables set up correctly everywhere, and I have not configured access_points, etc - This is a trivial single server config.

This is the entirety of my file based config changes:
--- a/utils/config/examples/daos_server_local.yml
+++ b/utils/config/examples/daos_server_local.yml
@@ -14,7 +14,7 @@ servers:
   targets: 1
   first_core: 0
   nr_xs_helpers: 0
-  fabric_iface: eth0
+  fabric_iface: enp6s0
   fabric_iface_port: 31416
   log_file: /tmp/daos_server.log

@@ -31,8 +31,8 @@ servers:
   # The size of ram is specified by scm_size in GB units.
   scm_mount: /mnt/daos # map to -s /mnt/daos
   scm_class: ram
-  scm_size: 4
+  scm_size: 16

-  bdev_class: file
-  bdev_size: 16
-  bdev_list: [/tmp/daos-bdev]
+  bdev_class: kdev
+  bdev_size: 64
+  bdev_list: [/dev/sdl1]
---------

Any clever ideas what's wrong here?  Is there a command or config change I missed?

Thanks,
-Patrick


How to configure IB with multiple mlx4 devices per server

Kevan Rehm
 

Greetings,

 

Could someone please describe how to configure linux and/or libfabric for infiniband such that a verbs;ofi_rxm configuration is able to communicate with all active IB devices in the cluster when each server has two mlx4 devices attached to the same subnet?   Specifically, how to set up routing so that every device port can reach every other device port in the cluster?

 

The first example here is with fi_pingpong, which has different symptoms than the second ‘ping’ example near the bottom…

 

Example #1: host hl_c500 has IB devices mlx4_0 and mlx4_1 configured with IPoIB addresses 10.0.0.36 and 10.0.0.136 respectively.  Host hl_d100 has devices mlx4_0 and mlx4_1 with addresses 10.0.0.16 and 10.0.0.116 respectively.  opensm is running.    I previously used ibping to verify that I can ibping each IB device port from every other device port in the cluster.

 

If I run this on hl-c500:

                fi_pingpong -vvv -d mlx4_0 -p 'verbs;ofi_rxm' -e rdm

and run this on hl-d100:

                fi_pingpong -vvv -d mlx4_0 -p 'verbs;ofi_rxm' -e rdm hl-c500

everything works fine.

 

If I instead run the following on hl-d100:

                fi_pingpong -vvv -d mlx4_1 -p 'verbs;ofi_rxm' -e rdm hl-c500

it fails, I am unable to use the mlx4_1 device to talk to any other server on the subnet.  

 

The routing table seems to imply that both mlx devices should work:

$ route

Kernel IP routing table

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface

default         cfgw-3222-vrrp. 0.0.0.0         UG    100    0        0 enp6s0

10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 mlx4_0

10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 mlx4_1

link-local      0.0.0.0         255.255.0.0     U     1004   0        0 mlx4_0

link-local      0.0.0.0         255.255.0.0     U     1005   0        0 mlx4_1

 

I can’t be the first person to want to use multiple IB devices on each node in the cluster, what are the configuration tricks to make it work?

 

Later… I debugged the above failure, it seems like a bug to me, am I missing something?  

 

The fi_pingpong on the hl-d100 client machine looks for a connection-based verbs producer to use underneath ofi_rxm.  It calls vrb_getinfo() which calls vrb_get_match_infos().    Routine vrb_get_match_infos() starts by calling vrb_get_matching_info() with a ‘hints’ structure that includes the ‘mlx4_1’ domain hint from the “-d mlx4_1” parameter, and it fills in the ‘info’ parameter with a list containing only the ‘mlx4_1’ producer, just as one would expect.

 

Routine vrb_get_match_infos() then calls vrb_handle_sock_addr(), which calls vrb_get_rai_id(), which calls rdma_resolve_addr() to resolve the 10.0.0.36 address.  The second parameter to rdma_resolve_addr() is a src_addr, but the value passed in this case (*rai)->ai_src_addr is NULL.  So rdma_resolve_addr() has to pick an IB device to use, and it happens to pick mlx4_0, returns it in id->verbs.

 

Then vrb_handle_sock_addr() goes on to call ibv_get_device_name(), which uses id->verbs to set dev_name to “mlx4_0”.  Next it calls vrb_del_info_not_belong_to() which removes all non-mlx4_0 devices from the ‘info’ list.  Since the only entry in the list was mlx4_1, the list is now empty, and vrb_del_info_nto_belong_to() returns -61, you get the messages:

 

libfabric:27193:verbs:fabric:vrb_get_match_infos():1636<info> handling of the socket address fails - -61

libfabric:27193:verbs:core:vrb_get_match_infos():1656<info> Handling of the addresses fails, the getting infos is unsuccessful

 

and ultimately vrb_getinfo() returns – FI_ENODATA.

 

It seems to me that if vrb_get_rai_id() had used the mlx4_1 ‘hint’ structure to include the mlx4_1 src_addr parameter in the rdma_resolve_addr() call that the kernel would have probably selected the mlx4_1 device instead, and the mlx4_1 device would have worked.

 

This did seem like a bug to me, but then I find that ping has a similar issues, so now I am not so sure.  

 

Example #2: these two commands issued on hl-d100 both successfully ping hl-c500:

     $ ping -I mlx4_0 -r 10.0.0.36

     $ ping -I mlx4_0 -r 10.0.0.136

But both of these commands fail:

     $ ping -I mlx4_1 -r 10.0.0.36

     $ ping -I mlx4_1 -r 10.0.0.136

So again, it is a routing problem, even though I specifically told ping which mlx4 device to use.  Or maybe this is a separate problem?

 

Help please!

 

Regards, Kevan


Re: daos server crash with extensive IO test

Zhang, Jiafu
 

Hi Patrick,

 

I checked daos_server and daos_io_server processes in two daos servers. They were all presented. So daos server didn’t exit. But when I ran “daos pool query --pool 58255480-98bd-47d7-98bd-cefe3067e55f  --svc=0”, it just hung. And all concurrent writing processes from Hadoop hung too. Before hanging, they did make progress and write  some data to daos successfully.  Attached you daos_agent.log.

 

Then I shut down daos server by killing all daos processes. After starting daos server again, I can connect to pool and container, but faild to mount DFS with below error. I checked daos server log, it said targets excluded. See daos_control.txt. no more other server logs.

 

Mount error:

placement EMRG src/placement/pl_map_common.c:57 remap_add_one() same fseq 8!

java: src/placement/pl_map_common.c:57: remap_add_one: Assertion `f_new->fs_fseq != f_shard->fs_fseq' failed.

 

Thanks.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Patrick Farrell
Sent: Friday, February 14, 2020 10:53 PM
To: 'daos@daos.groups.io' <daos@daos.groups.io>
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] daos server crash with extensive IO test

 

You stated that the DAOS server *crashed*, is that correct, or did it receive an error and stop unexpectedly (but not crash - just exit on error), as it (tentatively) appears from the log?  If it crashed, could you share the back trace if you have it?

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Zhang, Jiafu <jiafu.zhang@...>
Sent: Friday, February 14, 2020 8:47 AM
To: 'daos@daos.groups.io' <
daos@daos.groups.io>
Cc: Zhu, Minming <
minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: [daos] daos server crash with extensive IO test

 

Hi Guys,

 

When tested Hadoop dfsio on two-node daos servers, we got new errors, see daos.log and server.log attached. See also the server configs *.yml and env attached. The parallel access is in the same level as we tested Spark. But we didn’t meet these types of errors during Spark test.  It’s about 200 concurrent accesses. Each access mounts DFS to same the pool and container as dfsclient via which file read and write are performed.

 

Please help. Olivier, Jeffrey V may forword my email. But I provide more info here for your inspection. We are using daos server with commit, 7cc334c6e8561be8a19fa29e0102030b14d0ef63 and DFS patch (for DFS SB) from Mohamod.

 

Thanks.


Update to privileged helper

Macdonald, Mjmac
 

Hi all.

 

As of this morning, running as a non-root user with builds from the master branch (development branch for version 1.2) will require setup of the privileged helper (daos_admin). This change was made to complete the effort started with the introduction of the helper.

 

To summarize:

  • If you are running DAOS from RPM-based installs, then this setup has already been done for you and no further work is necessary
  • If you are running DAOS from source, and you always run as the root user, then the privileged helper will inherit those permissions and no further work is necessary
  • If you are running DAOS from source, and you want to run as a non-root user, then you will need to perform some manual setup steps on every server in order to ensure that the privileged helper has the correct permissions in order to perform privileged tasks

 

You’ll know that you need to perform these setup steps if you see an error like the following on daos_server startup:

ERROR: pbin: code = 2 description = "the privileged helper (/home/mjmac/daos/install/bin/daos_admin) does not have root permissions"

ERROR: pbin: code = 2 resolution = "check the DAOS admin guide for details on privileged helper setup"

 

Note: These setup steps do not necessarily need to be performed after every DAOS build. The privileged helper code is pretty stable at this point and doesn’t change very often.

 

Please refer to the DAOS Admin Guide for specifics on the setup steps: https://daos-stack.github.io/#admin/deployment/#elevated-privileges

 

Best,

mjmac


Re: daos server crash with extensive IO test

Patrick Farrell <paf@...>
 

You stated that the DAOS server *crashed*, is that correct, or did it receive an error and stop unexpectedly (but not crash - just exit on error), as it (tentatively) appears from the log?  If it crashed, could you share the back trace if you have it?

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Zhang, Jiafu <jiafu.zhang@...>
Sent: Friday, February 14, 2020 8:47 AM
To: 'daos@daos.groups.io' <daos@daos.groups.io>
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: [daos] daos server crash with extensive IO test
 

Hi Guys,

 

When tested Hadoop dfsio on two-node daos servers, we got new errors, see daos.log and server.log attached. See also the server configs *.yml and env attached. The parallel access is in the same level as we tested Spark. But we didn’t meet these types of errors during Spark test.  It’s about 200 concurrent accesses. Each access mounts DFS to same the pool and container as dfsclient via which file read and write are performed.

 

Please help. Olivier, Jeffrey V may forword my email. But I provide more info here for your inspection. We are using daos server with commit, 7cc334c6e8561be8a19fa29e0102030b14d0ef63 and DFS patch (for DFS SB) from Mohamod.

 

Thanks.


daos server crash with extensive IO test

Zhang, Jiafu
 

Hi Guys,

 

When tested Hadoop dfsio on two-node daos servers, we got new errors, see daos.log and server.log attached. See also the server configs *.yml and env attached. The parallel access is in the same level as we tested Spark. But we didn’t meet these types of errors during Spark test.  It’s about 200 concurrent accesses. Each access mounts DFS to same the pool and container as dfsclient via which file read and write are performed.

 

Please help. Olivier, Jeffrey V may forword my email. But I provide more info here for your inspection. We are using daos server with commit, 7cc334c6e8561be8a19fa29e0102030b14d0ef63 and DFS patch (for DFS SB) from Mohamod.

 

Thanks.


Re: dfs_write error 28

Olivier, Jeffrey V
 

Hi Chenzhao,

 

Internally, DAOS object updates are versioned.   New updates add new data.  When an object is deleted, that is also logged as a new write that we call a “punch”.   Punched objects that are no longer visible are cleaned up in the background by a process we call aggregation.   This aggregation process should not lag 30 minutes behind but does operate on a delay of about 60 seconds.   Knowing that can help explain why you’re seeing what you are seeing.   Again, I wouldn’t expect it to take 30 minutes to do full aggregation but you likely need to ensure your storage has some extra space to handle this temporary excess.

 

As for the CPU time, this is something we plan to fix eventually but DAOS server threads spin waiting for new work so can take a lot of CPU even when idle.

 

-Jeff

 

From: daos@daos.groups.io [mailto:daos@daos.groups.io] On Behalf Of Guo, Chenzhao
Sent: Thursday, February 13, 2020 4:17 AM
To: 'daos@daos.groups.io' <daos@daos.groups.io>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>
Subject: [daos] dfs_write error 28

 

Hi guys,

 

I have come across an issue. Could you please help me see it?

 

Error code 28, No space left on device from calling dfs_write: I sequentially run the same workload 10 times, the workload writes some data to DAOS, and delete them all at last. The 10th time will fail due to above error(And you can see from daos pool query –pool that the free space is truly not enough). But if you wait enough time, say 30 minutes, daos pool query will return full space, and now you can write to DAOS.   I am using 1 server running both daos_server & daos_client.

 

Another one is a question, at the time when no workload is running, I can still see daos_io_server occupies 300% CPU from `top`, is this normal and why?

 

Thanks,

Chenzhao

 


dfs_write error 28

Guo, Chenzhao <chenzhao.guo@...>
 

Hi guys,

 

I have come across an issue. Could you please help me see it?

 

Error code 28, No space left on device from calling dfs_write: I sequentially run the same workload 10 times, the workload writes some data to DAOS, and delete them all at last. The 10th time will fail due to above error(And you can see from daos pool query –pool that the free space is truly not enough). But if you wait enough time, say 30 minutes, daos pool query will return full space, and now you can write to DAOS.   I am using 1 server running both daos_server & daos_client.

 

Another one is a question, at the time when no workload is running, I can still see daos_io_server occupies 300% CPU from `top`, is this normal and why?

 

Thanks,

Chenzhao

 


Re: [External] Re: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

Shengyu SY19 Zhang
 

Hello,

 

I ‘m interesting in this resolve as well, In my side, dfs_stat caused an infinite loop in cart, cart is just updated, log file will increase quickly.

 

Repeat logs are:

02/13-01:08:12.64 afa1 DAOS[2697/2697] hg   ERR  src/cart/crt_hg.c:1197 crt_hg_req_send(0x55e4a5357010)  [opc=0x4010001 xid=0x133ccf rank:tag=0:1] HG_Forward failed, hg_ret: 11

02/13-01:08:12.64 afa1 DAOS[2697/2697] rpc  ERR  src/cart/crt_rpc.c:1034 crt_req_send_immediately(0x55e4a5357010)  [opc=0x4010001 xid=0x133ccf rank:tag=0:1] crt_hg_req_send failed, rc: -1020

02/13-01:08:12.64 afa1 DAOS[2697/2697] rpc  ERR  src/cart/crt_rpc.c:1174 crt_req_send() crt_req_send_internal() failed, rc -1020, opc: 0x4010001

02/13-01:08:12.64 afa1 DAOS[2697/2697] rpc  ERR  src/cart/crt_context.c:302 crt_rpc_complete(0x55e4a5357010)  [opc=0x4010001 xid=0x133ccf rank:tag=0:1] RPC failed; rc: -1020

02/13-01:08:12.64 afa1 DAOS[2697/2697] object ERR  src/object/cli_shard.c:201 dc_rw_cb() RPC 1 failed: -1020

02/13-01:08:12.64 afa1 DAOS[2697/2697] hg   ERR  # NA -- Error -- /root/daos/_build.external/mercury/src/na/na_ofi.c:4010

# na_ofi_mem_register(): fi_mr_reg() failed, rc: -13(Permission denied)

02/13-01:08:12.64 afa1 DAOS[2697/2697] hg   ERR  # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:530

# hg_bulk_create(): NA_Mem_register() failed (NA_PROTOCOL_ERROR)

02/13-01:08:12.64 afa1 DAOS[2697/2697] hg   ERR  # HG -- Error -- /root/daos/_build.external/mercury/src/mercury_bulk.c:1108

# HG_Bulk_create(): Could not create bulk handle

 

here is stack:

 

#0  0x00007fa06901a275 in _xstat () from /lib64/libc.so.6

#1  0x00007fa068fe1dad in __tzfile_read () from /lib64/libc.so.6

#2  0x00007fa068fe1104 in tzset_internal () from /lib64/libc.so.6

#3  0x00007fa068fe1ab3 in __tz_convert () from /lib64/libc.so.6

#4  0x00007fa052f430a7 in d_vlog (flags=67108901, fmt=0x7fa05229c818 "# %s -- %s -- %s:%d\n # %s(): %s\n", ap=ap@entry=0x7ffd2abd3648) at src/gurt/dlog.c:404

#5  0x00007fa0531aa237 in crt_hg_log (stream=<optimized out>, fmt=<optimized out>) at src/cart/crt_hg.c:481

#6  0x00007fa05229ad68 in hg_log_write (log_type=log_type@entry=4, module=module@entry=0x7fa0526cd65b "HG", file=file@entry=0x7fa0526cdb58 "/root/daos/_build.external/mercury/src/mercury.c", line=1888, func=0x7fa0526ce11a <__func__.5980> "HG_Forward",

    format=<optimized out>) at /root/daos/_build.external/mercury/src/util/mercury_log.c:98

#7  0x00007fa0526c2d02 in HG_Forward (handle=0x55e4a5263460, callback=callback@entry=0x7fa0531a89b0 <crt_hg_req_send_cb>, arg=arg@entry=0x55e4a5363010, in_struct=in_struct@entry=0x55e4a5363030) at /root/daos/_build.external/mercury/src/mercury.c:1903

#8  0x00007fa0531afe69 in crt_hg_req_send (rpc_priv=rpc_priv@entry=0x55e4a5363010) at src/cart/crt_hg.c:1192

#9  0x00007fa0531ec56c in crt_req_send_immediately (rpc_priv=0x55e4a5363010) at src/cart/crt_rpc.c:1030

#10 crt_req_send_internal (rpc_priv=rpc_priv@entry=0x55e4a5363010) at src/cart/crt_rpc.c:1063

#11 0x00007fa0531f1870 in crt_req_send (req=0x55e4a5363010, complete_cb=complete_cb@entry=0x7fa0538b26a0 <daos_rpc_cb>, arg=arg@entry=0x55e4a513a940) at src/cart/crt_rpc.c:1170

#12 0x00007fa0538b2733 in daos_rpc_send (rpc=<optimized out>, task=task@entry=0x55e4a513a940) at src/client/api/rpc.c:59

#13 0x00007fa0538f40d5 in dc_obj_shard_rw (shard=0x55e4a5303d78, opc=<optimized out>, shard_args=0x55e4a513acb0, fw_shard_tgts=0x0, fw_cnt=<optimized out>, task=0x55e4a513a940) at src/object/cli_shard.c:479

#14 0x00007fa0538e734c in shard_io (task=task@entry=0x55e4a513a940, shard_auxi=shard_auxi@entry=0x55e4a513acb0) at src/object/cli_obj.c:1877

#15 0x00007fa0538e7c9a in obj_req_fanout (obj=0x55e4a5303ed0, obj_auxi=0x55e4a513aa58, dkey_hash=dkey_hash@entry=10594513079604596330, map_ver=1, epoch=18446744073709551615, io_prep_cb=io_prep_cb@entry=0x7fa0538e3330 <shard_rw_prep>,

    io_cb=0x7fa0538f3960 <dc_obj_shard_rw>, obj_task=obj_task@entry=0x55e4a513a940) at src/object/cli_obj.c:1977

#16 0x00007fa0538eb5d3 in do_dc_obj_fetch (task=0x55e4a513a940, args=0x55e4a513a9c8, flags=<optimized out>, shard=0) at src/object/cli_obj.c:2593

#17 0x00007fa053db756c in tse_sched_process_init (dsp=0x7fa053b5ded0 <daos_sched_g+16>) at src/common/tse.c:543

#18 tse_sched_run (sched=0x7fa053b5dec0 <daos_sched_g>) at src/common/tse.c:693

#19 0x00007fa053db7b74 in tse_sched_progress (sched=<optimized out>) at src/common/tse.c:722

#20 0x00007fa0538aa041 in ev_progress_cb (arg=arg@entry=0x7ffd2abd3fe0) at src/client/api/event.c:506

#21 0x00007fa053174db1 in crt_progress (crt_ctx=0x55e4a5253120, timeout=timeout@entry=-1, cond_cb=cond_cb@entry=0x7fa0538aa020 <ev_progress_cb>, arg=arg@entry=0x7ffd2abd3fe0) at src/cart/crt_context.c:1236

#22 0x00007fa0538af346 in daos_event_priv_wait () at src/client/api/event.c:1205

#23 0x00007fa0538b2b16 in dc_task_schedule (task=0x55e4a513a940, instant=instant@entry=true) at src/client/api/task.c:139

#24 0x00007fa0538b13ac in daos_obj_fetch (oh=..., oh@entry=..., th=..., th@entry=..., flags=flags@entry=0, dkey=dkey@entry=0x7ffd2abd40e0, nr=nr@entry=1, iods=iods@entry=0x7ffd2abd4100, sgls=sgls@entry=0x7ffd2abd40c0, maps=maps@entry=0x0, ev=ev@entry=0x0)

    at src/client/api/object.c:170

#25 0x00007fa053b6310a in fetch_entry (oh=oh@entry=..., th=..., th@entry=..., name=0x55e4a5303c08 "/", fetch_sym=fetch_sym@entry=true, exists=exists@entry=0x7ffd2abd425f, entry=0x7ffd2abd4270) at src/client/dfs/dfs.c:329

#26 0x00007fa053b664cf in entry_stat (dfs=dfs@entry=0x55e4a5303b70, th=th@entry=..., oh=..., name=name@entry=0x55e4a5303c08 "/", stbuf=stbuf@entry=0x7ffd2abd4390) at src/client/dfs/dfs.c:490

#27 0x00007fa053b722e7 in dfs_stat (dfs=0x55e4a5303b70, parent=0x55e4a5303bd8, name=0x55e4a5303c08 "/", stbuf=0x7ffd2abd4390) at src/client/dfs/dfs.c:2876

 

Regards,

Shengyu.

 

From: <daos@daos.groups.io> on behalf of "Zhang, Jiafu" <jiafu.zhang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, February 13, 2020 at 9:39 AM
To: "'daos@daos.groups.io'" <daos@daos.groups.io>
Cc: "Zhu, Minming" <minming.zhu@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: [External] Re: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Hi Mohamad,

 

Your patch worked.

 

Thanks.

 

From: Zhang, Jiafu
Sent: Thursday, February 13, 2020 8:50 AM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: RE: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Thank you, Mohamad. Let me try.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Chaarawi, Mohamad
Sent: Wednesday, February 12, 2020 11:01 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Could you please try with this patch:

https://patch-diff.githubusercontent.com/raw/daos-stack/daos/pull/1864.patch

 

I think the previous patch that Johann sent you only changed the oclass for the SB object, but not the root object.

 

Thanks,

Mohamad

 

From: <daos@daos.groups.io> on behalf of "Zhang, Jiafu" <jiafu.zhang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 12, 2020 at 5:28 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Cc: "Zhu, Minming" <minming.zhu@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Hi Guys,

 

During integration test with Hadoop, we failed to list root directory (“/”)  on multiple DAOS servers (two DAOS servers and five DAOS servers tried) with below command.

              hadoop fs -ls /

 

We tracked down the issue to dfs_ostat API which fails with opened root (“/”) fs object. See below statement.

 int rc = dfs_ostat(dfs, file, &stat);

rc is 2, meaning “no such file or directory”.  But we can open it successfully with dfs_lookup API.

 

To be more strange, we can list the root directory (“/”)  on single DAOS server with same code. Any idea?

 

Thanks.


Re: [External] Re: [daos] Core dump while creating pool

Shengyu SY19 Zhang
 

Hello Alex,

 

Maybe related to the cart, I completely removed cart and rebuild the project, this not happen today, the steps were very simple:

One server side:

rm /mnt/daos/* -rf  (since there was an issue in IB, I have to do this every time)

daos_server start

 

Client side:

dmg storage format --reformat

dmg pool create --scm-size=100G --nvme-size=1T

 

Regards,

Shengyu

 

From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 12, 2020 at 6:00 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [External] Re: [daos] Core dump while creating pool

 

Hi Shengyu,

 

We haven’t seen this before.

 

What are git hashes of daos, cart, mercury and ofi that you used?

What command did you run to get this segfault?

Can you provide daos server yaml that you used?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Shengyu SY19 Zhang
Sent: Wednesday, February 12, 2020 1:39 AM
To: daos@daos.groups.io
Subject: [daos] Core dump while creating pool

 

Hello,

 

I don’t know if is this a known issue, just let you know, I meet this issue for several days, here is the backs trace (daos and mercury just updated):

 

#0  0x0000000001df4620 in ?? ()

#1  0x00007f749e55d922 in crt_hg_addr_lookup_cb (hg_cbinfo=<optimized out>) at src/cart/crt_hg.c:291

#2  0x00007f749c3a0675 in hg_core_addr_lookup_cb (callback_info=<optimized out>) at /root/daos/_build.external/mercury/src/mercury.c:447

#3  0x00007f749c3ac828 in hg_core_trigger_lookup_entry (hg_core_op_id=0x7f7455386340) at /root/daos/_build.external/mercury/src/mercury_core.c:3308

#4  hg_core_trigger (context=0x7f748802bac0, timeout=<optimized out>, timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury_core.c:3250

#5  0x00007f749c3ad7db in HG_Core_trigger (context=<optimized out>, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury_core.c:4494

#6  0x00007f749c3a3f92 in HG_Trigger (context=context@entry=0x7f748802baa0, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury.c:1983

#7  0x00007f749e5603ca in crt_hg_trigger (hg_ctx=hg_ctx@entry=0x7f7488026858) at src/cart/crt_hg.c:1328

#8  0x00007f749e569a3d in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7f7488026858, timeout=timeout@entry=0) at src/cart/crt_hg.c:1361

#9  0x00007f749e52b43a in crt_progress (crt_ctx=0x7f7488026840, timeout=timeout@entry=0, cond_cb=cond_cb@entry=0x0, arg=arg@entry=0x0) at src/cart/crt_context.c:1253

#10 0x000000000041d537 in dss_srv_handler (arg=0x46c8090) at src/iosrv/srv.c:512

#11 0x00007f749d475a4f in ABTD_thread_func_wrapper_thread () from /root/daos/install/lib/libabt.so.0

#12 0x00007f749d4761b1 in make_fcontext () from /root/daos/install/lib/libabt.so.0

#13 0x0000000000000000 in ?? ()

 

Regards,

Shengyu


Re: [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

Zhang, Jiafu
 

Hi Mohamad,

 

Your patch worked.

 

Thanks.

 

From: Zhang, Jiafu
Sent: Thursday, February 13, 2020 8:50 AM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: RE: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Thank you, Mohamad. Let me try.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Chaarawi, Mohamad
Sent: Wednesday, February 12, 2020 11:01 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Could you please try with this patch:

https://patch-diff.githubusercontent.com/raw/daos-stack/daos/pull/1864.patch

 

I think the previous patch that Johann sent you only changed the oclass for the SB object, but not the root object.

 

Thanks,

Mohamad

 

From: <daos@daos.groups.io> on behalf of "Zhang, Jiafu" <jiafu.zhang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 12, 2020 at 5:28 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Cc: "Zhu, Minming" <minming.zhu@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Hi Guys,

 

During integration test with Hadoop, we failed to list root directory (“/”)  on multiple DAOS servers (two DAOS servers and five DAOS servers tried) with below command.

              hadoop fs -ls /

 

We tracked down the issue to dfs_ostat API which fails with opened root (“/”) fs object. See below statement.

 int rc = dfs_ostat(dfs, file, &stat);

rc is 2, meaning “no such file or directory”.  But we can open it successfully with dfs_lookup API.

 

To be more strange, we can list the root directory (“/”)  on single DAOS server with same code. Any idea?

 

Thanks.


Re: [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

Zhang, Jiafu
 

Thank you, Mohamad. Let me try.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Chaarawi, Mohamad
Sent: Wednesday, February 12, 2020 11:01 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Could you please try with this patch:

https://patch-diff.githubusercontent.com/raw/daos-stack/daos/pull/1864.patch

 

I think the previous patch that Johann sent you only changed the oclass for the SB object, but not the root object.

 

Thanks,

Mohamad

 

From: <daos@daos.groups.io> on behalf of "Zhang, Jiafu" <jiafu.zhang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 12, 2020 at 5:28 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Cc: "Zhu, Minming" <minming.zhu@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Hi Guys,

 

During integration test with Hadoop, we failed to list root directory (“/”)  on multiple DAOS servers (two DAOS servers and five DAOS servers tried) with below command.

              hadoop fs -ls /

 

We tracked down the issue to dfs_ostat API which fails with opened root (“/”) fs object. See below statement.

 int rc = dfs_ostat(dfs, file, &stat);

rc is 2, meaning “no such file or directory”.  But we can open it successfully with dfs_lookup API.

 

To be more strange, we can list the root directory (“/”)  on single DAOS server with same code. Any idea?

 

Thanks.


Unable to run DAOS commands - Agent reports "no dRPC client set"

Patrick Farrell <paf@...>
 

Good morning,

I've just moved up to latest tip of tree DAOS (I'm not sure exactly which commit I was running before, a week or two out of date), and I can't get any tests to run.

I've pared back to a trivial config, and I appear to be able to start the server, etc, but the agent claims the data plane is not running and I'm not having a lot of luck troubleshooting.

Here's my server startup command & output:
/root/daos/install/bin/daos_server start -o /root/daos/utils/config/examples/daos_server_local.yml
/root/daos/install/bin/daos_server logging to file /tmp/daos_control.log
ERROR: /root/daos/install/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 22075) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM format required on instance 0
formatting storage for DAOS I/O Server instance 0 (reformat: false)
Starting format of SCM (ram:/mnt/daos)
Finished format of SCM (ram:/mnt/daos)
Starting format of kdev block devices (/dev/sdl1)
Finished format of kdev block devices (/dev/sdl1)
DAOS I/O Server instance 0 storage ready
SCM @ /mnt/daos: 16.00GB Total/16.00GB Avail
Starting I/O server instance 0: /root/daos/install/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm

As you can see, I format and the server appears to start normally.

Here's that format command output:
dmg -i storage format
localhost:10001: connected

localhost: storage format ok

I run the agent, and it appears OK:
daos_agent -i
Starting daos_agent:
Using logfile: /tmp/daos_agent.log
Listening on /var/run/daos_agent/agent.sock

But when I try to run daos_test, everything it attempts fails, and the agent prints this message over and over:
ERROR: HandleCall for 2:206 failed: GetAttachInfo hl-d102:10001 {daos_server {} [] 13}: rpc error: code = Unknown desc = no dRPC client set (data plane not started?)

I believe I've got the environment variables set up correctly everywhere, and I have not configured access_points, etc - This is a trivial single server config.

This is the entirety of my file based config changes:
--- a/utils/config/examples/daos_server_local.yml
+++ b/utils/config/examples/daos_server_local.yml
@@ -14,7 +14,7 @@ servers:
   targets: 1
   first_core: 0
   nr_xs_helpers: 0
-  fabric_iface: eth0
+  fabric_iface: enp6s0
   fabric_iface_port: 31416
   log_file: /tmp/daos_server.log

@@ -31,8 +31,8 @@ servers:
   # The size of ram is specified by scm_size in GB units.
   scm_mount: /mnt/daos # map to -s /mnt/daos
   scm_class: ram
-  scm_size: 4
+  scm_size: 16

-  bdev_class: file
-  bdev_size: 16
-  bdev_list: [/tmp/daos-bdev]
+  bdev_class: kdev
+  bdev_size: 64
+  bdev_list: [/dev/sdl1]
---------

Any clever ideas what's wrong here?  Is there a command or config change I missed?

Thanks,
-Patrick


Re: [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

Chaarawi, Mohamad
 

Could you please try with this patch:

https://patch-diff.githubusercontent.com/raw/daos-stack/daos/pull/1864.patch

 

I think the previous patch that Johann sent you only changed the oclass for the SB object, but not the root object.

 

Thanks,

Mohamad

 

From: <daos@daos.groups.io> on behalf of "Zhang, Jiafu" <jiafu.zhang@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 12, 2020 at 5:28 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Cc: "Zhu, Minming" <minming.zhu@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: [daos] [DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

 

Hi Guys,

 

During integration test with Hadoop, we failed to list root directory (“/”)  on multiple DAOS servers (two DAOS servers and five DAOS servers tried) with below command.

              hadoop fs -ls /

 

We tracked down the issue to dfs_ostat API which fails with opened root (“/”) fs object. See below statement.

 int rc = dfs_ostat(dfs, file, &stat);

rc is 2, meaning “no such file or directory”.  But we can open it successfully with dfs_lookup API.

 

To be more strange, we can list the root directory (“/”)  on single DAOS server with same code. Any idea?

 

Thanks.


[DAOS DFS] dfs_ostat with root path ("/") failed on multiple DAOS servers

Zhang, Jiafu
 

Hi Guys,

 

During integration test with Hadoop, we failed to list root directory (“/”)  on multiple DAOS servers (two DAOS servers and five DAOS servers tried) with below command.

              hadoop fs -ls /

 

We tracked down the issue to dfs_ostat API which fails with opened root (“/”) fs object. See below statement.

 int rc = dfs_ostat(dfs, file, &stat);

rc is 2, meaning “no such file or directory”.  But we can open it successfully with dfs_lookup API.

 

To be more strange, we can list the root directory (“/”)  on single DAOS server with same code. Any idea?

 

Thanks.


Re: Core dump while creating pool

Oganezov, Alexander A
 

Hi Shengyu,

 

We haven’t seen this before.

 

What are git hashes of daos, cart, mercury and ofi that you used?

What command did you run to get this segfault?

Can you provide daos server yaml that you used?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Shengyu SY19 Zhang
Sent: Wednesday, February 12, 2020 1:39 AM
To: daos@daos.groups.io
Subject: [daos] Core dump while creating pool

 

Hello,

 

I don’t know if is this a known issue, just let you know, I meet this issue for several days, here is the backs trace (daos and mercury just updated):

 

#0  0x0000000001df4620 in ?? ()

#1  0x00007f749e55d922 in crt_hg_addr_lookup_cb (hg_cbinfo=<optimized out>) at src/cart/crt_hg.c:291

#2  0x00007f749c3a0675 in hg_core_addr_lookup_cb (callback_info=<optimized out>) at /root/daos/_build.external/mercury/src/mercury.c:447

#3  0x00007f749c3ac828 in hg_core_trigger_lookup_entry (hg_core_op_id=0x7f7455386340) at /root/daos/_build.external/mercury/src/mercury_core.c:3308

#4  hg_core_trigger (context=0x7f748802bac0, timeout=<optimized out>, timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury_core.c:3250

#5  0x00007f749c3ad7db in HG_Core_trigger (context=<optimized out>, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury_core.c:4494

#6  0x00007f749c3a3f92 in HG_Trigger (context=context@entry=0x7f748802baa0, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury.c:1983

#7  0x00007f749e5603ca in crt_hg_trigger (hg_ctx=hg_ctx@entry=0x7f7488026858) at src/cart/crt_hg.c:1328

#8  0x00007f749e569a3d in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7f7488026858, timeout=timeout@entry=0) at src/cart/crt_hg.c:1361

#9  0x00007f749e52b43a in crt_progress (crt_ctx=0x7f7488026840, timeout=timeout@entry=0, cond_cb=cond_cb@entry=0x0, arg=arg@entry=0x0) at src/cart/crt_context.c:1253

#10 0x000000000041d537 in dss_srv_handler (arg=0x46c8090) at src/iosrv/srv.c:512

#11 0x00007f749d475a4f in ABTD_thread_func_wrapper_thread () from /root/daos/install/lib/libabt.so.0

#12 0x00007f749d4761b1 in make_fcontext () from /root/daos/install/lib/libabt.so.0

#13 0x0000000000000000 in ?? ()

 

Regards,

Shengyu


Core dump while creating pool

Shengyu SY19 Zhang
 

Hello,

 

I don’t know if is this a known issue, just let you know, I meet this issue for several days, here is the backs trace (daos and mercury just updated):

 

#0  0x0000000001df4620 in ?? ()

#1  0x00007f749e55d922 in crt_hg_addr_lookup_cb (hg_cbinfo=<optimized out>) at src/cart/crt_hg.c:291

#2  0x00007f749c3a0675 in hg_core_addr_lookup_cb (callback_info=<optimized out>) at /root/daos/_build.external/mercury/src/mercury.c:447

#3  0x00007f749c3ac828 in hg_core_trigger_lookup_entry (hg_core_op_id=0x7f7455386340) at /root/daos/_build.external/mercury/src/mercury_core.c:3308

#4  hg_core_trigger (context=0x7f748802bac0, timeout=<optimized out>, timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury_core.c:3250

#5  0x00007f749c3ad7db in HG_Core_trigger (context=<optimized out>, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury_core.c:4494

#6  0x00007f749c3a3f92 in HG_Trigger (context=context@entry=0x7f748802baa0, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x46d8d4c) at /root/daos/_build.external/mercury/src/mercury.c:1983

#7  0x00007f749e5603ca in crt_hg_trigger (hg_ctx=hg_ctx@entry=0x7f7488026858) at src/cart/crt_hg.c:1328

#8  0x00007f749e569a3d in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7f7488026858, timeout=timeout@entry=0) at src/cart/crt_hg.c:1361

#9  0x00007f749e52b43a in crt_progress (crt_ctx=0x7f7488026840, timeout=timeout@entry=0, cond_cb=cond_cb@entry=0x0, arg=arg@entry=0x0) at src/cart/crt_context.c:1253

#10 0x000000000041d537 in dss_srv_handler (arg=0x46c8090) at src/iosrv/srv.c:512

#11 0x00007f749d475a4f in ABTD_thread_func_wrapper_thread () from /root/daos/install/lib/libabt.so.0

#12 0x00007f749d4761b1 in make_fcontext () from /root/daos/install/lib/libabt.so.0

#13 0x0000000000000000 in ?? ()

 

Regards,

Shengyu


daos configuration change - “nr_xs_helpers” changed to be total number of helpers per server

Liu, Xuezhao
 

Hi,

 

Just to remind that as PR1220 landed, now the “nr_xs_helpers” means the total number of helpers for the whole daos io server (previously it means #helpers per VOS target). It should only affect performance testing.

 

For example now the configuration with “targets: 8” and “nr_xs_helpers: 8”, equals “targets: 8” “nr_xs_helpers: 1” before PR1220’s landing.

The purpose is to support the usage like 8 VOS with 4 helpers that cannot be configured before.

 

A few more info about internal core binding -

If nr_xs_helpers equals #targets, or two times #targets, then each VOS target has 1 or 2 private helpers, and the core affinity between VOS IO XS and helper XS is completely same as before.

If nr_xs_helpers cannot be evenly assigned to targets, for example 8 VOS targets configured with 4 or 12 helpers, then all the helpers are created as a pool and shared by all the VOS targets.

 

Thanks,

Xuezhao

 

 

 

1081 - 1100 of 1663