Re: Hugepages setting
Patrick thank you so much for reply!
|
|||||
|
|||||
Re: [External] Re: [daos] dfs_stat and infinitely loop
Patrick Farrell <paf@...>
Are you using OPA? I believe there are some issues with network contexts and different users in OPA...?
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Sent: Wednesday, February 19, 2020 7:41:09 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [External] Re: [daos] dfs_stat and infinitely loop Hi Mohamad,
Yes, the code works without setgid(0) (or similar functions related to user context), client log is nothing related, it infinitely loop in the poll just like network packet lost, this is the dbg stack:
#0 0x00007f1966699e63 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f19658dc728 in hg_poll_wait (poll_set=0x85d990, timeout=timeout@entry=1, progressed=progressed@entry=0x7ffd6af1f49f "") at /root/daos/_build.external/mercury/src/util/mercury_poll.c:434 #2 0x00007f1965d05763 in hg_core_progress_poll (context=0x895b70, timeout=1) at /root/daos/_build.external/mercury/src/mercury_core.c:3280 #3 0x00007f1965d0a94c in HG_Core_progress (context=<optimized out>, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury_core.c:4877 #4 0x00007f1965d0242d in HG_Progress (context=context@entry=0x77f250, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury.c:2243 #5 0x00007f1966dfb28b in crt_hg_progress (hg_ctx=hg_ctx@entry=0x8909b8, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1366 #6 0x00007f1966dbcf2b in crt_progress (crt_ctx=0x8909a0, timeout=timeout@entry=-1, cond_cb=cond_cb@entry=0x7f196772d5a0 <ev_progress_cb>, arg=arg@entry=0x7ffd6af1f5d0) at src/cart/crt_context.c:1300 #7 0x00007f19677328c6 in daos_event_priv_wait () at src/client/api/event.c:1205 #8 0x00007f1967736096 in dc_task_schedule (task=0x8a3be0, instant=instant@entry=true) at src/client/api/task.c:139 #9 0x00007f196773492c in daos_obj_fetch (oh=..., oh@entry=..., th=..., th@entry=..., flags=flags@entry=0, dkey=dkey@entry=0x7ffd6af1f6d0, nr=nr@entry=1, iods=iods@entry=0x7ffd6af1f6f0, sgls=sgls@entry=0x7ffd6af1f6b0, maps=maps@entry=0x0, ev=ev@entry=0x0) at src/client/api/object.c:170 #10 0x00007f19674f810a in fetch_entry (oh=oh@entry=..., th=..., th@entry=..., name=0x941808 "/", fetch_sym=fetch_sym@entry=true, exists=exists@entry=0x7ffd6af1f84f, entry=0x7ffd6af1f860) at src/client/dfs/dfs.c:329 #11 0x00007f19674fb4cf in entry_stat (dfs=dfs@entry=0x941770, th=th@entry=..., oh=..., name=name@entry=0x941808 "/", stbuf=stbuf@entry=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:490 #12 0x00007f19675072e7 in dfs_stat (dfs=0x941770, parent=0x9417d8, name=0x941808 "/", stbuf=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:2876 #13 0x00000000004012c3 in main ()
Regards, Shengyu.
From: <daos@daos.groups.io> on behalf of "Chaarawi, Mohamad" <mohamad.chaarawi@...>
Hi Shengyu,
If you don’t setgid(0), it works? Im not sure why that would cause the operation not to return. Could you please attach gdb and return a trace of where it hangs? Do you see anything suspicious in the DAOS client log?
Thanks, Mohamad
From: <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Hello,
Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code: rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1); if (rc != -DER_SUCCESS) { printf("Failed to mount to container (%d)\n", rc); D_GOTO(out_dfs, 0); }
setgid(0);
struct stat stbuf = {0};
rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf); if(rc) printf("stat '' failed, rc: %d\n", rc); else printf("stat \'\' succesffuly, rc: %d\n", rc);
There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.
Regards, Shengyu. |
|||||
|
|||||
Re: [External] Re: [daos] dfs_stat and infinitely loop
Shengyu SY19 Zhang
Hi Mohamad,
Yes, the code works without setgid(0) (or similar functions related to user context), client log is nothing related, it infinitely loop in the poll just like network packet lost, this is the dbg stack:
#0 0x00007f1966699e63 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f19658dc728 in hg_poll_wait (poll_set=0x85d990, timeout=timeout@entry=1, progressed=progressed@entry=0x7ffd6af1f49f "") at /root/daos/_build.external/mercury/src/util/mercury_poll.c:434 #2 0x00007f1965d05763 in hg_core_progress_poll (context=0x895b70, timeout=1) at /root/daos/_build.external/mercury/src/mercury_core.c:3280 #3 0x00007f1965d0a94c in HG_Core_progress (context=<optimized out>, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury_core.c:4877 #4 0x00007f1965d0242d in HG_Progress (context=context@entry=0x77f250, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury.c:2243 #5 0x00007f1966dfb28b in crt_hg_progress (hg_ctx=hg_ctx@entry=0x8909b8, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1366 #6 0x00007f1966dbcf2b in crt_progress (crt_ctx=0x8909a0, timeout=timeout@entry=-1, cond_cb=cond_cb@entry=0x7f196772d5a0 <ev_progress_cb>, arg=arg@entry=0x7ffd6af1f5d0) at src/cart/crt_context.c:1300 #7 0x00007f19677328c6 in daos_event_priv_wait () at src/client/api/event.c:1205 #8 0x00007f1967736096 in dc_task_schedule (task=0x8a3be0, instant=instant@entry=true) at src/client/api/task.c:139 #9 0x00007f196773492c in daos_obj_fetch (oh=..., oh@entry=..., th=..., th@entry=..., flags=flags@entry=0, dkey=dkey@entry=0x7ffd6af1f6d0, nr=nr@entry=1, iods=iods@entry=0x7ffd6af1f6f0, sgls=sgls@entry=0x7ffd6af1f6b0, maps=maps@entry=0x0, ev=ev@entry=0x0) at src/client/api/object.c:170 #10 0x00007f19674f810a in fetch_entry (oh=oh@entry=..., th=..., th@entry=..., name=0x941808 "/", fetch_sym=fetch_sym@entry=true, exists=exists@entry=0x7ffd6af1f84f, entry=0x7ffd6af1f860) at src/client/dfs/dfs.c:329 #11 0x00007f19674fb4cf in entry_stat (dfs=dfs@entry=0x941770, th=th@entry=..., oh=..., name=name@entry=0x941808 "/", stbuf=stbuf@entry=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:490 #12 0x00007f19675072e7 in dfs_stat (dfs=0x941770, parent=0x9417d8, name=0x941808 "/", stbuf=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:2876 #13 0x00000000004012c3 in main ()
Regards, Shengyu.
From: <daos@daos.groups.io> on behalf of "Chaarawi, Mohamad" <mohamad.chaarawi@...>
Hi Shengyu,
If you don’t setgid(0), it works? Im not sure why that would cause the operation not to return. Could you please attach gdb and return a trace of where it hangs? Do you see anything suspicious in the DAOS client log?
Thanks, Mohamad
From: <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Hello,
Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code: rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1); if (rc != -DER_SUCCESS) { printf("Failed to mount to container (%d)\n", rc); D_GOTO(out_dfs, 0); }
setgid(0);
struct stat stbuf = {0};
rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf); if(rc) printf("stat '' failed, rc: %d\n", rc); else printf("stat \'\' succesffuly, rc: %d\n", rc);
There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.
Regards, Shengyu. |
|||||
|
|||||
Re: Hugepages setting
Patrick Farrell <paf@...>
Anton,
This message is a little bit confusing – It just indicates there were no 1 GiB huge pages, which is fine. Other smaller huge pages were acquired successfully, and this doesn’t indicate any problem that will prevent you from running.
So you’re good to go - just need to format and the server should finish startup normally.
-Patrick
From: <daos@daos.groups.io> on behalf of "anton.brekhov@..." <anton.brekhov@...>
Hi everyone! I want to launch local daos server on one node. I'm using installation guide using docker https://daos-stack.github.io/#admin/installation/ . (Centos 7) I've added libfabric dependancy to Dockerfile. I've installed uio_pci_generic to host. Also I want to use DRAM as SCM. So I've started it with command docker exec server daos_server start \ -o /home/daos/daos/utils/config/examples/daos_server_local.yml And I got such error: daos_server logging to file /tmp/daos_control.log ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB DAOS Control Server (pid 560) listening on 0.0.0.0:10001 Waiting for DAOS I/O Server instance storage to be ready... SCM format required on instance 0 Can I use DRAM without hugepages? Unless how I need to configure it (link to some guide will be enogh)?
Thanks!
|
|||||
|
|||||
Re: Unable to run DAOS commands - Agent reports "no dRPC client set"
Patrick Farrell <paf@...>
mjmac,
Ah, that has it working again. Thanks much for the pointer.
Just out of curiosity, was any thought given to making this a reported failure? I see Niu's patch just corrects the misapplication.
It seems like an error in entering the whitelist (if I'm understanding correctly, perhaps the parameter is generated) is far from impossible, and the failure I experienced was silent on the server side.
I am not entirely clear on how the problem manifests itself - if the data plane truly doesn't start in this case, or if it fails when trying to access the device to actually do something when prompted by a client, or if there is some other issue - but this
seems like a condition that would be worth reporting in some way (unless it really is an internal sort of failure, which can only realistically occur due to applying the whitelist to the wrong kind of device rather than user config).
- Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Macdonald, Mjmac <mjmac.macdonald@...>
Sent: Tuesday, February 18, 2020 8:32 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set" Hi Patrick.
A commit (18d31d) just landed to master this morning that will probably fix that issue. As part of the work you referenced, a new whitelist parameter is being used to ensure that each ioserver only has access to the devices specified in the configuration. Unfortunately, this doesn't work with emulated devices, so the fix is to avoid using the whitelist except with real devices. Sorry about that, hope this helps. Best, mjmac |
|||||
|
|||||
Re: Tuning problem
Zhu, Minming
Hi , Mohamad :
/home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.
Local env :
This was the previous build on boro, ior can be executed.
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:22 AM To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...> Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...> Subject: Re: Tuning problem
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5 /usr/bin/ld: cannot find -lgurt collect2: error: ld returned 1 exit status configure:5942: $? = 1
are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir? That seems like a path to the cart source dir.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Mohamad :
Yes , ior was build with DAOS driver support . Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart Attach file is config.log .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
That probably means that your IOR was not built with DAOS driver support. If you enabled that, I would check the config.log in your IOR build and see why.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Mohamad : Thanks for you help .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 . But encountered a new problem .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Could you provide some info on the system you are running on? Do you have OPA there? You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Guys : I have some about daos performance tuning problems.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
Normally this is the case :
2.About network performance :On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows . Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
Please help solve the problem.
Regards, Minmingz |
|||||
|
|||||
Re: Tuning problem
Chaarawi, Mohamad
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5 /usr/bin/ld: cannot find -lgurt collect2: error: ld returned 1 exit status configure:5942: $? = 1
are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir? That seems like a path to the cart source dir.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Mohamad :
Yes , ior was build with DAOS driver support . Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart Attach file is config.log .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...> Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...> Subject: Re: Tuning problem
That probably means that your IOR was not built with DAOS driver support. If you enabled that, I would check the config.log in your IOR build and see why.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Mohamad : Thanks for you help .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 . But encountered a new problem .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Could you provide some info on the system you are running on? Do you have OPA there? You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Guys : I have some about daos performance tuning problems.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
Normally this is the case :
2.About network performance :On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows . Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
Please help solve the problem.
Regards, Minmingz |
|||||
|
|||||
Re: Tuning problem
Zhu, Minming
Hi , Mohamad :
Yes , ior was build with DAOS driver support . Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart Attach file is config.log .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...> Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...> Subject: Re: Tuning problem
That probably means that your IOR was not built with DAOS driver support. If you enabled that, I would check the config.log in your IOR build and see why.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Mohamad : Thanks for you help .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 . But encountered a new problem .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Could you provide some info on the system you are running on? Do you have OPA there? You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Guys : I have some about daos performance tuning problems.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
Normally this is the case :
2.About network performance :On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows . Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
Please help solve the problem.
Regards, Minmingz |
|||||
|
|||||
Re: Tuning problem
Chaarawi, Mohamad
That probably means that your IOR was not built with DAOS driver support. If you enabled that, I would check the config.log in your IOR build and see why.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Mohamad : Thanks for you help .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 . But encountered a new problem .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...> Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...> Subject: Re: Tuning problem
Could you provide some info on the system you are running on? Do you have OPA there? You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Guys : I have some about daos performance tuning problems.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
Normally this is the case :
2.About network performance :On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows . Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
Please help solve the problem.
Regards, Minmingz |
|||||
|
|||||
Hugepages setting
anton.brekhov@...
Hi everyone! I want to launch local daos server on one node. I'm using installation guide using docker https://daos-stack.github.io/#admin/installation/ . (Centos 7) I've added libfabric dependancy to Dockerfile. I've installed uio_pci_generic to host. Also I want to use DRAM as SCM. So I've started it with command docker exec server daos_server start \ -o /home/daos/daos/utils/config/examples/daos_server_local.yml And I got such error:daos_server logging to file /tmp/daos_control.log ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB DAOS Control Server (pid 560) listening on 0.0.0.0:10001 Waiting for DAOS I/O Server instance storage to be ready... SCM format required on instance 0 Can I use DRAM without hugepages? Unless how I need to configure it (link to some guide will be enogh)?
Thanks!
|
|||||
|
|||||
Re: Tuning problem
Zhu, Minming
Hi , Mohamad : Thanks for you help .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 . But encountered a new problem .
Regards, Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...> Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...> Subject: Re: Tuning problem
Could you provide some info on the system you are running on? Do you have OPA there? You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Guys : I have some about daos performance tuning problems.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
Normally this is the case :
2.About network performance :On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows . Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
Please help solve the problem.
Regards, Minmingz |
|||||
|
|||||
Re: dfs_stat and infinitely loop
Chaarawi, Mohamad
Hi Shengyu,
If you don’t setgid(0), it works? Im not sure why that would cause the operation not to return. Could you please attach gdb and return a trace of where it hangs? Do you see anything suspicious in the DAOS client log?
Thanks, Mohamad
From: <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Hello,
Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code: rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1); if (rc != -DER_SUCCESS) { printf("Failed to mount to container (%d)\n", rc); D_GOTO(out_dfs, 0); }
setgid(0);
struct stat stbuf = {0};
rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf); if(rc) printf("stat '' failed, rc: %d\n", rc); else printf("stat \'\' succesffuly, rc: %d\n", rc);
There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.
Regards, Shengyu. |
|||||
|
|||||
Re: Tuning problem
Chaarawi, Mohamad
Could you provide some info on the system you are running on? Do you have OPA there? You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks, Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Hi , Guys : I have some about daos performance tuning problems.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
Normally this is the case :
2.About network performance :On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows . Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
Please help solve the problem.
Regards, Minmingz |
|||||
|
|||||
Re: Unable to run DAOS commands - Agent reports "no dRPC client set"
Farrell, Patrick Arthur <patrick.farrell@...>
Tom,
You've probably seen it, but if not, fyi that mjmac pointed me to commit 18d31d,
which landed yesterday and resolved the issue for me.
Thanks for taking a look!
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Nabarro, Tom <tom.nabarro@...>
Sent: Wednesday, February 19, 2020 7:17 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set" Hello Patrick I'm looking into this now
On 17 Feb 2020 22:52, Patrick Farrell <paf@...> wrote:
I finally gave up and bisected this.
This problem started with DAOS-4034 control: enable vfio permissions for non-root (#1785)/14c7c2e06512659f4122a01c57e82ad58ee642b0
Looking at, it does a variety of things, and I'm not having any luck tracking down what's broken by this change. I made sure to enable the vfio driver as mentioned in the patch notes, but I'm not seeing any change.
One note. I am running as root, because that has been the easiest set up so far.
Is running as root perhaps broken with this patch?
- Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Sent: Wednesday, February 12, 2020 11:18 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
Good morning,
I've just moved up to latest tip of tree DAOS (I'm not sure exactly which commit I was running before, a week or two out of date), and I can't get any tests to run.
I've pared back to a trivial config, and I appear to be able to start the server, etc, but the agent claims the data plane is not running and I'm not having a lot of luck troubleshooting.
Here's my server startup command & output:
/root/daos/install/bin/daos_server start -o /root/daos/utils/config/examples/daos_server_local.yml
/root/daos/install/bin/daos_server logging to file /tmp/daos_control.log
ERROR: /root/daos/install/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 22075) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM format required on instance 0
formatting storage for DAOS I/O Server instance 0 (reformat: false)
Starting format of SCM (ram:/mnt/daos)
Finished format of SCM (ram:/mnt/daos)
Starting format of kdev block devices (/dev/sdl1)
Finished format of kdev block devices (/dev/sdl1)
DAOS I/O Server instance 0 storage ready
SCM @ /mnt/daos: 16.00GB Total/16.00GB Avail
Starting I/O server instance 0: /root/daos/install/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
As you can see, I format and the server appears to start normally.
Here's that format command output:
dmg -i storage format
localhost:10001: connected
localhost: storage format ok
I run the agent, and it appears OK:
daos_agent -i
Starting daos_agent:
Using logfile: /tmp/daos_agent.log
Listening on /var/run/daos_agent/agent.sock
But when I try to run daos_test, everything it attempts fails, and the agent prints this message over and over:
ERROR: HandleCall for 2:206 failed: GetAttachInfo hl-d102:10001 {daos_server {} [] 13}: rpc error: code = Unknown desc = no dRPC client set (data plane not started?)
I believe I've got the environment variables set up correctly everywhere, and I have not configured access_points, etc - This is a trivial single server config.
This is the entirety of my file based config changes:
--- a/utils/config/examples/daos_server_local.yml
+++ b/utils/config/examples/daos_server_local.yml
@@ -14,7 +14,7 @@ servers:
targets: 1
first_core: 0
nr_xs_helpers: 0
- fabric_iface: eth0
+ fabric_iface: enp6s0
fabric_iface_port: 31416
log_file: /tmp/daos_server.log
@@ -31,8 +31,8 @@ servers:
# The size of ram is specified by scm_size in GB units.
scm_mount: /mnt/daos # map to -s /mnt/daos
scm_class: ram
- scm_size: 4
+ scm_size: 16
- bdev_class: file
- bdev_size: 16
- bdev_list: [/tmp/daos-bdev]
+ bdev_class: kdev
+ bdev_size: 64
+ bdev_list: [/dev/sdl1]
---------
Any clever ideas what's wrong here? Is there a command or config change I missed?
Thanks,
-Patrick
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|||||
|
|||||
Re: Unable to run DAOS commands - Agent reports "no dRPC client set"
Hello Patrick I'm looking into this now
On 17 Feb 2020 22:52, Patrick Farrell <paf@...> wrote:
I finally gave up and bisected this.
This problem started with DAOS-4034 control: enable vfio permissions for non-root (#1785)/14c7c2e06512659f4122a01c57e82ad58ee642b0
Looking at, it does a variety of things, and I'm not having any luck tracking down what's broken by this change. I made sure to enable the vfio driver as mentioned in the patch notes, but I'm not seeing any change.
One note. I am running as root, because that has been the easiest set up so far.
Is running as root perhaps broken with this patch?
- Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Sent: Wednesday, February 12, 2020 11:18 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
Good morning,
I've just moved up to latest tip of tree DAOS (I'm not sure exactly which commit I was running before, a week or two out of date), and I can't get any tests to run.
I've pared back to a trivial config, and I appear to be able to start the server, etc, but the agent claims the data plane is not running and I'm not having a lot of luck troubleshooting.
Here's my server startup command & output:
/root/daos/install/bin/daos_server start -o /root/daos/utils/config/examples/daos_server_local.yml
/root/daos/install/bin/daos_server logging to file /tmp/daos_control.log
ERROR: /root/daos/install/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 22075) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM format required on instance 0
formatting storage for DAOS I/O Server instance 0 (reformat: false)
Starting format of SCM (ram:/mnt/daos)
Finished format of SCM (ram:/mnt/daos)
Starting format of kdev block devices (/dev/sdl1)
Finished format of kdev block devices (/dev/sdl1)
DAOS I/O Server instance 0 storage ready
SCM @ /mnt/daos: 16.00GB Total/16.00GB Avail
Starting I/O server instance 0: /root/daos/install/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm
As you can see, I format and the server appears to start normally.
Here's that format command output:
dmg -i storage format
localhost:10001: connected
localhost: storage format ok
I run the agent, and it appears OK:
daos_agent -i
Starting daos_agent:
Using logfile: /tmp/daos_agent.log
Listening on /var/run/daos_agent/agent.sock
But when I try to run daos_test, everything it attempts fails, and the agent prints this message over and over:
ERROR: HandleCall for 2:206 failed: GetAttachInfo hl-d102:10001 {daos_server {} [] 13}: rpc error: code = Unknown desc = no dRPC client set (data plane not started?)
I believe I've got the environment variables set up correctly everywhere, and I have not configured access_points, etc - This is a trivial single server config.
This is the entirety of my file based config changes:
--- a/utils/config/examples/daos_server_local.yml
+++ b/utils/config/examples/daos_server_local.yml
@@ -14,7 +14,7 @@ servers:
targets: 1
first_core: 0
nr_xs_helpers: 0
- fabric_iface: eth0
+ fabric_iface: enp6s0
fabric_iface_port: 31416
log_file: /tmp/daos_server.log
@@ -31,8 +31,8 @@ servers:
# The size of ram is specified by scm_size in GB units.
scm_mount: /mnt/daos # map to -s /mnt/daos
scm_class: ram
- scm_size: 4
+ scm_size: 16
- bdev_class: file
- bdev_size: 16
- bdev_list: [/tmp/daos-bdev]
+ bdev_class: kdev
+ bdev_size: 64
+ bdev_list: [/dev/sdl1]
---------
Any clever ideas what's wrong here? Is there a command or config change I missed?
Thanks,
-Patrick
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for |
|||||
|
|||||
dfs_stat and infinitely loop
Shengyu SY19 Zhang
Hello,
Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code: rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1); if (rc != -DER_SUCCESS) { printf("Failed to mount to container (%d)\n", rc); D_GOTO(out_dfs, 0); }
setgid(0);
struct stat stbuf = {0};
rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf); if(rc) printf("stat '' failed, rc: %d\n", rc); else printf("stat \'\' succesffuly, rc: %d\n", rc);
There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.
Regards, Shengyu. |
|||||
|
|||||
Re: How to configure IB with multiple mlx4 devices per server
Latham, Robert J.
On Sun, 2020-02-16 at 16:28 +0000, Kevan Rehm wrote:
Hi Kevan: I don't have much in the way of solutions, but yes you are not the first person to want to ue multiple IB devices on each node. The Oak Ridge gang solved this in a different way, using PAMI directives: https://dl.acm.org/doi/10.1145/3295500.3356166 The libfabric equivalent would be "multi rail" I think, but I haven't been able to construct a correct FI_OFI_MRAIL_ADDR environment variable to describing the ib ports. Maybe it's easier to describe the ports on your cluster than it was for me on Summit. https://ofiwg.github.io/libfabric/master/man/fi_mrail.7.html ==rob |
|||||
|
|||||
Tuning problem
Zhu, Minming
Hi , Guys : I have some about daos performance tuning problems.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
Normally this is the case :
2.About network performance :On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows . Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
Please help solve the problem.
Regards, Minmingz |
|||||
|
|||||
Re: Unable to run DAOS commands - Agent reports "no dRPC client set"
Macdonald, Mjmac
Hi Patrick.
In this case the BIO whitelists are generated internally from the per-ioserver bdev lists, independent of the bdev_include config parameter. You do raise a good point through – we should take a look to see what would/should happen if the whitelist supplied from the server config is wrong.
mjmac
From: Patrick Farrell <paf@...>
Sent: Tuesday, 18 February, 2020 09:49 To: Macdonald, Mjmac <mjmac.macdonald@...>; daos@daos.groups.io Subject: Re: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
mjmac,
Ah, that has it working again. Thanks much for the pointer.
Just out of curiosity, was any thought given to making this a reported failure? I see Niu's patch just corrects the misapplication.
It seems like an error in entering the whitelist (if I'm understanding correctly, perhaps the parameter is generated) is far from impossible, and the failure I experienced was silent on the server side.
I am not entirely clear on how the problem manifests itself - if the data plane truly doesn't start in this case, or if it fails when trying to access the device to actually do something when prompted by a client, or if there is some other issue - but this seems like a condition that would be worth reporting in some way (unless it really is an internal sort of failure, which can only realistically occur due to applying the whitelist to the wrong kind of device rather than user config).
- Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Macdonald, Mjmac <mjmac.macdonald@...>
Hi Patrick. |
|||||
|
|||||
Re: Unable to run DAOS commands - Agent reports "no dRPC client set"
Macdonald, Mjmac
Hi Patrick.
A commit (18d31d) just landed to master this morning that will probably fix that issue. As part of the work you referenced, a new whitelist parameter is being used to ensure that each ioserver only has access to the devices specified in the configuration. Unfortunately, this doesn't work with emulated devices, so the fix is to avoid using the whitelist except with real devices. Sorry about that, hope this helps. Best, mjmac |
|||||
|