Error on simple test on POSIX container
Hi,
I created a POSIX container and mounted at /mnt/dfuse on the client node, and ran the following command: ``` # echo "foo" > /mnt/dfuse/bar # cat /mnt/dfuse/bar ``` But it gives me the following error repeated infinitely. object ERR src/object/cli_shard.c:631 dc_rw_cb() rpc 0x7ffa3801d6e0 opc 1 to rank 0 tag 7 failed: DER_HG(-1020): 'Transport layer mercury error'OS: Ubuntu 20.04 Network: Infiniband with MOFED 5.0-2 DAOS version: c20c47 (commit at 2020-11-28)
|
|
DUG'20 slides are available!
Lombardi, Johann
Hi there,
I have posted all the DUG presentations on the wiki (see https://wiki.hpdd.intel.com/display/DC/DUG20) We need some more time for the video recordings that will be published on our YouTube channel (i.e. https://www.youtube.com/channel/UCVP4e_UTnSJg15Cm80UtNwg) when ready.
Cheers, Johann --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
Re: DUG'20 Agenda Online
Carrier, John
Note that the time for DUG listed in the SC2020 schedule is not correct. Please use the webex info below.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Lombardi, Johann
Sent: Tuesday, November 17, 2020 11:29 PM To: daos@daos.groups.io Subject: Re: [daos] DUG'20 Agenda Online
Just a reminder that the DUG’20 is tomorrow.
Hope to see you there!
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of "Lombardi, Johann" <johann.lombardi@...>
Hi there,
Please note that the agenda for the 4th annual DAOS User Group meeting is now available online: https://wiki.hpdd.intel.com/display/DC/DUG20
I am very excited by the diversity and number of presentations this year. A big thank you to all the presenters.
As a reminder, the DUG is virtual this year: - On Nov 19 - Starts at 7:30am Pacific / 8:30am Mountain / 9:30am Central / 4:30pm CET / 11:30pm China - 3h30 of live presentations - Please see instructions on how to join in the webex invite
We also encourage everyone to join the #community slack channel for side discussions between attendees/presenters after the event.
Hope to see you there!
Best regards, Johann
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
Re: DUG'20 Agenda Online
Lombardi, Johann
Just a reminder that the DUG’20 is tomorrow.
Hope to see you there!
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of "Lombardi, Johann" <johann.lombardi@...>
Hi there,
Please note that the agenda for the 4th annual DAOS User Group meeting is now available online: https://wiki.hpdd.intel.com/display/DC/DUG20
I am very excited by the diversity and number of presentations this year. A big thank you to all the presenters.
As a reminder, the DUG is virtual this year: - On Nov 19 - Starts at 7:30am Pacific / 8:30am Mountain / 9:30am Central / 4:30pm CET / 11:30pm China - 3h30 of live presentations - Please see instructions on how to join in the webex invite
We also encourage everyone to join the #community slack channel for side discussions between attendees/presenters after the event.
Hope to see you there!
Best regards, Johann
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
Re: Install problem
fhoa@...
Did you find a workaround for this problem ? I am experiencing the same problem when trying to setup on an ubuntu 20.04.1 OS. Commands I tried to run: $ Git clone https://github.com/daos-stack/daos " /usr/bin/ld: warning: libna.so.2, needed by /usr/prereq/dev/mercury/lib/libmercury.so.2, not found (try using -rpath or -rpath-link) /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Error_to_string' /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Addr_free' /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_create_segments' /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Op_create' /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_free' [...] "
|
|
Re: Install problem
nicolau.manubens@...
Thanks for your help. I have tried the ubuntu and leap dockerfiles too. Leap worked fine. The ubuntu one failed with a similar error when compiling acl_dump_test. I leave a snippet of the error below. Although I can continue with the leap one for now, it would still be good to have the centos one working for tests, as our final DAOS system will be deployed on machines with centos. Nicolau
/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Addr_free' /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_create_segments' /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Op_create' /usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_free' [...]
|
|
Re: Install problem
The logic in utils/sl for scons should be detecting that the libfabric version installed is not suitable automatically and building a suitable version. I’m trying it locally to see what is going on
-Jeff
From: <daos@daos.groups.io> on behalf of "maureen.jean@..." <maureen.jean@...>
Yes you need a later version of libfabric; preferably 1.11. But you need a libfabric that supports ABI 1.3 (FABRIC 1.3 )
|
|
Re: Install problem
maureen.jean@...
Yes you need a later version of libfabric; preferably 1.11. But you need a libfabric that supports ABI 1.3 (FABRIC 1.3 )
|
|
Re: Install problem
The dockerfile I am taking from master is installing libfabric 1.7 in the image. Should I modify the scons script in order to replace the libfabric version?
|
|
Re: Install problem
maureen.jean@...
What version of libfabric are you using? Try using libfabric >= 1.11
/usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_dupinfo@...' /usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_freeinfo@...' /usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_getinfo@...'
|
|
Re: Install problem
nicolau.manubens@...
Hello, I am finding a similar error also when trying to build the DAOS docker image. wget https://raw.githubusercontent.com/daos-stack/daos/master/utils/docker/Dockerfile.centos.7 docker build --no-cache -t daos -f ./Dockerfile.centos.7 . The Dockerfile is being pulled from the master branch, e.g. commit 4cbb16cf8edc9ddf5c7503b4448bf897c8331ea3 The output follows: gcc -o build/dev/gcc/src/tests/security/acl_dump_test -Wl,-rpath-link=build/dev/gcc/src/gurt -Wl,-rpath-link=build/dev/gcc/src/cart -Wl,--enable-new-dtags -Wl,-rpath-link=/home/daos/daos/build/dev/gcc/src/gurt -Wl,-rpath-link=/usr/prereq/dev/pmdk/lib -Wl,-rpath-link=/usr/prereq/dev/isal/lib -Wl,-rpath-link=/usr/prereq/dev/isal_crypto/lib -Wl,-rpath-link=/usr/prereq/dev/argobots/lib -Wl,-rpath-link=/usr/prereq/dev/protobufc/lib -Wl,-rpath-link=/usr/lib64 -Wl,-rpath=/usr/lib -Wl,-rpath=\$ORIGIN/../../home/daos/daos/build/dev/gcc/src/gurt -Wl,-rpath=\$ORIGIN/../prereq/dev/pmdk/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/isal/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/isal_crypto/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/argobots/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/protobufc/lib -Wl,-rpath=\$ORIGIN/../lib64 build/dev/gcc/src/tests/security/acl_dump_test.o -Lbuild/dev/gcc/src/gurt -Lbuild/dev/gcc/src/cart/swim -Lbuild/dev/gcc/src/cart -Lbuild/dev/gcc/src/common -L/usr/prereq/dev/pmdk/lib -L/usr/prereq/dev/isal/lib -L/usr/prereq/dev/isal_crypto/lib -Lbuild/dev/gcc/src/bio -Lbuild/dev/gcc/src/bio/smd -Lbuild/dev/gcc/src/vea -Lbuild/dev/gcc/src/vos -Lbuild/dev/gcc/src/mgmt -Lbuild/dev/gcc/src/pool -Lbuild/dev/gcc/src/container -Lbuild/dev/gcc/src/placement -Lbuild/dev/gcc/src/dtx -Lbuild/dev/gcc/src/object -Lbuild/dev/gcc/src/rebuild -Lbuild/dev/gcc/src/security -Lbuild/dev/gcc/src/client/api -Lbuild/dev/gcc/src/control -L/usr/prereq/dev/argobots/lib -L/usr/prereq/dev/protobufc/lib -lpmemobj -lisal -lisal_crypto -labt -lprotobuf-c -lhwloc -ldaos -ldaos_common -lgurt /usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_dupinfo@...' /usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_freeinfo@...' /usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_getinfo@...' collect2: error: ld returned 1 exit status scons: *** [build/dev/gcc/src/tests/security/acl_dump_test] Error 1 scons: building terminated because of errors. The command '/bin/sh -c if [ "x$NOBUILD" = "x" ] ; then scons --build-deps=yes install PREFIX=/usr; fi' returned a non-zero code: 2
I have also tried pulling the version right after the pull request was merged, and building, with no success:
cd daos git checkout 5c887623f0013241d27b8daad1813a3444abf718 cd utils/docker docker build --no-cache -t daos -f ./Dockerfile.centos.7 . [...] Step 28/34 : RUN if [ "x$NOBUILD" = "x" ] ; then scons --build-deps=yes install PREFIX=/usr; fi ---> Running in 6fa6d1725ecd scons: Reading SConscript files ... ImportError: No module named distro: File "/home/daos/daos/SConstruct", line 16: import daos_build File "/home/daos/daos/utils/daos_build.py", line 4: from env_modules import load_mpi File "/home/daos/daos/site_scons/env_modules.py", line 27: import distro Please let me know if you have any further hints.
Regards, Nicolau
|
|
Tutorial videos
Lombardi, Johann
Hi there,
FYI, we have posted several tutorial videos on the newly created DAOS YouTube channel (see https://www.youtube.com/channel/UCVP4e_UTnSJg15Cm80UtNwg). This includes how to install RPMs, configure a DAOS system with the control plane or get started with the DAOS FileSystem layer. I would like to thank all the DAOS engineers who contributed to those videos!
We also had a lot of pre-existing YouTube videos that are now gathered under a single playlist, see https://www.youtube.com/playlist?list=PLkLsgO4eC8RKLeRi50e-HE88ezcp3hg48.
Hope this helps. Johann
ps: DUG’20 is approaching! Check https://wiki.hpdd.intel.com/display/DC/DUG20 for more information.
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
update ipmctl
We have recently learned of a serious issue affecting systems with Apache Pass DCPM modules. The ipmctl utility (versions 02.00.00.3809 through 02.00.00.3816) was released with a bug that can cause the modules to be put into a state that requires a physical procedure to recover the modules. In order to avoid this issue, please ensure that you are running version v02.00.00.3820 or later (when running DAOS to prepare persistent memory modules). More details… When running DAOS and executing dmg storage prepare command to set persistent memory modules into AppDirect mode for use with DAOS, ipmctl which is forked at runtime, attempts to set a PMem configuration goal and fails if 02.00.00.3809-3816 version of ipmctl is installed. The failure is due to a corruption of the PCD on the DIMMs and results in system POST memory checks failing with fatal error. ipmctl was fixed in this commit released in version v02.00.00.3820 That situation then requires recovery of persistent memory modules. Prevention… After installing DAOS and before running “dmg|daos_server storage prepare", please upgrade ipmctl to the most recent distro provided version with your preferred package manager e.g. sudo yum update ipmctl .
Work is in progress to add the necessary version checks to DAOS.
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
Ari
Hi Alex,
That’s good info, I’ll give it a go. We were hoping to avoid customizing the stack in this particular environment unless required, and this definitely qualifies.
Thanks
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Oganezov, Alexander A
Sent: Monday, November 2, 2020 10:58 To: daos@daos.groups.io Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
[EXTERNAL EMAIL] Hi Ari,
Is it possible for you to install more recent MOFED on your system? In the past we’ve had issues with MOFEDs older than 4.7; locally we’ve been using MOFED 5.0.2.
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Ari
Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email). Other errors I saw were related to memory errors which I haven’t seen a match.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
[EXTERNAL EMAIL] Ari,
You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me. The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.
Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure. You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Hi,
We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source). When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool. The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels. Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different. In addition to the snippets are the attached server configuration and log files. Still trying out a couple more things, but wanted to reach out in the meantime.
Any ideas as to debugging this problem with multirail or any gotchas encountered?
Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas. Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.
dac007$ dmg -i system query Rank State ---- ----- [0-1] Joined
dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio) Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH
Network: eth0 10.140.0.7/16 ib0 10.10.0.7/16 ib1=10.10.1.7/16
net.ipv4.conf.ib0.rp_filter = 2 net.ipv4.conf.ib1.rp_filter = 2 net.ipv4.conf.all.rp_filter = 2 net.ipv4.conf.ib0.accept_local = 1 net.ipv4.conf.ib1.accept_local = 1 net.ipv4.conf.all.accept_local = 1 net.ipv4.conf.ib0.arp_ignore = 2 net.ipv4.conf.ib1.arp_ignore = 2 net.ipv4.conf.all.arp_ignore = 2
LOGS: dac007$ wc -l /tmp/daos_*.log 71 /tmp/daos_control.log 5970 /tmp/daos_server-core0.log 35 /tmp/daos_server-core1.log
These errors repeat in one of the server.log 10/22-09:11:10.59 dac007 DAOS[194223/194230] vos DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238 # HG_Core_forward(): Could not forward buffer 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] hg ERR src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12 10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second 10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209 10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server 10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006. 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated
control.log Management Service access point started (bootstrapped) daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 0 DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 1 DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false} DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001] DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3 rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 > DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
|
|
Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
Oganezov, Alexander A
Hi Ari,
Is it possible for you to install more recent MOFED on your system? In the past we’ve had issues with MOFEDs older than 4.7; locally we’ve been using MOFED 5.0.2.
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Ari
Sent: Thursday, October 22, 2020 8:34 AM To: daos@daos.groups.io Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email). Other errors I saw were related to memory errors which I haven’t seen a match.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
[EXTERNAL EMAIL] Ari,
You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me. The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.
Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure. You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Hi,
We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source). When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool. The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels. Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different. In addition to the snippets are the attached server configuration and log files. Still trying out a couple more things, but wanted to reach out in the meantime.
Any ideas as to debugging this problem with multirail or any gotchas encountered?
Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas. Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.
dac007$ dmg -i system query Rank State ---- ----- [0-1] Joined
dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio) Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH
Network: eth0 10.140.0.7/16 ib0 10.10.0.7/16 ib1=10.10.1.7/16
net.ipv4.conf.ib0.rp_filter = 2 net.ipv4.conf.ib1.rp_filter = 2 net.ipv4.conf.all.rp_filter = 2 net.ipv4.conf.ib0.accept_local = 1 net.ipv4.conf.ib1.accept_local = 1 net.ipv4.conf.all.accept_local = 1 net.ipv4.conf.ib0.arp_ignore = 2 net.ipv4.conf.ib1.arp_ignore = 2 net.ipv4.conf.all.arp_ignore = 2
LOGS: dac007$ wc -l /tmp/daos_*.log 71 /tmp/daos_control.log 5970 /tmp/daos_server-core0.log 35 /tmp/daos_server-core1.log
These errors repeat in one of the server.log 10/22-09:11:10.59 dac007 DAOS[194223/194230] vos DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238 # HG_Core_forward(): Could not forward buffer 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] hg ERR src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12 10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second 10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209 10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server 10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006. 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated
control.log Management Service access point started (bootstrapped) daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 0 DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 1 DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false} DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001] DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3 rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 > DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
|
|
Re: DAOS with NVMe-over-Fabrics
anton.brekhov@...
Tom thanks! Unfortunately I couldn't reproduce this behaviour on v1.0.1 with bdev_exclude Also we've checked the NVMeF initiator on pure SPDK and it's connects remote target. So daos_nvme.conf really can't be used to connect NVMeF.
|
|
Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
Ari
Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email). Other errors I saw were related to memory errors which I haven’t seen a match.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Farrell, Patrick Arthur
Sent: Thursday, October 22, 2020 10:23 To: daos@daos.groups.io Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
[EXTERNAL EMAIL] Ari,
You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me. The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.
Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure. You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Hi,
We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source). When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool. The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels. Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different. In addition to the snippets are the attached server configuration and log files. Still trying out a couple more things, but wanted to reach out in the meantime.
Any ideas as to debugging this problem with multirail or any gotchas encountered?
Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas. Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.
dac007$ dmg -i system query Rank State ---- ----- [0-1] Joined
dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio) Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH
Network: eth0 10.140.0.7/16 ib0 10.10.0.7/16 ib1=10.10.1.7/16
net.ipv4.conf.ib0.rp_filter = 2 net.ipv4.conf.ib1.rp_filter = 2 net.ipv4.conf.all.rp_filter = 2 net.ipv4.conf.ib0.accept_local = 1 net.ipv4.conf.ib1.accept_local = 1 net.ipv4.conf.all.accept_local = 1 net.ipv4.conf.ib0.arp_ignore = 2 net.ipv4.conf.ib1.arp_ignore = 2 net.ipv4.conf.all.arp_ignore = 2
LOGS: dac007$ wc -l /tmp/daos_*.log 71 /tmp/daos_control.log 5970 /tmp/daos_server-core0.log 35 /tmp/daos_server-core1.log
These errors repeat in one of the server.log 10/22-09:11:10.59 dac007 DAOS[194223/194230] vos DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238 # HG_Core_forward(): Could not forward buffer 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] hg ERR src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12 10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second 10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209 10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server 10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006. 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated
control.log Management Service access point started (bootstrapped) daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 0 DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 1 DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false} DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001] DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3 rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 > DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
|
|
Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
Farrell, Patrick Arthur
Ari,
You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me. The issue in the spring was communication between two servers on the same node,
where they were unable to route to each other.
Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure. You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues
here, just as then.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected Hi,
We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source). When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool. The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels. Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different. In addition to the snippets are the attached server configuration and log files. Still trying out a couple more things, but wanted to reach out in the meantime.
Any ideas as to debugging this problem with multirail or any gotchas encountered?
Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas. Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.
dac007$ dmg -i system query Rank State ---- ----- [0-1] Joined
dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio) Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH
Network: eth0 10.140.0.7/16 ib0 10.10.0.7/16 ib1=10.10.1.7/16
net.ipv4.conf.ib0.rp_filter = 2 net.ipv4.conf.ib1.rp_filter = 2 net.ipv4.conf.all.rp_filter = 2 net.ipv4.conf.ib0.accept_local = 1 net.ipv4.conf.ib1.accept_local = 1 net.ipv4.conf.all.accept_local = 1 net.ipv4.conf.ib0.arp_ignore = 2 net.ipv4.conf.ib1.arp_ignore = 2 net.ipv4.conf.all.arp_ignore = 2
LOGS: dac007$ wc -l /tmp/daos_*.log 71 /tmp/daos_control.log 5970 /tmp/daos_server-core0.log 35 /tmp/daos_server-core1.log
These errors repeat in one of the server.log 10/22-09:11:10.59 dac007 DAOS[194223/194230] vos DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238 # HG_Core_forward(): Could not forward buffer 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] hg ERR src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12 10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second 10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209 10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server 10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006. 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated
control.log Management Service access point started (bootstrapped) daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 0 DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 1 DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false} DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001] DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3 rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 > DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
|
|
DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
Ari
Hi,
We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source). When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool. The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels. Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different. In addition to the snippets are the attached server configuration and log files. Still trying out a couple more things, but wanted to reach out in the meantime.
Any ideas as to debugging this problem with multirail or any gotchas encountered?
Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas. Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.
dac007$ dmg -i system query Rank State ---- ----- [0-1] Joined
dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio) Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH
Network: eth0 10.140.0.7/16 ib0 10.10.0.7/16 ib1=10.10.1.7/16
net.ipv4.conf.ib0.rp_filter = 2 net.ipv4.conf.ib1.rp_filter = 2 net.ipv4.conf.all.rp_filter = 2 net.ipv4.conf.ib0.accept_local = 1 net.ipv4.conf.ib1.accept_local = 1 net.ipv4.conf.all.accept_local = 1 net.ipv4.conf.ib0.arp_ignore = 2 net.ipv4.conf.ib1.arp_ignore = 2 net.ipv4.conf.all.arp_ignore = 2
LOGS: dac007$ wc -l /tmp/daos_*.log 71 /tmp/daos_control.log 5970 /tmp/daos_server-core0.log 35 /tmp/daos_server-core1.log
These errors repeat in one of the server.log 10/22-09:11:10.59 dac007 DAOS[194223/194230] vos DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated 10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881 # na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828 # hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238 # HG_Core_forward(): Could not forward buffer 10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933 # HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR) 10/22-09:11:12.59 dac007 DAOS[194223/194230] hg ERR src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12 10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second 10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209 10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server 10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up 10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006. 10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc ERR src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006 10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated
control.log Management Service access point started (bootstrapped) daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 0 DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007. Using NUMA node: 1 DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false} DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001] DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3 rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 > DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
|
|
Re: DAOS with NVMe-over-Fabrics
I realise that the problem is the bdev_exclude devices are already bound to UIO at the time that DAOS starts and are not actively unbound by SPDK setup script. 0000:00:04.1 (8086 2021): Already using the uio_pci_generic driver Simple workaround is to run daos_server storage prepare -n --reset before starting and rerun, this will release all the devices back to the system then bdev_exclude will be honoured the next time daos_server start ... is run.
Longer term solution maybe to run prepare reset when initially starting the server but will have to discuss with team whether this is the right approach.
I verified the behavior of bdev_exclude server config file parameter (binaries built from current master - commit 6c50dbbc45431eea8c6eecf5dae74e5b88713f65).
Initial NVMe device listing:
tanabarr@wolf-151:~/projects/daos_m> ls -lah /dev/nv* crw------- 1 root root 237, 0 Oct 20 14:58 /dev/nvme0 brw-rw---- 1 root disk 259, 105 Oct 20 14:58 /dev/nvme0n1 crw------- 1 root root 237, 1 Oct 20 14:58 /dev/nvme1 brw-rw---- 1 root disk 259, 113 Oct 20 14:58 /dev/nvme1n1 crw------- 1 root root 237, 2 Oct 20 14:58 /dev/nvme2 brw-rw---- 1 root disk 259, 121 Oct 20 14:58 /dev/nvme2n1 crw------- 1 root root 237, 3 Oct 20 14:58 /dev/nvme3 brw-rw---- 1 root disk 259, 123 Oct 20 14:58 /dev/nvme3n1 crw------- 1 root root 237, 4 Oct 19 13:16 /dev/nvme4 brw-rw---- 1 root disk 259, 115 Oct 19 13:16 /dev/nvme4n1 crw------- 1 root root 237, 5 Oct 20 14:58 /dev/nvme5 brw-rw---- 1 root disk 259, 125 Oct 20 14:58 /dev/nvme5n1 crw------- 1 root root 237, 6 Oct 19 13:16 /dev/nvme6 brw-rw---- 1 root disk 259, 119 Oct 19 13:16 /dev/nvme6n1 crw------- 1 root root 237, 7 Oct 20 14:58 /dev/nvme7 brw-rw---- 1 root disk 259, 127 Oct 20 14:58 /dev/nvme7n1 crw------- 1 root root 10, 144 Oct 7 22:48 /dev/nvram
Relevant server configuration file changes:
--- a/utils/config/examples/daos_server_sockets.yml +++ b/utils/config/examples/daos_server_sockets.yml @@ -8,6 +8,8 @@ socket_dir: /tmp/daos_sockets nr_hugepages: 4096 control_log_mask: DEBUG control_log_file: /tmp/daos_control.log +helper_log_file: /tmp/daos_admin.log +bdev_exclude: ["0000:e3:00.0", "0000:e7:00.0"]
Server invocation:
sudo install/bin/daos_server start -o utils/config/examples/daos_server_sockets.yml -i
Debug output in the helper log file (/tmp/daos_admin):
DEBUG 13:16:11.027148 runner.go:150: spdk setup env: [PATH=/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/usr/sbin _NRHUGE=128 _TARGET_USER=root _PCI_BLACKLIST=0000:e3:00.0 0000:e7:00.0] DEBUG 13:16:11.027302 runner.go:80: running script: /usr/share/daos/control/setup_spdk.sh DEBUG 13:16:31.264301 runner.go:152: spdk setup stdout: start of script: /usr/share/daos/control/setup_spdk.sh calling into script: /usr/share/daos/control/../../spdk/scripts/setup.sh 0000:65:00.0 (144d a824): nvme -> uio_pci_generic 0000:67:00.0 (144d a824): nvme -> uio_pci_generic 0000:69:00.0 (144d a824): nvme -> uio_pci_generic 0000:6b:00.0 (144d a824): nvme -> uio_pci_generic 0000:e3:00.0 (144d a824): Skipping un-whitelisted NVMe controller at 0000:e3:00.0 0000:e5:00.0 (144d a824): nvme -> uio_pci_generic 0000:e7:00.0 (144d a824): Skipping un-whitelisted NVMe controller at 0000:e7:00.0 0000:e9:00.0 (144d a824): nvme -> uio_pci_generic RUN: ls -d /dev/hugepages | xargs -r chown -R rootRUN: ls -d /dev/uio* | xargs -r chown -R rootRUN: ls -d /sys/class/uio/uio*/device/config | xargs -r chown -R rootRUN: ls -d /sys/class/uio/uio*/device/resource* | xargs -r chown -R rootSetting VFIO file permissions for unprivileged access RUN: chmod /dev/vfio OK RUN: chmod /dev/vfio/* OK
DEBUG 13:16:31.527190 spdk.go:122: spdk init go opts: {MemSize:0 PciWhiteList:[] DisableVMD:true} DEBUG 13:16:31.577836 spdk.go:136: spdk init c opts: &{name:0x7f7eacaf13f5 core_mask:0x7f7eacaf1569 shm_id:-1 mem_channel:-1 master_core:-1 mem_size:-1 no_pci:false hugepage_single_segments:false unlink_hugepage:false num_pci_addr:0 hugedir:<nil> pci_blacklist:<nil> pci_whitelist:<nil> env_context:0x33c3790} DEBUG 13:16:31.665129 nvme.go:143: discovered nvme ssds: [0000:65:00.0 0000:67:00.0 0000:6b:00.0 0000:e5:00.0 0000:e9:00.0 0000:69:00.0]
Resultant NVMe device listing:
tanabarr@wolf-151:~/projects/daos_m> ls -lah /dev/nv* crw------- 1 root root 237, 4 Oct 19 13:16 /dev/nvme4 brw-rw---- 1 root disk 259, 115 Oct 19 13:16 /dev/nvme4n1 crw------- 1 root root 237, 6 Oct 19 13:16 /dev/nvme6 brw-rw---- 1 root disk 259, 119 Oct 19 13:16 /dev/nvme6n1 crw------- 1 root root 10, 144 Oct 7 22:48 /dev/nvram
Regards, Tom Nabarro – DCG/ESAD M: +44 (0)7786 260986 Skype: tom.nabarro
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
anton.brekhov@...
Sent: Sunday, October 18, 2020 1:28 PM To: daos@daos.groups.io Subject: Re: [daos] DAOS with NVMe-over-Fabrics
Tom, thanks for comments! I've edited my previous message. I'll copy it below: Tom here is the active config file content: name: daos_server access_points: ['apache512'] #access_points: ['localhost'] port: 10001 #provider: ofi+sockets provider: ofi+verbs;ofi_rxm nr_hugepages: 4096 control_log_file: /tmp/daos_control.log helper_log_file: /tmp/daos_admin.log transport_config: allow_insecure: true
servers: - targets: 4 first_core: 0 nr_xs_helpers: 0 fabric_iface: ib0 fabric_iface_port: 31416 log_mask: ERR log_file: /tmp/daos_server.log
env_vars: - DAOS_MD_CAP=1024 - CRT_CTX_SHARE_ADDR=0 - CRT_TIMEOUT=30 - FI_SOCKETS_MAX_CONN_RETRY=1 - FI_SOCKETS_CONN_TIMEOUT=2000 #- OFI_INTERFACE=ib0 #- OFI_DOMAIN=mlx5_0 #- CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm
# Storage definitions
# When scm_class is set to ram, tmpfs will be used to emulate SCM. # The size of ram is specified by scm_size in GB units. scm_mount: /mnt/daos # map to -s /mnt/daos #scm_class: ram #scm_size: 8 scm_class: dcpm scm_list: [/dev/pmem0]
bdev_class: nvme #bdev_list: ["0000:b1:00.0","0000:b2:00.0"]
Before starting server: [root@apache512 ~]# nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 BTLJ81460E1M1P0I INTEL SSDPELKX010T8 1 1,00 TB / 1,00 TB 512 B + 0 B VCV10300 /dev/nvme1n1 BTLJ81460E031P0I INTEL SSDPELKX010T8 1 1,00 TB / 1,00 TB 512 B + 0 B VCV10300 /dev/nvme2n1 BTLJ81460E1J1P0I INTEL SSDPELKX010T8 1 1,00 TB / 1,00 TB 512 B + 0 B VCV10300 /dev/nvme3n1 BTLJ81460E341P0I INTEL SSDPELKX010T8 1 1,00 TB / 1,00 TB 512 B + 0 B VCV10300 After starting server: [root@apache512 ~]# The helper_log_file : 0000:b1:00.0 (8086 0a54): nvme -> uio_pci_generic 0000:b2:00.0 (8086 0a54): nvme -> uio_pci_generic 0000:b3:00.0 (8086 0a54): nvme -> uio_pci_generic 0000:b4:00.0 (8086 0a54): nvme -> uio_pci_generic 0000:00:04.0 (8086 2021): no driver -> uio_pci_generic 0000:00:04.1 (8086 2021): Already using the uio_pci_generic driver 0000:00:04.2 (8086 2021): Already using the uio_pci_generic driver 0000:00:04.3 (8086 2021): Already using the uio_pci_generic driver 0000:00:04.4 (8086 2021): Already using the uio_pci_generic driver 0000:00:04.5 (8086 2021): Already using the uio_pci_generic driver 0000:00:04.6 (8086 2021): Already using the uio_pci_generic driver 0000:00:04.7 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.0 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.1 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.2 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.3 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.4 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.5 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.6 (8086 2021): Already using the uio_pci_generic driver 0000:80:04.7 (8086 2021): Already using the uio_pci_generic driver RUN: ls -d /dev/hugepages | xargs -r chown -R rootRUN: ls -d /dev/uio* | xargs -r chown -R rootRUN: ls -d /sys/class/uio/uio*/device/config | xargs -r chown -R rootRUN: ls -d /sys/class/uio/uio*/device/resource* | xargs -r chown -R rootSetting VFIO file permissions for unprivileged access RUN: chmod /dev/vfio OK RUN: chmod /dev/vfio/* OK
DEBUG 17:41:13.340092 nvme.go:176: discovered nvme ssds: [0000:b4:00.0 0000:b3:00.0 0000:b1:00.0 0000:b2:00.0] DEBUG 17:41:13.340794 nvme.go:133: removed lockfiles: [/tmp/spdk_pci_lock_0000:b4:00.0 /tmp/spdk_pci_lock_0000:b3:00.0 /tmp/spdk_pci_lock_0000:b1:00.0 /tmp/spdk_pci_lock_0000:b2:00.0] DEBUG 17:41:13.502291 ipmctl.go:104: discovered 4 DCPM modules DEBUG 17:41:13.517978 ipmctl.go:356: discovered 2 DCPM namespaces DEBUG 17:41:13.775184 ipmctl.go:133: show region output: ---ISetID=0xfe0ceeb819432444--- PersistentMemoryType=AppDirect FreeCapacity=0.000 GiB ---ISetID=0xe1f4eeb8c7432444--- PersistentMemoryType=AppDirect FreeCapacity=0.000 GiB
DEBUG 17:41:13.973259 ipmctl.go:104: discovered 4 DCPM modules DEBUG 17:41:13.988400 ipmctl.go:356: discovered 2 DCPM namespaces DEBUG 17:41:14.234782 ipmctl.go:133: show region output: ---ISetID=0xfe0ceeb819432444--- PersistentMemoryType=AppDirect FreeCapacity=0.000 GiB ---ISetID=0xe1f4eeb8c7432444--- PersistentMemoryType=AppDirect FreeCapacity=0.000 GiB --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|