Hi there,
Any errors in the engine logs?
Cheers,
Johann
From:
<daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday 31 May 2021 at 08:42
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Trying to solve "DAOS rank exited unexpectedly"
[Edited Message Follows]
Hello all,
I've been experiencing problems trying to run DAOS 1.2. For the moment, I'm just trying to run a single server, and DAOS runs correctly when using Ethernet. However, when switching to an Infiniband configuration, I am met with the error attached below ("/tmp/daos_engine.0.log"
remains empty).
Any help you can provide would be much appreciated, and please let me know if there is any more info I can provide. This is on Ubuntu 20.04, using the latest MLNX drivers (MLNX_OFED_LINUX-5.3-1.0.0.1).
------------------------- OUTPUT -------------------------
copoka@swat7-02:~$ daos_server start -i -o daos_server_ib.yml
DAOS Server config loaded from /home/copoka/daos_server_ib.yml
daos_server logging to file /tmp/daos_server.log
DEBUG 15:25:33.517796 start.go:89: Switching control log level to DEBUG
DEBUG 15:25:33.579954 netdetect.go:279: 2 NUMA nodes detected with 20 cores per node
DEBUG 15:25:33.580176 netdetect.go:284: initDeviceScan completed. Depth -6, numObj 8, systemDeviceNames [lo eno1 eno2 ib0], hwlocDeviceNames [sda card0 ib0 mlx5_0 eno1 eno2 pmem0 pmem1]
DEBUG 15:25:33.580206 netdetect.go:913: Calling ValidateProviderConfig with ib0, ofi+verbs;ofi_rxm
DEBUG 15:25:33.580223 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm
DEBUG 15:25:33.580363 netdetect.go:995: There are 0 hfi1 devices in the system
DEBUG 15:25:33.580392 netdetect.go:572: There are 2 NUMA nodes.
DEBUG 15:25:33.580413 netdetect.go:928: Device ib0 supports provider: ofi+verbs;ofi_rxm
DEBUG 15:25:33.580440 netdetect.go:1073: Validate network config for numaNode: 0
DEBUG 15:25:33.580460 netdetect.go:572: There are 2 NUMA nodes.
DEBUG 15:25:33.580479 netdetect.go:1102: The NUMA node for device ib0 matches the provided value 0. Network configuration is valid.
DEBUG 15:25:33.581049 server.go:401: Active config saved to /home/copoka/.daos_server.active.yml (read-only)
DEBUG 15:25:33.581078 server.go:113: fault domain: /swat7-02
DEBUG 15:25:33.581279 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:128 DisableCleanHugePages:false PCIWhitelist: PCIBlacklist: TargetUser:copoka ResetOnly:false DisableVFIO:false
DisableVMD:true}
DEBUG 15:25:40.008657 database.go:277: set db replica addr: 172.23.7.2:10001
DEBUG 15:25:40.115490 netdetect.go:279: 2 NUMA nodes detected with 20 cores per node
DEBUG 15:25:40.115717 netdetect.go:284: initDeviceScan completed. Depth -6, numObj 8, systemDeviceNames [lo eno1 eno2 ib0], hwlocDeviceNames [sda card0 ib0 mlx5_0 eno1 eno2 pmem0 pmem1]
DEBUG 15:25:40.115786 netdetect.go:669: Searching for a device alias for: ib0
DEBUG 15:25:40.115824 netdetect.go:693: Device alias for ib0 is mlx5_0
DEBUG 15:25:40.115874 class.go:196: spdk nvme: bdev_list empty in config, no nvme.conf generated for server
DEBUG 15:25:40.647130 provider.go:217: bdev scan: update cache (4 devices)
DAOS Control Server v1.2 (pid 186967) listening on 0.0.0.0:10001
DEBUG 15:25:40.647841 instance_exec.go:35: instance 0: checking if storage is formatted
Checking DAOS I/O Engine instance 0 storage ...
DEBUG 15:25:40.647958 instance_storage.go:74: /mnt/daos0: checking formatting
DEBUG 15:25:41.090347 instance_storage.go:90: /mnt/daos0 (dcpm) needs format: false
DEBUG 15:25:41.090374 instance_storage.go:121: instance 0: no SCM format required; checking for superblock
DEBUG 15:25:41.090392 instance_superblock.go:90: /mnt/daos0: checking superblock
DEBUG 15:25:41.090576 instance_storage.go:127: instance 0: superblock not needed
DEBUG 15:25:41.090601 database.go:372: system db start: isReplica: true, isBootstrap: true
DEBUG 15:25:41.090932 api.go:556: initial configuration: index=1 servers=[%+v [{Suffrage:Voter ID:75.12.36.64:10001 Address:75.12.36.64:10001}]]
DEBUG 15:25:41.090961 raft.go:173: isBootstrap: true, newDB: false
DEBUG 15:25:41.090977 instance_exec.go:62: instance start()
DEBUG 15:25:41.090990 class.go:223: skip bdev conf file generation as no path set
SCM @ /mnt/daos0: 265 GB Total/260 GB Avail
DEBUG 15:25:41.091098 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init
DEBUG 15:25:41.091102 raft.go:152: entering follower state: follower=Node at 172.23.7.2:10001 [Follower] leader=
DEBUG 15:25:41.091142 exec.go:69: daos_engine:0 args: [-t 16 -x 3 -g daos_server -d /var/run/daos_server -s /mnt/daos0 -p 0 -I 0]
DEBUG 15:25:41.091174 exec.go:70: daos_engine:0 env: [OFI_PORT=31416 CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0 OFI_DOMAIN=mlx5_0 D_LOG_FILE=/tmp/daos_engine.0.log CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_INTERFACE=ib0]
Starting I/O Engine instance 0: /home/copoka/daos/install/bin/daos_engine
daos_engine:0 Using NUMA core allocation algorithm
DEBUG 15:25:43.855176 instance_drpc.go:49: DAOS I/O Engine instance 0 drpc ready: uri:"ofi+verbs;ofi_rxm://172.23.7.2:31416" nctxs:20 drpcListenerSock:"/var/run/daos_server/daos_engine_188279.sock" ntgts:16
DEBUG 15:25:43.856148 system.go:145: DAOS system join request: sys:"daos_server" uuid:"c30f4940-f257-4a76-b1d0-cb74c3a29077" uri:"ofi+verbs;ofi_rxm://172.23.7.2:31416" nctxs:20 addr:"0.0.0.0:10001" srvFaultDomain:"/swat7-02"
DEBUG 15:25:43.856324 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:43.859492 rpc.go:392: MS request error: not the DAOS Management Service leader (try or one of ); retrying after 0s
DEBUG 15:25:43.859708 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:43.861207 rpc.go:392: MS request error: not the DAOS Management Service leader (try or one of ); retrying after 250ms
DEBUG 15:25:44.086829 raft.go:214: heartbeat timeout reached, starting election: last-leader=
DEBUG 15:25:44.086907 raft.go:250: entering candidate state: node=Node at 172.23.7.2:10001 [Candidate] term=27
DEBUG 15:25:44.087048 raft.go:268: votes: needed=1
DEBUG 15:25:44.087144 raft.go:220: RequestVote(75.12.36.64:10001, 75.12.36.64:10001) req: &{RPCHeader:{ProtocolVersion:3} Term:27 Candidate:[49 55 50 46 50 51 46 55 46 50 58 49 48 48 48 49] LastLogIndex:20 LastLogTerm:25 LeadershipTransfer:false}
DEBUG 15:25:44.090843 raft.go:287: vote granted: from=75.12.36.64:10001 term=27 tally=1
DEBUG 15:25:44.090915 raft.go:292: election won: tally=1
DEBUG 15:25:44.090975 raft.go:363: entering leader state: leader=Node at 172.23.7.2:10001 [Leader]
DEBUG 15:25:44.091044 raft.go:474: added peer, starting replication: peer=75.12.36.64:10001
DEBUG 15:25:44.091154 database.go:458: node 172.23.7.2:10001 gained MS leader state
MS leader running on swat7-02
DEBUG 15:25:44.091234 mgmt_system.go:160: starting joinLoop
DEBUG 15:25:44.091827 raft.go:152: entering follower state: follower=Node at 172.23.7.2:10001 [Follower] leader=
DEBUG 15:25:44.091899 database.go:449: node 172.23.7.2:10001 lost MS leader state
MS leader no longer running on swat7-02
DEBUG 15:25:44.091986 mgmt_system.go:170: stopped joinLoop
DEBUG 15:25:44.111729 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:44.113210 rpc.go:392: MS request error: not the DAOS Management Service leader (try or one of ); retrying after 2.75s
DEBUG 15:25:46.863692 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:46.865213 rpc.go:392: MS request error: not the DAOS Management Service leader (try or one of ); retrying after 1s
DEBUG 15:25:47.027134 raft.go:214: heartbeat timeout reached, starting election: last-leader=
DEBUG 15:25:47.027205 raft.go:250: entering candidate state: node=Node at 172.23.7.2:10001 [Candidate] term=28
DEBUG 15:25:47.027329 raft.go:268: votes: needed=1
DEBUG 15:25:47.027411 raft.go:220: RequestVote(75.12.36.64:10001, 75.12.36.64:10001) req: &{RPCHeader:{ProtocolVersion:3} Term:28 Candidate:[49 55 50 46 50 51 46 55 46 50 58 49 48 48 48 49] LastLogIndex:21 LastLogTerm:27 LeadershipTransfer:false}
DEBUG 15:25:47.028143 raft.go:287: vote granted: from=75.12.36.64:10001 term=28 tally=1
DEBUG 15:25:47.028209 raft.go:292: election won: tally=1
DEBUG 15:25:47.028254 raft.go:363: entering leader state: leader=Node at 172.23.7.2:10001 [Leader]
DEBUG 15:25:47.865660 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:47.866773 mgmt_system.go:353: MgmtSvc.Join dispatch, req:&mgmt.JoinReq{Sys:"daos_server", Uuid:"c30f4940-f257-4a76-b1d0-cb74c3a29077", Rank:0x0, Uri:"ofi+verbs;ofi_rxm://172.23.7.2:31416", Nctxs:0x14, Addr:"0.0.0.0:10001",
SrvFaultDomain:"/swat7-02", Idx:0x0, XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
instance 0 exited: context deadline exceeded
DEBUG 15:25:57.865829 rpc.go:223: parent context canceled -- tearing down client invoker
&&& RAS EVENT id: [engine_status_down] ts: [2021-05-31T15:25:57.865841+0900] host: [swat7-02] type: [STATE_CHANGE] sev: [ERROR] msg: [DAOS rank exited unexpectedly] pid: [186967] rank: [0]
DEBUG 15:25:57.865912 system.go:246: forwarding engine_status_down event to MS access points [172.23.7.2:10001] (seq: 1)
DEBUG 15:25:57.866901 system.go:208: DAOS cluster event request: sequence:1 event:<id:2 msg:"DAOS rank exited unexpectedly" timestamp:"2021-05-31T15:25:57.865841+0900" type:1 severity:3 hostname:"swat7-02" proc_id:186967 rank_state_info:<errored:true
error:"context deadline exceeded" > >
DEBUG 15:25:57.867166 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:57.868219 system.go:237: forwarding disabled for engine_status_down event