Re: Trying to solve "DAOS rank exited unexpectedly"


Lombardi, Johann
 

Hi there,

 

Any errors in the engine logs?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday 31 May 2021 at 08:42
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Trying to solve "DAOS rank exited unexpectedly"

 

[Edited Message Follows]

Hello all,

I've been experiencing problems trying to run DAOS 1.2. For the moment, I'm just trying to run a single server, and DAOS runs correctly when using Ethernet. However, when switching to an Infiniband configuration, I am met with the error attached below ("/tmp/daos_engine.0.log" remains empty).

Any help you can provide would be much appreciated, and please let me know if there is any more info I can provide. This is on Ubuntu 20.04, using the latest MLNX drivers (MLNX_OFED_LINUX-5.3-1.0.0.1).

------------------------- OUTPUT -------------------------
copoka@swat7-02:~$ daos_server start -i -o daos_server_ib.yml

DAOS Server config loaded from /home/copoka/daos_server_ib.yml
daos_server logging to file /tmp/daos_server.log
DEBUG 15:25:33.517796 start.go:89: Switching control log level to DEBUG
DEBUG 15:25:33.579954 netdetect.go:279: 2 NUMA nodes detected with 20 cores per node
DEBUG 15:25:33.580176 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 8, systemDeviceNames [lo eno1 eno2 ib0], hwlocDeviceNames [sda card0 ib0 mlx5_0 eno1 eno2 pmem0 pmem1]
DEBUG 15:25:33.580206 netdetect.go:913: Calling ValidateProviderConfig with ib0, ofi+verbs;ofi_rxm
DEBUG 15:25:33.580223 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm
DEBUG 15:25:33.580363 netdetect.go:995: There are 0 hfi1 devices in the system
DEBUG 15:25:33.580392 netdetect.go:572: There are 2 NUMA nodes.
DEBUG 15:25:33.580413 netdetect.go:928: Device ib0 supports provider: ofi+verbs;ofi_rxm
DEBUG 15:25:33.580440 netdetect.go:1073: Validate network config for numaNode: 0
DEBUG 15:25:33.580460 netdetect.go:572: There are 2 NUMA nodes.
DEBUG 15:25:33.580479 netdetect.go:1102: The NUMA node for device ib0 matches the provided value 0.  Network configuration is valid.
DEBUG 15:25:33.581049 server.go:401: Active config saved to /home/copoka/.daos_server.active.yml (read-only)
DEBUG 15:25:33.581078 server.go:113: fault domain: /swat7-02
DEBUG 15:25:33.581279 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:128 DisableCleanHugePages:false PCIWhitelist: PCIBlacklist: TargetUser:copoka ResetOnly:false DisableVFIO:false DisableVMD:true}
DEBUG 15:25:40.008657 database.go:277: set db replica addr: 172.23.7.2:10001
DEBUG 15:25:40.115490 netdetect.go:279: 2 NUMA nodes detected with 20 cores per node
DEBUG 15:25:40.115717 netdetect.go:284: initDeviceScan completed.  Depth -6, numObj 8, systemDeviceNames [lo eno1 eno2 ib0], hwlocDeviceNames [sda card0 ib0 mlx5_0 eno1 eno2 pmem0 pmem1]
DEBUG 15:25:40.115786 netdetect.go:669: Searching for a device alias for: ib0
DEBUG 15:25:40.115824 netdetect.go:693: Device alias for ib0 is mlx5_0
DEBUG 15:25:40.115874 class.go:196: spdk nvme: bdev_list empty in config, no nvme.conf generated for server
DEBUG 15:25:40.647130 provider.go:217: bdev scan: update cache (4 devices)
DAOS Control Server v1.2 (pid 186967) listening on 0.0.0.0:10001
DEBUG 15:25:40.647841 instance_exec.go:35: instance 0: checking if storage is formatted
Checking DAOS I/O Engine instance 0 storage ...
DEBUG 15:25:40.647958 instance_storage.go:74: /mnt/daos0: checking formatting
DEBUG 15:25:41.090347 instance_storage.go:90: /mnt/daos0 (dcpm) needs format: false
DEBUG 15:25:41.090374 instance_storage.go:121: instance 0: no SCM format required; checking for superblock
DEBUG 15:25:41.090392 instance_superblock.go:90: /mnt/daos0: checking superblock
DEBUG 15:25:41.090576 instance_storage.go:127: instance 0: superblock not needed
DEBUG 15:25:41.090601 database.go:372: system db start: isReplica: true, isBootstrap: true
DEBUG 15:25:41.090932 api.go:556: initial configuration: index=1 servers=[%+v [{Suffrage:Voter ID:75.12.36.64:10001 Address:75.12.36.64:10001}]]
DEBUG 15:25:41.090961 raft.go:173: isBootstrap: true, newDB: false
DEBUG 15:25:41.090977 instance_exec.go:62: instance start()
DEBUG 15:25:41.090990 class.go:223: skip bdev conf file generation as no path set
SCM @ /mnt/daos0: 265 GB Total/260 GB Avail
DEBUG 15:25:41.091098 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init
DEBUG 15:25:41.091102 raft.go:152: entering follower state: follower=Node at 172.23.7.2:10001 [Follower] leader=
DEBUG 15:25:41.091142 exec.go:69: daos_engine:0 args: [-t 16 -x 3 -g daos_server -d /var/run/daos_server -s /mnt/daos0 -p 0 -I 0]
DEBUG 15:25:41.091174 exec.go:70: daos_engine:0 env: [OFI_PORT=31416 CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0 OFI_DOMAIN=mlx5_0 D_LOG_FILE=/tmp/daos_engine.0.log CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_INTERFACE=ib0]
Starting I/O Engine instance 0: /home/copoka/daos/install/bin/daos_engine
daos_engine:0 Using NUMA core allocation algorithm
DEBUG 15:25:43.855176 instance_drpc.go:49: DAOS I/O Engine instance 0 drpc ready: uri:"ofi+verbs;ofi_rxm://172.23.7.2:31416" nctxs:20 drpcListenerSock:"/var/run/daos_server/daos_engine_188279.sock" ntgts:16
DEBUG 15:25:43.856148 system.go:145: DAOS system join request: sys:"daos_server" uuid:"c30f4940-f257-4a76-b1d0-cb74c3a29077" uri:"ofi+verbs;ofi_rxm://172.23.7.2:31416" nctxs:20 addr:"0.0.0.0:10001" srvFaultDomain:"/swat7-02"
DEBUG 15:25:43.856324 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:43.859492 rpc.go:392: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 0s
DEBUG 15:25:43.859708 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:43.861207 rpc.go:392: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 250ms
DEBUG 15:25:44.086829 raft.go:214: heartbeat timeout reached, starting election: last-leader=
DEBUG 15:25:44.086907 raft.go:250: entering candidate state: node=Node at 172.23.7.2:10001 [Candidate] term=27
DEBUG 15:25:44.087048 raft.go:268: votes: needed=1
DEBUG 15:25:44.087144 raft.go:220: RequestVote(75.12.36.64:10001, 75.12.36.64:10001) req: &{RPCHeader:{ProtocolVersion:3} Term:27 Candidate:[49 55 50 46 50 51 46 55 46 50 58 49 48 48 48 49] LastLogIndex:20 LastLogTerm:25 LeadershipTransfer:false}
DEBUG 15:25:44.090843 raft.go:287: vote granted: from=75.12.36.64:10001 term=27 tally=1
DEBUG 15:25:44.090915 raft.go:292: election won: tally=1
DEBUG 15:25:44.090975 raft.go:363: entering leader state: leader=Node at 172.23.7.2:10001 [Leader]
DEBUG 15:25:44.091044 raft.go:474: added peer, starting replication: peer=75.12.36.64:10001
DEBUG 15:25:44.091154 database.go:458: node 172.23.7.2:10001 gained MS leader state
MS leader running on swat7-02
DEBUG 15:25:44.091234 mgmt_system.go:160: starting joinLoop
DEBUG 15:25:44.091827 raft.go:152: entering follower state: follower=Node at 172.23.7.2:10001 [Follower] leader=
DEBUG 15:25:44.091899 database.go:449: node 172.23.7.2:10001 lost MS leader state
MS leader no longer running on swat7-02
DEBUG 15:25:44.091986 mgmt_system.go:170: stopped joinLoop
DEBUG 15:25:44.111729 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:44.113210 rpc.go:392: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 2.75s
DEBUG 15:25:46.863692 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:46.865213 rpc.go:392: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 1s
DEBUG 15:25:47.027134 raft.go:214: heartbeat timeout reached, starting election: last-leader=
DEBUG 15:25:47.027205 raft.go:250: entering candidate state: node=Node at 172.23.7.2:10001 [Candidate] term=28
DEBUG 15:25:47.027329 raft.go:268: votes: needed=1
DEBUG 15:25:47.027411 raft.go:220: RequestVote(75.12.36.64:10001, 75.12.36.64:10001) req: &{RPCHeader:{ProtocolVersion:3} Term:28 Candidate:[49 55 50 46 50 51 46 55 46 50 58 49 48 48 48 49] LastLogIndex:21 LastLogTerm:27 LeadershipTransfer:false}
DEBUG 15:25:47.028143 raft.go:287: vote granted: from=75.12.36.64:10001 term=28 tally=1
DEBUG 15:25:47.028209 raft.go:292: election won: tally=1
DEBUG 15:25:47.028254 raft.go:363: entering leader state: leader=Node at 172.23.7.2:10001 [Leader]
DEBUG 15:25:47.865660 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:47.866773 mgmt_system.go:353: MgmtSvc.Join dispatch, req:&mgmt.JoinReq{Sys:"daos_server", Uuid:"c30f4940-f257-4a76-b1d0-cb74c3a29077", Rank:0x0, Uri:"ofi+verbs;ofi_rxm://172.23.7.2:31416", Nctxs:0x14, Addr:"0.0.0.0:10001", SrvFaultDomain:"/swat7-02", Idx:0x0, XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
instance 0 exited: context deadline exceeded
DEBUG 15:25:57.865829 rpc.go:223: parent context canceled -- tearing down client invoker
&&& RAS EVENT id: [engine_status_down] ts: [2021-05-31T15:25:57.865841+0900] host: [swat7-02] type: [STATE_CHANGE] sev: [ERROR] msg: [DAOS rank exited unexpectedly] pid: [186967] rank: [0]
DEBUG 15:25:57.865912 system.go:246: forwarding engine_status_down event to MS access points [172.23.7.2:10001] (seq: 1)
DEBUG 15:25:57.866901 system.go:208: DAOS cluster event request: sequence:1 event:<id:2 msg:"DAOS rank exited unexpectedly" timestamp:"2021-05-31T15:25:57.865841+0900" type:1 severity:3 hostname:"swat7-02" proc_id:186967 rank_state_info:<errored:true error:"context deadline exceeded" > >
DEBUG 15:25:57.867166 rpc.go:196: request hosts: [172.23.7.2:10001]
DEBUG 15:25:57.868219 system.go:237: forwarding disabled for engine_status_down event

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Join daos@daos.groups.io to automatically receive all group messages.