Re: Local server setup error


Rosenzweig, Joel B
 

Hello Erika,

 

I saw a similar issue and the same failure signature with the v1.1.3 build with verbs on Mellanox.  I’d like you to try upgrading your OFI to a more modern version.  The version with DAOS 1.1.3 has OFI v1.11.1 which is about a year old.  We are using OFI 1.12.0 now.  If you modify /utils/build.config, you can update the OFI string as:  OFI = v1.12.0 

 

After you modify the build.config, wipe the daos/build completely, and then:

 

Remove at least daos/install/prereq/release/ofi  -- Or -- wipe everything from daos/install/ except for your etc/ directory which has your yaml configuration files. 

 

Then perform a full rebuild and retest. 

 

If that does not resolve your problem, then we can try some other troubleshooting measures to debug further.  There’s a good chance that this is all you need though.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of hayashi-erika@...
Sent: Tuesday, April 20, 2021 3:51 AM
To: daos@daos.groups.io
Subject: [daos] Local server setup error

 

Hello DAOS Community,

 

I'm having trouble running DAOS v1.1.3 on Centos 7.9.

I'm trying to run a server and a client on a single node.

After formatting the DCPM, when I try to start the server, I get the following output :

 

$  daos_server start -o daos/utils/config/examples/daos_server_local.yml

DAOS Server config loaded from /home/USER/daos/utils/config/examples/daos_server_local.yml

daos_server logging to file /tmp/daos_server.log

DEBUG 18:18:04.617736 start.go:89: Switching control log level to DEBUG

DEBUG 18:18:04.742295 netdetect.go:279: 2 NUMA nodes detected with 18 cores per node

DEBUG 18:18:04.743438 netdetect.go:284: initDeviceScan completed.  Depth -5, numObj 11, systemDeviceNames [lo enp94s0f0 enp94s0f1 eno1 eno2 ib0 ib1 virbr0 virbr0-nic], hwlocDeviceNames [eno1 eno2 card0 controlD64 ib0 mlx5_0 enp94s0f0 enp94s0f1 sda ib1 mlx5_1]

DEBUG 18:18:04.743534 netdetect.go:913: Calling ValidateProviderConfig with ib0, ofi+verbs;ofi_rxm

DEBUG 18:18:04.743598 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm

DEBUG 18:18:04.744605 netdetect.go:995: There are 0 hfi1 devices in the system

DEBUG 18:18:04.744674 netdetect.go:928: Device ib0 supports provider: ofi+verbs;ofi_rxm

DEBUG 18:18:04.744740 netdetect.go:913: Calling ValidateProviderConfig with ib1, ofi+verbs;ofi_rxm

DEBUG 18:18:04.744775 netdetect.go:964: Input provider string: ofi+verbs;ofi_rxm

DEBUG 18:18:04.745406 netdetect.go:995: There are 0 hfi1 devices in the system

DEBUG 18:18:04.745486 netdetect.go:928: Device ib1 supports provider: ofi+verbs;ofi_rxm

DEBUG 18:18:04.746310 server.go:401: Active config saved to /home/USER/daos/utils/config/examples/.daos_server.active.yml (read-only)

DEBUG 18:18:04.746397 server.go:113: fault domain: /10.0.0.0_104810

DEBUG 18:18:04.746757 server.go:163: automatic NVMe prepare req: {ForwardableRequest:{Forwarded:false} HugePageCount:128 PCIWhitelist: PCIBlacklist: TargetUser:USER ResetOnly:false DisableVFIO:true DisableVMD:true}

DEBUG 18:18:12.057940 database.go:246: set db replica addr: 127.0.0.1:10001

DEBUG 18:18:12.190386 netdetect.go:279: 2 NUMA nodes detected with 18 cores per node

DEBUG 18:18:12.191763 netdetect.go:284: initDeviceScan completed.  Depth -5, numObj 11, systemDeviceNames [lo enp94s0f0 enp94s0f1 eno1 eno2 ib0 ib1 virbr0 virbr0-nic], hwlocDeviceNames [eno1 eno2 card0 controlD64 ib0 mlx5_0 enp94s0f0 enp94s0f1 sda ib1 mlx5_1]

DEBUG 18:18:12.191921 netdetect.go:669: Searching for a device alias for: ib0

DEBUG 18:18:12.192029 netdetect.go:693: Device alias for ib0 is mlx5_0

DEBUG 18:18:12.192225 class.go:196: spdk : bdev_list empty in config, no nvme.conf generated for server

DEBUG 18:18:12.192444 netdetect.go:669: Searching for a device alias for: ib1

DEBUG 18:18:12.192561 netdetect.go:693: Device alias for ib1 is mlx5_1

DEBUG 18:18:12.192653 class.go:196: spdk : bdev_list empty in config, no nvme.conf generated for server

DAOS Control Server v1.1.3 (pid 59341) listening on 0.0.0.0:10001

DEBUG 18:18:15.403381 instance_exec.go:35: instance 0: checking if storage is formatted

Checking DAOS I/O Engine instance 0 storage ...

DEBUG 18:18:15.403441 instance_exec.go:35: instance 1: checking if storage is formatted

DEBUG 18:18:15.403477 instance_storage.go:74: /mnt/daos: checking formatting

Checking DAOS I/O Engine instance 1 storage ...

DEBUG 18:18:15.403535 instance_storage.go:74: /mnt/daos1: checking formatting

DEBUG 18:18:19.835749 instance_storage.go:90: /mnt/daos1 (dcpm) needs format: false

DEBUG 18:18:19.835871 instance_storage.go:121: instance 1: no SCM format required; checking for superblock

DEBUG 18:18:19.835961 instance_superblock.go:90: /mnt/daos1: checking superblock

DEBUG 18:18:19.837041 instance_storage.go:127: instance 1: superblock not needed

DEBUG 18:18:19.837116 instance_exec.go:62: instance start()

DEBUG 18:18:19.837154 class.go:223: skip bdev conf file generation as no path set

SCM @ /mnt/daos1: 799 GB Total/783 GB Avail

DEBUG 18:18:19.837460 instance_exec.go:79: instance 1: awaiting DAOS I/O Engine init

DEBUG 18:18:19.837696 exec.go:72: daos_engine:1 args: [-t 1 -x 0 -f 17 -g daos_server -d /var/run/daos_server -s /mnt/daos1 -I 1]

DEBUG 18:18:19.837800 exec.go:73: daos_engine:1 env: [CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0 D_LOG_MASK=DEBUG D_LOG_FILE=/tmp/daos_engine.1.log CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_INTERFACE=ib1 OFI_PORT=31417 OFI_DOMAIN=mlx5_1]

Starting I/O server instance 1: /home/USER/daos/install/bin/daos_engine

DEBUG 18:18:19.846734 instance_storage.go:90: /mnt/daos (dcpm) needs format: false

DEBUG 18:18:19.846816 instance_storage.go:121: instance 0: no SCM format required; checking for superblock

DEBUG 18:18:19.846873 instance_superblock.go:90: /mnt/daos: checking superblock

DEBUG 18:18:19.847415 instance_storage.go:127: instance 0: superblock not needed

DEBUG 18:18:19.847502 database.go:334: system db start: isReplica: true, isBootstrap: true

DEBUG 18:18:19.848912 api.go:556: initial configuration: index=1 servers=[%+v [{Suffrage:Voter ID:127.0.0.1:10001 Address:127.0.0.1:10001}]]

DEBUG 18:18:19.849019 raft.go:154: isBootstrap: true, newDB: false

DEBUG 18:18:19.849079 instance_exec.go:62: instance start()

DEBUG 18:18:19.849118 class.go:223: skip bdev conf file generation as no path set

SCM @ /mnt/daos: 799 GB Total/783 GB Avail

DEBUG 18:18:19.849239 raft.go:152: entering follower state: follower=Node at 127.0.0.1:10001 [Follower] leader=

DEBUG 18:18:19.849341 instance_exec.go:79: instance 0: awaiting DAOS I/O Engine init

DEBUG 18:18:19.849527 exec.go:72: daos_engine:0 args: [-t 1 -x 0 -g daos_server -d /var/run/daos_server -s /mnt/daos -I 0]

DEBUG 18:18:19.849575 exec.go:73: daos_engine:0 env: [OFI_DOMAIN=mlx5_0 D_LOG_MASK=DEBUG D_LOG_FILE=/tmp/daos_engine.0.log CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm OFI_INTERFACE=ib0 OFI_PORT=31416 CRT_CTX_SHARE_ADDR=0 CRT_TIMEOUT=0]

Starting I/O server instance 0: /home/USER/daos/install/bin/daos_engine

daos_engine:1 Using legacy core allocation algorithm

daos_engine:0 Using legacy core allocation algorithm

ERROR: daos_engine:1 *** Process 60471 received signal 11 ***

Associated errno: Success (0)

Failing for address: 0x7fb154a65000

ERROR: daos_engine:1 /lib64/libpthread.so.0(+0xf630)[0x7fb1555f4630]

ERROR: daos_engine:1 /lib64/libc.so.6(+0x156918)[0x7fb154ac2918]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x3842f)[0x7fb14e59642f]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x3872e)[0x7fb14e59672e]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x37f06)[0x7fb14e595f06]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x39b7a)[0x7fb14e597b7a]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x5f10e)[0x7fb14e5bd10e]

ERROR: daos_engine:1 /home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7088b)[0x7fb14e5ce88b]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libna.so.2(+0xdab9)[0x7fb1537bdab9]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7fb1537b422f]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libmercury.so.2(+0xd03f)[0x7fb1539df03f]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libmercury.so.2(HG_Core_init_opt+0xa)[0x7fb1539e4fda]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libmercury.so.2(HG_Init_opt+0x7b)[0x7fb1539d79bb]

ERROR: daos_engine:1 /home/USER/daos/install/lib64/libcart.so.4(+0x4c92a)[0x7fb15646c92a]

/home/USER/daos/install/lib64/libcart.so.4(crt_hg_ctx_init+0x388)[0x7fb15646dce8]

/home/USER/daos/install/lib64/libcart.so.4(crt_context_create+0x40a)[0x7fb15643a6ca]

/home/USER/daos/install/bin/daos_engine[0x420b58]

ERROR: daos_engine:1 /home/USER/daos/install/bin/../prereq/release/argobots/lib/libabt.so.0(+0x1317b)[0x7fb1553d617b]

/home/USER/daos/install/bin/../prereq/release/argobots/lib/libabt.so.0(+0x13851)[0x7fb1553d6851]

ERROR: daos_engine:0 *** Process 60472 received signal 11 ***

Associated errno: Success (0)

Failing for address: 0x7f2093174000

ERROR: daos_engine:0 /lib64/libpthread.so.0(+0xf630)[0x7f2093d03630]

ERROR: daos_engine:0 /lib64/libc.so.6(+0x156918)[0x7f20931d1918]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x3842f)[0x7f208cca542f]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x3872e)[0x7f208cca572e]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x37f06)[0x7f208cca4f06]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x39b7a)[0x7f208cca6b7a]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x5f10e)[0x7f208cccc10e]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1(+0x7088b)[0x7f208ccdd88b]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libna.so.2(+0xdab9)[0x7f2091eccab9]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f2091ec322f]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libmercury.so.2(+0xd03f)[0x7f20920ee03f]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libmercury.so.2(HG_Core_init_opt+0xa)[0x7f20920f3fda]

/home/USER/daos/install/lib64/../prereq/release/mercury/lib/libmercury.so.2(HG_Init_opt+0x7b)[0x7f20920e69bb]

ERROR: daos_engine:0 /home/USER/daos/install/lib64/libcart.so.4(+0x4c92a)[0x7f2094b7b92a]

/home/USER/daos/install/lib64/libcart.so.4(crt_hg_ctx_init+0x388)[0x7f2094b7cce8]

/home/USER/daos/install/lib64/libcart.so.4(crt_context_create+0x40a)[0x7f2094b496ca]

/home/USER/daos/install/bin/daos_engine[0x420b58]

/home/USER/daos/install/bin/../prereq/release/argobots/lib/libabt.so.0(+0x1317b)[0x7f2093ae517b]

/home/USER/daos/install/bin/../prereq/release/argobots/lib/libabt.so.0(+0x13851)[0x7f2093ae5851]

instance 0 exited: instance 0 exited prematurely: /home/USER/daos/install/bin/daos_engine (instance 0) exited: signal: segmentation fault (core dumped)

ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)

DEBUG 18:18:20.292988 system.go:237: forwarding engine_status_down event to MS access points [localhost:10001] (seq: 1)

&&& RAS EVENT id: [engine_status_down] ts: [2021-04-19T18:18:20.292879+0900] host: [10.0.0.0_104810] type: [STATE_CHANGE] sev: [ERROR] msg: [DAOS rank exited unexpectedly] pid: [59341] rank: [0]

DEBUG 18:18:20.294097 system.go:202: DAOS cluster event request: sequence:1 event:<id:2 msg:"DAOS rank exited unexpectedly" timestamp:"2021-04-19T18:18:20.292879+0900" type:1 severity:3 hostname:"10.0.0.0_104810" proc_id:59341 rank_state_info:<errored:true error:"instance 0 exited prematurely: /home/USER/daos/install/bin/daos_engine (instance 0) exited: signal: segmentation fault (core dumped)" > >

DEBUG 18:18:20.294315 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:20.299667 rpc.go:380: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 0s

DEBUG 18:18:20.299845 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:20.301829 rpc.go:380: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 1.25s

instance 1 exited: instance 1 exited prematurely: /home/USER/daos/install/bin/daos_engine (instance 1) exited: signal: segmentation fault (core dumped)

ERROR: removing socket file: removing instance 1 socket file: no dRPC client set (data plane not started?)

DEBUG 18:18:21.174777 system.go:237: forwarding engine_status_down event to MS access points [localhost:10001] (seq: 2)

&&& RAS EVENT id: [engine_status_down] ts: [2021-04-19T18:18:21.174668+0900] host: [10.0.0.0_104810] type: [STATE_CHANGE] sev: [ERROR] msg: [DAOS rank exited unexpectedly] pid: [59341] rank: [1]

DEBUG 18:18:21.175125 system.go:202: DAOS cluster event request: sequence:2 event:<id:2 msg:"DAOS rank exited unexpectedly" timestamp:"2021-04-19T18:18:21.174668+0900" type:1 severity:3 hostname:"10.0.0.0_104810" rank:1 proc_id:59341 rank_state_info:<instance:1 errored:true error:"instance 1 exited prematurely: /home/USER/daos/install/bin/daos_engine (instance 1) exited: signal: segmentation fault (core dumped)" > >

DEBUG 18:18:21.175325 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:21.177949 rpc.go:380: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 0s

DEBUG 18:18:21.178188 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:21.182118 rpc.go:380: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 1.75s

DEBUG 18:18:21.552634 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:21.555054 rpc.go:380: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 1.75s

DEBUG 18:18:22.933217 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:22.935642 rpc.go:380: MS request error: not the DAOS Management Service leader (try  or one of ); retrying after 2.75s

DEBUG 18:18:23.157591 raft.go:214: heartbeat timeout reached, starting election: last-leader=

DEBUG 18:18:23.157720 raft.go:250: entering candidate state: node=Node at 127.0.0.1:10001 [Candidate] term=4

DEBUG 18:18:23.158132 raft.go:268: votes: needed=1

DEBUG 18:18:23.158193 raft.go:287: vote granted: from=127.0.0.1:10001 term=4 tally=1

DEBUG 18:18:23.158236 raft.go:292: election won: tally=1

DEBUG 18:18:23.158289 raft.go:363: entering leader state: leader=Node at 127.0.0.1:10001 [Leader]

DEBUG 18:18:23.158506 database.go:414: node 127.0.0.1:10001 gained MS leader state

MS leader running on 10.0.0.0_104810

DEBUG 18:18:23.158614 mgmt_system.go:148: starting joinLoop

DEBUG 18:18:23.305370 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:23.307429 membership.go:451: processing RAS event "DAOS rank exited unexpectedly" from rank 0 on host "10.0.0.0_104810"

ERROR: updating member states: unable to find member with rank 0

DEBUG 18:18:25.686152 rpc.go:196: request hosts: [localhost:10001]

DEBUG 18:18:25.689836 membership.go:451: processing RAS event "DAOS rank exited unexpectedly" from rank 1 on host "10.0.0.0_104810"

ERROR: updating member states: unable to find member with rank 1

 

Attach the server configuration file. Any help would be much appreciated.

Sincerely,
Erika Hayashi

Join daos@daos.groups.io to automatically receive all group messages.