DAOS_test failed
anton.brekhov@...
Hi everyone! Im trying to make simple cluster of two nodes connected to IB network. One node is with PMEM and other one is client (for daos agent). I installed rpm for centos 7 v1.0.1 Here is DAOS_server config:
name: daos_server access_points: ['apache512'] #access_points: ['localhost'] port: 10001 #provider: ofi+sockets provider: ofi+verbs;ofi_rxm nr_hugepages: 4096 control_log_file: /tmp/daos_control.log transport_config: allow_insecure: true
servers: - targets: 4 first_core: 0 nr_xs_helpers: 0 fabric_iface: ib0 fabric_iface_port: 31416 log_mask: DEBUG log_file: /tmp/daos_server.log
env_vars: - DAOS_MD_CAP=1024 - CRT_CTX_SHARE_ADDR=0 - CRT_TIMEOUT=30 - FI_SOCKETS_MAX_CONN_RETRY=1 - FI_SOCKETS_CONN_TIMEOUT=2000 #- OFI_INTERFACE=ib0 #- OFI_DOMAIN=mlx5_0 #- CRT_PHY_ADDR_STR=ofi+verbs
# Storage definitions
# When scm_class is set to ram, tmpfs will be used to emulate SCM. # The size of ram is specified by scm_size in GB units. scm_mount: /mnt/daos # map to -s /mnt/daos #scm_class: ram #scm_size: 8 scm_class: dcpm scm_list: [/dev/pmem0] bdev_class: nvme bdev_list: ["0000:b1:00.0","0000:b2:00.0","0000:b3:00.0","0000:b4:00.0"] name: daos_server access_points: ['apache512'] port: 10001
runtime_dir: /var/run/daos_agent export CRT_TIMEOUT=5 export OFI_INTERFACE=ib0
export OFI_DOMAIN=mlx_5 [root@sky08 ~]# mpirun -np 1 --allow-run-as-root /tmp/daos_test -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: sky08 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: sky08 Local device: i40iw0 Local port: 1 CPCs attempted: rdmacm, udcm -------------------------------------------------------------------------- daos_init() failed with -1003 -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[27909,1],0] Exit code: 255 -------------------------------------------------------------------------- [sky08:23537] 2 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[sky08:23537] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 1003 error is invalid parameter as I understand from here (https://github.com/daos-stack/cart/blob/master/src/include/gurt/errno.h). What can be wrong? fi_pingpong test is OK... Thanks! Anton Brekhov
|
|
Lombardi, Johann
Hi Anton,
Since you use 1.0.1, you also need to set CRT_PHY_ADDR_STR (not required any longer in 1.2/master) as follows: export CRT_PHY_ADDR_STR="ofi+verbs;ofi_rxm"
Hope this helps.
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of "anton.brekhov@..." <anton.brekhov@...>
Hi everyone! Im trying to make simple cluster of two nodes connected to IB network. One node is with PMEM and other one is client (for daos agent). I installed rpm for centos 7 v1.0.1 Here is DAOS_server config:
name: daos_server access_points: ['apache512'] #access_points: ['localhost'] port: 10001 #provider: ofi+sockets provider: ofi+verbs;ofi_rxm nr_hugepages: 4096 control_log_file: /tmp/daos_control.log transport_config: allow_insecure: true
servers: - targets: 4 first_core: 0 nr_xs_helpers: 0 fabric_iface: ib0 fabric_iface_port: 31416 log_mask: DEBUG log_file: /tmp/daos_server.log
env_vars: - DAOS_MD_CAP=1024 - CRT_CTX_SHARE_ADDR=0 - CRT_TIMEOUT=30 - FI_SOCKETS_MAX_CONN_RETRY=1 - FI_SOCKETS_CONN_TIMEOUT=2000 #- OFI_INTERFACE=ib0 #- OFI_DOMAIN=mlx5_0 #- CRT_PHY_ADDR_STR=ofi+verbs
# Storage definitions
# When scm_class is set to ram, tmpfs will be used to emulate SCM. # The size of ram is specified by scm_size in GB units. scm_mount: /mnt/daos # map to -s /mnt/daos #scm_class: ram #scm_size: 8 scm_class: dcpm scm_list: [/dev/pmem0]
bdev_class: nvme bdev_list: ["0000:b1:00.0","0000:b2:00.0","0000:b3:00.0","0000:b4:00.0"] name: daos_server access_points: ['apache512'] port: 10001
runtime_dir: /var/run/daos_agent export CRT_TIMEOUT=5 export OFI_INTERFACE=ib0
export OFI_DOMAIN=mlx_5 [root@sky08 ~]# mpirun -np 1 --allow-run-as-root /tmp/daos_test -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: sky08 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: sky08 Local device: i40iw0 Local port: 1 CPCs attempted: rdmacm, udcm -------------------------------------------------------------------------- daos_init() failed with -1003 -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[27909,1],0] Exit code: 255 -------------------------------------------------------------------------- [sky08:23537] 2 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[sky08:23537] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 1003 error is invalid parameter as I understand from here (https://github.com/daos-stack/cart/blob/master/src/include/gurt/errno.h). What can be wrong? fi_pingpong test is OK... Thanks! Anton Brekhov --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
anton.brekhov@...
Johann thanks! I've set export CRT_PHY_ADDR_STR="ofi+verbs;ofi_rxm" both on storage server and client and now I'get an error -1020: export OFI_INTERFACE=ib0 export CRT_PHY_ADDR_STR="ofi+verbs;ofi_rxm" export OFI_DOMAIN=mlx_5 mpirun -np 1 --allow-run-as-root /tmp/daos_test -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: sky08 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- [sky08:29616] [[17732,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 501 -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: sky08 Local device: i40iw0 Local port: 1 CPCs attempted: rdmacm, udcm -------------------------------------------------------------------------- daos_init() failed with -1020 -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[17732,1],0] Exit code: 255 -------------------------------------------------------------------------- [sky08:29616] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[sky08:29616] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
|
|
Lombardi, Johann
Could you please try with export OFI_DOMAIN="mlx5_0"? If it does not work, please collect the the daos logs (i.e. /tmp/daos.log).
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of "anton.brekhov@..." <anton.brekhov@...>
Johann thanks! I've set export CRT_PHY_ADDR_STR="ofi+verbs;ofi_rxm" both on storage server and client and now I'get an error -1020: export OFI_INTERFACE=ib0 export CRT_PHY_ADDR_STR="ofi+verbs;ofi_rxm" export OFI_DOMAIN=mlx_5 mpirun -np 1 --allow-run-as-root /tmp/daos_test -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: sky08 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- [sky08:29616] [[17732,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 501 -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: sky08 Local device: i40iw0 Local port: 1 CPCs attempted: rdmacm, udcm -------------------------------------------------------------------------- daos_init() failed with -1020 -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[17732,1],0] Exit code: 255 -------------------------------------------------------------------------- [sky08:29616] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[sky08:29616] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
mhennecke@...
Hi,
we run 1.0.1 with these settings:
# DAOS 1.0.1 client-side settings for Mellanox InfiniBand: export OFI_INTERFACE="ib0" export CRT_PHY_ADDR_STR="ofi+verbs;ofi_rxm" export OFI_DOMAIN="mlx5_0"
What does “daos_server network scan” say?
Mit freundlichen Grüssen / Best regards,
Michael Hennecke Chief Technologist - HPC Storage & Networking -- Lenovo Global Technology (Germany) GmbH * Am Zehnthof 77 * D-45307 Essen * Germany Geschäftsführung: Colm Gleeson, Christophe Laurent * Sitz der Gesellschaft: Stuttgart * HRB-Nr.: 758298, AG Stuttgart
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of anton.brekhov@...
UPDATE: I fixed OFI_DOMAIN to mlx5_0 Here is new error: [root@sky08 ~]# mpirun -np 1 --allow-run-as-root /tmp/daos_test -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: sky08 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: sky08 Local device: i40iw0 Local port: 1 CPCs attempted: rdmacm, udcm --------------------------------------------------------------------------
================= DAOS management tests.. ===================== [==========] Running 5 test(s). [ RUN ] MGMT1: create/destroy pool on all tgts /tmp/daos_test: symbol lookup error: /tmp/daos_test: undefined symbol: dmg_pool_create creating pool synchronously ... -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[48263,1],0] Exit code: 127 -------------------------------------------------------------------------- [sky08:35443] 2 more processes have sent help message help-mpi-btl-openib.txt / no device params found [sky08:35443] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 09/11-13:11:00.14 sky08 DAOS[28284/28284] hg ERR src/cart/crt_hg.c:534 crt_hg_init() Could not initialize NA class. 09/11-13:11:00.14 sky08 DAOS[28284/28284] crt ERR src/cart/crt_init.c:409 crt_init_opt() crt_hg_init failed rc: -1020. 09/11-13:11:00.14 sky08 DAOS[28284/28284] crt ERR src/cart/crt_init.c:477 crt_init_opt() crt_init failed, rc: -1020. 09/11-13:11:00.14 sky08 DAOS[28284/28284] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:11:00.14 sky08 DAOS[28284/28284] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 09/11-13:11:00.14 sky08 DAOS[28284/28284] client ERR src/client/api/init.c:160 daos_init() failed to initialize eq_lib: DER_HG(-1020) 09/11-13:11:00.14 sky08 DAOS[28284/28284] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:716 d_fault_attr_set() Fault Injection attr not set feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt INFO src/cart/crt_init.c:278 crt_init_opt() libcart version 4.8.0 initializing 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt WARN src/cart/crt_init.c:170 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt WARN src/cart/crt_init.c:389 crt_init_opt() FI_OFI_RXM_USE_SRX not set; set=1 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:1685 # na_ofi_domain_open(): No provider found for "verbs;ofi_rxm" provider on domain "mlx_5" 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:3150 # na_ofi_initialize(): Could not open domain for verbs;ofi_rxm, mlx_5 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na.c:312 # NA_Initialize_opt(): Could not initialize plugin 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR src/cart/crt_hg.c:534 crt_hg_init() Could not initialize NA class. 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt ERR src/cart/crt_init.c:409 crt_init_opt() crt_hg_init failed rc: -1020. 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt ERR src/cart/crt_init.c:477 crt_init_opt() crt_init failed, rc: -1020. 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 09/11-13:23:18.51 sky08 DAOS[29621/29621] client ERR src/client/api/init.c:160 daos_init() failed to initialize eq_lib: DER_HG(-1020) 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:716 d_fault_attr_set() Fault Injection attr not set feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt INFO src/cart/crt_init.c:278 crt_init_opt() libcart version 4.8.0 initializing 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt WARN src/cart/crt_init.c:170 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt WARN src/cart/crt_init.c:389 crt_init_opt() FI_OFI_RXM_USE_SRX not set; set=1 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:1685 # na_ofi_domain_open(): No provider found for "verbs;ofi_rxm" provider on domain "mlx_5" 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:3150 # na_ofi_initialize(): Could not open domain for verbs;ofi_rxm, mlx_5 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na.c:312 # NA_Initialize_opt(): Could not initialize plugin 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR src/cart/crt_hg.c:534 crt_hg_init() Could not initialize NA class. 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt ERR src/cart/crt_init.c:409 crt_init_opt() crt_hg_init failed rc: -1020. 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt ERR src/cart/crt_init.c:477 crt_init_opt() crt_init failed, rc: -1020. 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 09/11-13:26:13.98 sky08 DAOS[29976/29976] client ERR src/client/api/init.c:160 daos_init() failed to initialize eq_lib: DER_HG(-1020) 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-14:17:47.58 sky08 DAOS[35448/35448] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-14:17:47.58 sky08 DAOS[35448/35448] fi WARN src/gurt/fault_inject.c:716 d_fault_attr_set() Fault Injection attr not set feature not included in build 09/11-14:17:47.59 sky08 DAOS[35448/35448] crt INFO src/cart/crt_init.c:278 crt_init_opt() libcart version 4.8.0 initializing 09/11-14:17:47.59 sky08 DAOS[35448/35448] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-14:17:47.59 sky08 DAOS[35448/35448] crt WARN src/cart/crt_init.c:170 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048
09/11-14:17:47.59 sky08 DAOS[35448/35448] crt WARN src/cart/crt_init.c:389 crt_init_opt() FI_OFI_RXM_USE_SRX not set; set=1
|
|
Lombardi, Johann
Looks better. The network stack seems to be properly initialized now. The failure you see now (“undefined symbol: dmg_pool_create”) is related to the C wrapper of the dmg command which is used to create a pool. I suspect that you are missing something in the LD_LIBRARY_PATH. I will let @Macdonald, Mjmac comment further on that.
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of "anton.brekhov@..." <anton.brekhov@...>
UPDATE: I fixed OFI_DOMAIN to mlx5_0 Here is new error: [root@sky08 ~]# mpirun -np 1 --allow-run-as-root /tmp/daos_test -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected:
Local host: sky08 Device name: mlx5_0 Device vendor ID: 0x02c9 Device vendor part ID: 4123
Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.
NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
Local host: sky08 Local device: i40iw0 Local port: 1 CPCs attempted: rdmacm, udcm --------------------------------------------------------------------------
================= DAOS management tests.. ===================== [==========] Running 5 test(s). [ RUN ] MGMT1: create/destroy pool on all tgts /tmp/daos_test: symbol lookup error: /tmp/daos_test: undefined symbol: dmg_pool_create creating pool synchronously ... -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[48263,1],0] Exit code: 127 -------------------------------------------------------------------------- [sky08:35443] 2 more processes have sent help message help-mpi-btl-openib.txt / no device params found [sky08:35443] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 09/11-13:11:00.14 sky08 DAOS[28284/28284] hg ERR src/cart/crt_hg.c:534 crt_hg_init() Could not initialize NA class. 09/11-13:11:00.14 sky08 DAOS[28284/28284] crt ERR src/cart/crt_init.c:409 crt_init_opt() crt_hg_init failed rc: -1020. 09/11-13:11:00.14 sky08 DAOS[28284/28284] crt ERR src/cart/crt_init.c:477 crt_init_opt() crt_init failed, rc: -1020. 09/11-13:11:00.14 sky08 DAOS[28284/28284] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:11:00.14 sky08 DAOS[28284/28284] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 09/11-13:11:00.14 sky08 DAOS[28284/28284] client ERR src/client/api/init.c:160 daos_init() failed to initialize eq_lib: DER_HG(-1020) 09/11-13:11:00.14 sky08 DAOS[28284/28284] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:716 d_fault_attr_set() Fault Injection attr not set feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt INFO src/cart/crt_init.c:278 crt_init_opt() libcart version 4.8.0 initializing 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt WARN src/cart/crt_init.c:170 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt WARN src/cart/crt_init.c:389 crt_init_opt() FI_OFI_RXM_USE_SRX not set; set=1 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:1685 # na_ofi_domain_open(): No provider found for "verbs;ofi_rxm" provider on domain "mlx_5" 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:3150 # na_ofi_initialize(): Could not open domain for verbs;ofi_rxm, mlx_5 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na.c:312 # NA_Initialize_opt(): Could not initialize plugin 09/11-13:23:18.51 sky08 DAOS[29621/29621] hg ERR src/cart/crt_hg.c:534 crt_hg_init() Could not initialize NA class. 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt ERR src/cart/crt_init.c:409 crt_init_opt() crt_hg_init failed rc: -1020. 09/11-13:23:18.51 sky08 DAOS[29621/29621] crt ERR src/cart/crt_init.c:477 crt_init_opt() crt_init failed, rc: -1020. 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:23:18.51 sky08 DAOS[29621/29621] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 09/11-13:23:18.51 sky08 DAOS[29621/29621] client ERR src/client/api/init.c:160 daos_init() failed to initialize eq_lib: DER_HG(-1020) 09/11-13:23:18.51 sky08 DAOS[29621/29621] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:716 d_fault_attr_set() Fault Injection attr not set feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt INFO src/cart/crt_init.c:278 crt_init_opt() libcart version 4.8.0 initializing 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt WARN src/cart/crt_init.c:170 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt WARN src/cart/crt_init.c:389 crt_init_opt() FI_OFI_RXM_USE_SRX not set; set=1 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:1685 # na_ofi_domain_open(): No provider found for "verbs;ofi_rxm" provider on domain "mlx_5" 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:3150 # na_ofi_initialize(): Could not open domain for verbs;ofi_rxm, mlx_5 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na.c:312 # NA_Initialize_opt(): Could not initialize plugin 09/11-13:26:13.98 sky08 DAOS[29976/29976] hg ERR src/cart/crt_hg.c:534 crt_hg_init() Could not initialize NA class. 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt ERR src/cart/crt_init.c:409 crt_init_opt() crt_hg_init failed rc: -1020. 09/11-13:26:13.98 sky08 DAOS[29976/29976] crt ERR src/cart/crt_init.c:477 crt_init_opt() crt_init failed, rc: -1020. 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-13:26:13.98 sky08 DAOS[29976/29976] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 09/11-13:26:13.98 sky08 DAOS[29976/29976] client ERR src/client/api/init.c:160 daos_init() failed to initialize eq_lib: DER_HG(-1020) 09/11-13:26:13.98 sky08 DAOS[29976/29976] fi WARN src/gurt/fault_inject.c:685 d_fault_inject_fini() Fault Injection not finalized feature not included in build 09/11-14:17:47.58 sky08 DAOS[35448/35448] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-14:17:47.58 sky08 DAOS[35448/35448] fi WARN src/gurt/fault_inject.c:716 d_fault_attr_set() Fault Injection attr not set feature not included in build 09/11-14:17:47.59 sky08 DAOS[35448/35448] crt INFO src/cart/crt_init.c:278 crt_init_opt() libcart version 4.8.0 initializing 09/11-14:17:47.59 sky08 DAOS[35448/35448] fi WARN src/gurt/fault_inject.c:679 d_fault_inject_init() Fault Injection not initialized feature not included in build 09/11-14:17:47.59 sky08 DAOS[35448/35448] crt WARN src/cart/crt_init.c:170 data_init() FI_UNIVERSE_SIZE was not set; setting to 2048
09/11-14:17:47.59 sky08 DAOS[35448/35448] crt WARN src/cart/crt_init.c:389 crt_init_opt() FI_OFI_RXM_USE_SRX not set; set=1
--------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
Johann thanks! You right, and also I've run daos_test compiled from source. Now I've ran from package daos-tests (RPM of v1.0.1) and it works, but one test is stuck:
DAOS_IOD_SINGLE:NVMe size: 5120
[ OK ] IO2: simple update/fetch/verify (async)
[ RUN ] IO3: i/o with variable rec size
Record size: 1 val: 'X' dkey: 707937202
And daos.log is filled with error: 09/11-14:37:23.48 sky08 DAOS[36236/36236] object ERR src/object/cli_shard.c:216 dc_rw_cb() rpc 0x404a030 RPC 1 failed: DER_HG(-1020)
Does anyone know why it can be?
|
|
anton.brekhov@...
This is errors from daos_server.log :
09/11-20:41:22.58 apache512 DAOS[10757/10786] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:4196
# na_ofi_mem_register(): fi_mr_reg() failed, rc: -95 (Operation not supported)
09/11-20:41:26.20 apache512 DAOS[10757/10786] hg ERR # HG -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/mercury_bulk.c:494
# hg_bulk_create(): NA_Mem_register() failed (NA_PROTOCOL_ERROR)
09/11-20:41:26.20 apache512 DAOS[10757/10786] hg ERR # HG -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/mercury_bulk.c:1072
# HG_Bulk_create(): Could not create bulk handle
09/11-20:41:26.20 apache512 DAOS[10757/10786] hg ERR src/cart/crt_hg.c:1445 crt_hg_bulk_create() HG_Bulk_create failed, hg_ret: 11.
09/11-20:41:26.20 apache512 DAOS[10757/10786] bulk ERR src/cart/crt_bulk.c:137 crt_bulk_create() crt_hg_bulk_create failed, rc: -1020.
09/11-20:41:26.20 apache512 DAOS[10757/10786] object ERR src/object/srv_obj.c:359 obj_bulk_transfer() crt_bulk_create 0 error (-1020).
09/11-20:41:26.20 apache512 DAOS[10757/10786] object ERR src/object/srv_obj.c:992 obj_local_rw() 1155473290706288642.0.1 data transfer failed, dma 1 rc DER_HG(-1020)
09/11-20:41:26.20 apache512 DAOS[10757/10786] object ERR src/object/srv_obj.c:96 obj_rw_complete() 1155473290706288642.0.1Fetch end failed: -1020
09/11-20:41:26.20 apache512 DAOS[10757/10777] rpc DBUG src/cart/crt_register.c:215 crt_opc_lookup() looking up opcode: 0x2010003
09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c98212380 size 24
09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c981e3680 size 320
09/11-20:41:26.20 apache512 DAOS[10757/10777] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [16] fetch single epr 9-6359751
09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c98207700 size 8
09/11-20:41:26.20 apache512 DAOS[10757/10777] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [12] fetch single epr 5-6359751
09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c98241400 size 4
09/11-20:41:26.20 apache512 DAOS[10757/10777] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [11] fetch single epr 5-6359751
09/11-20:41:26.20 apache512 DAOS[10757/10786] rpc DBUG src/cart/crt_register.c:215 crt_opc_lookup() looking up opcode: 0x4010001
09/11-20:41:26.20 apache512 DAOS[10757/10786] object DBUG src/object/srv_obj.c:1337 ds_obj_rw_handler() overwrite epoch 1599846087457767431
09/11-20:41:26.20 apache512 DAOS[10757/10786] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [1] fetch array epr 1599844210152833028-1599846087457767431
|
|
Lombardi, Johann
It sounds like the DAOS server is not able to register memory to initiate a RDMA. Could you please tell me more about the network and storage you use on the server? Optane PMEM or DRAM? Also, what version of OFED do you use?
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of "anton.brekhov@..." <anton.brekhov@...>
This is errors from daos_server.log : 09/11-20:41:22.58 apache512 DAOS[10757/10786] hg ERR # NA -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/na/na_ofi.c:4196 # na_ofi_mem_register(): fi_mr_reg() failed, rc: -95 (Operation not supported) 09/11-20:41:26.20 apache512 DAOS[10757/10786] hg ERR # HG -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/mercury_bulk.c:494 # hg_bulk_create(): NA_Mem_register() failed (NA_PROTOCOL_ERROR) 09/11-20:41:26.20 apache512 DAOS[10757/10786] hg ERR # HG -- Error -- /builddir/build/BUILD/mercury-2.0.0a1/src/mercury_bulk.c:1072 # HG_Bulk_create(): Could not create bulk handle 09/11-20:41:26.20 apache512 DAOS[10757/10786] hg ERR src/cart/crt_hg.c:1445 crt_hg_bulk_create() HG_Bulk_create failed, hg_ret: 11. 09/11-20:41:26.20 apache512 DAOS[10757/10786] bulk ERR src/cart/crt_bulk.c:137 crt_bulk_create() crt_hg_bulk_create failed, rc: -1020. 09/11-20:41:26.20 apache512 DAOS[10757/10786] object ERR src/object/srv_obj.c:359 obj_bulk_transfer() crt_bulk_create 0 error (-1020). 09/11-20:41:26.20 apache512 DAOS[10757/10786] object ERR src/object/srv_obj.c:992 obj_local_rw() 1155473290706288642.0.1 data transfer failed, dma 1 rc DER_HG(-1020) 09/11-20:41:26.20 apache512 DAOS[10757/10786] object ERR src/object/srv_obj.c:96 obj_rw_complete() 1155473290706288642.0.1Fetch end failed: -1020 09/11-20:41:26.20 apache512 DAOS[10757/10777] rpc DBUG src/cart/crt_register.c:215 crt_opc_lookup() looking up opcode: 0x2010003 09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c98212380 size 24 09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c981e3680 size 320 09/11-20:41:26.20 apache512 DAOS[10757/10777] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [16] fetch single epr 9-6359751 09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c98207700 size 8 09/11-20:41:26.20 apache512 DAOS[10757/10777] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [12] fetch single epr 5-6359751 09/11-20:41:26.20 apache512 DAOS[10757/10777] bio DBUG src/bio/bio_buffer.c:824 copy_one() bio copy 0x7f8c98241400 size 4 09/11-20:41:26.20 apache512 DAOS[10757/10777] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [11] fetch single epr 5-6359751 09/11-20:41:26.20 apache512 DAOS[10757/10786] rpc DBUG src/cart/crt_register.c:215 crt_opc_lookup() looking up opcode: 0x4010001 09/11-20:41:26.20 apache512 DAOS[10757/10786] object DBUG src/object/srv_obj.c:1337 ds_obj_rw_handler() overwrite epoch 1599846087457767431 09/11-20:41:26.20 apache512 DAOS[10757/10786] vos DBUG src/vos/vos_io.c:607 akey_fetch() akey [1] fetch array epr 1599844210152833028-1599846087457767431 --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
anton.brekhov@...
I'm using Optane PMEM ( there are 4 modules 512GB each, two PMEM modules near each socket). I created /dev/pmem0 and /dev/pmem1 devices using `ipmctl create -goal PersistentMemoryType=AppDirect` command. On daos server I have two mellanox IB interfaces, but only one in use (mlx5_0). Here is ibstat output:
[root@apache512 tmp]# ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.1002
Hardware version: 0
Node GUID: 0xb8599f0300e4f800
System image GUID: 0xb8599f0300e4f800
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 4
LMC: 0
SM lid: 4
Capability mask: 0x2659e84a
Port GUID: 0xb8599f0300e4f800
Link layer: InfiniBand
Seems like I didn't install OFED drivers, because ofed_info not found. Which version is better to use?
|
|
Farrell, Patrick Arthur <patrick.farrell@...>
MOFED 5.0 works well, 5.1 seems to have an incompatibility with libfabric currently. (Note this is not a supported list or anything - This is just what we've used successfully.)
Regards,
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of anton.brekhov@... <anton.brekhov@...>
Sent: Monday, September 14, 2020 2:45 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] DAOS_test failed I'm using Optane PMEM ( there are 4 modules 512GB each, two PMEM modules near each socket). I created /dev/pmem0 and /dev/pmem1 devices using `ipmctl create -goal PersistentMemoryType=AppDirect` command. On daos server I have two mellanox IB interfaces,
but only one in use (mlx5_0). Here is ibstat output:
[root@apache512 tmp]# ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.1002
Hardware version: 0
Node GUID: 0xb8599f0300e4f800
System image GUID: 0xb8599f0300e4f800
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 4
LMC: 0
SM lid: 4
Capability mask: 0x2659e84a
Port GUID: 0xb8599f0300e4f800
Link layer: InfiniBand
Seems like I didn't install OFED drivers, because ofed_info not found. Which version is better to use?
|
|
Oganezov, Alexander A
Hi Anton,
The last one that we’ve tried and worked for us was MOFED 5.0.2 which is what we currently use for our test clusters that use ofi+verbs;ofi_rxm provider.
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
anton.brekhov@...
I'm using Optane PMEM ( there are 4 modules 512GB each, two PMEM modules near each socket). I created /dev/pmem0 and /dev/pmem1 devices using `ipmctl create -goal PersistentMemoryType=AppDirect` command. On daos server I have two mellanox IB interfaces, but only one in use (mlx5_0). Here is ibstat output: [root@apache512 tmp]# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.28.1002 Hardware version: 0 Node GUID: 0xb8599f0300e4f800 System image GUID: 0xb8599f0300e4f800 Port 1: State: Active Physical state: LinkUp Rate: 56 Base lid: 4 LMC: 0 SM lid: 4 Capability mask: 0x2659e84a Port GUID: 0xb8599f0300e4f800 Link layer: InfiniBand Seems like I didn't install OFED drivers, because ofed_info not found. Which version is better to use?
|
|
anton.brekhov@...
I've installed MOFED 5.0.2 on both hosts, and openmpi. And it works a little further! There were passed and failed tests, but it ended with another error: ================= DAOS rebuild tests.. ================= [ PASSED ] 3 test(s). setup: creating pool, SCM size=4 GB, NVMe size=8 GB setup: created pool 88a9a5f1-8a21-4e45-a32d-ef87791c5f80 setup: connecting to pool connected to pool, ntarget=4 setup: creating container 28aa2f9d-8559-41f3-907f-ec1b893ca90c setup: opening container REBUILD0: drop rebuild scan reply No enough targets, skipping (4/0) teardown: destroyed pool 88a9a5f1-8a21-4e45-a32d-ef87791c5f80 REBUILD1: retry rebuild for not ready setup: creating pool, SCM size=0 GB, NVMe size=0 GB daos_pool_create failed, rc: -1003 [sky08:21007:0:21007] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x520f) ==== backtrace (tid: 21007) ==== 0 0x000000000004cb95 ucs_debug_print_backtrace() ???:0 1 0x000000000045471e ???() /usr/bin/daos_test:0 2 0x000000000044ca21 ???() /usr/bin/daos_test:0 3 0x0000000000406db3 ???() /usr/bin/daos_test:0 4 0x0000000000022505 __libc_start_main() ???:0 5 0x00000000004079e2 ???() /usr/bin/daos_test:0 ================================= [sky08:21007] *** Process received signal *** [sky08:21007] Signal: Segmentation fault (11) [sky08:21007] Signal code: (-6) [sky08:21007] Failing at address: 0x520f [sky08:21007] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x7fdac52295f0] [sky08:21007] [ 1] daos_test[0x45471e]
|
|
anton.brekhov@...
I've set this env vars: export POOL_NVME_SIZE=4 REBUILD12: rebuild send objects failed setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool 8abfd4aa-0fb4-4122-aef7-f58b3fe6d81f setup: connecting to pool connected to pool, ntarget=4 setup: creating container ac978bdb-cb4c-4729-a0e4-cf3f3973696d setup: opening container No enough targets, skipping (4/0) teardown: destroyed pool 8abfd4aa-0fb4-4122-aef7-f58b3fe6d81f REBUILD13: rebuild empty pool offline setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool 803c465f-6547-4c8f-a473-2dea0d457081 setup: connecting to pool connected to pool, ntarget=4 setup: creating container a7490729-0d9a-4935-900e-ede0f656871d setup: opening container No enough targets, skipping (4/0) teardown: destroyed pool 803c465f-6547-4c8f-a473-2dea0d457081 REBUILD14: rebuild no space failure setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool cf718311-93cb-4b9f-a159-419585af99e8 setup: connecting to pool connected to pool, ntarget=4 setup: creating container f5a357b4-3af8-4670-9032-f5e9fc2944af setup: opening container No enough targets, skipping (4/0) -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[25434,1],0] Exit code: 255
--------------------------------------------------------------------------
|
|
Farrell, Patrick Arthur <patrick.farrell@...>
You'll want to turn on debug (see the troubleshooting section in the user guide) to get more information on why this failed.
Also, the pool size (both SCM and NVME) will not be large enough to complete the tests. I think you need something like at least 16 GB NVMe and 8 GB SCM? I'm not saying that is your issue here (though it might be), but it will stop you later.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of anton.brekhov@... <anton.brekhov@...>
Sent: Tuesday, September 15, 2020 4:07 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] DAOS_test failed
I've set this env vars:
export POOL_NVME_SIZE=4 REBUILD12: rebuild send objects failed setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool 8abfd4aa-0fb4-4122-aef7-f58b3fe6d81f setup: connecting to pool connected to pool, ntarget=4 setup: creating container ac978bdb-cb4c-4729-a0e4-cf3f3973696d setup: opening container No enough targets, skipping (4/0) teardown: destroyed pool 8abfd4aa-0fb4-4122-aef7-f58b3fe6d81f REBUILD13: rebuild empty pool offline setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool 803c465f-6547-4c8f-a473-2dea0d457081 setup: connecting to pool connected to pool, ntarget=4 setup: creating container a7490729-0d9a-4935-900e-ede0f656871d setup: opening container No enough targets, skipping (4/0) teardown: destroyed pool 803c465f-6547-4c8f-a473-2dea0d457081 REBUILD14: rebuild no space failure setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool cf718311-93cb-4b9f-a159-419585af99e8 setup: connecting to pool connected to pool, ntarget=4 setup: creating container f5a357b4-3af8-4670-9032-f5e9fc2944af setup: opening container No enough targets, skipping (4/0) -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[25434,1],0] Exit code: 255
--------------------------------------------------------------------------
|
|
Wang, Di
Hello,
This basically means there are not enough servers to run rebuild tests, so they were being skipped.
The failure here is probably due to the incorrect usage of cmoka, which are used by some DAOS tests. Anyway it is not a real “failure”.
If you are interested in running rebuild tests. You need at least 6 DAOS servers.
Thanks WangDi
On 9/15/20, 2:07 PM, "daos@daos.groups.io on behalf of anton.brekhov@..." <daos@daos.groups.io on behalf of anton.brekhov@...> wrote:
I've set this env vars:
export POOL_NVME_SIZE=4 REBUILD12: rebuild send objects failed setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool 8abfd4aa-0fb4-4122-aef7-f58b3fe6d81f setup: connecting to pool connected to pool, ntarget=4 setup: creating container ac978bdb-cb4c-4729-a0e4-cf3f3973696d setup: opening container No enough targets, skipping (4/0) teardown: destroyed pool 8abfd4aa-0fb4-4122-aef7-f58b3fe6d81f REBUILD13: rebuild empty pool offline setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool 803c465f-6547-4c8f-a473-2dea0d457081 setup: connecting to pool connected to pool, ntarget=4 setup: creating container a7490729-0d9a-4935-900e-ede0f656871d setup: opening container No enough targets, skipping (4/0) teardown: destroyed pool 803c465f-6547-4c8f-a473-2dea0d457081 REBUILD14: rebuild no space failure setup: creating pool, SCM size=2 GB, NVMe size=4 GB setup: created pool cf718311-93cb-4b9f-a159-419585af99e8 setup: connecting to pool connected to pool, ntarget=4 setup: creating container f5a357b4-3af8-4670-9032-f5e9fc2944af setup: opening container No enough targets, skipping (4/0) -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[25434,1],0] Exit code: 255
--------------------------------------------------------------------------
|
|
I've set new sizes, and launch only management test :
[root@sky08 ~]# export POOL_NVME_SIZE=16 [root@sky08 ~]# export POOL_SCM_SIZE=8
[root@sky08 ~]# mpirun --allow-run-as-root -np 1 daos_test -m
=================
DAOS management tests..
=====================
[==========] Running 5 test(s).
[ RUN ] MGMT1: create/destroy pool on all tgts
creating pool synchronously ... success uuid = 95650fb9-3eec-4b48-9c5f-ffaff9597df8
destroying pool synchronously ... success
[ OK ] MGMT1: create/destroy pool on all tgts
[ RUN ] MGMT2: create/destroy pool on all tgts (async)
creating pool asynchronously ... success uuid = 865c9dcc-c28c-4759-88fe-46e009c52362
destroying pool asynchronously ... success
[ OK ] MGMT2: create/destroy pool on all tgts (async)
[ RUN ] MGMT3: list-pools with no pools in sys
[ ERROR ] --- 0x2 != 0
[ LINE ] --- src/tests/suite/daos_mgmt.c:262: error: Failure!
[ FAILED ] MGMT3: list-pools with no pools in sys
[ RUN ] MGMT4: list-pools with multiple pools in sys
setup: creating pool, SCM size=8 GB, NVMe size=16 GB
setup: created pool 2817ca85-971a-4a43-b01b-fefcec01ec1d
setup: creating pool, SCM size=8 GB, NVMe size=16 GB
setup: created pool f4aaa2fe-5e00-47d8-9d04-a9559df709d5
setup: creating pool, SCM size=8 GB, NVMe size=16 GB
setup: created pool 2e81be40-1c5a-401f-be5a-c9646603a9f8
setup: creating pool, SCM size=8 GB, NVMe size=16 GB
setup: created pool 929dad5e-f797-4c4e-bb33-44fde4c958cf
teardown: destroyed pool 2817ca85-971a-4a43-b01b-fefcec01ec1d
teardown: destroyed pool f4aaa2fe-5e00-47d8-9d04-a9559df709d5
teardown: destroyed pool 2e81be40-1c5a-401f-be5a-c9646603a9f8
teardown: destroyed pool 929dad5e-f797-4c4e-bb33-44fde4c958cf
[ FAILED ] MGMT4: list-pools with multiple pools in sys
[ RUN ] MGMT5: retry MGMT_POOL_{CREATE,DESETROY} upon errors
Fault injection required for test, skipping...
[ ERROR ] --- 0x6 != 0x4
[ LINE ] --- src/tests/suite/daos_mgmt.c:262: error: Failure!
[ SKIPPED ] MGMT5: retry MGMT_POOL_{CREATE,DESETROY} upon errors
[==========] 5 test(s) run.
[ PASSED ] 2 test(s).
[ SKIPPED ] 1 test(s), listed below:
[ SKIPPED ] MGMT5: retry MGMT_POOL_{CREATE,DESETROY} upon errors
1 SKIPPED TEST(S)
[ FAILED ] 2 test(s), listed below:
[ FAILED ] MGMT3: list-pools with no pools in sys
[ FAILED ] MGMT4: list-pools with multiple pools in sys
2 FAILED TEST(S)
============ Summary src/tests/suite/daos_test.c
ERROR, 2 TEST(S) FAILED
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[13612,1],0]
Exit code: 2
-------------------------------------------------------------------------- Is it ok that few tests have failed?
|
|
Lombardi, Johann
Hi Anton,
Yes, I think it is fine. You probably leaked pools from prior runs that show up in pool listing while the test is assuming that no other pools are present. I think you are good to go.
Cheers, Johann
From:
<daos@daos.groups.io> on behalf of "anton.brekhov@..." <anton.brekhov@...>
[Edited Message Follows] I've set new sizes, and launch only management test : [root@sky08 ~]# export POOL_SCM_SIZE=8 [root@sky08 ~]# mpirun --allow-run-as-root -np 1 daos_test -m
================= DAOS management tests.. ===================== [==========] Running 5 test(s). [ RUN ] MGMT1: create/destroy pool on all tgts creating pool synchronously ... success uuid = 95650fb9-3eec-4b48-9c5f-ffaff9597df8 destroying pool synchronously ... success [ OK ] MGMT1: create/destroy pool on all tgts [ RUN ] MGMT2: create/destroy pool on all tgts (async) creating pool asynchronously ... success uuid = 865c9dcc-c28c-4759-88fe-46e009c52362 destroying pool asynchronously ... success [ OK ] MGMT2: create/destroy pool on all tgts (async) [ RUN ] MGMT3: list-pools with no pools in sys [ ERROR ] --- 0x2 != 0 [ LINE ] --- src/tests/suite/daos_mgmt.c:262: error: Failure! [ FAILED ] MGMT3: list-pools with no pools in sys [ RUN ] MGMT4: list-pools with multiple pools in sys setup: creating pool, SCM size=8 GB, NVMe size=16 GB setup: created pool 2817ca85-971a-4a43-b01b-fefcec01ec1d setup: creating pool, SCM size=8 GB, NVMe size=16 GB setup: created pool f4aaa2fe-5e00-47d8-9d04-a9559df709d5 setup: creating pool, SCM size=8 GB, NVMe size=16 GB setup: created pool 2e81be40-1c5a-401f-be5a-c9646603a9f8 setup: creating pool, SCM size=8 GB, NVMe size=16 GB setup: created pool 929dad5e-f797-4c4e-bb33-44fde4c958cf teardown: destroyed pool 2817ca85-971a-4a43-b01b-fefcec01ec1d teardown: destroyed pool f4aaa2fe-5e00-47d8-9d04-a9559df709d5 teardown: destroyed pool 2e81be40-1c5a-401f-be5a-c9646603a9f8 teardown: destroyed pool 929dad5e-f797-4c4e-bb33-44fde4c958cf [ FAILED ] MGMT4: list-pools with multiple pools in sys [ RUN ] MGMT5: retry MGMT_POOL_{CREATE,DESETROY} upon errors Fault injection required for test, skipping... [ ERROR ] --- 0x6 != 0x4 [ LINE ] --- src/tests/suite/daos_mgmt.c:262: error: Failure! [ SKIPPED ] MGMT5: retry MGMT_POOL_{CREATE,DESETROY} upon errors [==========] 5 test(s) run. [ PASSED ] 2 test(s). [ SKIPPED ] 1 test(s), listed below: [ SKIPPED ] MGMT5: retry MGMT_POOL_{CREATE,DESETROY} upon errors
1 SKIPPED TEST(S) [ FAILED ] 2 test(s), listed below: [ FAILED ] MGMT3: list-pools with no pools in sys [ FAILED ] MGMT4: list-pools with multiple pools in sys
2 FAILED TEST(S)
============ Summary src/tests/suite/daos_test.c ERROR, 2 TEST(S) FAILED -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[13612,1],0] Exit code: 2 -------------------------------------------------------------------------- --------------------------------------------------------------------- This e-mail and any attachments may contain confidential material for
|
|
anton.brekhov@...
daos_server.log full of this debug:
09/16-15:04:12.07 apache512 DAOS[13572/13608] vos DBUG src/vos/vos_iterator.c:279 vos_iter_probe() probing iterator
09/16-15:04:12.07 apache512 DAOS[13572/13608] vos DBUG src/vos/vos_iterator.c:288 vos_iter_probe() done probing iterator rc = DER_NONEXIST(-1005)
|
|
anton.brekhov@...
Thanks everyone for the help! I've created container and pool, mounted with dfuse, stored file, unmounted, mounted again and everything is fine! Thanks!
|
|