Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|
Hi , Mohamad :
Thanks for you help .
- What does OPA mean ?
- Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
- I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|
That probably means that your IOR was not built with DAOS driver support.
If you enabled that, I would check the config.log in your IOR build and see why.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Thanks for you help .
- What does OPA mean ?
- Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
- I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|
Hi , Mohamad :
Yes , ior was build with DAOS driver support .
Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart
Attach file is config.log .
Regards,
Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem
That probably means that your IOR was not built with DAOS driver support.
If you enabled that, I would check the config.log in your IOR build and see why.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Thanks for you help .
- What does OPA mean ?
- Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
- I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
collect2: error: ld returned 1 exit status
configure:5942: $? = 1
are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?
That seems like a path to the cart source dir.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Yes , ior was build with DAOS driver support .
Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart
Attach file is config.log .
Regards,
Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem
That probably means that your IOR was not built with DAOS driver support.
If you enabled that, I would check the config.log in your IOR build and see why.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Thanks for you help .
- What does OPA mean ?
- Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
- I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|
Hi , Mohamad :
/home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
|
Local env :


This was the previous build on boro, ior can be executed.
Regards,
Minmingz
From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:22 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
collect2: error: ld returned 1 exit status
configure:5942: $? = 1
are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?
That seems like a path to the cart source dir.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Yes , ior was build with DAOS driver support .
Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart
Attach file is config.log .
Regards,
Minmingz
That probably means that your IOR was not built with DAOS driver support.
If you enabled that, I would check the config.log in your IOR build and see why.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Thanks for you help .
- What does OPA mean ?
- Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
- I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|

Olivier, Jeffrey V
You need to set –with-cart to
/home/spark/daos/install not _build.external/cart
-Jeff
From: daos@daos.groups.io [mailto:daos@daos.groups.io]
On Behalf Of Zhu, Minming
Sent: Wednesday, February 19, 2020 10:31 AM
To: Chaarawi, Mohamad <mohamad.chaarawi@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] Tuning problem
Hi , Mohamad :
/home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
|
Local env :


This was the previous build on boro, ior can be executed.
Regards,
Minmingz
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
collect2: error: ld returned 1 exit status
configure:5942: $? = 1
are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?
That seems like a path to the cart source dir.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Yes , ior was build with DAOS driver support .
Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart
Attach file is config.log .
Regards,
Minmingz
That probably means that your IOR was not built with DAOS driver support.
If you enabled that, I would check the config.log in your IOR build and see why.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Thanks for you help .
1.
What does OPA mean ?
2.
Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
3.
I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
1.
About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|
yes I had a chat offline with Minming, forgot to report back here.
But that was it.
Mohamad
From: "Olivier, Jeffrey V" <jeffrey.v.olivier@...>
Date: Thursday, February 20, 2020 at 9:34 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>, "Olivier, Jeffrey V" <jeffrey.v.olivier@...>
Subject: RE: [daos] Tuning problem
You need to set –with-cart to
/home/spark/daos/install not _build.external/cart
-Jeff
From: daos@daos.groups.io [mailto:daos@daos.groups.io]
On Behalf Of Zhu, Minming
Sent: Wednesday, February 19, 2020 10:31 AM
To: Chaarawi, Mohamad <mohamad.chaarawi@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] Tuning problem
Hi , Mohamad :
/home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
|
Local env :


This was the previous build on boro, ior can be executed.
Regards,
Minmingz
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
collect2: error: ld returned 1 exit status
configure:5942: $? = 1
are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?
That seems like a path to the cart source dir.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Yes , ior was build with DAOS driver support .
Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart
Attach file is config.log .
Regards,
Minmingz
That probably means that your IOR was not built with DAOS driver support.
If you enabled that, I would check the config.log in your IOR build and see why.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Thanks for you help .
- What does OPA mean ?
- Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
- I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
- About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|

Olivier, Jeffrey V
To be clear, the build checks out cart and builds it in _build.external/cart but it gets installed by default in same location as daos.
From: daos@daos.groups.io [mailto:daos@daos.groups.io]
On Behalf Of Olivier, Jeffrey V
Sent: Thursday, February 20, 2020 8:34 AM
To: daos@daos.groups.io; Chaarawi, Mohamad <mohamad.chaarawi@...>; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>; Olivier, Jeffrey V <jeffrey.v.olivier@...>
Subject: Re: [daos] Tuning problem
You need to set –with-cart to
/home/spark/daos/install not _build.external/cart
-Jeff
Hi , Mohamad :
/home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
|
Local env :


This was the previous build on boro, ior can be executed.
Regards,
Minmingz
configure:5942: mpicc -std=gnu99 -o conftest -g -O2 -I/home/spark/daos/_build.external/cart/include/ -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt -lm >&5
/usr/bin/ld: cannot find -lgurt
collect2: error: ld returned 1 exit status
configure:5942: $? = 1
are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?
That seems like a path to the cart source dir.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Yes , ior was build with DAOS driver support .
Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart
Attach file is config.log .
Regards,
Minmingz
That probably means that your IOR was not built with DAOS driver support.
If you enabled that, I would check the config.log in your IOR build and see why.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem
Hi , Mohamad :
Thanks for you help .
1.
What does OPA mean ?
2.
Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .
3.
I has solved this problem by adding -x UCX_NET_DEVICES=mlx5_1:1 .
IOR command : /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior
-a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .
But encountered a new problem .
Error invalid argument: --dfs.pool
Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139
Error invalid argument: --dfs.svcl
Error invalid argument: 0
Error invalid argument: --dfs.cont
Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88
Invalid options
Synopsis /home/spark/cluster/ior_hpc/bin/ior
Flags
-c collective -- collective I/O
……
|
Regards,
Minmingz
Could you provide some info on the system you are running on? Do you have OPA there?
You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?
Output of fi_info and ifconfig would help.
Thanks,
Mohamad
From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem
Hi , Guys :
I have some about daos performance tuning problems.
1.
About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.
Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi /home/spark/cluster/ior_hpc/bin/ior
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: vsr139
Local device: mlx5_1
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
[vsr139:164808:0:164808] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]
12 /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]
14 /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]
|
Normally this is the case :
IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O
Began : Tue Feb 18 10:08:58 2020
Command line : ior
Machine : Linux boro-9.boro.hpdd.intel.com
TestID : 0
StartTime : Tue Feb 18 10:08:58 2020
Path : /home/minmingz
FS : 3.8 TiB Used FS: 43.3% Inodes: 250.0 Mi Used Inodes: 6.3%
Options:
api : POSIX
apiVersion :
test filename : testFile
access : single-shared-file
type : independent
segments : 1
ordering in a file : sequential
ordering inter file : no tasks offsets
tasks : 1
clients per node : 1
repetitions : 1
xfersize : 262144 bytes
blocksize : 1 MiB
aggregate filesize : 1 MiB
Results:
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 89.17 1024.00 256.00 0.000321 0.000916 0.009976 0.011214 0
read 1351.38 1024.00 256.00 0.000278 0.000269 0.000193 0.000740 0
remove - - - - - - 0.000643 0
Max Write: 89.17 MiB/sec (93.50 MB/sec)
Max Read: 1351.38 MiB/sec (1417.02 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 89.17 89.17 89.17 0.00 356.68 356.68 356.68 0.00 0.01121 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
read 1351.38 1351.38 1351.38 0.00 5405.52 5405.52 5405.52 0.00 0.00074 0 1 1 1 0 0 1 0 0 1 1048576 262144 1.0 POSIX 0
Finished : Tue Feb 18 10:08:58 2020
|
2.About network performance
:On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .
Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1
-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2 -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.
vsr139:176276:0:176276] ud_iface.c:307 Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed
==== backtrace ====
0 /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]
1 /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]
2 /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]
3 /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]
4 /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]
5 /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]
6 /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]
7 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]
8 /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]
9 /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]
10 /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]
11 /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]
12 crt_launch() [0x40130c]
13 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]
14 crt_launch() [0x401dcf]
|
Please help solve the problem.
Regards,
Minmingz
|
|