Tuning problem


Zhu, Minming
 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Chaarawi, Mohamad
 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Zhu, Minming
 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Chaarawi, Mohamad
 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Zhu, Minming
 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Chaarawi, Mohamad
 

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

collect2: error: ld returned 1 exit status

configure:5942: $? = 1

 

are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?

That seems like a path to the cart source dir.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Zhu, Minming
 

Hi , Mohamad :

 

        /home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.

       

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

 

Local env :

 

      

       

 This was the previous build on boro, ior can be executed.

 

Regards,

Minmingz

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:22 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

collect2: error: ld returned 1 exit status

configure:5942: $? = 1

 

are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?

That seems like a path to the cart source dir.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Olivier, Jeffrey V
 

You need to set –with-cart to /home/spark/daos/install not _build.external/cart


-Jeff

 

From: daos@daos.groups.io [mailto:daos@daos.groups.io] On Behalf Of Zhu, Minming
Sent: Wednesday, February 19, 2020 10:31 AM
To: Chaarawi, Mohamad <mohamad.chaarawi@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] Tuning problem

 

Hi , Mohamad :

 

        /home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.

       

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

 

Local env :

 

      

       

 This was the previous build on boro, ior can be executed.

 

Regards,

Minmingz

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:22 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

collect2: error: ld returned 1 exit status

configure:5942: $? = 1

 

are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?

That seems like a path to the cart source dir.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

1.      What does OPA mean ?

2.      Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  

3.      I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

1.      About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Chaarawi, Mohamad
 

yes I had a chat offline with Minming, forgot to report back here.

But that was it.

 

Mohamad

 

From: "Olivier, Jeffrey V" <jeffrey.v.olivier@...>
Date: Thursday, February 20, 2020 at 9:34 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>, "Olivier, Jeffrey V" <jeffrey.v.olivier@...>
Subject: RE: [daos] Tuning problem

 

You need to set –with-cart to /home/spark/daos/install not _build.external/cart


-Jeff

 

From: daos@daos.groups.io [mailto:daos@daos.groups.io] On Behalf Of Zhu, Minming
Sent: Wednesday, February 19, 2020 10:31 AM
To: Chaarawi, Mohamad <mohamad.chaarawi@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] Tuning problem

 

Hi , Mohamad :

 

        /home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.

       

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

 

Local env :

 

      

       

 This was the previous build on boro, ior can be executed.

 

Regards,

Minmingz

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:22 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

collect2: error: ld returned 1 exit status

configure:5942: $? = 1

 

are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?

That seems like a path to the cart source dir.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Olivier, Jeffrey V
 

To be clear, the build checks out cart and builds it in _build.external/cart but it gets installed by default in same location as daos.

 

From: daos@daos.groups.io [mailto:daos@daos.groups.io] On Behalf Of Olivier, Jeffrey V
Sent: Thursday, February 20, 2020 8:34 AM
To: daos@daos.groups.io; Chaarawi, Mohamad <mohamad.chaarawi@...>; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>; Olivier, Jeffrey V <jeffrey.v.olivier@...>
Subject: Re: [daos] Tuning problem

 

You need to set –with-cart to /home/spark/daos/install not _build.external/cart


-Jeff

 

From: daos@daos.groups.io [mailto:daos@daos.groups.io] On Behalf Of Zhu, Minming
Sent: Wednesday, February 19, 2020 10:31 AM
To: Chaarawi, Mohamad <mohamad.chaarawi@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] Tuning problem

 

Hi , Mohamad :

 

        /home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.

       

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

 

Local env :

 

      

       

 This was the previous build on boro, ior can be executed.

 

Regards,

Minmingz

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:22 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

collect2: error: ld returned 1 exit status

configure:5942: $? = 1

 

are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?

That seems like a path to the cart source dir.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

1.      What does OPA mean ?

2.      Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  

3.      I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

1.      About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz