Date   

Re: Hugepages setting

anton.brekhov@...
 
Edited

Patrick thank you so much for reply!

 


Re: [External] Re: [daos] dfs_stat and infinitely loop

Patrick Farrell <paf@...>
 

Are you using OPA?  I believe there are some issues with network contexts and different users in OPA...?
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Sent: Wednesday, February 19, 2020 7:41:09 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [External] Re: [daos] dfs_stat and infinitely loop
 

Hi Mohamad,

 

Yes, the code works without setgid(0) (or similar functions related to user context), client log is nothing related, it infinitely  loop in the poll just like network packet lost, this is the dbg stack:

 

#0  0x00007f1966699e63 in epoll_wait () from /lib64/libc.so.6

#1  0x00007f19658dc728 in hg_poll_wait (poll_set=0x85d990, timeout=timeout@entry=1, progressed=progressed@entry=0x7ffd6af1f49f "") at /root/daos/_build.external/mercury/src/util/mercury_poll.c:434

#2  0x00007f1965d05763 in hg_core_progress_poll (context=0x895b70, timeout=1) at /root/daos/_build.external/mercury/src/mercury_core.c:3280

#3  0x00007f1965d0a94c in HG_Core_progress (context=<optimized out>, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury_core.c:4877

#4  0x00007f1965d0242d in HG_Progress (context=context@entry=0x77f250, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury.c:2243

#5  0x00007f1966dfb28b in crt_hg_progress (hg_ctx=hg_ctx@entry=0x8909b8, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1366

#6  0x00007f1966dbcf2b in crt_progress (crt_ctx=0x8909a0, timeout=timeout@entry=-1, cond_cb=cond_cb@entry=0x7f196772d5a0 <ev_progress_cb>, arg=arg@entry=0x7ffd6af1f5d0) at src/cart/crt_context.c:1300

#7  0x00007f19677328c6 in daos_event_priv_wait () at src/client/api/event.c:1205

#8  0x00007f1967736096 in dc_task_schedule (task=0x8a3be0, instant=instant@entry=true) at src/client/api/task.c:139

#9  0x00007f196773492c in daos_obj_fetch (oh=..., oh@entry=..., th=..., th@entry=..., flags=flags@entry=0, dkey=dkey@entry=0x7ffd6af1f6d0, nr=nr@entry=1, iods=iods@entry=0x7ffd6af1f6f0, sgls=sgls@entry=0x7ffd6af1f6b0, maps=maps@entry=0x0, ev=ev@entry=0x0)

    at src/client/api/object.c:170

#10 0x00007f19674f810a in fetch_entry (oh=oh@entry=..., th=..., th@entry=..., name=0x941808 "/", fetch_sym=fetch_sym@entry=true, exists=exists@entry=0x7ffd6af1f84f, entry=0x7ffd6af1f860) at src/client/dfs/dfs.c:329

#11 0x00007f19674fb4cf in entry_stat (dfs=dfs@entry=0x941770, th=th@entry=..., oh=..., name=name@entry=0x941808 "/", stbuf=stbuf@entry=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:490

#12 0x00007f19675072e7 in dfs_stat (dfs=0x941770, parent=0x9417d8, name=0x941808 "/", stbuf=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:2876

#13 0x00000000004012c3 in main ()

 

Regards,

Shengyu.

 

From: <daos@daos.groups.io> on behalf of "Chaarawi, Mohamad" <mohamad.chaarawi@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 19, 2020 at 11:19 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [External] Re: [daos] dfs_stat and infinitely loop

 

Hi Shengyu,

 

If you don’t setgid(0), it works? Im not sure why that would cause the operation not to return.

Could you please attach gdb and return a trace of where it hangs? Do you see anything suspicious in the DAOS client log?

 

Thanks,

Mohamad

 

From: <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday, February 18, 2020 at 9:35 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] dfs_stat and infinitely loop

 

Hello,

 

Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code:

rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1);

        if (rc != -DER_SUCCESS) {

                printf("Failed to mount to container (%d)\n", rc);

                D_GOTO(out_dfs, 0);

        }

 

        setgid(0);

       

        struct stat stbuf = {0};

       

rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf);

        if(rc)

                printf("stat '' failed, rc: %d\n", rc);

        else

                printf("stat \'\' succesffuly, rc: %d\n", rc);

 

There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.

 

Regards,

Shengyu.


Re: [External] Re: [daos] dfs_stat and infinitely loop

Shengyu SY19 Zhang
 

Hi Mohamad,

 

Yes, the code works without setgid(0) (or similar functions related to user context), client log is nothing related, it infinitely  loop in the poll just like network packet lost, this is the dbg stack:

 

#0  0x00007f1966699e63 in epoll_wait () from /lib64/libc.so.6

#1  0x00007f19658dc728 in hg_poll_wait (poll_set=0x85d990, timeout=timeout@entry=1, progressed=progressed@entry=0x7ffd6af1f49f "") at /root/daos/_build.external/mercury/src/util/mercury_poll.c:434

#2  0x00007f1965d05763 in hg_core_progress_poll (context=0x895b70, timeout=1) at /root/daos/_build.external/mercury/src/mercury_core.c:3280

#3  0x00007f1965d0a94c in HG_Core_progress (context=<optimized out>, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury_core.c:4877

#4  0x00007f1965d0242d in HG_Progress (context=context@entry=0x77f250, timeout=timeout@entry=1) at /root/daos/_build.external/mercury/src/mercury.c:2243

#5  0x00007f1966dfb28b in crt_hg_progress (hg_ctx=hg_ctx@entry=0x8909b8, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1366

#6  0x00007f1966dbcf2b in crt_progress (crt_ctx=0x8909a0, timeout=timeout@entry=-1, cond_cb=cond_cb@entry=0x7f196772d5a0 <ev_progress_cb>, arg=arg@entry=0x7ffd6af1f5d0) at src/cart/crt_context.c:1300

#7  0x00007f19677328c6 in daos_event_priv_wait () at src/client/api/event.c:1205

#8  0x00007f1967736096 in dc_task_schedule (task=0x8a3be0, instant=instant@entry=true) at src/client/api/task.c:139

#9  0x00007f196773492c in daos_obj_fetch (oh=..., oh@entry=..., th=..., th@entry=..., flags=flags@entry=0, dkey=dkey@entry=0x7ffd6af1f6d0, nr=nr@entry=1, iods=iods@entry=0x7ffd6af1f6f0, sgls=sgls@entry=0x7ffd6af1f6b0, maps=maps@entry=0x0, ev=ev@entry=0x0)

    at src/client/api/object.c:170

#10 0x00007f19674f810a in fetch_entry (oh=oh@entry=..., th=..., th@entry=..., name=0x941808 "/", fetch_sym=fetch_sym@entry=true, exists=exists@entry=0x7ffd6af1f84f, entry=0x7ffd6af1f860) at src/client/dfs/dfs.c:329

#11 0x00007f19674fb4cf in entry_stat (dfs=dfs@entry=0x941770, th=th@entry=..., oh=..., name=name@entry=0x941808 "/", stbuf=stbuf@entry=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:490

#12 0x00007f19675072e7 in dfs_stat (dfs=0x941770, parent=0x9417d8, name=0x941808 "/", stbuf=0x7ffd6af1f9c0) at src/client/dfs/dfs.c:2876

#13 0x00000000004012c3 in main ()

 

Regards,

Shengyu.

 

From: <daos@daos.groups.io> on behalf of "Chaarawi, Mohamad" <mohamad.chaarawi@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 19, 2020 at 11:19 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [External] Re: [daos] dfs_stat and infinitely loop

 

Hi Shengyu,

 

If you don’t setgid(0), it works? Im not sure why that would cause the operation not to return.

Could you please attach gdb and return a trace of where it hangs? Do you see anything suspicious in the DAOS client log?

 

Thanks,

Mohamad

 

From: <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday, February 18, 2020 at 9:35 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] dfs_stat and infinitely loop

 

Hello,

 

Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code:

rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1);

        if (rc != -DER_SUCCESS) {

                printf("Failed to mount to container (%d)\n", rc);

                D_GOTO(out_dfs, 0);

        }

 

        setgid(0);

       

        struct stat stbuf = {0};

       

rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf);

        if(rc)

                printf("stat '' failed, rc: %d\n", rc);

        else

                printf("stat \'\' succesffuly, rc: %d\n", rc);

 

There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.

 

Regards,

Shengyu.


Re: Hugepages setting

Patrick Farrell <paf@...>
 

Anton,

 

This message is a little bit confusing – It just indicates there were no 1 GiB huge pages, which is fine.  Other smaller huge pages were acquired successfully, and this doesn’t indicate any problem that will prevent you from running.

 

So you’re good to go - just need to format and the server should finish startup normally.

 

-Patrick

 

From: <daos@daos.groups.io> on behalf of "anton.brekhov@..." <anton.brekhov@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, February 19, 2020 at 9:58 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Hugepages setting

 

Hi everyone!

I want to launch local daos server on one node. I'm using installation guide using docker https://daos-stack.github.io/#admin/installation/ . (Centos 7)

I've added libfabric dependancy to Dockerfile. I've installed uio_pci_generic to host. Also I want to use DRAM as SCM.

So I've started it with command 

docker

exec server daos_server

start \ -o /home/daos/daos/utils/config/examples/daos_server_local.yml

And I got such error:

daos_server logging to file /tmp/daos_control.log ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB DAOS Control Server (pid

560) listening on 0.0.0.0:10001 Waiting for DAOS I/O Server instance storage to be ready... SCM format required on instance 0

Can I use DRAM without hugepages? Unless how I need to configure it (link to some guide will be enogh)?

 

Thanks! 

 

 


Re: Unable to run DAOS commands - Agent reports "no dRPC client set"

Patrick Farrell <paf@...>
 

mjmac,

Ah, that has it working again.  Thanks much for the pointer.

Just out of curiosity, was any thought given to making this a reported failure?  I see Niu's patch just corrects the misapplication.

It seems like an error in entering the whitelist (if I'm understanding correctly, perhaps the parameter is generated) is far from impossible, and the failure I experienced was silent on the server side.

I am not entirely clear on how the problem manifests itself - if the data plane truly doesn't start in this case, or if it fails when trying to access the device to actually do something when prompted by a client, or if there is some other issue - but this seems like a condition that would be worth reporting in some way (unless it really is an internal sort of failure, which can only realistically occur due to applying the whitelist to the wrong kind of device rather than user config).

- Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Macdonald, Mjmac <mjmac.macdonald@...>
Sent: Tuesday, February 18, 2020 8:32 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
 
Hi Patrick.

A commit (18d31d) just landed to master this morning that will probably fix that issue. As part of the work you referenced, a new whitelist parameter is being used to ensure that each ioserver only has access to the devices specified in the configuration. Unfortunately, this doesn't work with emulated devices, so the fix is to avoid using the whitelist except with real devices.

Sorry about that, hope this helps.

Best,
mjmac


Re: Tuning problem

Zhu, Minming
 

Hi , Mohamad :

 

        /home/spark/daos/_build.external/cart is the path to the cart build dir. The error message is that the libgurt.so file was not found, but it exists in the local environment.

       

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

 

Local env :

 

      

       

 This was the previous build on boro, ior can be executed.

 

Regards,

Minmingz

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:22 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

collect2: error: ld returned 1 exit status

configure:5942: $? = 1

 

are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?

That seems like a path to the cart source dir.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Re: Tuning problem

Chaarawi, Mohamad
 

configure:5942: mpicc -std=gnu99 -o conftest -g -O2  -I/home/spark/daos/_build.external/cart/include/  -L/home/spark/daos/_build.external/cart/lib conftest.c -lgurt  -lm  >&5

/usr/bin/ld: cannot find -lgurt

collect2: error: ld returned 1 exit status

configure:5942: $? = 1

 

are you sure /home/spark/daos/_build.external/cart is the path to your cart install dir?

That seems like a path to the cart source dir.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 10:19 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Re: Tuning problem

Zhu, Minming
 

Hi , Mohamad :

 

      Yes , ior was build with DAOS driver support .

       Command : ./configure --prefix=/home/spark/cluster/ior_hpc --with-daos=/home/spark/daos/install --with-cart=/home/spark/daos/_build.external/cart

       Attach file is config.log .

 

Regards,

Minmingz

 

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Thursday, February 20, 2020 12:00 AM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Re: Tuning problem

Chaarawi, Mohamad
 

That probably means that your IOR was not built with DAOS driver support.

If you enabled that, I would check the config.log in your IOR build and see why.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Wednesday, February 19, 2020 at 9:43 AM
To: "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "daos@daos.groups.io" <daos@daos.groups.io>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: RE: Tuning problem

 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Hugepages setting

anton.brekhov@...
 

Hi everyone!

I want to launch local daos server on one node. I'm using installation guide using docker https://daos-stack.github.io/#admin/installation/ . (Centos 7)

I've added libfabric dependancy to Dockerfile. I've installed uio_pci_generic to host. Also I want to use DRAM as SCM.

So I've started it with command 

docker exec server daos_server start \ -o /home/daos/daos/utils/config/examples/daos_server_local.yml

And I got such error:

daos_server logging to file /tmp/daos_control.log ERROR: /usr/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB DAOS Control Server (pid 560) listening on 0.0.0.0:10001 Waiting for DAOS I/O Server instance storage to be ready... SCM format required on instance 0

Can I use DRAM without hugepages? Unless how I need to configure it (link to some guide will be enogh)?
 
Thanks! 

 

 


Re: Tuning problem

Zhu, Minming
 

Hi ,  Mohamad :

        Thanks for you help .

  1. What does OPA mean ?
  2. Tried as you said . Adding --mca pml ob1 ior work successful , however adding --mca btl tcp,self --mca oob tcp ior work fail .  
  3. I has solved this problem by adding  -x UCX_NET_DEVICES=mlx5_1:1 .

       IOR command :  /usr/lib64/openmpi3/bin/orterun -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi -x UCX_NET_DEVICES=mlx5_1:1 --host vsr135 --allow-run-as-root /home/spark/cluster/ior_hpc/bin/ior -a dfs -r -w -t 1m -b 50g -d /test --dfs.pool 85a86066-eb7e-4e66-b3a4-6b668c53c139 --dfs.svcl 0 --dfs.cont 4c45229b-b8be-443e-af72-8dc5aaeccc88 .

        But encountered a new problem .

       

Error invalid argument: --dfs.pool

Error invalid argument: 85a86066-eb7e-4e66-b3a4-6b668c53c139

Error invalid argument: --dfs.svcl

Error invalid argument: 0

Error invalid argument: --dfs.cont

Error invalid argument: 4c45229b-b8be-443e-af72-8dc5aaeccc88

Invalid options

Synopsis /home/spark/cluster/ior_hpc/bin/ior

 

Flags

  -c                            collective -- collective I/O

……

 

Regards,

Minmingz

      

 

 

From: Chaarawi, Mohamad <mohamad.chaarawi@...>
Sent: Wednesday, February 19, 2020 11:16 PM
To: Zhu, Minming <minming.zhu@...>; daos@daos.groups.io; Lombardi, Johann <johann.lombardi@...>
Cc: Zhang, Jiafu <jiafu.zhang@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: Tuning problem

 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Re: dfs_stat and infinitely loop

Chaarawi, Mohamad
 

Hi Shengyu,

 

If you don’t setgid(0), it works? Im not sure why that would cause the operation not to return.

Could you please attach gdb and return a trace of where it hangs? Do you see anything suspicious in the DAOS client log?

 

Thanks,

Mohamad

 

From: <daos@daos.groups.io> on behalf of Shengyu SY19 Zhang <zhangsy19@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday, February 18, 2020 at 9:35 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] dfs_stat and infinitely loop

 

Hello,

 

Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code:

rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1);

        if (rc != -DER_SUCCESS) {

                printf("Failed to mount to container (%d)\n", rc);

                D_GOTO(out_dfs, 0);

        }

 

        setgid(0);

       

        struct stat stbuf = {0};

       

rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf);

        if(rc)

                printf("stat '' failed, rc: %d\n", rc);

        else

                printf("stat \'\' succesffuly, rc: %d\n", rc);

 

There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.

 

Regards,

Shengyu.


Re: Tuning problem

Chaarawi, Mohamad
 

Could you provide some info on the system you are running on? Do you have OPA there?

You are failing in MPI_Init() so a simple MPI program wouldn’t even work for you. Could you add --mca pml ob1 --mca btl tcp,self --mca oob tcp and check?

 

Output of fi_info and ifconfig would help.

 

Thanks,

Mohamad

 

From: "Zhu, Minming" <minming.zhu@...>
Date: Tuesday, February 18, 2020 at 4:51 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>, "Chaarawi, Mohamad" <mohamad.chaarawi@...>, "Lombardi, Johann" <johann.lombardi@...>
Cc: "Zhang, Jiafu" <jiafu.zhang@...>, "Wang, Carson" <carson.wang@...>, "Guo, Chenzhao" <chenzhao.guo@...>
Subject: Tuning problem

 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Re: Unable to run DAOS commands - Agent reports "no dRPC client set"

Farrell, Patrick Arthur <patrick.farrell@...>
 

Tom,

You've probably seen it, but if not, fyi that mjmac pointed me to commit 18d31d, which landed yesterday and resolved the issue for me.

Thanks for taking a look!


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Nabarro, Tom <tom.nabarro@...>
Sent: Wednesday, February 19, 2020 7:17 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
 
Hello Patrick I'm looking into this now

On 17 Feb 2020 22:52, Patrick Farrell <paf@...> wrote:
I finally gave up and bisected this.

This problem started with DAOS-4034 control: enable vfio permissions for non-root (#1785)/14c7c2e06512659f4122a01c57e82ad58ee642b0

Looking at, it does a variety of things, and I'm not having any luck tracking down what's broken by this change.  I made sure to enable the vfio driver as mentioned in the patch notes, but I'm not seeing any change.

One note.  I am running as root, because that has been the easiest set up so far.
Is running as root perhaps broken with this patch?

- Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Sent: Wednesday, February 12, 2020 11:18 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
 
Good morning,

I've just moved up to latest tip of tree DAOS (I'm not sure exactly which commit I was running before, a week or two out of date), and I can't get any tests to run.

I've pared back to a trivial config, and I appear to be able to start the server, etc, but the agent claims the data plane is not running and I'm not having a lot of luck troubleshooting.

Here's my server startup command & output:
/root/daos/install/bin/daos_server start -o /root/daos/utils/config/examples/daos_server_local.yml
/root/daos/install/bin/daos_server logging to file /tmp/daos_control.log
ERROR: /root/daos/install/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 22075) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM format required on instance 0
formatting storage for DAOS I/O Server instance 0 (reformat: false)
Starting format of SCM (ram:/mnt/daos)
Finished format of SCM (ram:/mnt/daos)
Starting format of kdev block devices (/dev/sdl1)
Finished format of kdev block devices (/dev/sdl1)
DAOS I/O Server instance 0 storage ready
SCM @ /mnt/daos: 16.00GB Total/16.00GB Avail
Starting I/O server instance 0: /root/daos/install/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm

As you can see, I format and the server appears to start normally.

Here's that format command output:
dmg -i storage format
localhost:10001: connected

localhost: storage format ok

I run the agent, and it appears OK:
daos_agent -i
Starting daos_agent:
Using logfile: /tmp/daos_agent.log
Listening on /var/run/daos_agent/agent.sock

But when I try to run daos_test, everything it attempts fails, and the agent prints this message over and over:
ERROR: HandleCall for 2:206 failed: GetAttachInfo hl-d102:10001 {daos_server {} [] 13}: rpc error: code = Unknown desc = no dRPC client set (data plane not started?)

I believe I've got the environment variables set up correctly everywhere, and I have not configured access_points, etc - This is a trivial single server config.

This is the entirety of my file based config changes:
--- a/utils/config/examples/daos_server_local.yml
+++ b/utils/config/examples/daos_server_local.yml
@@ -14,7 +14,7 @@ servers:
   targets: 1
   first_core: 0
   nr_xs_helpers: 0
-  fabric_iface: eth0
+  fabric_iface: enp6s0
   fabric_iface_port: 31416
   log_file: /tmp/daos_server.log

@@ -31,8 +31,8 @@ servers:
   # The size of ram is specified by scm_size in GB units.
   scm_mount: /mnt/daos # map to -s /mnt/daos
   scm_class: ram
-  scm_size: 4
+  scm_size: 16

-  bdev_class: file
-  bdev_size: 16
-  bdev_list: [/tmp/daos-bdev]
+  bdev_class: kdev
+  bdev_size: 64
+  bdev_list: [/dev/sdl1]
---------

Any clever ideas what's wrong here?  Is there a command or config change I missed?

Thanks,
-Patrick

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Unable to run DAOS commands - Agent reports "no dRPC client set"

Nabarro, Tom
 

Hello Patrick I'm looking into this now

On 17 Feb 2020 22:52, Patrick Farrell <paf@...> wrote:
I finally gave up and bisected this.

This problem started with DAOS-4034 control: enable vfio permissions for non-root (#1785)/14c7c2e06512659f4122a01c57e82ad58ee642b0

Looking at, it does a variety of things, and I'm not having any luck tracking down what's broken by this change.  I made sure to enable the vfio driver as mentioned in the patch notes, but I'm not seeing any change.

One note.  I am running as root, because that has been the easiest set up so far.
Is running as root perhaps broken with this patch?

- Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Patrick Farrell <paf@...>
Sent: Wednesday, February 12, 2020 11:18 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"
 
Good morning,

I've just moved up to latest tip of tree DAOS (I'm not sure exactly which commit I was running before, a week or two out of date), and I can't get any tests to run.

I've pared back to a trivial config, and I appear to be able to start the server, etc, but the agent claims the data plane is not running and I'm not having a lot of luck troubleshooting.

Here's my server startup command & output:
/root/daos/install/bin/daos_server start -o /root/daos/utils/config/examples/daos_server_local.yml
/root/daos/install/bin/daos_server logging to file /tmp/daos_control.log
ERROR: /root/daos/install/bin/daos_admin EAL: No free hugepages reported in hugepages-1048576kB
DAOS Control Server (pid 22075) listening on 0.0.0.0:10001
Waiting for DAOS I/O Server instance storage to be ready...
SCM format required on instance 0
formatting storage for DAOS I/O Server instance 0 (reformat: false)
Starting format of SCM (ram:/mnt/daos)
Finished format of SCM (ram:/mnt/daos)
Starting format of kdev block devices (/dev/sdl1)
Finished format of kdev block devices (/dev/sdl1)
DAOS I/O Server instance 0 storage ready
SCM @ /mnt/daos: 16.00GB Total/16.00GB Avail
Starting I/O server instance 0: /root/daos/install/bin/daos_io_server
daos_io_server:0 Using legacy core allocation algorithm

As you can see, I format and the server appears to start normally.

Here's that format command output:
dmg -i storage format
localhost:10001: connected

localhost: storage format ok

I run the agent, and it appears OK:
daos_agent -i
Starting daos_agent:
Using logfile: /tmp/daos_agent.log
Listening on /var/run/daos_agent/agent.sock

But when I try to run daos_test, everything it attempts fails, and the agent prints this message over and over:
ERROR: HandleCall for 2:206 failed: GetAttachInfo hl-d102:10001 {daos_server {} [] 13}: rpc error: code = Unknown desc = no dRPC client set (data plane not started?)

I believe I've got the environment variables set up correctly everywhere, and I have not configured access_points, etc - This is a trivial single server config.

This is the entirety of my file based config changes:
--- a/utils/config/examples/daos_server_local.yml
+++ b/utils/config/examples/daos_server_local.yml
@@ -14,7 +14,7 @@ servers:
   targets: 1
   first_core: 0
   nr_xs_helpers: 0
-  fabric_iface: eth0
+  fabric_iface: enp6s0
   fabric_iface_port: 31416
   log_file: /tmp/daos_server.log

@@ -31,8 +31,8 @@ servers:
   # The size of ram is specified by scm_size in GB units.
   scm_mount: /mnt/daos # map to -s /mnt/daos
   scm_class: ram
-  scm_size: 4
+  scm_size: 16

-  bdev_class: file
-  bdev_size: 16
-  bdev_list: [/tmp/daos-bdev]
+  bdev_class: kdev
+  bdev_size: 64
+  bdev_list: [/dev/sdl1]
---------

Any clever ideas what's wrong here?  Is there a command or config change I missed?

Thanks,
-Patrick

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


dfs_stat and infinitely loop

Shengyu SY19 Zhang
 

Hello,

 

Recently I got this issue, when I issue dfs_stat in my code, it never return, and now I found basic reason, however I haven’t got solution, this is sample code:

rc = dfs_mount(dfs_poh, coh, O_RDWR, &dfs1);

        if (rc != -DER_SUCCESS) {

                printf("Failed to mount to container (%d)\n", rc);

                D_GOTO(out_dfs, 0);

        }

 

        setgid(0);

       

        struct stat stbuf = {0};

       

rc = dfs_stat(dfs1, NULL, NULL, (struct stat *) &stbuf);

        if(rc)

                printf("stat '' failed, rc: %d\n", rc);

        else

                printf("stat \'\' succesffuly, rc: %d\n", rc);

 

There is setgid(0), even there is no change to the current gid, the problem will always happen. I’m working on DAOS samba plugin, there are lots of similar user context switch operations.

 

Regards,

Shengyu.


Re: How to configure IB with multiple mlx4 devices per server

Latham, Robert J.
 

On Sun, 2020-02-16 at 16:28 +0000, Kevan Rehm wrote:

I can’t be the first person to want to use multiple IB devices on
each node in the cluster, what are the configuration tricks to make
it work?
Hi Kevan: I don't have much in the way of solutions, but yes you are
not the first person to want to ue multiple IB devices on each node.

The Oak Ridge gang solved this in a different way, using PAMI
directives:
https://dl.acm.org/doi/10.1145/3295500.3356166

The libfabric equivalent would be "multi rail" I think, but I haven't
been able to construct a correct FI_OFI_MRAIL_ADDR environment variable
to describing the ib ports. Maybe it's easier to describe the ports on
your cluster than it was for me on Summit.

https://ofiwg.github.io/libfabric/master/man/fi_mrail.7.html

==rob


Tuning problem

Zhu, Minming
 

Hi , Guys :

      I have some about daos performance tuning problems.

  1. About benchmarking daos : The latest ior_hpc pulled on the daos branch, an exception occurs when running . The exception information is as follows.

Ior command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root -np 1 --host vsr139 -x CRT_PHY_ADDR_STR=ofi+psm2 -x FI_PSM2_DISCONNECT=1 -x OFI_INTERFACE=ib0 --mca mtl ^psm2,ofi  /home/spark/cluster/ior_hpc/bin/ior

No OpenFabrics connection schemes reported that they were able to be

used on a specific port.  As such, the openib BTL (OpenFabrics

support) will be disabled for this port.

 

Local host:           vsr139

Local device:         mlx5_1

Local port:           1

CPCs attempted:       rdmacm, udcm

--------------------------------------------------------------------------

[vsr139:164808:0:164808]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f1e3d824907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f1e3df7cf40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f1e3df80a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f1e3df80d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f1e3df6e41d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f1e3e1b35ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f1e3e1b4182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f1e3e3e4ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f1e3e3e68b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f1e52b93118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f1e52b28fb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f1e52b5346b]

12  /home/spark/cluster/ior_hpc/bin/ior() [0x40d39b]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f1e52512505]

14  /home/spark/cluster/ior_hpc/bin/ior() [0x40313e]

 

Normally this is the case :

 

IOR-3.3.0+dev: MPI Coordinated Test of Parallel I/O

Began               : Tue Feb 18 10:08:58 2020

Command line        : ior

Machine             : Linux boro-9.boro.hpdd.intel.com

TestID              : 0

StartTime           : Tue Feb 18 10:08:58 2020

Path                : /home/minmingz

FS                  : 3.8 TiB   Used FS: 43.3%   Inodes: 250.0 Mi   Used Inodes: 6.3%

 

Options:

api                 : POSIX

apiVersion          :

test filename       : testFile

access              : single-shared-file

type                : independent

segments            : 1

ordering in a file  : sequential

ordering inter file : no tasks offsets

tasks               : 1

clients per node    : 1

repetitions         : 1

xfersize            : 262144 bytes

blocksize           : 1 MiB

aggregate filesize  : 1 MiB

 

Results:

 

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter

------    ---------  ---------- ---------  --------   --------   --------   --------   ----

write     89.17      1024.00    256.00     0.000321   0.000916   0.009976   0.011214   0

read      1351.38    1024.00    256.00     0.000278   0.000269   0.000193   0.000740   0

remove    -          -          -          -          -          -          0.000643   0

Max Write: 89.17 MiB/sec (93.50 MB/sec)

Max Read:  1351.38 MiB/sec (1417.02 MB/sec)

 

Summary of all tests:

Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum

write          89.17      89.17      89.17       0.00     356.68     356.68     356.68       0.00    0.01121     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

read         1351.38    1351.38    1351.38       0.00    5405.52    5405.52    5405.52       0.00    0.00074     0      1   1    1   0     0        1         0    0      1  1048576   262144       1.0 POSIX      0

Finished            : Tue Feb 18 10:08:58 2020

 

2About network performance On the latest daos(commit : 22ea193249741d40d24bc41bffef9dbcdedf3d41) code pulled, an exception occurred while executing self_test. . The exception information is as follows .

      Command : /usr/lib64/openmpi3/bin/orterun --allow-run-as-root  --mca btl self,tcp -N 1 --host vsr139 --output-filename testLogs/ -x D_LOG_FILE=testLogs/test_group_srv.log -x D_LOG_FILE_APPEND_PID=1 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR=ofi+psm2  -x OFI_INTERFACE=ib0 -x CRT_CTX_SHARE_ADDR=0 -x CRT_CTX_NUM=16 crt_launch -e tests/test_group_np_srv --name self_test_srv_grp --cfg_path=.

vsr139:176276:0:176276]    ud_iface.c:307  Assertion `qp_init_attr.cap.max_inline_data >= UCT_UD_MIN_INLINE' failed

==== backtrace ====

0  /lib64/libucs.so.0(ucs_fatal_error+0xf7) [0x7f576a8f1907]

1  /lib64/libuct.so.0(uct_ud_iface_cep_cleanup+0) [0x7f576ad3ef40]

2  /lib64/libuct.so.0(+0x28a05) [0x7f576ad42a05]

3  /lib64/libuct.so.0(+0x28d6a) [0x7f576ad42d6a]

4  /lib64/libuct.so.0(uct_iface_open+0xdd) [0x7f576ad3041d]

5  /lib64/libucp.so.0(ucp_worker_iface_init+0x22e) [0x7f576af755ee]

6  /lib64/libucp.so.0(ucp_worker_create+0x3f2) [0x7f576af76182]

7  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0x95) [0x7f576b1a6ac5]

8  /usr/lib64/openmpi3/lib/openmpi/mca_pml_ucx.so(+0x78b9) [0x7f576b1a88b9]

9  /usr/lib64/openmpi3/lib/libmpi.so.40(mca_pml_base_select+0x1d8) [0x7f5780f59118]

10  /usr/lib64/openmpi3/lib/libmpi.so.40(ompi_mpi_init+0x6f9) [0x7f5780eeefb9]

11  /usr/lib64/openmpi3/lib/libmpi.so.40(MPI_Init+0xbb) [0x7f5780f1946b]

12  crt_launch() [0x40130c]

13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f577fa99505]

14  crt_launch() [0x401dcf]

 

 

Please help solve the problem.

 

Regards,

Minmingz


Re: Unable to run DAOS commands - Agent reports "no dRPC client set"

Macdonald, Mjmac
 

Hi Patrick.

 

In this case the BIO whitelists are generated internally from the per-ioserver bdev lists, independent of the bdev_include config parameter. You do raise a good point through – we should take a look to see what would/should happen if the whitelist supplied from the server config is wrong.

 

mjmac

 

From: Patrick Farrell <paf@...>
Sent: Tuesday, 18 February, 2020 09:49
To: Macdonald, Mjmac <mjmac.macdonald@...>; daos@daos.groups.io
Subject: Re: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"

 

mjmac,

 

Ah, that has it working again.  Thanks much for the pointer.

 

Just out of curiosity, was any thought given to making this a reported failure?  I see Niu's patch just corrects the misapplication.

 

It seems like an error in entering the whitelist (if I'm understanding correctly, perhaps the parameter is generated) is far from impossible, and the failure I experienced was silent on the server side.

 

I am not entirely clear on how the problem manifests itself - if the data plane truly doesn't start in this case, or if it fails when trying to access the device to actually do something when prompted by a client, or if there is some other issue - but this seems like a condition that would be worth reporting in some way (unless it really is an internal sort of failure, which can only realistically occur due to applying the whitelist to the wrong kind of device rather than user config).

 

- Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Macdonald, Mjmac <mjmac.macdonald@...>
Sent: Tuesday, February 18, 2020 8:32 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Unable to run DAOS commands - Agent reports "no dRPC client set"

 

Hi Patrick.

A commit (18d31d) just landed to master this morning that will probably fix that issue. As part of the work you referenced, a new whitelist parameter is being used to ensure that each ioserver only has access to the devices specified in the configuration. Unfortunately, this doesn't work with emulated devices, so the fix is to avoid using the whitelist except with real devices.

Sorry about that, hope this helps.

Best,
mjmac


Re: Unable to run DAOS commands - Agent reports "no dRPC client set"

Macdonald, Mjmac
 

Hi Patrick.

A commit (18d31d) just landed to master this morning that will probably fix that issue. As part of the work you referenced, a new whitelist parameter is being used to ensure that each ioserver only has access to the devices specified in the configuration. Unfortunately, this doesn't work with emulated devices, so the fix is to avoid using the whitelist except with real devices.

Sorry about that, hope this helps.

Best,
mjmac