Date   

Error on simple test on POSIX container

Yunjae Lee
 

Hi,

I created a POSIX container and mounted at /mnt/dfuse on the client node,
and ran the following command:
```
# echo "foo" > /mnt/dfuse/bar
# cat /mnt/dfuse/bar
```

But it gives me the following error repeated infinitely.
object ERR src/object/cli_shard.c:631 dc_rw_cb() rpc 0x7ffa3801d6e0 opc 1 to rank 0 tag 7 failed: DER_HG(-1020): 'Transport layer mercury error'
OS: Ubuntu 20.04
Network: Infiniband with MOFED 5.0-2
DAOS version: c20c47 (commit at 2020-11-28)


DUG'20 slides are available!

Lombardi, Johann
 

Hi there,

 

I have posted all the DUG presentations on the wiki (see https://wiki.hpdd.intel.com/display/DC/DUG20)

We need some more time for the video recordings that will be published on our YouTube channel (i.e. https://www.youtube.com/channel/UCVP4e_UTnSJg15Cm80UtNwg) when ready.

 

Cheers,

Johann

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: DUG'20 Agenda Online

Carrier, John
 

Note that the time for DUG listed in the SC2020 schedule is not correct.  Please use the webex info below.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Lombardi, Johann
Sent: Tuesday, November 17, 2020 11:29 PM
To: daos@daos.groups.io
Subject: Re: [daos] DUG'20 Agenda Online

 

Just a reminder that the DUG’20 is tomorrow.

 

Hope to see you there!

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "Lombardi, Johann" <johann.lombardi@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 15 October 2020 at 09:51
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] DUG'20 Agenda Online

 

Hi there,

 

Please note that the agenda for the 4th annual DAOS User Group meeting is now available online:

https://wiki.hpdd.intel.com/display/DC/DUG20

 

I am very excited by the diversity and number of presentations this year. A big thank you to all the presenters.

 

As a reminder, the DUG is virtual this year:

-          On Nov 19

-          Starts at 7:30am Pacific / 8:30am Mountain / 9:30am Central / 4:30pm CET / 11:30pm China

-          3h30 of live presentations

-          Please see instructions on how to join in the webex invite

 

We also encourage everyone to join the #community slack channel for side discussions between attendees/presenters after the event.

 

Hope to see you there!

 

Best regards,

Johann

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: DUG'20 Agenda Online

Lombardi, Johann
 

Just a reminder that the DUG’20 is tomorrow.

 

Hope to see you there!

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "Lombardi, Johann" <johann.lombardi@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 15 October 2020 at 09:51
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] DUG'20 Agenda Online

 

Hi there,

 

Please note that the agenda for the 4th annual DAOS User Group meeting is now available online:

https://wiki.hpdd.intel.com/display/DC/DUG20

 

I am very excited by the diversity and number of presentations this year. A big thank you to all the presenters.

 

As a reminder, the DUG is virtual this year:

-          On Nov 19

-          Starts at 7:30am Pacific / 8:30am Mountain / 9:30am Central / 4:30pm CET / 11:30pm China

-          3h30 of live presentations

-          Please see instructions on how to join in the webex invite

 

We also encourage everyone to join the #community slack channel for side discussions between attendees/presenters after the event.

 

Hope to see you there!

 

Best regards,

Johann

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Install problem

fhoa@...
 

Did you find a workaround for this problem ? I am experiencing the same problem when trying to setup on an ubuntu 20.04.1 OS. 

Commands I tried to run:

$ Git clone https://github.com/daos-stack/daos
$ docker build --no-cache -t daos -f utils/docker/Dockerfile.ubuntu.20.04 --build-arg NOBUILD=1 .
$ docker run -it -d --privileged --name server -v ${daospath}:/home/daos/daos:Z -v /dev/hugepages:/dev/hugepages daos
$ docker exec server scons --build-deps=yes install PREFIX=/usr

This last command fails with similar error as above, namely:

"
gcc -o build/dev/gcc/src/tests/security/acl_dump_test -Wl,-rpath-link=build/dev/gcc/src/gurt -Wl,-rpath-link=build/dev/gcc/src/cart -Wl,--enable-new-dtags -Wl,-rpath-link=/home/daos/daos/build/dev/gcc/src/gurt -Wl,-rpath-link=/usr/prereq/dev/pmdk/lib -Wl,-rpath-link=/usr/prereq/dev/isal/lib -Wl,-rpath-link=/usr/prereq/dev/isal_crypto/lib -Wl,-rpath-link=/usr/prereq/dev/argobots/lib -Wl,-rpath-link=/usr/prereq/dev/protobufc/lib -Wl,-rpath-link=/usr/lib64 -Wl,-rpath=/usr/lib -Wl,-rpath=\$ORIGIN/../../home/daos/daos/build/dev/gcc/src/gurt -Wl,-rpath=\$ORIGIN/../prereq/dev/pmdk/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/isal/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/isal_crypto/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/argobots/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/protobufc/lib -Wl,-rpath=\$ORIGIN/../lib64 build/dev/gcc/src/tests/security/acl_dump_test.o -Lbuild/dev/gcc/src/gurt -Lbuild/dev/gcc/src/cart/swim -Lbuild/dev/gcc/src/cart -Lbuild/dev/gcc/src/common -L/usr/prereq/dev/pmdk/lib -L/usr/prereq/dev/isal/lib -L/usr/prereq/dev/isal_crypto/lib -Lbuild/dev/gcc/src/bio -Lbuild/dev/gcc/src/bio/smd -Lbuild/dev/gcc/src/vea -Lbuild/dev/gcc/src/vos -Lbuild/dev/gcc/src/mgmt -Lbuild/dev/gcc/src/pool -Lbuild/dev/gcc/src/container -Lbuild/dev/gcc/src/placement -Lbuild/dev/gcc/src/dtx -Lbuild/dev/gcc/src/object -Lbuild/dev/gcc/src/rebuild -Lbuild/dev/gcc/src/security -Lbuild/dev/gcc/src/client/api -Lbuild/dev/gcc/src/control -L/usr/prereq/dev/argobots/lib -L/usr/prereq/dev/protobufc/lib -lpmemobj -lisal -lisal_crypto -labt -lprotobuf-c -lhwloc -ldaos -ldaos_common -lgurt

/usr/bin/ld: warning: libna.so.2, needed by /usr/prereq/dev/mercury/lib/libmercury.so.2, not found (try using -rpath or -rpath-link)

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Error_to_string'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Addr_free'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_create_segments'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Op_create'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_free'

[...]

"



Re: Install problem

nicolau.manubens@...
 

Thanks for your help.

I have tried the ubuntu and leap dockerfiles too. Leap worked fine. The ubuntu one failed with a similar error when compiling acl_dump_test. I leave a snippet of the error below.

Although I can continue with the leap one for now, it would still be good to have the centos one working for tests, as our final DAOS system will be deployed on machines with centos.

Nicolau


/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Error_to_string'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Addr_free'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_create_segments'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Op_create'

/usr/bin/ld: /usr/prereq/dev/mercury/lib/libmercury.so.2: undefined reference to `NA_Mem_handle_free'

[...]


Re: Install problem

Olivier, Jeffrey V
 

The logic in utils/sl for scons should be detecting that the libfabric version installed is not suitable automatically and building a suitable version.   I’m trying it locally to see what is going on

 

-Jeff

 

From: <daos@daos.groups.io> on behalf of "maureen.jean@..." <maureen.jean@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday, November 11, 2020 at 8:14 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Install problem

 

Yes you need a later version of libfabric; preferably 1.11.   But you need a libfabric that supports ABI 1.3  (FABRIC 1.3 )


Re: Install problem

maureen.jean@...
 

Yes you need a later version of libfabric; preferably 1.11.   But you need a libfabric that supports ABI 1.3  (FABRIC 1.3 )


Re: Install problem

nicolau.manubens@...
 
Edited

The dockerfile I am taking from master is installing libfabric 1.7 in the image. Should I modify the scons script in order to replace the libfabric version?


Re: Install problem

maureen.jean@...
 

What version of libfabric are you using?   Try using libfabric >= 1.11

/usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_dupinfo@...'

/usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_freeinfo@...'

/usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_getinfo@...'


Re: Install problem

nicolau.manubens@...
 

Hello,

I am finding a similar error also when trying to build the DAOS docker image.

wget https://raw.githubusercontent.com/daos-stack/daos/master/utils/docker/Dockerfile.centos.7

docker build --no-cache -t daos -f ./Dockerfile.centos.7 .

The Dockerfile is being pulled from the master branch, e.g. commit 4cbb16cf8edc9ddf5c7503b4448bf897c8331ea3

The output follows:

gcc -o build/dev/gcc/src/tests/security/acl_dump_test -Wl,-rpath-link=build/dev/gcc/src/gurt -Wl,-rpath-link=build/dev/gcc/src/cart -Wl,--enable-new-dtags -Wl,-rpath-link=/home/daos/daos/build/dev/gcc/src/gurt -Wl,-rpath-link=/usr/prereq/dev/pmdk/lib -Wl,-rpath-link=/usr/prereq/dev/isal/lib -Wl,-rpath-link=/usr/prereq/dev/isal_crypto/lib -Wl,-rpath-link=/usr/prereq/dev/argobots/lib -Wl,-rpath-link=/usr/prereq/dev/protobufc/lib -Wl,-rpath-link=/usr/lib64 -Wl,-rpath=/usr/lib -Wl,-rpath=\$ORIGIN/../../home/daos/daos/build/dev/gcc/src/gurt -Wl,-rpath=\$ORIGIN/../prereq/dev/pmdk/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/isal/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/isal_crypto/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/argobots/lib -Wl,-rpath=\$ORIGIN/../prereq/dev/protobufc/lib -Wl,-rpath=\$ORIGIN/../lib64 build/dev/gcc/src/tests/security/acl_dump_test.o -Lbuild/dev/gcc/src/gurt -Lbuild/dev/gcc/src/cart/swim -Lbuild/dev/gcc/src/cart -Lbuild/dev/gcc/src/common -L/usr/prereq/dev/pmdk/lib -L/usr/prereq/dev/isal/lib -L/usr/prereq/dev/isal_crypto/lib -Lbuild/dev/gcc/src/bio -Lbuild/dev/gcc/src/bio/smd -Lbuild/dev/gcc/src/vea -Lbuild/dev/gcc/src/vos -Lbuild/dev/gcc/src/mgmt -Lbuild/dev/gcc/src/pool -Lbuild/dev/gcc/src/container -Lbuild/dev/gcc/src/placement -Lbuild/dev/gcc/src/dtx -Lbuild/dev/gcc/src/object -Lbuild/dev/gcc/src/rebuild -Lbuild/dev/gcc/src/security -Lbuild/dev/gcc/src/client/api -Lbuild/dev/gcc/src/control -L/usr/prereq/dev/argobots/lib -L/usr/prereq/dev/protobufc/lib -lpmemobj -lisal -lisal_crypto -labt -lprotobuf-c -lhwloc -ldaos -ldaos_common -lgurt

/usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_dupinfo@...'

/usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_freeinfo@...'

/usr/prereq/dev/mercury/lib/libna.so.2: undefined reference to `fi_getinfo@...'

collect2: error: ld returned 1 exit status

scons: *** [build/dev/gcc/src/tests/security/acl_dump_test] Error 1

scons: building terminated because of errors.

The command '/bin/sh -c if [ "x$NOBUILD" = "x" ] ; then scons --build-deps=yes install PREFIX=/usr; fi' returned a non-zero code: 2

 

I have also tried pulling the version right after the pull request was merged, and building, with no success:


git clone https://github.com/daos-stack/daos/

cd daos

git checkout 5c887623f0013241d27b8daad1813a3444abf718

cd utils/docker

docker build --no-cache -t daos -f ./Dockerfile.centos.7 .

[...]

Step 28/34 : RUN if [ "x$NOBUILD" = "x" ] ; then scons --build-deps=yes install PREFIX=/usr; fi

 ---> Running in 6fa6d1725ecd

scons: Reading SConscript files ...

ImportError: No module named distro:

  File "/home/daos/daos/SConstruct", line 16:

    import daos_build

  File "/home/daos/daos/utils/daos_build.py", line 4:

    from env_modules import load_mpi

  File "/home/daos/daos/site_scons/env_modules.py", line 27:

    import distro



Please let me know if you have any further hints.

 

Regards,

Nicolau


Tutorial videos

Lombardi, Johann
 

Hi there,

 

FYI, we have posted several tutorial videos on the newly created DAOS YouTube channel (see https://www.youtube.com/channel/UCVP4e_UTnSJg15Cm80UtNwg). This includes how to install RPMs, configure a DAOS system with the control plane or get started with the DAOS FileSystem layer. I would like to thank all the DAOS engineers who contributed to those videos!

 

We also had a lot of pre-existing YouTube videos that are now gathered under a single playlist, see https://www.youtube.com/playlist?list=PLkLsgO4eC8RKLeRi50e-HE88ezcp3hg48.

 

Hope this helps.

Johann

 

ps: DUG’20 is approaching! Check https://wiki.hpdd.intel.com/display/DC/DUG20 for more information.

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


update ipmctl

Nabarro, Tom
 

We have recently learned of a serious issue affecting systems with Apache Pass DCPM modules. The ipmctl utility (versions 02.00.00.3809 through 02.00.00.3816) was released with a bug that can cause the modules to be put into a state that requires a physical procedure to recover the modules. In order to avoid this issue, please ensure that you are running version v02.00.00.3820 or later (when running DAOS to prepare persistent memory modules).

More details…

When running DAOS and executing dmg storage prepare command to set persistent memory modules into AppDirect mode for use with DAOS, ipmctl which is forked at runtime, attempts to set a PMem configuration goal and fails if 02.00.00.3809-3816 version of ipmctl is installed. The failure is due to a corruption of the PCD on the DIMMs and results in system POST memory checks failing with fatal error.

ipmctl was fixed in this commit released in version v02.00.00.3820

That situation then requires recovery of persistent memory modules.

Prevention…

After installing DAOS and before running “dmg|daos_server storage prepare", please upgrade ipmctl to the most recent distro provided version with your preferred package manager e.g. sudo yum update ipmctl .

 

Work is in progress to add the necessary version checks to DAOS.

 

Regards,

Tom Nabarro – DCG/ESAD

M: +44 (0)7786 260986

Skype: tom.nabarro

 

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

Ari
 

Hi Alex,

 

That’s good info, I’ll give it a go.  We were hoping to avoid customizing the stack in this particular environment unless required, and this definitely qualifies.

 

Thanks

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Oganezov, Alexander A
Sent: Monday, November 2, 2020 10:58
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Hi Ari,

 

Is it possible for you to install more recent MOFED on your system? In the past we’ve had issues with MOFEDs older than 4.7; locally we’ve been using MOFED 5.0.2.

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Ari
Sent: Thursday, October 22, 2020 8:34 AM
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email).  Other errors I saw were related to memory errors which I haven’t seen a match. 

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, October 22, 2020 10:23
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Ari,

 

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

 

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

Oganezov, Alexander A
 

Hi Ari,

 

Is it possible for you to install more recent MOFED on your system? In the past we’ve had issues with MOFEDs older than 4.7; locally we’ve been using MOFED 5.0.2.

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Ari
Sent: Thursday, October 22, 2020 8:34 AM
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email).  Other errors I saw were related to memory errors which I haven’t seen a match. 

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, October 22, 2020 10:23
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Ari,

 

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

 

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Re: DAOS with NVMe-over-Fabrics

anton.brekhov@...
 

Tom thanks! 

Unfortunately I couldn't reproduce this behaviour on v1.0.1 with bdev_exclude

Also we've checked the NVMeF initiator on pure SPDK and it's connects remote target. So daos_nvme.conf really can't be used to connect NVMeF.


Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

Ari
 

Thanks for the feedback and agreed, the workaround I was to be sure the sysctl settings were applied for same logical subnet, I tried them in a few combinations (output is in previous email).  Other errors I saw were related to memory errors which I haven’t seen a match. 

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, October 22, 2020 10:23
To: daos@daos.groups.io
Subject: Re: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

[EXTERNAL EMAIL]

Ari,

 

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

 

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Re: DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

Farrell, Patrick Arthur
 

Ari,

You say this looks different from the stuff in the spring, but unless you've noticed a difference in the specific error messages, this looks (at a high level) identical to me.  The issue in the spring was communication between two servers on the same node, where they were unable to route to each other.

Yours is a communication issue between the two servers, saying UNREACHABLE, which is presumably a routing failure.  You will have to reference those conversations for details - I have no expertise in the specific issues - but there appear to be IB routing issues here, just as then.

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Ari <ari.martinez@...>
Sent: Thursday, October 22, 2020 10:08 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: [daos] DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected
 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


DAOS 1.1.1 & Multirail - na_ofi_msg_send_unexpected

Ari
 

Hi,

 

We’ve been testing DAOS functionality in our lab and have had success running IOR with 8 clients with & without POSIX containers with single IO instance and single HCA within one server with both versions 1.0.1 (rpms) & 1.1.1 (built from source).    When we create two IO instances within one server: one per CPU/HCA communication errors arise by just trying to create a pool.  The same problem occurs on two different physical servers with OFED 4.6.2 on RHEL 7.6 & 7.8 kernels.  Saw a post around Spring regarding communication issues with multiple IO instances, but this seems different.  In addition to the snippets are the attached server configuration and log files.  Still trying out a couple more things, but wanted to reach out in the meantime.

 

Any ideas as to debugging this problem with multirail or any gotchas encountered?

 

 

Workflow from previous successful tests when using the same server with one IO instance. The same server yml file works fine if I comment out either of the two IO server stanzas.

Perform wipefs, prepare, format for good measure from all tests above and have performed pre-deployment settings for sysctl arp multirail with two HCAs in same subnet.

 

dac007$ dmg -i system query

Rank  State

----  -----

[0-1] Joined

 

dac007$ dmg -i pool create --scm-size=600G --nvme-size=6TB

Creating DAOS pool with 600 GB SCM and 6.0 TB NVMe storage (10.00 % ratio)

Pool-create command FAILED: pool create failed: DAOS error (-1006): DER_UNREACH

ERROR: dmg: pool create failed: DAOS error (-1006): DER_UNREACH

 

 

Network:

eth0 10.140.0.7/16

ib0 10.10.0.7/16

ib1=10.10.1.7/16

 

net.ipv4.conf.ib0.rp_filter = 2

net.ipv4.conf.ib1.rp_filter = 2

net.ipv4.conf.all.rp_filter = 2

net.ipv4.conf.ib0.accept_local = 1

net.ipv4.conf.ib1.accept_local = 1

net.ipv4.conf.all.accept_local = 1

net.ipv4.conf.ib0.arp_ignore = 2

net.ipv4.conf.ib1.arp_ignore = 2

net.ipv4.conf.all.arp_ignore = 2

 

 

LOGS:

dac007$ wc -l /tmp/daos_*.log

    71 /tmp/daos_control.log

  5970 /tmp/daos_server-core0.log

    35 /tmp/daos_server-core1.log

 

These errors repeat in one of the server.log

10/22-09:11:10.59 dac007 DAOS[194223/194230] vos  DBUG src/vos/vos_io.c:1035 akey_fetch() akey [8] ???? fetch single epr 5-16b

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 574 connection has been terminated

10/22-09:11:10.59 dac007 DAOS[194223/194272] daos INFO src/common/drpc.c:717 drpc_close() Closing dRPC socket fd=574

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # NA -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/na/na_ofi.c:3881

# na_ofi_msg_send_unexpected(): fi_tsend() unexpected failed, rc: -110 (Connection timed out)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:1828

# hg_core_forward_na(): Could not post send for input buffer (NA_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury_core.c:4238

# HG_Core_forward(): Could not forward buffer

10/22-09:11:12.59 dac007 DAOS[194223/194230] external ERR  # HG -- Error -- /home/ari/DAOS/git/daosv111/build/external/dev/mercury/src/mercury.c:1933

# HG_Forward(): Could not forward call (HG_PROTOCOL_ERROR)

10/22-09:11:12.59 dac007 DAOS[194223/194230] hg   ERR  src/cart/crt_hg.c:1083 crt_hg_req_send(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] HG_Forward failed, hg_ret: 12

10/22-09:11:12.59 dac007 DAOS[194223/194230] rdb  WARN src/rdb/rdb_raft.c:1980 rdb_timerd() 64616f73[0]: not scheduled for 1.280968 second

10/22-09:11:12.59 dac007 DAOS[194223/194230] daos INFO src/iosrv/drpc_progress.c:295 drpc_handler_ult() dRPC handler ULT for module=2 method=209

10/22-09:11:12.59 dac007 DAOS[194223/194230] mgmt INFO src/mgmt/srv_drpc.c:1989 ds_mgmt_drpc_set_up() Received request to setup server

10/22-09:11:12.59 dac007 DAOS[194223/194230] server INFO src/iosrv/init.c:388 dss_init_state_set() setting server init state to 1

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:572 server_init() Modules successfully set up

10/22-09:11:12.59 dac007 DAOS[194223/194223] server INFO src/iosrv/init.c:575 server_init() Service fully up

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:798 crt_context_timeout_check(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] ctx_id 0, (status: 0x3f) timed out, tgt rank 1, tag 0

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:744 crt_req_timeout_hdlr(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] failed due to group daos_server, rank 1, tgt_uri ofi+verbs;ofi_rxm://10.10.0.7:31417 can't reach the target

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac1021180) [opc=0x101000b rpcid=0x5a136b7800000001 rank:tag=1:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194230] corpc ERR  src/cart/crt_corpc.c:646 crt_corpc_reply_hdlr() RPC(opc: 0x101000b) error, rc: -1006.

10/22-09:11:12.59 dac007 DAOS[194223/194230] rpc  ERR  src/cart/crt_context.c:297 crt_rpc_complete(0x2aaac10185b0) [opc=0x101000b rpcid=0x5a136b7800000000 rank:tag=0:0] RPC failed; rc: -1006

10/22-09:11:12.59 dac007 DAOS[194223/194272] daos INFO src/iosrv/drpc_progress.c:409 process_session_activity() Session 575 connection has been terminated

 

control.log

Management Service access point started (bootstrapped)

daos_io_server:0 DAOS I/O server (v1.1.1) process 194223 started on rank 0 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 0

DEBUG 09:11:13.492263 member.go:297: adding system member: 10.140.0.7:10001/1/Joined

DEBUG 09:11:13.492670 mgmt_client.go:160: join(dac007:10001, {Uuid:c611d86e-dcaf-44c1-86f0-5d275cf941a3 Rank:1 Uri:ofi+verbs;ofi_rxm://10.10.0.7:31417 Nctxs:7 Addr:0.0.0.0:10001 SrvFaultDomain:/dac007 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}) end

daos_io_server:1 DAOS I/O server (v1.1.1) process 194226 started on rank 1 with 6 target, 0 helper XS, firstcore 0, host dac007.

Using NUMA node: 1

DEBUG 09:13:55.157399 ctl_system.go:187: Received SystemQuery RPC

DEBUG 09:13:55.157863 system.go:645: DAOS system ping-ranks request: &{unaryRequest:{request:{HostList:[10.140.0.7:10001]} rpc:0xbb05e0} Ranks:0-1 Force:false}

DEBUG 09:13:55.158148 rpc.go:183: request hosts: [10.140.0.7:10001]

DEBUG 09:13:55.160340 mgmt_system.go:308: MgmtSvc.PingRanks dispatch, req:{Force:false Ranks:0-1 XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.161685 mgmt_system.go:320: MgmtSvc.PingRanks dispatch, resp:{Results:[state:3  rank:1 state:3 ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:13:55.163257 ctl_system.go:213: Responding to SystemQuery RPC: members:<addr:"10.140.0.7:10001" uuid:"226f2871-06d9-4616-9d08-d386992a059b" state:4 > members:<addr:"10.140.0.7:10001" uuid:"c611d86e-dcaf-44c1-86f0-5d275cf941a3" rank:1 state:4 >

DEBUG 09:14:05.417780 mgmt_pool.go:56: MgmtSvc.PoolCreate dispatch, req:{Scmbytes:600000000000 Nvmebytes:6000000000000 Ranks:[] Numsvcreps:1 User:root@ Usergroup:root@ Uuid:de34a760-9db6-4c2f-9709-7592ce880b49 Sys:daos_server Acl:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

DEBUG 09:14:08.282190 mgmt_pool.go:84: MgmtSvc.PoolCreate dispatch resp:{Status:-1006 Svcreps:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

 

 


Re: DAOS with NVMe-over-Fabrics

Nabarro, Tom
 

I realise that the problem is the bdev_exclude devices are already bound to UIO at the time that DAOS starts and are not actively unbound by SPDK setup script.

0000:00:04.1 (8086 2021): Already using the uio_pci_generic driver

Simple workaround is to run daos_server storage prepare -n --reset before starting and rerun, this will release all the devices back to the system then bdev_exclude will be honoured the next time daos_server start ... is run.

 

Longer term solution maybe to run prepare reset when initially starting the server but will have to discuss with team whether this is the right approach.

 

I verified the behavior of bdev_exclude server config file parameter (binaries built from current master - commit 6c50dbbc45431eea8c6eecf5dae74e5b88713f65).

 

Initial NVMe device listing:

 

tanabarr@wolf-151:~/projects/daos_m> ls -lah /dev/nv*

crw------- 1 root root 237,   0 Oct 20 14:58 /dev/nvme0

brw-rw---- 1 root disk 259, 105 Oct 20 14:58 /dev/nvme0n1

crw------- 1 root root 237,   1 Oct 20 14:58 /dev/nvme1

brw-rw---- 1 root disk 259, 113 Oct 20 14:58 /dev/nvme1n1

crw------- 1 root root 237,   2 Oct 20 14:58 /dev/nvme2

brw-rw---- 1 root disk 259, 121 Oct 20 14:58 /dev/nvme2n1

crw------- 1 root root 237,   3 Oct 20 14:58 /dev/nvme3

brw-rw---- 1 root disk 259, 123 Oct 20 14:58 /dev/nvme3n1

crw------- 1 root root 237,   4 Oct 19 13:16 /dev/nvme4

brw-rw---- 1 root disk 259, 115 Oct 19 13:16 /dev/nvme4n1

crw------- 1 root root 237,   5 Oct 20 14:58 /dev/nvme5

brw-rw---- 1 root disk 259, 125 Oct 20 14:58 /dev/nvme5n1

crw------- 1 root root 237,   6 Oct 19 13:16 /dev/nvme6

brw-rw---- 1 root disk 259, 119 Oct 19 13:16 /dev/nvme6n1

crw------- 1 root root 237,   7 Oct 20 14:58 /dev/nvme7

brw-rw---- 1 root disk 259, 127 Oct 20 14:58 /dev/nvme7n1

crw------- 1 root root  10, 144 Oct  7 22:48 /dev/nvram

 

Relevant server configuration file changes:

 

--- a/utils/config/examples/daos_server_sockets.yml

+++ b/utils/config/examples/daos_server_sockets.yml

@@ -8,6 +8,8 @@ socket_dir: /tmp/daos_sockets

nr_hugepages: 4096

control_log_mask: DEBUG

control_log_file: /tmp/daos_control.log

+helper_log_file: /tmp/daos_admin.log

+bdev_exclude: ["0000:e3:00.0", "0000:e7:00.0"]

 

Server invocation:

 

sudo install/bin/daos_server start -o utils/config/examples/daos_server_sockets.yml -i

 

Debug output in the helper log file (/tmp/daos_admin):

 

DEBUG 13:16:11.027148 runner.go:150: spdk setup env: [PATH=/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/usr/sbin _NRHUGE=128 _TARGET_USER=root _PCI_BLACKLIST=0000:e3:00.0 0000:e7:00.0]

DEBUG 13:16:11.027302 runner.go:80: running script: /usr/share/daos/control/setup_spdk.sh

DEBUG 13:16:31.264301 runner.go:152: spdk setup stdout:

start of script: /usr/share/daos/control/setup_spdk.sh

calling into script: /usr/share/daos/control/../../spdk/scripts/setup.sh

0000:65:00.0 (144d a824): nvme -> uio_pci_generic

0000:67:00.0 (144d a824): nvme -> uio_pci_generic

0000:69:00.0 (144d a824): nvme -> uio_pci_generic

0000:6b:00.0 (144d a824): nvme -> uio_pci_generic

0000:e3:00.0 (144d a824): Skipping un-whitelisted NVMe controller at 0000:e3:00.0

0000:e5:00.0 (144d a824): nvme -> uio_pci_generic

0000:e7:00.0 (144d a824): Skipping un-whitelisted NVMe controller at 0000:e7:00.0

0000:e9:00.0 (144d a824): nvme -> uio_pci_generic

RUN: ls -d /dev/hugepages | xargs -r chown -R

rootRUN: ls -d /dev/uio* | xargs -r chown -R

rootRUN: ls -d /sys/class/uio/uio*/device/config | xargs -r chown -R

rootRUN: ls -d /sys/class/uio/uio*/device/resource* | xargs -r chown -R

rootSetting VFIO file permissions for unprivileged access

RUN: chmod /dev/vfio

OK

RUN: chmod /dev/vfio/*

OK

 

DEBUG 13:16:31.527190 spdk.go:122: spdk init go opts: {MemSize:0 PciWhiteList:[] DisableVMD:true}

DEBUG 13:16:31.577836 spdk.go:136: spdk init c opts: &{name:0x7f7eacaf13f5 core_mask:0x7f7eacaf1569 shm_id:-1 mem_channel:-1 master_core:-1 mem_size:-1 no_pci:false hugepage_single_segments:false unlink_hugepage:false num_pci_addr:0 hugedir:<nil> pci_blacklist:<nil> pci_whitelist:<nil> env_context:0x33c3790}

DEBUG 13:16:31.665129 nvme.go:143: discovered nvme ssds: [0000:65:00.0 0000:67:00.0 0000:6b:00.0 0000:e5:00.0 0000:e9:00.0 0000:69:00.0]

 

Resultant NVMe device listing:

 

tanabarr@wolf-151:~/projects/daos_m> ls -lah /dev/nv*

crw------- 1 root root 237,   4 Oct 19 13:16 /dev/nvme4

brw-rw---- 1 root disk 259, 115 Oct 19 13:16 /dev/nvme4n1

crw------- 1 root root 237,   6 Oct 19 13:16 /dev/nvme6

brw-rw---- 1 root disk 259, 119 Oct 19 13:16 /dev/nvme6n1

crw------- 1 root root  10, 144 Oct  7 22:48 /dev/nvram

 

Regards,

Tom Nabarro – DCG/ESAD

M: +44 (0)7786 260986

Skype: tom.nabarro

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of anton.brekhov@...
Sent: Sunday, October 18, 2020 1:28 PM
To: daos@daos.groups.io
Subject: Re: [daos] DAOS with NVMe-over-Fabrics

 

Tom, thanks for comments! 

I've edited my previous message. I'll copy it below:

Tom here is the active config file content:

name: daos_server

access_points: ['apache512']

#access_points: ['localhost']

port: 10001

#provider: ofi+sockets

provider: ofi+verbs;ofi_rxm

nr_hugepages: 4096

control_log_file: /tmp/daos_control.log

helper_log_file: /tmp/daos_admin.log
bdev_exclude: ["0000:b1:00.0","0000:b2:00.0","0000:b3:00.0","0000:b4:00.0"]

transport_config:

   allow_insecure: true

 

servers:

-

  targets: 4

  first_core: 0

  nr_xs_helpers: 0

  fabric_iface: ib0

  fabric_iface_port: 31416

  log_mask: ERR

  log_file: /tmp/daos_server.log

 

  env_vars:

  - DAOS_MD_CAP=1024

  - CRT_CTX_SHARE_ADDR=0

  - CRT_TIMEOUT=30

  - FI_SOCKETS_MAX_CONN_RETRY=1

  - FI_SOCKETS_CONN_TIMEOUT=2000

  #- OFI_INTERFACE=ib0

  #- OFI_DOMAIN=mlx5_0

  #- CRT_PHY_ADDR_STR=ofi+verbs;ofi_rxm

 

  # Storage definitions

 

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.

  # The size of ram is specified by scm_size in GB units.

  scm_mount: /mnt/daos  # map to -s /mnt/daos

  #scm_class: ram

  #scm_size: 8

  scm_class: dcpm

  scm_list: [/dev/pmem0]

 

  bdev_class: nvme

  #bdev_list: ["0000:b1:00.0","0000:b2:00.0"]

 

Before starting server:

[root@apache512 ~]# nvme list

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev

---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

/dev/nvme0n1     BTLJ81460E1M1P0I     INTEL SSDPELKX010T8                      1           1,00  TB /   1,00  TB    512   B +  0 B   VCV10300

/dev/nvme1n1     BTLJ81460E031P0I     INTEL SSDPELKX010T8                      1           1,00  TB /   1,00  TB    512   B +  0 B   VCV10300

/dev/nvme2n1     BTLJ81460E1J1P0I     INTEL SSDPELKX010T8                      1           1,00  TB /   1,00  TB    512   B +  0 B   VCV10300

/dev/nvme3n1     BTLJ81460E341P0I     INTEL SSDPELKX010T8                      1           1,00  TB /   1,00  TB    512   B +  0 B   VCV10300

After starting server:

[root@apache512 ~]# nvme list

[root@apache512 ~]#

The helper_log_file :

calling into script: /usr/share/daos/control/../../spdk/scripts/setup.sh

0000:b1:00.0 (8086 0a54): nvme -> uio_pci_generic

0000:b2:00.0 (8086 0a54): nvme -> uio_pci_generic

0000:b3:00.0 (8086 0a54): nvme -> uio_pci_generic

0000:b4:00.0 (8086 0a54): nvme -> uio_pci_generic

0000:00:04.0 (8086 2021): no driver -> uio_pci_generic

0000:00:04.1 (8086 2021): Already using the uio_pci_generic driver

0000:00:04.2 (8086 2021): Already using the uio_pci_generic driver

0000:00:04.3 (8086 2021): Already using the uio_pci_generic driver

0000:00:04.4 (8086 2021): Already using the uio_pci_generic driver

0000:00:04.5 (8086 2021): Already using the uio_pci_generic driver

0000:00:04.6 (8086 2021): Already using the uio_pci_generic driver

0000:00:04.7 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.0 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.1 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.2 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.3 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.4 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.5 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.6 (8086 2021): Already using the uio_pci_generic driver

0000:80:04.7 (8086 2021): Already using the uio_pci_generic driver

RUN: ls -d /dev/hugepages | xargs -r chown -R

rootRUN: ls -d /dev/uio* | xargs -r chown -R

rootRUN: ls -d /sys/class/uio/uio*/device/config | xargs -r chown -R

rootRUN: ls -d /sys/class/uio/uio*/device/resource* | xargs -r chown -R

rootSetting VFIO file permissions for unprivileged access

RUN: chmod /dev/vfio

OK

RUN: chmod /dev/vfio/*

OK

 

DEBUG 17:41:13.340092 nvme.go:176: discovered nvme ssds: [0000:b4:00.0 0000:b3:00.0 0000:b1:00.0 0000:b2:00.0]

DEBUG 17:41:13.340794 nvme.go:133: removed lockfiles: [/tmp/spdk_pci_lock_0000:b4:00.0 /tmp/spdk_pci_lock_0000:b3:00.0 /tmp/spdk_pci_lock_0000:b1:00.0 /tmp/spdk_pci_lock_0000:b2:00.0]

DEBUG 17:41:13.502291 ipmctl.go:104: discovered 4 DCPM modules

DEBUG 17:41:13.517978 ipmctl.go:356: discovered 2 DCPM namespaces

DEBUG 17:41:13.775184 ipmctl.go:133: show region output: ---ISetID=0xfe0ceeb819432444---

   PersistentMemoryType=AppDirect

   FreeCapacity=0.000 GiB

---ISetID=0xe1f4eeb8c7432444---

   PersistentMemoryType=AppDirect

   FreeCapacity=0.000 GiB

 

DEBUG 17:41:13.973259 ipmctl.go:104: discovered 4 DCPM modules

DEBUG 17:41:13.988400 ipmctl.go:356: discovered 2 DCPM namespaces

DEBUG 17:41:14.234782 ipmctl.go:133: show region output: ---ISetID=0xfe0ceeb819432444---

   PersistentMemoryType=AppDirect

   FreeCapacity=0.000 GiB

---ISetID=0xe1f4eeb8c7432444---

   PersistentMemoryType=AppDirect

   FreeCapacity=0.000 GiB

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

21 - 40 of 1309