Date   

Re: How to install DAOS on ARM64 platform

Faccini, Bruno
 

Can you check if /root/huzj/daos/install/lib64/daos_srv/librdb.so exists ?

And if not, is there any log for this build available ?

Also, what are the environment variables for the session you are using to start the server/engine ?

And last, can you attach your server/engine config file ?

Thanks in advance for your help,

Bruno.

 

From: <daos@daos.groups.io> on behalf of Groot <kukougu@...>
Reply to: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday 30 September 2022 at 10:48
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] How to install DAOS on ARM64 platform

 

Since I build by source on ARM64 platform. I use daos_server start to start the daos server. But get the error bleow and I mkdir the /var/run/daos_server directory and the daos_server start successfully.
$ daos_server start 
ERROR: dRPC server setup: missing socket directory /var/run/daos_server: stat /var/run/daos_server: no such file or directory

But the daos_engine.0 (/tmp/daos_engine.0.log) get error after format the storage.

09/30-16:15:24.84 slave1 DAOS[1213401/-1/0] server ERR  src/engine/module.c:90 dss_module_load() cannot load librdb.so: /root/huzj/daos/install/bin/../lib64/daos_srv/librdb.so: undefined symbol: ds_obj_enum_pack

09/30-16:15:24.84 slave1 DAOS[1213401/-1/0] server ERR  src/engine/init.c:231 modules_load() Failed to load module rdb: -1003

Thanks a lot.
Groot

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: How to install DAOS on ARM64 platform

Groot
 

Since I build by source on ARM64 platform. I use daos_server start to start the daos server. But get the error bleow and I mkdir the /var/run/daos_server directory and the daos_server start successfully.
$ daos_server start 
ERROR: dRPC server setup: missing socket directory /var/run/daos_server: stat /var/run/daos_server: no such file or directory

But the daos_engine.0 (/tmp/daos_engine.0.log) get error after format the storage.
09/30-16:15:24.84 slave1 DAOS[1213401/-1/0] server ERR  src/engine/module.c:90 dss_module_load() cannot load librdb.so: /root/huzj/daos/install/bin/../lib64/daos_srv/librdb.so: undefined symbol: ds_obj_enum_pack
09/30-16:15:24.84 slave1 DAOS[1213401/-1/0] server ERR  src/engine/init.c:231 modules_load() Failed to load module rdb: -1003
Thanks a lot.
Groot


Re: How to install DAOS on ARM64 platform

samir.raval@...
 

Hello Groot,

Can you check if servers are ready using "dmg system query"? Looks like daos engine is not up.

Please also provide /tmp/daos_server.log and  /tmp/daos_engine.0.log.

Thank You
SAMIR


Re: How to install DAOS on ARM64 platform

Groot
 

Thanks a lot. We compile successfully by using the master branch.
But we face another problem by using ram and tmpfs to emulate SCM. We set the server config file just as https://github.com/daos-stack/daos/blob/master/utils/config/examples/daos_server_local.yml
And get the error when create pool
$ dmg pool create --size 1G Pool1
Creating DAOS pool with automatic storage allocation: 1.0 GB total, 6,94 tier ratio
ERROR: dmg: pool create failed: rpc error: code = Unknown desc = pool request contains zero target ranks
At the same time, we get the same error when we use ram to emulate SCM on x86 platform.
Any ideas?

Thanks.
Groot


Re: How to install DAOS on ARM64 platform

Nabarro, Tom
 

Hello Groot,

 

What version of DAOS source are you compiling from?

Please try with current master branch if you are not already.

 

Regards,

Tom

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Groot
Sent: Sunday, September 25, 2022 2:02 PM
To: daos@daos.groups.io
Subject: Re: [daos] How to install DAOS on ARM64 platform

 

Thanks a lot.
But I can't install ipmctl on ARM64 platform.  I  tried to compile from the source but got an error saying the lack of cpuid.h file. 
And I get error no nvm_management.h file while compiling daos as below

github.com/daos-stack/daos/src/control/lib/ipmctl

lib/ipmctl/nvm.go:17:10: fatal error: nvm_management.h: No such file or directory

So how to compile daos on ARM64 platform? Give some details ?
Thanks.
Groot


Re: How to install DAOS on ARM64 platform

Groot
 

Thanks a lot.
But I can't install ipmctl on ARM64 platform.  I  tried to compile from the source but got an error saying the lack of cpuid.h file. 
And I get error no nvm_management.h file while compiling daos as below
github.com/daos-stack/daos/src/control/lib/ipmctl
lib/ipmctl/nvm.go:17:10: fatal error: nvm_management.h: No such file or directory

So how to compile daos on ARM64 platform? Give some details ?
Thanks.
Groot


Re: How to install DAOS on ARM64 platform

Lombardi, Johann
 

Hi there,

 

Yes, the process is the same except that we don’t provide RPMs. Please use the master branch which is regularly built and (basically) tested on ARM64.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Groot <kukougu@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday 23 September 2022 at 11:22
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] How to install DAOS on ARM64 platform

 

How to install DAOS on ARM64 platform? Does it just like the process on x86 platform?

Thanks a lot.
Groot

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


How to install DAOS on ARM64 platform

Groot
 

How to install DAOS on ARM64 platform? Does it just like the process on x86 platform?

Thanks a lot.
Groot


Re: system fault testing

Lombardi, Johann
 

Hi Chuck,

 

We are slowly moving all our (internal) design documentations and test plans to the public wiki. Let me share with you the one related to system fault testing.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "Tuffli, Chuck" <chuck.tuffli@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Saturday 3 September 2022 at 00:24
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] system fault testing

 

The DAOS documentation has a good overview of its fault model. We are starting to experiment with various types of failures (pull a drive, pull a network cable, pull a power cord) and are curious what testing has been done in this area. Is there a test plan someone could share?

 

--chuck

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


DAOS Community Update / Sep'22

Lombardi, Johann
 

Hi there,

Please find below the DAOS community newsletter for September 2022. A copy of this newsletter is also available on the wiki.

Past Events

  • Flash Memory Summit’22: 3rd Workshop on Extreme-Scale Storage and Analysis (August 2nd-4th)
    Requirements and Challenges Associated with the World's Fastest Storage Platform
    https://www.flashmemorysummit.com
    Jeff Olivier (Intel)

Upcoming Events

  • IXPUG Annual Conference 2022 (Sep 29)
    The Evolution of Storage and Memory and the DAOS Role in It
    Kevin Harms (ANL)
    Andrey Kudryavtsev (Intel)
  • SuperCheck-SC'22 (Nov 14)
    DAOS: Nextgen Storage Stack for HPC and AI
    Johann Lombardi (Intel)
  • SC'22 BoF (Nov 15-17)
    DAOS Storage Community BoF
    Kevin Harms (ANL)
    Michael Hennecke (Intel)
    Dean Hildebrand (Google)
    Panagiotis Adamidis (DKRZ)
  • SC'22 BoF (Nov 15-17)
    The Storage Tower of Babel? ... Not! Actually, maybe?
    Philippe Deniel (CEA)
    John Bent (Seagate)
    Tiago Quintino (ECMWF)
    Johann Lombardi (Intel)
  • SC'22 Tutorial (Nov 13-14)
    Emerging Storage Interfaces: DAOS and PMDK
    Adrian Jackson (EPCC)
    Mohamad Chaarawi (Intel)
    Johann Lombardi (Intel) 
  • 6th annual DAOS User Group (Nov/Dec'22)

Release

  • Current stable release is 2.0.3. See https://docs.daos.io/v2.0/ and https://packages.daos.io/v2.0/ for more information.
    2.0.3 includes several fixes for ARM64 support, erasure code and pool operations. Please see the release notes for more details.
  • Branches:
    • release/2.0 is the release branch for the stable 2.0 release. Latest bug fix release is 2.0.3 (v2.0.3 tag).
    • release/2.2 is the development branch for the future 2.2 release. The first release candidate has been created (v2.2.0-rc1 tag).
    • Master is the development branch for the future 2.4 release. Latest test build is 2.3.100 (v2.3.100-tb tag). New build including EC parity rotation feature imminent.
  • Major recent changes on release/2.0 (bugfix release):
    • Several coverty fixes
    • Fix incorrect assertion failure hit when running soak testing with LAMMPS application
    • Bump hadoop-common version to 3.3.3
    • Several documentation fixes
    • Several test fixes.
  • Major recent changes on release/2.2 (future 2.2 release):
    • All patches listed in the 2.0 section above.
    • Update mercury to 2.2.0
    • Update pmdk to 1.12.1
    • Trigger DTX reindex before DTX resync
    • Fix issue with srx_disabled config field
    • Fix mtime set to not rely on DAOS HLC
    • Improve DAOS build preprocessing steps
    • Fix java jar build instructions
    • Reduce lock contention on hash lock in libdaos to increase multi-thread performance
    • Set UCX_IB_FORK_INIT env var in the engine
    • Add new metrics to track EC full stripe and partial updates
    • Improve dfs_setattr to re-sample mtime on file size changes
    • Add UCX documentation
    • Do not use stable epoch for reclaim
    • Fix dfs_open for directories without O_EXCL
    • Add support for 2.0/2.2 agent interoperability
  • Major recent changes on master (future 2.4 release):
    • All patches listed in the 2.2 section above.
    • Add prefix to notice logging in the control plane
    • Add githook install script
    • Move NLT and unit tests to el8
    • Fix a race in dc_tx_get_epoch
    • Fix name match in daos_oclass_name2id()
    • Add ability for engine to manage its own ABT stack via mmap() to pro-actively detect stack overrun
    • Limit number of outstanding I/Os to NVMe device
    • Remove indirect link for ISA-L
    • Store scan objects target ID during rebuild to avoid excessive iteration when sending object list
    • Create a single bulk handle per DMA chunk and share the same handle for all bulk transfer against the same DMA chunk.
    • Retry map_fresh on more errors
    • Refactor daos_server standalone command surface
    • Reject read/write hole in bio
    • Run NLT on ARM64 self-hosted runners
    • Fix gap in EC rotation patch in tx classify
    • Replace SWIM D_CIRCLEQ with a hash table.
    • Fix VMD domain parsing
    • Accept positional args in dfuse command to support mtab entries
    • Set EC cell alignment to 32 bytes
    • Disallow IP address with negative port in the control plane
  • What is coming:
    • 2.2.0 GA
    • 2.4.0 feature freeze

R&D

  • Major features under development:
    • VOS on SPDK blob
    • Multi-user dfuse
    • More aggressive caching in dfuse for AI APPs
      • FUSE version updated for EL8 for readdir caching support, not needed on Leap that was recent enough FUSE version.
      • FUSE kernel readdir is on enabled, dfuse readdir still under work.
      • PR: https://github.com/daos-stack/daos/pull/6776
      • Target release: 2.4
    • Catastrophic recovery
      • Aka distributed fsck or checker
      • Tests for ddb (low level debugger utility similar to debugfs for ext4) under review
      • Testing for the dmg checker under development
      • Pass 4 for container recovery completed.
      • Branch: feature/cat_recovery
      • Target release: 2.6
    • Multi-homed network support
      • Aka multi-provider support
      • This feature aims at supporting multiple network provider in the engine
      • Branch is feature complete now and testing is underway
      • Branch: feature/multiprovider
      • Target release: 2.6
    • Client-side metrics
    • Performance domain
      • Extend placement algorithm to be aware of fabric topology
      • Fix to avoid putting shards on the same domain landed
      • Branch: feature/perf_dom
      • Target release: 2.8 
  • Pathfinding:
    • DAOS Pipeline API for active storage
    • Leveraging the Intel Data Streaming Accelerator (DSA) to accelerate DAOS
      • Prototype leveraging DSA for VOS aggregation delivered
      • Initial results shared at IXPUG conference.
    • OPX provider support in collaboration with Cornelis Networks
      • OPX provider merged upstream in libfabric
      • Provider supported in latest mercury version
      • Changes to DAOS to enable OPX as part of the build in progress
    • GPU data path optimizations
  • I/O Middleware / Framework Support

News

  • In addition to building on ARM platform on Ubuntu 22.04, AlmaLinux 8 and Leap 15, some basic tests (called NLT, stands for Node Local Tests) are now run on every PR landing. See this link for more information .Thanks again for Linaro and Croit for their support.Next step is to run unit tests.
  • Congrats to Croit and DenisB for merging the SPDK DAOS bdev upstream!
  • The  DAOS community BoF for SC'22 has been accepted!

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


system fault testing

Tuffli, Chuck
 

The DAOS documentation has a good overview of its fault model. We are starting to experiment with various types of failures (pull a drive, pull a network cable, pull a power cord) and are curious what testing has been done in this area. Is there a test plan someone could share?

--chuck


DAOS Community Update / August'22

Lombardi, Johann
 

Hi there,

Please find below the DAOS community newsletter for August 2022. A copy of this newsletter is also available on the wiki.

Past Events

Upcoming Events

  • 6th annual DAOS User Group (Nov'22)

Release

  • Current stable release is 2.0.3. See https://docs.daos.io/v2.0/ and https://packages.daos.io/v2.0/ for more information.
    2.0.3 includes several fixes for ARM64 support, erasure code and pool operations. Please see the release notes for more details.
  • Branches:
    • release/2.0 is the release branch for the stable 2.0 release. Latest bug fix release is 2.0.3 (v2.0.3 tag).
    • release/2.2 is the development branch for the future 2.2 release. Latest test build is 2.1.104 (v2.1.104-tb tag).
    • Master is the development branch for the future 2.4 release. Latest test build is 2.3.100 (v2.3.100-tb tag).
  • Major recent changes on release/2.0 (bugfix release) since v2.0.3:
    • Several coverty fixes
    • Several packaging changes to prevent 2.0 from picking up UCX packages and new SPDK version.
    • Several test fixes
    • Remove unused test code
  • Major recent changes on release/2.2 (future 2.2 release):
    • All patches listed in the 2.0 section above.
    • Remove check for dpdk/rte_eal.
    • Handle the rare case in the control plane where a node processing a SWIM dead event loses leadership before the membership can be updated
    • The hwloc library provides quite a bit of information about block devices; make this available through the control plane. Also adds support for non-PCI devices
    • Add backward compatibility code for the enable_vmd config parameter
    • Several dtx internal fixes.
    • Fix size compatibility issue with ds_cont_prop_cont_global_version in rdb
    • Refine code cleaning up huge pages left behind by previous instances
    • Update mercury to 2.2.0-rc6 to grab several fixes in ucx, tcp and cxi.
    • Improve QoS on request processing in the engine when running out of DMA buffers (FIFO order is now guaranteed to avoid starvation)
    • Improve support of compound RPCs with co-located shards
    • Update SPDK to v22.01.1
    • Add support for ucx/tcp transport to cart.
    • Fix assertion failure in pool_map_get_version()
    • Fix mtime accounting for user set mtime
    • Add VPIC and LAMMPS applications to soak testing framework
    • Report pool global version on dmg pool list/query
    • Reject pool connection from old clients after pool upgrade
  • Major recent changes on master (future 2.4 release):
    • All patches listed in the 2.2 section above.
    • Add code to report Jira status into GitHub 
    • Use shared event queue in pydaos
    • Fix an issue in the I/O scheduler related to I/O throttling
    • Add partial support for readdir caching to dfuse
    • Land checksum scrubbing feature
    • Remove openpa package dependency
    • Add support for streaming I/O functions to the interception library
    • Update vendor dependency in the control plane
    • Fix daos cont list-obj JSON output
    • Fix possible race condition in map_refresh
    • Fix a race in dc_tx_get_epoch
    • Add metrics to track EC full stripes vs partial updates
    • Fix name match in daos_oclass_name2id()
    • Add new STACK_MMAP build option to enable DAOS-managed ABT stacks in the engine
    • Fix OID leak in the OIT
    • Fix a bug in size query introduced by the EC shard rotation feature
    • Move fault injection testing from CentOS7 to Rocky Linux 8 to prepare for the CentOS7 removal for 2.4.
    • Add ARM64 self-hosted runners to GitHub Action.
  • What is coming:
    • 2.2.0 code freeze
    • 2.4.0 feature freeze

R&D

  • Major features under development:
    • VOS on SPDK blob
      • New umem backend and WAL to maintain an up-to-date of copy of a VOS (i.e. DAOS metadata) file on a SPDK blob (i.e. SSD).
      • Patch to use umem DAOS interface in BIO and VOS landed.
      • Branch: feature/vos-on-blob
      • Target release: TBD
    • Checksum scrubber
      • Feature landed to master for 2.4
      • Branch: feature/csum-scrubbing
      • Target release: 2.4
      • This entry will be removed from this report next time
    • Multi-user dfuse
    • More aggressive caching in dfuse for AI APPs
    • I/O steaming function interception
      • Add the ability to intercept fopen/fclose/fread/fwrite and other streaming functions in the interception library
      • Feature landed to master for 2.4
      • PR: https://github.com/daos-stack/daos/pull/6939
      • Target release: 2.4
      • This entry will be removed from this report next time
    • Catastrophic recovery
      • Aka distributed fsck or checker
      • Tests for ddb (low level debugger utility similar to debugfs for ext4) under review
      • Testing for the dmg checker under development
      • Improvements to the checker start/stop flow and dmg in progress
      • Pass 4 for container recovery is in progress
      • Branch: feature/cat_recovery
      • Target release: 2.6
    • Multi-homed network support
      • Aka multi-provider support
      • This feature aims at supporting multiple network provider in the engine
      • Branch is feature complete now and testing is underway
      • Branch: feature/multiprovider
      • Target release: 2.6
    • Client-side metrics
    • Performance domain
      • Extend placement algorithm to be aware of fabric topology
      • Fix to avoid putting shards on the same domain landed
      • Branch: feature/perf_dom
      • Target release: 2.8 
    • LDMS plugin to export DAOS metrics
  • Pathfinding:
    • DAOS Pipeline API for active storage
    • Leveraging the Intel Data Streaming Accelerator (DSA) to accelerate DAOS
      • Prototype leveraging DSA for VOS aggregation delivered
      • Initial results shared at IXPUG conference.
    • OPX provider support in collaboration with Cornelis Networks
      • OPX provider merged upstream in libfabric
      • Provider supported in latest mercury version
      • Changes to DAOS to enable OPX as part of the build in progress
    • GPU data path optimizations
  • I/O Middleware / Framework Support:

News

  • Thanks a lot to Linaro and Croit for providing the DAOS community with access to ARM64 nodes with different configurations (#cores, OS, ...). Github actions is now enabled to build the DAOS master branch regularly on ARM64. Next step is to run unit tests.
  • A proposal for a DAOS community BoF at SC’22 has been submitted. We should know on Aug 12.
  • IO500 instructions on the wiki have been updated to use the new DAOS-aware pfind.

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.

dagouxiong2015@...
 

The cart context of the normal process needs to be initialized so that it can communicate normally with the failed recovery process.

Thank you for your reply!


Re: When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.

Lombardi, Johann
 

Thanks for the clarification. Since the engine is killed and restarted in step 4, I am not sure to understand why network contexts would need to be reinitialized. Could you please create a jira ticket with the logs?

In the meantime, you should be able to iterate over all the network contexts in cart (we keep an array with all the contexts there IIRC).

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "dagouxiong2015@..." <dagouxiong2015@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday 4 August 2022 at 03:55
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally
rpc msg OOG.

 

Thank you for your recovery, your understanding is correct.
At that time our sequence of operations was
 1) Unplug the network cable 2) Kill the process corresponding to the network cable 3) Reinsert the network cable 4) Restore the rank through "dmg system start --ranks" 5) DER_OOG appears on the recovered rank log

We noticed that this problem is inevitable

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.

dagouxiong2015@...
 

Gǎnxiè nín de huīfù, nǐ de lǐjiě shì duì de. Dāngshí wǒmen zhíxíng de cāozuò shì: 1) Bá diào wǎngxiàn 2) shā diào wǎngxiàn duìyìng de jìnchéng 3) chóngxīn chārù wǎngxiàn 4) tōngguò “dmg system start --ranks” huīfù duìyìng de rank 5) huīfù de rank rìzhì shàng chūxiàn DER_OOG
展开
 
 
 
 
 
120 / 5,000
 

翻译结果

 
star_border
 
Thank you for your recovery, your understanding is correct.
At that time our sequence of operations was: 1) Unplug the network cable 2) Kill the process corresponding to the network cable 3) Reinsert the network cable 4) Restore the rank through "dmg system start --ranks" 5) DER_OOG appears on the recovered rank log

We noticed that this problem is inevitable,



Re: When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.

Lombardi, Johann
 

Hi there,

 

IIUC, you are running into a bug where a DAOS engine is not able to rejoin the system / CART group if you “just” unplug & replug the network cable. You then noticed that you could work around this issue by reinitializing the cart contexts, but don’t know how to do that across the board for all network contexts used by the engine. Is that correct?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "dagouxiong2015@..." <dagouxiong2015@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 3 August 2022 at 04:15
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally
rpc msg OOG.

 

We tried the mercury demo and found that initializing hg can solve this OOG problem,
and then analyzed the code of the cart
and found that the cart context applied a global context, and it is used by all other ranks rpc msg.

When we want to initialize the cart context for a rank, instead of all rankswhat can we do?


struct dss_module_info {

    crt_context_t       dmi_ctx;

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


When the network port recovers from a fault, the corresponding rank cannot receive the groupupdate message, resulting in the failure to join the group normally,rpc msg OOG.

dagouxiong2015@...
 

We tried the mercury demo and found that initializing hg can solve this OOG problem,
and then analyzed the code of the cart,
and found that the cart context applied a global context, and it is used by all other ranks rpc msg.

When we want to initialize the cart context for a rank, instead of all ranks,what can we do?




struct dss_module_info {
    crt_context_t       dmi_ctx;


Re: Question about 3D Xpoint DIMM

Lombardi, Johann
 

Hi there,

 

The DAOS architecture won’t fundamentally change and the plan is to become more flexible in the configurations we support.  We will continue to store metadata and data on different devices and use direct load/store for the metadata. The DAOS metadata will be stored on either persistent (e.g. apache/barlow/crow pass, battery-backed DRAM or future SSD products supporting CXL.mem) or volatile (e.g. DRAM or CXL.mem) devices. The persistent option is what DAOS supports today. As for the volatile one, there will be an extra step on write operations to keep a copy of the metadata in sync on CLX.io/NVMe SSDs. This work was already underway with community partners and is going to be accelerated. We will share more on this soon.

 

Once done, this change should allow DAOS to run on a wider range of hardware while maintaining our performance leadership.

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "Nabarro, Tom" <tom.nabarro@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Monday 1 August 2022 at 11:51
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Question about 3D Xpoint DIMM

 

DAOS continues to be a strategic part of the Intel software portfolio and we remain committed to supporting our customers and the DAOS community. In parallel, we are accelerating efforts that have already been under way for DAOS to utilize other storage technologies to store metadata on SSDs through NVMe and CXL interfaces.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of ???
Sent: Sunday, July 31, 2022 7:09 AM
To: daos@daos.groups.io
Subject: [daos] Question about 3D Xpoint DIMM

 

Intel recently announced that it will no longer provide 3D Xpoint DIMMs, so how will this affect DAOS?

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: Question about 3D Xpoint DIMM

Nabarro, Tom
 

DAOS continues to be a strategic part of the Intel software portfolio and we remain committed to supporting our customers and the DAOS community. In parallel, we are accelerating efforts that have already been under way for DAOS to utilize other storage technologies to store metadata on SSDs through NVMe and CXL interfaces.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of ???
Sent: Sunday, July 31, 2022 7:09 AM
To: daos@daos.groups.io
Subject: [daos] Question about 3D Xpoint DIMM

 

Intel recently announced that it will no longer provide 3D Xpoint DIMMs, so how will this affect DAOS?


Question about 3D Xpoint DIMM

段世博
 

Intel recently announced that it will no longer provide 3D Xpoint DIMMs, so how will this affect DAOS?

1 - 20 of 1635