DAOS Community Update / Sep'22


Lombardi, Johann
 

Hi there,

Please find below the DAOS community newsletter for September 2022. A copy of this newsletter is also available on the wiki.

Past Events

  • Flash Memory Summit’22: 3rd Workshop on Extreme-Scale Storage and Analysis (August 2nd-4th)
    Requirements and Challenges Associated with the World's Fastest Storage Platform
    https://www.flashmemorysummit.com
    Jeff Olivier (Intel)

Upcoming Events

  • IXPUG Annual Conference 2022 (Sep 29)
    The Evolution of Storage and Memory and the DAOS Role in It
    Kevin Harms (ANL)
    Andrey Kudryavtsev (Intel)
  • SuperCheck-SC'22 (Nov 14)
    DAOS: Nextgen Storage Stack for HPC and AI
    Johann Lombardi (Intel)
  • SC'22 BoF (Nov 15-17)
    DAOS Storage Community BoF
    Kevin Harms (ANL)
    Michael Hennecke (Intel)
    Dean Hildebrand (Google)
    Panagiotis Adamidis (DKRZ)
  • SC'22 BoF (Nov 15-17)
    The Storage Tower of Babel? ... Not! Actually, maybe?
    Philippe Deniel (CEA)
    John Bent (Seagate)
    Tiago Quintino (ECMWF)
    Johann Lombardi (Intel)
  • SC'22 Tutorial (Nov 13-14)
    Emerging Storage Interfaces: DAOS and PMDK
    Adrian Jackson (EPCC)
    Mohamad Chaarawi (Intel)
    Johann Lombardi (Intel) 
  • 6th annual DAOS User Group (Nov/Dec'22)

Release

  • Current stable release is 2.0.3. See https://docs.daos.io/v2.0/ and https://packages.daos.io/v2.0/ for more information.
    2.0.3 includes several fixes for ARM64 support, erasure code and pool operations. Please see the release notes for more details.
  • Branches:
    • release/2.0 is the release branch for the stable 2.0 release. Latest bug fix release is 2.0.3 (v2.0.3 tag).
    • release/2.2 is the development branch for the future 2.2 release. The first release candidate has been created (v2.2.0-rc1 tag).
    • Master is the development branch for the future 2.4 release. Latest test build is 2.3.100 (v2.3.100-tb tag). New build including EC parity rotation feature imminent.
  • Major recent changes on release/2.0 (bugfix release):
    • Several coverty fixes
    • Fix incorrect assertion failure hit when running soak testing with LAMMPS application
    • Bump hadoop-common version to 3.3.3
    • Several documentation fixes
    • Several test fixes.
  • Major recent changes on release/2.2 (future 2.2 release):
    • All patches listed in the 2.0 section above.
    • Update mercury to 2.2.0
    • Update pmdk to 1.12.1
    • Trigger DTX reindex before DTX resync
    • Fix issue with srx_disabled config field
    • Fix mtime set to not rely on DAOS HLC
    • Improve DAOS build preprocessing steps
    • Fix java jar build instructions
    • Reduce lock contention on hash lock in libdaos to increase multi-thread performance
    • Set UCX_IB_FORK_INIT env var in the engine
    • Add new metrics to track EC full stripe and partial updates
    • Improve dfs_setattr to re-sample mtime on file size changes
    • Add UCX documentation
    • Do not use stable epoch for reclaim
    • Fix dfs_open for directories without O_EXCL
    • Add support for 2.0/2.2 agent interoperability
  • Major recent changes on master (future 2.4 release):
    • All patches listed in the 2.2 section above.
    • Add prefix to notice logging in the control plane
    • Add githook install script
    • Move NLT and unit tests to el8
    • Fix a race in dc_tx_get_epoch
    • Fix name match in daos_oclass_name2id()
    • Add ability for engine to manage its own ABT stack via mmap() to pro-actively detect stack overrun
    • Limit number of outstanding I/Os to NVMe device
    • Remove indirect link for ISA-L
    • Store scan objects target ID during rebuild to avoid excessive iteration when sending object list
    • Create a single bulk handle per DMA chunk and share the same handle for all bulk transfer against the same DMA chunk.
    • Retry map_fresh on more errors
    • Refactor daos_server standalone command surface
    • Reject read/write hole in bio
    • Run NLT on ARM64 self-hosted runners
    • Fix gap in EC rotation patch in tx classify
    • Replace SWIM D_CIRCLEQ with a hash table.
    • Fix VMD domain parsing
    • Accept positional args in dfuse command to support mtab entries
    • Set EC cell alignment to 32 bytes
    • Disallow IP address with negative port in the control plane
  • What is coming:
    • 2.2.0 GA
    • 2.4.0 feature freeze

R&D

  • Major features under development:
    • VOS on SPDK blob
    • Multi-user dfuse
    • More aggressive caching in dfuse for AI APPs
      • FUSE version updated for EL8 for readdir caching support, not needed on Leap that was recent enough FUSE version.
      • FUSE kernel readdir is on enabled, dfuse readdir still under work.
      • PR: https://github.com/daos-stack/daos/pull/6776
      • Target release: 2.4
    • Catastrophic recovery
      • Aka distributed fsck or checker
      • Tests for ddb (low level debugger utility similar to debugfs for ext4) under review
      • Testing for the dmg checker under development
      • Pass 4 for container recovery completed.
      • Branch: feature/cat_recovery
      • Target release: 2.6
    • Multi-homed network support
      • Aka multi-provider support
      • This feature aims at supporting multiple network provider in the engine
      • Branch is feature complete now and testing is underway
      • Branch: feature/multiprovider
      • Target release: 2.6
    • Client-side metrics
    • Performance domain
      • Extend placement algorithm to be aware of fabric topology
      • Fix to avoid putting shards on the same domain landed
      • Branch: feature/perf_dom
      • Target release: 2.8 
  • Pathfinding:
    • DAOS Pipeline API for active storage
    • Leveraging the Intel Data Streaming Accelerator (DSA) to accelerate DAOS
      • Prototype leveraging DSA for VOS aggregation delivered
      • Initial results shared at IXPUG conference.
    • OPX provider support in collaboration with Cornelis Networks
      • OPX provider merged upstream in libfabric
      • Provider supported in latest mercury version
      • Changes to DAOS to enable OPX as part of the build in progress
    • GPU data path optimizations
  • I/O Middleware / Framework Support

News

  • In addition to building on ARM platform on Ubuntu 22.04, AlmaLinux 8 and Leap 15, some basic tests (called NLT, stands for Node Local Tests) are now run on every PR landing. See this link for more information .Thanks again for Linaro and Croit for their support.Next step is to run unit tests.
  • Congrats to Croit and DenisB for merging the SPDK DAOS bdev upstream!
  • The  DAOS community BoF for SC'22 has been accepted!

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 5 208 026.16 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Join daos@daos.groups.io to automatically receive all group messages.