FIO Results & Running IO500


Peter
 

Hello all!

I have a cluster of 4 DAOS nodes. The nodes use CentOS 7.9, Optane SCM (no SSD), and are connected over EB Infiniband.
These nodes are able to run FIO as show here: https://daos-stack.github.io/admin/performance_tuning/#fio
The scores I am able to achieve running /examples/dfs.fio are:

Seq Read 12.4 GB/s 283 us / 21408 us (latency min/average)
Seq Write 4.0 GB/s 673 us / 66585 us
(latency min/average)
Random Read 187 KIOPS 83 us / 1335 us
(latency min/average)
Random Write 180 KIOPs 93 us / 1409 us
(latency min/average)


Are these numbers reasonable? The random scores seem low. I'm not 100% sure about my recorded latency numbers, but they also seem slow (for Optane) but perhaps this is due to various DFUSE or other overheads.


I have since attempted to run IO-500, configured according to: https://wiki.hpdd.intel.com/display/DC/IO-500+ISC21
IO500 runs, with the following output: (I'm not concerned about the stonewall time errors for the moment)

IO500 version io500-isc21 (standard)
ERROR INVALID (src/phase_ior.c:24) Write phase needed 103.465106s instead of stonewall 300s. Stonewall was hit at 103.5s
ERROR INVALID (src/main.c:396) Runtime of phase (104.060211) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT]       ior-easy-write        3.092830 GiB/s : time 104.060 seconds [INVALID]
ERROR INVALID (src/main.c:396) Runtime of phase (2.191084) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT]    mdtest-easy-write      178.031067 kIOPS : time 2.191 seconds [INVALID]
[      ]            timestamp        0.000000 kIOPS : time 0.003 seconds
ERROR INVALID (src/phase_ior.c:24) Write phase needed 6.626582s instead of stonewall 300s. Stonewall was hit at 6.3s
ERROR INVALID (src/main.c:396) Runtime of phase (6.666027) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT]       ior-hard-write        2.114133 GiB/s : time 6.666 seconds [INVALID]
ERROR INVALID (src/main.c:396) Runtime of phase (5.672756) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT]    mdtest-hard-write       59.615140 kIOPS : time 5.673 seconds [INVALID]

[swat7-02:06130] *** An error occurred in MPI_Comm_split_type
[swat7-02:06130] *** reported by process [1960837121,9]
[swat7-02:06130] *** on communicator MPI_COMM_WORLD
[swat7-02:06130] *** MPI_ERR_ARG: invalid argument of some other kind
[swat7-02:06130] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[swat7-02:06130] ***    and potentially your MPI job)
....(repeated)
And IO500 terminates. This is using openmpi 4, using openmpi 3.1.6 IO500 simply hangs at the same spot.

Would anyone have insight into what is going on here, and how I can fix it?

Thank you for your help.


Harms, Kevin
 

I'm not sure about what to expect from your nodes, but for IO-500:

the first part of complaints are for the runtime being too short. You need to adjust the parameters to make the run longer.
the second part for MPI_Comm_split_type failing is complaining about the arguments... Can you try with an MPICH derivative? Maybe there is some issue between OpenMPI and MPICH with regard to valid split_type values.

kevin

________________________________________
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@gmail.com>
Sent: Tuesday, June 22, 2021 9:40 PM
To: daos@daos.groups.io
Subject: [daos] FIO Results & Running IO500

Hello all!

I have a cluster of 4 DAOS nodes. The nodes use CentOS 7.9, Optane SCM (no SSD), and are connected over EB Infiniband.
These nodes are able to run FIO as show here: https://daos-stack.github.io/admin/performance_tuning/#fio
The scores I am able to achieve running /examples/dfs.fio are:

Seq Read 12.4 GB/s 283 us / 21408 us (latency min/average)
Seq Write 4.0 GB/s 673 us / 66585 us
(latency min/average)
Random Read 187 KIOPS 83 us / 1335 us
(latency min/average)
Random Write 180 KIOPs 93 us / 1409 us
(latency min/average)

Are these numbers reasonable? The random scores seem low. I'm not 100% sure about my recorded latency numbers, but they also seem slow (for Optane) but perhaps this is due to various DFUSE or other overheads.


I have since attempted to run IO-500, configured according to: https://wiki.hpdd.intel.com/display/DC/IO-500+ISC21<https://wiki.hpdd.intel.com/display/DC/IO-500+ISC21#IO500ISC21-Pre-requisites>
IO500 runs, with the following output: (I'm not concerned about the stonewall time errors for the moment)

IO500 version io500-isc21 (standard)
ERROR INVALID (src/phase_ior.c:24) Write phase needed 103.465106s instead of stonewall 300s. Stonewall was hit at 103.5s
ERROR INVALID (src/main.c:396) Runtime of phase (104.060211) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT] ior-easy-write 3.092830 GiB/s : time 104.060 seconds [INVALID]
ERROR INVALID (src/main.c:396) Runtime of phase (2.191084) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT] mdtest-easy-write 178.031067 kIOPS : time 2.191 seconds [INVALID]
[ ] timestamp 0.000000 kIOPS : time 0.003 seconds
ERROR INVALID (src/phase_ior.c:24) Write phase needed 6.626582s instead of stonewall 300s. Stonewall was hit at 6.3s
ERROR INVALID (src/main.c:396) Runtime of phase (6.666027) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT] ior-hard-write 2.114133 GiB/s : time 6.666 seconds [INVALID]
ERROR INVALID (src/main.c:396) Runtime of phase (5.672756) is below stonewall time. This shouldn't happen!
ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime
[RESULT] mdtest-hard-write 59.615140 kIOPS : time 5.673 seconds [INVALID]

[swat7-02:06130] *** An error occurred in MPI_Comm_split_type
[swat7-02:06130] *** reported by process [1960837121,9]
[swat7-02:06130] *** on communicator MPI_COMM_WORLD
[swat7-02:06130] *** MPI_ERR_ARG: invalid argument of some other kind
[swat7-02:06130] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[swat7-02:06130] *** and potentially your MPI job)
....(repeated)
And IO500 terminates. This is using openmpi 4, using openmpi 3.1.6 IO500 simply hangs at the same spot.

Would anyone have insight into what is going on here, and how I can fix it?

Thank you for your help.


Lombardi, Johann
 

Hi there,

 

The fio numbers look indeed pretty low. Could you please tell us more about the configuration? It sounds like you have Optane pmem on all the nodes, right? How many DIMMs per node? How many engines do you run totally? Are you running fio from a node that is also a DAOS server? Could you please also share your yaml config file?

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of "Harms, Kevin via groups.io" <harms@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Wednesday 23 June 2021 at 17:04
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] FIO Results & Running IO500

 

 

  I'm not sure about what to expect from your nodes, but for IO-500:

 

  the first part of complaints are for the runtime being too short. You need to adjust the parameters to make the run longer.

  the second part for MPI_Comm_split_type failing is complaining about the arguments... Can you try with an MPICH derivative? Maybe there is some issue between OpenMPI and MPICH with regard to valid split_type values.

 

kevin

 

________________________________________

Sent: Tuesday, June 22, 2021 9:40 PM

Subject: [daos] FIO Results & Running IO500

 

Hello all!

 

I have a cluster of 4 DAOS nodes. The nodes use CentOS 7.9, Optane SCM (no SSD), and are connected over EB Infiniband.

These nodes are able to run FIO as show here: https://daos-stack.github.io/admin/performance_tuning/#fio

The scores I am able to achieve running /examples/dfs.fio are:

 

Seq Read        12.4 GB/s       283 us / 21408 us (latency min/average)

Seq Write       4.0 GB/s        673 us / 66585 us

(latency min/average)

Random Read     187 KIOPS       83 us / 1335 us

(latency min/average)

Random Write    180 KIOPs       93 us / 1409 us

(latency min/average)

 

Are these numbers reasonable? The random scores seem low. I'm not 100% sure about my recorded latency numbers, but they also seem slow (for Optane) but perhaps this is due to various DFUSE or other overheads.

 

 

IO500 runs, with the following output: (I'm not concerned about the stonewall time errors for the moment)

 

IO500 version io500-isc21 (standard)

ERROR INVALID (src/phase_ior.c:24) Write phase needed 103.465106s instead of stonewall 300s. Stonewall was hit at 103.5s

ERROR INVALID (src/main.c:396) Runtime of phase (104.060211) is below stonewall time. This shouldn't happen!

ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime

[RESULT]       ior-easy-write        3.092830 GiB/s : time 104.060 seconds [INVALID]

ERROR INVALID (src/main.c:396) Runtime of phase (2.191084) is below stonewall time. This shouldn't happen!

ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime

[RESULT]    mdtest-easy-write      178.031067 kIOPS : time 2.191 seconds [INVALID]

[      ]            timestamp        0.000000 kIOPS : time 0.003 seconds

ERROR INVALID (src/phase_ior.c:24) Write phase needed 6.626582s instead of stonewall 300s. Stonewall was hit at 6.3s

ERROR INVALID (src/main.c:396) Runtime of phase (6.666027) is below stonewall time. This shouldn't happen!

ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime

[RESULT]       ior-hard-write        2.114133 GiB/s : time 6.666 seconds [INVALID]

ERROR INVALID (src/main.c:396) Runtime of phase (5.672756) is below stonewall time. This shouldn't happen!

ERROR INVALID (src/main.c:402) Runtime is smaller than expected minimum runtime

[RESULT]    mdtest-hard-write       59.615140 kIOPS : time 5.673 seconds [INVALID]

 

[swat7-02:06130] *** An error occurred in MPI_Comm_split_type

[swat7-02:06130] *** reported by process [1960837121,9]

[swat7-02:06130] *** on communicator MPI_COMM_WORLD

[swat7-02:06130] *** MPI_ERR_ARG: invalid argument of some other kind

[swat7-02:06130] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

[swat7-02:06130] ***    and potentially your MPI job)

....(repeated)

And IO500 terminates. This is using openmpi 4, using openmpi 3.1.6 IO500 simply hangs at the same spot.

 

Would anyone have insight into what is going on here, and how I can fix it?

 

Thank you for your help.

 

 

 

 

 

 

 

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Peter
 

Johann, my configuration is as follows:


4 nodes, 4 engines in total (1 per node)
2 x 128 GB Optane per socket, 2 x sockets per node (DAOS is currently only using 1 socket per node)
(We also have 1x 1.5 TB NVMe drive / node, that we plan to eventually configure DAOS to use)
Nodes are using Mellanox EDR Infiniband
Cent OS 7.9, have tried various MPI distributions.

The yaml file (was generated via the auto conf)

***********
port: 10001
transport_config:
  allow_insecure: true
  server_name: server
  client_cert_dir: /etc/daos/certs/clients
  ca_cert: /etc/daos/certs/daosCA.crt
  cert: /etc/daos/certs/server.crt
  key: /etc/daos/certs/server.key
servers: []
engines:
- targets: 16
  nr_xs_helpers: 3
  first_core: 0
  name: daos_server
  socket_dir: /var/run/daos_server
  log_file: /tmp/daos_engine.0.log
  scm_mount: /mnt/daos0
  scm_class: dcpm
  scm_list:
  - /dev/pmem0
  bdev_class: nvme
  provider: ofi+verbs;ofi_rxm
  fabric_iface: ib0
  fabric_iface_port: 31416
  pinned_numa_node: 0
disable_vfio: false
disable_vmd: true
nr_hugepages: 0
set_hugepages: false
control_log_mask: INFO
control_log_file: /tmp/daos_server.log
helper_log_file: ""
firmware_helper_log_file: ""
recreate_superblocks: false
fault_path: ""
name: daos_server
socket_dir: /var/run/daos_server
provider: ofi+verbs;ofi_rxm
modules: ""
access_points:
- 172.23.7.3:10001
fault_cb: ""
hyperthreads: false
path: ../etc/daos_server.yml
*******

And yes, I am running FIO from one of the nodes.

Is there anything you see that I should modify or investigate? Thank you very much for the help!


JACKSON Adrian
 

It would be sensible to increase the number of engines per node. For our
system, where we have 48 cores per node, we're running 12 engines per
socket, 24 per node. This night be too many, but I think 1 engine per
node is too few.

cheers

adrianj

On 29/06/2021 08:32, Peter wrote:
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that
the email is genuine and the content is safe.
Johann, my configuration is as follows:


4 nodes, 4 engines in total (1 per node)
2 x 128 GB Optane per socket, 2 x sockets per node (DAOS is currently
only using 1 socket per node)
(We also have 1x 1.5 TB NVMe drive / node, that we plan to eventually
configure DAOS to use)
Nodes are using Mellanox EDR Infiniband
Cent OS 7.9, have tried various MPI distributions.

The yaml file (was generated via the auto conf)

***********
port: 10001
transport_config:
allow_insecure: true
server_name: server
client_cert_dir: /etc/daos/certs/clients
ca_cert: /etc/daos/certs/daosCA.crt
cert: /etc/daos/certs/server.crt
key: /etc/daos/certs/server.key
servers: []
engines:
- targets: 16
nr_xs_helpers: 3
first_core: 0
name: daos_server
socket_dir: /var/run/daos_server
log_file: /tmp/daos_engine.0.log
scm_mount: /mnt/daos0
scm_class: dcpm
scm_list:
- /dev/pmem0
bdev_class: nvme
provider: ofi+verbs;ofi_rxm
fabric_iface: ib0
fabric_iface_port: 31416
pinned_numa_node: 0
disable_vfio: false
disable_vmd: true
nr_hugepages: 0
set_hugepages: false
control_log_mask: INFO
control_log_file: /tmp/daos_server.log
helper_log_file: ""
firmware_helper_log_file: ""
recreate_superblocks: false
fault_path: ""
name: daos_server
socket_dir: /var/run/daos_server
provider: ofi+verbs;ofi_rxm
modules: ""
access_points:
- 172.23.7.3:10001
fault_cb: ""
hyperthreads: false
path: ../etc/daos_server.yml
*******

And yes, I am running FIO from one of the nodes.

Is there anything you see that I should modify or investigate? Thank you
very much for the help!
--
Tel: +44 131 6506470 skype: remoteadrianj
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.


JACKSON Adrian
 

Actually, it's been pointed out to me I was confusing engines and
targets. So ignore me. :)

On 29/06/2021 10:09, JACKSON Adrian wrote:
It would be sensible to increase the number of engines per node. For our
system, where we have 48 cores per node, we're running 12 engines per
socket, 24 per node. This night be too many, but I think 1 engine per
node is too few.

cheers

adrianj

On 29/06/2021 08:32, Peter wrote:
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that
the email is genuine and the content is safe.
Johann, my configuration is as follows:


4 nodes, 4 engines in total (1 per node)
2 x 128 GB Optane per socket, 2 x sockets per node (DAOS is currently
only using 1 socket per node)
(We also have 1x 1.5 TB NVMe drive / node, that we plan to eventually
configure DAOS to use)
Nodes are using Mellanox EDR Infiniband
Cent OS 7.9, have tried various MPI distributions.

The yaml file (was generated via the auto conf)

***********
port: 10001
transport_config:
allow_insecure: true
server_name: server
client_cert_dir: /etc/daos/certs/clients
ca_cert: /etc/daos/certs/daosCA.crt
cert: /etc/daos/certs/server.crt
key: /etc/daos/certs/server.key
servers: []
engines:
- targets: 16
nr_xs_helpers: 3
first_core: 0
name: daos_server
socket_dir: /var/run/daos_server
log_file: /tmp/daos_engine.0.log
scm_mount: /mnt/daos0
scm_class: dcpm
scm_list:
- /dev/pmem0
bdev_class: nvme
provider: ofi+verbs;ofi_rxm
fabric_iface: ib0
fabric_iface_port: 31416
pinned_numa_node: 0
disable_vfio: false
disable_vmd: true
nr_hugepages: 0
set_hugepages: false
control_log_mask: INFO
control_log_file: /tmp/daos_server.log
helper_log_file: ""
firmware_helper_log_file: ""
recreate_superblocks: false
fault_path: ""
name: daos_server
socket_dir: /var/run/daos_server
provider: ofi+verbs;ofi_rxm
modules: ""
access_points:
- 172.23.7.3:10001
fault_cb: ""
hyperthreads: false
path: ../etc/daos_server.yml
*******

And yes, I am running FIO from one of the nodes.

Is there anything you see that I should modify or investigate? Thank you
very much for the help!
--
Tel: +44 131 6506470 skype: remoteadrianj
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.




--
Tel: +44 131 6506470 skype: remoteadrianj


Peter
 

Hello again,

I've tried some more to improve these results, different DAOS versions (including the YUM repo), different MPI versions, DAOS configurations, etc.

I'm still unable to diagnose this issue, IOPS performance remains low for both FIO and IO500.

Does anyone have any input on how I can try to debug or resolve this issue?

Thanks for your help.


JACKSON Adrian
 

Hi,

Have you tried benchmarking the hardware directly, rather than through
DAOS? i.e. running some benchmarks just on an ext4 filesystem mounted on
the Optane on a single node. Just to check that gives you expected
performance.

cheers

adrianj
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.


Peter
 

Thank you for the reply,

Yes, I have mounted the Optane modules as an ext4 filesystem; a quick FIO test is able to achieve > 1 MIOPs.

My thought was that it would be network related, however I've ran some MPI benchmarks and the scores line-up with EDR Infiniband.
Also, the 12.4 GB/s speed we get in the FIO Seq Read test shows we can get better than local-only performance.

The documentation mentions daos_test and daos_perf, are these still supported?


Lombardi, Johann
 

Hi Peter,

 

A few things to try/explore:

  • I don’t think that we have ever tested with 2x pmem DIMMs per socket. Maybe you could try with dram instead of pmem to see whether the performance increases.
  • 16x targets might be too much for 2x pmem DIMMs. You could try to reduce it to 8x targets and set “nr_xs_helpers” to 0.
  • It sounds like you run the benchmark (fio, IO500) and the DAOS engine on the same node. There might be interferences between both. You could try to change the affinity of the benchmark to run on CPU cores not used by the DAOS engine (e.g. with taskset(1) or mpirun args).

 

Cheers,

Johann

 

From: <daos@daos.groups.io> on behalf of Peter <magpiesaresoawesome@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Tuesday 13 July 2021 at 03:16
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] FIO Results & Running IO500

 

Hello again,

I've tried some more to improve these results, different DAOS versions (including the YUM repo), different MPI versions, DAOS configurations, etc.

I'm still unable to diagnose this issue, IOPS performance remains low for both FIO and IO500.

Does anyone have any input on how I can try to debug or resolve this issue?

Thanks for your help.

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris,
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.