CPU and Log Activity after Server Startup


Steffen Christgau
 

Hi everybody,

I think, I managed to successfully install DAOS on a small test-bed
consisting of two nodes. Both machines run an up-to-date installation of
CentOS 7. In addition, libfabric 1.8.0, isal 2.27, pmix 3.1.3, Open MPI
4.0.1, hwloc 1.11.2 and protobuf 3.9.0 with protobuf-c 1.3.2 have been
manually compiled and installed on these two machines. DAOS v0.6 has
been compiled from the sources using the prebuilt packages from above.
DAOS runs under a dedicated user and group id (daos:daos), which have
access to the pmem and hugetable mount points. The machines which run
DAOS have NVDIMMs installed but no NVMe storage yet (still in delivery
:-(). However, the server is configured to use a file as NVMe emulation.

Using orterun as documented in the admin guide the server starts up on
both machines:

----------------------------------------------------------------
$ orterun -n 2 --map-by node --report-uri /tmp/daos_urifile
--enable-recovery daos_server -t 1 -i -o $DAOS_ROOT/etc/daos_server.yml


2019/08/14 15:44:10 config.go:109: debug: DAOS config read from
/opt/local/daos/0.6/etc/daos_server.yml
2019/08/14 15:44:10 config.go:150: debug: Active config saved to
/opt/local/daos/0.6/etc/.daos_server.active.yml (read-only)
2019/08/14 15:44:10 config.go:361: debug: Switching control log level to
DEBUG
2019/08/14 15:44:10 config.go:376: debug: daos_server logging to file
/var/log/daos/control.log
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: spdk -c 0x1 --file-prefix=spdk1077140161
--base-virtaddr=0x200000000000 --proc-type=auto ]
2019/08/14 15:44:10 config.go:109: debug: DAOS config read from
/opt/local/daos/0.6/etc/daos_server.yml
2019/08/14 15:44:10 config.go:150: debug: Active config saved to
/opt/local/daos/0.6/etc/.daos_server.active.yml (read-only)
2019/08/14 15:44:10 config.go:361: debug: Switching control log level to
DEBUG
2019/08/14 15:44:10 config.go:376: debug: daos_server logging to file
/var/log/daos/control.log
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: spdk -c 0x1 --file-prefix=spdk990821539
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Multi-process socket /opt/local/daos/home/.spdk990821539_unix
EAL: Probing VFIO support...
EAL: Multi-process socket /opt/local/daos/home/.spdk1077140161_unix
EAL: Probing VFIO support...
no NVMe controllers found
no NVMe controllers found
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --file-prefix=spdk990821539
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: SECONDARY
EAL: Multi-process socket
/opt/local/daos/home/.spdk990821539_unix_23769_bc38cb16a9404
EAL: Probing VFIO support...
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --file-prefix=spdk1077140161
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: SECONDARY
EAL: Multi-process socket
/opt/local/daos/home/.spdk1077140161_unix_38799_10f8cc2f18d7bc
EAL: Probing VFIO support...
DAOS I/O server (v0.6.0) process 38799 started on rank 0 (out of 2) with
1 target xstream set(s), 2 helper XS per target, firstcore 0.
DAOS I/O server (v0.6.0) process 23769 started on rank 1 (out of 2) with
1 target xstream set(s), 2 helper XS per target, firstcore 0.
----------------------------------------------------------------

What makes me wonder is that four CPU cores of every node are 100% busy
which is caused by threads of daos_io_server process. Attaching gdb to
the process/one of these threads show that the activity appears to come
from the Argobots framework or at least it hangs in
ABTI_xstream_schedule_thread (among others) when I interrupt.

Is this an intended behavior or do I have take any actions to remove
this behavior?

In addition, but maybe unrelated to that issue, the following lines are
added log server file approx. every 2 seconds:

----------------------------------------------------------------
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:635
swim_progress() 0: dping 0 => 1
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 => 1: {1 A 0} {0 A 0}
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 <= 1: {1 A 0}
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 <= 1: {0 A 0} {1 A 0}
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:842
swim_parse_message() 0: dresp 0 => 1
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 => 1: {0 A 0}
----------------------------------------------------------------
(looks like a heartbeat ping)

Since those lines are only INFOs is it safe to ignore them?

Thanks in advance for your responses.

Regards, Steffen


Li, Wei G
 

Hi Steffen,

The CPU activity is expected. daos_io_server uses Argobots to execute multiple tasks on each pthread. Our scheduler does not yet have a complete knowledge of all expected events in order to decide when it is safe to block each pthread without risking an undesirable latency for the forthcoming event. The pthreads therefore keep polling all event sources in loops.

The swim INFO messages are harmless, though I would agree that they should be less verbose.

Regards,
Li Wei

-----Original Message-----
From: daos@daos.groups.io [mailto:daos@daos.groups.io] On Behalf Of Steffen
Christgau
Sent: Wednesday, August 14, 2019 11:03 PM
To: daos@daos.groups.io
Subject: [daos] CPU and Log Activity after Server Startup

Hi everybody,

I think, I managed to successfully install DAOS on a small test-bed
consisting of two nodes. Both machines run an up-to-date installation of
CentOS 7. In addition, libfabric 1.8.0, isal 2.27, pmix 3.1.3, Open MPI
4.0.1, hwloc 1.11.2 and protobuf 3.9.0 with protobuf-c 1.3.2 have been
manually compiled and installed on these two machines. DAOS v0.6 has
been compiled from the sources using the prebuilt packages from above.
DAOS runs under a dedicated user and group id (daos:daos), which have
access to the pmem and hugetable mount points. The machines which run
DAOS have NVDIMMs installed but no NVMe storage yet (still in delivery
:-(). However, the server is configured to use a file as NVMe emulation.

Using orterun as documented in the admin guide the server starts up on
both machines:

----------------------------------------------------------------
$ orterun -n 2 --map-by node --report-uri /tmp/daos_urifile
--enable-recovery daos_server -t 1 -i -o $DAOS_ROOT/etc/daos_server.yml


2019/08/14 15:44:10 config.go:109: debug: DAOS config read from
/opt/local/daos/0.6/etc/daos_server.yml
2019/08/14 15:44:10 config.go:150: debug: Active config saved to
/opt/local/daos/0.6/etc/.daos_server.active.yml (read-only)
2019/08/14 15:44:10 config.go:361: debug: Switching control log level to
DEBUG
2019/08/14 15:44:10 config.go:376: debug: daos_server logging to file
/var/log/daos/control.log
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: spdk -c 0x1 --file-prefix=spdk1077140161
--base-virtaddr=0x200000000000 --proc-type=auto ]
2019/08/14 15:44:10 config.go:109: debug: DAOS config read from
/opt/local/daos/0.6/etc/daos_server.yml
2019/08/14 15:44:10 config.go:150: debug: Active config saved to
/opt/local/daos/0.6/etc/.daos_server.active.yml (read-only)
2019/08/14 15:44:10 config.go:361: debug: Switching control log level to
DEBUG
2019/08/14 15:44:10 config.go:376: debug: daos_server logging to file
/var/log/daos/control.log
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: spdk -c 0x1 --file-prefix=spdk990821539
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Multi-process socket /opt/local/daos/home/.spdk990821539_unix
EAL: Probing VFIO support...
EAL: Multi-process socket /opt/local/daos/home/.spdk1077140161_unix
EAL: Probing VFIO support...
no NVMe controllers found
no NVMe controllers found
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --file-prefix=spdk990821539
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: SECONDARY
EAL: Multi-process socket
/opt/local/daos/home/.spdk990821539_unix_23769_bc38cb16a9404
EAL: Probing VFIO support...
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --file-prefix=spdk1077140161
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: SECONDARY
EAL: Multi-process socket
/opt/local/daos/home/.spdk1077140161_unix_38799_10f8cc2f18d7bc
EAL: Probing VFIO support...
DAOS I/O server (v0.6.0) process 38799 started on rank 0 (out of 2) with
1 target xstream set(s), 2 helper XS per target, firstcore 0.
DAOS I/O server (v0.6.0) process 23769 started on rank 1 (out of 2) with
1 target xstream set(s), 2 helper XS per target, firstcore 0.
----------------------------------------------------------------

What makes me wonder is that four CPU cores of every node are 100% busy
which is caused by threads of daos_io_server process. Attaching gdb to
the process/one of these threads show that the activity appears to come
from the Argobots framework or at least it hangs in
ABTI_xstream_schedule_thread (among others) when I interrupt.

Is this an intended behavior or do I have take any actions to remove
this behavior?

In addition, but maybe unrelated to that issue, the following lines are
added log server file approx. every 2 seconds:

----------------------------------------------------------------
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:635
swim_progress() 0: dping 0 => 1
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 => 1: {1 A 0} {0 A 0}
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 <= 1: {1 A 0}
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 <= 1: {0 A 0} {1 A 0}
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:842
swim_parse_message() 0: dresp 0 => 1
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 => 1: {0 A 0}
----------------------------------------------------------------
(looks like a heartbeat ping)

Since those lines are only INFOs is it safe to ignore them?

Thanks in advance for your responses.

Regards, Steffen


Steffen Christgau
 

Thank you for the clarification, Li.

Regards, Steffen

On 15/08/2019 05.33, Li, Wei G wrote:
Hi Steffen,

The CPU activity is expected. daos_io_server uses Argobots to execute multiple tasks on each pthread. Our scheduler does not yet have a complete knowledge of all expected events in order to decide when it is safe to block each pthread without risking an undesirable latency for the forthcoming event. The pthreads therefore keep polling all event sources in loops.

The swim INFO messages are harmless, though I would agree that they should be less verbose.

Regards,
Li Wei

-----Original Message-----
From: daos@daos.groups.io [mailto:daos@daos.groups.io] On Behalf Of Steffen
Christgau
Sent: Wednesday, August 14, 2019 11:03 PM
To: daos@daos.groups.io
Subject: [daos] CPU and Log Activity after Server Startup

Hi everybody,

I think, I managed to successfully install DAOS on a small test-bed
consisting of two nodes. Both machines run an up-to-date installation of
CentOS 7. In addition, libfabric 1.8.0, isal 2.27, pmix 3.1.3, Open MPI
4.0.1, hwloc 1.11.2 and protobuf 3.9.0 with protobuf-c 1.3.2 have been
manually compiled and installed on these two machines. DAOS v0.6 has
been compiled from the sources using the prebuilt packages from above.
DAOS runs under a dedicated user and group id (daos:daos), which have
access to the pmem and hugetable mount points. The machines which run
DAOS have NVDIMMs installed but no NVMe storage yet (still in delivery
:-(). However, the server is configured to use a file as NVMe emulation.

Using orterun as documented in the admin guide the server starts up on
both machines:

----------------------------------------------------------------
$ orterun -n 2 --map-by node --report-uri /tmp/daos_urifile
--enable-recovery daos_server -t 1 -i -o $DAOS_ROOT/etc/daos_server.yml


2019/08/14 15:44:10 config.go:109: debug: DAOS config read from
/opt/local/daos/0.6/etc/daos_server.yml
2019/08/14 15:44:10 config.go:150: debug: Active config saved to
/opt/local/daos/0.6/etc/.daos_server.active.yml (read-only)
2019/08/14 15:44:10 config.go:361: debug: Switching control log level to
DEBUG
2019/08/14 15:44:10 config.go:376: debug: daos_server logging to file
/var/log/daos/control.log
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: spdk -c 0x1 --file-prefix=spdk1077140161
--base-virtaddr=0x200000000000 --proc-type=auto ]
2019/08/14 15:44:10 config.go:109: debug: DAOS config read from
/opt/local/daos/0.6/etc/daos_server.yml
2019/08/14 15:44:10 config.go:150: debug: Active config saved to
/opt/local/daos/0.6/etc/.daos_server.active.yml (read-only)
2019/08/14 15:44:10 config.go:361: debug: Switching control log level to
DEBUG
2019/08/14 15:44:10 config.go:376: debug: daos_server logging to file
/var/log/daos/control.log
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: spdk -c 0x1 --file-prefix=spdk990821539
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Multi-process socket /opt/local/daos/home/.spdk990821539_unix
EAL: Probing VFIO support...
EAL: Multi-process socket /opt/local/daos/home/.spdk1077140161_unix
EAL: Probing VFIO support...
no NVMe controllers found
no NVMe controllers found
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --file-prefix=spdk990821539
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: SECONDARY
EAL: Multi-process socket
/opt/local/daos/home/.spdk990821539_unix_23769_bc38cb16a9404
EAL: Probing VFIO support...
Starting SPDK v18.07-pre / DPDK 18.02.0 initialization...
[ DPDK EAL parameters: daos -c 0x1 --file-prefix=spdk1077140161
--base-virtaddr=0x200000000000 --proc-type=auto ]
EAL: Detected 96 lcore(s)
EAL: Auto-detected process type: SECONDARY
EAL: Multi-process socket
/opt/local/daos/home/.spdk1077140161_unix_38799_10f8cc2f18d7bc
EAL: Probing VFIO support...
DAOS I/O server (v0.6.0) process 38799 started on rank 0 (out of 2) with
1 target xstream set(s), 2 helper XS per target, firstcore 0.
DAOS I/O server (v0.6.0) process 23769 started on rank 1 (out of 2) with
1 target xstream set(s), 2 helper XS per target, firstcore 0.
----------------------------------------------------------------

What makes me wonder is that four CPU cores of every node are 100% busy
which is caused by threads of daos_io_server process. Attaching gdb to
the process/one of these threads show that the activity appears to come
from the Argobots framework or at least it hangs in
ABTI_xstream_schedule_thread (among others) when I interrupt.

Is this an intended behavior or do I have take any actions to remove
this behavior?

In addition, but maybe unrelated to that issue, the following lines are
added log server file approx. every 2 seconds:

----------------------------------------------------------------
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:635
swim_progress() 0: dping 0 => 1
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 => 1: {1 A 0} {0 A 0}
08/14-15:47:29.15 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 <= 1: {1 A 0}
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 <= 1: {0 A 0} {1 A 0}
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:842
swim_parse_message() 0: dresp 0 => 1
08/14-15:47:29.22 apass2 DAOS[38799/38804] swim INFO src/swim/swim.c:73
swim_dump_updates() 0 => 1: {0 A 0}
----------------------------------------------------------------
(looks like a heartbeat ping)

Since those lines are only INFOs is it safe to ignore them?

Thanks in advance for your responses.

Regards, Steffen