PSM2 can't open hfi unit: 0 (err=23)


Zhang, Jiafu
 

Hi Guys,

 

I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?

 

Sometimes, I got below error stack when I had 18 processes.

 

java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90.  Backtrace:

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c]

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8]

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0]

/home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8]

/home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed]

/home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f]

/home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826]

/home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83]

/home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0]

/home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08]

/tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d]

/opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2]

[0x7f13b10186c7]

 

Thanks.


Oganezov, Alexander A
 

Hi Jiafu,

 

Can you provide daos logs from when this failure happens?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Wednesday, February 26, 2020 7:54 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Guys,

 

I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?

 

Sometimes, I got below error stack when I had 18 processes.

 

java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90.  Backtrace:

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c]

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8]

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0]

/home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8]

/home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed]

/home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f]

/home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826]

/home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83]

/home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0]

/home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08]

/tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d]

/opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2]

[0x7f13b10186c7]

 

Thanks.


Zhang, Jiafu
 

Hi Oganezov,

 

Here is DAOS log. I am using root user. And there is no resource limit.

 

02/28-09:54:44.14 sr135 DAOS[288071/288104] crt  INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288071/288104] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

 

 

 

 

 

Here is more Java client log.

 

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

daos_init() failed with rc = -1020

error msg: DER_HG

 

 

thanks.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Oganezov, Alexander A
Sent: Thursday, February 27, 2020 12:41 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Jiafu,

 

Can you provide daos logs from when this failure happens?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Wednesday, February 26, 2020 7:54 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Guys,

 

I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?

 

Sometimes, I got below error stack when I had 18 processes.

 

java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90.  Backtrace:

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c]

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8]

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0]

/home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8]

/home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed]

/home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f]

/home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826]

/home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83]

/home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0]

/home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08]

/tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d]

/opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2]

[0x7f13b10186c7]

 

Thanks.


Zhang, Jiafu
 

Hi Oganezov,

 

I played more PSM parameters and still didn’t help. I searched Mercury issue list and got a similar one, https://github.com/mercury-hpc/mercury/issues/343. It seems that you reported this issue. It’s failed to allocate memory with verbs provider when there were 16 clients per node.

 

Thanks.

 

From: Zhang, Jiafu
Sent: Friday, February 28, 2020 10:04 AM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: RE: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Oganezov,

 

Here is DAOS log. I am using root user. And there is no resource limit.

 

02/28-09:54:44.14 sr135 DAOS[288071/288104] crt  INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288071/288104] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

 

 

 

 

 

Here is more Java client log.

 

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

daos_init() failed with rc = -1020

error msg: DER_HG

 

 

thanks.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Oganezov, Alexander A
Sent: Thursday, February 27, 2020 12:41 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Jiafu,

 

Can you provide daos logs from when this failure happens?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Wednesday, February 26, 2020 7:54 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Guys,

 

I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?

 

Sometimes, I got below error stack when I had 18 processes.

 

java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90.  Backtrace:

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c]

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8]

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0]

/home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8]

/home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed]

/home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f]

/home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826]

/home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83]

/home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0]

/home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08]

/tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d]

/opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2]

[0x7f13b10186c7]

 

Thanks.


Oganezov, Alexander A
 

Hi Jiafu,

 

I noticed in your log you are using cart 4.3.1 which is a bit old; can you try with latest daos and see if you still hit this problem?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Thursday, February 27, 2020 9:42 PM
To: 'daos@daos.groups.io' <daos@daos.groups.io>
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Oganezov,

 

I played more PSM parameters and still didn’t help. I searched Mercury issue list and got a similar one, https://github.com/mercury-hpc/mercury/issues/343. It seems that you reported this issue. It’s failed to allocate memory with verbs provider when there were 16 clients per node.

 

Thanks.

 

From: Zhang, Jiafu
Sent: Friday, February 28, 2020 10:04 AM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: RE: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Oganezov,

 

Here is DAOS log. I am using root user. And there is no resource limit.

 

02/28-09:54:44.14 sr135 DAOS[288071/288104] crt  INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288071/288104] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

 

 

 

 

 

Here is more Java client log.

 

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

daos_init() failed with rc = -1020

error msg: DER_HG

 

 

thanks.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Oganezov, Alexander A
Sent: Thursday, February 27, 2020 12:41 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Jiafu,

 

Can you provide daos logs from when this failure happens?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Wednesday, February 26, 2020 7:54 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Guys,

 

I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?

 

Sometimes, I got below error stack when I had 18 processes.

 

java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90.  Backtrace:

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c]

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8]

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0]

/home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8]

/home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed]

/home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f]

/home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826]

/home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83]

/home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0]

/home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08]

/tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d]

/opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2]

[0x7f13b10186c7]

 

Thanks.


Zhang, Jiafu
 

Hi Oganezov,

 

I just tried the latest DAOS with mohamod’s DFS patch. I got the exact same error.

 

Thanks.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Oganezov, Alexander A
Sent: Saturday, February 29, 2020 1:05 AM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Jiafu,

 

I noticed in your log you are using cart 4.3.1 which is a bit old; can you try with latest daos and see if you still hit this problem?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Thursday, February 27, 2020 9:42 PM
To: 'daos@daos.groups.io' <daos@daos.groups.io>
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Oganezov,

 

I played more PSM parameters and still didn’t help. I searched Mercury issue list and got a similar one, https://github.com/mercury-hpc/mercury/issues/343. It seems that you reported this issue. It’s failed to allocate memory with verbs provider when there were 16 clients per node.

 

Thanks.

 

From: Zhang, Jiafu
Sent: Friday, February 28, 2020 10:04 AM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: RE: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Oganezov,

 

Here is DAOS log. I am using root user. And there is no resource limit.

 

02/28-09:54:44.14 sr135 DAOS[288071/288104] crt  INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288083/288144] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981

# na_ofi_endpoint_open(): na_ofi_sep_open() failed

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057

# na_ofi_initialize(): Could not create endpoint for 10.100.0.35

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309

# NA_Initialize_opt(): Could not initialize plugin

02/28-09:54:44.50 sr135 DAOS[288051/288074] hg   ERR  src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] crt  ERR  src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020.

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR  src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020)

02/28-09:54:44.50 sr135 DAOS[288071/288104] hg   ERR  # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083

# na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)

 

 

 

 

 

Here is more Java client log.

 

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3)

sr135.288083hfi_userinit: assign_context command failed: Device or resource busy

daos_init() failed with rc = -1020

error msg: DER_HG

 

 

thanks.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Oganezov, Alexander A
Sent: Thursday, February 27, 2020 12:41 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: Re: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Jiafu,

 

Can you provide daos logs from when this failure happens?

 

Thanks,

~~Alex.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Zhang, Jiafu
Sent: Wednesday, February 26, 2020 7:54 PM
To: daos@daos.groups.io
Cc: Zhu, Minming <minming.zhu@...>; Wang, Carson <carson.wang@...>; Guo, Chenzhao <chenzhao.guo@...>
Subject: [daos] PSM2 can't open hfi unit: 0 (err=23)

 

Hi Guys,

 

I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?

 

Sometimes, I got below error stack when I had 18 processes.

 

java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90.  Backtrace:

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c]

/opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8]

/usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0]

/home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8]

/home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed]

/home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f]

/home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826]

/home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83]

/home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0]

/home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08]

/tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d]

/opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2]

[0x7f13b10186c7]

 

Thanks.