PSM2 can't open hfi unit: 0 (err=23)
Zhang, Jiafu
Hi Guys,
I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?
Sometimes, I got below error stack when I had 18 processes.
java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90. Backtrace: /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c] /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8] /usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0] /home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8] /home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed] /home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f] /home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826] /home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83] /home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0] /home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08] /tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d] /opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2] [0x7f13b10186c7]
Thanks.
|
|
Oganezov, Alexander A
Hi Jiafu,
Can you provide daos logs from when this failure happens?
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Zhang, Jiafu
Hi Guys,
I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?
Sometimes, I got below error stack when I had 18 processes.
java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90. Backtrace: /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c] /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8] /usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0] /home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8] /home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed] /home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f] /home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826] /home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83] /home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0] /home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08] /tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d] /opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2] [0x7f13b10186c7]
Thanks.
|
|
Zhang, Jiafu
Hi Oganezov,
Here is DAOS log. I am using root user. And there is no resource limit.
02/28-09:54:44.14 sr135 DAOS[288071/288104] crt INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288071/288104] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)
Here is more Java client log.
sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy daos_init() failed with rc = -1020 error msg: DER_HG
thanks.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Oganezov, Alexander A
Hi Jiafu,
Can you provide daos logs from when this failure happens?
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Zhang, Jiafu
Hi Guys,
I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?
Sometimes, I got below error stack when I had 18 processes.
java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90. Backtrace: /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c] /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8] /usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0] /home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8] /home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed] /home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f] /home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826] /home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83] /home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0] /home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08] /tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d] /opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2] [0x7f13b10186c7]
Thanks.
|
|
Zhang, Jiafu
Hi Oganezov,
I played more PSM parameters and still didn’t help. I searched Mercury issue list and got a similar one, https://github.com/mercury-hpc/mercury/issues/343. It seems that you reported this issue. It’s failed to allocate memory with verbs provider when there were 16 clients per node.
Thanks.
From: Zhang, Jiafu
Hi Oganezov,
Here is DAOS log. I am using root user. And there is no resource limit.
02/28-09:54:44.14 sr135 DAOS[288071/288104] crt INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288071/288104] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)
Here is more Java client log.
sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy daos_init() failed with rc = -1020 error msg: DER_HG
thanks.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Oganezov, Alexander A
Hi Jiafu,
Can you provide daos logs from when this failure happens?
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Zhang, Jiafu
Hi Guys,
I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?
Sometimes, I got below error stack when I had 18 processes.
java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90. Backtrace: /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c] /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8] /usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0] /home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8] /home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed] /home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f] /home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826] /home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83] /home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0] /home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08] /tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d] /opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2] [0x7f13b10186c7]
Thanks.
|
|
Oganezov, Alexander A
Hi Jiafu,
I noticed in your log you are using cart 4.3.1 which is a bit old; can you try with latest daos and see if you still hit this problem?
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Zhang, Jiafu
Hi Oganezov,
I played more PSM parameters and still didn’t help. I searched Mercury issue list and got a similar one, https://github.com/mercury-hpc/mercury/issues/343. It seems that you reported this issue. It’s failed to allocate memory with verbs provider when there were 16 clients per node.
Thanks.
From: Zhang, Jiafu
Hi Oganezov,
Here is DAOS log. I am using root user. And there is no resource limit.
02/28-09:54:44.14 sr135 DAOS[288071/288104] crt INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288071/288104] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)
Here is more Java client log.
sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy daos_init() failed with rc = -1020 error msg: DER_HG
thanks.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Oganezov, Alexander A
Hi Jiafu,
Can you provide daos logs from when this failure happens?
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Zhang, Jiafu
Hi Guys,
I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?
Sometimes, I got below error stack when I had 18 processes.
java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90. Backtrace: /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c] /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8] /usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0] /home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8] /home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed] /home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f] /home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826] /home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83] /home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0] /home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08] /tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d] /opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2] [0x7f13b10186c7]
Thanks.
|
|
Zhang, Jiafu
Hi Oganezov,
I just tried the latest DAOS with mohamod’s DFS patch. I got the exact same error.
Thanks.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Oganezov, Alexander A
Hi Jiafu,
I noticed in your log you are using cart 4.3.1 which is a bit old; can you try with latest daos and see if you still hit this problem?
Thanks, ~~Alex.
From:
daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Zhang, Jiafu
Hi Oganezov,
I played more PSM parameters and still didn’t help. I searched Mercury issue list and got a similar one, https://github.com/mercury-hpc/mercury/issues/343. It seems that you reported this issue. It’s failed to allocate memory with verbs provider when there were 16 clients per node.
Thanks.
From: Zhang, Jiafu
Hi Oganezov,
Here is DAOS log. I am using root user. And there is no resource limit.
02/28-09:54:44.14 sr135 DAOS[288071/288104] crt INFO src/cart/crt_init.c:235 crt_init_opt() libcart version 4.3.1 initializing 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288083/288144] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288083/288144] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory) 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:1981 # na_ofi_endpoint_open(): na_ofi_sep_open() failed 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:3057 # na_ofi_initialize(): Could not create endpoint for 10.100.0.35 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na.c:309 # NA_Initialize_opt(): Could not initialize plugin 02/28-09:54:44.50 sr135 DAOS[288051/288074] hg ERR src/cart/crt_hg.c:527 crt_hg_init() Could not initialize NA class. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:345 crt_init_opt() crt_hg_init failed rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] crt ERR src/cart/crt_init.c:413 crt_init_opt() crt_init failed, rc: -1020. 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/event.c:93 daos_eq_lib_init() failed to initialize crt: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288051/288074] client ERR src/client/api/init.c:159 daos_init() failed to initialize eq_lib: DER_HG(-1020) 02/28-09:54:44.50 sr135 DAOS[288071/288104] hg ERR # NA -- Error -- /home/spark/daos/_build.external/mercury/src/na/na_ofi.c:2083 # na_ofi_sep_open(): fi_scalable_ep() failed, rc: -12(Cannot allocate memory)
Here is more Java client log.
sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (1/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (2/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy sr135.288083hfp_gen1_context_open: hfi_userinit: failed, trying again (3/3) sr135.288083hfi_userinit: assign_context command failed: Device or resource busy daos_init() failed with rc = -1020 error msg: DER_HG
thanks.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Oganezov, Alexander A
Hi Jiafu,
Can you provide daos logs from when this failure happens?
Thanks, ~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Zhang, Jiafu
Hi Guys,
I get this error when I have more than 17 concurrent JVM processes doing daos_init(). I searched omni-path guide and tuned two PSM2 env variables, like HFI_UNIT (unset or 0) and HFI_NO_CPUAFFINITY (unset or YES). But it didn’t help. Did you experience the same issue?
Sometimes, I got below error stack when I had 18 processes.
java:96006 terminated with signal 11 at PC=7f126c2c8cb8 SP=7f13c8d81a90. Backtrace: /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0x11c)[0x7f13c78b497c] /opt/jdk1.8.0_191/jre/lib/amd64/server/libjvm.so(+0x907cd8)[0x7f13c78a7cd8] /usr/lib64/libpthread.so.0(+0xf5f0)[0x7f13c877d5f0] /home/spark/daos/install/lib/libna.so.2(+0x7cb8)[0x7f126c2c8cb8] /home/spark/daos/install/lib/libna.so.2(+0xaeed)[0x7f126c2cbeed] /home/spark/daos/install/lib/libna.so.2(NA_Initialize_opt+0x3af)[0x7f126c2c4b9f] /home/spark/daos/install/lib/libcart.so.4(crt_hg_init+0x1d6)[0x7f126cfce826] /home/spark/daos/install/lib/libcart.so.4(crt_init_opt+0x773)[0x7f126cfdff83] /home/spark/daos/install/lib64/libdaos.so.0(daos_eq_lib_init+0x160)[0x7f126db43ee0] /home/spark/daos/install/lib64/libdaos.so.0(daos_init+0x158)[0x7f126db47a08] /tmp/daos6804109670564808304/libdaos-jni.so(JNI_OnLoad+0x261)[0x7f126ddf721d] /opt/jdk1.8.0_191/jre/lib/amd64/libjava.so(Java_java_lang_ClassLoader_00024NativeLibrary_load+0x282)[0x7f13c66679d2] [0x7f13b10186c7]
Thanks.
|
|