Mercury debug (and IB question)
Farrell, Patrick Arthur <patrick.farrell@...>
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury:
NA_LOG_DEBUG("Entering na_ofi_initialize()
class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name, na_info->host_name);
(from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2
in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info>
Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22 libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1
)
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks,
-Patrick
|
|
Oganezov, Alexander A
Adding Jerome from mercury to answer question on enabling debug on mercury level.
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Rosenzweig, Joel B <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Colin Ngam
Hi,
Where should OFI_DOMAIN be set? Just exported as an ENV before stating daos_server “/root/daos/install/bin/daos_server start -o ./daos_server_local.yml” is enough right?
Thanks.
Colin
From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
patrick.farrell@...
The OFI_DOMAIN env doesn't seem to work - I set the OFI_DOMAIN variable in my environment *but not in the yaml file*; specifically, I set to "george". The server started just fine with no mention of 'george'.
When OFI_DOMAIN was set in the server yml file, again to 'george', the server failed to start with the sort of message you'd expect - No provider found on domain "george".
This issue is interesting and may be relevant to my problem. I'll reference this note in my reply to Joel's message.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Sent: Friday, March 20, 2020 11:29 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question) Hi,
Where should OFI_DOMAIN be set? Just exported as an ENV before stating daos_server “/root/daos/install/bin/daos_server start -o ./daos_server_local.yml” is enough right?
Thanks.
Colin
From:
<daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Farrell, Patrick Arthur <patrick.farrell@...>
Literally precisely that message - It's just a copy/paste. Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.
It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).
So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.
Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change. Exactly the same behavior, including the referenced error.
So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question) Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury:
NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Farrell, Patrick Arthur <patrick.farrell@...>
Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.
So, this still leaves open the question of how/why the domain is coming up wrong in that message.
I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that? Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how
to turn it on.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 11:47 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question)
Literally precisely that message - It's just a copy/paste. Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.
It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).
So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.
Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change. Exactly the same behavior, including the referenced error.
So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question) Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury:
NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Rosenzweig, Joel B <joel.b.rosenzweig@...>
Hi Patrick,
What domains does a fi_info -v scan show you? Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”? The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.
I know that Alex inquired about the Mercury debug message help. I don’t know if there’s a status update on that.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Farrell, Patrick Arthur
Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.
So, this still leaves open the question of how/why the domain is coming up wrong in that message.
I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that? Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Literally precisely that message - It's just a copy/paste. Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.
It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).
So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.
Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change. Exactly the same behavior, including the referenced error.
So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Farrell, Patrick Arthur <patrick.farrell@...>
Joel,
In that message, the two domains are different parts of the print statement, they're different strings.
Here's the code:
if (strncmp(dom->verbs->device->name,
info->domain_attr->name,
strlen(dom->verbs->device->name))) { VERBS_INFO(FI_LOG_DOMAIN, "Invalid info->domain_attr->name: %s and %s\n", dom->verbs->device->name, info->domain_attr->name); return -FI_EINVAL; } Note the two %s and the different sources. That's in vrbs_ep.c in ofi.
So, there's no reason to expect a single domain with that complex name.
However, the output of fi_info is *very* interesting. I see there's an ofi_rxm; verbs provider listed for
mlx5_0, which is interesting because while that's a Mellanox card, it's an *ethernet* card and it's in ethernet mode. I don't think verbs would work there.
Here's the output of fi_info - fi_info -v is over 6 thousand lines of output, so I can attach that if you want,
but I figured I'd start with this.
-----
provider:
verbs
fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_1-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_0-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_1-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_2-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_3-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: shm fabric: shm domain: shm version: 1.1 type: FI_EP_RDM protocol: FI_PROTO_SHM provider: UDP fabric: 172.30.222.0/24 domain: eno1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: eno1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 10.0.0.0/24 domain: ib0 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 10.0.1.0/24 domain: ib1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: ib0 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: ib1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 127.0.0.1/32 domain: lo version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: ::1/128 domain: lo version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: tcp fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 12:18 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question) Hi Patrick,
What domains does a fi_info -v scan show you? Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”? The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.
I know that Alex inquired about the Mercury debug message help. I don’t know if there’s a status update on that.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.
So, this still leaves open the question of how/why the domain is coming up wrong in that message.
I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that? Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Literally precisely that message - It's just a copy/paste. Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.
It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).
So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.
Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change. Exactly the same behavior, including the referenced error.
So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Farrell, Patrick Arthur <patrick.farrell@...>
Ah, scratch that confusion about having a verbs provider for ethernet - I see OFI has support for verbs over ethernet.
Anyway, just to see what would happen, I disabled my ethernet adapter, so mlx5_0 is no longer up.
After doing that, I was indeed able to get further. The only ofi_rxm; verbs provider is associated with the mlx5_2 domain, and everything worked - I did not see this issue with domain mismatch.
Of course, fi_mr_reg failed with -14, which is EFAULT.
So, two issues currently:
Interesting.
Johann,
I switched back to RAM for this test for simplicity and speed (as I am restarting the server a lot); you alluded to an error that occurs with RAM but not with PMEM or the other way around... Is this possibly it?
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 1:43 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question)
Joel,
In that message, the two domains are different parts of the print statement, they're different strings.
Here's the code:
if (strncmp(dom->verbs->device->name, info->domain_attr->name,
strlen(dom->verbs->device->name))) { VERBS_INFO(FI_LOG_DOMAIN, "Invalid info->domain_attr->name: %s and %s\n", dom->verbs->device->name, info->domain_attr->name); return -FI_EINVAL; } Note the two %s and the different sources. That's in vrbs_ep.c in ofi.
So, there's no reason to expect a single domain with that complex name.
However, the output of fi_info is *very* interesting. I see there's an ofi_rxm; verbs provider listed for mlx5_0,
which is interesting because while that's a Mellanox card, it's an *ethernet* card and it's in ethernet mode. I don't think verbs would work there.
Here's the output of fi_info - fi_info -v is over 6 thousand lines of output, so I can attach that if you want,
but I figured I'd start with this.
-----
provider:
verbs
fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_1-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_0-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_1-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_2-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_3-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: shm fabric: shm domain: shm version: 1.1 type: FI_EP_RDM protocol: FI_PROTO_SHM provider: UDP fabric: 172.30.222.0/24 domain: eno1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: eno1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 10.0.0.0/24 domain: ib0 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 10.0.1.0/24 domain: ib1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: ib0 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: ib1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 127.0.0.1/32 domain: lo version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: ::1/128 domain: lo version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: tcp fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 12:18 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question) Hi Patrick,
What domains does a fi_info -v scan show you? Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”? The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.
I know that Alex inquired about the Mercury debug message help. I don’t know if there’s a status update on that.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.
So, this still leaves open the question of how/why the domain is coming up wrong in that message.
I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that? Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Literally precisely that message - It's just a copy/paste. Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.
It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).
So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.
Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change. Exactly the same behavior, including the referenced error.
So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Farrell, Patrick Arthur <patrick.farrell@...>
Nevermind - This memory registration failure is not related to using or not using dcpm.
I'll get more details and report that separately, it's only occurring in some cases.
So, just leaves the problem of the client library ignoring OFI_DOMAIN.
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 2:15 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question)
Ah, scratch that confusion about having a verbs provider for ethernet - I see OFI has support for verbs over ethernet.
Anyway, just to see what would happen, I disabled my ethernet adapter, so mlx5_0 is no longer up.
After doing that, I was indeed able to get further. The only ofi_rxm; verbs provider is associated with the mlx5_2 domain, and everything worked - I did not see this issue with domain mismatch.
Of course, fi_mr_reg failed with -14, which is EFAULT.
So, two issues currently:
Interesting.
Johann,
I switched back to RAM for this test for simplicity and speed (as I am restarting the server a lot); you alluded to an error that occurs with RAM but not with PMEM or the other way around... Is this possibly it?
-Patrick
From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 1:43 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question)
Joel,
In that message, the two domains are different parts of the print statement, they're different strings.
Here's the code:
if (strncmp(dom->verbs->device->name, info->domain_attr->name,
strlen(dom->verbs->device->name))) { VERBS_INFO(FI_LOG_DOMAIN, "Invalid info->domain_attr->name: %s and %s\n", dom->verbs->device->name, info->domain_attr->name); return -FI_EINVAL; } Note the two %s and the different sources. That's in vrbs_ep.c in ofi.
So, there's no reason to expect a single domain with that complex name.
However, the output of fi_info is *very* interesting. I see there's an ofi_rxm; verbs provider listed for mlx5_0,
which is interesting because while that's a Mellanox card, it's an *ethernet* card and it's in ethernet mode. I don't think verbs would work there.
Here's the output of fi_info - fi_info -v is over 6 thousand lines of output, so I can attach that if you want,
but I figured I'd start with this.
-----
provider:
verbs
fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_0-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_1-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_2-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_RC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-xrc version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_RDMA_CM_IB_XRC provider: verbs fabric: IB-0xfe80000000000000 domain: mlx5_3-dgram version: 1.0 type: FI_EP_DGRAM protocol: FI_PROTO_IB_UD provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_2 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxm fabric: IB-0xfe80000000000000 domain: mlx5_3 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: tcp;ofi_rxm fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXM provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_0-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_1-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_2-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: verbs;ofi_rxd fabric: IB-0xfe80000000000000 domain: mlx5_3-dgram version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: UDP;ofi_rxd fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD provider: shm fabric: shm domain: shm version: 1.1 type: FI_EP_RDM protocol: FI_PROTO_SHM provider: UDP fabric: 172.30.222.0/24 domain: eno1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: eno1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 10.0.0.0/24 domain: ib0 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 10.0.1.0/24 domain: ib1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: ib0 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: fe80::/64 domain: ib1 version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: 127.0.0.1/32 domain: lo version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: UDP fabric: ::1/128 domain: lo version: 1.1 type: FI_EP_DGRAM protocol: FI_PROTO_UDP provider: tcp fabric: 172.30.222.0/24 domain: eno1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: eno1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 10.0.0.0/24 domain: ib0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 10.0.1.0/24 domain: ib1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: ib0 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: fe80::/64 domain: ib1 version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: 127.0.0.1/32 domain: lo version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: tcp fabric: ::1/128 domain: lo version: 1.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 172.30.222.0/24 domain: eno1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: eno1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.0.0/24 domain: ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 10.0.1.0/24 domain: ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib0 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: fe80::/64 domain: ib1 version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: 127.0.0.1/32 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP provider: sockets fabric: ::1/128 domain: lo version: 2.0 type: FI_EP_MSG protocol: FI_PROTO_SOCK_TCP From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 12:18 PM To: daos@daos.groups.io <daos@daos.groups.io> Subject: Re: [daos] Mercury debug (and IB question) Hi Patrick,
What domains does a fi_info -v scan show you? Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”? The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.
I know that Alex inquired about the Mercury debug message help. I don’t know if there’s a status update on that.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.
So, this still leaves open the question of how/why the domain is coming up wrong in that message.
I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that? Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Literally precisely that message - It's just a copy/paste. Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.
It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).
So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.
Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change. Exactly the same behavior, including the referenced error.
So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Kevan Rehm
Just an update, we have figured out how to enable mercury debug messages..
Kevan
From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Hi Patrick,
What domains does a fi_info -v scan show you? Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”? The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.
I know that Alex inquired about the Mercury debug message help. I don’t know if there’s a status update on that.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Farrell, Patrick Arthur
Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.
So, this still leaves open the question of how/why the domain is coming up wrong in that message.
I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that? Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Literally precisely that message - It's just a copy/paste. Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.
It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).
So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.
Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change. Exactly the same behavior, including the referenced error.
So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.
-Patrick From:
daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Hi Patrick,
Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”? I just want to make sure I understand the error correctly.
In src/control/server/server.go’s Start(), you will find this:
// Provide special handling for the ofi+verbs provider. // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0 // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can // specify the correct device for each. if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") { deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface) if err != nil { return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface) } envVar := "OFI_DOMAIN=" + deviceAlias srvCfg.WithEnvVars(envVar) }
If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN. Does your log show output from netdetect showing that it searched for and found a device alias? If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:
// at function entry log.Debugf("Searching for a device alias for: %s", device)
// at function exit if there wasn’t an error up to this point log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))
If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one. If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.
If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it. And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.
Regards, Joel
From: daos@daos.groups.io <daos@daos.groups.io>
On Behalf Of Farrell, Patrick Arthur
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Kevan Rehm
All,
I mostly understand what is happening here, I’m wondering if there is a fix already underway for this.
We have two ethernet interfaces and two infiniband interfaces in this node. All four of those interfaces support verbs;ofi_rxm. When Mercury calls fi_getinfo() it is getting back a list of four matching interfaces, and it’s apparently not looking at the OFI_DOMAIN that the user specified, using that to narrow down the list to the correct IB interface. It so happens that the ethernet interfaces appear in the list before the infiniband interfaces, which is why mlx5_0 gets selected when the infiniband mlx5_2 is what the user wants. There is code to narrow the selection by provider, but here we need to also narrow the selection by OFI_DOMAIN.
Has this been reported before by any chance? Perhaps a PR exists already?
Thanks, Kevan
From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Oganezov, Alexander A
Hi Kevan,
This is not something that we’ve encountered before as our systems don’t have such setup of interfaces. Adding Jerome from mercury as well to see if there is known issue or not.
As a note CaRT currently initializes mercury using an init string below, so the domain info (via OFI_DOMAIN envariable) should be provided to mercury to make a proper decision. “455 D_ASPRINTF(*string, "%s://%s/%s", plugin_str, 456 crt_na_ofi_conf.noc_domain, 457 crt_na_ofi_conf.noc_ip_str);”
~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Kevan Rehm
All,
I mostly understand what is happening here, I’m wondering if there is a fix already underway for this.
We have two ethernet interfaces and two infiniband interfaces in this node. All four of those interfaces support verbs;ofi_rxm. When Mercury calls fi_getinfo() it is getting back a list of four matching interfaces, and it’s apparently not looking at the OFI_DOMAIN that the user specified, using that to narrow down the list to the correct IB interface. It so happens that the ethernet interfaces appear in the list before the infiniband interfaces, which is why mlx5_0 gets selected when the infiniband mlx5_2 is what the user wants. There is code to narrow the selection by provider, but here we need to also narrow the selection by OFI_DOMAIN.
Has this been reported before by any chance? Perhaps a PR exists already?
Thanks, Kevan
From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|
Kevan Rehm
Well, I see that there is code in Mercury to deal with such situations, which means I need to do a better job of debugging why it’s not working in this case.
Kevan
From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Hi Kevan,
This is not something that we’ve encountered before as our systems don’t have such setup of interfaces. Adding Jerome from mercury as well to see if there is known issue or not.
As a note CaRT currently initializes mercury using an init string below, so the domain info (via OFI_DOMAIN envariable) should be provided to mercury to make a proper decision. “455 D_ASPRINTF(*string, "%s://%s/%s", plugin_str, 456 crt_na_ofi_conf.noc_domain, 457 crt_na_ofi_conf.noc_ip_str);”
~~Alex.
From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of
Kevan Rehm
All,
I mostly understand what is happening here, I’m wondering if there is a fix already underway for this.
We have two ethernet interfaces and two infiniband interfaces in this node. All four of those interfaces support verbs;ofi_rxm. When Mercury calls fi_getinfo() it is getting back a list of four matching interfaces, and it’s apparently not looking at the OFI_DOMAIN that the user specified, using that to narrow down the list to the correct IB interface. It so happens that the ethernet interfaces appear in the list before the infiniband interfaces, which is why mlx5_0 gets selected when the infiniband mlx5_2 is what the user wants. There is code to narrow the selection by provider, but here we need to also narrow the selection by OFI_DOMAIN.
Has this been reported before by any chance? Perhaps a PR exists already?
Thanks, Kevan
From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Good afternoon,
I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything. I'm also using Mellanox OFED.
I am interested in the output of this debug message in Mercury: NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s," (from na_ofi_initialize, of course)
(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:
ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0 )
I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful. But I can't get it to print out.
I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.
I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?
So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.
Thanks, -Patrick
|
|