Mercury debug (and IB question)


Farrell, Patrick Arthur <patrick.farrell@...>
 

Good afternoon,

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

I am interested in the output of this debug message in Mercury:
NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);
(from na_ofi_initialize, of course)

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1
)

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

Thanks,
-Patrick


Oganezov, Alexander A
 

Adding Jerome from mercury to answer question on enabling debug on mercury level.

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 1:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Rosenzweig, Joel B <joel.b.rosenzweig@...>
 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Colin Ngam
 

Hi,

 

Where should OFI_DOMAIN be set? Just exported as an ENV before stating daos_server “/root/daos/install/bin/daos_server start -o ./daos_server_local.yml” is enough right?

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, March 20, 2020 at 11:05 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


patrick.farrell@...
 

The OFI_DOMAIN env doesn't seem to work - I set the OFI_DOMAIN variable in my environment *but not in the yaml file*; specifically, I set to "george".  The server started just fine with no mention of 'george'.

When OFI_DOMAIN was set in the server yml file, again to 'george', the server failed to start with the sort of message you'd expect - No provider found on domain "george".

This issue is interesting and may be relevant to my problem.  I'll reference this note in my reply to Joel's message.

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Colin Ngam <colin.ngam@...>
Sent: Friday, March 20, 2020 11:29 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 

Hi,

 

Where should OFI_DOMAIN be set? Just exported as an ENV before stating daos_server “/root/daos/install/bin/daos_server start -o ./daos_server_local.yml” is enough right?

 

Thanks.

 

Colin

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, March 20, 2020 at 11:05 AM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Farrell, Patrick Arthur <patrick.farrell@...>
 

Literally precisely that message - It's just a copy/paste.  Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.

It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).

So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.

Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change.  Exactly the same behavior, including the referenced error.

So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Farrell, Patrick Arthur <patrick.farrell@...>
 

Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.

So, this still leaves open the question of how/why the domain is coming up wrong in that message.

I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that?  Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 11:47 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 
Literally precisely that message - It's just a copy/paste.  Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.

It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).

So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.

Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change.  Exactly the same behavior, including the referenced error.

So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Rosenzweig, Joel B <joel.b.rosenzweig@...>
 

Hi Patrick,

 

What domains does a fi_info -v scan show you?  Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”?  The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.

 

I know that Alex inquired about the Mercury debug message help.  I don’t know if there’s a status update on that.

 

Regards,

Joel

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Friday, March 20, 2020 1:05 PM
To: daos@daos.groups.io
Subject: Re: [daos] Mercury debug (and IB question)

 

Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.

 

So, this still leaves open the question of how/why the domain is coming up wrong in that message.

 

I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that?  Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 11:47 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Literally precisely that message - It's just a copy/paste.  Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.

 

It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).

 

So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.

 

Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change.  Exactly the same behavior, including the referenced error.

 

So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Farrell, Patrick Arthur <patrick.farrell@...>
 

Joel,

In that message, the two domains are different parts of the print statement, they're different strings.

Here's the code:
if (strncmp(dom->verbs->device->name, info->domain_attr->name,
strlen(dom->verbs->device->name))) {
VERBS_INFO(FI_LOG_DOMAIN,
"Invalid info->domain_attr->name: %s and %s\n",
dom->verbs->device->name, info->domain_attr->name);
return -FI_EINVAL;
}

Note the two %s and the different sources.  That's in vrbs_ep.c in ofi.
So, there's no reason to expect a single domain with that complex name.

However, the output of fi_info is *very* interesting.  I see there's an ofi_rxm; verbs provider listed for mlx5_0, which is interesting because while that's a Mellanox card, it's an *ethernet* card and it's in ethernet mode.  I don't think verbs would work there.

Here's the output of fi_info - fi_info -v is over 6 thousand lines of output, so I can attach that if you want, but I figured I'd start with this.

-----
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_1-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_1-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_2-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_3-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: shm
fabric: shm
domain: shm
version: 1.1
type: FI_EP_RDM
protocol: FI_PROTO_SHM
provider: UDP
fabric: 172.30.222.0/24
domain: eno1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: eno1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 10.0.0.0/24
domain: ib0
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 10.0.1.0/24
domain: ib1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: ib0
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: ib1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 127.0.0.1/32
domain: lo
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: ::1/128
domain: lo
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: tcp
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 12:18 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 

Hi Patrick,

 

What domains does a fi_info -v scan show you?  Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”?  The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.

 

I know that Alex inquired about the Mercury debug message help.  I don’t know if there’s a status update on that.

 

Regards,

Joel

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Friday, March 20, 2020 1:05 PM
To: daos@daos.groups.io
Subject: Re: [daos] Mercury debug (and IB question)

 

Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.

 

So, this still leaves open the question of how/why the domain is coming up wrong in that message.

 

I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that?  Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 11:47 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Literally precisely that message - It's just a copy/paste.  Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.

 

It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).

 

So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.

 

Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change.  Exactly the same behavior, including the referenced error.

 

So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Farrell, Patrick Arthur <patrick.farrell@...>
 

Ah, scratch that confusion about having a verbs provider for ethernet - I see OFI has support for verbs over ethernet.

Anyway, just to see what would happen, I disabled my ethernet adapter, so mlx5_0 is no longer up.

After doing that, I was indeed able to get further.  The only ofi_rxm; verbs provider is associated with the mlx5_2 domain, and everything worked - I did not see this issue with domain mismatch.

Of course, fi_mr_reg failed with -14, which is EFAULT.

So, two issues currently:
  1. OFI_DOMAIN is being partly ignored, and something - I think the client library - is using the first domain it finds with the right provider
  2. With MOFED 5.0, we're getting -EFAULT on a memory registration
Interesting.

Johann,

I switched back to RAM for this test for simplicity and speed (as I am restarting the server a lot); you alluded to an error that occurs with RAM but not with PMEM or the other way around...  Is this possibly it?

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 1:43 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 
Joel,

In that message, the two domains are different parts of the print statement, they're different strings.

Here's the code:
if (strncmp(dom->verbs->device->name, info->domain_attr->name,
strlen(dom->verbs->device->name))) {
VERBS_INFO(FI_LOG_DOMAIN,
"Invalid info->domain_attr->name: %s and %s\n",
dom->verbs->device->name, info->domain_attr->name);
return -FI_EINVAL;
}

Note the two %s and the different sources.  That's in vrbs_ep.c in ofi.
So, there's no reason to expect a single domain with that complex name.

However, the output of fi_info is *very* interesting.  I see there's an ofi_rxm; verbs provider listed for mlx5_0, which is interesting because while that's a Mellanox card, it's an *ethernet* card and it's in ethernet mode.  I don't think verbs would work there.

Here's the output of fi_info - fi_info -v is over 6 thousand lines of output, so I can attach that if you want, but I figured I'd start with this.

-----
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_1-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_1-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_2-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_3-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: shm
fabric: shm
domain: shm
version: 1.1
type: FI_EP_RDM
protocol: FI_PROTO_SHM
provider: UDP
fabric: 172.30.222.0/24
domain: eno1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: eno1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 10.0.0.0/24
domain: ib0
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 10.0.1.0/24
domain: ib1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: ib0
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: ib1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 127.0.0.1/32
domain: lo
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: ::1/128
domain: lo
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: tcp
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 12:18 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 

Hi Patrick,

 

What domains does a fi_info -v scan show you?  Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”?  The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.

 

I know that Alex inquired about the Mercury debug message help.  I don’t know if there’s a status update on that.

 

Regards,

Joel

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Friday, March 20, 2020 1:05 PM
To: daos@daos.groups.io
Subject: Re: [daos] Mercury debug (and IB question)

 

Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.

 

So, this still leaves open the question of how/why the domain is coming up wrong in that message.

 

I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that?  Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 11:47 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Literally precisely that message - It's just a copy/paste.  Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.

 

It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).

 

So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.

 

Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change.  Exactly the same behavior, including the referenced error.

 

So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Farrell, Patrick Arthur <patrick.farrell@...>
 

Nevermind - This memory registration failure is not related to using or not using dcpm.

I'll get more details and report that separately, it's only occurring in some cases.

So, just leaves the problem of the client library ignoring OFI_DOMAIN.

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 2:15 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 
Ah, scratch that confusion about having a verbs provider for ethernet - I see OFI has support for verbs over ethernet.

Anyway, just to see what would happen, I disabled my ethernet adapter, so mlx5_0 is no longer up.

After doing that, I was indeed able to get further.  The only ofi_rxm; verbs provider is associated with the mlx5_2 domain, and everything worked - I did not see this issue with domain mismatch.

Of course, fi_mr_reg failed with -14, which is EFAULT.

So, two issues currently:
  1. OFI_DOMAIN is being partly ignored, and something - I think the client library - is using the first domain it finds with the right provider
  2. With MOFED 5.0, we're getting -EFAULT on a memory registration
Interesting.

Johann,

I switched back to RAM for this test for simplicity and speed (as I am restarting the server a lot); you alluded to an error that occurs with RAM but not with PMEM or the other way around...  Is this possibly it?

-Patrick

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 1:43 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 
Joel,

In that message, the two domains are different parts of the print statement, they're different strings.

Here's the code:
if (strncmp(dom->verbs->device->name, info->domain_attr->name,
strlen(dom->verbs->device->name))) {
VERBS_INFO(FI_LOG_DOMAIN,
"Invalid info->domain_attr->name: %s and %s\n",
dom->verbs->device->name, info->domain_attr->name);
return -FI_EINVAL;
}

Note the two %s and the different sources.  That's in vrbs_ep.c in ofi.
So, there's no reason to expect a single domain with that complex name.

However, the output of fi_info is *very* interesting.  I see there's an ofi_rxm; verbs provider listed for mlx5_0, which is interesting because while that's a Mellanox card, it's an *ethernet* card and it's in ethernet mode.  I don't think verbs would work there.

Here's the output of fi_info - fi_info -v is over 6 thousand lines of output, so I can attach that if you want, but I figured I'd start with this.

-----
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_1-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_2-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-xrc
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_RDMA_CM_IB_XRC
provider: verbs
fabric: IB-0xfe80000000000000
domain: mlx5_3-dgram
version: 1.0
type: FI_EP_DGRAM
protocol: FI_PROTO_IB_UD
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_2
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxm
fabric: IB-0xfe80000000000000
domain: mlx5_3
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_0-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_1-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_2-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: verbs;ofi_rxd
fabric: IB-0xfe80000000000000
domain: mlx5_3-dgram
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: UDP;ofi_rxd
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
provider: shm
fabric: shm
domain: shm
version: 1.1
type: FI_EP_RDM
protocol: FI_PROTO_SHM
provider: UDP
fabric: 172.30.222.0/24
domain: eno1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: eno1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 10.0.0.0/24
domain: ib0
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 10.0.1.0/24
domain: ib1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: ib0
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: fe80::/64
domain: ib1
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: 127.0.0.1/32
domain: lo
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: UDP
fabric: ::1/128
domain: lo
version: 1.1
type: FI_EP_DGRAM
protocol: FI_PROTO_UDP
provider: tcp
fabric: 172.30.222.0/24
domain: eno1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: eno1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 10.0.0.0/24
domain: ib0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 10.0.1.0/24
domain: ib1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: ib0
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: fe80::/64
domain: ib1
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: 127.0.0.1/32
domain: lo
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: tcp
fabric: ::1/128
domain: lo
version: 1.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 172.30.222.0/24
domain: eno1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: eno1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.0.0/24
domain: ib0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 10.0.1.0/24
domain: ib1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib0
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: fe80::/64
domain: ib1
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: 127.0.0.1/32
domain: lo
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP
provider: sockets
fabric: ::1/128
domain: lo
version: 2.0
type: FI_EP_MSG
protocol: FI_PROTO_SOCK_TCP

From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 12:18 PM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)
 

Hi Patrick,

 

What domains does a fi_info -v scan show you?  Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”?  The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.

 

I know that Alex inquired about the Mercury debug message help.  I don’t know if there’s a status update on that.

 

Regards,

Joel

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Friday, March 20, 2020 1:05 PM
To: daos@daos.groups.io
Subject: Re: [daos] Mercury debug (and IB question)

 

Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.

 

So, this still leaves open the question of how/why the domain is coming up wrong in that message.

 

I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that?  Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 11:47 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Literally precisely that message - It's just a copy/paste.  Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.

 

It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).

 

So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.

 

Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change.  Exactly the same behavior, including the referenced error.

 

So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Kevan Rehm
 

Just an update, we have figured out how to enable mercury debug messages..

 

Kevan

 

From: <daos@daos.groups.io> on behalf of "Rosenzweig, Joel B" <joel.b.rosenzweig@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, March 20, 2020 at 12:19 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

What domains does a fi_info -v scan show you?  Does it list them separately, or is there some entry that actually says “mlx5_2 and mlx5_0”?  The error message that shows “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” comes from libfabric, so perhaps the fi_info scan will show something interesting (unexpected) there.

 

I know that Alex inquired about the Mercury debug message help.  I don’t know if there’s a status update on that.

 

Regards,

Joel

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Friday, March 20, 2020 1:05 PM
To: daos@daos.groups.io
Subject: Re: [daos] Mercury debug (and IB question)

 

Sorry, a correction there, it looks like I messed up setting the OFI_DOMAIN env on my client - It is being used, and when I set it to a nonsense value, I do get a failure because the domain doesn't exist.

 

So, this still leaves open the question of how/why the domain is coming up wrong in that message.

 

I think it would be very helpful if I could turn on CaRT/Mercury debug - Is anyone able to shed light on what's required to do that?  Like I mentioned in the email that prompted this chain, it seems to be compiled out by default, and I can't figure out how to turn it on.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Farrell, Patrick Arthur <patrick.farrell@...>
Sent: Friday, March 20, 2020 11:47 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Literally precisely that message - It's just a copy/paste.  Looking in the code, it is specifically complaining because those two strings are not the same and that's why the error is printing.

 

It looks like the first one is the domain on the server, and the second is the domain in the message (my "client" is a separate shell session on the server).

 

So, I do have OFI_DOMAIN set in the environment in my client session - it's set to mlx5_2.

 

Thinking now of my reply to Colin on this chain, though, I tried setting it to "george" there, and... no change.  Exactly the same behavior, including the referenced error.

 

So it seems that the client is ignoring the OFI_DOMAIN variable and choosing mlx5_0 by itself, which is incorrect.

 

-Patrick


From: daos@daos.groups.io <daos@daos.groups.io> on behalf of Rosenzweig, Joel B <joel.b.rosenzweig@...>
Sent: Friday, March 20, 2020 10:51 AM
To: daos@daos.groups.io <daos@daos.groups.io>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Patrick,

 

Does the error “ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0” literally say “mlx5_2 and mlx5_0” or do you get two error messages, one with “mlx5_2” and the other with “mlx5_0”?  I just want to make sure I understand the error correctly.

 

In src/control/server/server.go’s Start(), you will find this:

 

            // Provide special handling for the ofi+verbs provider.

            // Mercury uses the interface name such as ib0, while OFI uses the device name such as hfi1_0

            // CaRT and Mercury will now support the new OFI_DOMAIN environment variable so that we can

            // specify the correct device for each.

            if strings.HasPrefix(srvCfg.Fabric.Provider, "ofi+verbs") && !srvCfg.HasEnvVar("OFI_DOMAIN") {

                  deviceAlias, err := netdetect.GetDeviceAlias(srvCfg.Fabric.Interface)

                  if err != nil {

                        return errors.Wrapf(err, "failed to resolve alias for %s", srvCfg.Fabric.Interface)

                  }

                  envVar := "OFI_DOMAIN=" + deviceAlias

                  srvCfg.WithEnvVars(envVar)

            }

 

If the OFI_DOMAIN variable is set, daos_server should not override any setting you have for OFI_DOMAIN.  Does your log show output from netdetect showing that it searched for and found a device alias?  If GetDeviceAlias() executes, your debug log will show output from getDeviceAliasWithSystemList from these two debug messages:

 

// at function entry

log.Debugf("Searching for a device alias for: %s", device)

 

// at function exit if there wasn’t an error up to this point

log.Debugf("Device alias for %s is %s", device, C.GoString(node.name))

 

If there is no debug output like that, then the OFI_DOMAIN came from somewhere other than daos_server providing one.  If daos_server is doing it, then it seems to think that OFI_DOMAIN was not defined for that particular IO server instance.

 

If you find that daos_server is providing the OFI_DOMAIN despite your environment variable setting, then go ahead and comment out that code so you can work around it.  And, send me your daos_server.yml for starters so I can try to reproduce the erroneous response from HasEnvVar if that is actually a problem.

 

Regards,

Joel

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Farrell, Patrick Arthur
Sent: Thursday, March 19, 2020 4:42 PM
To: daos@daos.groups.io
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Kevan Rehm
 

All,

 

I mostly understand what is happening here, I’m wondering if there is a fix already underway for this.

 

We have two ethernet interfaces and two infiniband interfaces in this node.   All four of those interfaces support verbs;ofi_rxm.   When Mercury calls fi_getinfo() it is getting back a list of four matching interfaces, and it’s apparently not looking at the OFI_DOMAIN that the user specified, using that to narrow down the list to the correct IB interface.   It so happens that the ethernet interfaces appear in the list before the infiniband interfaces, which is why mlx5_0 gets selected when the infiniband mlx5_2 is what the user wants.      There is code to narrow the selection by provider, but here we need to also narrow the selection by OFI_DOMAIN.

 

Has this been reported before by any chance?   Perhaps a PR exists already?

 

Thanks, Kevan

 

From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, March 19, 2020 at 3:49 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Oganezov, Alexander A
 

Hi Kevan,

 

This is not something that we’ve encountered before as our systems don’t have such setup of interfaces. Adding Jerome from mercury as well to see if there is known issue or not.

 

As a note CaRT currently initializes mercury using an init string below, so the domain info (via OFI_DOMAIN envariable) should be provided to mercury to make a proper decision.

“455                 D_ASPRINTF(*string, "%s://%s/%s", plugin_str,

456                         crt_na_ofi_conf.noc_domain,

457                         crt_na_ofi_conf.noc_ip_str);”

 

 

~~Alex.

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Friday, March 20, 2020 1:53 PM
To: daos@daos.groups.io
Subject: Re: [daos] Mercury debug (and IB question)

 

All,

 

I mostly understand what is happening here, I’m wondering if there is a fix already underway for this.

 

We have two ethernet interfaces and two infiniband interfaces in this node.   All four of those interfaces support verbs;ofi_rxm.   When Mercury calls fi_getinfo() it is getting back a list of four matching interfaces, and it’s apparently not looking at the OFI_DOMAIN that the user specified, using that to narrow down the list to the correct IB interface.   It so happens that the ethernet interfaces appear in the list before the infiniband interfaces, which is why mlx5_0 gets selected when the infiniband mlx5_2 is what the user wants.      There is code to narrow the selection by provider, but here we need to also narrow the selection by OFI_DOMAIN.

 

Has this been reported before by any chance?   Perhaps a PR exists already?

 

Thanks, Kevan

 

From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, March 19, 2020 at 3:49 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick


Kevan Rehm
 

Well, I see that there is code in Mercury to deal with such situations, which means I need to do a better job of debugging why it’s not working in this case.

 

Kevan

 

From: <daos@daos.groups.io> on behalf of "Oganezov, Alexander A" <alexander.a.oganezov@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Friday, March 20, 2020 at 4:08 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>, Jerome Soumagne <jsoumagne@...>
Subject: Re: [daos] Mercury debug (and IB question)

 

Hi Kevan,

 

This is not something that we’ve encountered before as our systems don’t have such setup of interfaces. Adding Jerome from mercury as well to see if there is known issue or not.

 

As a note CaRT currently initializes mercury using an init string below, so the domain info (via OFI_DOMAIN envariable) should be provided to mercury to make a proper decision.

“455                 D_ASPRINTF(*string, "%s://%s/%s", plugin_str,

456                         crt_na_ofi_conf.noc_domain,

457                         crt_na_ofi_conf.noc_ip_str);”

 

 

~~Alex.

 

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of Kevan Rehm
Sent: Friday, March 20, 2020 1:53 PM
To: daos@daos.groups.io
Subject: Re: [daos] Mercury debug (and IB question)

 

All,

 

I mostly understand what is happening here, I’m wondering if there is a fix already underway for this.

 

We have two ethernet interfaces and two infiniband interfaces in this node.   All four of those interfaces support verbs;ofi_rxm.   When Mercury calls fi_getinfo() it is getting back a list of four matching interfaces, and it’s apparently not looking at the OFI_DOMAIN that the user specified, using that to narrow down the list to the correct IB interface.   It so happens that the ethernet interfaces appear in the list before the infiniband interfaces, which is why mlx5_0 gets selected when the infiniband mlx5_2 is what the user wants.      There is code to narrow the selection by provider, but here we need to also narrow the selection by OFI_DOMAIN.

 

Has this been reported before by any chance?   Perhaps a PR exists already?

 

Thanks, Kevan

 

From: <daos@daos.groups.io> on behalf of "Farrell, Patrick Arthur" <patrick.farrell@...>
Reply-To: "daos@daos.groups.io" <daos@daos.groups.io>
Date: Thursday, March 19, 2020 at 3:49 PM
To: "daos@daos.groups.io" <daos@daos.groups.io>
Subject: [daos] Mercury debug (and IB question)

 

Good afternoon,

 

I am running latest DAOS (as of two hours ago; so including the Mercury update), and I rebuilt from scratch (deleted everything) to guarantee I had the latest everything.  I'm also using Mellanox OFED.

 

I am interested in the output of this debug message in Mercury:

NA_LOG_DEBUG("Entering na_ofi_initialize() class_name %s, protocol_name %s,"
" host_name %s", na_info->class_name, na_info->protocol_name,
na_info->host_name);

(from na_ofi_initialize, of course)

 

(I am specifically interested because my daos_test instance seems to be ignoring my OFI_DOMAIN environment variable in favor of finding its own - I am getting this error on the server when daos_test tries to connect, despite having OFI_DOMAIN set to mlx5_2 in both client and server environments:

 

ERROR: daos_io_server:0 libfabric:148660:verbs:domain:vrb_open_ep():917<info> Invalid info->domain_attr->name: mlx5_2 and mlx5_0
libfabric:148660:ofi_rxm:ep_ctrl:rxm_msg_ep_open():801<warn> unable to create msg_ep: -22
libfabric:148660:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1095<info> notify event 1

)

 

I'm trying to figure out how domain is getting set there, and that mercury debug message looks useful.  But I can't get it to print out.

 

I've tried setting both HG_LOG_LEVEL and NA_HG_LOG_LEVEL to 'debug', but I'm not seeing anything either on my console or in the relevant log files.

 

I did notice Mercury debug is behind an #ifdef, but it looks to be enabled...?

 

So, hoping for help with the Mercury debug, and if anyone has an idea on the domain issue, that would be very interesting as well.

 

Thanks,

-Patrick