All of lore.kernel.org
 help / color / mirror / Atom feed
* possible kernel/libxl race with xl network-attach
@ 2022-01-13 11:19 James Dingwall
  2022-01-13 16:11 ` Roger Pau Monné
  0 siblings, 1 reply; 7+ messages in thread
From: James Dingwall @ 2022-01-13 11:19 UTC (permalink / raw)
  To: xen-devel

Hi,

I have been trying to debug a problem where a vif with the backend in a 
driver domain is added to dom0.  Intermittently the hotplug script is 
not invoked by libxl (running as xl devd) in the driver domain.  By 
enabling some debug for the driver domain kernel and libxl I have these 
messages:

driver domain kernel (Ubuntu 5.4.0-92-generic):

[Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0: Successfully created xenvif
[Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: /local/domain/0/device/vif/0 -> Initialising
[Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
[Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: /local/domain/0/device/vif/0 -> Connected
[Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
[Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: backend/vif/0/0 -> Connected

xl devd (Xen 4.14.3):

2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 wpath=/local/domain/2/backend token=3/0: event epath=/local/domain/2/backend/vif/0/0/state
2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700: nested ao, parent 0x5633ac567f90
2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event epath=/local/domain/2/backend/vif/0/0/state
2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:1055:devstate_callback: backend /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting state 4
2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 wpath=/local/domain/2/backend token=3/0: event epath=/local/domain/2/backend/vif/0/0/state
2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220: nested ao, parent 0x5633ac567f90
2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event epath=/local/domain/2/backend/vif/0/0/state
2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:1055:devstate_callback: backend /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting state 4
2022-01-13 01:39:51 UTC libxl: debug: libxl_aoutils.c:88:xswait_timeout_callback: backend /local/domain/2/backend/vif/0/0/state (hoping for state change to 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:850:libxl__ev_xswatch_deregister: watch w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state token=2/1: deregister slotnum=2
2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:1039:devstate_callback: backend /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:864:libxl__ev_xswatch_deregister: watch w=0x5633ac569180: deregister unregistered
2022-01-13 01:39:51 UTC libxl: debug: libxl_device.c:1092:device_backend_callback: calling device_backend_cleanup
2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:864:libxl__ev_xswatch_deregister: watch w=0x5633ac569180: deregister unregistered
2022-01-13 01:39:51 UTC libxl: error: libxl_device.c:1105:device_backend_callback: unable to add device with path /local/domain/2/backend/vif/0/0
2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:864:libxl__ev_xswatch_deregister: watch w=0x5633ac569280: deregister unregistered
2022-01-13 01:39:51 UTC libxl: debug: libxl_device.c:1470:device_complete: device /local/domain/2/backend/vif/0/0 add failed
2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy

the xenstore content for the backend:

# xenstore-ls /local/domain/2/backend/vif/0
0 = ""
 frontend = "/local/domain/0/device/vif/0"
 frontend-id = "0"
 online = "1"
 state = "4"
 script = "/etc/xen/scripts/vif-zynstra"
 vifname = "dom0.0"
 mac = "00:16:3e:6c:de:82"
 bridge = "cluster"
 handle = "0"
 type = "vif"
 feature-sg = "1"
 feature-gso-tcpv4 = "1"
 feature-gso-tcpv6 = "1"
 feature-ipv6-csum-offload = "1"
 feature-rx-copy = "1"
 feature-rx-flip = "0"
 feature-multicast-control = "1"
 feature-dynamic-multicast-control = "1"
 feature-split-event-channels = "1"
 multi-queue-max-queues = "2"
 feature-ctrl-ring = "1"
 hotplug-status = "connected"

My guess is that the libxl callback is started waiting for the backend 
state key to be set to XenbusStateInitWait (2) but the frontend in dom0 
has already triggered the backend to transition to XenbusStateConnected 
(4) and therefore it does not successfully complete.

Does this seem a reasonable explanation for the problem and what would 
the best approach to try and solve it?

Thanks,
James


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: possible kernel/libxl race with xl network-attach
  2022-01-13 11:19 possible kernel/libxl race with xl network-attach James Dingwall
@ 2022-01-13 16:11 ` Roger Pau Monné
  2022-01-21 13:34   ` James Dingwall
  0 siblings, 1 reply; 7+ messages in thread
From: Roger Pau Monné @ 2022-01-13 16:11 UTC (permalink / raw)
  To: James Dingwall; +Cc: xen-devel

On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
> Hi,
> 
> I have been trying to debug a problem where a vif with the backend in a 
> driver domain is added to dom0.  Intermittently the hotplug script is 
> not invoked by libxl (running as xl devd) in the driver domain.  By 
> enabling some debug for the driver domain kernel and libxl I have these 
> messages:
> 
> driver domain kernel (Ubuntu 5.4.0-92-generic):
> 
> [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0: Successfully created xenvif
> [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: /local/domain/0/device/vif/0 -> Initialising
> [Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: /local/domain/0/device/vif/0 -> Connected
> [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> [Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> 
> xl devd (Xen 4.14.3):
> 
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 wpath=/local/domain/2/backend token=3/0: event epath=/local/domain/2/backend/vif/0/0/state
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700: nested ao, parent 0x5633ac567f90
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event epath=/local/domain/2/backend/vif/0/0/state
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:1055:devstate_callback: backend /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting state 4
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 wpath=/local/domain/2/backend token=3/0: event epath=/local/domain/2/backend/vif/0/0/state
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220: nested ao, parent 0x5633ac567f90
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event epath=/local/domain/2/backend/vif/0/0/state
> 2022-01-13 01:39:31 UTC libxl: debug: libxl_event.c:1055:devstate_callback: backend /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting state 4
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_aoutils.c:88:xswait_timeout_callback: backend /local/domain/2/backend/vif/0/0/state (hoping for state change to 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:850:libxl__ev_xswatch_deregister: watch w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state token=2/1: deregister slotnum=2
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:1039:devstate_callback: backend /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:864:libxl__ev_xswatch_deregister: watch w=0x5633ac569180: deregister unregistered
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_device.c:1092:device_backend_callback: calling device_backend_cleanup
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:864:libxl__ev_xswatch_deregister: watch w=0x5633ac569180: deregister unregistered
> 2022-01-13 01:39:51 UTC libxl: error: libxl_device.c:1105:device_backend_callback: unable to add device with path /local/domain/2/backend/vif/0/0
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:864:libxl__ev_xswatch_deregister: watch w=0x5633ac569280: deregister unregistered
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_device.c:1470:device_complete: device /local/domain/2/backend/vif/0/0 add failed
> 2022-01-13 01:39:51 UTC libxl: debug: libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
> 
> the xenstore content for the backend:
> 
> # xenstore-ls /local/domain/2/backend/vif/0
> 0 = ""
>  frontend = "/local/domain/0/device/vif/0"
>  frontend-id = "0"
>  online = "1"
>  state = "4"
>  script = "/etc/xen/scripts/vif-zynstra"
>  vifname = "dom0.0"
>  mac = "00:16:3e:6c:de:82"
>  bridge = "cluster"
>  handle = "0"
>  type = "vif"
>  feature-sg = "1"
>  feature-gso-tcpv4 = "1"
>  feature-gso-tcpv6 = "1"
>  feature-ipv6-csum-offload = "1"
>  feature-rx-copy = "1"
>  feature-rx-flip = "0"
>  feature-multicast-control = "1"
>  feature-dynamic-multicast-control = "1"
>  feature-split-event-channels = "1"
>  multi-queue-max-queues = "2"
>  feature-ctrl-ring = "1"
>  hotplug-status = "connected"
> 
> My guess is that the libxl callback is started waiting for the backend 
> state key to be set to XenbusStateInitWait (2) but the frontend in dom0 
> has already triggered the backend to transition to XenbusStateConnected 
> (4) and therefore it does not successfully complete.

I think your analysis is correct. The question is who sets
hotplug-status to "connected" in xenstore if the hotplug script in the
driver domain hasn't been launched by `xl devd`.

It would seem to me some other entity is launching the hotplug
scripts, do you have some udev rules in the driver domain that do that
and race with `xl devd` maybe?

Regards, Roger.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: possible kernel/libxl race with xl network-attach
  2022-01-13 16:11 ` Roger Pau Monné
@ 2022-01-21 13:34   ` James Dingwall
  2022-01-21 14:00     ` Roger Pau Monné
  0 siblings, 1 reply; 7+ messages in thread
From: James Dingwall @ 2022-01-21 13:34 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel

On 2022-01-13 16:11, Roger Pau Monné wrote:
> On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
>> 
>> I have been trying to debug a problem where a vif with the backend in 
>> a
>> driver domain is added to dom0.  Intermittently the hotplug script is
>> not invoked by libxl (running as xl devd) in the driver domain.  By
>> enabling some debug for the driver domain kernel and libxl I have 
>> these
>> messages:
>> 
>> driver domain kernel (Ubuntu 5.4.0-92-generic):
>> 
>> [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0: 
>> Successfully created xenvif
>> [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: 
>> /local/domain/0/device/vif/0 -> Initialising
>> [Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: 
>> backend/vif/0/0 -> InitWait
>> [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: 
>> /local/domain/0/device/vif/0 -> Connected
>> [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
>> [Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: 
>> backend/vif/0/0 -> Connected
>> 
>> xl devd (Xen 4.14.3):
>> 
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 
>> wpath=/local/domain/2/backend token=3/0: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700: nested 
>> ao, parent 0x5633ac567f90
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 
>> wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:1055:devstate_callback: backend 
>> /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting 
>> state 4
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 
>> wpath=/local/domain/2/backend token=3/0: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220: nested 
>> ao, parent 0x5633ac567f90
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 
>> wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:1055:devstate_callback: backend 
>> /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting 
>> state 4
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_aoutils.c:88:xswait_timeout_callback: backend 
>> /local/domain/2/backend/vif/0/0/state (hoping for state change to 2): 
>> xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:850:libxl__ev_xswatch_deregister: watch w=0x5633ac569180 
>> wpath=/local/domain/2/backend/vif/0/0/state token=2/1: deregister 
>> slotnum=2
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:1039:devstate_callback: backend 
>> /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:864:libxl__ev_xswatch_deregister: watch 
>> w=0x5633ac569180: deregister unregistered
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_device.c:1092:device_backend_callback: calling 
>> device_backend_cleanup
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:864:libxl__ev_xswatch_deregister: watch 
>> w=0x5633ac569180: deregister unregistered
>> 2022-01-13 01:39:51 UTC libxl: error: 
>> libxl_device.c:1105:device_backend_callback: unable to add device with 
>> path /local/domain/2/backend/vif/0/0
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:864:libxl__ev_xswatch_deregister: watch 
>> w=0x5633ac569280: deregister unregistered
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_device.c:1470:device_complete: device 
>> /local/domain/2/backend/vif/0/0 add failed
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
>> 
>> the xenstore content for the backend:
>> 
>> # xenstore-ls /local/domain/2/backend/vif/0
>> 0 = ""
>>  frontend = "/local/domain/0/device/vif/0"
>>  frontend-id = "0"
>>  online = "1"
>>  state = "4"
>>  script = "/etc/xen/scripts/vif-zynstra"
>>  vifname = "dom0.0"
>>  mac = "00:16:3e:6c:de:82"
>>  bridge = "cluster"
>>  handle = "0"
>>  type = "vif"
>>  feature-sg = "1"
>>  feature-gso-tcpv4 = "1"
>>  feature-gso-tcpv6 = "1"
>>  feature-ipv6-csum-offload = "1"
>>  feature-rx-copy = "1"
>>  feature-rx-flip = "0"
>>  feature-multicast-control = "1"
>>  feature-dynamic-multicast-control = "1"
>>  feature-split-event-channels = "1"
>>  multi-queue-max-queues = "2"
>>  feature-ctrl-ring = "1"
>>  hotplug-status = "connected"
>> 
>> My guess is that the libxl callback is started waiting for the backend
>> state key to be set to XenbusStateInitWait (2) but the frontend in 
>> dom0
>> has already triggered the backend to transition to 
>> XenbusStateConnected
>> (4) and therefore it does not successfully complete.
> 
> I think your analysis is correct. The question is who sets
> hotplug-status to "connected" in xenstore if the hotplug script in the
> driver domain hasn't been launched by `xl devd`.
> 
> It would seem to me some other entity is launching the hotplug
> scripts, do you have some udev rules in the driver domain that do that
> and race with `xl devd` maybe?

No, there are no udev rules for xen network devices.  What it looks like 
is that the in linux/drivers/net/xen-netback/xenbus.c that the backend 
state is changed to XenbusStateConnected in response to 
frontend_changed() being called and the frontend state is 
XenbusStateConnected.  The xen-frontend.c driver has:

         case XenbusStateInitWait:
                 if (dev->state != XenbusStateInitialising)
                         break;
                 if (xennet_connect(netdev) != 0)
                         break;
                 xenbus_switch_state(dev, XenbusStateConnected);
                 break;

so it seems that if this completes in the domain with the frontend then 
the backend can process this state change before `xl devd` is launched 
and it completely misses the window where the backend state was 
XenbusStateInitWait.

This is the patch I have come up with as a workaround:

diff --git a/tools/libxl/libxl_event.c b/tools/libxl/libxl_event.c
index 7c5387e94f..408afa6ee6 100644
--- a/tools/libxl/libxl_event.c
+++ b/tools/libxl/libxl_event.c
@@ -1052,6 +1052,12 @@ static void devstate_callback(libxl__egc *egc, 
libxl__xswait_state *xsw,
          LOG(DEBUG, "backend %s wanted state %d ok", ds->w.path, 
ds->wanted);
          rc = 0;
      } else {
+        if (ds->wanted == XenbusStateInitWait && got == 
XenbusStateConnected) {
+            LOG(DEBUG, "backend %s wanted state %d interpreting"
+                " state %d as success", ds->w.path, ds->wanted, got);
+            rc = 0;
+            goto out;
+        }
          LOG(DEBUG, "backend %s wanted state %d"" still waiting state 
%d",
              ds->w.path, ds->wanted, got);
          return;

Thanks,
James


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: possible kernel/libxl race with xl network-attach
  2022-01-21 13:34   ` James Dingwall
@ 2022-01-21 14:00     ` Roger Pau Monné
  2022-01-21 15:05       ` James Dingwall
  0 siblings, 1 reply; 7+ messages in thread
From: Roger Pau Monné @ 2022-01-21 14:00 UTC (permalink / raw)
  To: James Dingwall; +Cc: xen-devel

On Fri, Jan 21, 2022 at 01:34:54PM +0000, James Dingwall wrote:
> On 2022-01-13 16:11, Roger Pau Monné wrote:
> > On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
> > > 
> > > I have been trying to debug a problem where a vif with the backend
> > > in a
> > > driver domain is added to dom0.  Intermittently the hotplug script is
> > > not invoked by libxl (running as xl devd) in the driver domain.  By
> > > enabling some debug for the driver domain kernel and libxl I have
> > > these
> > > messages:
> > > 
> > > driver domain kernel (Ubuntu 5.4.0-92-generic):
> > > 
> > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0:
> > > Successfully created xenvif
> > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > /local/domain/0/device/vif/0 -> Initialising
> > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > /local/domain/0/device/vif/0 -> Connected
> > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> > > 
> > > xl devd (Xen 4.14.3):
> > > 
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > wpath=/local/domain/2/backend token=3/0: event
> > > epath=/local/domain/2/backend/vif/0/0/state
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700:
> > > nested ao, parent 0x5633ac567f90
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > epath=/local/domain/2/backend/vif/0/0/state
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:1055:devstate_callback: backend
> > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > state 4
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > wpath=/local/domain/2/backend token=3/0: event
> > > epath=/local/domain/2/backend/vif/0/0/state
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220:
> > > nested ao, parent 0x5633ac567f90
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > epath=/local/domain/2/backend/vif/0/0/state
> > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > libxl_event.c:1055:devstate_callback: backend
> > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > state 4
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_aoutils.c:88:xswait_timeout_callback: backend
> > > /local/domain/2/backend/vif/0/0/state (hoping for state change to
> > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch
> > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state
> > > token=2/1: deregister slotnum=2
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_event.c:1039:devstate_callback: backend
> > > /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > w=0x5633ac569180: deregister unregistered
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_device.c:1092:device_backend_callback: calling
> > > device_backend_cleanup
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > w=0x5633ac569180: deregister unregistered
> > > 2022-01-13 01:39:51 UTC libxl: error:
> > > libxl_device.c:1105:device_backend_callback: unable to add device
> > > with path /local/domain/2/backend/vif/0/0
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > w=0x5633ac569280: deregister unregistered
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_device.c:1470:device_complete: device
> > > /local/domain/2/backend/vif/0/0 add failed
> > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
> > > 
> > > the xenstore content for the backend:
> > > 
> > > # xenstore-ls /local/domain/2/backend/vif/0
> > > 0 = ""
> > >  frontend = "/local/domain/0/device/vif/0"
> > >  frontend-id = "0"
> > >  online = "1"
> > >  state = "4"
> > >  script = "/etc/xen/scripts/vif-zynstra"
> > >  vifname = "dom0.0"
> > >  mac = "00:16:3e:6c:de:82"
> > >  bridge = "cluster"
> > >  handle = "0"
> > >  type = "vif"
> > >  feature-sg = "1"
> > >  feature-gso-tcpv4 = "1"
> > >  feature-gso-tcpv6 = "1"
> > >  feature-ipv6-csum-offload = "1"
> > >  feature-rx-copy = "1"
> > >  feature-rx-flip = "0"
> > >  feature-multicast-control = "1"
> > >  feature-dynamic-multicast-control = "1"
> > >  feature-split-event-channels = "1"
> > >  multi-queue-max-queues = "2"
> > >  feature-ctrl-ring = "1"
> > >  hotplug-status = "connected"

The relevant point here is that `hotplug-status = "connected"` in the
backend xenstore nodes, and that's set by the hotplug script.

Having hotplug-status == "connected" will allow the backend to proceed
to the connected state, so there's some component in your system that
sets this xenstore node before `xl devd` get a change to run the
hotplug script, hence me asking for any udev rules in the previous
email.

If it's not an udev rule then I'm lost. Do you have any modifications
to the Xen tools that could set this xenstore node?

Regards, Roger.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: possible kernel/libxl race with xl network-attach
  2022-01-21 14:00     ` Roger Pau Monné
@ 2022-01-21 15:05       ` James Dingwall
  2022-01-24  9:07         ` Roger Pau Monné
  0 siblings, 1 reply; 7+ messages in thread
From: James Dingwall @ 2022-01-21 15:05 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel

On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote:
> On Fri, Jan 21, 2022 at 01:34:54PM +0000, James Dingwall wrote:
> > On 2022-01-13 16:11, Roger Pau Monné wrote:
> > > On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
> > > > 
> > > > I have been trying to debug a problem where a vif with the backend
> > > > in a
> > > > driver domain is added to dom0.  Intermittently the hotplug script is
> > > > not invoked by libxl (running as xl devd) in the driver domain.  By
> > > > enabling some debug for the driver domain kernel and libxl I have
> > > > these
> > > > messages:
> > > > 
> > > > driver domain kernel (Ubuntu 5.4.0-92-generic):
> > > > 
> > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0:
> > > > Successfully created xenvif
> > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > /local/domain/0/device/vif/0 -> Initialising
> > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > /local/domain/0/device/vif/0 -> Connected
> > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> > > > 
> > > > xl devd (Xen 4.14.3):
> > > > 
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > wpath=/local/domain/2/backend token=3/0: event
> > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700:
> > > > nested ao, parent 0x5633ac567f90
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:1055:devstate_callback: backend
> > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > state 4
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > wpath=/local/domain/2/backend token=3/0: event
> > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220:
> > > > nested ao, parent 0x5633ac567f90
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > libxl_event.c:1055:devstate_callback: backend
> > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > state 4
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_aoutils.c:88:xswait_timeout_callback: backend
> > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to
> > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch
> > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state
> > > > token=2/1: deregister slotnum=2
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_event.c:1039:devstate_callback: backend
> > > > /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > w=0x5633ac569180: deregister unregistered
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_device.c:1092:device_backend_callback: calling
> > > > device_backend_cleanup
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > w=0x5633ac569180: deregister unregistered
> > > > 2022-01-13 01:39:51 UTC libxl: error:
> > > > libxl_device.c:1105:device_backend_callback: unable to add device
> > > > with path /local/domain/2/backend/vif/0/0
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > w=0x5633ac569280: deregister unregistered
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_device.c:1470:device_complete: device
> > > > /local/domain/2/backend/vif/0/0 add failed
> > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
> > > > 
> > > > the xenstore content for the backend:
> > > > 
> > > > # xenstore-ls /local/domain/2/backend/vif/0
> > > > 0 = ""
> > > >  frontend = "/local/domain/0/device/vif/0"
> > > >  frontend-id = "0"
> > > >  online = "1"
> > > >  state = "4"
> > > >  script = "/etc/xen/scripts/vif-zynstra"
> > > >  vifname = "dom0.0"
> > > >  mac = "00:16:3e:6c:de:82"
> > > >  bridge = "cluster"
> > > >  handle = "0"
> > > >  type = "vif"
> > > >  feature-sg = "1"
> > > >  feature-gso-tcpv4 = "1"
> > > >  feature-gso-tcpv6 = "1"
> > > >  feature-ipv6-csum-offload = "1"
> > > >  feature-rx-copy = "1"
> > > >  feature-rx-flip = "0"
> > > >  feature-multicast-control = "1"
> > > >  feature-dynamic-multicast-control = "1"
> > > >  feature-split-event-channels = "1"
> > > >  multi-queue-max-queues = "2"
> > > >  feature-ctrl-ring = "1"
> > > >  hotplug-status = "connected"
> 
> The relevant point here is that `hotplug-status = "connected"` in the
> backend xenstore nodes, and that's set by the hotplug script.
> 
> Having hotplug-status == "connected" will allow the backend to proceed
> to the connected state, so there's some component in your system that
> sets this xenstore node before `xl devd` get a change to run the
> hotplug script, hence me asking for any udev rules in the previous
> email.
> 
> If it's not an udev rule then I'm lost. Do you have any modifications
> to the Xen tools that could set this xenstore node?

I am wondering if that xenstore content was captured after the environment
had been manually fixed.  I have been able to reproduce it by hotplugging
an interface where libxl isn't patched:


dom0# xl network-attach 0 backend=netdd script=vif-zynstra vifname=dom0.2 bridge=abridge
netdd# xenstore-ls /local/domain/2/backend/vif/0/2
frontend = "/local/domain/0/device/vif/2"
frontend-id = "0"
online = "1"
state = "4"
script = "/etc/xen/scripts/vif-zynstra"
vifname = "dom0.2"
mac = "00:16:3e:5f:fc:51"
bridge = "abridge"
handle = "2"
type = "vif"
feature-sg = "1"
feature-gso-tcpv4 = "1"
feature-gso-tcpv6 = "1"
feature-ipv6-csum-offload = "1"
feature-rx-copy = "1"
feature-rx-flip = "0"
feature-multicast-control = "1"
feature-dynamic-multicast-control = "1"
feature-split-event-channels = "1"
multi-queue-max-queues = "2"
feature-ctrl-ring = "1"

If I have understood the backend kernel code it only waits for the
hotplug-status == "connected" if they key exists which it doesn't
appear to by default.

Thanks,
James


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: possible kernel/libxl race with xl network-attach
  2022-01-21 15:05       ` James Dingwall
@ 2022-01-24  9:07         ` Roger Pau Monné
  2022-01-24 13:38           ` James Dingwall
  0 siblings, 1 reply; 7+ messages in thread
From: Roger Pau Monné @ 2022-01-24  9:07 UTC (permalink / raw)
  To: James Dingwall; +Cc: xen-devel, Wei Liu, Paul Durrant

On Fri, Jan 21, 2022 at 03:05:07PM +0000, James Dingwall wrote:
> On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote:
> > On Fri, Jan 21, 2022 at 01:34:54PM +0000, James Dingwall wrote:
> > > On 2022-01-13 16:11, Roger Pau Monné wrote:
> > > > On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
> > > > > 
> > > > > I have been trying to debug a problem where a vif with the backend
> > > > > in a
> > > > > driver domain is added to dom0.  Intermittently the hotplug script is
> > > > > not invoked by libxl (running as xl devd) in the driver domain.  By
> > > > > enabling some debug for the driver domain kernel and libxl I have
> > > > > these
> > > > > messages:
> > > > > 
> > > > > driver domain kernel (Ubuntu 5.4.0-92-generic):
> > > > > 
> > > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0:
> > > > > Successfully created xenvif
> > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > /local/domain/0/device/vif/0 -> Initialising
> > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > /local/domain/0/device/vif/0 -> Connected
> > > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> > > > > 
> > > > > xl devd (Xen 4.14.3):
> > > > > 
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700:
> > > > > nested ao, parent 0x5633ac567f90
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > state 4
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220:
> > > > > nested ao, parent 0x5633ac567f90
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > state 4
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_aoutils.c:88:xswait_timeout_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to
> > > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state
> > > > > token=2/1: deregister slotnum=2
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:1039:devstate_callback: backend
> > > > > /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569180: deregister unregistered
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_device.c:1092:device_backend_callback: calling
> > > > > device_backend_cleanup
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569180: deregister unregistered
> > > > > 2022-01-13 01:39:51 UTC libxl: error:
> > > > > libxl_device.c:1105:device_backend_callback: unable to add device
> > > > > with path /local/domain/2/backend/vif/0/0
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > w=0x5633ac569280: deregister unregistered
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_device.c:1470:device_complete: device
> > > > > /local/domain/2/backend/vif/0/0 add failed
> > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
> > > > > 
> > > > > the xenstore content for the backend:
> > > > > 
> > > > > # xenstore-ls /local/domain/2/backend/vif/0
> > > > > 0 = ""
> > > > >  frontend = "/local/domain/0/device/vif/0"
> > > > >  frontend-id = "0"
> > > > >  online = "1"
> > > > >  state = "4"
> > > > >  script = "/etc/xen/scripts/vif-zynstra"
> > > > >  vifname = "dom0.0"
> > > > >  mac = "00:16:3e:6c:de:82"
> > > > >  bridge = "cluster"
> > > > >  handle = "0"
> > > > >  type = "vif"
> > > > >  feature-sg = "1"
> > > > >  feature-gso-tcpv4 = "1"
> > > > >  feature-gso-tcpv6 = "1"
> > > > >  feature-ipv6-csum-offload = "1"
> > > > >  feature-rx-copy = "1"
> > > > >  feature-rx-flip = "0"
> > > > >  feature-multicast-control = "1"
> > > > >  feature-dynamic-multicast-control = "1"
> > > > >  feature-split-event-channels = "1"
> > > > >  multi-queue-max-queues = "2"
> > > > >  feature-ctrl-ring = "1"
> > > > >  hotplug-status = "connected"
> > 
> > The relevant point here is that `hotplug-status = "connected"` in the
> > backend xenstore nodes, and that's set by the hotplug script.
> > 
> > Having hotplug-status == "connected" will allow the backend to proceed
> > to the connected state, so there's some component in your system that
> > sets this xenstore node before `xl devd` get a change to run the
> > hotplug script, hence me asking for any udev rules in the previous
> > email.
> > 
> > If it's not an udev rule then I'm lost. Do you have any modifications
> > to the Xen tools that could set this xenstore node?
> 
> I am wondering if that xenstore content was captured after the environment
> had been manually fixed.  I have been able to reproduce it by hotplugging
> an interface where libxl isn't patched:
> 
> 
> dom0# xl network-attach 0 backend=netdd script=vif-zynstra vifname=dom0.2 bridge=abridge
> netdd# xenstore-ls /local/domain/2/backend/vif/0/2
> frontend = "/local/domain/0/device/vif/2"
> frontend-id = "0"
> online = "1"
> state = "4"
> script = "/etc/xen/scripts/vif-zynstra"
> vifname = "dom0.2"
> mac = "00:16:3e:5f:fc:51"
> bridge = "abridge"
> handle = "2"
> type = "vif"
> feature-sg = "1"
> feature-gso-tcpv4 = "1"
> feature-gso-tcpv6 = "1"
> feature-ipv6-csum-offload = "1"
> feature-rx-copy = "1"
> feature-rx-flip = "0"
> feature-multicast-control = "1"
> feature-dynamic-multicast-control = "1"
> feature-split-event-channels = "1"
> multi-queue-max-queues = "2"
> feature-ctrl-ring = "1"
> 
> If I have understood the backend kernel code it only waits for the
> hotplug-status == "connected" if they key exists which it doesn't
> appear to by default.

Indeed. I have to admit this is quite weird. I have the following
completely untested patch, could you give it a try?

Adding netback maintainers for feedback on whether it's fine for libxl
to force netback to wait for hotplug script execution. Not sure the
reason why netback doesn't do that by default, but it seems it's been
like that since the module was imported into Linux in 2011.

Thanks, Roger.
---
diff --git a/tools/libs/light/libxl_nic.c b/tools/libs/light/libxl_nic.c
index 0b45469dca..0b9e70c9d1 100644
--- a/tools/libs/light/libxl_nic.c
+++ b/tools/libs/light/libxl_nic.c
@@ -248,6 +248,13 @@ static int libxl__set_xenstore_nic(libxl__gc *gc, uint32_t domid,
     flexarray_append(ro_front, "mtu");
     flexarray_append(ro_front, GCSPRINTF("%u", nic->mtu));
 
+    /*
+     * Force backend to wait for hotplug script execution before switching to
+     * connected state.
+     */
+    flexarray_append(back, "hotplug-status");
+    flexarray_append(back, "");
+
     return 0;
 }
 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: possible kernel/libxl race with xl network-attach
  2022-01-24  9:07         ` Roger Pau Monné
@ 2022-01-24 13:38           ` James Dingwall
  0 siblings, 0 replies; 7+ messages in thread
From: James Dingwall @ 2022-01-24 13:38 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Wei Liu, Paul Durrant

On Mon, Jan 24, 2022 at 10:07:54AM +0100, Roger Pau Monné wrote:
> On Fri, Jan 21, 2022 at 03:05:07PM +0000, James Dingwall wrote:
> > On Fri, Jan 21, 2022 at 03:00:29PM +0100, Roger Pau Monné wrote:
> > > On Fri, Jan 21, 2022 at 01:34:54PM +0000, James Dingwall wrote:
> > > > On 2022-01-13 16:11, Roger Pau Monné wrote:
> > > > > On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
> > > > > > 
> > > > > > I have been trying to debug a problem where a vif with the backend
> > > > > > in a
> > > > > > driver domain is added to dom0.  Intermittently the hotplug script is
> > > > > > not invoked by libxl (running as xl devd) in the driver domain.  By
> > > > > > enabling some debug for the driver domain kernel and libxl I have
> > > > > > these
> > > > > > messages:
> > > > > > 
> > > > > > driver domain kernel (Ubuntu 5.4.0-92-generic):
> > > > > > 
> > > > > > [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0:
> > > > > > Successfully created xenvif
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > > /local/domain/0/device/vif/0 -> Initialising
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> InitWait
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed:
> > > > > > /local/domain/0/device/vif/0 -> Connected
> > > > > > [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
> > > > > > [Thu Jan 13 01:39:31 2022] [26] 470:
> > > > > > xen_netback:backend_switch_state: backend/vif/0/0 -> Connected
> > > > > > 
> > > > > > xl devd (Xen 4.14.3):
> > > > > > 
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700:
> > > > > > nested ao, parent 0x5633ac567f90
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > > state 4
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528
> > > > > > wpath=/local/domain/2/backend token=3/0: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220:
> > > > > > nested ao, parent 0x5633ac567f90
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180
> > > > > > wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event
> > > > > > epath=/local/domain/2/backend/vif/0/0/state
> > > > > > 2022-01-13 01:39:31 UTC libxl: debug:
> > > > > > libxl_event.c:1055:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting
> > > > > > state 4
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_aoutils.c:88:xswait_timeout_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state (hoping for state change to
> > > > > > 2): xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:850:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180 wpath=/local/domain/2/backend/vif/0/0/state
> > > > > > token=2/1: deregister slotnum=2
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:1039:devstate_callback: backend
> > > > > > /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_device.c:1092:device_backend_callback: calling
> > > > > > device_backend_cleanup
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569180: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: error:
> > > > > > libxl_device.c:1105:device_backend_callback: unable to add device
> > > > > > with path /local/domain/2/backend/vif/0/0
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:864:libxl__ev_xswatch_deregister: watch
> > > > > > w=0x5633ac569280: deregister unregistered
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_device.c:1470:device_complete: device
> > > > > > /local/domain/2/backend/vif/0/0 add failed
> > > > > > 2022-01-13 01:39:51 UTC libxl: debug:
> > > > > > libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
> > > > > > 
> > > > > > the xenstore content for the backend:
> > > > > > 
> > > > > > # xenstore-ls /local/domain/2/backend/vif/0
> > > > > > 0 = ""
> > > > > >  frontend = "/local/domain/0/device/vif/0"
> > > > > >  frontend-id = "0"
> > > > > >  online = "1"
> > > > > >  state = "4"
> > > > > >  script = "/etc/xen/scripts/vif-zynstra"
> > > > > >  vifname = "dom0.0"
> > > > > >  mac = "00:16:3e:6c:de:82"
> > > > > >  bridge = "cluster"
> > > > > >  handle = "0"
> > > > > >  type = "vif"
> > > > > >  feature-sg = "1"
> > > > > >  feature-gso-tcpv4 = "1"
> > > > > >  feature-gso-tcpv6 = "1"
> > > > > >  feature-ipv6-csum-offload = "1"
> > > > > >  feature-rx-copy = "1"
> > > > > >  feature-rx-flip = "0"
> > > > > >  feature-multicast-control = "1"
> > > > > >  feature-dynamic-multicast-control = "1"
> > > > > >  feature-split-event-channels = "1"
> > > > > >  multi-queue-max-queues = "2"
> > > > > >  feature-ctrl-ring = "1"
> > > > > >  hotplug-status = "connected"
> > > 
> > > The relevant point here is that `hotplug-status = "connected"` in the
> > > backend xenstore nodes, and that's set by the hotplug script.
> > > 
> > > Having hotplug-status == "connected" will allow the backend to proceed
> > > to the connected state, so there's some component in your system that
> > > sets this xenstore node before `xl devd` get a change to run the
> > > hotplug script, hence me asking for any udev rules in the previous
> > > email.
> > > 
> > > If it's not an udev rule then I'm lost. Do you have any modifications
> > > to the Xen tools that could set this xenstore node?
> > 
> > I am wondering if that xenstore content was captured after the environment
> > had been manually fixed.  I have been able to reproduce it by hotplugging
> > an interface where libxl isn't patched:
> > 
> > 
> > dom0# xl network-attach 0 backend=netdd script=vif-zynstra vifname=dom0.2 bridge=abridge
> > netdd# xenstore-ls /local/domain/2/backend/vif/0/2
> > frontend = "/local/domain/0/device/vif/2"
> > frontend-id = "0"
> > online = "1"
> > state = "4"
> > script = "/etc/xen/scripts/vif-zynstra"
> > vifname = "dom0.2"
> > mac = "00:16:3e:5f:fc:51"
> > bridge = "abridge"
> > handle = "2"
> > type = "vif"
> > feature-sg = "1"
> > feature-gso-tcpv4 = "1"
> > feature-gso-tcpv6 = "1"
> > feature-ipv6-csum-offload = "1"
> > feature-rx-copy = "1"
> > feature-rx-flip = "0"
> > feature-multicast-control = "1"
> > feature-dynamic-multicast-control = "1"
> > feature-split-event-channels = "1"
> > multi-queue-max-queues = "2"
> > feature-ctrl-ring = "1"
> > 
> > If I have understood the backend kernel code it only waits for the
> > hotplug-status == "connected" if they key exists which it doesn't
> > appear to by default.
> 
> Indeed. I have to admit this is quite weird. I have the following
> completely untested patch, could you give it a try?
> 
> Adding netback maintainers for feedback on whether it's fine for libxl
> to force netback to wait for hotplug script execution. Not sure the
> reason why netback doesn't do that by default, but it seems it's been
> like that since the module was imported into Linux in 2011.
> 
> Thanks, Roger.
> ---
> diff --git a/tools/libs/light/libxl_nic.c b/tools/libs/light/libxl_nic.c
> index 0b45469dca..0b9e70c9d1 100644
> --- a/tools/libs/light/libxl_nic.c
> +++ b/tools/libs/light/libxl_nic.c
> @@ -248,6 +248,13 @@ static int libxl__set_xenstore_nic(libxl__gc *gc, uint32_t domid,
>      flexarray_append(ro_front, "mtu");
>      flexarray_append(ro_front, GCSPRINTF("%u", nic->mtu));
>  
> +    /*
> +     * Force backend to wait for hotplug script execution before switching to
> +     * connected state.
> +     */
> +    flexarray_append(back, "hotplug-status");
> +    flexarray_append(back, "");
> +
>      return 0;
>  }
>  

I rebuilt libxl with this patch and installed it (without reboot) on the
dom0 of the system I was testing with last week.  With a 'for' loop to
add 15 interfaces I didn't see the same problem so initialising xenstore
with the empty hotplug-status key fixes this problem.  I'll add it to
some more of our systems to give it some wider testing.

Thanks,
James


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-24 13:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-13 11:19 possible kernel/libxl race with xl network-attach James Dingwall
2022-01-13 16:11 ` Roger Pau Monné
2022-01-21 13:34   ` James Dingwall
2022-01-21 14:00     ` Roger Pau Monné
2022-01-21 15:05       ` James Dingwall
2022-01-24  9:07         ` Roger Pau Monné
2022-01-24 13:38           ` James Dingwall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.