xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: James Dingwall <james-xen@dingwall.me.uk>
To: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: xen-devel@lists.xenproject.org
Subject: Re: possible kernel/libxl race with xl network-attach
Date: Fri, 21 Jan 2022 13:34:54 +0000	[thread overview]
Message-ID: <1671ad485c35fb86d758951889cfa05d@dingwall.me.uk> (raw)
In-Reply-To: <YeBPG+8D00xG4h0j@Air-de-Roger>

On 2022-01-13 16:11, Roger Pau Monné wrote:
> On Thu, Jan 13, 2022 at 11:19:46AM +0000, James Dingwall wrote:
>> 
>> I have been trying to debug a problem where a vif with the backend in 
>> a
>> driver domain is added to dom0.  Intermittently the hotplug script is
>> not invoked by libxl (running as xl devd) in the driver domain.  By
>> enabling some debug for the driver domain kernel and libxl I have 
>> these
>> messages:
>> 
>> driver domain kernel (Ubuntu 5.4.0-92-generic):
>> 
>> [Thu Jan 13 01:39:31 2022] [1408] 564: vif vif-0-0 vif0.0: 
>> Successfully created xenvif
>> [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: 
>> /local/domain/0/device/vif/0 -> Initialising
>> [Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: 
>> backend/vif/0/0 -> InitWait
>> [Thu Jan 13 01:39:31 2022] [26] 583: xen_netback:frontend_changed: 
>> /local/domain/0/device/vif/0 -> Connected
>> [Thu Jan 13 01:39:31 2022] vif vif-0-0 vif0.0: Guest Rx ready
>> [Thu Jan 13 01:39:31 2022] [26] 470: xen_netback:backend_switch_state: 
>> backend/vif/0/0 -> Connected
>> 
>> xl devd (Xen 4.14.3):
>> 
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 
>> wpath=/local/domain/2/backend token=3/0: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac569700: nested 
>> ao, parent 0x5633ac567f90
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 
>> wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:1055:devstate_callback: backend 
>> /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting 
>> state 4
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x7ffd416b0528 
>> wpath=/local/domain/2/backend token=3/0: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:2445:libxl__nested_ao_create: ao 0x5633ac56a220: nested 
>> ao, parent 0x5633ac567f90
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:750:watchfd_callback: watch w=0x5633ac569180 
>> wpath=/local/domain/2/backend/vif/0/0/state token=2/1: event 
>> epath=/local/domain/2/backend/vif/0/0/state
>> 2022-01-13 01:39:31 UTC libxl: debug: 
>> libxl_event.c:1055:devstate_callback: backend 
>> /local/domain/2/backend/vif/0/0/state wanted state 2 still waiting 
>> state 4
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_aoutils.c:88:xswait_timeout_callback: backend 
>> /local/domain/2/backend/vif/0/0/state (hoping for state change to 2): 
>> xswait timeout (path=/local/domain/2/backend/vif/0/0/state)
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:850:libxl__ev_xswatch_deregister: watch w=0x5633ac569180 
>> wpath=/local/domain/2/backend/vif/0/0/state token=2/1: deregister 
>> slotnum=2
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:1039:devstate_callback: backend 
>> /local/domain/2/backend/vif/0/0/state wanted state 2  timed out
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:864:libxl__ev_xswatch_deregister: watch 
>> w=0x5633ac569180: deregister unregistered
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_device.c:1092:device_backend_callback: calling 
>> device_backend_cleanup
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:864:libxl__ev_xswatch_deregister: watch 
>> w=0x5633ac569180: deregister unregistered
>> 2022-01-13 01:39:51 UTC libxl: error: 
>> libxl_device.c:1105:device_backend_callback: unable to add device with 
>> path /local/domain/2/backend/vif/0/0
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:864:libxl__ev_xswatch_deregister: watch 
>> w=0x5633ac569280: deregister unregistered
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_device.c:1470:device_complete: device 
>> /local/domain/2/backend/vif/0/0 add failed
>> 2022-01-13 01:39:51 UTC libxl: debug: 
>> libxl_event.c:2035:libxl__ao__destroy: ao 0x5633ac568f30: destroy
>> 
>> the xenstore content for the backend:
>> 
>> # xenstore-ls /local/domain/2/backend/vif/0
>> 0 = ""
>>  frontend = "/local/domain/0/device/vif/0"
>>  frontend-id = "0"
>>  online = "1"
>>  state = "4"
>>  script = "/etc/xen/scripts/vif-zynstra"
>>  vifname = "dom0.0"
>>  mac = "00:16:3e:6c:de:82"
>>  bridge = "cluster"
>>  handle = "0"
>>  type = "vif"
>>  feature-sg = "1"
>>  feature-gso-tcpv4 = "1"
>>  feature-gso-tcpv6 = "1"
>>  feature-ipv6-csum-offload = "1"
>>  feature-rx-copy = "1"
>>  feature-rx-flip = "0"
>>  feature-multicast-control = "1"
>>  feature-dynamic-multicast-control = "1"
>>  feature-split-event-channels = "1"
>>  multi-queue-max-queues = "2"
>>  feature-ctrl-ring = "1"
>>  hotplug-status = "connected"
>> 
>> My guess is that the libxl callback is started waiting for the backend
>> state key to be set to XenbusStateInitWait (2) but the frontend in 
>> dom0
>> has already triggered the backend to transition to 
>> XenbusStateConnected
>> (4) and therefore it does not successfully complete.
> 
> I think your analysis is correct. The question is who sets
> hotplug-status to "connected" in xenstore if the hotplug script in the
> driver domain hasn't been launched by `xl devd`.
> 
> It would seem to me some other entity is launching the hotplug
> scripts, do you have some udev rules in the driver domain that do that
> and race with `xl devd` maybe?

No, there are no udev rules for xen network devices.  What it looks like 
is that the in linux/drivers/net/xen-netback/xenbus.c that the backend 
state is changed to XenbusStateConnected in response to 
frontend_changed() being called and the frontend state is 
XenbusStateConnected.  The xen-frontend.c driver has:

         case XenbusStateInitWait:
                 if (dev->state != XenbusStateInitialising)
                         break;
                 if (xennet_connect(netdev) != 0)
                         break;
                 xenbus_switch_state(dev, XenbusStateConnected);
                 break;

so it seems that if this completes in the domain with the frontend then 
the backend can process this state change before `xl devd` is launched 
and it completely misses the window where the backend state was 
XenbusStateInitWait.

This is the patch I have come up with as a workaround:

diff --git a/tools/libxl/libxl_event.c b/tools/libxl/libxl_event.c
index 7c5387e94f..408afa6ee6 100644
--- a/tools/libxl/libxl_event.c
+++ b/tools/libxl/libxl_event.c
@@ -1052,6 +1052,12 @@ static void devstate_callback(libxl__egc *egc, 
libxl__xswait_state *xsw,
          LOG(DEBUG, "backend %s wanted state %d ok", ds->w.path, 
ds->wanted);
          rc = 0;
      } else {
+        if (ds->wanted == XenbusStateInitWait && got == 
XenbusStateConnected) {
+            LOG(DEBUG, "backend %s wanted state %d interpreting"
+                " state %d as success", ds->w.path, ds->wanted, got);
+            rc = 0;
+            goto out;
+        }
          LOG(DEBUG, "backend %s wanted state %d"" still waiting state 
%d",
              ds->w.path, ds->wanted, got);
          return;

Thanks,
James


  reply	other threads:[~2022-01-21 13:35 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-13 11:19 possible kernel/libxl race with xl network-attach James Dingwall
2022-01-13 16:11 ` Roger Pau Monné
2022-01-21 13:34   ` James Dingwall [this message]
2022-01-21 14:00     ` Roger Pau Monné
2022-01-21 15:05       ` James Dingwall
2022-01-24  9:07         ` Roger Pau Monné
2022-01-24 13:38           ` James Dingwall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1671ad485c35fb86d758951889cfa05d@dingwall.me.uk \
    --to=james-xen@dingwall.me.uk \
    --cc=roger.pau@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).