On Tue, May 11, 2021 at 12:46:38PM +0000, Durrant, Paul wrote: > > -----Original Message----- > > From: Marek Marczykowski-Górecki > > Sent: 11 May 2021 11:45 > > To: Durrant, Paul > > Cc: Michael Brown ; paul@xen.org; xen-devel@lists.xenproject.org; > > netdev@vger.kernel.org; wei.liu@kernel.org > > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching > > > > On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote: > > > On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote: > > > > > -----Original Message----- > > > > > From: Marek Marczykowski-Górecki > > > > > Sent: 10 May 2021 20:43 > > > > > To: Michael Brown ; paul@xen.org > > > > > Cc: paul@xen.org; xen-devel@lists.xenproject.org; netdev@vger.kernel.org; wei.liu@kernel.org; > > Durrant, > > > > > Paul > > > > > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching > > > > > > > > > > On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote: > > > > > > If you have a suggested patch, I'm happy to test that it doesn't reintroduce > > > > > > the regression bug that was fixed by this commit. > > > > > > > > > > Actually, I've just tested with a simple reloading xen-netfront module. It > > > > > seems in this case, the hotplug script is not re-executed. In fact, I > > > > > think it should not be re-executed at all, since the vif interface > > > > > remains in place (it just gets NO-CARRIER flag). > > > > > > > > > > This brings a question, why removing hotplug-status in the first place? > > > > > The interface remains correctly configured by the hotplug script after > > > > > all. From the commit message: > > > > > > > > > > xen-netback: remove 'hotplug-status' once it has served its purpose > > > > > > > > > > Removing the 'hotplug-status' node in netback_remove() is wrong; the script > > > > > may not have completed. Only remove the node once the watch has fired and > > > > > has been unregistered. > > > > > > > > > > I think the intention was to remove 'hotplug-status' node _later_ in > > > > > case of quickly adding and removing the interface. Is that right, Paul? > > > > > > > > The removal was done to allow unbind/bind to function correctly. IIRC before the original patch > > doing a bind would stall forever waiting for the hotplug status to change, which would never happen. > > > > > > Hmm, in that case maybe don't remove it at all in the backend, and let > > > it be cleaned up by the toolstack, when it removes other backend-related > > > nodes? > > > > No, unbind/bind _does_ require hotplug script to be called again. > > > > Yes, sorry I was misremembering. My memory is hazy but there was definitely a problem at the time with leaving the node in place. > > > When exactly vif interface appears in the system (starts to be available > > for the hotplug script)? Maybe remove 'hotplug-status' just before that > > point? > > > > I really can't remember any detail. Perhaps try reverting both patches then and check that the unbind/rmmod/modprobe/bind sequence still works (and the backend actually makes it into connected state). Ok, I've tried this. I've reverted both commits, then used your test script from the 9476654bd5e8ad42abe8ee9f9e90069ff8e60c17: This has been tested by running iperf as a server in the test VM and then running a client against it in a continuous loop, whilst also running: while true; do echo vif-$DOMID-$VIF >unbind; echo down; rmmod xen-netback; echo unloaded; modprobe xen-netback; cd $(pwd); brctl addif xenbr0 vif$DOMID.$VIF; ip link set vif$DOMID.$VIF up; echo up; sleep 5; done in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind, unload, re-load, re-bind and re-plumb the backend. In fact, the need to call `brctl` and `ip link` manually is exactly because the hotplug script isn't executed. When I execute it manually, the backend properly gets back to working. So, removing 'hotplug-status' was in the correct place (netback_remove). The missing part is the toolstack calling the hotplug script on xen-netback re-bind. In this case, I'm not sure what is the proper way. If I restart xendriverdomain service (I do run the backend in domU), it properly executes hotplug script and the backend interface gets properly configured. But it doesn't do it on its own. It seems to be related to device "state" in xenstore. The specific part of the libxl is backend_watch_callback(): https://github.com/xen-project/xen/blob/master/tools/libs/light/libxl_device.c#L1664 ddev = search_for_device(dguest, dev); if (ddev == NULL && state == XenbusStateClosed) { /* * Spurious state change, device has already been disconnected * or never attached. */ goto skip; } else if (ddev == NULL) { rc = add_device(egc, nested_ao, dguest, dev); if (rc > 0) free_ao = true; } else if (state == XenbusStateClosed && online == 0) { rc = remove_device(egc, nested_ao, dguest, ddev); if (rc > 0) free_ao = true; check_and_maybe_remove_guest(gc, ddomain, dguest); } In short: if device gets XenbusStateInitWait for the first time (ddev == NULL case), it goes to add_device() which executes the hotplug script and stores the device. Then, if device goes to XenbusStateClosed + online==0 state, then it executes hotplug script again (with "offline" parameter) and forgets the device. If you unbind the driver, the device stays in XenbusStateConnected state (in xenstore), and after you bind it again, it goes to XenbusStateInitWait. It don't think it goes through XenbusStateClosed, and online stays at 1 too, so libxl doesn't execute the hotplug script again. Some solution could be to add an extra case at the end, like "ddev != NULL && state == XenbusStateInitWait && hotplug-status != connected". And make sure xl devd won't call the same hotplug script multiple times for the same device _at the same time_ (I'm not sure about the async machinery here). But even if xl devd (aka xendriverdomain service) gets "fixes" to execute hotplug script in that case, I don't think it would work in backend in dom0 case - there, I think nothing watches already configured vif interfaces (there is no xl devd daemon in dom0, and xl background process watches only domain death and cdrom eject events). I'm adding toolstack maintainers, maybe they'll have some idea... In any case, the issue is not calling the hotplug script, responsible for configuring newly created vif interface. Not kernel waiting for it. So, I think both commits should still be reverted. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab