xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* xen-netback hotplug-status regression bug
@ 2021-04-10 18:25 Michael Brown
  2021-04-13  7:12 ` Paul Durrant
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Brown @ 2021-04-10 18:25 UTC (permalink / raw)
  To: Wei Liu, Paul Durrant, xen-devel, netdev, Paul Durrant

Commit https://github.com/torvalds/linux/commit/1f25657 ("xen-netback: 
remove 'hotplug-status' once it has served its purpose") seems to have 
introduced a regression that prevents a vif frontend from transitioning 
more than once into Connected state.

As far as I can tell:

- The defined vif script (e.g. /etc/xen/scripts/vif-bridge) executes 
only once, at domU startup, and sets 
backend/vif/<domU>/0/hotplug-status="connected"

- When the frontend first enters Connected state, 
drivers/net/xen-netback/xenbus.c's connect() sets up a watch on 
"hotplug-status" with the callback function hotplug_status_changed()

- When hotplug_status_changed() is triggered by the watch, it 
transitions the backend to Connected state and calls xenbus_rm() to 
delete the "hotplug-status" attribute.

If the frontend subsequently disconnects and reconnects (e.g. 
transitions through Closed->Initialising->Connected) then:

- Nothing recreates "hotplug-status"

- When the frontend re-enters Connected state, connect() sets up a watch 
on "hotplug-status" again

- The callback hotplug_status_changed() is never triggered, and so the 
backend device never transitions to Connected state.


Reverting the commit would fix this bug, but would obviously also 
reintroduce the race condition that the commit was designed to avoid.

I'm happy to put together a patch, if one of the maintainers could 
suggest a sensible design approach.

I'm not a list member, so please CC me directly on replies.

Thanks,

Michael


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xen-netback hotplug-status regression bug
  2021-04-10 18:25 xen-netback hotplug-status regression bug Michael Brown
@ 2021-04-13  7:12 ` Paul Durrant
  2021-04-13 10:48   ` Michael Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Durrant @ 2021-04-13  7:12 UTC (permalink / raw)
  To: Michael Brown, Wei Liu, xen-devel, netdev, Paul Durrant

On 10/04/2021 19:25, Michael Brown wrote:
> Commit https://github.com/torvalds/linux/commit/1f25657 ("xen-netback: 
> remove 'hotplug-status' once it has served its purpose") seems to have 
> introduced a regression that prevents a vif frontend from transitioning 
> more than once into Connected state.
> 
> As far as I can tell:
> 
> - The defined vif script (e.g. /etc/xen/scripts/vif-bridge) executes 
> only once, at domU startup, and sets 
> backend/vif/<domU>/0/hotplug-status="connected"
> 
> - When the frontend first enters Connected state, 
> drivers/net/xen-netback/xenbus.c's connect() sets up a watch on 
> "hotplug-status" with the callback function hotplug_status_changed()
> 
> - When hotplug_status_changed() is triggered by the watch, it 
> transitions the backend to Connected state and calls xenbus_rm() to 
> delete the "hotplug-status" attribute.
> 
> If the frontend subsequently disconnects and reconnects (e.g. 
> transitions through Closed->Initialising->Connected) then:
> 
> - Nothing recreates "hotplug-status"
> 
> - When the frontend re-enters Connected state, connect() sets up a watch 
> on "hotplug-status" again
> 
> - The callback hotplug_status_changed() is never triggered, and so the 
> backend device never transitions to Connected state.
> 

That's not how I read it. Given that "hotplug-status" is removed by the 
call to hotplug_status_changed() then the next call to connect() should 
fail to register the watch and 'have_hotplug_status_watch' should be 0. 
Thus backend_switch_state() should not defer the transition to 
XenbusStateConnected in any subsequent interaction with the frontend.


> 
> Reverting the commit would fix this bug, but would obviously also 
> reintroduce the race condition that the commit was designed to avoid.
> 
> I'm happy to put together a patch, if one of the maintainers could 
> suggest a sensible design approach.
>

Are you seeing the watch successfully re-registered even though the node 
does not exist? Perhaps there has been a change in xenstore behaviour?

   Paul

> I'm not a list member, so please CC me directly on replies.
> 
> Thanks,
> 
> Michael



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xen-netback hotplug-status regression bug
  2021-04-13  7:12 ` Paul Durrant
@ 2021-04-13 10:48   ` Michael Brown
  2021-04-13 10:55     ` Paul Durrant
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Brown @ 2021-04-13 10:48 UTC (permalink / raw)
  To: paul, Wei Liu, xen-devel, netdev, Paul Durrant

On 13/04/2021 08:12, Paul Durrant wrote:
>> If the frontend subsequently disconnects and reconnects (e.g. 
>> transitions through Closed->Initialising->Connected) then:
>>
>> - Nothing recreates "hotplug-status"
>>
>> - When the frontend re-enters Connected state, connect() sets up a 
>> watch on "hotplug-status" again
>>
>> - The callback hotplug_status_changed() is never triggered, and so the 
>> backend device never transitions to Connected state.
> 
> That's not how I read it. Given that "hotplug-status" is removed by the 
> call to hotplug_status_changed() then the next call to connect() should 
> fail to register the watch and 'have_hotplug_status_watch' should be 0. 
> Thus backend_switch_state() should not defer the transition to 
> XenbusStateConnected in any subsequent interaction with the frontend.

Thank you for the reply.  I've tested and confirmed my initial 
hypothesis: the call to xenbus_watch_pathfmt() succeeds even if the node 
does not exist.

I confirmed this with ftrace using:

   cd /sys/kernel/debug/tracing
   echo function_graph > current_tracer
   echo set_backend_state > set_ftrace_filter
   echo xenbus_watch_pathfmt >> set_ftrace_filter
   echo register_xenbus_watch >> set_ftrace_filter
   echo xenbus_dev_fatal >> set_ftrace_filter

On the second time that the frontend transitions to Connected, this 
produced the trace:

   set_backend_state [xen_netback]() {
     register_xenbus_watch();
     register_xenbus_watch();
     xenbus_watch_pathfmt() {
       register_xenbus_watch();
     }
   }

which seems to confirm that the error path in xenbus_watch_path() is 
*not* taken, i.e. that the call to register_xenbus_watch() succeeded 
even though the node did not exist.


Other observations also seem to confirm this behaviour:

- Running "xenstore ls" in dom0 confirms that on the second frontend 
transition to Connected, the frontend state is indeed Connected (4) but 
the backend state remains in InitWait (2)

- Running "xenstore watch 
/local/domain/0/backend/vif/<domU>/0/hotplug-status" *before* starting 
the domU confirms that it is possible to create a watch on a node that 
does not (yet) exist, and that the watch *is* notified when the node is 
later created.

> Are you seeing the watch successfully re-registered even though the node 
> does not exist? Perhaps there has been a change in xenstore behaviour?

So, the TL;DR is that yes, the watch does successfully register even 
though the node does not exist.

 From a quick look through the xenstored source, it looks as though the 
only check on the node name is the call to is_valid_nodename(), which 
seems to perform a syntactic validity check only.  I can't immediately 
find any commit that would have changed this behaviour.

Thanks,

Michael


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xen-netback hotplug-status regression bug
  2021-04-13 10:48   ` Michael Brown
@ 2021-04-13 10:55     ` Paul Durrant
  2021-04-13 15:14       ` Michael Brown
  2021-04-13 15:25       ` [PATCH] xen-netback: Check for hotplug-status existence before watching Michael Brown
  0 siblings, 2 replies; 23+ messages in thread
From: Paul Durrant @ 2021-04-13 10:55 UTC (permalink / raw)
  To: Michael Brown, Wei Liu, xen-devel, netdev, Paul Durrant

On 13/04/2021 11:48, Michael Brown wrote:
> On 13/04/2021 08:12, Paul Durrant wrote:
>>> If the frontend subsequently disconnects and reconnects (e.g. 
>>> transitions through Closed->Initialising->Connected) then:
>>>
>>> - Nothing recreates "hotplug-status"
>>>
>>> - When the frontend re-enters Connected state, connect() sets up a 
>>> watch on "hotplug-status" again
>>>
>>> - The callback hotplug_status_changed() is never triggered, and so 
>>> the backend device never transitions to Connected state.
>>
>> That's not how I read it. Given that "hotplug-status" is removed by 
>> the call to hotplug_status_changed() then the next call to connect() 
>> should fail to register the watch and 'have_hotplug_status_watch' 
>> should be 0. Thus backend_switch_state() should not defer the 
>> transition to XenbusStateConnected in any subsequent interaction with 
>> the frontend.
> 
> Thank you for the reply.  I've tested and confirmed my initial 
> hypothesis: the call to xenbus_watch_pathfmt() succeeds even if the node 
> does not exist.
> 
> I confirmed this with ftrace using:
> 
>   cd /sys/kernel/debug/tracing
>   echo function_graph > current_tracer
>   echo set_backend_state > set_ftrace_filter
>   echo xenbus_watch_pathfmt >> set_ftrace_filter
>   echo register_xenbus_watch >> set_ftrace_filter
>   echo xenbus_dev_fatal >> set_ftrace_filter
> 
> On the second time that the frontend transitions to Connected, this 
> produced the trace:
> 
>   set_backend_state [xen_netback]() {
>     register_xenbus_watch();
>     register_xenbus_watch();
>     xenbus_watch_pathfmt() {
>       register_xenbus_watch();
>     }
>   }
> 
> which seems to confirm that the error path in xenbus_watch_path() is 
> *not* taken, i.e. that the call to register_xenbus_watch() succeeded 
> even though the node did not exist.
> 
> 
> Other observations also seem to confirm this behaviour:
> 
> - Running "xenstore ls" in dom0 confirms that on the second frontend 
> transition to Connected, the frontend state is indeed Connected (4) but 
> the backend state remains in InitWait (2)
> 
> - Running "xenstore watch 
> /local/domain/0/backend/vif/<domU>/0/hotplug-status" *before* starting 
> the domU confirms that it is possible to create a watch on a node that 
> does not (yet) exist, and that the watch *is* notified when the node is 
> later created.
> 
>> Are you seeing the watch successfully re-registered even though the 
>> node does not exist? Perhaps there has been a change in xenstore 
>> behaviour?
> 
> So, the TL;DR is that yes, the watch does successfully register even 
> though the node does not exist.
> 
>  From a quick look through the xenstored source, it looks as though the 
> only check on the node name is the call to is_valid_nodename(), which 
> seems to perform a syntactic validity check only.  I can't immediately 
> find any commit that would have changed this behaviour.
> 

Ok, so it sound like this was probably my misunderstanding of xenstore 
semantics in the first place (although I'm sure I remember watch 
registration failing for non-existent nodes at some point in the past... 
that may have been with a non-upstream version of oxenstored though).

Anyway... a reasonable fix would therefore be to read the node first and 
only register the watch if it does exist.

   Paul

> Thanks,
> 
> Michael



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xen-netback hotplug-status regression bug
  2021-04-13 10:55     ` Paul Durrant
@ 2021-04-13 15:14       ` Michael Brown
  2021-04-13 15:25       ` [PATCH] xen-netback: Check for hotplug-status existence before watching Michael Brown
  1 sibling, 0 replies; 23+ messages in thread
From: Michael Brown @ 2021-04-13 15:14 UTC (permalink / raw)
  To: paul, Wei Liu, xen-devel, netdev, Paul Durrant

On 13/04/2021 11:55, Paul Durrant wrote:
> Ok, so it sound like this was probably my misunderstanding of xenstore 
> semantics in the first place (although I'm sure I remember watch 
> registration failing for non-existent nodes at some point in the past... 
> that may have been with a non-upstream version of oxenstored though).
> 
> Anyway... a reasonable fix would therefore be to read the node first and 
> only register the watch if it does exist.

Thanks.  Patch coming up shortly!

Michael



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-04-13 10:55     ` Paul Durrant
  2021-04-13 15:14       ` Michael Brown
@ 2021-04-13 15:25       ` Michael Brown
  2021-04-13 19:12         ` Paul Durrant
                           ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Michael Brown @ 2021-04-13 15:25 UTC (permalink / raw)
  To: paul, xen-devel, netdev, wei.liu, pdurrant; +Cc: Michael Brown

The logic in connect() is currently written with the assumption that
xenbus_watch_pathfmt() will return an error for a node that does not
exist.  This assumption is incorrect: xenstore does allow a watch to
be registered for a nonexistent node (and will send notifications
should the node be subsequently created).

As of commit 1f2565780 ("xen-netback: remove 'hotplug-status' once it
has served its purpose"), this leads to a failure when a domU
transitions into XenbusStateConnected more than once.  On the first
domU transition into Connected state, the "hotplug-status" node will
be deleted by the hotplug_status_changed() callback in dom0.  On the
second or subsequent domU transition into Connected state, the
hotplug_status_changed() callback will therefore never be invoked, and
so the backend will remain stuck in InitWait.

This failure prevents scenarios such as reloading the xen-netfront
module within a domU, or booting a domU via iPXE.  There is
unfortunately no way for the domU to work around this dom0 bug.

Fix by explicitly checking for existence of the "hotplug-status" node,
thereby creating the behaviour that was previously assumed to exist.

Signed-off-by: Michael Brown <mbrown@fensystems.co.uk>
---
 drivers/net/xen-netback/xenbus.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
index a5439c130130..d24b7a7993aa 100644
--- a/drivers/net/xen-netback/xenbus.c
+++ b/drivers/net/xen-netback/xenbus.c
@@ -824,11 +824,15 @@ static void connect(struct backend_info *be)
 	xenvif_carrier_on(be->vif);
 
 	unregister_hotplug_status_watch(be);
-	err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch, NULL,
-				   hotplug_status_changed,
-				   "%s/%s", dev->nodename, "hotplug-status");
-	if (!err)
+	if (xenbus_exists(XBT_NIL, dev->nodename, "hotplug-status")) {
+		err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch,
+					   NULL, hotplug_status_changed,
+					   "%s/%s", dev->nodename,
+					   "hotplug-status");
+		if (err)
+			goto err;
 		be->have_hotplug_status_watch = 1;
+	}
 
 	netif_tx_wake_all_queues(be->vif->dev);
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-04-13 15:25       ` [PATCH] xen-netback: Check for hotplug-status existence before watching Michael Brown
@ 2021-04-13 19:12         ` Paul Durrant
  2021-04-13 22:30         ` patchwork-bot+netdevbpf
  2021-05-10 18:32         ` Marek Marczykowski-Górecki
  2 siblings, 0 replies; 23+ messages in thread
From: Paul Durrant @ 2021-04-13 19:12 UTC (permalink / raw)
  To: Michael Brown, xen-devel, netdev, wei.liu, pdurrant

On 13/04/2021 16:25, Michael Brown wrote:
> The logic in connect() is currently written with the assumption that
> xenbus_watch_pathfmt() will return an error for a node that does not
> exist.  This assumption is incorrect: xenstore does allow a watch to
> be registered for a nonexistent node (and will send notifications
> should the node be subsequently created).
> 
> As of commit 1f2565780 ("xen-netback: remove 'hotplug-status' once it
> has served its purpose"), this leads to a failure when a domU
> transitions into XenbusStateConnected more than once.  On the first
> domU transition into Connected state, the "hotplug-status" node will
> be deleted by the hotplug_status_changed() callback in dom0.  On the
> second or subsequent domU transition into Connected state, the
> hotplug_status_changed() callback will therefore never be invoked, and
> so the backend will remain stuck in InitWait.
> 
> This failure prevents scenarios such as reloading the xen-netfront
> module within a domU, or booting a domU via iPXE.  There is
> unfortunately no way for the domU to work around this dom0 bug.
> 
> Fix by explicitly checking for existence of the "hotplug-status" node,
> thereby creating the behaviour that was previously assumed to exist.
> 
> Signed-off-by: Michael Brown <mbrown@fensystems.co.uk>

Reviewed-by: Paul Durrant <paul@xen.org>

> ---
>   drivers/net/xen-netback/xenbus.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
> index a5439c130130..d24b7a7993aa 100644
> --- a/drivers/net/xen-netback/xenbus.c
> +++ b/drivers/net/xen-netback/xenbus.c
> @@ -824,11 +824,15 @@ static void connect(struct backend_info *be)
>   	xenvif_carrier_on(be->vif);
>   
>   	unregister_hotplug_status_watch(be);
> -	err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch, NULL,
> -				   hotplug_status_changed,
> -				   "%s/%s", dev->nodename, "hotplug-status");
> -	if (!err)
> +	if (xenbus_exists(XBT_NIL, dev->nodename, "hotplug-status")) {
> +		err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch,
> +					   NULL, hotplug_status_changed,
> +					   "%s/%s", dev->nodename,
> +					   "hotplug-status");
> +		if (err)
> +			goto err;
>   		be->have_hotplug_status_watch = 1;
> +	}
>   
>   	netif_tx_wake_all_queues(be->vif->dev);
>   
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-04-13 15:25       ` [PATCH] xen-netback: Check for hotplug-status existence before watching Michael Brown
  2021-04-13 19:12         ` Paul Durrant
@ 2021-04-13 22:30         ` patchwork-bot+netdevbpf
  2021-05-10 18:32         ` Marek Marczykowski-Górecki
  2 siblings, 0 replies; 23+ messages in thread
From: patchwork-bot+netdevbpf @ 2021-04-13 22:30 UTC (permalink / raw)
  To: Michael Brown; +Cc: paul, xen-devel, netdev, wei.liu, pdurrant

Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Tue, 13 Apr 2021 16:25:12 +0100 you wrote:
> The logic in connect() is currently written with the assumption that
> xenbus_watch_pathfmt() will return an error for a node that does not
> exist.  This assumption is incorrect: xenstore does allow a watch to
> be registered for a nonexistent node (and will send notifications
> should the node be subsequently created).
> 
> As of commit 1f2565780 ("xen-netback: remove 'hotplug-status' once it
> has served its purpose"), this leads to a failure when a domU
> transitions into XenbusStateConnected more than once.  On the first
> domU transition into Connected state, the "hotplug-status" node will
> be deleted by the hotplug_status_changed() callback in dom0.  On the
> second or subsequent domU transition into Connected state, the
> hotplug_status_changed() callback will therefore never be invoked, and
> so the backend will remain stuck in InitWait.
> 
> [...]

Here is the summary with links:
  - xen-netback: Check for hotplug-status existence before watching
    https://git.kernel.org/netdev/net/c/2afeec08ab5c

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-04-13 15:25       ` [PATCH] xen-netback: Check for hotplug-status existence before watching Michael Brown
  2021-04-13 19:12         ` Paul Durrant
  2021-04-13 22:30         ` patchwork-bot+netdevbpf
@ 2021-05-10 18:32         ` Marek Marczykowski-Górecki
  2021-05-10 18:47           ` Michael Brown
  2 siblings, 1 reply; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-10 18:32 UTC (permalink / raw)
  To: Michael Brown; +Cc: paul, xen-devel, netdev, wei.liu, pdurrant

[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]

On Tue, Apr 13, 2021 at 04:25:12PM +0100, Michael Brown wrote:
> The logic in connect() is currently written with the assumption that
> xenbus_watch_pathfmt() will return an error for a node that does not
> exist.  This assumption is incorrect: xenstore does allow a watch to
> be registered for a nonexistent node (and will send notifications
> should the node be subsequently created).
> 
> As of commit 1f2565780 ("xen-netback: remove 'hotplug-status' once it
> has served its purpose"), this leads to a failure when a domU
> transitions into XenbusStateConnected more than once.  On the first
> domU transition into Connected state, the "hotplug-status" node will
> be deleted by the hotplug_status_changed() callback in dom0.  On the
> second or subsequent domU transition into Connected state, the
> hotplug_status_changed() callback will therefore never be invoked, and
> so the backend will remain stuck in InitWait.
> 
> This failure prevents scenarios such as reloading the xen-netfront
> module within a domU, or booting a domU via iPXE.  There is
> unfortunately no way for the domU to work around this dom0 bug.
> 
> Fix by explicitly checking for existence of the "hotplug-status" node,
> thereby creating the behaviour that was previously assumed to exist.

This change is wrong. The 'hotplug-status' node is created _only_ by a
hotplug script and done so when it's executed. When kernel waits for
hotplug script to be executed it waits for the node to _appear_, not
_change_. So, this change basically made the kernel not waiting for the
hotplug script at all.

Furthermore, there is an additional side effect: in case of a driver
domain, xl devd may be started after the backend node is created (this
may happen if you start the frontend domain in parallel with the
backend). In this case, 'xl devd' will see the vif backend in
XenbusStateConnected state already and will not execute hotplug script
at all.

I think the proper fix is to re-register the watch when necessary,
instead of not registering it at all.

> Signed-off-by: Michael Brown <mbrown@fensystems.co.uk>
> ---
>  drivers/net/xen-netback/xenbus.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/xenbus.c b/drivers/net/xen-netback/xenbus.c
> index a5439c130130..d24b7a7993aa 100644
> --- a/drivers/net/xen-netback/xenbus.c
> +++ b/drivers/net/xen-netback/xenbus.c
> @@ -824,11 +824,15 @@ static void connect(struct backend_info *be)
>  	xenvif_carrier_on(be->vif);
>  
>  	unregister_hotplug_status_watch(be);
> -	err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch, NULL,
> -				   hotplug_status_changed,
> -				   "%s/%s", dev->nodename, "hotplug-status");
> -	if (!err)
> +	if (xenbus_exists(XBT_NIL, dev->nodename, "hotplug-status")) {
> +		err = xenbus_watch_pathfmt(dev, &be->hotplug_status_watch,
> +					   NULL, hotplug_status_changed,
> +					   "%s/%s", dev->nodename,
> +					   "hotplug-status");
> +		if (err)
> +			goto err;
>  		be->have_hotplug_status_watch = 1;
> +	}
>  
>  	netif_tx_wake_all_queues(be->vif->dev);
>  

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-10 18:32         ` Marek Marczykowski-Górecki
@ 2021-05-10 18:47           ` Michael Brown
  2021-05-10 18:53             ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Brown @ 2021-05-10 18:47 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: paul, xen-devel, netdev, wei.liu, pdurrant

On 10/05/2021 19:32, Marek Marczykowski-Górecki wrote:
> On Tue, Apr 13, 2021 at 04:25:12PM +0100, Michael Brown wrote:
>> The logic in connect() is currently written with the assumption that
>> xenbus_watch_pathfmt() will return an error for a node that does not
>> exist.  This assumption is incorrect: xenstore does allow a watch to
>> be registered for a nonexistent node (and will send notifications
>> should the node be subsequently created).
>>
>> As of commit 1f2565780 ("xen-netback: remove 'hotplug-status' once it
>> has served its purpose"), this leads to a failure when a domU
>> transitions into XenbusStateConnected more than once.  On the first
>> domU transition into Connected state, the "hotplug-status" node will
>> be deleted by the hotplug_status_changed() callback in dom0.  On the
>> second or subsequent domU transition into Connected state, the
>> hotplug_status_changed() callback will therefore never be invoked, and
>> so the backend will remain stuck in InitWait.
>>
>> This failure prevents scenarios such as reloading the xen-netfront
>> module within a domU, or booting a domU via iPXE.  There is
>> unfortunately no way for the domU to work around this dom0 bug.
>>
>> Fix by explicitly checking for existence of the "hotplug-status" node,
>> thereby creating the behaviour that was previously assumed to exist.
> 
> This change is wrong. The 'hotplug-status' node is created _only_ by a
> hotplug script and done so when it's executed. When kernel waits for
> hotplug script to be executed it waits for the node to _appear_, not
> _change_. So, this change basically made the kernel not waiting for the
> hotplug script at all.

That doesn't sound plausible to me.  In the setup as you describe, how 
is the kernel expected to differentiate between "hotplug script has not 
yet created the node" and "hotplug script does not exist and will 
therefore never create any node"?

Michael


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-10 18:47           ` Michael Brown
@ 2021-05-10 18:53             ` Marek Marczykowski-Górecki
  2021-05-10 19:06               ` Michael Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Michael Brown; +Cc: paul, xen-devel, netdev, wei.liu, pdurrant

[-- Attachment #1: Type: text/plain, Size: 2328 bytes --]

On Mon, May 10, 2021 at 07:47:01PM +0100, Michael Brown wrote:
> On 10/05/2021 19:32, Marek Marczykowski-Górecki wrote:
> > On Tue, Apr 13, 2021 at 04:25:12PM +0100, Michael Brown wrote:
> > > The logic in connect() is currently written with the assumption that
> > > xenbus_watch_pathfmt() will return an error for a node that does not
> > > exist.  This assumption is incorrect: xenstore does allow a watch to
> > > be registered for a nonexistent node (and will send notifications
> > > should the node be subsequently created).
> > > 
> > > As of commit 1f2565780 ("xen-netback: remove 'hotplug-status' once it
> > > has served its purpose"), this leads to a failure when a domU
> > > transitions into XenbusStateConnected more than once.  On the first
> > > domU transition into Connected state, the "hotplug-status" node will
> > > be deleted by the hotplug_status_changed() callback in dom0.  On the
> > > second or subsequent domU transition into Connected state, the
> > > hotplug_status_changed() callback will therefore never be invoked, and
> > > so the backend will remain stuck in InitWait.
> > > 
> > > This failure prevents scenarios such as reloading the xen-netfront
> > > module within a domU, or booting a domU via iPXE.  There is
> > > unfortunately no way for the domU to work around this dom0 bug.
> > > 
> > > Fix by explicitly checking for existence of the "hotplug-status" node,
> > > thereby creating the behaviour that was previously assumed to exist.
> > 
> > This change is wrong. The 'hotplug-status' node is created _only_ by a
> > hotplug script and done so when it's executed. When kernel waits for
> > hotplug script to be executed it waits for the node to _appear_, not
> > _change_. So, this change basically made the kernel not waiting for the
> > hotplug script at all.
> 
> That doesn't sound plausible to me.  In the setup as you describe, how is
> the kernel expected to differentiate between "hotplug script has not yet
> created the node" and "hotplug script does not exist and will therefore
> never create any node"?

Is the later valid at all? From what I can see in the toolstack, it
always sets some hotplug script (if not specified explicitly - then
"vif-bridge"),

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-10 18:53             ` Marek Marczykowski-Górecki
@ 2021-05-10 19:06               ` Michael Brown
  2021-05-10 19:42                 ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Brown @ 2021-05-10 19:06 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: paul, xen-devel, netdev, wei.liu, pdurrant

On 10/05/2021 19:53, Marek Marczykowski-Górecki wrote:
> On Mon, May 10, 2021 at 07:47:01PM +0100, Michael Brown wrote:
>> That doesn't sound plausible to me.  In the setup as you describe, how is
>> the kernel expected to differentiate between "hotplug script has not yet
>> created the node" and "hotplug script does not exist and will therefore
>> never create any node"?
> 
> Is the later valid at all? From what I can see in the toolstack, it
> always sets some hotplug script (if not specified explicitly - then
> "vif-bridge"),

I don't see any definitive documentation around that so I can't answer 
for sure.  It's probably best to let one of the Xen guys answer that.

If you have a suggested patch, I'm happy to test that it doesn't 
reintroduce the regression bug that was fixed by this commit.

Michael


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-10 19:06               ` Michael Brown
@ 2021-05-10 19:42                 ` Marek Marczykowski-Górecki
  2021-05-11  7:06                   ` Durrant, Paul
  0 siblings, 1 reply; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-10 19:42 UTC (permalink / raw)
  To: Michael Brown, paul; +Cc: paul, xen-devel, netdev, wei.liu, pdurrant

[-- Attachment #1: Type: text/plain, Size: 2270 bytes --]

On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
> If you have a suggested patch, I'm happy to test that it doesn't reintroduce
> the regression bug that was fixed by this commit.

Actually, I've just tested with a simple reloading xen-netfront module. It
seems in this case, the hotplug script is not re-executed. In fact, I
think it should not be re-executed at all, since the vif interface
remains in place (it just gets NO-CARRIER flag).

This brings a question, why removing hotplug-status in the first place?
The interface remains correctly configured by the hotplug script after
all. From the commit message:

    xen-netback: remove 'hotplug-status' once it has served its purpose

    Removing the 'hotplug-status' node in netback_remove() is wrong; the script
    may not have completed. Only remove the node once the watch has fired and
    has been unregistered.

I think the intention was to remove 'hotplug-status' node _later_ in
case of quickly adding and removing the interface. Is that right, Paul?
In that case, letting hotplug_status_changed() remove the entry wont
work, because the watch was unregistered few lines earlier in
netback_remove(). And keeping the watch is not an option, because the
whole backend_info struct is going to be free-ed already.

If my guess about the original reason for the change is right, I think
it should be fixed at the hotplug script level - it should check if the
device is still there before writing 'hotplug-status' node. I'm not sure
if doing it race-free is possible from a shell script (I think it
requires doing xenstore read _and_ write in a single transaction). But
in the worst case, the aftermath of loosing the race is leaving stray
'hotplug-status' xenstore node - not ideal, but also less harmful than
failing to bring up an interface. At this point, the toolstack could cleanup
it later, perhaps while setting up that interface again (if it gets
re-connected)?

Anyway, perhaps the best thing to do now, is to revert both commits, and
think of an alternative solution for the original issue? That of course
assumes I guessed correctly why it was done in the first place...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-10 19:42                 ` Marek Marczykowski-Górecki
@ 2021-05-11  7:06                   ` Durrant, Paul
  2021-05-11 10:40                     ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 23+ messages in thread
From: Durrant, Paul @ 2021-05-11  7:06 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Michael Brown, paul
  Cc: paul, xen-devel, netdev, wei.liu

> -----Original Message-----
> From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Sent: 10 May 2021 20:43
> To: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org
> Cc: paul@xen.org; xen-devel@lists.xenproject.org; netdev@vger.kernel.org; wei.liu@kernel.org; Durrant,
> Paul <pdurrant@amazon.co.uk>
> Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
> 
> On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
> > If you have a suggested patch, I'm happy to test that it doesn't reintroduce
> > the regression bug that was fixed by this commit.
> 
> Actually, I've just tested with a simple reloading xen-netfront module. It
> seems in this case, the hotplug script is not re-executed. In fact, I
> think it should not be re-executed at all, since the vif interface
> remains in place (it just gets NO-CARRIER flag).
> 
> This brings a question, why removing hotplug-status in the first place?
> The interface remains correctly configured by the hotplug script after
> all. From the commit message:
> 
>     xen-netback: remove 'hotplug-status' once it has served its purpose
> 
>     Removing the 'hotplug-status' node in netback_remove() is wrong; the script
>     may not have completed. Only remove the node once the watch has fired and
>     has been unregistered.
> 
> I think the intention was to remove 'hotplug-status' node _later_ in
> case of quickly adding and removing the interface. Is that right, Paul?

The removal was done to allow unbind/bind to function correctly. IIRC before the original patch doing a bind would stall forever waiting for the hotplug status to change, which would never happen.

> In that case, letting hotplug_status_changed() remove the entry wont
> work, because the watch was unregistered few lines earlier in
> netback_remove(). And keeping the watch is not an option, because the
> whole backend_info struct is going to be free-ed already.
> 
> If my guess about the original reason for the change is right, I think
> it should be fixed at the hotplug script level - it should check if the
> device is still there before writing 'hotplug-status' node.
> I'm not sure if doing it race-free is possible from a shell script (I think it
> requires doing xenstore read _and_ write in a single transaction). But
> in the worst case, the aftermath of loosing the race is leaving stray
> 'hotplug-status' xenstore node - not ideal, but also less harmful than
> failing to bring up an interface. At this point, the toolstack could cleanup
> it later, perhaps while setting up that interface again (if it gets
> re-connected)?
> 
> Anyway, perhaps the best thing to do now, is to revert both commits, and
> think of an alternative solution for the original issue? That of course
> assumes I guessed correctly why it was done in the first place...
> 

Simply reverting everything would likely break the ability to do unbind and bind (which is useful e.g to allow update the netback module whilst guests are still running) so I don't think that's an option.

  Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-11  7:06                   ` Durrant, Paul
@ 2021-05-11 10:40                     ` Marek Marczykowski-Górecki
  2021-05-11 10:45                       ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-11 10:40 UTC (permalink / raw)
  To: Durrant, Paul; +Cc: Michael Brown, paul, xen-devel, netdev, wei.liu

[-- Attachment #1: Type: text/plain, Size: 2079 bytes --]

On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > Sent: 10 May 2021 20:43
> > To: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org
> > Cc: paul@xen.org; xen-devel@lists.xenproject.org; netdev@vger.kernel.org; wei.liu@kernel.org; Durrant,
> > Paul <pdurrant@amazon.co.uk>
> > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
> > 
> > On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
> > > If you have a suggested patch, I'm happy to test that it doesn't reintroduce
> > > the regression bug that was fixed by this commit.
> > 
> > Actually, I've just tested with a simple reloading xen-netfront module. It
> > seems in this case, the hotplug script is not re-executed. In fact, I
> > think it should not be re-executed at all, since the vif interface
> > remains in place (it just gets NO-CARRIER flag).
> > 
> > This brings a question, why removing hotplug-status in the first place?
> > The interface remains correctly configured by the hotplug script after
> > all. From the commit message:
> > 
> >     xen-netback: remove 'hotplug-status' once it has served its purpose
> > 
> >     Removing the 'hotplug-status' node in netback_remove() is wrong; the script
> >     may not have completed. Only remove the node once the watch has fired and
> >     has been unregistered.
> > 
> > I think the intention was to remove 'hotplug-status' node _later_ in
> > case of quickly adding and removing the interface. Is that right, Paul?
> 
> The removal was done to allow unbind/bind to function correctly. IIRC before the original patch doing a bind would stall forever waiting for the hotplug status to change, which would never happen.

Hmm, in that case maybe don't remove it at all in the backend, and let
it be cleaned up by the toolstack, when it removes other backend-related
nodes?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-11 10:40                     ` Marek Marczykowski-Górecki
@ 2021-05-11 10:45                       ` Marek Marczykowski-Górecki
  2021-05-11 12:46                         ` Durrant, Paul
  0 siblings, 1 reply; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-11 10:45 UTC (permalink / raw)
  To: Durrant, Paul; +Cc: Michael Brown, paul, xen-devel, netdev, wei.liu

[-- Attachment #1: Type: text/plain, Size: 2457 bytes --]

On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
> > > -----Original Message-----
> > > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > Sent: 10 May 2021 20:43
> > > To: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org
> > > Cc: paul@xen.org; xen-devel@lists.xenproject.org; netdev@vger.kernel.org; wei.liu@kernel.org; Durrant,
> > > Paul <pdurrant@amazon.co.uk>
> > > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
> > > 
> > > On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
> > > > If you have a suggested patch, I'm happy to test that it doesn't reintroduce
> > > > the regression bug that was fixed by this commit.
> > > 
> > > Actually, I've just tested with a simple reloading xen-netfront module. It
> > > seems in this case, the hotplug script is not re-executed. In fact, I
> > > think it should not be re-executed at all, since the vif interface
> > > remains in place (it just gets NO-CARRIER flag).
> > > 
> > > This brings a question, why removing hotplug-status in the first place?
> > > The interface remains correctly configured by the hotplug script after
> > > all. From the commit message:
> > > 
> > >     xen-netback: remove 'hotplug-status' once it has served its purpose
> > > 
> > >     Removing the 'hotplug-status' node in netback_remove() is wrong; the script
> > >     may not have completed. Only remove the node once the watch has fired and
> > >     has been unregistered.
> > > 
> > > I think the intention was to remove 'hotplug-status' node _later_ in
> > > case of quickly adding and removing the interface. Is that right, Paul?
> > 
> > The removal was done to allow unbind/bind to function correctly. IIRC before the original patch doing a bind would stall forever waiting for the hotplug status to change, which would never happen.
> 
> Hmm, in that case maybe don't remove it at all in the backend, and let
> it be cleaned up by the toolstack, when it removes other backend-related
> nodes?

No, unbind/bind _does_ require hotplug script to be called again.

When exactly vif interface appears in the system (starts to be available
for the hotplug script)? Maybe remove 'hotplug-status' just before that
point?


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-11 10:45                       ` Marek Marczykowski-Górecki
@ 2021-05-11 12:46                         ` Durrant, Paul
  2021-05-17 21:43                           ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 23+ messages in thread
From: Durrant, Paul @ 2021-05-11 12:46 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Michael Brown, paul, xen-devel, netdev, wei.liu

> -----Original Message-----
> From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> Sent: 11 May 2021 11:45
> To: Durrant, Paul <pdurrant@amazon.co.uk>
> Cc: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org; xen-devel@lists.xenproject.org;
> netdev@vger.kernel.org; wei.liu@kernel.org
> Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
> 
> On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote:
> > On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
> > > > -----Original Message-----
> > > > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > Sent: 10 May 2021 20:43
> > > > To: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org
> > > > Cc: paul@xen.org; xen-devel@lists.xenproject.org; netdev@vger.kernel.org; wei.liu@kernel.org;
> Durrant,
> > > > Paul <pdurrant@amazon.co.uk>
> > > > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
> > > >
> > > > On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
> > > > > If you have a suggested patch, I'm happy to test that it doesn't reintroduce
> > > > > the regression bug that was fixed by this commit.
> > > >
> > > > Actually, I've just tested with a simple reloading xen-netfront module. It
> > > > seems in this case, the hotplug script is not re-executed. In fact, I
> > > > think it should not be re-executed at all, since the vif interface
> > > > remains in place (it just gets NO-CARRIER flag).
> > > >
> > > > This brings a question, why removing hotplug-status in the first place?
> > > > The interface remains correctly configured by the hotplug script after
> > > > all. From the commit message:
> > > >
> > > >     xen-netback: remove 'hotplug-status' once it has served its purpose
> > > >
> > > >     Removing the 'hotplug-status' node in netback_remove() is wrong; the script
> > > >     may not have completed. Only remove the node once the watch has fired and
> > > >     has been unregistered.
> > > >
> > > > I think the intention was to remove 'hotplug-status' node _later_ in
> > > > case of quickly adding and removing the interface. Is that right, Paul?
> > >
> > > The removal was done to allow unbind/bind to function correctly. IIRC before the original patch
> doing a bind would stall forever waiting for the hotplug status to change, which would never happen.
> >
> > Hmm, in that case maybe don't remove it at all in the backend, and let
> > it be cleaned up by the toolstack, when it removes other backend-related
> > nodes?
> 
> No, unbind/bind _does_ require hotplug script to be called again.
> 

Yes, sorry I was misremembering. My memory is hazy but there was definitely a problem at the time with leaving the node in place.

> When exactly vif interface appears in the system (starts to be available
> for the hotplug script)? Maybe remove 'hotplug-status' just before that
> point?
> 

I really can't remember any detail. Perhaps try reverting both patches then and check that the unbind/rmmod/modprobe/bind sequence still works (and the backend actually makes it into connected state).

  Paul


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-11 12:46                         ` Durrant, Paul
@ 2021-05-17 21:43                           ` Marek Marczykowski-Górecki
  2021-05-17 21:51                             ` Michael Brown
  2021-05-18  6:57                             ` Paul Durrant
  0 siblings, 2 replies; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-17 21:43 UTC (permalink / raw)
  To: Durrant, Paul
  Cc: Michael Brown, paul, xen-devel, netdev, wei.liu, Ian Jackson,
	Wei Liu, Anthony PERARD

[-- Attachment #1: Type: text/plain, Size: 7158 bytes --]

On Tue, May 11, 2021 at 12:46:38PM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > Sent: 11 May 2021 11:45
> > To: Durrant, Paul <pdurrant@amazon.co.uk>
> > Cc: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org; xen-devel@lists.xenproject.org;
> > netdev@vger.kernel.org; wei.liu@kernel.org
> > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
> > 
> > On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
> > > > > -----Original Message-----
> > > > > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > > Sent: 10 May 2021 20:43
> > > > > To: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org
> > > > > Cc: paul@xen.org; xen-devel@lists.xenproject.org; netdev@vger.kernel.org; wei.liu@kernel.org;
> > Durrant,
> > > > > Paul <pdurrant@amazon.co.uk>
> > > > > Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
> > > > >
> > > > > On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
> > > > > > If you have a suggested patch, I'm happy to test that it doesn't reintroduce
> > > > > > the regression bug that was fixed by this commit.
> > > > >
> > > > > Actually, I've just tested with a simple reloading xen-netfront module. It
> > > > > seems in this case, the hotplug script is not re-executed. In fact, I
> > > > > think it should not be re-executed at all, since the vif interface
> > > > > remains in place (it just gets NO-CARRIER flag).
> > > > >
> > > > > This brings a question, why removing hotplug-status in the first place?
> > > > > The interface remains correctly configured by the hotplug script after
> > > > > all. From the commit message:
> > > > >
> > > > >     xen-netback: remove 'hotplug-status' once it has served its purpose
> > > > >
> > > > >     Removing the 'hotplug-status' node in netback_remove() is wrong; the script
> > > > >     may not have completed. Only remove the node once the watch has fired and
> > > > >     has been unregistered.
> > > > >
> > > > > I think the intention was to remove 'hotplug-status' node _later_ in
> > > > > case of quickly adding and removing the interface. Is that right, Paul?
> > > >
> > > > The removal was done to allow unbind/bind to function correctly. IIRC before the original patch
> > doing a bind would stall forever waiting for the hotplug status to change, which would never happen.
> > >
> > > Hmm, in that case maybe don't remove it at all in the backend, and let
> > > it be cleaned up by the toolstack, when it removes other backend-related
> > > nodes?
> > 
> > No, unbind/bind _does_ require hotplug script to be called again.
> > 
> 
> Yes, sorry I was misremembering. My memory is hazy but there was definitely a problem at the time with leaving the node in place.
> 
> > When exactly vif interface appears in the system (starts to be available
> > for the hotplug script)? Maybe remove 'hotplug-status' just before that
> > point?
> > 
> 
> I really can't remember any detail. Perhaps try reverting both patches then and check that the unbind/rmmod/modprobe/bind sequence still works (and the backend actually makes it into connected state).

Ok, I've tried this. I've reverted both commits, then used your test
script from the 9476654bd5e8ad42abe8ee9f9e90069ff8e60c17:
    
    This has been tested by running iperf as a server in the test VM and
    then running a client against it in a continuous loop, whilst also
    running:
    
    while true;
      do echo vif-$DOMID-$VIF >unbind;
      echo down;
      rmmod xen-netback;
      echo unloaded;
      modprobe xen-netback;
      cd $(pwd);
      brctl addif xenbr0 vif$DOMID.$VIF;
      ip link set vif$DOMID.$VIF up;
      echo up;
      sleep 5;
      done
    
    in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
    unload, re-load, re-bind and re-plumb the backend.
    
In fact, the need to call `brctl` and `ip link` manually is exactly
because the hotplug script isn't executed. When I execute it manually,
the backend properly gets back to working. So, removing 'hotplug-status'
was in the correct place (netback_remove). The missing part is the toolstack
calling the hotplug script on xen-netback re-bind.

In this case, I'm not sure what is the proper way. If I restart
xendriverdomain service (I do run the backend in domU), it properly
executes hotplug script and the backend interface gets properly
configured. But it doesn't do it on its own. It seems to be related to
device "state" in xenstore. The specific part of the libxl is
backend_watch_callback():
https://github.com/xen-project/xen/blob/master/tools/libs/light/libxl_device.c#L1664

    ddev = search_for_device(dguest, dev);
    if (ddev == NULL && state == XenbusStateClosed) {
        /*
         * Spurious state change, device has already been disconnected
         * or never attached.
         */
        goto skip;
    } else if (ddev == NULL) {
        rc = add_device(egc, nested_ao, dguest, dev);
        if (rc > 0)
            free_ao = true;
    } else if (state == XenbusStateClosed && online == 0) {
        rc = remove_device(egc, nested_ao, dguest, ddev);
        if (rc > 0)
            free_ao = true;
        check_and_maybe_remove_guest(gc, ddomain, dguest);
    }

In short: if device gets XenbusStateInitWait for the first time (ddev ==
NULL case), it goes to add_device() which executes the hotplug script
and stores the device.
Then, if device goes to XenbusStateClosed + online==0 state, then it
executes hotplug script again (with "offline" parameter) and forgets the
device. If you unbind the driver, the device stays in
XenbusStateConnected state (in xenstore), and after you bind it again,
it goes to XenbusStateInitWait. It don't think it goes through
XenbusStateClosed, and online stays at 1 too, so libxl doesn't execute
the hotplug script again.

Some solution could be to add an extra case at the end, like "ddev !=
NULL && state == XenbusStateInitWait && hotplug-status != connected".
And make sure xl devd won't call the same hotplug script multiple times
for the same device _at the same time_ (I'm not sure about the async
machinery here).

But even if xl devd (aka xendriverdomain service) gets "fixes" to
execute hotplug script in that case, I don't think it would work in
backend in dom0 case - there, I think nothing watches already configured
vif interfaces (there is no xl devd daemon in dom0, and xl background
process watches only domain death and cdrom eject events). 

I'm adding toolstack maintainers, maybe they'll have some idea...

In any case, the issue is not calling the hotplug script, responsible
for configuring newly created vif interface. Not kernel waiting for it.
So, I think both commits should still be reverted.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-17 21:43                           ` Marek Marczykowski-Górecki
@ 2021-05-17 21:51                             ` Michael Brown
  2021-05-17 21:58                               ` Marek Marczykowski-Górecki
  2021-05-18  6:57                             ` Paul Durrant
  1 sibling, 1 reply; 23+ messages in thread
From: Michael Brown @ 2021-05-17 21:51 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Durrant, Paul
  Cc: paul, xen-devel, netdev, wei.liu, Ian Jackson, Wei Liu, Anthony PERARD

On 17/05/2021 22:43, Marek Marczykowski-Górecki wrote:
> In any case, the issue is not calling the hotplug script, responsible
> for configuring newly created vif interface. Not kernel waiting for it.
> So, I think both commits should still be reverted.

Did you also test the ability for a domU to have the netfront driver 
reloaded?  (That should be roughly equivalent to my original test 
scenario comprising the iPXE netfront driver used to boot a kernel that 
then loads the Linux netfront driver.)

Michael




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-17 21:51                             ` Michael Brown
@ 2021-05-17 21:58                               ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-17 21:58 UTC (permalink / raw)
  To: Michael Brown
  Cc: Durrant, Paul, paul, xen-devel, netdev, wei.liu, Ian Jackson,
	Wei Liu, Anthony PERARD

[-- Attachment #1: Type: text/plain, Size: 771 bytes --]

On Mon, May 17, 2021 at 10:51:38PM +0100, Michael Brown wrote:
> On 17/05/2021 22:43, Marek Marczykowski-Górecki wrote:
> > In any case, the issue is not calling the hotplug script, responsible
> > for configuring newly created vif interface. Not kernel waiting for it.
> > So, I think both commits should still be reverted.
> 
> Did you also test the ability for a domU to have the netfront driver
> reloaded?  (That should be roughly equivalent to my original test scenario
> comprising the iPXE netfront driver used to boot a kernel that then loads
> the Linux netfront driver.)

Yes, with both commits reverted, it just works. Without the need to do
anything at the backend side.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-17 21:43                           ` Marek Marczykowski-Górecki
  2021-05-17 21:51                             ` Michael Brown
@ 2021-05-18  6:57                             ` Paul Durrant
  2021-05-18  9:18                               ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 23+ messages in thread
From: Paul Durrant @ 2021-05-18  6:57 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Durrant, Paul
  Cc: Michael Brown, xen-devel, netdev, wei.liu, Ian Jackson, Wei Liu,
	Anthony PERARD

On 17/05/2021 22:43, Marek Marczykowski-Górecki wrote:
> On Tue, May 11, 2021 at 12:46:38PM +0000, Durrant, Paul wrote:
>>> -----Original Message-----
>>> From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>>> Sent: 11 May 2021 11:45
>>> To: Durrant, Paul <pdurrant@amazon.co.uk>
>>> Cc: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org; xen-devel@lists.xenproject.org;
>>> netdev@vger.kernel.org; wei.liu@kernel.org
>>> Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
>>>
>>> On Tue, May 11, 2021 at 12:40:54PM +0200, Marek Marczykowski-Górecki wrote:
>>>> On Tue, May 11, 2021 at 07:06:55AM +0000, Durrant, Paul wrote:
>>>>>> -----Original Message-----
>>>>>> From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>>>>>> Sent: 10 May 2021 20:43
>>>>>> To: Michael Brown <mbrown@fensystems.co.uk>; paul@xen.org
>>>>>> Cc: paul@xen.org; xen-devel@lists.xenproject.org; netdev@vger.kernel.org; wei.liu@kernel.org;
>>> Durrant,
>>>>>> Paul <pdurrant@amazon.co.uk>
>>>>>> Subject: RE: [EXTERNAL] [PATCH] xen-netback: Check for hotplug-status existence before watching
>>>>>>
>>>>>> On Mon, May 10, 2021 at 08:06:55PM +0100, Michael Brown wrote:
>>>>>>> If you have a suggested patch, I'm happy to test that it doesn't reintroduce
>>>>>>> the regression bug that was fixed by this commit.
>>>>>>
>>>>>> Actually, I've just tested with a simple reloading xen-netfront module. It
>>>>>> seems in this case, the hotplug script is not re-executed. In fact, I
>>>>>> think it should not be re-executed at all, since the vif interface
>>>>>> remains in place (it just gets NO-CARRIER flag).
>>>>>>
>>>>>> This brings a question, why removing hotplug-status in the first place?
>>>>>> The interface remains correctly configured by the hotplug script after
>>>>>> all. From the commit message:
>>>>>>
>>>>>>      xen-netback: remove 'hotplug-status' once it has served its purpose
>>>>>>
>>>>>>      Removing the 'hotplug-status' node in netback_remove() is wrong; the script
>>>>>>      may not have completed. Only remove the node once the watch has fired and
>>>>>>      has been unregistered.
>>>>>>
>>>>>> I think the intention was to remove 'hotplug-status' node _later_ in
>>>>>> case of quickly adding and removing the interface. Is that right, Paul?
>>>>>
>>>>> The removal was done to allow unbind/bind to function correctly. IIRC before the original patch
>>> doing a bind would stall forever waiting for the hotplug status to change, which would never happen.
>>>>
>>>> Hmm, in that case maybe don't remove it at all in the backend, and let
>>>> it be cleaned up by the toolstack, when it removes other backend-related
>>>> nodes?
>>>
>>> No, unbind/bind _does_ require hotplug script to be called again.
>>>
>>
>> Yes, sorry I was misremembering. My memory is hazy but there was definitely a problem at the time with leaving the node in place.
>>
>>> When exactly vif interface appears in the system (starts to be available
>>> for the hotplug script)? Maybe remove 'hotplug-status' just before that
>>> point?
>>>
>>
>> I really can't remember any detail. Perhaps try reverting both patches then and check that the unbind/rmmod/modprobe/bind sequence still works (and the backend actually makes it into connected state).
> 
> Ok, I've tried this. I've reverted both commits, then used your test
> script from the 9476654bd5e8ad42abe8ee9f9e90069ff8e60c17:
>      
>      This has been tested by running iperf as a server in the test VM and
>      then running a client against it in a continuous loop, whilst also
>      running:
>      
>      while true;
>        do echo vif-$DOMID-$VIF >unbind;
>        echo down;
>        rmmod xen-netback;
>        echo unloaded;
>        modprobe xen-netback;
>        cd $(pwd);
>        brctl addif xenbr0 vif$DOMID.$VIF;
>        ip link set vif$DOMID.$VIF up;
>        echo up;
>        sleep 5;
>        done
>      
>      in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
>      unload, re-load, re-bind and re-plumb the backend.
>      
> In fact, the need to call `brctl` and `ip link` manually is exactly
> because the hotplug script isn't executed. When I execute it manually,
> the backend properly gets back to working. So, removing 'hotplug-status'
> was in the correct place (netback_remove). The missing part is the toolstack
> calling the hotplug script on xen-netback re-bind.
> 

Why is that missing? We're going behind the back of the toolstack to do 
the unbind and bind so why should we expect it to re-execute a hotplug 
script?

> In this case, I'm not sure what is the proper way. If I restart
> xendriverdomain service (I do run the backend in domU), it properly
> executes hotplug script and the backend interface gets properly
> configured. But it doesn't do it on its own. It seems to be related to
> device "state" in xenstore. The specific part of the libxl is
> backend_watch_callback():
> https://github.com/xen-project/xen/blob/master/tools/libs/light/libxl_device.c#L1664
> 
>      ddev = search_for_device(dguest, dev);
>      if (ddev == NULL && state == XenbusStateClosed) {
>          /*
>           * Spurious state change, device has already been disconnected
>           * or never attached.
>           */
>          goto skip;
>      } else if (ddev == NULL) {
>          rc = add_device(egc, nested_ao, dguest, dev);
>          if (rc > 0)
>              free_ao = true;
>      } else if (state == XenbusStateClosed && online == 0) {
>          rc = remove_device(egc, nested_ao, dguest, ddev);
>          if (rc > 0)
>              free_ao = true;
>          check_and_maybe_remove_guest(gc, ddomain, dguest);
>      }
> 
> In short: if device gets XenbusStateInitWait for the first time (ddev ==
> NULL case), it goes to add_device() which executes the hotplug script
> and stores the device.
> Then, if device goes to XenbusStateClosed + online==0 state, then it
> executes hotplug script again (with "offline" parameter) and forgets the
> device. If you unbind the driver, the device stays in
> XenbusStateConnected state (in xenstore), and after you bind it again,
> it goes to XenbusStateInitWait. It don't think it goes through
> XenbusStateClosed, and online stays at 1 too, so libxl doesn't execute
> the hotplug script again.

This is pretty key. The frontend should not notice an unbind/bind i.e. 
there should be no evidence of it happening by examining states in 
xenstore (from the guest side).

   Paul

> 
> Some solution could be to add an extra case at the end, like "ddev !=
> NULL && state == XenbusStateInitWait && hotplug-status != connected".
> And make sure xl devd won't call the same hotplug script multiple times
> for the same device _at the same time_ (I'm not sure about the async
> machinery here).
> 
> But even if xl devd (aka xendriverdomain service) gets "fixes" to
> execute hotplug script in that case, I don't think it would work in
> backend in dom0 case - there, I think nothing watches already configured
> vif interfaces (there is no xl devd daemon in dom0, and xl background
> process watches only domain death and cdrom eject events).
> 
> I'm adding toolstack maintainers, maybe they'll have some idea...
> 
> In any case, the issue is not calling the hotplug script, responsible
> for configuring newly created vif interface. Not kernel waiting for it.
> So, I think both commits should still be reverted.
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
  2021-05-18  6:57                             ` Paul Durrant
@ 2021-05-18  9:18                               ` Marek Marczykowski-Górecki
       [not found]                                 ` <887f9533f5c54bfabfbff7231eb99b08@EX13D32EUC003.ant.amazon.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-18  9:18 UTC (permalink / raw)
  To: paul
  Cc: Durrant, Paul, Michael Brown, xen-devel, netdev, wei.liu,
	Ian Jackson, Wei Liu, Anthony PERARD

[-- Attachment #1: Type: text/plain, Size: 3263 bytes --]

On Tue, May 18, 2021 at 07:57:16AM +0100, Paul Durrant wrote:
> On 17/05/2021 22:43, Marek Marczykowski-Górecki wrote:
> > On Tue, May 11, 2021 at 12:46:38PM +0000, Durrant, Paul wrote:
> > > I really can't remember any detail. Perhaps try reverting both patches then and check that the unbind/rmmod/modprobe/bind sequence still works (and the backend actually makes it into connected state).
> > 
> > Ok, I've tried this. I've reverted both commits, then used your test
> > script from the 9476654bd5e8ad42abe8ee9f9e90069ff8e60c17:
> >      This has been tested by running iperf as a server in the test VM and
> >      then running a client against it in a continuous loop, whilst also
> >      running:
> >      while true;
> >        do echo vif-$DOMID-$VIF >unbind;
> >        echo down;
> >        rmmod xen-netback;
> >        echo unloaded;
> >        modprobe xen-netback;
> >        cd $(pwd);
> >        brctl addif xenbr0 vif$DOMID.$VIF;
> >        ip link set vif$DOMID.$VIF up;
> >        echo up;
> >        sleep 5;
> >        done
> >      in dom0 from /sys/bus/xen-backend/drivers/vif to continuously unbind,
> >      unload, re-load, re-bind and re-plumb the backend.
> > In fact, the need to call `brctl` and `ip link` manually is exactly
> > because the hotplug script isn't executed. When I execute it manually,
> > the backend properly gets back to working. So, removing 'hotplug-status'
> > was in the correct place (netback_remove). The missing part is the toolstack
> > calling the hotplug script on xen-netback re-bind.
> > 
> 
> Why is that missing? We're going behind the back of the toolstack to do the
> unbind and bind so why should we expect it to re-execute a hotplug script?

Ok, then simply execute the whole hotplug script (instead of its subset)
after re-loading the backend module and everything will be fine.

For example like this:
    XENBUS_PATH=backend/vif/$DOMID/$VIF \
    XENBUS_TYPE=vif \
    XENBUS_BASE_PATH=backend \
    script=/etc/xen/scripts/vif-bridge \
    vif=vif.$DOMID.$VIF \
    /etc/xen/scripts/vif-bridge online

(...)

> > In short: if device gets XenbusStateInitWait for the first time (ddev ==
> > NULL case), it goes to add_device() which executes the hotplug script
> > and stores the device.
> > Then, if device goes to XenbusStateClosed + online==0 state, then it
> > executes hotplug script again (with "offline" parameter) and forgets the
> > device. If you unbind the driver, the device stays in
> > XenbusStateConnected state (in xenstore), and after you bind it again,
> > it goes to XenbusStateInitWait. It don't think it goes through
> > XenbusStateClosed, and online stays at 1 too, so libxl doesn't execute
> > the hotplug script again.
> 
> This is pretty key. The frontend should not notice an unbind/bind i.e. there
> should be no evidence of it happening by examining states in xenstore (from
> the guest side).

If you update the backend module, I think the frontend needs at least to
re-evaluate feature-* nodes. In case of applying just a bug fix, they
should not change (in theory), but technically that would be the correct
thing to do.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen-netback: Check for hotplug-status existence before watching
       [not found]                                     ` <2c23e102b6254e42877eb1e8fe68a4f7@EX13D32EUC003.ant.amazon.com>
@ 2021-05-18 10:42                                       ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 23+ messages in thread
From: Marek Marczykowski-Górecki @ 2021-05-18 10:42 UTC (permalink / raw)
  To: Durrant, Paul
  Cc: Michael Brown, xen-devel, netdev, wei.liu, Ian Jackson, Wei Liu,
	Anthony PERARD

[-- Attachment #1: Type: text/plain, Size: 1946 bytes --]

On Tue, May 18, 2021 at 09:48:25AM +0000, Durrant, Paul wrote:
> > -----Original Message-----
> > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > 
> > On Tue, May 18, 2021 at 09:34:45AM +0000, Durrant, Paul wrote:
> > > > -----Original Message-----
> > > > From: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > >
> > > > On Tue, May 18, 2021 at 07:57:16AM +0100, Paul Durrant wrote:
> > > > > Why is that missing? We're going behind the back of the toolstack to do the
> > > > > unbind and bind so why should we expect it to re-execute a hotplug script?
> > > >
> > > > Ok, then simply execute the whole hotplug script (instead of its subset)
> > > > after re-loading the backend module and everything will be fine.
> > > >
> > > > For example like this:
> > > >     XENBUS_PATH=backend/vif/$DOMID/$VIF \
> > > >     XENBUS_TYPE=vif \
> > > >     XENBUS_BASE_PATH=backend \
> > > >     script=/etc/xen/scripts/vif-bridge \
> > > >     vif=vif.$DOMID.$VIF \
> > > >     /etc/xen/scripts/vif-bridge online
> > > >
> > >
> > > ... as long as there's no xenstore fall-out that the guest can observe.
> > 
> > Backend will set state to XenbusStateInitWait on load anyway...
> > 
> 
> Oh, that sounds like a bug then... It ought to go straight to connected if the frontend is already there.

To me this sounds very suspicious. But if that's really what should
backend do, then it would "solve" also hotplug-status node issue.
See the end of netback_probe() function.
But I think if you start processing traffic before hotplug script
configures the interface (so - without switching to XenbusStateInitWait
and waiting for hotplug-status node), you'll post some packets into not
enabled interface, which I think will drop them (not queue). TCP will be
fine with that, but many other protocols not.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-05-18 10:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-10 18:25 xen-netback hotplug-status regression bug Michael Brown
2021-04-13  7:12 ` Paul Durrant
2021-04-13 10:48   ` Michael Brown
2021-04-13 10:55     ` Paul Durrant
2021-04-13 15:14       ` Michael Brown
2021-04-13 15:25       ` [PATCH] xen-netback: Check for hotplug-status existence before watching Michael Brown
2021-04-13 19:12         ` Paul Durrant
2021-04-13 22:30         ` patchwork-bot+netdevbpf
2021-05-10 18:32         ` Marek Marczykowski-Górecki
2021-05-10 18:47           ` Michael Brown
2021-05-10 18:53             ` Marek Marczykowski-Górecki
2021-05-10 19:06               ` Michael Brown
2021-05-10 19:42                 ` Marek Marczykowski-Górecki
2021-05-11  7:06                   ` Durrant, Paul
2021-05-11 10:40                     ` Marek Marczykowski-Górecki
2021-05-11 10:45                       ` Marek Marczykowski-Górecki
2021-05-11 12:46                         ` Durrant, Paul
2021-05-17 21:43                           ` Marek Marczykowski-Górecki
2021-05-17 21:51                             ` Michael Brown
2021-05-17 21:58                               ` Marek Marczykowski-Górecki
2021-05-18  6:57                             ` Paul Durrant
2021-05-18  9:18                               ` Marek Marczykowski-Górecki
     [not found]                                 ` <887f9533f5c54bfabfbff7231eb99b08@EX13D32EUC003.ant.amazon.com>
     [not found]                                   ` <YKOMpXwcnr9QiXy8@mail-itl>
     [not found]                                     ` <2c23e102b6254e42877eb1e8fe68a4f7@EX13D32EUC003.ant.amazon.com>
2021-05-18 10:42                                       ` Marek Marczykowski-Górecki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).