All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] net/tls: Act on going down event
@ 2020-02-13 16:54 Or Gerlitz
  2020-02-13 18:32 ` Jakub Kicinski
  0 siblings, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2020-02-13 16:54 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Tariq Toukan, netdev, Or Gerlitz

By the time of the down event, the netdevice stop ndo was
already called and the nic driver is likely to destroy the HW
objects/constructs which are used for the tls_dev_resync op.

Instead, act on the going down event which is triggered before
the stop ndo.

Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---

compile tested only.

# vim net/core/dev.c +1555
 *	This function moves an active device into down state. A
 *	%NETDEV_GOING_DOWN is sent to the netdev notifier chain. The device
 *	is then deactivated and finally a %NETDEV_DOWN is sent to the notifier chain.
[..]
void dev_close(struct net_device *dev)

 net/tls/tls_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 1ba5a92832bb..457c4b8352d8 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1246,7 +1246,7 @@ static int tls_dev_event(struct notifier_block *this, unsigned long event,
 			return NOTIFY_DONE;
 		else
 			return NOTIFY_BAD;
-	case NETDEV_DOWN:
+	case NETDEV_GOING_DOWN:
 		return tls_device_down(dev);
 	}
 	return NOTIFY_DONE;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net] net/tls: Act on going down event
  2020-02-13 16:54 [PATCH net] net/tls: Act on going down event Or Gerlitz
@ 2020-02-13 18:32 ` Jakub Kicinski
  2020-02-13 20:07   ` Or Gerlitz
  0 siblings, 1 reply; 4+ messages in thread
From: Jakub Kicinski @ 2020-02-13 18:32 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Tariq Toukan, netdev

On Thu, 13 Feb 2020 16:54:07 +0000 Or Gerlitz wrote:
> By the time of the down event, the netdevice stop ndo was
> already called and the nic driver is likely to destroy the HW
> objects/constructs which are used for the tls_dev_resync op.
> 
> Instead, act on the going down event which is triggered before
> the stop ndo.
>
> Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
> 
> compile tested only.

For a fix that you have hardware to test that is a little 
disappointing :(

> # vim net/core/dev.c +1555
>  *	This function moves an active device into down state. A
>  *	%NETDEV_GOING_DOWN is sent to the netdev notifier chain. The device
>  *	is then deactivated and finally a %NETDEV_DOWN is sent to the notifier chain.
> [..]
> void dev_close(struct net_device *dev)

As is quoting a comment rather than justifying changes based on the
code.

> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> index 1ba5a92832bb..457c4b8352d8 100644
> --- a/net/tls/tls_device.c
> +++ b/net/tls/tls_device.c
> @@ -1246,7 +1246,7 @@ static int tls_dev_event(struct notifier_block *this, unsigned long event,
>  			return NOTIFY_DONE;
>  		else
>  			return NOTIFY_BAD;
> -	case NETDEV_DOWN:
> +	case NETDEV_GOING_DOWN:

Now we'll have two race conditions. 1. Traffic is still running while
we remove the connection state. 2. We clean out the state and then
close the device. Between the two new connections may be installed.

I think it's an inherently racy to only be able to perform clean up 
while the device is still up.

>  		return tls_device_down(dev);
>  	}
>  	return NOTIFY_DONE;

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net] net/tls: Act on going down event
  2020-02-13 18:32 ` Jakub Kicinski
@ 2020-02-13 20:07   ` Or Gerlitz
  2020-02-14  0:32     ` Jakub Kicinski
  0 siblings, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2020-02-13 20:07 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Or Gerlitz, Tariq Toukan, Linux Netdev List

On Thu, Feb 13, 2020 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 13 Feb 2020 16:54:07 +0000 Or Gerlitz wrote:
> > By the time of the down event, the netdevice stop ndo was
> > already called and the nic driver is likely to destroy the HW
> > objects/constructs which are used for the tls_dev_resync op.

>> @@ -1246,7 +1246,7 @@ static int tls_dev_event(struct notifier_block *this, unsigned long event,
> > -     case NETDEV_DOWN:
> > +     case NETDEV_GOING_DOWN:

> Now we'll have two race conditions. 1. Traffic is still running while
> we remove the connection state. 2. We clean out the state and then
> close the device. Between the two new connections may be installed.
>
> I think it's an inherently racy to only be able to perform clean up
> while the device is still up.

good points, I have to think on both of them and re/sync (..) with
the actual design details here, I came across this while working
on something else which is still more of a research so just throwing
a patch here was moving too fast.

Repeating myself -- by the time of the down event, the netdevice
stop ndo was already called and the nic driver is likely to destroy
HW objects/constructs which are used for the tls_dev_resync op.

For example suppose a NIC driver uses some QP/ring to post resync
request to the HW and these rings are part of the driver's channels and
the channels are destroyed when the stop ndo is called - the tls code
here uses RW synchronization between invoking the resync driver op
to the down event handling. But the down event comes only after these
resources were destroyed, too late. If we will safely/stop the offload
at the going down stage, we can avoid a much more complex and per nic
driver locking which I guess would be needed to protect against that race.

thoughts?


- so here's
the point -- the driver resync op may use some HW


>
> >               return tls_device_down(dev);
> >       }
> >       return NOTIFY_DONE;

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net] net/tls: Act on going down event
  2020-02-13 20:07   ` Or Gerlitz
@ 2020-02-14  0:32     ` Jakub Kicinski
  0 siblings, 0 replies; 4+ messages in thread
From: Jakub Kicinski @ 2020-02-14  0:32 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Or Gerlitz, Tariq Toukan, Linux Netdev List

On Thu, 13 Feb 2020 22:07:19 +0200 Or Gerlitz wrote:
> On Thu, Feb 13, 2020 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Thu, 13 Feb 2020 16:54:07 +0000 Or Gerlitz wrote:  
> > > By the time of the down event, the netdevice stop ndo was
> > > already called and the nic driver is likely to destroy the HW
> > > objects/constructs which are used for the tls_dev_resync op.  
> 
> >> @@ -1246,7 +1246,7 @@ static int tls_dev_event(struct notifier_block *this, unsigned long event,
> > > -     case NETDEV_DOWN:
> > > +     case NETDEV_GOING_DOWN:  
> 
> > Now we'll have two race conditions. 1. Traffic is still running while
> > we remove the connection state. 2. We clean out the state and then
> > close the device. Between the two new connections may be installed.
> >
> > I think it's an inherently racy to only be able to perform clean up
> > while the device is still up.  
> 
> good points, I have to think on both of them and re/sync (..) with
> the actual design details here, I came across this while working
> on something else which is still more of a research so just throwing
> a patch here was moving too fast.
> 
> Repeating myself -- by the time of the down event, the netdevice
> stop ndo was already called and the nic driver is likely to destroy
> HW objects/constructs which are used for the tls_dev_resync op.
> 
> For example suppose a NIC driver uses some QP/ring to post resync
> request to the HW and these rings are part of the driver's channels and
> the channels are destroyed when the stop ndo is called - the tls code
> here uses RW synchronization between invoking the resync driver op
> to the down event handling. But the down event comes only after these
> resources were destroyed, too late. If we will safely/stop the offload
> at the going down stage, we can avoid a much more complex and per nic
> driver locking which I guess would be needed to protect against that race.
> 
> thoughts?

I'm not sure what happens on the device side, so take this with
a pinch of salt.

The device flushing some state automatically on down sounds like 
a pretty error prone design, I don't think you'd do that :)

So I think you're just using datapath components associated with a
vNIC/function to speed up communication with the device? That does 
get painful quite quickly if the device does not have a idea of 
control QPs/functions/vNICs with independent state :S

We could slice the downing in the stack into one more stage where 
device is considered down but resources (mem, irq, qps) are not freed,
but I think that's not a complete fix, because one day you may want 
to do BPF offloads or use QPs for devlink and those are independent of
vNIC/function state, anyway.. 

> - so here's
> the point -- the driver resync op may use some HW

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-02-14  0:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-13 16:54 [PATCH net] net/tls: Act on going down event Or Gerlitz
2020-02-13 18:32 ` Jakub Kicinski
2020-02-13 20:07   ` Or Gerlitz
2020-02-14  0:32     ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.