All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] net: dsa: Some cleanups in remove code
@ 2021-11-09 11:39 Uwe Kleine-König
  2021-11-09 11:54 ` Vladimir Oltean
  0 siblings, 1 reply; 10+ messages in thread
From: Uwe Kleine-König @ 2021-11-09 11:39 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean
  Cc: David S. Miller, Jakub Kicinski, netdev, kernel

vsc73xx_remove() returns zero unconditionally and no caller checks the
returned value. So convert the function to return no value.

For both the platform and the spi variant ..._get_drvdata() will never
return NULL in .remove() because the remove callback is only called after
the probe callback returned successfully and in this case driver data was
set to a non-NULL value.

Also setting driver data to NULL is not necessary, this is already done
in the driver core in __device_release_driver(), so drop this from the
remove callback, too.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
---
 drivers/net/dsa/vitesse-vsc73xx-core.c     | 4 +---
 drivers/net/dsa/vitesse-vsc73xx-platform.c | 5 -----
 drivers/net/dsa/vitesse-vsc73xx-spi.c      | 5 -----
 drivers/net/dsa/vitesse-vsc73xx.h          | 2 +-
 4 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/net/dsa/vitesse-vsc73xx-core.c b/drivers/net/dsa/vitesse-vsc73xx-core.c
index a4b1447ff055..4c18f619ec02 100644
--- a/drivers/net/dsa/vitesse-vsc73xx-core.c
+++ b/drivers/net/dsa/vitesse-vsc73xx-core.c
@@ -1216,12 +1216,10 @@ int vsc73xx_probe(struct vsc73xx *vsc)
 }
 EXPORT_SYMBOL(vsc73xx_probe);
 
-int vsc73xx_remove(struct vsc73xx *vsc)
+void vsc73xx_remove(struct vsc73xx *vsc)
 {
 	dsa_unregister_switch(vsc->ds);
 	gpiod_set_value(vsc->reset, 1);
-
-	return 0;
 }
 EXPORT_SYMBOL(vsc73xx_remove);
 
diff --git a/drivers/net/dsa/vitesse-vsc73xx-platform.c b/drivers/net/dsa/vitesse-vsc73xx-platform.c
index fe4b154a0a57..f2715bee2173 100644
--- a/drivers/net/dsa/vitesse-vsc73xx-platform.c
+++ b/drivers/net/dsa/vitesse-vsc73xx-platform.c
@@ -116,13 +116,8 @@ static int vsc73xx_platform_remove(struct platform_device *pdev)
 {
 	struct vsc73xx_platform *vsc_platform = platform_get_drvdata(pdev);
 
-	if (!vsc_platform)
-		return 0;
-
 	vsc73xx_remove(&vsc_platform->vsc);
 
-	platform_set_drvdata(pdev, NULL);
-
 	return 0;
 }
 
diff --git a/drivers/net/dsa/vitesse-vsc73xx-spi.c b/drivers/net/dsa/vitesse-vsc73xx-spi.c
index 645398901e05..6b33f754982b 100644
--- a/drivers/net/dsa/vitesse-vsc73xx-spi.c
+++ b/drivers/net/dsa/vitesse-vsc73xx-spi.c
@@ -163,13 +163,8 @@ static int vsc73xx_spi_remove(struct spi_device *spi)
 {
 	struct vsc73xx_spi *vsc_spi = spi_get_drvdata(spi);
 
-	if (!vsc_spi)
-		return 0;
-
 	vsc73xx_remove(&vsc_spi->vsc);
 
-	spi_set_drvdata(spi, NULL);
-
 	return 0;
 }
 
diff --git a/drivers/net/dsa/vitesse-vsc73xx.h b/drivers/net/dsa/vitesse-vsc73xx.h
index 30b951504e65..30b1f0a36566 100644
--- a/drivers/net/dsa/vitesse-vsc73xx.h
+++ b/drivers/net/dsa/vitesse-vsc73xx.h
@@ -26,5 +26,5 @@ struct vsc73xx_ops {
 
 int vsc73xx_is_addr_valid(u8 block, u8 subblock);
 int vsc73xx_probe(struct vsc73xx *vsc);
-int vsc73xx_remove(struct vsc73xx *vsc);
+void vsc73xx_remove(struct vsc73xx *vsc);
 void vsc73xx_shutdown(struct vsc73xx *vsc);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-09 11:39 [PATCH net-next] net: dsa: Some cleanups in remove code Uwe Kleine-König
@ 2021-11-09 11:54 ` Vladimir Oltean
  2021-11-09 17:50   ` Uwe Kleine-König
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Oltean @ 2021-11-09 11:54 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Andrew Lunn, Vivien Didelot, Florian Fainelli, David S. Miller,
	Jakub Kicinski, netdev, kernel

Your commit prefix does not reflect the fact that you are touching the
vsc73xx driver. Try "net: dsa: vsc73xx: ".

On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> vsc73xx_remove() returns zero unconditionally and no caller checks the
> returned value. So convert the function to return no value.

This I agree with.

> For both the platform and the spi variant ..._get_drvdata() will never
> return NULL in .remove() because the remove callback is only called after
> the probe callback returned successfully and in this case driver data was
> set to a non-NULL value.

Have you read the commit message of 0650bf52b31f ("net: dsa: be
compatible with masters which unregister on shutdown")?

To remove the check for dev_get_drvdata == NULL in ->remove, you need to
prove that ->remove will never be called after ->shutdown. For platform
devices this is pretty easy to prove, for SPI devices not so much.
I intentionally kept the code structure the same because code gets
copied around a lot, it is easy to copy from the wrong place.

> Also setting driver data to NULL is not necessary, this is already done
> in the driver core in __device_release_driver(), so drop this from the
> remove callback, too.

And this was also intentional, for visibility more or less. I would like
you to ack that you understand the problems surrounding ->remove/->shutdown
ordering for devices on buses, prior to making seemingly trivial cleanups.

> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> ---

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-09 11:54 ` Vladimir Oltean
@ 2021-11-09 17:50   ` Uwe Kleine-König
  2021-11-10 13:15     ` Vladimir Oltean
  0 siblings, 1 reply; 10+ messages in thread
From: Uwe Kleine-König @ 2021-11-09 17:50 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, netdev, David S. Miller, kernel,
	Jakub Kicinski, Vivien Didelot, Greg Kroah-Hartman

[-- Attachment #1: Type: text/plain, Size: 2467 bytes --]

Hello,

Cc += gregkh, maybe he has something to say on this matter

On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> Your commit prefix does not reflect the fact that you are touching the
> vsc73xx driver. Try "net: dsa: vsc73xx: ".

Oh, I missed that indeed.

> On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > returned value. So convert the function to return no value.
> 
> This I agree with.
> 
> > For both the platform and the spi variant ..._get_drvdata() will never
> > return NULL in .remove() because the remove callback is only called after
> > the probe callback returned successfully and in this case driver data was
> > set to a non-NULL value.
> 
> Have you read the commit message of 0650bf52b31f ("net: dsa: be
> compatible with masters which unregister on shutdown")?

No. But I did now. I consider it very surprising that .shutdown() calls
the .remove() callback and would recommend to not do this. The commit
log seems to prove this being difficult.

> To remove the check for dev_get_drvdata == NULL in ->remove, you need to
> prove that ->remove will never be called after ->shutdown. For platform
> devices this is pretty easy to prove, for SPI devices not so much.
> I intentionally kept the code structure the same because code gets
> copied around a lot, it is easy to copy from the wrong place.

Alternatively remove spi_set_drvdata(spi, NULL); from
vsc73xx_spi_shutdown()? Also I'm not aware how platform devices are
different to spi devices that the ordering of .remove and shutdown() is
more or less obvious than on the other bus?!

> > Also setting driver data to NULL is not necessary, this is already done
> > in the driver core in __device_release_driver(), so drop this from the
> > remove callback, too.
> 
> And this was also intentional, for visibility more or less. I would like
> you to ack that you understand the problems surrounding ->remove/->shutdown
> ordering for devices on buses, prior to making seemingly trivial cleanups.

I see that the change is not so obviously correct as I thought. I'll
have to think about this and will respin if and when I find a sane way
forward.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-09 17:50   ` Uwe Kleine-König
@ 2021-11-10 13:15     ` Vladimir Oltean
  2021-11-10 21:03       ` Uwe Kleine-König
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Oltean @ 2021-11-10 13:15 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Andrew Lunn, Florian Fainelli, netdev, David S. Miller, kernel,
	Jakub Kicinski, Vivien Didelot, Greg Kroah-Hartman

On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote:
> Hello,
> 
> Cc += gregkh, maybe he has something to say on this matter
> 
> On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> > Your commit prefix does not reflect the fact that you are touching the
> > vsc73xx driver. Try "net: dsa: vsc73xx: ".
> 
> Oh, I missed that indeed.
> 
> > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > > returned value. So convert the function to return no value.
> > 
> > This I agree with.
> > 
> > > For both the platform and the spi variant ..._get_drvdata() will never
> > > return NULL in .remove() because the remove callback is only called after
> > > the probe callback returned successfully and in this case driver data was
> > > set to a non-NULL value.
> > 
> > Have you read the commit message of 0650bf52b31f ("net: dsa: be
> > compatible with masters which unregister on shutdown")?
> 
> No. But I did now. I consider it very surprising that .shutdown() calls
> the .remove() callback and would recommend to not do this. The commit
> log seems to prove this being difficult.

Why do you consider it surprising?

Many drivers implement ->shutdown by calling ->remove for the simple
reason that ->remove provides for a well-tested code path already, and
leaves the hardware in a known state, workable for kexec and others.

Many drivers have buses beneath them. Those buses go away when these
drivers unregister, and so do their children.

==============================================

=> some drivers do both => children of these buses should expect to be
potentially unregistered after they've been shut down.

> > To remove the check for dev_get_drvdata == NULL in ->remove, you need to
> > prove that ->remove will never be called after ->shutdown. For platform
> > devices this is pretty easy to prove, for SPI devices not so much.
> > I intentionally kept the code structure the same because code gets
> > copied around a lot, it is easy to copy from the wrong place.
> 
> Alternatively remove spi_set_drvdata(spi, NULL); from
> vsc73xx_spi_shutdown()?

What is the end goal exactly?

> Also I'm not aware how platform devices are
> different to spi devices that the ordering of .remove and shutdown() is
> more or less obvious than on the other bus?!

Not sure what you mean. See the explanation above. For the "platform"
bus, there simply isn't any code path that unregisters children on the
->shutdown callback. For other buses, there is.

> > > Also setting driver data to NULL is not necessary, this is already done
> > > in the driver core in __device_release_driver(), so drop this from the
> > > remove callback, too.
> > 
> > And this was also intentional, for visibility more or less. I would like
> > you to ack that you understand the problems surrounding ->remove/->shutdown
> > ordering for devices on buses, prior to making seemingly trivial cleanups.
> 
> I see that the change is not so obviously correct as I thought. I'll
> have to think about this and will respin if and when I find a sane way
> forward.

A way forward towards what? This is literally a cosmetic patch that
would happen to break some stuff, were it to be applied.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-10 13:15     ` Vladimir Oltean
@ 2021-11-10 21:03       ` Uwe Kleine-König
  2021-11-10 22:56         ` Vladimir Oltean
  0 siblings, 1 reply; 10+ messages in thread
From: Uwe Kleine-König @ 2021-11-10 21:03 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, netdev, Vivien Didelot, kernel,
	Greg Kroah-Hartman, Jakub Kicinski, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 4169 bytes --]

Hello Vladimir,

On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote:
> On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote:
> > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> > > Your commit prefix does not reflect the fact that you are touching the
> > > vsc73xx driver. Try "net: dsa: vsc73xx: ".
> > 
> > Oh, I missed that indeed.
> > 
> > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > > > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > > > returned value. So convert the function to return no value.
> > > 
> > > This I agree with.
> > > 
> > > > For both the platform and the spi variant ..._get_drvdata() will never
> > > > return NULL in .remove() because the remove callback is only called after
> > > > the probe callback returned successfully and in this case driver data was
> > > > set to a non-NULL value.
> > > 
> > > Have you read the commit message of 0650bf52b31f ("net: dsa: be
> > > compatible with masters which unregister on shutdown")?
> > 
> > No. But I did now. I consider it very surprising that .shutdown() calls
> > the .remove() callback and would recommend to not do this. The commit
> > log seems to prove this being difficult.
> 
> Why do you consider it surprising?

In my book .shutdown should be minimal and just silence the device, such
that it e.g. doesn't do any DMA any more.

> Many drivers implement ->shutdown by calling ->remove for the simple
> reason that ->remove provides for a well-tested code path already, and
> leaves the hardware in a known state, workable for kexec and others.
> 
> Many drivers have buses beneath them. Those buses go away when these
> drivers unregister, and so do their children.
> 
> ==============================================
> 
> => some drivers do both => children of these buses should expect to be
> potentially unregistered after they've been shut down.

Do you know this happens, or do you "only" fear it might happen?

> > > To remove the check for dev_get_drvdata == NULL in ->remove, you need to
> > > prove that ->remove will never be called after ->shutdown. For platform
> > > devices this is pretty easy to prove, for SPI devices not so much.
> > > I intentionally kept the code structure the same because code gets
> > > copied around a lot, it is easy to copy from the wrong place.
> > 
> > Alternatively remove spi_set_drvdata(spi, NULL); from
> > vsc73xx_spi_shutdown()?
> 
> What is the end goal exactly?

My end goal is:

diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
index eb7ac8a1e03c..183cf15fbdd2 100644
--- a/include/linux/spi/spi.h
+++ b/include/linux/spi/spi.h
@@ -280,7 +280,7 @@ struct spi_message;
 struct spi_driver {
        const struct spi_device_id *id_table;
        int                     (*probe)(struct spi_device *spi);
-       int                     (*remove)(struct spi_device *spi);
+       void                    (*remove)(struct spi_device *spi);
        void                    (*shutdown)(struct spi_device *spi);
        struct device_driver    driver;
 };

As (nearly) all spi drivers must be touched in the same commit, the
preparing goal is to have these remove callbacks simple, such that I
only have to replace their "return 0;" by "return;" (or nothing if it's
at the end of the function). Looking at vsc73xx's remove function I
didn't stop at this minimal goal and simplified the stuff that I thought
to be superflous.

> > Also I'm not aware how platform devices are
> > different to spi devices that the ordering of .remove and shutdown() is
> > more or less obvious than on the other bus?!
> 
> Not sure what you mean. See the explanation above. For the "platform"
> bus, there simply isn't any code path that unregisters children on the
> ->shutdown callback. For other buses, there is.

OK, with your last mail I understood that now, thanks.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-10 21:03       ` Uwe Kleine-König
@ 2021-11-10 22:56         ` Vladimir Oltean
  2021-11-11  7:57           ` Uwe Kleine-König
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Oltean @ 2021-11-10 22:56 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Andrew Lunn, Florian Fainelli, netdev, Vivien Didelot, kernel,
	Greg Kroah-Hartman, Jakub Kicinski, David S. Miller

On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote:
> Hello Vladimir,
> 
> On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote:
> > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote:
> > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> > > > Your commit prefix does not reflect the fact that you are touching the
> > > > vsc73xx driver. Try "net: dsa: vsc73xx: ".
> > > 
> > > Oh, I missed that indeed.
> > > 
> > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > > > > returned value. So convert the function to return no value.
> > > > 
> > > > This I agree with.
> > > > 
> > > > > For both the platform and the spi variant ..._get_drvdata() will never
> > > > > return NULL in .remove() because the remove callback is only called after
> > > > > the probe callback returned successfully and in this case driver data was
> > > > > set to a non-NULL value.
> > > > 
> > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be
> > > > compatible with masters which unregister on shutdown")?
> > > 
> > > No. But I did now. I consider it very surprising that .shutdown() calls
> > > the .remove() callback and would recommend to not do this. The commit
> > > log seems to prove this being difficult.
> > 
> > Why do you consider it surprising?
> 
> In my book .shutdown should be minimal and just silence the device, such
> that it e.g. doesn't do any DMA any more.

To me, the more important thing to consider is that many drivers lack
any ->shutdown hook at all, and making their ->shutdown simply call
->remove is often times the least-effort path of doing something
reasonable towards quiescing the hardware. Not to mention the lesser
evil compared to not having a ->shutdown at all.

That's not to say I am not in favor of a minimal shutdown procedure if
possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch().
But judging what should go into dsa_switch_shutdown() was definitely not
simple and there might still be corner cases that I missed - although it
works for now, knock on wood.

The reality is that you'll have a very hard time convincing people to
write a dedicated code path for shutdown, if you can convince them to
write one at all. They wouldn't even know if it does all the right
things - it's not like you kexec every day (unless you're using Linux as
a bootloader - but then again, if you do that you're kind of asking for
trouble - the reason why this is the case is exactly because not having
a ->shutdown hook implemented by drivers is an option, and the driver
core doesn't e.g. fall back to calling the ->remove method, even with
all the insanity that might ensue).

> > Many drivers implement ->shutdown by calling ->remove for the simple
> > reason that ->remove provides for a well-tested code path already, and
> > leaves the hardware in a known state, workable for kexec and others.
> > 
> > Many drivers have buses beneath them. Those buses go away when these
> > drivers unregister, and so do their children.
> > 
> > ==============================================
> > 
> > => some drivers do both => children of these buses should expect to be
> > potentially unregistered after they've been shut down.
> 
> Do you know this happens, or do you "only" fear it might happen?

Are you asking whether there are SPI controllers that implement
->shutdown as ->remove? Just search for "\.shutdown" in drivers/spi.
3 out of 3 implementations call ->remove.

If you really have time to waste, here, have fun: Lino Sanfilippo had
not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch
connected to a Raspberry Pi, both of which were due to other drivers
implementing their ->shutdown as ->remove. First driver was the DSA
master/host port (bcmgenet), the other was the bcm2835_spi controller.
https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/
https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/
https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/
As soon as we implemented ->shutdown in DSA drivers (which we had mostly
not done up until that thread) we ran into the surprise that ->remove
will get called too. Yay. It wasn't trivial to sort out, but we did it
eventually in a more systematic way. Not sure whether there's anything
to change at the drivers/base/ level.

Since any SPI-controlled DSA switch can fundamentally be connected to
mostly any SPI controller, then yes, I have no doubt at all that the
same can happen even with the vsc73xx driver you're patching here.

> > > > To remove the check for dev_get_drvdata == NULL in ->remove, you need to
> > > > prove that ->remove will never be called after ->shutdown. For platform
> > > > devices this is pretty easy to prove, for SPI devices not so much.
> > > > I intentionally kept the code structure the same because code gets
> > > > copied around a lot, it is easy to copy from the wrong place.
> > > 
> > > Alternatively remove spi_set_drvdata(spi, NULL); from
> > > vsc73xx_spi_shutdown()?
> > 
> > What is the end goal exactly?
> 
> My end goal is:
> 
> diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
> index eb7ac8a1e03c..183cf15fbdd2 100644
> --- a/include/linux/spi/spi.h
> +++ b/include/linux/spi/spi.h
> @@ -280,7 +280,7 @@ struct spi_message;
>  struct spi_driver {
>         const struct spi_device_id *id_table;
>         int                     (*probe)(struct spi_device *spi);
> -       int                     (*remove)(struct spi_device *spi);
> +       void                    (*remove)(struct spi_device *spi);
>         void                    (*shutdown)(struct spi_device *spi);
>         struct device_driver    driver;
>  };
> 
> As (nearly) all spi drivers must be touched in the same commit, the
> preparing goal is to have these remove callbacks simple, such that I
> only have to replace their "return 0;" by "return;" (or nothing if it's
> at the end of the function). Looking at vsc73xx's remove function I
> didn't stop at this minimal goal and simplified the stuff that I thought
> to be superflous.

Yeah, well I guess you can stop at the minimal goal.

> > > Also I'm not aware how platform devices are
> > > different to spi devices that the ordering of .remove and shutdown() is
> > > more or less obvious than on the other bus?!
> > 
> > Not sure what you mean. See the explanation above. For the "platform"
> > bus, there simply isn't any code path that unregisters children on the
> > ->shutdown callback. For other buses, there is.
> 
> OK, with your last mail I understood that now, thanks.
> 
> Best regards
> Uwe
> 
> -- 
> Pengutronix e.K.                           | Uwe Kleine-König            |
> Industrial Linux Solutions                 | https://www.pengutronix.de/ |

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-10 22:56         ` Vladimir Oltean
@ 2021-11-11  7:57           ` Uwe Kleine-König
  2021-11-11 10:47             ` Vladimir Oltean
  0 siblings, 1 reply; 10+ messages in thread
From: Uwe Kleine-König @ 2021-11-11  7:57 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, Greg Kroah-Hartman,
	David S. Miller, kernel, netdev, Jakub Kicinski, Vivien Didelot

[-- Attachment #1: Type: text/plain, Size: 6201 bytes --]

Hello Vladimir,

On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote:
> On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote:
> > Hello Vladimir,
> > 
> > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote:
> > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote:
> > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> > > > > Your commit prefix does not reflect the fact that you are touching the
> > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ".
> > > > 
> > > > Oh, I missed that indeed.
> > > > 
> > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > > > > > returned value. So convert the function to return no value.
> > > > > 
> > > > > This I agree with.
> > > > > 
> > > > > > For both the platform and the spi variant ..._get_drvdata() will never
> > > > > > return NULL in .remove() because the remove callback is only called after
> > > > > > the probe callback returned successfully and in this case driver data was
> > > > > > set to a non-NULL value.
> > > > > 
> > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be
> > > > > compatible with masters which unregister on shutdown")?
> > > > 
> > > > No. But I did now. I consider it very surprising that .shutdown() calls
> > > > the .remove() callback and would recommend to not do this. The commit
> > > > log seems to prove this being difficult.
> > > 
> > > Why do you consider it surprising?
> > 
> > In my book .shutdown should be minimal and just silence the device, such
> > that it e.g. doesn't do any DMA any more.
> 
> To me, the more important thing to consider is that many drivers lack
> any ->shutdown hook at all, and making their ->shutdown simply call
> ->remove is often times the least-effort path of doing something
> reasonable towards quiescing the hardware. Not to mention the lesser
> evil compared to not having a ->shutdown at all.
> 
> That's not to say I am not in favor of a minimal shutdown procedure if
> possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch().
> But judging what should go into dsa_switch_shutdown() was definitely not
> simple and there might still be corner cases that I missed - although it
> works for now, knock on wood.
> 
> The reality is that you'll have a very hard time convincing people to
> write a dedicated code path for shutdown, if you can convince them to
> write one at all. They wouldn't even know if it does all the right
> things - it's not like you kexec every day (unless you're using Linux as
> a bootloader - but then again, if you do that you're kind of asking for
> trouble - the reason why this is the case is exactly because not having
> a ->shutdown hook implemented by drivers is an option, and the driver
> core doesn't e.g. fall back to calling the ->remove method, even with
> all the insanity that might ensue).

Maybe I'm missing an important point here, but I thought it to be fine
for most drivers not to have a .shutdown hook.

> > > Many drivers implement ->shutdown by calling ->remove for the simple
> > > reason that ->remove provides for a well-tested code path already, and
> > > leaves the hardware in a known state, workable for kexec and others.
> > > 
> > > Many drivers have buses beneath them. Those buses go away when these
> > > drivers unregister, and so do their children.
> > > 
> > > ==============================================
> > > 
> > > => some drivers do both => children of these buses should expect to be
> > > potentially unregistered after they've been shut down.
> > 
> > Do you know this happens, or do you "only" fear it might happen?
> 
> Are you asking whether there are SPI controllers that implement
> ->shutdown as ->remove?

No I ask if it happens a lot / sometimes / ever that a driver's remove
callback is run for a device that was already shut down.

> Just search for "\.shutdown" in drivers/spi.
> 3 out of 3 implementations call ->remove.
> 
> If you really have time to waste, here, have fun: Lino Sanfilippo had
> not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch
> connected to a Raspberry Pi, both of which were due to other drivers
> implementing their ->shutdown as ->remove. First driver was the DSA
> master/host port (bcmgenet), the other was the bcm2835_spi controller.
> https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/
> https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/
> https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/
> As soon as we implemented ->shutdown in DSA drivers (which we had mostly
> not done up until that thread) we ran into the surprise that ->remove
> will get called too. Yay. It wasn't trivial to sort out, but we did it
> eventually in a more systematic way. Not sure whether there's anything
> to change at the drivers/base/ level.

What I wonder is: There are drivers that call .remove from .shutdown. Is
the right action to make other parts of the kernel robust with this
behaviour, or should the drivers changed to not call .remove from
.shutdown?

IMHO this is a question of promises of/expectations against the core
device layer. It must be known if for a shut down device there is (and
should be) a possibility that .remove is called. Depending on that
device drivers must be ready for this to happen, or can rely on it not
to happen.

From a global maintenance POV it would be good if it could not happen,
because then the complexity is concentrated to a small place (i.e. the
driver core, or maybe generic code in all subsystems) instead of making
each and every driver robust to this possible event that a considerable
part of the driver writers isn't aware of.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-11  7:57           ` Uwe Kleine-König
@ 2021-11-11 10:47             ` Vladimir Oltean
  2021-11-11 11:44               ` Greg Kroah-Hartman
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Oltean @ 2021-11-11 10:47 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: Andrew Lunn, Florian Fainelli, Greg Kroah-Hartman,
	David S. Miller, kernel, netdev, Jakub Kicinski, Vivien Didelot

On Thu, Nov 11, 2021 at 08:57:54AM +0100, Uwe Kleine-König wrote:
> Hello Vladimir,
> 
> On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote:
> > On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote:
> > > Hello Vladimir,
> > > 
> > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote:
> > > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote:
> > > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> > > > > > Your commit prefix does not reflect the fact that you are touching the
> > > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ".
> > > > > 
> > > > > Oh, I missed that indeed.
> > > > > 
> > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > > > > > > returned value. So convert the function to return no value.
> > > > > > 
> > > > > > This I agree with.
> > > > > > 
> > > > > > > For both the platform and the spi variant ..._get_drvdata() will never
> > > > > > > return NULL in .remove() because the remove callback is only called after
> > > > > > > the probe callback returned successfully and in this case driver data was
> > > > > > > set to a non-NULL value.
> > > > > > 
> > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be
> > > > > > compatible with masters which unregister on shutdown")?
> > > > > 
> > > > > No. But I did now. I consider it very surprising that .shutdown() calls
> > > > > the .remove() callback and would recommend to not do this. The commit
> > > > > log seems to prove this being difficult.
> > > > 
> > > > Why do you consider it surprising?
> > > 
> > > In my book .shutdown should be minimal and just silence the device, such
> > > that it e.g. doesn't do any DMA any more.
> > 
> > To me, the more important thing to consider is that many drivers lack
> > any ->shutdown hook at all, and making their ->shutdown simply call
> > ->remove is often times the least-effort path of doing something
> > reasonable towards quiescing the hardware. Not to mention the lesser
> > evil compared to not having a ->shutdown at all.
> > 
> > That's not to say I am not in favor of a minimal shutdown procedure if
> > possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch().
> > But judging what should go into dsa_switch_shutdown() was definitely not
> > simple and there might still be corner cases that I missed - although it
> > works for now, knock on wood.
> > 
> > The reality is that you'll have a very hard time convincing people to
> > write a dedicated code path for shutdown, if you can convince them to
> > write one at all. They wouldn't even know if it does all the right
> > things - it's not like you kexec every day (unless you're using Linux as
> > a bootloader - but then again, if you do that you're kind of asking for
> > trouble - the reason why this is the case is exactly because not having
> > a ->shutdown hook implemented by drivers is an option, and the driver
> > core doesn't e.g. fall back to calling the ->remove method, even with
> > all the insanity that might ensue).
> 
> Maybe I'm missing an important point here, but I thought it to be fine
> for most drivers not to have a .shutdown hook.

Depends on what you mean by "most drivers". One other case of definitely
problematic things that ->shutdown must take care of are shared interrupts.
I don't have a metric at hand, but there are definitely not few drivers
which support IRQF_SHARED. Some of those don't implement ->shutdown.
What I'm saying is that it would definitely go a long way for the
problems caused by these to be solved in one fell swoop by having some
logic to fall back to the ->remove path.

> > > > Many drivers implement ->shutdown by calling ->remove for the simple
> > > > reason that ->remove provides for a well-tested code path already, and
> > > > leaves the hardware in a known state, workable for kexec and others.
> > > > 
> > > > Many drivers have buses beneath them. Those buses go away when these
> > > > drivers unregister, and so do their children.
> > > > 
> > > > ==============================================
> > > > 
> > > > => some drivers do both => children of these buses should expect to be
> > > > potentially unregistered after they've been shut down.
> > > 
> > > Do you know this happens, or do you "only" fear it might happen?
> > 
> > Are you asking whether there are SPI controllers that implement
> > ->shutdown as ->remove?
> 
> No I ask if it happens a lot / sometimes / ever that a driver's remove
> callback is run for a device that was already shut down.

So if a SPI device is connected to one of the 3 SPI controllers
mentioned by me below, it happens with 100% reproduction rate. Otherwise
it happens with 0% reproduction rate. But you don't write a SPI device
driver to work with just one SPI controller, ideally you write it to
work with all.

> > Just search for "\.shutdown" in drivers/spi.
> > 3 out of 3 implementations call ->remove.
> > 
> > If you really have time to waste, here, have fun: Lino Sanfilippo had
> > not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch
> > connected to a Raspberry Pi, both of which were due to other drivers
> > implementing their ->shutdown as ->remove. First driver was the DSA
> > master/host port (bcmgenet), the other was the bcm2835_spi controller.
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/
> > As soon as we implemented ->shutdown in DSA drivers (which we had mostly
> > not done up until that thread) we ran into the surprise that ->remove
> > will get called too. Yay. It wasn't trivial to sort out, but we did it
> > eventually in a more systematic way. Not sure whether there's anything
> > to change at the drivers/base/ level.
> 
> What I wonder is: There are drivers that call .remove from .shutdown. Is
> the right action to make other parts of the kernel robust with this
> behaviour, or should the drivers changed to not call .remove from
> .shutdown?
> 
> IMHO this is a question of promises of/expectations against the core
> device layer. It must be known if for a shut down device there is (and
> should be) a possibility that .remove is called. Depending on that
> device drivers must be ready for this to happen, or can rely on it not
> to happen.
> 
> From a global maintenance POV it would be good if it could not happen,
> because then the complexity is concentrated to a small place (i.e. the
> driver core, or maybe generic code in all subsystems) instead of making
> each and every driver robust to this possible event that a considerable
> part of the driver writers isn't aware of.

IMO, if you can not offer a solid promise but merely a fragile one, then
it is always better to be robust (which DSA now is). How would you
propose that this particular promise could be fulfilled? Simply patch
the known offending drivers today and hope more drivers won't do this in
the future? Patching the device core to keep track of which devices
were shut down, so as to not call into their ->remove method?
Mind you, this issue was reported as a bug and had to be dealt with
locally, for stable kernels, so changing the driver core was not an
option.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-11 10:47             ` Vladimir Oltean
@ 2021-11-11 11:44               ` Greg Kroah-Hartman
  2021-11-11 11:51                 ` Vladimir Oltean
  0 siblings, 1 reply; 10+ messages in thread
From: Greg Kroah-Hartman @ 2021-11-11 11:44 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Uwe Kleine-König, Andrew Lunn, Florian Fainelli,
	David S. Miller, kernel, netdev, Jakub Kicinski, Vivien Didelot

On Thu, Nov 11, 2021 at 12:47:01PM +0200, Vladimir Oltean wrote:
> On Thu, Nov 11, 2021 at 08:57:54AM +0100, Uwe Kleine-König wrote:
> > Hello Vladimir,
> > 
> > On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote:
> > > On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote:
> > > > Hello Vladimir,
> > > > 
> > > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote:
> > > > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote:
> > > > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> > > > > > > Your commit prefix does not reflect the fact that you are touching the
> > > > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ".
> > > > > > 
> > > > > > Oh, I missed that indeed.
> > > > > > 
> > > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > > > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > > > > > > > returned value. So convert the function to return no value.
> > > > > > > 
> > > > > > > This I agree with.
> > > > > > > 
> > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never
> > > > > > > > return NULL in .remove() because the remove callback is only called after
> > > > > > > > the probe callback returned successfully and in this case driver data was
> > > > > > > > set to a non-NULL value.
> > > > > > > 
> > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be
> > > > > > > compatible with masters which unregister on shutdown")?
> > > > > > 
> > > > > > No. But I did now. I consider it very surprising that .shutdown() calls
> > > > > > the .remove() callback and would recommend to not do this. The commit
> > > > > > log seems to prove this being difficult.
> > > > > 
> > > > > Why do you consider it surprising?
> > > > 
> > > > In my book .shutdown should be minimal and just silence the device, such
> > > > that it e.g. doesn't do any DMA any more.
> > > 
> > > To me, the more important thing to consider is that many drivers lack
> > > any ->shutdown hook at all, and making their ->shutdown simply call
> > > ->remove is often times the least-effort path of doing something
> > > reasonable towards quiescing the hardware. Not to mention the lesser
> > > evil compared to not having a ->shutdown at all.
> > > 
> > > That's not to say I am not in favor of a minimal shutdown procedure if
> > > possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch().
> > > But judging what should go into dsa_switch_shutdown() was definitely not
> > > simple and there might still be corner cases that I missed - although it
> > > works for now, knock on wood.
> > > 
> > > The reality is that you'll have a very hard time convincing people to
> > > write a dedicated code path for shutdown, if you can convince them to
> > > write one at all. They wouldn't even know if it does all the right
> > > things - it's not like you kexec every day (unless you're using Linux as
> > > a bootloader - but then again, if you do that you're kind of asking for
> > > trouble - the reason why this is the case is exactly because not having
> > > a ->shutdown hook implemented by drivers is an option, and the driver
> > > core doesn't e.g. fall back to calling the ->remove method, even with
> > > all the insanity that might ensue).
> > 
> > Maybe I'm missing an important point here, but I thought it to be fine
> > for most drivers not to have a .shutdown hook.
> 
> Depends on what you mean by "most drivers". One other case of definitely
> problematic things that ->shutdown must take care of are shared interrupts.
> I don't have a metric at hand, but there are definitely not few drivers
> which support IRQF_SHARED. Some of those don't implement ->shutdown.
> What I'm saying is that it would definitely go a long way for the
> problems caused by these to be solved in one fell swoop by having some
> logic to fall back to the ->remove path.
> 
> > > > > Many drivers implement ->shutdown by calling ->remove for the simple
> > > > > reason that ->remove provides for a well-tested code path already, and
> > > > > leaves the hardware in a known state, workable for kexec and others.
> > > > > 
> > > > > Many drivers have buses beneath them. Those buses go away when these
> > > > > drivers unregister, and so do their children.
> > > > > 
> > > > > ==============================================
> > > > > 
> > > > > => some drivers do both => children of these buses should expect to be
> > > > > potentially unregistered after they've been shut down.
> > > > 
> > > > Do you know this happens, or do you "only" fear it might happen?
> > > 
> > > Are you asking whether there are SPI controllers that implement
> > > ->shutdown as ->remove?
> > 
> > No I ask if it happens a lot / sometimes / ever that a driver's remove
> > callback is run for a device that was already shut down.
> 
> So if a SPI device is connected to one of the 3 SPI controllers
> mentioned by me below, it happens with 100% reproduction rate. Otherwise
> it happens with 0% reproduction rate. But you don't write a SPI device
> driver to work with just one SPI controller, ideally you write it to
> work with all.
> 
> > > Just search for "\.shutdown" in drivers/spi.
> > > 3 out of 3 implementations call ->remove.
> > > 
> > > If you really have time to waste, here, have fun: Lino Sanfilippo had
> > > not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch
> > > connected to a Raspberry Pi, both of which were due to other drivers
> > > implementing their ->shutdown as ->remove. First driver was the DSA
> > > master/host port (bcmgenet), the other was the bcm2835_spi controller.
> > > https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/
> > > https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/
> > > https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/
> > > As soon as we implemented ->shutdown in DSA drivers (which we had mostly
> > > not done up until that thread) we ran into the surprise that ->remove
> > > will get called too. Yay. It wasn't trivial to sort out, but we did it
> > > eventually in a more systematic way. Not sure whether there's anything
> > > to change at the drivers/base/ level.
> > 
> > What I wonder is: There are drivers that call .remove from .shutdown. Is
> > the right action to make other parts of the kernel robust with this
> > behaviour, or should the drivers changed to not call .remove from
> > .shutdown?
> > 
> > IMHO this is a question of promises of/expectations against the core
> > device layer. It must be known if for a shut down device there is (and
> > should be) a possibility that .remove is called. Depending on that
> > device drivers must be ready for this to happen, or can rely on it not
> > to happen.
> > 
> > From a global maintenance POV it would be good if it could not happen,
> > because then the complexity is concentrated to a small place (i.e. the
> > driver core, or maybe generic code in all subsystems) instead of making
> > each and every driver robust to this possible event that a considerable
> > part of the driver writers isn't aware of.
> 
> IMO, if you can not offer a solid promise but merely a fragile one, then
> it is always better to be robust (which DSA now is). How would you
> propose that this particular promise could be fulfilled? Simply patch
> the known offending drivers today and hope more drivers won't do this in
> the future? Patching the device core to keep track of which devices
> were shut down, so as to not call into their ->remove method?
> Mind you, this issue was reported as a bug and had to be dealt with
> locally, for stable kernels, so changing the driver core was not an
> option.

Fix things properly first, in Linus's tree, and then worry about stable
kernels.  Never the other way around please.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] net: dsa: Some cleanups in remove code
  2021-11-11 11:44               ` Greg Kroah-Hartman
@ 2021-11-11 11:51                 ` Vladimir Oltean
  0 siblings, 0 replies; 10+ messages in thread
From: Vladimir Oltean @ 2021-11-11 11:51 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Uwe Kleine-König, Andrew Lunn, Florian Fainelli,
	David S. Miller, kernel, netdev, Jakub Kicinski, Vivien Didelot

On Thu, Nov 11, 2021 at 12:44:00PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Nov 11, 2021 at 12:47:01PM +0200, Vladimir Oltean wrote:
> > On Thu, Nov 11, 2021 at 08:57:54AM +0100, Uwe Kleine-König wrote:
> > > Hello Vladimir,
> > > 
> > > On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote:
> > > > On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote:
> > > > > Hello Vladimir,
> > > > > 
> > > > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote:
> > > > > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote:
> > > > > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote:
> > > > > > > > Your commit prefix does not reflect the fact that you are touching the
> > > > > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ".
> > > > > > > 
> > > > > > > Oh, I missed that indeed.
> > > > > > > 
> > > > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote:
> > > > > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the
> > > > > > > > > returned value. So convert the function to return no value.
> > > > > > > > 
> > > > > > > > This I agree with.
> > > > > > > > 
> > > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never
> > > > > > > > > return NULL in .remove() because the remove callback is only called after
> > > > > > > > > the probe callback returned successfully and in this case driver data was
> > > > > > > > > set to a non-NULL value.
> > > > > > > > 
> > > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be
> > > > > > > > compatible with masters which unregister on shutdown")?
> > > > > > > 
> > > > > > > No. But I did now. I consider it very surprising that .shutdown() calls
> > > > > > > the .remove() callback and would recommend to not do this. The commit
> > > > > > > log seems to prove this being difficult.
> > > > > > 
> > > > > > Why do you consider it surprising?
> > > > > 
> > > > > In my book .shutdown should be minimal and just silence the device, such
> > > > > that it e.g. doesn't do any DMA any more.
> > > > 
> > > > To me, the more important thing to consider is that many drivers lack
> > > > any ->shutdown hook at all, and making their ->shutdown simply call
> > > > ->remove is often times the least-effort path of doing something
> > > > reasonable towards quiescing the hardware. Not to mention the lesser
> > > > evil compared to not having a ->shutdown at all.
> > > > 
> > > > That's not to say I am not in favor of a minimal shutdown procedure if
> > > > possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch().
> > > > But judging what should go into dsa_switch_shutdown() was definitely not
> > > > simple and there might still be corner cases that I missed - although it
> > > > works for now, knock on wood.
> > > > 
> > > > The reality is that you'll have a very hard time convincing people to
> > > > write a dedicated code path for shutdown, if you can convince them to
> > > > write one at all. They wouldn't even know if it does all the right
> > > > things - it's not like you kexec every day (unless you're using Linux as
> > > > a bootloader - but then again, if you do that you're kind of asking for
> > > > trouble - the reason why this is the case is exactly because not having
> > > > a ->shutdown hook implemented by drivers is an option, and the driver
> > > > core doesn't e.g. fall back to calling the ->remove method, even with
> > > > all the insanity that might ensue).
> > > 
> > > Maybe I'm missing an important point here, but I thought it to be fine
> > > for most drivers not to have a .shutdown hook.
> > 
> > Depends on what you mean by "most drivers". One other case of definitely
> > problematic things that ->shutdown must take care of are shared interrupts.
> > I don't have a metric at hand, but there are definitely not few drivers
> > which support IRQF_SHARED. Some of those don't implement ->shutdown.
> > What I'm saying is that it would definitely go a long way for the
> > problems caused by these to be solved in one fell swoop by having some
> > logic to fall back to the ->remove path.
> > 
> > > > > > Many drivers implement ->shutdown by calling ->remove for the simple
> > > > > > reason that ->remove provides for a well-tested code path already, and
> > > > > > leaves the hardware in a known state, workable for kexec and others.
> > > > > > 
> > > > > > Many drivers have buses beneath them. Those buses go away when these
> > > > > > drivers unregister, and so do their children.
> > > > > > 
> > > > > > ==============================================
> > > > > > 
> > > > > > => some drivers do both => children of these buses should expect to be
> > > > > > potentially unregistered after they've been shut down.
> > > > > 
> > > > > Do you know this happens, or do you "only" fear it might happen?
> > > > 
> > > > Are you asking whether there are SPI controllers that implement
> > > > ->shutdown as ->remove?
> > > 
> > > No I ask if it happens a lot / sometimes / ever that a driver's remove
> > > callback is run for a device that was already shut down.
> > 
> > So if a SPI device is connected to one of the 3 SPI controllers
> > mentioned by me below, it happens with 100% reproduction rate. Otherwise
> > it happens with 0% reproduction rate. But you don't write a SPI device
> > driver to work with just one SPI controller, ideally you write it to
> > work with all.
> > 
> > > > Just search for "\.shutdown" in drivers/spi.
> > > > 3 out of 3 implementations call ->remove.
> > > > 
> > > > If you really have time to waste, here, have fun: Lino Sanfilippo had
> > > > not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch
> > > > connected to a Raspberry Pi, both of which were due to other drivers
> > > > implementing their ->shutdown as ->remove. First driver was the DSA
> > > > master/host port (bcmgenet), the other was the bcm2835_spi controller.
> > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/
> > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/
> > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/
> > > > As soon as we implemented ->shutdown in DSA drivers (which we had mostly
> > > > not done up until that thread) we ran into the surprise that ->remove
> > > > will get called too. Yay. It wasn't trivial to sort out, but we did it
> > > > eventually in a more systematic way. Not sure whether there's anything
> > > > to change at the drivers/base/ level.
> > > 
> > > What I wonder is: There are drivers that call .remove from .shutdown. Is
> > > the right action to make other parts of the kernel robust with this
> > > behaviour, or should the drivers changed to not call .remove from
> > > .shutdown?
> > > 
> > > IMHO this is a question of promises of/expectations against the core
> > > device layer. It must be known if for a shut down device there is (and
> > > should be) a possibility that .remove is called. Depending on that
> > > device drivers must be ready for this to happen, or can rely on it not
> > > to happen.
> > > 
> > > From a global maintenance POV it would be good if it could not happen,
> > > because then the complexity is concentrated to a small place (i.e. the
> > > driver core, or maybe generic code in all subsystems) instead of making
> > > each and every driver robust to this possible event that a considerable
> > > part of the driver writers isn't aware of.
> > 
> > IMO, if you can not offer a solid promise but merely a fragile one, then
> > it is always better to be robust (which DSA now is). How would you
> > propose that this particular promise could be fulfilled? Simply patch
> > the known offending drivers today and hope more drivers won't do this in
> > the future? Patching the device core to keep track of which devices
> > were shut down, so as to not call into their ->remove method?
> > Mind you, this issue was reported as a bug and had to be dealt with
> > locally, for stable kernels, so changing the driver core was not an
> > option.
> 
> Fix things properly first, in Linus's tree, and then worry about stable
> kernels.  Never the other way around please.

Thanks, all clear now. I don't know why I never thought about that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-11-11 11:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-09 11:39 [PATCH net-next] net: dsa: Some cleanups in remove code Uwe Kleine-König
2021-11-09 11:54 ` Vladimir Oltean
2021-11-09 17:50   ` Uwe Kleine-König
2021-11-10 13:15     ` Vladimir Oltean
2021-11-10 21:03       ` Uwe Kleine-König
2021-11-10 22:56         ` Vladimir Oltean
2021-11-11  7:57           ` Uwe Kleine-König
2021-11-11 10:47             ` Vladimir Oltean
2021-11-11 11:44               ` Greg Kroah-Hartman
2021-11-11 11:51                 ` Vladimir Oltean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.