All of lore.kernel.org
 help / color / mirror / Atom feed
* Regressions in stable releases
@ 2021-08-05 16:11 Guenter Roeck
  2021-08-05 16:19 ` Greg Kroah-Hartman
  2021-08-05 16:42 ` Willy Tarreau
  0 siblings, 2 replies; 13+ messages in thread
From: Guenter Roeck @ 2021-08-05 16:11 UTC (permalink / raw)
  To: stable; +Cc: Greg Kroah-Hartman, Sasha Levin

Hi folks,

we have (at least) two severe regressions in stable releases right now.

[SHAs are from linux-5.10.y]

2435dcfd16ac spi: mediatek: fix fifo rx mode
	Breaks SPI access on all Mediatek devices for small transactions
	(including all Mediatek based Chromebooks since they use small SPI
	 transactions for EC communication)

60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
	Breaks Bluetooth on various devices (Mediatek and possibly others)
	Discussion: https://lkml.org/lkml/2021/7/28/569

Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.

I understand that upstream is just as broken until fixes are applied there.
Still, it shows that our test coverage is far from where it needs to be,
and/or that we may be too aggressive with backporting patches to stable
releases.

If you have an idea how to improve the situation, please let me know.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 16:11 Regressions in stable releases Guenter Roeck
@ 2021-08-05 16:19 ` Greg Kroah-Hartman
  2021-08-05 17:39   ` Guenter Roeck
  2021-08-05 16:42 ` Willy Tarreau
  1 sibling, 1 reply; 13+ messages in thread
From: Greg Kroah-Hartman @ 2021-08-05 16:19 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: stable, Sasha Levin

On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:
> Hi folks,
> 
> we have (at least) two severe regressions in stable releases right now.
> 
> [SHAs are from linux-5.10.y]
> 
> 2435dcfd16ac spi: mediatek: fix fifo rx mode
> 	Breaks SPI access on all Mediatek devices for small transactions
> 	(including all Mediatek based Chromebooks since they use small SPI
> 	 transactions for EC communication)
> 
> 60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
> 	Breaks Bluetooth on various devices (Mediatek and possibly others)
> 	Discussion: https://lkml.org/lkml/2021/7/28/569

Are either of these being tracked on the regressions list?  I have not
noticed them being reported there, or on the stable list :(

> Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.
> 
> I understand that upstream is just as broken until fixes are applied there.
> Still, it shows that our test coverage is far from where it needs to be,
> and/or that we may be too aggressive with backporting patches to stable
> releases.
> 
> If you have an idea how to improve the situation, please let me know.

We need to get tests running in kernelci on real hardware, that's going
to be much more helpful here.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 16:11 Regressions in stable releases Guenter Roeck
  2021-08-05 16:19 ` Greg Kroah-Hartman
@ 2021-08-05 16:42 ` Willy Tarreau
  2021-08-05 17:29   ` Guenter Roeck
  1 sibling, 1 reply; 13+ messages in thread
From: Willy Tarreau @ 2021-08-05 16:42 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: stable, Greg Kroah-Hartman, Sasha Levin

Hi Guenter,

On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:
> Hi folks,
> 
> we have (at least) two severe regressions in stable releases right now.
> 
> [SHAs are from linux-5.10.y]
> 
> 2435dcfd16ac spi: mediatek: fix fifo rx mode
> 	Breaks SPI access on all Mediatek devices for small transactions
> 	(including all Mediatek based Chromebooks since they use small SPI
> 	 transactions for EC communication)
> 
> 60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
> 	Breaks Bluetooth on various devices (Mediatek and possibly others)
> 	Discussion: https://lkml.org/lkml/2021/7/28/569
> 
> Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.
> 
> I understand that upstream is just as broken until fixes are applied there.
> Still, it shows that our test coverage is far from where it needs to be,
> and/or that we may be too aggressive with backporting patches to stable
> releases.
> 
> If you have an idea how to improve the situation, please let me know.

The first one is really interesting. The author did all the job right
by documenting what commit this patch fixed, this commit was indeed
present in the stable branches, and given that the change is probably
only understood by the driver's maintainer, it's very likely that he
did that in good faith after some testing on real hardware. So there's
little chance that any extra form of automated testing will catch this
if it worked at least in one place.

It looks like a typical "works for me" regression. The best thing that
could possibly be done to limit such occurrences would be to wait "long
enough" before backporting them, in hope to catch breakage reports before
the backport, but here there were already 3 weeks between the patch was
submitted and it was backported.

One solution might be to further increase the delay between the patch
and its integration, but do we think it could increase the likelyhood
that the bug is detected and reported in *some* environment ? And if
so, is the overall situation any better with *some* users experiencing
a possible rare regression compared to leaving 100% of the users exposed
to a known bug in stable branches ? That's always difficult. As a user
of the stable branches I personally prefer to risk a rare regression and
report it than not getting fixes, because if the risk of regression in a
patch is 1%, I'd rather get 99 useful fixes and 1 regression than no fix
for a bug that bothers me.

So very likely the most robust solution is to further encourage users
to report regressions as soon as they are met so that the faulty commits
are spotted, reverted, and their author is aware that a corner case was
identified. Greg is always very fast to respond to requests for reverts.

Also, for a developer, being aware of a deployment exhibiting an issue
is extremely valuable, and the chances of spotting an issue and getting
it fixed are much higher if the delay between integration and deployment
is shorter. Otherwise it can sometimes take months to years before driver
code lands into users' hands, especially with embedded systems where the
rule remains "if it's not broken, don't touch it".

So in the end, the more often users upgrade, the better both for them and
to spot issues. I know it doesn't please everyone, but while nobody likes
bugs, someone has to face them at some point in order to report them :-/

In an ideal world we could imagine that postponing sensitive backports to
older branches would improve their stability and reduce users' exposure.
We're doing this to a certain extent in haproxy and it sort-of works. But
the cost of keeping fixes in queue and postponing them is high and the
risk of failing a backport is much higher this way, because either you
prepare all the backports at once and you risk that the context changed
between the initial backport and the merge, or you have to them at the
last moment, without remembering any of the analysis that had to be done
for the first branches.

Maybe in the end a sweet spot could be to just release older branches less
often and with more patches each time, offering more chances to expose the
faulty backports to more recent branches and affecting super-stable users
even less ?

Just my two cents on this never-ending debate :-/
Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 16:42 ` Willy Tarreau
@ 2021-08-05 17:29   ` Guenter Roeck
  2021-08-05 18:30     ` Willy Tarreau
  0 siblings, 1 reply; 13+ messages in thread
From: Guenter Roeck @ 2021-08-05 17:29 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: stable, Greg Kroah-Hartman, Sasha Levin

On Thu, Aug 05, 2021 at 06:42:54PM +0200, Willy Tarreau wrote:
> Hi Guenter,
> 
> On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:
> > Hi folks,
> > 
> > we have (at least) two severe regressions in stable releases right now.
> > 
> > [SHAs are from linux-5.10.y]
> > 
> > 2435dcfd16ac spi: mediatek: fix fifo rx mode
> > 	Breaks SPI access on all Mediatek devices for small transactions
> > 	(including all Mediatek based Chromebooks since they use small SPI
> > 	 transactions for EC communication)
> > 
> > 60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
> > 	Breaks Bluetooth on various devices (Mediatek and possibly others)
> > 	Discussion: https://lkml.org/lkml/2021/7/28/569
> > 
> > Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.
> > 
> > I understand that upstream is just as broken until fixes are applied there.
> > Still, it shows that our test coverage is far from where it needs to be,
> > and/or that we may be too aggressive with backporting patches to stable
> > releases.
> > 
> > If you have an idea how to improve the situation, please let me know.
> 
> The first one is really interesting. The author did all the job right
> by documenting what commit this patch fixed, this commit was indeed
> present in the stable branches, and given that the change is probably
> only understood by the driver's maintainer, it's very likely that he
> did that in good faith after some testing on real hardware. So there's
> little chance that any extra form of automated testing will catch this
> if it worked at least in one place.
> 
> It looks like a typical "works for me" regression. The best thing that
> could possibly be done to limit such occurrences would be to wait "long
> enough" before backporting them, in hope to catch breakage reports before
> the backport, but here there were already 3 weeks between the patch was
> submitted and it was backported.

No. The patch is wrong. It just _looks_ correct at first glance. It claims
to fix something that wasn't broken. FIFO rx mode was working just fine,
handled in the receive interrupt as one would expect. The patch copies
data from the rx fifo before the transfer is even started. I do not
think it was tested on real hardware, or at least fifo receive transfer
was not tested.

The patch _does_ fix a problem on the transmit side, but the patch subject
doesn't mention that.

Guenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 16:19 ` Greg Kroah-Hartman
@ 2021-08-05 17:39   ` Guenter Roeck
  2021-08-05 17:43     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 13+ messages in thread
From: Guenter Roeck @ 2021-08-05 17:39 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: stable, Sasha Levin

On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:
> > Hi folks,
> > 
> > we have (at least) two severe regressions in stable releases right now.
> > 
> > [SHAs are from linux-5.10.y]
> > 
> > 2435dcfd16ac spi: mediatek: fix fifo rx mode
> > 	Breaks SPI access on all Mediatek devices for small transactions
> > 	(including all Mediatek based Chromebooks since they use small SPI
> > 	 transactions for EC communication)
> > 
> > 60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
> > 	Breaks Bluetooth on various devices (Mediatek and possibly others)
> > 	Discussion: https://lkml.org/lkml/2021/7/28/569
> 
> Are either of these being tracked on the regressions list?  I have not
> noticed them being reported there, or on the stable list :(
> 

I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the
report on the stable list, or at least that was the idea. Should I send separate
emails to regressions@ with details ?

> > Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.
> > 
> > I understand that upstream is just as broken until fixes are applied there.
> > Still, it shows that our test coverage is far from where it needs to be,
> > and/or that we may be too aggressive with backporting patches to stable
> > releases.
> > 
> > If you have an idea how to improve the situation, please let me know.
> 
> We need to get tests running in kernelci on real hardware, that's going
> to be much more helpful here.
> 

Yes, I know. Of course it didn't help that our internal testing didn't catch
the problem until after the fact either.

Guenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 17:39   ` Guenter Roeck
@ 2021-08-05 17:43     ` Greg Kroah-Hartman
  2021-08-05 19:44       ` Guenter Roeck
  0 siblings, 1 reply; 13+ messages in thread
From: Greg Kroah-Hartman @ 2021-08-05 17:43 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: stable, Sasha Levin

On Thu, Aug 05, 2021 at 10:39:22AM -0700, Guenter Roeck wrote:
> On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:
> > On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:
> > > Hi folks,
> > > 
> > > we have (at least) two severe regressions in stable releases right now.
> > > 
> > > [SHAs are from linux-5.10.y]
> > > 
> > > 2435dcfd16ac spi: mediatek: fix fifo rx mode
> > > 	Breaks SPI access on all Mediatek devices for small transactions
> > > 	(including all Mediatek based Chromebooks since they use small SPI
> > > 	 transactions for EC communication)
> > > 
> > > 60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
> > > 	Breaks Bluetooth on various devices (Mediatek and possibly others)
> > > 	Discussion: https://lkml.org/lkml/2021/7/28/569
> > 
> > Are either of these being tracked on the regressions list?  I have not
> > noticed them being reported there, or on the stable list :(
> > 
> 
> I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the
> report on the stable list, or at least that was the idea. Should I send separate
> emails to regressions@ with details ?

For regressions in Linus's tree, yes please do.  I have seen many stable
regressions also sent there as they mirror regressions in Linus's tree
(right now there is at least one ACPI regression that hopefully will
show up in Linus's tree soon that has hit stable as well..)

> > > Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.
> > > 
> > > I understand that upstream is just as broken until fixes are applied there.
> > > Still, it shows that our test coverage is far from where it needs to be,
> > > and/or that we may be too aggressive with backporting patches to stable
> > > releases.
> > > 
> > > If you have an idea how to improve the situation, please let me know.
> > 
> > We need to get tests running in kernelci on real hardware, that's going
> > to be much more helpful here.
> > 
> 
> Yes, I know. Of course it didn't help that our internal testing didn't catch
> the problem until after the fact either.

There will always be issues that can only be caught on real hardware, we
are all just human.  The goal is to handle them quickly when they are
caught.

I'll go revert the above commits now, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 17:29   ` Guenter Roeck
@ 2021-08-05 18:30     ` Willy Tarreau
  2021-08-06 14:16       ` Guenter Roeck
  2021-08-06 14:37       ` Sasha Levin
  0 siblings, 2 replies; 13+ messages in thread
From: Willy Tarreau @ 2021-08-05 18:30 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: stable, Greg Kroah-Hartman, Sasha Levin

On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:
> > It looks like a typical "works for me" regression. The best thing that
> > could possibly be done to limit such occurrences would be to wait "long
> > enough" before backporting them, in hope to catch breakage reports before
> > the backport, but here there were already 3 weeks between the patch was
> > submitted and it was backported.
> 
> No. The patch is wrong. It just _looks_ correct at first glance.

So that's the core of the problem. Stable maintainers cannot be tasked
to try to analyse each patch in its finest details to figure whether a
maintainer that's expected to be more knowledgeable than them on their
driver did something wrong.

Then in my opinion we should encourage *not* to use "Fixes:" on untested
patches (untested patches will always happen due to hardware availability
or lack of a reliable reproducer).

What about this to try to improve the situation in this specific case ?

Willy


From ef646bae2139ba005de78940064c464126c430e6 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 5 Aug 2021 20:24:30 +0200
Subject: docs: process: submitting-patches.rst: recommend against 'Fixes:' if
 untested

'Fixes:' usually is taken as authority and often results in a backport. If
a patch couldn't be tested although it looks perfectly valid, better not
add this tag to leave a final chance for backporters to ask about the
relevance of the backport or to check for future tests.

Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 Documentation/process/submitting-patches.rst | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
index 0852bcf73630..54782b0e2f4c 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -140,6 +140,15 @@ An example call::
 	$ git log -1 --pretty=fixes 54a4f0239f2e
 	Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed")
 
+Please note that a 'Fixes:' tag will most often result in your patch being
+automatically backported to stable branches. If for any reason you could not
+test that it really fixes the problem (for example, because the bug is not
+reproducible, or because you did not have access to the required hardware
+at the time of writing the patch to verify it does not cause regressions),
+even if you are absolutely certain of your patch's validity, do not include
+a 'Fixes:' tag and instead explain the situation in the commit message in
+plain English.
+
 .. _split_changes:
 
 Separate your changes
-- 
2.17.5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 17:43     ` Greg Kroah-Hartman
@ 2021-08-05 19:44       ` Guenter Roeck
  2021-08-05 19:51         ` Greg Kroah-Hartman
  0 siblings, 1 reply; 13+ messages in thread
From: Guenter Roeck @ 2021-08-05 19:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: stable, Sasha Levin

On Thu, Aug 05, 2021 at 07:43:20PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Aug 05, 2021 at 10:39:22AM -0700, Guenter Roeck wrote:
> > On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:
> > > On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:
> > > > Hi folks,
> > > > 
> > > > we have (at least) two severe regressions in stable releases right now.
> > > > 
> > > > [SHAs are from linux-5.10.y]
> > > > 
> > > > 2435dcfd16ac spi: mediatek: fix fifo rx mode
> > > > 	Breaks SPI access on all Mediatek devices for small transactions
> > > > 	(including all Mediatek based Chromebooks since they use small SPI
> > > > 	 transactions for EC communication)
> > > > 
> > > > 60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
> > > > 	Breaks Bluetooth on various devices (Mediatek and possibly others)
> > > > 	Discussion: https://lkml.org/lkml/2021/7/28/569
> > > 
> > > Are either of these being tracked on the regressions list?  I have not
> > > noticed them being reported there, or on the stable list :(
> > > 
> > 
> > I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the
> > report on the stable list, or at least that was the idea. Should I send separate
> > emails to regressions@ with details ?
> 
> For regressions in Linus's tree, yes please do.  I have seen many stable
> regressions also sent there as they mirror regressions in Linus's tree
> (right now there is at least one ACPI regression that hopefully will
> show up in Linus's tree soon that has hit stable as well..)
> 
> > > > Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.
> > > > 
> > > > I understand that upstream is just as broken until fixes are applied there.
> > > > Still, it shows that our test coverage is far from where it needs to be,
> > > > and/or that we may be too aggressive with backporting patches to stable
> > > > releases.
> > > > 
> > > > If you have an idea how to improve the situation, please let me know.
> > > 
> > > We need to get tests running in kernelci on real hardware, that's going
> > > to be much more helpful here.
> > > 
> > 
> > Yes, I know. Of course it didn't help that our internal testing didn't catch
> > the problem until after the fact either.
> 
> There will always be issues that can only be caught on real hardware, we
> are all just human.  The goal is to handle them quickly when they are
> caught.
> 
> I'll go revert the above commits now, thanks.

If you do that, it needs to be done all the way to v4.19.y.

Guenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 19:44       ` Guenter Roeck
@ 2021-08-05 19:51         ` Greg Kroah-Hartman
  0 siblings, 0 replies; 13+ messages in thread
From: Greg Kroah-Hartman @ 2021-08-05 19:51 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: stable, Sasha Levin

On Thu, Aug 05, 2021 at 12:44:54PM -0700, Guenter Roeck wrote:
> On Thu, Aug 05, 2021 at 07:43:20PM +0200, Greg Kroah-Hartman wrote:
> > On Thu, Aug 05, 2021 at 10:39:22AM -0700, Guenter Roeck wrote:
> > > On Thu, Aug 05, 2021 at 06:19:15PM +0200, Greg Kroah-Hartman wrote:
> > > > On Thu, Aug 05, 2021 at 09:11:02AM -0700, Guenter Roeck wrote:
> > > > > Hi folks,
> > > > > 
> > > > > we have (at least) two severe regressions in stable releases right now.
> > > > > 
> > > > > [SHAs are from linux-5.10.y]
> > > > > 
> > > > > 2435dcfd16ac spi: mediatek: fix fifo rx mode
> > > > > 	Breaks SPI access on all Mediatek devices for small transactions
> > > > > 	(including all Mediatek based Chromebooks since they use small SPI
> > > > > 	 transactions for EC communication)
> > > > > 
> > > > > 60789afc02f5 Bluetooth: Shutdown controller after workqueues are flushed or cancelled
> > > > > 	Breaks Bluetooth on various devices (Mediatek and possibly others)
> > > > > 	Discussion: https://lkml.org/lkml/2021/7/28/569
> > > > 
> > > > Are either of these being tracked on the regressions list?  I have not
> > > > noticed them being reported there, or on the stable list :(
> > > > 
> > > 
> > > I wasn't aware of regressions@lists.linux.dev. Clueless me. And this is the
> > > report on the stable list, or at least that was the idea. Should I send separate
> > > emails to regressions@ with details ?
> > 
> > For regressions in Linus's tree, yes please do.  I have seen many stable
> > regressions also sent there as they mirror regressions in Linus's tree
> > (right now there is at least one ACPI regression that hopefully will
> > show up in Linus's tree soon that has hit stable as well..)
> > 
> > > > > Unfortunately, it appears that all our testing doesn't cover SPI and Bluetooth.
> > > > > 
> > > > > I understand that upstream is just as broken until fixes are applied there.
> > > > > Still, it shows that our test coverage is far from where it needs to be,
> > > > > and/or that we may be too aggressive with backporting patches to stable
> > > > > releases.
> > > > > 
> > > > > If you have an idea how to improve the situation, please let me know.
> > > > 
> > > > We need to get tests running in kernelci on real hardware, that's going
> > > > to be much more helpful here.
> > > > 
> > > 
> > > Yes, I know. Of course it didn't help that our internal testing didn't catch
> > > the problem until after the fact either.
> > 
> > There will always be issues that can only be caught on real hardware, we
> > are all just human.  The goal is to handle them quickly when they are
> > caught.
> > 
> > I'll go revert the above commits now, thanks.
> 
> If you do that, it needs to be done all the way to v4.19.y.

4.9.y for the first one, 4.4.y for the second.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 18:30     ` Willy Tarreau
@ 2021-08-06 14:16       ` Guenter Roeck
  2021-08-06 14:42         ` Willy Tarreau
  2021-08-06 14:37       ` Sasha Levin
  1 sibling, 1 reply; 13+ messages in thread
From: Guenter Roeck @ 2021-08-06 14:16 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: stable, Greg Kroah-Hartman, Sasha Levin

On 8/5/21 11:30 AM, Willy Tarreau wrote:
> On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:
>>> It looks like a typical "works for me" regression. The best thing that
>>> could possibly be done to limit such occurrences would be to wait "long
>>> enough" before backporting them, in hope to catch breakage reports before
>>> the backport, but here there were already 3 weeks between the patch was
>>> submitted and it was backported.
>>
>> No. The patch is wrong. It just _looks_ correct at first glance.
> 
> So that's the core of the problem. Stable maintainers cannot be tasked
> to try to analyse each patch in its finest details to figure whether a
> maintainer that's expected to be more knowledgeable than them on their
> driver did something wrong.
> 
> Then in my opinion we should encourage *not* to use "Fixes:" on untested
> patches (untested patches will always happen due to hardware availability
> or lack of a reliable reproducer).
> 
> What about this to try to improve the situation in this specific case ?
> 
> Willy
> 
> 
>>From ef646bae2139ba005de78940064c464126c430e6 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 5 Aug 2021 20:24:30 +0200
> Subject: docs: process: submitting-patches.rst: recommend against 'Fixes:' if
>   untested
> 
> 'Fixes:' usually is taken as authority and often results in a backport. If
> a patch couldn't be tested although it looks perfectly valid, better not
> add this tag to leave a final chance for backporters to ask about the
> relevance of the backport or to check for future tests.
> 
> Cc: Guenter Roeck <linux@roeck-us.net>
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>   Documentation/process/submitting-patches.rst | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
> index 0852bcf73630..54782b0e2f4c 100644
> --- a/Documentation/process/submitting-patches.rst
> +++ b/Documentation/process/submitting-patches.rst
> @@ -140,6 +140,15 @@ An example call::
>   	$ git log -1 --pretty=fixes 54a4f0239f2e
>   	Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed")
>   
> +Please note that a 'Fixes:' tag will most often result in your patch being
> +automatically backported to stable branches. If for any reason you could not
> +test that it really fixes the problem (for example, because the bug is not
> +reproducible, or because you did not have access to the required hardware
> +at the time of writing the patch to verify it does not cause regressions),
> +even if you are absolutely certain of your patch's validity, do not include
> +a 'Fixes:' tag and instead explain the situation in the commit message in
> +plain English.
> +

I don't think that would be a good idea, First, it would discourage people
from using Fixes: tags, and second it would not really solve the problem either.

While I am sure that the patch in question wasn't really tested (after all,
it broke EC communication on all Mediatek Chromebooks using SPI to communicate
with the EC), we don't know what the author tested. It may well be (and is quite
likely) that the author _did_ test the patch, only not the affected code path.
That means the author may still have added a Fixes: tag in the wrong belief
to have tested the fix, even with the above restrictions in place.

The same is true for the Bluetooth patch: This patch was for sure tested
and works on many Bluetooth devices, except for those with a specific
Mediatek controller and driver. Again, the Fixes: tag would be there
(and it would be completely unreasonable to expect that infrastructure
patches are tested on all affected hardware).

No, I think all we can do is to improve test coverage.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-05 18:30     ` Willy Tarreau
  2021-08-06 14:16       ` Guenter Roeck
@ 2021-08-06 14:37       ` Sasha Levin
  2021-08-06 14:52         ` Willy Tarreau
  1 sibling, 1 reply; 13+ messages in thread
From: Sasha Levin @ 2021-08-06 14:37 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Guenter Roeck, stable, Greg Kroah-Hartman

On Thu, Aug 05, 2021 at 08:30:55PM +0200, Willy Tarreau wrote:
>On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:
>> > It looks like a typical "works for me" regression. The best thing that
>> > could possibly be done to limit such occurrences would be to wait "long
>> > enough" before backporting them, in hope to catch breakage reports before
>> > the backport, but here there were already 3 weeks between the patch was
>> > submitted and it was backported.
>>
>> No. The patch is wrong. It just _looks_ correct at first glance.
>
>So that's the core of the problem. Stable maintainers cannot be tasked
>to try to analyse each patch in its finest details to figure whether a
>maintainer that's expected to be more knowledgeable than them on their
>driver did something wrong.
>
>Then in my opinion we should encourage *not* to use "Fixes:" on untested
>patches (untested patches will always happen due to hardware availability
>or lack of a reliable reproducer).
>
>What about this to try to improve the situation in this specific case ?

No, please let's not. If there is no testing story behind a buggy patch
then it'll explode either when we go to the next version, or when we
pull it into -stable.

If we avoid taking groups of patches into -stable it'll just mean that
we end up with a huge amount of issues waiting for us during a version
upgrade.

Yes, we may be taking bugs in now, but the regression rate (according to
LWN) is pretty low, and the somewhat linear distribution of those bugs
throughout our releases makes them managable (to review when they're
sent out, to find during testing, to bisect if we hit the bug).

As Guenter points out, the path forward should be to improve our testing
story.

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-06 14:16       ` Guenter Roeck
@ 2021-08-06 14:42         ` Willy Tarreau
  0 siblings, 0 replies; 13+ messages in thread
From: Willy Tarreau @ 2021-08-06 14:42 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: stable, Greg Kroah-Hartman, Sasha Levin

On Fri, Aug 06, 2021 at 07:16:32AM -0700, Guenter Roeck wrote:
> On 8/5/21 11:30 AM, Willy Tarreau wrote:
> > On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:
> > > > It looks like a typical "works for me" regression. The best thing that
> > > > could possibly be done to limit such occurrences would be to wait "long
> > > > enough" before backporting them, in hope to catch breakage reports before
> > > > the backport, but here there were already 3 weeks between the patch was
> > > > submitted and it was backported.
> > > 
> > > No. The patch is wrong. It just _looks_ correct at first glance.
> > 
> > So that's the core of the problem. Stable maintainers cannot be tasked
> > to try to analyse each patch in its finest details to figure whether a
> > maintainer that's expected to be more knowledgeable than them on their
> > driver did something wrong.
> > 
> > Then in my opinion we should encourage *not* to use "Fixes:" on untested
> > patches (untested patches will always happen due to hardware availability
> > or lack of a reliable reproducer).
> > 
> > What about this to try to improve the situation in this specific case ?
> > 
> > Willy
> > 
> > 
> > > From ef646bae2139ba005de78940064c464126c430e6 Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Thu, 5 Aug 2021 20:24:30 +0200
> > Subject: docs: process: submitting-patches.rst: recommend against 'Fixes:' if
> >   untested
> > 
> > 'Fixes:' usually is taken as authority and often results in a backport. If
> > a patch couldn't be tested although it looks perfectly valid, better not
> > add this tag to leave a final chance for backporters to ask about the
> > relevance of the backport or to check for future tests.
> > 
> > Cc: Guenter Roeck <linux@roeck-us.net>
> > Signed-off-by: Willy Tarreau <w@1wt.eu>
> > ---
> >   Documentation/process/submitting-patches.rst | 9 +++++++++
> >   1 file changed, 9 insertions(+)
> > 
> > diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
> > index 0852bcf73630..54782b0e2f4c 100644
> > --- a/Documentation/process/submitting-patches.rst
> > +++ b/Documentation/process/submitting-patches.rst
> > @@ -140,6 +140,15 @@ An example call::
> >   	$ git log -1 --pretty=fixes 54a4f0239f2e
> >   	Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed")
> > +Please note that a 'Fixes:' tag will most often result in your patch being
> > +automatically backported to stable branches. If for any reason you could not
> > +test that it really fixes the problem (for example, because the bug is not
> > +reproducible, or because you did not have access to the required hardware
> > +at the time of writing the patch to verify it does not cause regressions),
> > +even if you are absolutely certain of your patch's validity, do not include
> > +a 'Fixes:' tag and instead explain the situation in the commit message in
> > +plain English.
> > +
> 
> I don't think that would be a good idea, First, it would discourage people
> from using Fixes: tags, and second it would not really solve the problem either.
> 
> While I am sure that the patch in question wasn't really tested (after all,
> it broke EC communication on all Mediatek Chromebooks using SPI to communicate
> with the EC), we don't know what the author tested. It may well be (and is quite
> likely) that the author _did_ test the patch, only not the affected code path.
> That means the author may still have added a Fixes: tag in the wrong belief
> to have tested the fix, even with the above restrictions in place.

But this is exactly the principle of a regression, and the first rule that
evreyone learns in programming is that 100% coverage never exists. That's
particularly true in large projects where the skills are distributed over
many people. Everyone faithfully tests the faulty case (i.e. I tested that
it does solve the problem I could reproduce), without a knowledge of plenty
of other cases. I do cause regressions myself when fixing bugs, despite
running plenty of regtests, and I hate this. The only thing I can do in
this case is to communicate about it as soon as detected and produce another
fix as fast as possible.

> No, I think all we can do is to improve test coverage.

Improving test coverage is a way to *reduce* the frequency of regressions,
it *never* eliminates them. And I'd really like that people start to accept
that they are part of a normal development process, as careful as it can
be, and that it's precisely why deploying updates from least sensitive to
most sensitive environments is extremely important, and why bug reports
are critical to the overall stability. There's never a zero-risk software
nor their is any zero-risk upgrade. It's probably convenient to some users
to complain that "the kernel maintainers" broke their system during an
update between two stable releases, but this is just looking at the single
failure in the middle of thousands of fixes and the fact that they maybe
didn't even prepare the necessary rollback plan that must come with *any*
software upgrade. And given that the alternative is to keep all their bugs
unfixed, I find this a bit exagerated sometimes.

I noticed that among people who I've seen complaining about regressions in
-stable that extremely rare are those who faced more than one in a major
branch. Sure some are more painful than others, but that's still an extremely
high success rate.

Now maybe there are *classes* of bugs that could be better detected through
automated testing (new platforms, new workloads etc). But I strongly doubt
it will make whiners complain less given that their use cases are probably
mostly widely covered already and still slip through the rare cracks. And
they are likely not the ones reporting problems here in the first place.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Regressions in stable releases
  2021-08-06 14:37       ` Sasha Levin
@ 2021-08-06 14:52         ` Willy Tarreau
  0 siblings, 0 replies; 13+ messages in thread
From: Willy Tarreau @ 2021-08-06 14:52 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Guenter Roeck, stable, Greg Kroah-Hartman

Hi Sasha,

On Fri, Aug 06, 2021 at 10:37:47AM -0400, Sasha Levin wrote:
> > Then in my opinion we should encourage *not* to use "Fixes:" on untested
> > patches (untested patches will always happen due to hardware availability
> > or lack of a reliable reproducer).
> > 
> > What about this to try to improve the situation in this specific case ?
> 
> No, please let's not. If there is no testing story behind a buggy patch
> then it'll explode either when we go to the next version, or when we
> pull it into -stable.
> 
> If we avoid taking groups of patches into -stable it'll just mean that
> we end up with a huge amount of issues waiting for us during a version
> upgrade.

I agree with this and that was the point I was explaining as well: someone
has to detect those bugs, and unfortunately if they're so hard to see that
they can't be detected before the users, it has to hit a user.

> Yes, we may be taking bugs in now, but the regression rate (according to
> LWN) is pretty low, and the somewhat linear distribution of those bugs
> throughout our releases makes them managable (to review when they're
> sent out, to find during testing, to bisect if we hit the bug).

I totally agree that they're extremely low. I only faced one in production
in 4 or 5 years now, and even then it was not a true one, it was caused by
a context change that made one local patch silently apply at the wrong place.

> As Guenter points out, the path forward should be to improve our testing
> story.

Yes but we also have to make people understand that it only improves the
situation a little bit and doesn't magically result in zero regression.
Most whiners seem to say "I've met a regression once, that's unacceptable".
This will not change their experience, it will just reduce the number of
whiners who complain about their first bug ever. The amount of effort to
invest in testing to reduce regressions by just 1% can be huge, and at
some point one has to wonder where resources are better assigned.

Again, I tend to think that releasing older releases less often (and with
more patches each time) could reassure some users. It used to happen in
the past when Paul, Ben and I were in charge of older & slower branches,
and it sometimes allowed us to drop one patch and its subsequent revert
from a series. That's still one regression saved whenever it happens.
And this maintains the principle of "older==extremely stable with slower
fixes, newer==very stable with faster fixes".

Cheers,
Willy

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-08-06 14:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-05 16:11 Regressions in stable releases Guenter Roeck
2021-08-05 16:19 ` Greg Kroah-Hartman
2021-08-05 17:39   ` Guenter Roeck
2021-08-05 17:43     ` Greg Kroah-Hartman
2021-08-05 19:44       ` Guenter Roeck
2021-08-05 19:51         ` Greg Kroah-Hartman
2021-08-05 16:42 ` Willy Tarreau
2021-08-05 17:29   ` Guenter Roeck
2021-08-05 18:30     ` Willy Tarreau
2021-08-06 14:16       ` Guenter Roeck
2021-08-06 14:42         ` Willy Tarreau
2021-08-06 14:37       ` Sasha Levin
2021-08-06 14:52         ` Willy Tarreau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.