linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Kernel panic on Google Pixel devices due to regulator patch
       [not found] <CAJRo92+eD9F6Q60yVY2PfwaPWO_8Dts8QwH7mhpJaem7SpLihg@mail.gmail.com>
@ 2019-12-18 11:34 ` Mark Brown
  2019-12-18 12:21   ` Greg KH
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Brown @ 2019-12-18 11:34 UTC (permalink / raw)
  To: Siddharth Kapoor; +Cc: lee.jones, linux-kernel, stable, gregkh

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]

On Tue, Dec 17, 2019 at 11:51:55PM +0800, Siddharth Kapoor wrote:

> I would like to share a concern with the regulator patch which is part of
> 4.9.196 LTS kernel.

That's an *extremely* old kernel.

> https://lore.kernel.org/lkml/20190904124250.25844-1-broonie@kernel.org/

That's the patch "[PATCH] regulator: Defer init completion for a while
after late_initcall" which defers disabling of idle regulators for a
while.

Please include human readable descriptions of things like commits and
issues being discussed in e-mail in your mails, this makes them much
easier for humans to read especially when they have no internet access.
I do frequently catch up on my mail on flights or while otherwise
travelling so this is even more pressing for me than just being about
making things a bit easier to read.

> We have reverted the patch in Pixel kernels and would like you to look into
> this and consider reverting it upstream as well.

I've got nothing to do with the stable kernels so there's nothing I can
do here, sorry.  However if this is triggering anything it's almost
certainly some kind of timing issue (this code isn't new, it's just
being run a bit later) and is only currently working through luck so I
do strongly recommend trying to figure out the actual problem since it's
liable to come back and bite you later - we did find one buggy driver in
mainline as a result of this change, it's possible you've got another
one.  

Possibly your GPU supplies need to be flagged as always on, possibly
your GPU driver is forgetting to enable some supplies it needs, or
possibly there's a missing always-on constraint on one of the regulators
depending on how the driver expects this to work (if it's a proprietary
driver it shouldn't be using the regulator API itself).  I'm quite
surprised you've not seen any issue before given that the supplies would
still be being disabled earlier.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel panic on Google Pixel devices due to regulator patch
  2019-12-18 11:34 ` Kernel panic on Google Pixel devices due to regulator patch Mark Brown
@ 2019-12-18 12:21   ` Greg KH
  2019-12-18 13:11     ` Mark Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Greg KH @ 2019-12-18 12:21 UTC (permalink / raw)
  To: Mark Brown; +Cc: Siddharth Kapoor, lee.jones, linux-kernel, stable

On Wed, Dec 18, 2019 at 11:34:58AM +0000, Mark Brown wrote:
> On Tue, Dec 17, 2019 at 11:51:55PM +0800, Siddharth Kapoor wrote:
> 
> > I would like to share a concern with the regulator patch which is part of
> > 4.9.196 LTS kernel.
> 
> That's an *extremely* old kernel.

It is, but it's the latest stable kernel (well close to), and your patch
was tagged by you to be backported to here, so if there's a problem with
a stable branch, I want to know about it as I don't want to see
regressions happen in it.

> > https://lore.kernel.org/lkml/20190904124250.25844-1-broonie@kernel.org/
> 
> That's the patch "[PATCH] regulator: Defer init completion for a while
> after late_initcall" which defers disabling of idle regulators for a
> while.
> 
> Please include human readable descriptions of things like commits and
> issues being discussed in e-mail in your mails, this makes them much
> easier for humans to read especially when they have no internet access.
> I do frequently catch up on my mail on flights or while otherwise
> travelling so this is even more pressing for me than just being about
> making things a bit easier to read.
> 
> > We have reverted the patch in Pixel kernels and would like you to look into
> > this and consider reverting it upstream as well.
> 
> I've got nothing to do with the stable kernels so there's nothing I can
> do here, sorry.

Should I revert it everywhere?  This patch reads as it should be fixing
problems, not causing them :)

> However if this is triggering anything it's almost
> certainly some kind of timing issue (this code isn't new, it's just
> being run a bit later) and is only currently working through luck so I
> do strongly recommend trying to figure out the actual problem since it's
> liable to come back and bite you later - we did find one buggy driver in
> mainline as a result of this change, it's possible you've got another
> one.  
> 
> Possibly your GPU supplies need to be flagged as always on, possibly
> your GPU driver is forgetting to enable some supplies it needs, or
> possibly there's a missing always-on constraint on one of the regulators
> depending on how the driver expects this to work (if it's a proprietary
> driver it shouldn't be using the regulator API itself).  I'm quite
> surprised you've not seen any issue before given that the supplies would
> still be being disabled earlier.

Timing "luck" is probably something we shouldn't be messing with in
stable kernels.  How about I revert this for the 4.14 and older releases
and let new devices deal with the timing issues when they are brought up
on new hardware?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel panic on Google Pixel devices due to regulator patch
  2019-12-18 12:21   ` Greg KH
@ 2019-12-18 13:11     ` Mark Brown
  2019-12-18 14:22       ` Greg KH
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Brown @ 2019-12-18 13:11 UTC (permalink / raw)
  To: Greg KH; +Cc: Siddharth Kapoor, lee.jones, linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 2786 bytes --]

On Wed, Dec 18, 2019 at 01:21:57PM +0100, Greg KH wrote:
> On Wed, Dec 18, 2019 at 11:34:58AM +0000, Mark Brown wrote:
> > On Tue, Dec 17, 2019 at 11:51:55PM +0800, Siddharth Kapoor wrote:

> > > I would like to share a concern with the regulator patch which is part of
> > > 4.9.196 LTS kernel.

> > That's an *extremely* old kernel.

> It is, but it's the latest stable kernel (well close to), and your patch
> was tagged by you to be backported to here, so if there's a problem with
> a stable branch, I want to know about it as I don't want to see
> regressions happen in it.

I don't track what's in older stable kernels, it wanted to go back at
least one kernel revision but the issue has been around since forever.

> > I've got nothing to do with the stable kernels so there's nothing I can
> > do here, sorry.

> Should I revert it everywhere?  This patch reads as it should be fixing
> problems, not causing them :)

The main targets were whatever Debian and Ubuntu are shipping (and to a
lesser extent SuSE or RHEL but they don't use stable directly), it's
less relevant to anything that only gets used on embedded stuff.  It's
right on the knife edge of what I'd backport but since that's way less
enthusiastic than stable is in general these days.

> > Possibly your GPU supplies need to be flagged as always on, possibly
> > your GPU driver is forgetting to enable some supplies it needs, or
> > possibly there's a missing always-on constraint on one of the regulators
> > depending on how the driver expects this to work (if it's a proprietary
> > driver it shouldn't be using the regulator API itself).  I'm quite
> > surprised you've not seen any issue before given that the supplies would
> > still be being disabled earlier.

> Timing "luck" is probably something we shouldn't be messing with in
> stable kernels.  How about I revert this for the 4.14 and older releases
> and let new devices deal with the timing issues when they are brought up
> on new hardware?

To be clear this is more a straight up bug in their stuff than the sort
of thing you'd normally think of as a race condition, we're talking
about moving the timing by 30 seconds here.  The case that we saw
already was just a clear and obvious bug that was made more visible (the
driver was using the wrong name for a supply so lookups were always
failing but some sequence of events meant it didn't produce big runtime
failures).

If you don't want to be messing with timing luck then you probably want
to be having a look at what Sasha's bot is doing, it's picking up a lot
of things that are *well* into this sort of territory (and the bad
interactions with out of tree code territory).  I personally would not
be using stable these days if I wasn't prepared to be digging into
something like this.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel panic on Google Pixel devices due to regulator patch
  2019-12-18 13:11     ` Mark Brown
@ 2019-12-18 14:22       ` Greg KH
  2019-12-18 16:18         ` Mark Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Greg KH @ 2019-12-18 14:22 UTC (permalink / raw)
  To: Mark Brown; +Cc: Siddharth Kapoor, lee.jones, linux-kernel, stable

On Wed, Dec 18, 2019 at 01:11:14PM +0000, Mark Brown wrote:
> On Wed, Dec 18, 2019 at 01:21:57PM +0100, Greg KH wrote:
> > On Wed, Dec 18, 2019 at 11:34:58AM +0000, Mark Brown wrote:
> > > On Tue, Dec 17, 2019 at 11:51:55PM +0800, Siddharth Kapoor wrote:
> 
> > > > I would like to share a concern with the regulator patch which is part of
> > > > 4.9.196 LTS kernel.
> 
> > > That's an *extremely* old kernel.
> 
> > It is, but it's the latest stable kernel (well close to), and your patch
> > was tagged by you to be backported to here, so if there's a problem with
> > a stable branch, I want to know about it as I don't want to see
> > regressions happen in it.
> 
> I don't track what's in older stable kernels, it wanted to go back at
> least one kernel revision but the issue has been around since forever.

Ok, you can always mark patches that way if you want to :)

> > > I've got nothing to do with the stable kernels so there's nothing I can
> > > do here, sorry.
> 
> > Should I revert it everywhere?  This patch reads as it should be fixing
> > problems, not causing them :)
> 
> The main targets were whatever Debian and Ubuntu are shipping (and to a
> lesser extent SuSE or RHEL but they don't use stable directly), it's
> less relevant to anything that only gets used on embedded stuff.  It's
> right on the knife edge of what I'd backport but since that's way less
> enthusiastic than stable is in general these days.

I've reverted it now from 4.14.y and 4.9.y.

> > > Possibly your GPU supplies need to be flagged as always on, possibly
> > > your GPU driver is forgetting to enable some supplies it needs, or
> > > possibly there's a missing always-on constraint on one of the regulators
> > > depending on how the driver expects this to work (if it's a proprietary
> > > driver it shouldn't be using the regulator API itself).  I'm quite
> > > surprised you've not seen any issue before given that the supplies would
> > > still be being disabled earlier.
> 
> > Timing "luck" is probably something we shouldn't be messing with in
> > stable kernels.  How about I revert this for the 4.14 and older releases
> > and let new devices deal with the timing issues when they are brought up
> > on new hardware?
> 
> To be clear this is more a straight up bug in their stuff than the sort
> of thing you'd normally think of as a race condition, we're talking
> about moving the timing by 30 seconds here.  The case that we saw
> already was just a clear and obvious bug that was made more visible (the
> driver was using the wrong name for a supply so lookups were always
> failing but some sequence of events meant it didn't produce big runtime
> failures).
> 
> If you don't want to be messing with timing luck then you probably want
> to be having a look at what Sasha's bot is doing, it's picking up a lot
> of things that are *well* into this sort of territory (and the bad
> interactions with out of tree code territory).  I personally would not
> be using stable these days if I wasn't prepared to be digging into
> something like this.

I watch what his bot is doing, and we have tons of testing happening as
well, which is reflected by the fact that THIS WAS CAUGHT HERE.  This is
a sign that things are working, it's just that some SoC trees are slower
than mainline by a few months, and that's fine.  It's worlds better than
the SoC trees that are no where close to mainline, and as such, totally
insecure :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel panic on Google Pixel devices due to regulator patch
  2019-12-18 14:22       ` Greg KH
@ 2019-12-18 16:18         ` Mark Brown
  2019-12-18 16:24           ` Greg KH
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Brown @ 2019-12-18 16:18 UTC (permalink / raw)
  To: Greg KH; +Cc: Siddharth Kapoor, lee.jones, linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 2213 bytes --]

On Wed, Dec 18, 2019 at 03:22:19PM +0100, Greg KH wrote:
> On Wed, Dec 18, 2019 at 01:11:14PM +0000, Mark Brown wrote:
> > On Wed, Dec 18, 2019 at 01:21:57PM +0100, Greg KH wrote:

> > > It is, but it's the latest stable kernel (well close to), and your patch
> > > was tagged by you to be backported to here, so if there's a problem with
> > > a stable branch, I want to know about it as I don't want to see
> > > regressions happen in it.

> > I don't track what's in older stable kernels, it wanted to go back at
> > least one kernel revision but the issue has been around since forever.

> Ok, you can always mark patches that way if you want to :)

That's what a tag to stable with no particular revision attached to it
is isn't it?

> > If you don't want to be messing with timing luck then you probably want
> > to be having a look at what Sasha's bot is doing, it's picking up a lot
> > of things that are *well* into this sort of territory (and the bad
> > interactions with out of tree code territory).  I personally would not
> > be using stable these days if I wasn't prepared to be digging into
> > something like this.

> I watch what his bot is doing, and we have tons of testing happening as
> well, which is reflected by the fact that THIS WAS CAUGHT HERE.  This is

You don't have anywhere near the level of testing that you'd need to
cover what the bot is trying to pull in, the subsystem and driver
coverage is extremely thin relative to the enthusiasm with which things
are being picked up.  All the pushback I see in review is on me for 
being conservative about what gets pulled into stable and worrying about
interactions with out of tree code.

> a sign that things are working, it's just that some SoC trees are slower
> than mainline by a few months, and that's fine.  It's worlds better than
> the SoC trees that are no where close to mainline, and as such, totally
> insecure :)

What you appear to have caught here is an interaction with some
unreviewed vendor code - how much of that is going on in the vendor
trees you're not testing?  If we want to encourage people to pull in
stable we should be paying attention to that sort of stuff.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel panic on Google Pixel devices due to regulator patch
  2019-12-18 16:18         ` Mark Brown
@ 2019-12-18 16:24           ` Greg KH
  2019-12-18 17:03             ` Mark Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Greg KH @ 2019-12-18 16:24 UTC (permalink / raw)
  To: Mark Brown; +Cc: Siddharth Kapoor, lee.jones, linux-kernel, stable

On Wed, Dec 18, 2019 at 04:18:06PM +0000, Mark Brown wrote:
> On Wed, Dec 18, 2019 at 03:22:19PM +0100, Greg KH wrote:
> > On Wed, Dec 18, 2019 at 01:11:14PM +0000, Mark Brown wrote:
> > > On Wed, Dec 18, 2019 at 01:21:57PM +0100, Greg KH wrote:
> 
> > > > It is, but it's the latest stable kernel (well close to), and your patch
> > > > was tagged by you to be backported to here, so if there's a problem with
> > > > a stable branch, I want to know about it as I don't want to see
> > > > regressions happen in it.
> 
> > > I don't track what's in older stable kernels, it wanted to go back at
> > > least one kernel revision but the issue has been around since forever.
> 
> > Ok, you can always mark patches that way if you want to :)
> 
> That's what a tag to stable with no particular revision attached to it
> is isn't it?

No, that means "drag it as far back as Greg can easily do it" :)

> > > If you don't want to be messing with timing luck then you probably want
> > > to be having a look at what Sasha's bot is doing, it's picking up a lot
> > > of things that are *well* into this sort of territory (and the bad
> > > interactions with out of tree code territory).  I personally would not
> > > be using stable these days if I wasn't prepared to be digging into
> > > something like this.
> 
> > I watch what his bot is doing, and we have tons of testing happening as
> > well, which is reflected by the fact that THIS WAS CAUGHT HERE.  This is
> 
> You don't have anywhere near the level of testing that you'd need to
> cover what the bot is trying to pull in, the subsystem and driver
> coverage is extremely thin relative to the enthusiasm with which things
> are being picked up.  All the pushback I see in review is on me for 
> being conservative about what gets pulled into stable and worrying about
> interactions with out of tree code.
> 
> > a sign that things are working, it's just that some SoC trees are slower
> > than mainline by a few months, and that's fine.  It's worlds better than
> > the SoC trees that are no where close to mainline, and as such, totally
> > insecure :)
> 
> What you appear to have caught here is an interaction with some
> unreviewed vendor code - how much of that is going on in the vendor
> trees you're not testing?  If we want to encourage people to pull in
> stable we should be paying attention to that sort of stuff.

I get weekly merge reports from all of the major SoC vendors when they
pull these releases into their tree and run through their full suite of
tests.  So I am paying attention to this type of thing.

What I need to figure out here is what is going wrong and why the SoC's
testing did not catch this.  That's going to take a bit longer...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel panic on Google Pixel devices due to regulator patch
  2019-12-18 16:24           ` Greg KH
@ 2019-12-18 17:03             ` Mark Brown
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Brown @ 2019-12-18 17:03 UTC (permalink / raw)
  To: Greg KH; +Cc: Siddharth Kapoor, lee.jones, linux-kernel, stable

[-- Attachment #1: Type: text/plain, Size: 1119 bytes --]

On Wed, Dec 18, 2019 at 05:24:24PM +0100, Greg KH wrote:
> On Wed, Dec 18, 2019 at 04:18:06PM +0000, Mark Brown wrote:

> > What you appear to have caught here is an interaction with some
> > unreviewed vendor code - how much of that is going on in the vendor
> > trees you're not testing?  If we want to encourage people to pull in
> > stable we should be paying attention to that sort of stuff.

> I get weekly merge reports from all of the major SoC vendors when they
> pull these releases into their tree and run through their full suite of
> tests.  So I am paying attention to this type of thing.

Are you sure you're not just definining major SoC vendors as being
people who send you reports here?  :P  In any case, that's only going to
cover a limited subset of potential drivers and subsystems, devices that
don't appear on reference designs aren't going to get any coverage at
all that way for example.

> What I need to figure out here is what is going wrong and why the SoC's
> testing did not catch this.  That's going to take a bit longer...

There's a reasonable chance this is something board specific.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-12-18 17:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJRo92+eD9F6Q60yVY2PfwaPWO_8Dts8QwH7mhpJaem7SpLihg@mail.gmail.com>
2019-12-18 11:34 ` Kernel panic on Google Pixel devices due to regulator patch Mark Brown
2019-12-18 12:21   ` Greg KH
2019-12-18 13:11     ` Mark Brown
2019-12-18 14:22       ` Greg KH
2019-12-18 16:18         ` Mark Brown
2019-12-18 16:24           ` Greg KH
2019-12-18 17:03             ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).