linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* short-circuit and over-current IRQs
@ 2021-01-27 12:01 Vaittinen, Matti
  2021-01-27 12:27 ` Mark Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Vaittinen, Matti @ 2021-01-27 12:01 UTC (permalink / raw)
  Cc: lgirdwood, angelogioacchino.delregno, broonie, linux-kernel

Hi dee Ho peeps,

I just noticed the
390af53e04114f790d60b63802a4de9d815ade03 ("regulator: qcom-labibb:
Implement short-circuit and over-current IRQs")

in regulator tree. No worries - I haven't hit any problem with it :]
I've been working with ROHM BD9576MUF - which implements warning IRQs
for over/under-voltage and high temperature. (Short-circuit detection
does also generate IRQ but it also automatically shuts down the
regulators by HW). BD9576MUF does also keep the IRQ activated for as
long as the 'troubling condition' stays active in HW. This is why
BD9576MUF driver does also need to implement some logic to disable IRQ
for some time period just to keep CPU out of the IRQ loop.

https://lore.kernel.org/lkml/4d725ea6e9261f22d4c808b39013baf479f252dc.1611324968.git.matti.vaittinen@fi.rohmeurope.com/

My implementation lacks of the retry counter and regulator shut-down
(BD9576 has also OCP - "Over Current Protection/OVP - "Over Voltage
Protection"/TSD - "Thermal Shutdown" features so that HW will shut down
outputs if things get worse so those are not so crucial).

I've also had some hard time when trying to test this - I guess others
have faced similar problems too...

Anyways - I was wondering if this is common thing amongst many PMICs?
If yes - then, perhaps some generally useful regulator helper could be
added to help implementing the IRQ disabling + scheduling worker to
check status and re-enable IRQs? I think it *might* save some time in
the future - and help making same mistakes many times :]

I *might* be able to find some time to draft out some code for this
(not promising, I am currently having few patch series pending but I
could at least try) - but I am not able to do thorough testing with
BD9576... Any possibility to co-operate? Maybe we could try this also
onqcom-labibb? Or do you think this is overall just a dead idea?

Best Regards
	Matti Vaittinen


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-27 12:01 short-circuit and over-current IRQs Vaittinen, Matti
@ 2021-01-27 12:27 ` Mark Brown
  2021-01-27 12:56   ` Matti Vaittinen
  0 siblings, 1 reply; 12+ messages in thread
From: Mark Brown @ 2021-01-27 12:27 UTC (permalink / raw)
  To: Vaittinen, Matti; +Cc: lgirdwood, angelogioacchino.delregno, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 693 bytes --]

On Wed, Jan 27, 2021 at 12:01:55PM +0000, Vaittinen, Matti wrote:

> Anyways - I was wondering if this is common thing amongst many PMICs?
> If yes - then, perhaps some generally useful regulator helper could be
> added to help implementing the IRQ disabling + scheduling worker to
> check status and re-enable IRQs? I think it *might* save some time in
> the future - and help making same mistakes many times :]

If we've got two that's enough for a helper.  TBH I'm a bit surprised
that people are implementing hardware that leaves the outputs enabled
when it detects this sort of error, it's something that's usually an
emergency that needs shutting off quickly to prevent hardware damage.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-27 12:27 ` Mark Brown
@ 2021-01-27 12:56   ` Matti Vaittinen
  2021-01-27 14:34     ` AngeloGioacchino Del Regno
  0 siblings, 1 reply; 12+ messages in thread
From: Matti Vaittinen @ 2021-01-27 12:56 UTC (permalink / raw)
  To: Mark Brown; +Cc: lgirdwood, angelogioacchino.delregno, linux-kernel

Hello Mark,

Nice to hear from you. :)

On Wed, 2021-01-27 at 12:27 +0000, Mark Brown wrote:
> On Wed, Jan 27, 2021 at 12:01:55PM +0000, Vaittinen, Matti wrote:
> 
> > Anyways - I was wondering if this is common thing amongst many
> > PMICs?
> > If yes - then, perhaps some generally useful regulator helper could
> > be
> > added to help implementing the IRQ disabling + scheduling worker to
> > check status and re-enable IRQs? I think it *might* save some time
> > in
> > the future - and help making same mistakes many times :]
> 
> If we've got two that's enough for a helper.  TBH I'm a bit surprised
> that people are implementing hardware that leaves the outputs enabled
> when it detects this sort of error, it's something that's usually an
> emergency that needs shutting off quickly to prevent hardware damage.

I can only speak for BD9576MUF - which has two limits for monitored
entity (temperature/voltage). One limit being 'warning' limit (or
'detection' as data-sheet says), the other being 'protection' limit.

I don't know guys who designed HW - I am located to a remote spot on
the other side of the world and on top of that I am the odd "SW guy" so
it's better to keep me out of the HW R&D decisions and especially the
customers ;) - but I *guess* the idea has been that consumer driver(s)
could do some 'recovery action' at 'warning' limit (which might make
sense for example when temperature is increased to 'high' but not yet
'damaging' - I guess there is something that can be done with
over/under voltages too...) and only kill the power if that doesn't
help and situation (with temperature/voltage) gets worse.

What I don't like is the fact that HW keeps IRQ asserted instead of
having a state machine which would only generate IRQ when condition
changes + status register to read current condition.

I will see if I can cook-up something decent - but as I said, I would
appreciate any/all testing if I get patch crafted :)

Best Regards
	Matti


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-27 12:56   ` Matti Vaittinen
@ 2021-01-27 14:34     ` AngeloGioacchino Del Regno
  2021-01-27 14:42       ` Vaittinen, Matti
  2021-01-27 16:32       ` Mark Brown
  0 siblings, 2 replies; 12+ messages in thread
From: AngeloGioacchino Del Regno @ 2021-01-27 14:34 UTC (permalink / raw)
  To: matti.vaittinen, Mark Brown; +Cc: lgirdwood, linux-kernel

Il 27/01/21 13:56, Matti Vaittinen ha scritto:
> Hello Mark,
> 

Hey Matti, hey Mark!

> Nice to hear from you. :)
> 
> On Wed, 2021-01-27 at 12:27 +0000, Mark Brown wrote:
>> On Wed, Jan 27, 2021 at 12:01:55PM +0000, Vaittinen, Matti wrote:
>>
>>> Anyways - I was wondering if this is common thing amongst many
>>> PMICs?
>>> If yes - then, perhaps some generally useful regulator helper could
>>> be
>>> added to help implementing the IRQ disabling + scheduling worker to
>>> check status and re-enable IRQs? I think it *might* save some time
>>> in
>>> the future - and help making same mistakes many times :]
>>

It's probably worth it if more drivers need that: sometimes there is
HW supporting this feature and it doesn't get done because of the usual
lack of time.

Providing a helper would probably help.

>> If we've got two that's enough for a helper.  TBH I'm a bit surprised
>> that people are implementing hardware that leaves the outputs enabled
>> when it detects this sort of error, it's something that's usually an
>> emergency that needs shutting off quickly to prevent hardware damage.
> 
> I can only speak for BD9576MUF - which has two limits for monitored
> entity (temperature/voltage). One limit being 'warning' limit (or
> 'detection' as data-sheet says), the other being 'protection' limit.
> 
> I don't know guys who designed HW - I am located to a remote spot on
> the other side of the world and on top of that I am the odd "SW guy" so
> it's better to keep me out of the HW R&D decisions and especially the
> customers ;) - but I *guess* the idea has been that consumer driver(s)
> could do some 'recovery action' at 'warning' limit (which might make
> sense for example when temperature is increased to 'high' but not yet
> 'damaging' - I guess there is something that can be done with
> over/under voltages too...) and only kill the power if that doesn't
> help and situation (with temperature/voltage) gets worse.

I would tend to agree with you here, Matti. Also from what I understand,
the wanted outcome is software handling a possibly temporary issue with
you charging caps, external IC initialization using (expectedly) much
more power than needed before stabilizing, and eventually handling other
"real" issues for which there is a solution that may not even include
disabling the regulator itself, but some other handling on the connected
device driver.

Though, Mark: for example, on qcom labibb, there's "PBS" that is killing
the regulators on short-circuit condition and as you see, handling that
is a little trickier compared to the over-current one, where there is no
such auto-magic-thing...
.... I wouldn't know if it'd be a good idea to have a system like qcom's
PBS everywhere.
For the sake of protecting HW "paranoidly" though.. maybe :))

> 
> What I don't like is the fact that HW keeps IRQ asserted instead of
> having a state machine which would only generate IRQ when condition
> changes + status register to read current condition.
> 

Unless I've misunderstood this, you're describing a *very* common
behavior across regulators and other kinds of devices, but that's
not a problem. IMO, it's a solution (to quirky MCUs/SoCs/CPUs/blah).

Of course reading a register means that you waste more time before
deciding to "press the red button", but even on a slow bus like I2C,
it's anyway not reaching the point where that wasted time is relevant.
At least, in many cases.

> I will see if I can cook-up something decent - but as I said, I would
> appreciate any/all testing if I get patch crafted :)

I develop this stuff in my spare time: I can't make big promises, but
I can tell you that I will try to test your proposal on qcom-labibb as
soon as I will be able to.

Yours,
--Angelo

> 
> Best Regards
> 	Matti
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-27 14:34     ` AngeloGioacchino Del Regno
@ 2021-01-27 14:42       ` Vaittinen, Matti
  2021-01-27 16:32       ` Mark Brown
  1 sibling, 0 replies; 12+ messages in thread
From: Vaittinen, Matti @ 2021-01-27 14:42 UTC (permalink / raw)
  To: angelogioacchino.delregno, broonie; +Cc: lgirdwood, linux-kernel

On Wed, 2021-01-27 at 15:34 +0100, AngeloGioacchino Del Regno wrote:
> Il 27/01/21 13:56, Matti Vaittinen ha scritto:
> > Hello Mark,
> > 
> 
> Hey Matti, hey Mark!
> 
> > Nice to hear from you. :)
> > 
> > On Wed, 2021-01-27 at 12:27 +0000, Mark Brown wrote:
> > > On Wed, Jan 27, 2021 at 12:01:55PM +0000, Vaittinen, Matti wrote:
> 
> > I will see if I can cook-up something decent - but as I said, I
> > would
> > appreciate any/all testing if I get patch crafted :)
> 
> I develop this stuff in my spare time: I can't make big promises, but
> I can tell you that I will try to test your proposal on qcom-labibb
> as
> soon as I will be able to.

Thanks a Lot! This is more than fair :) I'll CC you when (if) I have
this drafted!

Best Regards
--Matti

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-27 14:34     ` AngeloGioacchino Del Regno
  2021-01-27 14:42       ` Vaittinen, Matti
@ 2021-01-27 16:32       ` Mark Brown
  2021-01-28  9:23         ` Vaittinen, Matti
  1 sibling, 1 reply; 12+ messages in thread
From: Mark Brown @ 2021-01-27 16:32 UTC (permalink / raw)
  To: AngeloGioacchino Del Regno; +Cc: matti.vaittinen, lgirdwood, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1898 bytes --]

On Wed, Jan 27, 2021 at 03:34:46PM +0100, AngeloGioacchino Del Regno wrote:
> Il 27/01/21 13:56, Matti Vaittinen ha scritto:

> > I can only speak for BD9576MUF - which has two limits for monitored
> > entity (temperature/voltage). One limit being 'warning' limit (or
> > 'detection' as data-sheet says), the other being 'protection' limit.

> I would tend to agree with you here, Matti. Also from what I understand,
> the wanted outcome is software handling a possibly temporary issue with
> you charging caps, external IC initialization using (expectedly) much
> more power than needed before stabilizing, and eventually handling other
> "real" issues for which there is a solution that may not even include
> disabling the regulator itself, but some other handling on the connected
> device driver.

Note that the events the API currently has are expected to be for the
actual error conditions, not for the warning ones - indicating that the
voltage is out of regulation for example.  If you're supporting warning
notifications as well you'll want to add more events (or possibly
another flag to munge in with the existing events to indicate that it's
a warning rather than an error).

> Though, Mark: for example, on qcom labibb, there's "PBS" that is killing
> the regulators on short-circuit condition and as you see, handling that
> is a little trickier compared to the over-current one, where there is no
> such auto-magic-thing...
> .... I wouldn't know if it'd be a good idea to have a system like qcom's
> PBS everywhere.
> For the sake of protecting HW "paranoidly" though.. maybe :))

Well, if these things are kicking in the hardware is in serious trouble
anyway so it's unclear what the system would be likely to do in
software, and also unclear how safe it is to rely on software to be able
to take that action given that it let things get into such a bad state
in the first place.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-27 16:32       ` Mark Brown
@ 2021-01-28  9:23         ` Vaittinen, Matti
  2021-01-28 12:10           ` Mark Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Vaittinen, Matti @ 2021-01-28  9:23 UTC (permalink / raw)
  To: angelogioacchino.delregno, broonie; +Cc: lgirdwood, linux-kernel


On Wed, 2021-01-27 at 16:32 +0000, Mark Brown wrote:
> On Wed, Jan 27, 2021 at 03:34:46PM +0100, AngeloGioacchino Del Regno
> wrote:
> > Il 27/01/21 13:56, Matti Vaittinen ha scritto:
> > > I can only speak for BD9576MUF - which has two limits for
> > > monitored
> > > entity (temperature/voltage). One limit being 'warning' limit (or
> > > 'detection' as data-sheet says), the other being 'protection'
> > > limit.
> > I would tend to agree with you here, Matti. Also from what I
> > understand,
> > the wanted outcome is software handling a possibly temporary issue
> > with
> > you charging caps, external IC initialization using (expectedly)
> > much
> > more power than needed before stabilizing, and eventually handling
> > other
> > "real" issues for which there is a solution that may not even
> > include
> > disabling the regulator itself, but some other handling on the
> > connected
> > device driver.
> 
> Note that the events the API currently has are expected to be for the
> actual error conditions, not for the warning ones - indicating that
> the
> voltage is out of regulation for example.

I am unsure how to interpret this. What is the criteria of issue being
an error/warning. When I was talking about warning I meant that the
issue which is detected is unexpected and abnormal (error?) - but might
still be recoverable (warning?). I understand the regulator framework
must not signal same events for different purposes - but I don't really
know what the current events are used for - I am grateful for any
guidance!

So, I would like to have an event which the driver can signal to
consumers so that the consumer drivers can try to recover problem W/O
shutting down the SoC.

I try to explain what the BD9576 does and how I see it - I hope you can
then tell me if I am misusing existing events/errors. Sorry for longish
explanation...

The patch I sent did just use existing flags as follows:

REGULATOR_EVENT_UNDER_VOLTAGE:

Sent when PMIC informs regulator output being under the target. Limit
for this is configurable for each output - from roughly at 0.9*(set
Vout) up to 0.99*(set Vout). The detection can also be disabled.

I think that if this is enabled for system, then this should be
regarded as "error condition" - but HW does not shut down. Furthermore,
BD9576 does NOT have "protection" limit for output dropping. It does
have a protection limit for regulator input dropping though - and PMIC
will forcibly shut down if voltage input of a regulator drops below
certain value.

What action the system can take when output voltage drops depends on
system and is not really up to the regulator driver to decide. But is
this notification correct? (I thought so just judging the name). What I
thought is that it would be nice to inform consumers about the detected
under-voltage (as HW supports this). In addition to this notification I
added a 'state flag' in driver so that if someone calls the
regulator_get_error we do return REGULATOR_ERROR_UNDER_VOLTAGE. Are
these correct flags?

REGULATOR_EVENT_REGULATION_OUT:

Current patch send this when PMIC informs regulator output being OVER
the set voltage. PMIC supports similar limit configuration / feature
disabling as with UVD. The over voltage limit can be set roughly from
1.01*(set Vout) to 1.1*(set Vout).

PMIC has also protection limits for regulator outputs so that if output
is roughly 1.2*(set Vout) the PMIC will shut down.

Again, I think this is an error condition but what action system
can/should take if output voltage is exceeding the limit is not known
by me.

I used same regulator_get_error logic and
REGULATOR_ERROR_REGULATION_OUT is returned by regulator_get_error when
condition is not cleared.

Again - I would appreciate knowledge about if this is correct flag.

REGULATOR_EVENT_OVER_TEMP:

Sent when PMIC gives us "thermal warning" IRQ. The PMIC has thermal
shut-down point 175 C when it will immediately cut the power from
system to protect it from damaging. The thermal warning IRQ is firing
at 30 C lower temperature and HW performs no other action here. AFAIR
the temperature limit can not be adjusted.

I think this is again an error condition - kind of last resort for the
system to do something before PMIC shuts it down. I don't think this is
intended to be any 'normal temperature regulation' feature. What system
can do here (turn off some HW? Reduce clocks? maxx fans? Shut down
graphichs block and send warning to display using TCON image overlay?
Forcibly close the panorama windows to cut sunlight?) depends on
system.

Here I also added REGULATOR_ERROR_OVER_TEMP to be returned by
regulator_get_error.

Finally, for one of the outputs we have "over current warning"
interrupt and over current protection. The limit for OCW is
configurable (data-sheet speaks of milliVOLTS here so maybe shunt is
used?) and it can also be disabled. Protection is again not
configurable and will shut down the outputs immediately.

I think it is again up to system designer to decide whether the OCW
interrupt should be used to inform error condition. I used
REGULATOR_EVENT_OVER_CURRENT and REGULATOR_ERROR_OVER_CURRENT
respectively. 


>   If you're supporting warning
> notifications as well you'll want to add more events (or possibly
> another flag to munge in with the existing events to indicate that
> it's
> a warning rather than an error).

As I said, I need some help here to decide if these events are what
existing notification flags are intended to be used for or if I should
add new notification(s)/flags? I think that from user perspective it
boils down to question if there is some different recovery actions to
be taken? In the case of BD9576MUF I don't see any other meaningfull
error events because, as I said, when the "protection" level is reached
the IRQs can't really be handled (because power from all outputs is
shut. Unless another processor which is not being powered by the BD9576
is controlling it - and I don't think this would be the case here).

> 
> > Though, Mark: for example, on qcom labibb, there's "PBS" that is
> > killing
> > the regulators on short-circuit condition and as you see, handling
> > that
> > is a little trickier compared to the over-current one, where there
> > is no
> > such auto-magic-thing...
> > .... I wouldn't know if it'd be a good idea to have a system like
> > qcom's
> > PBS everywhere.
> > For the sake of protecting HW "paranoidly" though.. maybe :))
> 
> Well, if these things are kicking in the hardware is in serious
> trouble
> anyway so it's unclear what the system would be likely to do in
> software, and also unclear how safe it is to rely on software to be
> able
> to take that action given that it let things get into such a bad
> state
> in the first place.

Actually, bear with me but I am unsure why we have these notifications
if we don't expect SW to be able to do anything? Wouldn't the panic
print be all that is needed then? I think that setups which have dual
limits (one for initiating potential SW recovery - other for HW to
forcing protection) actually make sense. So does implementing notifiers
/ error statuses for events where SW recovery is potentially helpful.
But whether the existing event notifications / error flags are correct
for these is something I can't decide :) Here I ask guidance for Mark &
others who know what is the idea behind existing error-flags/events.

Best Regards
	Matti Vaittinen




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-28  9:23         ` Vaittinen, Matti
@ 2021-01-28 12:10           ` Mark Brown
  2021-01-28 12:49             ` Vaittinen, Matti
       [not found]             ` <a89bf6f0e6c1e4b9afe980908b7e36b70b304a96.camel@fi.rohmeurope.com>
  0 siblings, 2 replies; 12+ messages in thread
From: Mark Brown @ 2021-01-28 12:10 UTC (permalink / raw)
  To: Vaittinen, Matti; +Cc: angelogioacchino.delregno, lgirdwood, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3163 bytes --]

On Thu, Jan 28, 2021 at 09:23:08AM +0000, Vaittinen, Matti wrote:
> On Wed, 2021-01-27 at 16:32 +0000, Mark Brown wrote:

> > Note that the events the API currently has are expected to be for the
> > actual error conditions, not for the warning ones - indicating that
> > the
> > voltage is out of regulation for example.

> I am unsure how to interpret this. What is the criteria of issue being
> an error/warning. When I was talking about warning I meant that the
> issue which is detected is unexpected and abnormal (error?) - but might
> still be recoverable (warning?). I understand the regulator framework
> must not signal same events for different purposes - but I don't really
> know what the current events are used for - I am grateful for any
> guidance!

What the majority of hardware interrupts on is situations where things
have already gone out of spec and there are actual problems with the
output - for example with current limiting there's often an actual
limiter in there so the regulator simply won't supply any more current
than is configured.  With a warning everything is still working fine but
getting close to not doing so.

> > Well, if these things are kicking in the hardware is in serious
> > trouble
> > anyway so it's unclear what the system would be likely to do in
> > software, and also unclear how safe it is to rely on software to be
> > able
> > to take that action given that it let things get into such a bad
> > state
> > in the first place.

> Actually, bear with me but I am unsure why we have these notifications
> if we don't expect SW to be able to do anything? Wouldn't the panic
> print be all that is needed then? I think that setups which have dual

You'll notice that there aren't any actual users of this stuff in tree
at the minute - people don't generally put much effort into software
recovery as they're not expecting to be anywhere near limiting in normal
operation.  What I'd expect people to do where they do implement
handling is something like shutting down all other supplies on the
device, possibly also trying to shut down the system as a whole.  Things
more about preventing physical damage rather than being part of the
normal operation of the system.

For thermal issues systems generally try to apply software limits well
before an individual component starts flagging things up with an
interrupt, the limits that devices have are generally super high and
often there'll be issues at a system level (eg, a case getting unusably
hot) earlier and it can take a while for responses to have an impact.

> limits (one for initiating potential SW recovery - other for HW to
> forcing protection) actually make sense. So does implementing notifiers
> / error statuses for events where SW recovery is potentially helpful.
> But whether the existing event notifications / error flags are correct
> for these is something I can't decide :) Here I ask guidance for Mark &
> others who know what is the idea behind existing error-flags/events.

It's not that we shouldn't implement support for warnings, it's that
they're not the common case for hardware and so won't line up with
behaviour for other users.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-28 12:10           ` Mark Brown
@ 2021-01-28 12:49             ` Vaittinen, Matti
       [not found]             ` <a89bf6f0e6c1e4b9afe980908b7e36b70b304a96.camel@fi.rohmeurope.com>
  1 sibling, 0 replies; 12+ messages in thread
From: Vaittinen, Matti @ 2021-01-28 12:49 UTC (permalink / raw)
  To: broonie; +Cc: lgirdwood, angelogioacchino.delregno, linux-kernel


On Thu, 2021-01-28 at 12:10 +0000, Mark Brown wrote:
> On Thu, Jan 28, 2021 at 09:23:08AM +0000, Vaittinen, Matti wrote:
> > On Wed, 2021-01-27 at 16:32 +0000, Mark Brown wrote:
> > > Note that the events the API currently has are expected to be for
> > > the
> > > actual error conditions, not for the warning ones - indicating
> > > that
> > > the
> > > voltage is out of regulation for example.
> > I am unsure how to interpret this. What is the criteria of issue
> > being
> > an error/warning. When I was talking about warning I meant that the
> > issue which is detected is unexpected and abnormal (error?) - but
> > might
> > still be recoverable (warning?). I understand the regulator
> > framework
> > must not signal same events for different purposes - but I don't
> > really
> > know what the current events are used for - I am grateful for any
> > guidance!
> 
> What the majority of hardware interrupts on is situations where
> things
> have already gone out of spec and there are actual problems with the
> output - for example with current limiting there's often an actual
> limiter in there so the regulator simply won't supply any more
> current
> than is configured.  With a warning everything is still working fine
> but
> getting close to not doing so.

Sounds reasonable. Warning while things are still working - but are
getting to the boundary. Error when things are already pretty wrong.
Thanks.

> > > Well, if these things are kicking in the hardware is in serious
> > > trouble
> > > anyway so it's unclear what the system would be likely to do in
> > > software, and also unclear how safe it is to rely on software to
> > > be
> > > able
> > > to take that action given that it let things get into such a bad
> > > state
> > > in the first place.
> > Actually, bear with me but I am unsure why we have these
> > notifications
> > if we don't expect SW to be able to do anything? Wouldn't the panic
> > print be all that is needed then? I think that setups which have
> > dual
> 
> You'll notice that there aren't any actual users of this stuff in
> tree
> at the minute - people don't generally put much effort into software
> recovery as they're not expecting to be anywhere near limiting in
> normal
> operation.  What I'd expect people to do where they do implement
> handling is something like shutting down all other supplies on the
> device, possibly also trying to shut down the system as a
> whole.  Things
> more about preventing physical damage rather than being part of the
> normal operation of the system.

Again this makes sense. I will try to ask form HW colleagues what they
thought to be the action SW take (I hope they have some scenario on
mind - let's see). If they tell me that they expect SW to shut down
system gracefully - then I keep errors, if they tell me they think SW
will temporarily disable some HW blocks or do other "tricks" and later
resume normal operation - then I will see if I can add some new
'warning' indicators.

> For thermal issues systems generally try to apply software limits
> well
> before an individual component starts flagging things up with an
> interrupt, the limits that devices have are generally super high and
> often there'll be issues at a system level (eg, a case getting
> unusably
> hot) earlier and it can take a while for responses to have an impact.

I think this is also case with the BD9576 - 140 C sounds pretty hot to
me - and I expect this is really where things are already badly wrong.
So I guess I can keep the 'error' here.

> 
> > limits (one for initiating potential SW recovery - other for HW to
> > forcing protection) actually make sense. So does implementing
> > notifiers
> > / error statuses for events where SW recovery is potentially
> > helpful.
> > But whether the existing event notifications / error flags are
> > correct
> > for these is something I can't decide :) Here I ask guidance for
> > Mark &
> > others who know what is the idea behind existing error-
> > flags/events.
> 
> It's not that we shouldn't implement support for warnings, it's that
> they're not the common case for hardware and so won't line up with
> behaviour for other users.


Agreed. As I said, I understand we shouldn't send same events to
different situations. If current errors are used to indicate things are
really "wrong" to the point where safest thing is to shut down system -
then we'd better add these "warnings" to indicate that there would
potentially still be time to change something - before things are shut
off.

Thanks again!

Best regards
	Matti Vaittinen

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
       [not found]             ` <a89bf6f0e6c1e4b9afe980908b7e36b70b304a96.camel@fi.rohmeurope.com>
@ 2021-01-30 15:43               ` AngeloGioacchino Del Regno
  2021-02-01  7:14                 ` Matti Vaittinen
  2021-02-01 13:17                 ` Mark Brown
  0 siblings, 2 replies; 12+ messages in thread
From: AngeloGioacchino Del Regno @ 2021-01-30 15:43 UTC (permalink / raw)
  To: matti.vaittinen, Mark Brown; +Cc: lgirdwood, linux-kernel

Il 29/01/21 10:14, Matti Vaittinen ha scritto:
> 
> On Thu, 2021-01-28 at 12:10 +0000, Mark Brown wrote:
>> On Thu, Jan 28, 2021 at 09:23:08AM +0000, Vaittinen, Matti wrote:
>>> On Wed, 2021-01-27 at 16:32 +0000, Mark Brown wrote:
>>>> Note that the events the API currently has are expected to be for
>>>> the
>>>> actual error conditions, not for the warning ones - indicating
>>>> that
>>>> the
>>>> voltage is out of regulation for example.
>>> I am unsure how to interpret this. What is the criteria of issue
>>> being
>>> an error/warning. When I was talking about warning I meant that the
>>> issue which is detected is unexpected and abnormal (error?) - but
>>> might
>>> still be recoverable (warning?). I understand the regulator
>>> framework
>>> must not signal same events for different purposes - but I don't
>>> really
>>> know what the current events are used for - I am grateful for any
>>> guidance!
>>
>> What the majority of hardware interrupts on is situations where
>> things
>> have already gone out of spec and there are actual problems with the
>> output - for example with current limiting there's often an actual
>> limiter in there so the regulator simply won't supply any more
>> current
>> than is configured.  With a warning everything is still working fine
>> but
>> getting close to not doing so.
>>
>>>> Well, if these things are kicking in the hardware is in serious
>>>> trouble
>>>> anyway so it's unclear what the system would be likely to do in
>>>> software, and also unclear how safe it is to rely on software to
>>>> be
>>>> able
>>>> to take that action given that it let things get into such a bad
>>>> state
>>>> in the first place.
>>> Actually, bear with me but I am unsure why we have these
>>> notifications
>>> if we don't expect SW to be able to do anything? Wouldn't the panic
>>> print be all that is needed then? I think that setups which have
>>> dual
>>
>> You'll notice that there aren't any actual users of this stuff in
>> tree
>> at the minute - people don't generally put much effort into software
>> recovery as they're not expecting to be anywhere near limiting in
>> normal
>> operation.  What I'd expect people to do where they do implement
>> handling is something like shutting down all other supplies on the
>> device, possibly also trying to shut down the system as a
>> whole.  Things
>> more about preventing physical damage rather than being part of the
>> normal operation of the system.
>>
>> For thermal issues systems generally try to apply software limits
>> well
>> before an individual component starts flagging things up with an
>> interrupt, the limits that devices have are generally super high and
>> often there'll be issues at a system level (eg, a case getting
>> unusably
>> hot) earlier and it can take a while for responses to have an impact.
>>
>>> limits (one for initiating potential SW recovery - other for HW to
>>> forcing protection) actually make sense. So does implementing
>>> notifiers
>>> / error statuses for events where SW recovery is potentially
>>> helpful.
>>> But whether the existing event notifications / error flags are
>>> correct
>>> for these is something I can't decide :) Here I ask guidance for
>>> Mark &
>>> others who know what is the idea behind existing error-
>>> flags/events.
>>
>> It's not that we shouldn't implement support for warnings, it's that
>> they're not the common case for hardware and so won't line up with
>> behaviour for other users.

....but anyway, I have no idea what would you do when a warning is
triggered: make a good example, please...

I mean, take for example the "usual display": you are delivering power
to a DriverIC (which is most of the times undocumented), your display
starts drawing more than expected, so... uh.. so what?
You shut it down? You reboot the DrIC?
I can't imagine a DrIC reboot to restore the power draw...

> 
> I have been pondering this - and I guess evaluating the severity of
> these issues belong to board designer's table. (As when to expect
> things be already so wrong that something is broken "ERROR" - or when
> things are approaching the point where things are wrong - but might
> still be recoverable by <insert recovery action here> "WARNING").
> 
> So, what comes to my mind is pushing this decision to a board designer.
> And to me this sounds like device-tree configuration.
> 

That's 100% DT (or ACPI, whatever), since we're in 2020 :P

> For regulator framework that would mean supporting the warning level
> events (and regulator_get_error_flags warning-flags?) too.
> 
> For (my) driver this would mean getting the information from device-
> tree. I didn't yet check the existing DT properties (if any for these
> levels) - but I think we should have something like:
> 
> regulator-over-voltage-protection-microvolt = <>;
> regulator-over-voltage-error-microvolt = <>;
> regulator-over-voltage-warning-microvolt = <>;
> 
> *-protection-* would be point where HW (or driver) shuts down the
> outputs w/o asking questions. (if HW based protection limit can not be
> set but protection can be enabled/disabled then value '1' would mean
> enable, zero would mean disable).
> 

regulator-over-voltage-protection;
^^^ that would look better, since we already have a
"regulator-over-current-protection" property to enable OCP.

Then, my suggestion would be:
regulator-over-voltage-max-microvolt = <uint32>;

And to signal that the hardware will auto-shutdown the rail:
regulator-over-voltage-auto-shutdown;

> *-error-* would be point where driver punts the existing error events
> via notifications allowing consumers to implement what ever is needed
> in order to handle error (which now means that something is likely to
> be broken / out of spec already.
> 

This one would be replaced by the property that I've proposed above

> *-warning-* would be point where driver punts new to-be-invented events
> / error flags and consumers can implement recovery actions assuming the
> HW is still operable but getting to the point where things are going to
> be shut down.
> 

regulator-over-voltage-warn-microvolt = <uint32>;


Less is more, I think. "warn" is shorter, looks nicer to me, but then,
it's not a big deal, that was just a nit.

> The BD9576 would then allow giving either *-error-* or *-warning-*
> properties (not both at least for now) and then selects the flag /
> event based on given property.
> 
> Do you think this makes sense? I hope that having these properties and
> flags/events would help actually utilizing these IRQs from PMICs which
> support them - and also help inventing recovery actions in consumer
> drivers, eventually saving the world for sure :]
> 
> If no one disagrees with this plan, then I know what I am going to do
> next week :]
> 

But then, as I just said, there's already "something" for over-current,
so... if you implement only the OVP stuff, then the implementation would
look "incomplete".
At this point it would be nicer to add another 10 lines of code to
replicate the same on OCP.

Keep in mind, though, that at least the qcom-labibb regulator does *not*
support what you propose: that one has auto-shutdown (OVP) enabled by
default (and *cannot* be disabled), no auto-shutdown on OCP (and no way
to enable it).

So, for labibb then "regulator-over-voltage-auto-shutdown" would be
sort of implicit, so every driver implementing similar hardware would
have to error out in case one property is set (ovp max voltage)but not
the other (auto-shutdown).

In a way, this looks messy.. but it may also make a lot of sense.

> Best Regards
> 	Matti Vaittinen
> 

Yours,
--Angelo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-30 15:43               ` AngeloGioacchino Del Regno
@ 2021-02-01  7:14                 ` Matti Vaittinen
  2021-02-01 13:17                 ` Mark Brown
  1 sibling, 0 replies; 12+ messages in thread
From: Matti Vaittinen @ 2021-02-01  7:14 UTC (permalink / raw)
  To: AngeloGioacchino Del Regno, Mark Brown; +Cc: lgirdwood, linux-kernel

On Sat, 2021-01-30 at 16:43 +0100, AngeloGioacchino Del Regno wrote:
> Il 29/01/21 10:14, Matti Vaittinen ha scritto:
> > On Thu, 2021-01-28 at 12:10 +0000, Mark Brown wrote:
> > > On Thu, Jan 28, 2021 at 09:23:08AM +0000, Vaittinen, Matti wrote:
> > > > On Wed, 2021-01-27 at 16:32 +0000, Mark Brown wrote:
> > > > > limits (one for initiating potential SW recovery - other for
> > > > > HW
> > > > to
> > > > forcing protection) actually make sense. So does implementing
> > > > notifiers
> > > > / error statuses for events where SW recovery is potentially
> > > > helpful.
> > > > But whether the existing event notifications / error flags are
> > > > correct
> > > > for these is something I can't decide :) Here I ask guidance
> > > > for
> > > > Mark &
> > > > others who know what is the idea behind existing error-
> > > > flags/events.
> > > 
> > > It's not that we shouldn't implement support for warnings, it's
> > > that
> > > they're not the common case for hardware and so won't line up
> > > with
> > > behaviour for other users.
> 
> ....but anyway, I have no idea what would you do when a warning is
> triggered: make a good example, please...

Unfortunately, I typically lost the visibility to final solutions :( I
guess that's the prize I must pay for working for a component vendor
nowadays. I've heard "word of mouth" that some setup had issues with
graphics accelerator heating - which was solved by cutting the power
from this IC (via regulator control) when things went wrong. Sure the
video playing was 'stopped' - but rest of the device remained usable
and no data was lost. But as I said, this is what I heard of - not what
I've done.

OTOH. I never stop being surprized by what people invent when they have
a problem to work around. I've seen a phone-like device which had
start-up problems at times - and which did 'secret reboot' by keeping
display running while rebooting - and due to fast reboot this resulted
actually a good user experience. My point? I am positive that by giving
the tools to invent work-around for problem - there will be a work-
around using that tool :)

> > For regulator framework that would mean supporting the warning
> > level
> > events (and regulator_get_error_flags warning-flags?) too.
> > 
> > For (my) driver this would mean getting the information from
> > device-
> > tree. I didn't yet check the existing DT properties (if any for
> > these
> > levels) - but I think we should have something like:
> > 
> > regulator-over-voltage-protection-microvolt = <>;
> > regulator-over-voltage-error-microvolt = <>;
> > regulator-over-voltage-warning-microvolt = <>;
> > 
> > *-protection-* would be point where HW (or driver) shuts down the
> > outputs w/o asking questions. (if HW based protection limit can not
> > be
> > set but protection can be enabled/disabled then value '1' would
> > mean
> > enable, zero would mean disable).
> > 
> 
> regulator-over-voltage-protection;
> ^^^ that would look better, since we already have a
> "regulator-over-current-protection" property to enable OCP.
> 
> Then, my suggestion would be:
> regulator-over-voltage-max-microvolt = <uint32>;
> 
> And to signal that the hardware will auto-shutdown the rail:
> regulator-over-voltage-auto-shutdown;

This would prevent having limits for both the protection and error at
same time, right? I would like to have separate property for HW-
specific shut-down (or shutdown by driver W/O any consumer driver
actions - which seems like HW originated shut-down). This is why I
wanted to have *-protection-* and *-error-*. Most of the ROHM ICs
actually provide the 'protection' - although the limit is rarely
configurable. (I am unsure if it is configurable on any ROHM PMIC I've
worked with) I still think allowing limit would be Ok in order to keep
all of the properties "looking" the same. Few ICs (like the BD9576MUF)
additionally support the 'notification IRQ' with configurable limit -
and this is where we can allow board designer to set this as error or
warning. I am thinking that for BD9576 we should allow setting:

'protection' 	(disable/enable) "HW originated shutdown"
and additionally either
'error'		(limit/disable/enable) "existing err notification"
or 
'warning' (limit/disable/enable) "new warning notification"

So - we would need separate properties for all 3 levels. As I said, I
would like to add the 'limit' also for protection to keep it identical
to error and warning levels.

The thing that bugs me with these limit properties and actually with
many other 'safety related' properties (like charger current/voltage
limits/thresholds at battery nodes) is that when the parsing is left
for IC drivers - then we see no warning when property is given but
driver does not care about it. This is why I think parsing these
properties should be left for the regulator core - and IC drivers
should only fill the enable/disable/set-limit callbacks in ops. The
core could spill a warning if DT defines these limits but the IC driver
does not provide callbacks. (just a thought).

> > *-error-* would be point where driver punts the existing error
> > events
> > via notifications allowing consumers to implement what ever is
> > needed
> > in order to handle error (which now means that something is likely
> > to
> > be broken / out of spec already.
> > 
> 
> This one would be replaced by the property that I've proposed above
> 
> > *-warning-* would be point where driver punts new to-be-invented
> > events
> > / error flags and consumers can implement recovery actions assuming
> > the
> > HW is still operable but getting to the point where things are
> > going to
> > be shut down.
> > 
> 
> regulator-over-voltage-warn-microvolt = <uint32>;
> 
> 
> Less is more, I think. "warn" is shorter, looks nicer to me, but
> then,
> it's not a big deal, that was just a nit.

Thanks for suggestion. I think warn is indeed better. :) I will try to
get the patch done - I am sure it will probably invoke some comments
from Rob as well :)

> 
> > The BD9576 would then allow giving either *-error-* or *-warning-*
> > properties (not both at least for now) and then selects the flag /
> > event based on given property.
> > 
> > Do you think this makes sense? I hope that having these properties
> > and
> > flags/events would help actually utilizing these IRQs from PMICs
> > which
> > support them - and also help inventing recovery actions in consumer
> > drivers, eventually saving the world for sure :]
> > 
> > If no one disagrees with this plan, then I know what I am going to
> > do
> > next week :]
> > 
> 
> But then, as I just said, there's already "something" for over-
> current,
> so... if you implement only the OVP stuff, then the implementation
> would
> look "incomplete".
> At this point it would be nicer to add another 10 lines of code to
> replicate the same on OCP.

Yep. BD9576MUF has configurable OCP for one of the power rails.

> Keep in mind, though, that at least the qcom-labibb regulator does
> *not*
> support what you propose: that one has auto-shutdown (OVP) enabled by
> default (and *cannot* be disabled), no auto-shutdown on OCP (and no
> way
> to enable it).

This is why I am thinking we should put the DT parsing to core and
allow ICs just to provide operations for configuring them. If qcom-
labibb does not allow configurations - then it won't provide
configuration callbacks and if DT limits/enable/disable is attempted
the core can emit a warning(?)

> So, for labibb then "regulator-over-voltage-auto-shutdown" would be
> sort of implicit, so every driver implementing similar hardware would
> have to error out in case one property is set (ovp max voltage)but
> not
> the other (auto-shutdown).

This is why it should be in core. And warning only because we don't
have information about 'implicit features'. We could add a flag for
'implicit protection' to allow drivers to prevent core from emitting
unnecessary warnings - (something like a bit-mask: 
#define REGULATOR_PROT_OVP BIT(0)
#define REGULATOR_PROT_OCP BIT(1)
#define REGULATOR_PROT_UVP BIT(2)
#define REGULATOR_PROT_TEMP BIT(3)
...
int hw_fixed_prot; 

in regulator desc) - but I am not sure what to thin about it now. Let's
discuss further when I get something actually done and we see what it
looks like :)

> 
> In a way, this looks messy.. but it may also make a lot of sense.

I do agree :) Thanks for investing the time to this! Much appreciated!
It is really nice to see someone cares for the work I do XD


Best Regards
	Matti Vaittinen


--
Matti Vaittinen, Linux device drivers
ROHM Semiconductors, Finland
SWDC
Kiviharjunlenkki 1E
90220 OULU
FINLAND

~~~ "I don't think so," said Rene Descartes. Just then he vanished ~~~

Simon says - in Latin please.
"non cogito me" dixit Rene Descarte, deinde evanescavit

(Thanks for the translation Simon)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: short-circuit and over-current IRQs
  2021-01-30 15:43               ` AngeloGioacchino Del Regno
  2021-02-01  7:14                 ` Matti Vaittinen
@ 2021-02-01 13:17                 ` Mark Brown
  1 sibling, 0 replies; 12+ messages in thread
From: Mark Brown @ 2021-02-01 13:17 UTC (permalink / raw)
  To: AngeloGioacchino Del Regno; +Cc: matti.vaittinen, lgirdwood, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2602 bytes --]

On Sat, Jan 30, 2021 at 04:43:46PM +0100, AngeloGioacchino Del Regno wrote:
> Il 29/01/21 10:14, Matti Vaittinen ha scritto:

Please delete unneeded context from mails when replying.  Doing this
makes it much easier to find your reply in the message, helping ensure
it won't be missed by people scrolling through the irrelevant quoted
material.

> > > It's not that we shouldn't implement support for warnings, it's that
> > > they're not the common case for hardware and so won't line up with
> > > behaviour for other users.

> ....but anyway, I have no idea what would you do when a warning is
> triggered: make a good example, please...

> I mean, take for example the "usual display": you are delivering power
> to a DriverIC (which is most of the times undocumented), your display
> starts drawing more than expected, so... uh.. so what?
> You shut it down? You reboot the DrIC?
> I can't imagine a DrIC reboot to restore the power draw...

As I've said several times now if you're getting anywhere near the point
where individual chips are starting to generate this sort of hardwired
alarms rather than specialist hardware monitoring stuff then something
will usually be very badly wrong.  If there's an actual fault developed
then very likely there is nothing that will fix it and it's merely a
case of preventing further or escallating physical damage to the system.
If it's something like a thermal warning then possibly waiting for
things to cool down might help.  The fact that it's hard to do anything
constructive about these issues on an individual device level is why
there's not code doing so upstream, if things have got to this point
then the ship has sailed.

Typical reactions would include things like going down to the lowest
available OPP, turning off some or all functionality of the system or
just plain shutting down the entire system.  The main distinction with a
warning rather than an error is that you know the regulator is still
operating.  If the regulator has encountered an actual error then you
don't know what state the devices it is supplying will be in, this means
that for example the driver getting the notification won't be able to
assume that the device will respond to register I/O or retain any state
it may have had.

> Keep in mind, though, that at least the qcom-labibb regulator does *not*
> support what you propose: that one has auto-shutdown (OVP) enabled by
> default (and *cannot* be disabled), no auto-shutdown on OCP (and no way
> to enable it).

This is very standard, the stuff that's baked into the devices tends to
be minimally configurable.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-02-01 13:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 12:01 short-circuit and over-current IRQs Vaittinen, Matti
2021-01-27 12:27 ` Mark Brown
2021-01-27 12:56   ` Matti Vaittinen
2021-01-27 14:34     ` AngeloGioacchino Del Regno
2021-01-27 14:42       ` Vaittinen, Matti
2021-01-27 16:32       ` Mark Brown
2021-01-28  9:23         ` Vaittinen, Matti
2021-01-28 12:10           ` Mark Brown
2021-01-28 12:49             ` Vaittinen, Matti
     [not found]             ` <a89bf6f0e6c1e4b9afe980908b7e36b70b304a96.camel@fi.rohmeurope.com>
2021-01-30 15:43               ` AngeloGioacchino Del Regno
2021-02-01  7:14                 ` Matti Vaittinen
2021-02-01 13:17                 ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).