On Sat, Nov 25, 2023 at 02:35:41PM +0000, Greg Kroah-Hartman wrote:
> On Sat, Nov 25, 2023 at 10:30:42AM +0000, Mark Brown wrote:
> > On Sat, Nov 25, 2023 at 09:09:01AM +0000, Greg Kroah-Hartman wrote:

> > > So hardware is attempting to rely on software in order to prevent the
> > > destruction of that same hardware?  Surely hardware designers aren't
> > > that crazy, right?  (rhetorical question, I know...)

> > Surely software people aren't going to make no effort to integrate with
> > the notification features that the hardware engineers have so helpfully
> > provided us with?

> That would be great, but I don't see that here, do you?  All I see is
> the shutdown sequence changing because someone wants it to go "faster"
> with the threat of hardware breaking if we don't meet that "faster"
> number, yet no knowledge or guarantee that this number can ever be known
> or happen.

The idea was to have somewhere to send notifications when the hardware
starts reporting things like power supplies starting to fail.  We do
have those from hardware, we just don't do anything terribly useful
with them yet.

TBH it does seem reasonable that there will be systems that can usefully
detect these issues but hasn't got a detailed characterisation of
exactly how long you've got before things expire, it's also likely that
the actual bound is going to be highly variable depending on what the
system is up to at the point of detection.  It's quite likely that we'd
only get a worst case bound so it's also likely that we'd have more time
in practice than in spec.  I'd expect characterisation that does happen
to be very system specific at this point, I don't think we can rely on
getting that information.  I'd certainly expect that we have vastly more
systems can usefully detect issues than systems where we have firm
numbers.

> > > > Same problem was seen not only in automotive devices, but also in
> > > > industrial or agricultural. With other words, it is important enough to bring
> > > > some kind of solution mainline.

> > > But you are not providing a real solution here, only a "I am going to
> > > attempt to shut down a specific type of device before the others, there
> > > are no time or ordering guarantees here, so good luck!" solution.

> > I'm not sure there are great solutions here, the system integrators are
> > constrained by the what the application appropriate silicon that's on
> > the market is capable of, the siicon is constrained by the area costs of
> > dealing with corner cases for system robustness and how much of the
> > market cares about fixing these issues and software is constrained by
> > what hardware ends up being built.  Everyone's just got to try their
> > best with the reality they're confronted with, hopefully what's possible
> > will improve with time.

> Agreed, but I don't think this patch is going to actually work properly
> over time as there is no time values involved :)

This seems to be more into the area of mitigation than firm solution, I
suspect users will be pleased if they can make a noticable dent in the
number of failures they're seeing.

> > > And again, how are you going to prevent the in-fighting of all device
> > > types to be "first" in the list?

> > It doesn't seem like the most complex integration challenge we've ever
> > had to deal with TBH.

> True, but we all know how this grows and thinking about how to handle it
> now is key for this to be acceptable.

It feels like if we're concerned about mitigating physical damage during
the process of power failure that's a very limited set of devices - the
storage case where we're in the middle of writing to flash or whatever
is the most obvious case.