All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
@ 2022-01-30 18:15 Zichar Zhang
  0 siblings, 0 replies; 10+ messages in thread
From: Zichar Zhang @ 2022-01-30 18:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Greg Kroah-Hartman, Kelly Rossmoyer, Lee Jones, Len Brown,
	linux-kernel, linux-pm, Vijay Nayak, Pavel Machek

Hi Rafael, Kelly

hello Rafael, it is a little bit late for me to reply to you. I was
finding the way to
reply to you, cause I'm not in the "cc list". So, thanks Kelly in that way. :)
I'm totally agree with you that we should split the work into smaller
pieces and do it step by step.

Hi Kelly,
I got the strong signal from you that you insist on your requirement.
It's reasonable, and I want that if I am the user too. :)

But it's could be some problem for me to do all of that. And I am calling help
here. Yes! I need some help!

I want to seperate this task into 4 part:
1. user interface: like sysfs file /sys/power/last_wakeup_reason.
2. report interface: call by "wakeup sources" to report "wakeup reason".
3. report operation in kernel: like "interrupt subsystem". (a common interface)
4. report operation in device: like WDT driver, GIC driver or other
device driver.

I think we should do the first 3 parts, but not the last one, cause it is device
specific things. Device and BSP should do that, I insist that.
Part 1 and 2 are easily to do, and I can rework again and agian until it is all
right for everyone.
So it is clear that we have problem with the third part. And yes! it is very
hard.

Kernel desn't know how the "machine" wakeup, kernel just offer the interface
that user can mark the "wakeup source", like IRQD_WAKEUP_STATE flag
and "ws"(wakeup source) interface (acturally they are fake wakeup sources).
These works well and we can easily to report these which Android patch and
mine already do that.
But the left things is hard.Cause kernel or "subsystem" in kernel desn't has
any mechanism to do that. Then we are facing these three things:

1. "misconfigured" and "unmapped" IRQs reporting.
Android patch just add a "wakeup report" interface here once it was occurred,
even it's not in a "suspend" state, and even one of them was in GIC driver.
if I was the maintainer I won't take this, but the question is what should I do
for that?
(Maybe I shoud give a task to "interrupt subsystem people" and ask them to
do that? :) )

2. errors in suspend/resume process.
That means if there is a error occurs in suspend/resume process I need to
report it as "wakeup reason".Which is just "abort wakeup reason" as  Kelly
said. But it is lots of errors may occurs here, and which one I should report,
and is that enough?
And as  Kelly said the code is "messy", that does hit the point.

3. threaded inerrupt
Sorry, I don't find the properly place in kernel to report there "wakeup
reason". Maybe that's my lack of knowalige. :)
It's seem like some interrupt chip driver should do that? I don't know. Maybe
I should offer a interface and just let "user" to use it?

So, that all the things I got.
And again kelly, I got your mind, and I will try to think this over again to see
if I can find a way to do that.

besides, thanks Jone and John, I will rework the patch after this discusion.

And any advice could help! :)
Or you have a better idea, I can help you to do yours!

Best
Zichar

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-30 14:46       ` Rafael J. Wysocki
@ 2022-02-02  7:59         ` Kelly Rossmoyer
  0 siblings, 0 replies; 10+ messages in thread
From: Kelly Rossmoyer @ 2022-02-02  7:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Len Brown, Greg Kroah-Hartman, Lee Jones,
	Vijay Nayak, Linux PM, Linux Kernel Mailing List, Zichar Zhang

+Zichar, to try to pull the threads together.

On Sun, Jan 30, 2022 at 6:46 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Sat, Jan 29, 2022 at 9:27 AM Kelly Rossmoyer <krossmo@google.com> wrote:
> >
> > On Thu, Jan 27, 2022 at 12:10 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > That said, the general idea behind wakeup_source objects is that every
> > > system wakeup event should be recorded in one of them which then can
> > > be used for later analysis.
> > >
> > > If there are reasons why this cannot work in general, what are they?
> >
> > I won't presume to say that it "cannot work in general."  Nearly everyone on
> > this thread has more expertise here than I do, and I'm keenly aware of how
> > much I don't know.  :-)
> >
> > What I will say is that - across the chips and architectures I've worked upon
> > over the last few years - that concept has not appeared to match observed
> > reality.  From what I've seen (which is a very narrow slice of the Linux
> > universe, but I suspect is at least pretty representative of Android):
> > * resumes from successful suspends are typically accompanied by a flurry of
> >   wakesource activity from which it is not possible to determine what actually
> >   caused the resume (despite last changed timestamps)
>
> So I wonder how you are going to determine what actually caused the
> resume reliably.

I don't have the cross-platform/cross-arch experience to know if this is
broadly applicable, but right now what we're doing is taking:
* the first suspend abort reason that was captured if there was one, or if
  not...
* all pending wakeup-armed IRQs, if there are any, or if not...
* the first non-IRQ, non-abort reason logged, which mostly tends to emerge
  from platform-specific code (e.g. the Foo logic block decided it was
  time to power on for reasons that aren't reflected in any IRQs)

> > * resumes that aren't accompanied by wakeup-armed IRQs can be even
> >   less well-reflected by wakesource activity
>
> Do you have examples of these other than the aborts mentioned below?

Nothing super detailed, as my area of power work has only gotten down to
relatively high-level portions of the kernel as opposed to lower-level SOC
architecture.  But two examples that come to mind include:
* a secure watchdog interrupt fires, which wakes up a hypervisor that
  handles and clears the interrupt, leaving no IRQ pending by the time
  kernel execution resumes
* power control logic outside of the CPUs decides to turn CPU clusters
  back on to support a use case currently driven by other logic on the
  chip that doesn't involve any wakeup-armed IRQs (this could plausibly be
  something like a low-power logic block that handles audio playback
  causing the CPUs to be woken up so it can receive the next several
  seconds of audio data to buffer up before the kernel suspends again)

> > * I believe inferring wakeup reasons from wakesource stats would require
> >   having a snapshot from the last moment prior to suspend, which seems
> >   unsolvable from userspace
>
> That can be addressed by extending the wakeup sources in principle.
>
> > * suspend aborts (which can be even more harmful for battery life than
> >   "true" wakeups) are often caused by things that aren't reflected by specific
> >   wakesources (e.g. a driver failing to suspend)
>
> Which again can be addressed by using special wakeup sources for
> registering these "wakeups"  or similar.

Do you have the outline of a concept in mind, or is this more about the
general principle of extending what's there vs. adding something new?
(Apologies if what you're alluding to should be clear to me... I'm afraid
I don't have the relevant experience to envision what this could look like.)

> > And as I mentioned in my reply to Zichar, this isn't solely about
> > troubleshooting.  There's a lot of room to improve on user-focused power
> > attribution, and I'm hoping to build change in that direction upon the same
> > foundation.  Having the best possible data about "why we're awake" serves both
> > goals.
>
> Generally speaking, there is one wakeup-related framework in the
> kernel (wakeup sources) and you want to add another one sort of on top
> of it and it is still quite unclear to me what can be done with the
> new framework that cannot be achieved with the old one (possibly with
> some extensions),

I guess my (neophytic) impression has been that there are actually already
three frameworks (not just one) for wakeup-related data (where "wakeup" in
this sense is "the reason(s) we aren't suspended", including aborts):
* wakeup_source stats
* pm_wakeup_irq
* suspend_stats

And if those were extended, such that:
* the name of the last active wakeup_source was exposed to userspace
  to decode suspend aborts due to wakeup_count changes
* pm_wakeup_irq captured potentially multiple pending wakeup_armed
  IRQs during resume instead of just one
* some sense of causation was added to wakeup_sources to enable their
  use for non-IRQ resume causes (e.g. PM core outside of the CPUs
  turned CPUs back on for reasons that only platform logic can decode
  and report)
* the userspace interface to wakeup_source stats wasn't so flawed

Then very determined userspace code could combine those things with their
pre-suspend-attempt states and the fourth key piece of data (the return
value from the write to /sys/power/state, assuming kernel autosuspend
isn't used) to put together something that would be close to what the
Android wakeup_reason code is doing.  And then it would break because
there's nothing in the kernel tying those different pieces together in a
cohesive way that's guaranteed to work with some degree of stability.

> Let's first talk about the specific problems to address and then we'll
> decide whether or not we need yet another piece of infrastructure to
> address them.

I'm probably being overly optimistic, but my intention is more to tie
together and shore up the gaps in existing pieces that are currently
disjoint and incomplete, as opposed to just throwing yet another
framework into the mix.

--

Kelly Rossmoyer | Software Engineer | krossmo@google.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-29  8:26     ` Kelly Rossmoyer
@ 2022-01-30 14:46       ` Rafael J. Wysocki
  2022-02-02  7:59         ` Kelly Rossmoyer
  0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2022-01-30 14:46 UTC (permalink / raw)
  To: Kelly Rossmoyer
  Cc: Rafael J. Wysocki, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Lee Jones, Vijay Nayak, Linux PM, Linux Kernel Mailing List

On Sat, Jan 29, 2022 at 9:27 AM Kelly Rossmoyer <krossmo@google.com> wrote:
>
> On Thu, Jan 27, 2022 at 12:10 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Thu, Jan 27, 2022 at 8:54 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > On Mon, Jan 10, 2022 at 7:49 PM Kelly Rossmoyer <krossmo@google.com> wrote:
> > > >
> > > So as Zichar said, this is quite heavy-weight.
> > >
> > > I'm not fundamentally against adding more infrastructure to help
> > > identify issues related to system suspend, but there needs to be a
> > > clear benefit associated with any change in this direction.
> >
> > That said, the general idea behind wakeup_source objects is that every
> > system wakeup event should be recorded in one of them which then can
> > be used for later analysis.
> >
> > If there are reasons why this cannot work in general, what are they?
>
> I won't presume to say that it "cannot work in general."  Nearly everyone on
> this thread has more expertise here than I do, and I'm keenly aware of how
> much I don't know.  :-)
>
> What I will say is that - across the chips and architectures I've worked upon
> over the last few years - that concept has not appeared to match observed
> reality.  From what I've seen (which is a very narrow slice of the Linux
> universe, but I suspect is at least pretty representative of Android):
> * resumes from successful suspends are typically accompanied by a flurry of
>   wakesource activity from which it is not possible to determine what actually
>   caused the resume (despite last changed timestamps)

So I wonder how you are going to determine what actually caused the
resume reliably.

> * resumes that aren't accompanied by wakeup-armed IRQs can be even
>   less well-reflected by wakesource activity

Do you have examples of these other than the aborts mentioned below?

> * I believe inferring wakeup reasons from wakesource stats would require
>   having a snapshot from the last moment prior to suspend, which seems
>   unsolvable from userspace

That can be addressed by extending the wakeup sources in principle.

> * suspend aborts (which can be even more harmful for battery life than
>   "true" wakeups) are often caused by things that aren't reflected by specific
>   wakesources (e.g. a driver failing to suspend)

Which again can be addressed by using special wakeup sources for
registering these "wakeups"  or similar.

> And as I mentioned in my reply to Zichar, this isn't solely about
> troubleshooting.  There's a lot of room to improve on user-focused power
> attribution, and I'm hoping to build change in that direction upon the same
> foundation.  Having the best possible data about "why we're awake" serves both
> goals.

Generally speaking, there is one wakeup-related framework in the
kernel (wakeup sources) and you want to add another one sort of on top
of it and it is still quite unclear to me what can be done with the
new framework that cannot be achieved with the old one (possibly with
some extensions),

Let's first talk about the specific problems to address and then we'll
decide whether or not we need yet another piece of infrastructure to
address them.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-27 20:10   ` Rafael J. Wysocki
@ 2022-01-29  8:26     ` Kelly Rossmoyer
  2022-01-30 14:46       ` Rafael J. Wysocki
  0 siblings, 1 reply; 10+ messages in thread
From: Kelly Rossmoyer @ 2022-01-29  8:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Len Brown, Greg Kroah-Hartman, Lee Jones,
	Vijay Nayak, Linux PM, Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 12:10 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Jan 27, 2022 at 8:54 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Mon, Jan 10, 2022 at 7:49 PM Kelly Rossmoyer <krossmo@google.com> wrote:
> > >
> > So as Zichar said, this is quite heavy-weight.
> >
> > I'm not fundamentally against adding more infrastructure to help
> > identify issues related to system suspend, but there needs to be a
> > clear benefit associated with any change in this direction.
>
> That said, the general idea behind wakeup_source objects is that every
> system wakeup event should be recorded in one of them which then can
> be used for later analysis.
>
> If there are reasons why this cannot work in general, what are they?

I won't presume to say that it "cannot work in general."  Nearly everyone on
this thread has more expertise here than I do, and I'm keenly aware of how
much I don't know.  :-)

What I will say is that - across the chips and architectures I've worked upon
over the last few years - that concept has not appeared to match observed
reality.  From what I've seen (which is a very narrow slice of the Linux
universe, but I suspect is at least pretty representative of Android):
* resumes from successful suspends are typically accompanied by a flurry of
  wakesource activity from which it is not possible to determine what actually
  caused the resume (despite last changed timestamps)
* resumes that aren't accompanied by wakeup-armed IRQs can be even
  less well-reflected by wakesource activity
* I believe inferring wakeup reasons from wakesource stats would require
  having a snapshot from the last moment prior to suspend, which seems
  unsolvable from userspace
* suspend aborts (which can be even more harmful for battery life than
  "true" wakeups) are often caused by things that aren't reflected by specific
  wakesources (e.g. a driver failing to suspend)

And as I mentioned in my reply to Zichar, this isn't solely about
troubleshooting.  There's a lot of room to improve on user-focused power
attribution, and I'm hoping to build change in that direction upon the same
foundation.  Having the best possible data about "why we're awake" serves both
goals.

Tangentially, the new(ish) wakesource stats interface has also proved to be
quite difficult to utilize robustly from userspace (at least on Android, maybe
not elsewhere?).  But that's a different fish for a different fryer, that I'm
hoping to tackle later this year.


--

Kelly Rossmoyer | Software Engineer | krossmo@google.com | 858-239-4111

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-26  5:09   ` Zichar Zhang
@ 2022-01-29  7:52     ` Kelly Rossmoyer
  0 siblings, 0 replies; 10+ messages in thread
From: Kelly Rossmoyer @ 2022-01-29  7:52 UTC (permalink / raw)
  To: Zichar Zhang
  Cc: John Stultz, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Lee Jones, Vijay Nayak, linux-pm,
	linux-kernel, Sumit Semwal

On Tue, Jan 25, 2022 at 9:09 PM Zichar Zhang <zichar.zhang@linaro.org> wrote:
>
> hello John,  Kelly!

Hello!  Happy to see what the future may bring in this space, and looking
forward to what we can all do to move this area forward.

>     -- If these IRQs do hanpen, the code in this commit will catch
>     them as "unknown wakeup reason" and suggest user to check the
>     kernel log for more information.

I would argue that - at least in Android context, as that's what I know -
the kernel log is not a substitute for wakeup reason capture.
1) It is common to find yourself troubleshooting battery life issues for
   which kernel log data is not unavailable.
2) Troubleshooting is not the only consideration.  Since "why aren't we in
   suspend right now?" is such a key question for mobile device battery
   life, this is also about power attribution.  And I think trying to build
   that upon live kernel log parsing would be both inefficient and brittle.
   (But my lack of knowledge is vast, so maybe that's a solvable problem?)

> >> * abort reasons, including:
> >>    * wakeup_source activity
> >>    * failure to freeze userspace
> >>    * failure to suspend devices
> >>    * failed syscore_suspend callback
>
>     -- As mentioned before, if these "abort action" happened, you
>     can catch string "unknown wakeup reason", and check the kernel
>     log then.

I don't think the kernel log is a solution here.  Suspend aborts can be a
significant fraction of how suspend attempts end, making them a key
contributor to battery drain.  If the eventual set of patches doesn't solve
for at least the most common kinds of suspend aborts, that leaves a lot of
power observability off the table.  And again, at least on Android, kernel
log content is often not available for the interval when a series of
suspend aborts contributed to power drain.

> The patch is not complete, these is the next steps:
> 1. add interface to show time spend in suspend/resume work.
> (after this, I think it could be works and replace the android patch)

I believe Android would experience a significant regression in capability
with only those two patches implemented, so that would require a lot of
careful consideration.

> 2. we need solve the "unmapped HW IRQs", "misconfigured IRQs" and
> "threaded IRQs" problem.
> (this is the hardest one)

FWIW, I'm already expecting unmapped HW IRQs and misconfigured IRQs not to
make the cut.  Those have helped solve real and recurring issues, but
they're admittedly niche solutions to infrequent problems at the expense of
messy code coupling, so... maybe not broadly beneficial for Linux.

Threaded wakeup IRQs are a different matter.  As one example, there are
architectures in which the RTC wakeup IRQ is threaded and knowing "this
string of wakeups was due to the RTC" is a lot more useful than "this
string of wakeups was due to an irqchip with tens of child IRQs".

> ... So it just report a "wakeup reason" from
> interrupt subsystem. It just a coincidence that most hardware "wakeup
> reason" is also the interupt signal.  Even the "interrupt" and "wake up"
> signal are separated from each other in GIC700.
> So give them a chance to report the their "wakeup reason".)

Agreed.  But I do hope we can find a way to bring them together so a single
story is presented to userspace.  Or, barring that, at least make it easy
for userspace to figure that out after the fact.

Thanks so much for the work you are doing on this and taking the time to
walk me through your thoughts.  I'm happy to contribute in whatever manner
adds value.

--

Kelly Rossmoyer | Software Engineer | krossmo@google.com | 858-239-4111

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-27 19:54 ` Rafael J. Wysocki
@ 2022-01-27 20:10   ` Rafael J. Wysocki
  2022-01-29  8:26     ` Kelly Rossmoyer
  0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2022-01-27 20:10 UTC (permalink / raw)
  To: Kelly Rossmoyer
  Cc: Rafael J. Wysocki, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Lee Jones, Vijay Nayak, Linux PM, Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 8:54 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Jan 10, 2022 at 7:49 PM Kelly Rossmoyer <krossmo@google.com> wrote:
> >
> > # Introduction
> >
> > To aid optimization, troubleshooting, and attribution of battery life, the
> > Android kernel currently includes a set of patches which provide enhanced
> > visibility into kernel suspend/resume/abort behaviors.  The capabilities
> > and implementation of this feature have evolved significantly since an
> > unsuccessful attempt to upstream the original code
> > (https://lkml.org/lkml/2014/3/10/716), and we would like to (re)start a
> > conversation about upstreaming, starting with the central question: is
> > there support for upstreaming this set of features?
> >
> > # Motivation
> >
> > Of the many factors influencing battery life on Linux-powered mobile
> > devices, kernel suspend tends to be amongst the most impactful.  Maximizing
> > time spent in suspend and minimizing the frequency of net-negative suspend
> > cycles are both important contributors to battery life optimization.  But
> > enabling that optimization - and troubleshooting when things go wrong -
> > requires more observability of suspend/resume/abort behavior than Linux
> > currently provides.  While mechanisms like `/sys/power/pm_wakeup_irq` and
> > wakeup_source stats are useful, they are incomplete and scattered.  The
> > Android kernel wakeup reason patches implement significant improvements in
> > that area.
> >
> > # Features
> >
> > As of today, the active set of patches surface the following
> > suspend-related data:
> >
> > * wakeup IRQs, including:
> >    * multiple IRQs if more than one is pending during resume flow
> >    * unmapped HW IRQs (wakeup-capable in HW) that should not be
> >      occurring
> >    * misconfigured IRQs (e.g. both enable_irq_wake() and
> >      IRQF_NO_SUSPEND)
> >    * threaded IRQs (not just the parent chip's IRQ)
> >
> > * non-IRQ wakeups, including:
> >    * wakeups caused by an IRQ that was consumed by lower-level SW
> >    * wakeups from SOC architecture that don't manifest as IRQs
> >
> > * abort reasons, including:
> >    * wakeup_source activity
> >    * failure to freeze userspace
> >    * failure to suspend devices
> >    * failed syscore_suspend callback
> >
> > * durations from the most recent cycle, including:
> >    * time spent doing suspend/resume work
> >    * time spent in suspend
> >
> > In addition to battery life optimization and troubleshooting, some of these
> > capabilities also lay the groundwork for efforts around improving
> > attribution of wakeups/aborts (e.g. to specific processes, device features,
> > external devices, etc).
> >
> > # Shortcomings
> >
> > While the core implementation (see below) is relatively straightforward and
> > localized, calls into that core are somewhat widely spread in order to
> > capture the breadth of events of interest.  The pervasiveness of those
> > hooks is clearly an area where improvement would be beneficial, especially
> > if a cleaner solution preserved equivalent capabilities.
> >
> > # Existing Code
> >
> > As a reference for how Android currently implements the core code for these
> > features (which would need a bit of work before submission even if all
> > features were included), see the following link:
> >
> > https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/kernel/power/wakeup_reason.c
>
> So as Zichar said, this is quite heavy-weight.
>
> I'm not fundamentally against adding more infrastructure to help
> identify issues related to system suspend, but there needs to be a
> clear benefit associated with any change in this direction.

That said, the general idea behind wakeup_source objects is that every
system wakeup event should be recorded in one of them which then can
be used for later analysis.

If there are reasons why this cannot work in general, what are they?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-10 18:49 Kelly Rossmoyer
  2022-01-24 17:37 ` John Stultz
@ 2022-01-27 19:54 ` Rafael J. Wysocki
  2022-01-27 20:10   ` Rafael J. Wysocki
  1 sibling, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2022-01-27 19:54 UTC (permalink / raw)
  To: Kelly Rossmoyer
  Cc: Rafael J. Wysocki, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Lee Jones, Vijay Nayak, Linux PM, Linux Kernel Mailing List

On Mon, Jan 10, 2022 at 7:49 PM Kelly Rossmoyer <krossmo@google.com> wrote:
>
> # Introduction
>
> To aid optimization, troubleshooting, and attribution of battery life, the
> Android kernel currently includes a set of patches which provide enhanced
> visibility into kernel suspend/resume/abort behaviors.  The capabilities
> and implementation of this feature have evolved significantly since an
> unsuccessful attempt to upstream the original code
> (https://lkml.org/lkml/2014/3/10/716), and we would like to (re)start a
> conversation about upstreaming, starting with the central question: is
> there support for upstreaming this set of features?
>
> # Motivation
>
> Of the many factors influencing battery life on Linux-powered mobile
> devices, kernel suspend tends to be amongst the most impactful.  Maximizing
> time spent in suspend and minimizing the frequency of net-negative suspend
> cycles are both important contributors to battery life optimization.  But
> enabling that optimization - and troubleshooting when things go wrong -
> requires more observability of suspend/resume/abort behavior than Linux
> currently provides.  While mechanisms like `/sys/power/pm_wakeup_irq` and
> wakeup_source stats are useful, they are incomplete and scattered.  The
> Android kernel wakeup reason patches implement significant improvements in
> that area.
>
> # Features
>
> As of today, the active set of patches surface the following
> suspend-related data:
>
> * wakeup IRQs, including:
>    * multiple IRQs if more than one is pending during resume flow
>    * unmapped HW IRQs (wakeup-capable in HW) that should not be
>      occurring
>    * misconfigured IRQs (e.g. both enable_irq_wake() and
>      IRQF_NO_SUSPEND)
>    * threaded IRQs (not just the parent chip's IRQ)
>
> * non-IRQ wakeups, including:
>    * wakeups caused by an IRQ that was consumed by lower-level SW
>    * wakeups from SOC architecture that don't manifest as IRQs
>
> * abort reasons, including:
>    * wakeup_source activity
>    * failure to freeze userspace
>    * failure to suspend devices
>    * failed syscore_suspend callback
>
> * durations from the most recent cycle, including:
>    * time spent doing suspend/resume work
>    * time spent in suspend
>
> In addition to battery life optimization and troubleshooting, some of these
> capabilities also lay the groundwork for efforts around improving
> attribution of wakeups/aborts (e.g. to specific processes, device features,
> external devices, etc).
>
> # Shortcomings
>
> While the core implementation (see below) is relatively straightforward and
> localized, calls into that core are somewhat widely spread in order to
> capture the breadth of events of interest.  The pervasiveness of those
> hooks is clearly an area where improvement would be beneficial, especially
> if a cleaner solution preserved equivalent capabilities.
>
> # Existing Code
>
> As a reference for how Android currently implements the core code for these
> features (which would need a bit of work before submission even if all
> features were included), see the following link:
>
> https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/kernel/power/wakeup_reason.c

So as Zichar said, this is quite heavy-weight.

I'm not fundamentally against adding more infrastructure to help
identify issues related to system suspend, but there needs to be a
clear benefit associated with any change in this direction.  Also
adding significant overhead just for this purpose alone is rather out
of the question.

I would advise you to follow the suggestion to split the work into
smaller pieces and submit them one at a time, possibly starting with
the ones bringing the most significant benefits to the table.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-24 17:37 ` John Stultz
@ 2022-01-26  5:09   ` Zichar Zhang
  2022-01-29  7:52     ` Kelly Rossmoyer
  0 siblings, 1 reply; 10+ messages in thread
From: Zichar Zhang @ 2022-01-26  5:09 UTC (permalink / raw)
  To: John Stultz
  Cc: Kelly Rossmoyer, Rafael J. Wysocki, Pavel Machek, Len Brown,
	Greg Kroah-Hartman, Lee Jones, Vijay Nayak, linux-pm,
	linux-kernel, Sumit Semwal

hello John,  Kelly!

I'm working on upstream "wakeup_reason" patch.
But I found It's a little bit hard to do that, cause it is too heavy.

The android patch put "wakeup_reason",  suspend/resume process and
interrupt subsystem together make it hard to upstream all these at once. I
think maybe we can cooperate with each other, and do it step by step.
And I make a simplified "wakeup_reason" patch as the start.

It just report the  normal IRQs which cause system wake up from suspend
and non-device "wakeups" which use kernel interface "wakeup source". It
can cover most of the situations.

Here I placed the patch, so that people can review it:
https://android-review.googlesource.com/c/kernel/common/+/1958188

Here is the differences from the requirement:

>> As of today, the active set of patches surface the following
>> suspend-related data:
>>
>> * wakeup IRQs, including:
>>    * multiple IRQs if more than one is pending during resume flow
>>    * unmapped HW IRQs (wakeup-capable in HW) that should not be
>>      occurring
>>    * misconfigured IRQs (e.g. both enable_irq_wake() and
>>      IRQF_NO_SUSPEND)
>>    * threaded IRQs (not just the parent chip's IRQ)

    -- If we do these things, the additional codes should be added in
    interrupt subsystem or some interrupt controller drivers. These
    codes are no relationship with the interrupt subsystem or the
    interrupt controllers. And we can't distinguish these IRQs from
    "non-suspend" environment, even the "Android patch" can't do that.
    So it will make the things complicated.
    -- If these IRQs do hanpen, the code in this commit will catch
    them as "unknown wakeup reason" and suggest user to check the
    kernel log for more information.
    -- I think We should cooperate with interrupt subsystem to log
    these "abnormal" IRQs. And it could be several additional
    commits to accomplish this work together then.

>> * abort reasons, including:
>>    * wakeup_source activity
>>    * failure to freeze userspace
>>    * failure to suspend devices
>>    * failed syscore_suspend callback

    -- These codes are "intrusive" to kernel (suspend/resume).
    -- These "errors" are already printed in kernel log, if we add
    these log to "wakeup_reason" either, it will cause double
    "log string" in section ".data", which just waste of the memory.
    -- As mentioned before, if these "abort action" happened, you
    can catch string "unknown wakeup reason", and check the kernel
    log then.

>> * durations from the most recent cycle, including:
>>    * time spent doing suspend/resume work
>>    * time spent in suspend

    -- Just separate these from "wakeup reason".
    It should be done in another commit.

The patch is not complete, these is the next steps:
1. add interface to show time spend in suspend/resume work.
(after this, I think it could be works and replace the android patch)
2. we need solve the "unmapped HW IRQs", "misconfigured IRQs" and
"threaded IRQs" problem.
(this is the hardest one)
3. add a string report interface for the "wakeup reason" not belongs to
any thing above.
(kernel actually didn't know the hardware wake up reason. The IRQs just
one of the reason for wake up. So it just report a "wakeup reason" from
interrupt subsystem. It just a coincidence that most hardware "wakeup
reason" is also the interupt signal.  Even the "interrupt" and "wake up"
signal are separated from each other in GIC700.
So give them a chance to report the their "wakeup reason".)

This patch can be tested in Android using the following change to AOSP:
 https://android-review.googlesource.com/c/platform/system/hardware/interfaces/+/1958666

And there is a stress test code for the interfaces in kernel:
 https://android-review.googlesource.com/c/kernel/common/+/1958189

Best
zichar


On Tue, 25 Jan 2022 at 01:38, John Stultz <john.stultz@linaro.org> wrote:
>
> On Tue, Jan 11, 2022 at 5:06 AM Kelly Rossmoyer <krossmo@google.com> wrote:
> >
> > # Introduction
> >
> > To aid optimization, troubleshooting, and attribution of battery life, the
> > Android kernel currently includes a set of patches which provide enhanced
> > visibility into kernel suspend/resume/abort behaviors.  The capabilities
> > and implementation of this feature have evolved significantly since an
> > unsuccessful attempt to upstream the original code
> > (https://lkml.org/lkml/2014/3/10/716), and we would like to (re)start a
> > conversation about upstreaming, starting with the central question: is
> > there support for upstreaming this set of features?
> >
> > # Motivation
> >
> > Of the many factors influencing battery life on Linux-powered mobile
> > devices, kernel suspend tends to be amongst the most impactful.  Maximizing
> > time spent in suspend and minimizing the frequency of net-negative suspend
> > cycles are both important contributors to battery life optimization.  But
> > enabling that optimization - and troubleshooting when things go wrong -
> > requires more observability of suspend/resume/abort behavior than Linux
> > currently provides.  While mechanisms like `/sys/power/pm_wakeup_irq` and
> > wakeup_source stats are useful, they are incomplete and scattered.  The
> > Android kernel wakeup reason patches implement significant improvements in
> > that area.
> >
> > # Features
> >
> > As of today, the active set of patches surface the following
> > suspend-related data:
> >
> > * wakeup IRQs, including:
> >    * multiple IRQs if more than one is pending during resume flow
> >    * unmapped HW IRQs (wakeup-capable in HW) that should not be
> >      occurring
> >    * misconfigured IRQs (e.g. both enable_irq_wake() and
> >      IRQF_NO_SUSPEND)
> >    * threaded IRQs (not just the parent chip's IRQ)
> >
> > * non-IRQ wakeups, including:
> >    * wakeups caused by an IRQ that was consumed by lower-level SW
> >    * wakeups from SOC architecture that don't manifest as IRQs
> >
> > * abort reasons, including:
> >    * wakeup_source activity
> >    * failure to freeze userspace
> >    * failure to suspend devices
> >    * failed syscore_suspend callback
> >
> > * durations from the most recent cycle, including:
> >    * time spent doing suspend/resume work
> >    * time spent in suspend
> >
> > In addition to battery life optimization and troubleshooting, some of these
> > capabilities also lay the groundwork for efforts around improving
> > attribution of wakeups/aborts (e.g. to specific processes, device features,
> > external devices, etc).
> >
> > # Shortcomings
> >
> > While the core implementation (see below) is relatively straightforward and
> > localized, calls into that core are somewhat widely spread in order to
> > capture the breadth of events of interest.  The pervasiveness of those
> > hooks is clearly an area where improvement would be beneficial, especially
> > if a cleaner solution preserved equivalent capabilities.
> >
> > # Existing Code
> >
> > As a reference for how Android currently implements the core code for these
> > features (which would need a bit of work before submission even if all
> > features were included), see the following link:
> >
> > https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/kernel/power/wakeup_reason.c
> >
>
> Hey Kelly!
>   So Zichar (added to the thread here) has been working for a little
> while on his own approach to upstream a simplified version of the
> wakeup_reason functionality. He's just gotten it to a place where it
> can be shared, so I wanted to pull him in so he could reply with his
> proposal.
>
> thanks
> -john

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] PM: suspend: Upstreaming wakeup reason capture support
  2022-01-10 18:49 Kelly Rossmoyer
@ 2022-01-24 17:37 ` John Stultz
  2022-01-26  5:09   ` Zichar Zhang
  2022-01-27 19:54 ` Rafael J. Wysocki
  1 sibling, 1 reply; 10+ messages in thread
From: John Stultz @ 2022-01-24 17:37 UTC (permalink / raw)
  To: Kelly Rossmoyer, Zichar Zhang
  Cc: Rafael J. Wysocki, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Lee Jones, Vijay Nayak, linux-pm, linux-kernel, Sumit Semwal

On Tue, Jan 11, 2022 at 5:06 AM Kelly Rossmoyer <krossmo@google.com> wrote:
>
> # Introduction
>
> To aid optimization, troubleshooting, and attribution of battery life, the
> Android kernel currently includes a set of patches which provide enhanced
> visibility into kernel suspend/resume/abort behaviors.  The capabilities
> and implementation of this feature have evolved significantly since an
> unsuccessful attempt to upstream the original code
> (https://lkml.org/lkml/2014/3/10/716), and we would like to (re)start a
> conversation about upstreaming, starting with the central question: is
> there support for upstreaming this set of features?
>
> # Motivation
>
> Of the many factors influencing battery life on Linux-powered mobile
> devices, kernel suspend tends to be amongst the most impactful.  Maximizing
> time spent in suspend and minimizing the frequency of net-negative suspend
> cycles are both important contributors to battery life optimization.  But
> enabling that optimization - and troubleshooting when things go wrong -
> requires more observability of suspend/resume/abort behavior than Linux
> currently provides.  While mechanisms like `/sys/power/pm_wakeup_irq` and
> wakeup_source stats are useful, they are incomplete and scattered.  The
> Android kernel wakeup reason patches implement significant improvements in
> that area.
>
> # Features
>
> As of today, the active set of patches surface the following
> suspend-related data:
>
> * wakeup IRQs, including:
>    * multiple IRQs if more than one is pending during resume flow
>    * unmapped HW IRQs (wakeup-capable in HW) that should not be
>      occurring
>    * misconfigured IRQs (e.g. both enable_irq_wake() and
>      IRQF_NO_SUSPEND)
>    * threaded IRQs (not just the parent chip's IRQ)
>
> * non-IRQ wakeups, including:
>    * wakeups caused by an IRQ that was consumed by lower-level SW
>    * wakeups from SOC architecture that don't manifest as IRQs
>
> * abort reasons, including:
>    * wakeup_source activity
>    * failure to freeze userspace
>    * failure to suspend devices
>    * failed syscore_suspend callback
>
> * durations from the most recent cycle, including:
>    * time spent doing suspend/resume work
>    * time spent in suspend
>
> In addition to battery life optimization and troubleshooting, some of these
> capabilities also lay the groundwork for efforts around improving
> attribution of wakeups/aborts (e.g. to specific processes, device features,
> external devices, etc).
>
> # Shortcomings
>
> While the core implementation (see below) is relatively straightforward and
> localized, calls into that core are somewhat widely spread in order to
> capture the breadth of events of interest.  The pervasiveness of those
> hooks is clearly an area where improvement would be beneficial, especially
> if a cleaner solution preserved equivalent capabilities.
>
> # Existing Code
>
> As a reference for how Android currently implements the core code for these
> features (which would need a bit of work before submission even if all
> features were included), see the following link:
>
> https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/kernel/power/wakeup_reason.c
>

Hey Kelly!
  So Zichar (added to the thread here) has been working for a little
while on his own approach to upstream a simplified version of the
wakeup_reason functionality. He's just gotten it to a place where it
can be shared, so I wanted to pull him in so he could reply with his
proposal.

thanks
-john

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC] PM: suspend: Upstreaming wakeup reason capture support
@ 2022-01-10 18:49 Kelly Rossmoyer
  2022-01-24 17:37 ` John Stultz
  2022-01-27 19:54 ` Rafael J. Wysocki
  0 siblings, 2 replies; 10+ messages in thread
From: Kelly Rossmoyer @ 2022-01-10 18:49 UTC (permalink / raw)
  To: Rafael J. Wysocki, Pavel Machek, Len Brown, Greg Kroah-Hartman,
	Lee Jones, Vijay Nayak
  Cc: linux-pm, linux-kernel

# Introduction

To aid optimization, troubleshooting, and attribution of battery life, the
Android kernel currently includes a set of patches which provide enhanced
visibility into kernel suspend/resume/abort behaviors.  The capabilities
and implementation of this feature have evolved significantly since an
unsuccessful attempt to upstream the original code
(https://lkml.org/lkml/2014/3/10/716), and we would like to (re)start a
conversation about upstreaming, starting with the central question: is
there support for upstreaming this set of features?

# Motivation

Of the many factors influencing battery life on Linux-powered mobile
devices, kernel suspend tends to be amongst the most impactful.  Maximizing
time spent in suspend and minimizing the frequency of net-negative suspend
cycles are both important contributors to battery life optimization.  But
enabling that optimization - and troubleshooting when things go wrong -
requires more observability of suspend/resume/abort behavior than Linux
currently provides.  While mechanisms like `/sys/power/pm_wakeup_irq` and
wakeup_source stats are useful, they are incomplete and scattered.  The
Android kernel wakeup reason patches implement significant improvements in
that area.

# Features

As of today, the active set of patches surface the following
suspend-related data:

* wakeup IRQs, including:
   * multiple IRQs if more than one is pending during resume flow
   * unmapped HW IRQs (wakeup-capable in HW) that should not be
     occurring
   * misconfigured IRQs (e.g. both enable_irq_wake() and
     IRQF_NO_SUSPEND)
   * threaded IRQs (not just the parent chip's IRQ)

* non-IRQ wakeups, including:
   * wakeups caused by an IRQ that was consumed by lower-level SW
   * wakeups from SOC architecture that don't manifest as IRQs

* abort reasons, including:
   * wakeup_source activity
   * failure to freeze userspace
   * failure to suspend devices
   * failed syscore_suspend callback

* durations from the most recent cycle, including:
   * time spent doing suspend/resume work
   * time spent in suspend

In addition to battery life optimization and troubleshooting, some of these
capabilities also lay the groundwork for efforts around improving
attribution of wakeups/aborts (e.g. to specific processes, device features,
external devices, etc).

# Shortcomings

While the core implementation (see below) is relatively straightforward and
localized, calls into that core are somewhat widely spread in order to
capture the breadth of events of interest.  The pervasiveness of those
hooks is clearly an area where improvement would be beneficial, especially
if a cleaner solution preserved equivalent capabilities.

# Existing Code

As a reference for how Android currently implements the core code for these
features (which would need a bit of work before submission even if all
features were included), see the following link:

https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/kernel/power/wakeup_reason.c


--

Kelly Rossmoyer | Software Engineer | krossmo@google.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-02-02  8:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-30 18:15 [RFC] PM: suspend: Upstreaming wakeup reason capture support Zichar Zhang
  -- strict thread matches above, loose matches on Subject: below --
2022-01-10 18:49 Kelly Rossmoyer
2022-01-24 17:37 ` John Stultz
2022-01-26  5:09   ` Zichar Zhang
2022-01-29  7:52     ` Kelly Rossmoyer
2022-01-27 19:54 ` Rafael J. Wysocki
2022-01-27 20:10   ` Rafael J. Wysocki
2022-01-29  8:26     ` Kelly Rossmoyer
2022-01-30 14:46       ` Rafael J. Wysocki
2022-02-02  7:59         ` Kelly Rossmoyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.