linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
@ 2022-10-27 15:19 Sven van Ashbrook
  2022-10-27 15:40 ` Hans de Goede
  0 siblings, 1 reply; 15+ messages in thread
From: Sven van Ashbrook @ 2022-10-27 15:19 UTC (permalink / raw)
  To: LKML
  Cc: platform-driver-x86, Rajneesh Bhardwaj, Rafael J Wysocki,
	Rajat Jain, Sven van Ashbrook, David E Box, Hans de Goede,
	Mark Gross, Rajneesh Bhardwaj

The "failure to enter S0ix" warning is critically important for monitoring
and debugging power regressions, both in the field and in the test lab.

Promote from lower-case warn() to upper-case WARN() so that it becomes
more prominent, and gets picked up as part of existing monitoring
infrastructure, which typically focuses on WARN() and ignores warn()
type log messages.

Signed-off-by: Sven van Ashbrook <svenva@chromium.org>
---
Against v6.1-rc2

 drivers/platform/x86/intel/pmc/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/platform/x86/intel/pmc/core.c b/drivers/platform/x86/intel/pmc/core.c
index a1fe1e0dcf4a5..834f0352c0edf 100644
--- a/drivers/platform/x86/intel/pmc/core.c
+++ b/drivers/platform/x86/intel/pmc/core.c
@@ -2125,7 +2125,7 @@ static __maybe_unused int pmc_core_resume(struct device *dev)
 	}
 
 	/* The real interesting case - S0ix failed - lets ask PMC why. */
-	dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
+	dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
 		 pmcdev->s0ix_counter);
 	if (pmcdev->map->slps0_dbg_maps)
 		pmc_core_slps0_display(pmcdev, dev, NULL);
-- 
2.38.0.135.g90850a2211-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-27 15:19 [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN() Sven van Ashbrook
@ 2022-10-27 15:40 ` Hans de Goede
  2022-10-27 15:47   ` Limonciello, Mario
  2022-10-27 16:37   ` David E. Box
  0 siblings, 2 replies; 15+ messages in thread
From: Hans de Goede @ 2022-10-27 15:40 UTC (permalink / raw)
  To: Sven van Ashbrook, LKML
  Cc: platform-driver-x86, Rajneesh Bhardwaj, Rafael J Wysocki,
	Rajat Jain, David E Box, Mark Gross, Rajneesh Bhardwaj

Hi,

On 10/27/22 17:19, Sven van Ashbrook wrote:
> The "failure to enter S0ix" warning is critically important for monitoring
> and debugging power regressions, both in the field and in the test lab.
> 
> Promote from lower-case warn() to upper-case WARN() so that it becomes
> more prominent, and gets picked up as part of existing monitoring
> infrastructure, which typically focuses on WARN() and ignores warn()
> type log messages.
> 
> Signed-off-by: Sven van Ashbrook <svenva@chromium.org>

WARN() is really only intended for internal kernel bugs and not for
hw misbehaving, so I'm not a fan of the change you are suggesting here.

Intel folks, do you have an opinion on this ?

Regards,

Hans


> ---
> Against v6.1-rc2
> 
>  drivers/platform/x86/intel/pmc/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/platform/x86/intel/pmc/core.c b/drivers/platform/x86/intel/pmc/core.c
> index a1fe1e0dcf4a5..834f0352c0edf 100644
> --- a/drivers/platform/x86/intel/pmc/core.c
> +++ b/drivers/platform/x86/intel/pmc/core.c
> @@ -2125,7 +2125,7 @@ static __maybe_unused int pmc_core_resume(struct device *dev)
>  	}
>  
>  	/* The real interesting case - S0ix failed - lets ask PMC why. */
> -	dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
> +	dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
>  		 pmcdev->s0ix_counter);
>  	if (pmcdev->map->slps0_dbg_maps)
>  		pmc_core_slps0_display(pmcdev, dev, NULL);


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-27 15:40 ` Hans de Goede
@ 2022-10-27 15:47   ` Limonciello, Mario
  2022-10-27 15:51     ` Sven van Ashbrook
  2022-10-27 16:37   ` David E. Box
  1 sibling, 1 reply; 15+ messages in thread
From: Limonciello, Mario @ 2022-10-27 15:47 UTC (permalink / raw)
  To: Hans de Goede, Sven van Ashbrook, LKML, S-k, Shyam-sundar, rrangel
  Cc: platform-driver-x86, Rajneesh Bhardwaj, Rafael J Wysocki,
	Rajat Jain, David E Box, Mark Gross, Rajneesh Bhardwaj

[Public]

+Shyam & Raul

> -----Original Message-----
> From: Hans de Goede <hdegoede@redhat.com>
> Sent: Thursday, October 27, 2022 10:41
> To: Sven van Ashbrook <svenva@chromium.org>; LKML <linux-
> kernel@vger.kernel.org>
> Cc: platform-driver-x86@vger.kernel.org; Rajneesh Bhardwaj
> <rajneesh.bhardwaj@intel.com>; Rafael J Wysocki <rjw@rjwysocki.net>;
> Rajat Jain <rajatja@google.com>; David E Box <david.e.box@intel.com>;
> Mark Gross <markgross@kernel.org>; Rajneesh Bhardwaj
> <irenic.rajneesh@gmail.com>
> Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure
> warn() to WARN()
> 
> Hi,
> 
> On 10/27/22 17:19, Sven van Ashbrook wrote:
> > The "failure to enter S0ix" warning is critically important for monitoring
> > and debugging power regressions, both in the field and in the test lab.
> >
> > Promote from lower-case warn() to upper-case WARN() so that it becomes
> > more prominent, and gets picked up as part of existing monitoring
> > infrastructure, which typically focuses on WARN() and ignores warn()
> > type log messages.
> >
> > Signed-off-by: Sven van Ashbrook <svenva@chromium.org>
> 
> WARN() is really only intended for internal kernel bugs and not for
> hw misbehaving, so I'm not a fan of the change you are suggesting here.
> 
> Intel folks, do you have an opinion on this ?

For a reference point; on AMD's implementation of a similar driver (platform/x86/amd/pmc.c)
this "type" of message is also "dev_warn":
https://github.com/torvalds/linux/blob/master/drivers/platform/x86/amd/pmc.c#L365

If we do make changes to this message level so that other infrastructure picks up I suggest
we do it for both drivers.

Are we maybe at the point now it should be dev_err instead?

> 
> Regards,
> 
> Hans
> 
> 
> > ---
> > Against v6.1-rc2
> >
> >  drivers/platform/x86/intel/pmc/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/platform/x86/intel/pmc/core.c
> b/drivers/platform/x86/intel/pmc/core.c
> > index a1fe1e0dcf4a5..834f0352c0edf 100644
> > --- a/drivers/platform/x86/intel/pmc/core.c
> > +++ b/drivers/platform/x86/intel/pmc/core.c
> > @@ -2125,7 +2125,7 @@ static __maybe_unused int
> pmc_core_resume(struct device *dev)
> >  	}
> >
> >  	/* The real interesting case - S0ix failed - lets ask PMC why. */
> > -	dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
> > +	dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
> >  		 pmcdev->s0ix_counter);
> >  	if (pmcdev->map->slps0_dbg_maps)
> >  		pmc_core_slps0_display(pmcdev, dev, NULL);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-27 15:47   ` Limonciello, Mario
@ 2022-10-27 15:51     ` Sven van Ashbrook
       [not found]       ` <CAE2upjS6qRGRcuVYuAB5DMf66A7VcfCKKYEkpsr1My7RnKDFtQ@mail.gmail.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Sven van Ashbrook @ 2022-10-27 15:51 UTC (permalink / raw)
  To: Limonciello, Mario, Hans de Goede
  Cc: LKML, S-k, Shyam-sundar, rrangel, platform-driver-x86,
	Rajneesh Bhardwaj, Rafael J Wysocki, Rajat Jain, David E Box,
	Mark Gross, Rajneesh Bhardwaj

On Thu, Oct 27, 2022 at 11:47 AM Limonciello, Mario
<Mario.Limonciello@amd.com> wrote:
>
> If we do make changes to this message level so that other infrastructure picks up I suggest
> we do it for both drivers.

Sounds good, patch feedback willing.

> WARN() is really only intended for internal kernel bugs and not for
> hw misbehaving, so I'm not a fan of the change you are suggesting here.

Thanks, I was not aware of this distinction. But it does look like upper-case
WARN is already used in many places to indicate hw misbehaving?

Here for example:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/platform/chrome/cros_ec.c?h=v6.1-rc2#n142

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-27 15:40 ` Hans de Goede
  2022-10-27 15:47   ` Limonciello, Mario
@ 2022-10-27 16:37   ` David E. Box
  1 sibling, 0 replies; 15+ messages in thread
From: David E. Box @ 2022-10-27 16:37 UTC (permalink / raw)
  To: Hans de Goede, Sven van Ashbrook, LKML
  Cc: platform-driver-x86, Rajneesh Bhardwaj, Rafael J Wysocki,
	Rajat Jain, David E Box, Mark Gross, Rajneesh Bhardwaj

On Thu, 2022-10-27 at 17:40 +0200, Hans de Goede wrote:
> Hi,
> 
> On 10/27/22 17:19, Sven van Ashbrook wrote:
> > The "failure to enter S0ix" warning is critically important for monitoring
> > and debugging power regressions, both in the field and in the test lab.
> > 
> > Promote from lower-case warn() to upper-case WARN() so that it becomes
> > more prominent, and gets picked up as part of existing monitoring
> > infrastructure, which typically focuses on WARN() and ignores warn()
> > type log messages.
> > 
> > Signed-off-by: Sven van Ashbrook <svenva@chromium.org>
> 
> WARN() is really only intended for internal kernel bugs and not for
> hw misbehaving, so I'm not a fan of the change you are suggesting here.
> 
> Intel folks, do you have an opinion on this ?

I agree that not entering s0ix is a critical failure, but this is a hardware
suspend failure. How we treat this should be akin to how we treat failure to
enter S3 or deeper. S0ix support is indicated by the S0 Low Power Idle bit in
the ACPI FADT table. It's better IMO to create some framework in the suspend or
ACPI core that allows platforms to report whether they have achieved the
hardware state indicated by having this bit set. Rafael, your thoughts?

David

> 
> Regards,
> 
> Hans
> 
> 
> > ---
> > Against v6.1-rc2
> > 
> >  drivers/platform/x86/intel/pmc/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/platform/x86/intel/pmc/core.c
> > b/drivers/platform/x86/intel/pmc/core.c
> > index a1fe1e0dcf4a5..834f0352c0edf 100644
> > --- a/drivers/platform/x86/intel/pmc/core.c
> > +++ b/drivers/platform/x86/intel/pmc/core.c
> > @@ -2125,7 +2125,7 @@ static __maybe_unused int pmc_core_resume(struct
> > device *dev)
> >         }
> >  
> >         /* The real interesting case - S0ix failed - lets ask PMC why. */
> > -       dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
> > +       dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n",
> >                  pmcdev->s0ix_counter);
> >         if (pmcdev->map->slps0_dbg_maps)
> >                 pmc_core_slps0_display(pmcdev, dev, NULL);
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
       [not found]       ` <CAE2upjS6qRGRcuVYuAB5DMf66A7VcfCKKYEkpsr1My7RnKDFtQ@mail.gmail.com>
@ 2022-10-29  4:11         ` Sven van Ashbrook
  2022-10-31 19:39           ` Limonciello, Mario
  0 siblings, 1 reply; 15+ messages in thread
From: Sven van Ashbrook @ 2022-10-29  4:11 UTC (permalink / raw)
  To: Rajneesh Bhardwaj, Hans de Goede
  Cc: Limonciello, Mario, LKML, S-k, Shyam-sundar, rrangel,
	platform-driver-x86, Rajneesh Bhardwaj, Rafael J Wysocki,
	Rajat Jain, David E Box, Mark Gross

On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj
<irenic.rajneesh@gmail.com> wrote:
> I'd advise against this promotion based on my experience with S0ix entry failures.

On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com> wrote:
> I'm not a fan of the change you are suggesting here.

Thanks everyone for the feedback. Looks like there is consensus that it's
not advisable to promote the warning. We will move forward with changes to
our monitoring infrastructure instead.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-29  4:11         ` Sven van Ashbrook
@ 2022-10-31 19:39           ` Limonciello, Mario
  2022-10-31 20:15             ` Hans de Goede
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Limonciello, Mario @ 2022-10-31 19:39 UTC (permalink / raw)
  To: Sven van Ashbrook, Rajneesh Bhardwaj, Hans de Goede
  Cc: LKML, S-k, Shyam-sundar, rrangel, platform-driver-x86,
	Rajneesh Bhardwaj, Rafael J Wysocki, Rajat Jain, David E Box,
	Mark Gross

[Public]

> -----Original Message-----
> From: Sven van Ashbrook <svenva@chromium.org>
> Sent: Friday, October 28, 2022 23:12
> To: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com>; Hans de Goede
> <hdegoede@redhat.com>
> Cc: Limonciello, Mario <Mario.Limonciello@amd.com>; LKML <linux-
> kernel@vger.kernel.org>; S-k, Shyam-sundar <Shyam-sundar.S-
> k@amd.com>; rrangel@chromium.org; platform-driver-
> x86@vger.kernel.org; Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>;
> Rafael J Wysocki <rjw@rjwysocki.net>; Rajat Jain <rajatja@google.com>;
> David E Box <david.e.box@intel.com>; Mark Gross <markgross@kernel.org>
> Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure
> warn() to WARN()
> 
> On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj
> <irenic.rajneesh@gmail.com> wrote:
> > I'd advise against this promotion based on my experience with S0ix entry
> failures.
> 
> On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com>
> wrote:
> > I'm not a fan of the change you are suggesting here.
> 
> Thanks everyone for the feedback. Looks like there is consensus that it's
> not advisable to promote the warning. We will move forward with changes to
> our monitoring infrastructure instead.

Did you see the idea proposed by David Box to introduce some infrastructure in
the kernel for this?

Just thinking about it a little bit more, it could be a lot nicer to have something like:

/sys/power/suspend_stats/last_hw_deepest_state

During the resume process drivers such as amd_pmc and intel_pmc_core could
read the appropriate values for the hardware and call a function that would
populate it with either a "0" or "1" or maybe even the amount of time spent in
that state.

We could then retire the debugging messages from both drivers and instead of
your infrastructure relying upon scanning logs, userspace could read that sysfs
file after every suspend and raise the alarms when it's 0 (or if it's populated with
time smaller than X% of the total suspend time).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-31 19:39           ` Limonciello, Mario
@ 2022-10-31 20:15             ` Hans de Goede
  2022-10-31 20:20               ` Sven van Ashbrook
       [not found]             ` <CACK8Z6E7=xt118d47FTpmgKHgUBgH48FQzTi5iL90C3MjHb-3Q@mail.gmail.com>
  2022-11-01  1:38             ` Sven van Ashbrook
  2 siblings, 1 reply; 15+ messages in thread
From: Hans de Goede @ 2022-10-31 20:15 UTC (permalink / raw)
  To: Limonciello, Mario, Sven van Ashbrook, Rajneesh Bhardwaj
  Cc: LKML, S-k, Shyam-sundar, rrangel, platform-driver-x86,
	Rajneesh Bhardwaj, Rafael J Wysocki, Rajat Jain, David E Box,
	Mark Gross

Hi,

On 10/31/22 20:39, Limonciello, Mario wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Sven van Ashbrook <svenva@chromium.org>
>> Sent: Friday, October 28, 2022 23:12
>> To: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com>; Hans de Goede
>> <hdegoede@redhat.com>
>> Cc: Limonciello, Mario <Mario.Limonciello@amd.com>; LKML <linux-
>> kernel@vger.kernel.org>; S-k, Shyam-sundar <Shyam-sundar.S-
>> k@amd.com>; rrangel@chromium.org; platform-driver-
>> x86@vger.kernel.org; Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>;
>> Rafael J Wysocki <rjw@rjwysocki.net>; Rajat Jain <rajatja@google.com>;
>> David E Box <david.e.box@intel.com>; Mark Gross <markgross@kernel.org>
>> Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure
>> warn() to WARN()
>>
>> On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj
>> <irenic.rajneesh@gmail.com> wrote:
>>> I'd advise against this promotion based on my experience with S0ix entry
>> failures.
>>
>> On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com>
>> wrote:
>>> I'm not a fan of the change you are suggesting here.
>>
>> Thanks everyone for the feedback. Looks like there is consensus that it's
>> not advisable to promote the warning. We will move forward with changes to
>> our monitoring infrastructure instead.
> 
> Did you see the idea proposed by David Box to introduce some infrastructure in
> the kernel for this?
> 
> Just thinking about it a little bit more, it could be a lot nicer to have something like:
> 
> /sys/power/suspend_stats/last_hw_deepest_state
> 
> During the resume process drivers such as amd_pmc and intel_pmc_core could
> read the appropriate values for the hardware and call a function that would
> populate it with either a "0" or "1" or maybe even the amount of time spent in
> that state.
> 
> We could then retire the debugging messages from both drivers and instead of
> your infrastructure relying upon scanning logs, userspace could read that sysfs
> file after every suspend and raise the alarms when it's 0 (or if it's populated with
> time smaller than X% of the total suspend time).

Something like this does indeed sound like a nice solution for this.

Regards,

Hans


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-31 20:15             ` Hans de Goede
@ 2022-10-31 20:20               ` Sven van Ashbrook
  0 siblings, 0 replies; 15+ messages in thread
From: Sven van Ashbrook @ 2022-10-31 20:20 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Limonciello, Mario, Rajneesh Bhardwaj, LKML, S-k, Shyam-sundar,
	rrangel, platform-driver-x86, Rajneesh Bhardwaj,
	Rafael J Wysocki, Rajat Jain, David E Box, Mark Gross

On Mon, Oct 31, 2022 at 4:15 PM Hans de Goede <hdegoede@redhat.com> wrote:
>
> Something like this does indeed sound like a nice solution for this.

Agreed, much nicer than letting the monitoring infrastructure hunt for
certain dmesg strings, for sure. Thank you David for the suggestion,
and Mario for the opening ideas.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
       [not found]             ` <CACK8Z6E7=xt118d47FTpmgKHgUBgH48FQzTi5iL90C3MjHb-3Q@mail.gmail.com>
@ 2022-10-31 20:55               ` Limonciello, Mario
  2022-11-01 17:24                 ` Sven van Ashbrook
  0 siblings, 1 reply; 15+ messages in thread
From: Limonciello, Mario @ 2022-10-31 20:55 UTC (permalink / raw)
  To: Rajat Jain
  Cc: Sven van Ashbrook, Rajneesh Bhardwaj, Hans de Goede, LKML, S-k,
	Shyam-sundar, rrangel, platform-driver-x86, Rajneesh Bhardwaj,
	Rafael J Wysocki, David E Box, Mark Gross

On 10/31/2022 15:47, Rajat Jain wrote:
> Hello,
> 
> On Mon, Oct 31, 2022 at 12:39 PM Limonciello, Mario
> <Mario.Limonciello@amd.com <mailto:Mario.Limonciello@amd.com>> wrote:
>  >
>  > [Public]
>  >
>  > > -----Original Message-----
>  > > From: Sven van Ashbrook <svenva@chromium.org 
> <mailto:svenva@chromium.org>>
>  > > Sent: Friday, October 28, 2022 23:12
>  > > To: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com 
> <mailto:irenic.rajneesh@gmail.com>>; Hans de Goede
>  > > <hdegoede@redhat.com <mailto:hdegoede@redhat.com>>
>  > > Cc: Limonciello, Mario <Mario.Limonciello@amd.com 
> <mailto:Mario.Limonciello@amd.com>>; LKML <linux-
>  > > kernel@vger.kernel.org <mailto:kernel@vger.kernel.org>>; S-k, 
> Shyam-sundar <Shyam-sundar.S-
>  > > k@amd.com <mailto:k@amd.com>>; rrangel@chromium.org 
> <mailto:rrangel@chromium.org>; platform-driver-
>  > > x86@vger.kernel.org <mailto:x86@vger.kernel.org>; Rajneesh Bhardwaj 
> <rajneesh.bhardwaj@intel.com <mailto:rajneesh.bhardwaj@intel.com>>;
>  > > Rafael J Wysocki <rjw@rjwysocki.net <mailto:rjw@rjwysocki.net>>; 
> Rajat Jain <rajatja@google.com <mailto:rajatja@google.com>>;
>  > > David E Box <david.e.box@intel.com <mailto:david.e.box@intel.com>>; 
> Mark Gross <markgross@kernel.org <mailto:markgross@kernel.org>>
>  > > Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix 
> failure
>  > > warn() to WARN()
>  > >
>  > > On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj
>  > > <irenic.rajneesh@gmail.com <mailto:irenic.rajneesh@gmail.com>> wrote:
>  > > > I'd advise against this promotion based on my experience with 
> S0ix entry
>  > > failures.
>  > >
>  > > On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com 
> <mailto:hdegoede@redhat.com>>
>  > > wrote:
>  > > > I'm not a fan of the change you are suggesting here.
>  > >
>  > > Thanks everyone for the feedback. Looks like there is consensus 
> that it's
>  > > not advisable to promote the warning. We will move forward with 
> changes to
>  > > our monitoring infrastructure instead.
>  >
>  > Did you see the idea proposed by David Box to introduce some 
> infrastructure in
>  > the kernel for this?
>  >
>  > Just thinking about it a little bit more, it could be a lot nicer to 
> have something like:
>  >
>  > /sys/power/suspend_stats/last_hw_deepest_state
>  >
>  > During the resume process drivers such as amd_pmc and intel_pmc_core 
> could
>  > read the appropriate values for the hardware and call a function that 
> would
>  > populate it with either a "0" or "1" or maybe even the amount of time 
> spent in
>  > that state.
>  >
>  > We could then retire the debugging messages from both drivers
> 
> I do not think we should retire the debug messages. The sysfs
> attribute could help us *trigger* a failure detection, but we would
> still need these debug logs to actually determine why exactly we did
> not go into the S0ix / deepest power state (And the debug messages
> print out the debug register bit fields that let us know that).
> 
> Thanks,
> 

I just spun together an RFC series for this idea and while doing it I 
had the same realization.  So I left the warning messages in place for 
both drivers.

You can take a look at the series here:

https://lore.kernel.org/platform-driver-x86/20221031204320.22464-1-mario.limonciello@amd.com/T/#m6c7db55c98b8a3ce8c48d451fc01c1d9b0df37fb

Thanks,




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-31 19:39           ` Limonciello, Mario
  2022-10-31 20:15             ` Hans de Goede
       [not found]             ` <CACK8Z6E7=xt118d47FTpmgKHgUBgH48FQzTi5iL90C3MjHb-3Q@mail.gmail.com>
@ 2022-11-01  1:38             ` Sven van Ashbrook
  2022-11-01  1:58               ` Mario Limonciello
  2 siblings, 1 reply; 15+ messages in thread
From: Sven van Ashbrook @ 2022-11-01  1:38 UTC (permalink / raw)
  To: Limonciello, Mario
  Cc: Rajneesh Bhardwaj, Hans de Goede, LKML, S-k, Shyam-sundar,
	rrangel, platform-driver-x86, Rajneesh Bhardwaj,
	Rafael J Wysocki, Rajat Jain, David E Box, Mark Gross

On Mon, Oct 31, 2022 at 3:39 PM Limonciello, Mario
<Mario.Limonciello@amd.com> wrote:
>
> Just thinking about it a little bit more, it could be a lot nicer to have something like:
>
> /sys/power/suspend_stats/last_hw_deepest_state

While I agree that reporting through a framework is generally better
than getting infrastructure to grep for specific strings, I believe
that a simple sysfs file is probably too simplistic.

1. We need more sophisticated reporting than just last_hw_deepest_state:

- sometimes the system enters the deep state we want, yet after a
while moves back up and gets "stuck" in an intermediate state (below
S0). Or, the system enters the deep state we want, but moves back to
S0 after a time without apparent reason. These platform-dependent
failures are not so easily describable in a generic framework.

- ChromeOS in particular has multiple independent S0ix / S3 / s2idle
failure report sources. We have the kernel warning above; also our
Embedded Controller monitors suspend failure cases which the simple
kernel warning cannot catch, reported through a separate WARN_ONCE().

2. A simple sysfs file will need to be polled by the infrastructure
after every suspend; it would be preferable to have some signal or
callback which the infrastructure could register itself with.

The generic infrastructure to support this sounds like quite a bit of
work, and for what gain? Compared to simply matching a log string and
sending the whole dmesg if there's a match.

Is the light worth the candle?

Sven

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-11-01  1:38             ` Sven van Ashbrook
@ 2022-11-01  1:58               ` Mario Limonciello
  2022-11-01 13:50                 ` Sven van Ashbrook
  0 siblings, 1 reply; 15+ messages in thread
From: Mario Limonciello @ 2022-11-01  1:58 UTC (permalink / raw)
  To: Sven van Ashbrook
  Cc: Rajneesh Bhardwaj, Hans de Goede, LKML, S-k, Shyam-sundar,
	rrangel, platform-driver-x86, Rajneesh Bhardwaj,
	Rafael J Wysocki, Rajat Jain, David E Box, Mark Gross

On 10/31/22 20:38, Sven van Ashbrook wrote:
> On Mon, Oct 31, 2022 at 3:39 PM Limonciello, Mario
> <Mario.Limonciello@amd.com> wrote:
>>
>> Just thinking about it a little bit more, it could be a lot nicer to have something like:
>>
>> /sys/power/suspend_stats/last_hw_deepest_state
> 
> While I agree that reporting through a framework is generally better
> than getting infrastructure to grep for specific strings, I believe
> that a simple sysfs file is probably too simplistic.
> 
> 1. We need more sophisticated reporting than just last_hw_deepest_state:
> 
> - sometimes the system enters the deep state we want, yet after a
> while moves back up and gets "stuck" in an intermediate state (below
> S0). Or, the system enters the deep state we want, but moves back to
> S0 after a time without apparent reason. These platform-dependent
> failures are not so easily describable in a generic framework.

I actually thought that by putting the duration of time put in 
last_hw_deepest_state you'll be able to catch this by comparing the 
duration of the suspend to the duration of last_hw_deepest_state.

If you're below some threshold of percent for suspends that are at least 
some other threshold long you can trigger the failure.

This then lets you tune your framework to find the right place for those
thresholds too without needing to change the kernel.

> 
> - ChromeOS in particular has multiple independent S0ix / S3 / s2idle
> failure report sources. We have the kernel warning above; also our
> Embedded Controller monitors suspend failure cases which the simple
> kernel warning cannot catch, reported through a separate WARN_ONCE().
>  > 2. A simple sysfs file will need to be polled by the infrastructure
> after every suspend; it would be preferable to have some signal or
> callback which the infrastructure could register itself with.

The interface to trigger a suspend is writing a value into 
/sys/power/state.  You'll get a return code from this, but this return 
code does not represent whether you got to the deepest state, just 
whether the suspend succeeded or not.

So what would an ideal interface that sends a signal that the last 
"successful" suspend didn't get to the deepest state look like to you?

> 
> The generic infrastructure to support this sounds like quite a bit of
> work, and for what gain? Compared to simply matching a log string and
> sending the whole dmesg if there's a match.

I would like to think it's cheaper to read the sysfs file, do a local 
comparison on HW deepest time to the suspend time and then only send the 
the dmesg up for further analysis.

> 
> Is the light worth the candle?

I wrote an RFC that I sent out for it with my ideas at least.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-11-01  1:58               ` Mario Limonciello
@ 2022-11-01 13:50                 ` Sven van Ashbrook
  0 siblings, 0 replies; 15+ messages in thread
From: Sven van Ashbrook @ 2022-11-01 13:50 UTC (permalink / raw)
  To: Mario Limonciello
  Cc: Rajneesh Bhardwaj, Hans de Goede, LKML, S-k, Shyam-sundar,
	rrangel, platform-driver-x86, Rajneesh Bhardwaj,
	Rafael J Wysocki, Rajat Jain, David E Box, Mark Gross

On Mon, Oct 31, 2022 at 9:58 PM Mario Limonciello
<mario.limonciello@amd.com> wrote:
>
> I actually thought that by putting the duration of time put in
> last_hw_deepest_state you'll be able to catch this by comparing the
> duration of the suspend to the duration of last_hw_deepest_state.

I'm not sure if this method would catch all or even most suspend
failures. That's why the EC monitoring of S0ix was devised. I will
circulate this internally, see what comes back.

> >
> > Is the light worth the candle?
>
> I wrote an RFC that I sent out for it with my ideas at least.
>

That is much appreciated ! Yet even for good ideas, it's often
necessary to weigh the benefits and downsides of the intervention.
Perhaps we can get some pros/cons feedback from other stakeholders ?

Sven

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-10-31 20:55               ` Limonciello, Mario
@ 2022-11-01 17:24                 ` Sven van Ashbrook
  2022-11-01 20:30                   ` David E. Box
  0 siblings, 1 reply; 15+ messages in thread
From: Sven van Ashbrook @ 2022-11-01 17:24 UTC (permalink / raw)
  To: Limonciello, Mario
  Cc: Rajat Jain, Rajneesh Bhardwaj, Hans de Goede, LKML, S-k,
	Shyam-sundar, rrangel, platform-driver-x86, Rafael J Wysocki,
	David E Box, Mark Gross, swboyd

On Mon, Oct 31, 2022 at 4:55 PM Limonciello, Mario
<mario.limonciello@amd.com> wrote:
>
> I just spun together an RFC series for this idea and while doing it I
> had the same realization.  So I left the warning messages in place for
> both drivers.
>
> You can take a look at the series here:
>
> https://lore.kernel.org/platform-driver-x86/20221031204320.22464-1-mario.limonciello@amd.com/T/#m6c7db55c98b8a3ce8c48d451fc01c1d9b0df37fb
>

We've had some internal discussions within ChromeOS intel big core,
and we believe this is a worthwhile effort, and we are supportive, as
long as our current S0ix fail detection will not break for the
foreseeable future, i.e. as long as the warning message and register
dump stays in place. Which is the case for your RFC.

+swboyd@chromium.org who expressed interest in doing something similar for ARM.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
  2022-11-01 17:24                 ` Sven van Ashbrook
@ 2022-11-01 20:30                   ` David E. Box
  0 siblings, 0 replies; 15+ messages in thread
From: David E. Box @ 2022-11-01 20:30 UTC (permalink / raw)
  To: Sven van Ashbrook, Limonciello, Mario
  Cc: Rajat Jain, Rajneesh Bhardwaj, Hans de Goede, LKML, S-k,
	Shyam-sundar, rrangel, platform-driver-x86, Rafael J Wysocki,
	David E Box, Mark Gross, swboyd

On Tue, 2022-11-01 at 13:24 -0400, Sven van Ashbrook wrote:
> On Mon, Oct 31, 2022 at 4:55 PM Limonciello, Mario
> <mario.limonciello@amd.com> wrote:
> > I just spun together an RFC series for this idea and while doing it I
> > had the same realization.  So I left the warning messages in place for
> > both drivers.
> > 
> > You can take a look at the series here:
> > 
> > https://lore.kernel.org/platform-driver-x86/20221031204320.22464-1-mario.limonciello@amd.com/T/#m6c7db55c98b8a3ce8c48d451fc01c1d9b0df37fb
> > 
> 
> We've had some internal discussions within ChromeOS intel big core,
> and we believe this is a worthwhile effort, and we are supportive, as
> long as our current S0ix fail detection will not break for the
> foreseeable future, i.e. as long as the warning message and register
> dump stays in place. Which is the case for your RFC.

Yeah, I did not see this as a replacement for anything in the pmc drivers. Given
the prevalence of S0ix, and that hardware based low power idle support is
indicated in the FADT (so part of the standard) it makes sense to have it
tracked by the suspend core, particularly when it's being used as a replacement
for S3. We don't need to collect any implementation or debug details there. Only
detect when it's available, being used for suspend, and being achieved. Maybe
residency information as well if available but that's it. Other information is
separate and should be contained to the individual drivers which have the
detailed platform knowledge.

David

> 
> +swboyd@chromium.org who expressed interest in doing something similar for
> ARM.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-11-01 20:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-27 15:19 [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN() Sven van Ashbrook
2022-10-27 15:40 ` Hans de Goede
2022-10-27 15:47   ` Limonciello, Mario
2022-10-27 15:51     ` Sven van Ashbrook
     [not found]       ` <CAE2upjS6qRGRcuVYuAB5DMf66A7VcfCKKYEkpsr1My7RnKDFtQ@mail.gmail.com>
2022-10-29  4:11         ` Sven van Ashbrook
2022-10-31 19:39           ` Limonciello, Mario
2022-10-31 20:15             ` Hans de Goede
2022-10-31 20:20               ` Sven van Ashbrook
     [not found]             ` <CACK8Z6E7=xt118d47FTpmgKHgUBgH48FQzTi5iL90C3MjHb-3Q@mail.gmail.com>
2022-10-31 20:55               ` Limonciello, Mario
2022-11-01 17:24                 ` Sven van Ashbrook
2022-11-01 20:30                   ` David E. Box
2022-11-01  1:38             ` Sven van Ashbrook
2022-11-01  1:58               ` Mario Limonciello
2022-11-01 13:50                 ` Sven van Ashbrook
2022-10-27 16:37   ` David E. Box

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).