linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Luck, Tony" <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"bberg@redhat.com" <bberg@redhat.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"hdegoede@redhat.com" <hdegoede@redhat.com>,
	"ckellner@redhat.com" <ckellner@redhat.com>
Subject: Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages
Date: Fri, 18 Oct 2019 11:02:57 -0700	[thread overview]
Message-ID: <20191018180257.GA23835@agluck-desk2.amr.corp.intel.com> (raw)
In-Reply-To: <20191018132309.GD17053@zn.tnic>

On Fri, Oct 18, 2019 at 03:23:09PM +0200, Borislav Petkov wrote:
> On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote:
> > Server/desktops generally rely on the embedded controller for FAN
> > control, which  kernel have no control. For them this warning helps to
> > either bring in additional cooling or fix existing cooling.
> 
> How exactly does this warning help? A detailed example please.
> 
> > If something needs to force throttle from kernel, then we should use
> > some offset from the max temperature (aka TJMax), instead of this
> > warning threshold. Then we can use idle injection or change duty cycle
> > of CPU clocks.
> 
> Yes, as I said, all this needs to be properly defined first. That is,
> *if* there's even need for reacting to thermal interrupts in the kernel.

Recap:

We are starting from a place where the kernel prints a message.

Patch already in flight to reduce the severity of the message
(since users are seeing it, and find it annoying/unhelpful that
it has such a high severity).

Srinivas has asserted that in many cases we can eliminate the
message. But wants to keep the message if it seems that there
is something really wrong.

---

So what should we do next?  I don't think there is much by way
of actions that the kernel should take.  While we could stop
scheduling processes, the h/w and f/w have better tools to
reduce frequency, inject idle cycles, speed up fans, etc.
If you do have ideas ... then please share.

So this thread is now about doing the proper definition of
what we actions Linux should take.

Proposal on the table is the algoritm embodied in Srinivas'
patch (which originated from Alan Cox).

I.e.
1) ignore short excursions above this threshold.
2) Print a message for persistent problems.
3) Keep a record of total time spent above threshold.

If that's a reasonable approach, the we just need to come
up with a way to define "short excursion" (which might be
platform dependent). If someone has a brilliant idea on
how to do that, we can use it. If not we #define a number.

If it isn't reasonable ... then propose something better.

-Tony

  parent reply	other threads:[~2019-10-18 18:02 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <2c2b65c23be3064504566c5f621c1f37bf7e7326.camel@redhat.com>
2019-10-14 21:21 ` [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages Srinivas Pandruvada
2019-10-14 21:21   ` [PATCH 2/2] x86, mce: Add additional kernel boot parameter Srinivas Pandruvada
2019-10-14 21:36   ` [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages Borislav Petkov
2019-10-14 22:27     ` Luck, Tony
2019-10-15  8:36       ` Borislav Petkov
2019-10-15  8:52       ` Peter Zijlstra
2019-10-15 13:43         ` Srinivas Pandruvada
2019-10-14 22:41     ` Srinivas Pandruvada
2019-10-15  8:46       ` Borislav Petkov
2019-10-15 14:01         ` Srinivas Pandruvada
2019-10-15  8:48   ` Peter Zijlstra
2019-10-15 13:31     ` Srinivas Pandruvada
2019-10-16  8:14       ` Peter Zijlstra
2019-10-16 14:00         ` Borislav Petkov
2019-10-17 21:31           ` Luck, Tony
2019-10-17 21:44             ` Borislav Petkov
2019-10-17 23:53               ` Luck, Tony
2019-10-18  6:46                 ` Borislav Petkov
2019-10-18  7:17               ` Peter Zijlstra
2019-10-18 12:26               ` Srinivas Pandruvada
2019-10-18 13:23                 ` Borislav Petkov
2019-10-18 15:55                   ` Srinivas Pandruvada
2019-10-18 19:40                     ` Borislav Petkov
2019-10-18 18:02                   ` Luck, Tony [this message]
2019-10-18 19:45                     ` Borislav Petkov
2019-10-18 20:38                       ` Luck, Tony
2019-10-19  8:10                         ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191018180257.GA23835@agluck-desk2.amr.corp.intel.com \
    --to=tony.luck@intel.com \
    --cc=bberg@redhat.com \
    --cc=bp@alien8.de \
    --cc=ckellner@redhat.com \
    --cc=hdegoede@redhat.com \
    --cc=hpa@zytor.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).