linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: tony.luck@intel.com, tglx@linutronix.de, mingo@redhat.com,
	hpa@zytor.com, bberg@redhat.com, x86@kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	hdegoede@redhat.com, ckellner@redhat.com
Subject: Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages
Date: Mon, 14 Oct 2019 15:41:38 -0700	[thread overview]
Message-ID: <3055e340ebaba9f8fb587a11ce3f25cf33919ab3.camel@linux.intel.com> (raw)
In-Reply-To: <20191014213618.GK4715@zn.tnic>

On Mon, 2019-10-14 at 23:36 +0200, Borislav Petkov wrote:
> On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote:
> > Some modern systems have very tight thermal tolerances. Because of
> > this
> > they may cross thermal thresholds when running normal workloads
> > (even
> > during boot). The CPU hardware will react by limiting
> > power/frequency
> > and using duty cycles to bring the temperature back into normal
> > range.
> > 
> > Thus users may see a "critical" message about the "temperature
> > above
> > threshold" which is soon followed by "temperature/speed normal".
> > These
> > messages are rate limited, but still may repeat every few minutes.
> > 
> > The solution here is to set a timeout when the temperature first
> > exceeds
> > the threshold. If the CPU returns to normal before the timeout
> > fires,
> > we skip printing any messages. If we reach the timeout, then there
> > may be
> > a real thermal issue (e.g. inoperative or blocked fan) and we print
> > the
> > message (together with a count of how many thermal events have
> > occurred).
> > A rate control method is used to avoid printing repeatedly on these
> > broken
> > systems.
> > 
> > Some experimentation with fans enabled showed that temperature
> > returned
> > to normal on a laptop in ~4 seconds. With fans disabled it took
> > over 10
> > seconds. Default timeout is thus set to 8 seconds, but may be
> > changed
> > with kernel boot parameter: "x86_therm_warn_delay". This default
> > interval
> > is twice of typical sampling interval for cooling using running
> > average
> > power limit from user space thermal control softwares.
> > 
> > In addition a new sysfs attribute is added to show what is the
> > maximum
> > amount of time in miili-seconds the system was in throttled state.
> > This
> > will allow to change x86_therm_warn_delay, if required.
> 
> This description is already *begging* for this delay value to be
> automatically set by the kernel. Putting yet another knob in front of
> the user who doesn't have a clue most of the time shows one more time
> that we haven't done our job properly by asking her to know what we
> already do.
I experimented on the systems released from Sandy Bridge era. But
someone running on 10 years old system, this is a fallback mechanism.
Don't expect that users have to tune from the default but saying with
certainty is difficult. The source of this PROCHOT signal can be
anything on the board.
So some users who had issues in their systems can try with this patch.
We can get rid of this, till it becomes real issue.

> 
> IOW, a simple history feedback mechanism which sets the timeout based
> on
> the last couple of values is much smarter. The thing would have a max
> value, of course, which, when exceeded should mean an anomaly, etc,
> but
> almost anything else is better than merely asking the user to make an
> educated guess.
The temperature is function of load, time and heat dissipation capacity
of the system. I have to think more about this to come up with some
heuristics where we still warning users about real thermal issues.
Since value is not persistent, then next boot again will start from the
default.

> 
> > Suggested-by: Alan Cox <alan@linux.intel.com>
> > Commit-comment-by: Tony Luck <tony.luck@intel.com>
> 
>   ^^^^^^^^^^^^^^^^^^
> 
> What's that?
Tony suggested this to indicate that he rewrote the commit description
as he didn't like my description. Definitely checkpatch doesn't like
this.

Thanks,
Srinivas


  parent reply	other threads:[~2019-10-14 22:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <2c2b65c23be3064504566c5f621c1f37bf7e7326.camel@redhat.com>
2019-10-14 21:21 ` [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages Srinivas Pandruvada
2019-10-14 21:21   ` [PATCH 2/2] x86, mce: Add additional kernel boot parameter Srinivas Pandruvada
2019-10-14 21:36   ` [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages Borislav Petkov
2019-10-14 22:27     ` Luck, Tony
2019-10-15  8:36       ` Borislav Petkov
2019-10-15  8:52       ` Peter Zijlstra
2019-10-15 13:43         ` Srinivas Pandruvada
2019-10-14 22:41     ` Srinivas Pandruvada [this message]
2019-10-15  8:46       ` Borislav Petkov
2019-10-15 14:01         ` Srinivas Pandruvada
2019-10-15  8:48   ` Peter Zijlstra
2019-10-15 13:31     ` Srinivas Pandruvada
2019-10-16  8:14       ` Peter Zijlstra
2019-10-16 14:00         ` Borislav Petkov
2019-10-17 21:31           ` Luck, Tony
2019-10-17 21:44             ` Borislav Petkov
2019-10-17 23:53               ` Luck, Tony
2019-10-18  6:46                 ` Borislav Petkov
2019-10-18  7:17               ` Peter Zijlstra
2019-10-18 12:26               ` Srinivas Pandruvada
2019-10-18 13:23                 ` Borislav Petkov
2019-10-18 15:55                   ` Srinivas Pandruvada
2019-10-18 19:40                     ` Borislav Petkov
2019-10-18 18:02                   ` Luck, Tony
2019-10-18 19:45                     ` Borislav Petkov
2019-10-18 20:38                       ` Luck, Tony
2019-10-19  8:10                         ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3055e340ebaba9f8fb587a11ce3f25cf33919ab3.camel@linux.intel.com \
    --to=srinivas.pandruvada@linux.intel.com \
    --cc=bberg@redhat.com \
    --cc=bp@alien8.de \
    --cc=ckellner@redhat.com \
    --cc=hdegoede@redhat.com \
    --cc=hpa@zytor.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).