All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@suse.de>
To: "Raj, Ashok" <ashok.raj@intel.com>
Cc: Alexander Alemayhu <alexander@alemayhu.com>,
	Daniel J Blueman <daniel@quora.org>,
	Paul Menzel <pmenzel@molgen.mpg.de>,
	tony.luck@intel.com, linux@leemhuis.info, len.brown@intel.com,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: Dell XPS13: MCE (Hardware Error) reported
Date: Fri, 6 Jan 2017 00:56:11 +0100	[thread overview]
Message-ID: <20170105235611.yj4ayqpg2ysibeqy@pd.tnic> (raw)
In-Reply-To: <20170105232800.GA82321@otc-brkl-03>

On Thu, Jan 05, 2017 at 03:28:00PM -0800, Raj, Ashok wrote:
> After looking at the code, seems like these events are logged as MCE's
> but are really picked from real lvt thermal event interrupts.  via a fake
> bank 128 for MCE_THERMAL. These are not really HW MCE's, but fake ones created 
> and logged as mcelog entries. (arch/x86/kernel/cpu/mcheck/therm_throt.c)

Right, we've done that since forever but I do think that it confuses
people. This thread case-in-point. I mean, we already scream:

	pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",

to dmesg, why do we have to log a fake MCE too?!

Hell, we even log an MCE when things go back to normal:

        if (old_event) {
                if (event == THERMAL_THROTTLING_EVENT)
                        pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
                                level == CORE_LEVEL ? "Core" : "Package");
                return 1;

And Alexander's log shows exactly that:

[ 6338.170924] CPU1: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170925] CPU5: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170928] CPU7: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170931] CPU4: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170932] CPU0: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170933] CPU6: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170935] CPU2: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170936] CPU3: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170937] CPU5: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170945] CPU1: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170947] mce_notify_irq: 1 callbacks suppressed
[ 6338.170948] mce: [Hardware Error]: Machine check events logged				<--- new event
[ 6338.171917] CPU1: Core temperature/speed normal
[ 6338.171918] CPU5: Core temperature/speed normal
[ 6338.171920] CPU4: Package temperature/speed normal
[ 6338.171920] CPU0: Package temperature/speed normal
[ 6338.171922] CPU2: Package temperature/speed normal
[ 6338.171923] CPU6: Package temperature/speed normal
[ 6338.171924] CPU3: Package temperature/speed normal
[ 6338.171925] CPU7: Package temperature/speed normal
[ 6338.171927] CPU5: Package temperature/speed normal
[ 6338.171929] CPU1: Package temperature/speed normal
[ 6338.171930] mce: [Hardware Error]: Machine check events logged				<--- old event

Oh, and it's not like the user can do anything - there's a thermald
which is supposed to deal with all that. Which is not really
trouble-free too, TBH. What happens if that thing dies? Fried CPU?

So I say we should rip out that mce_log_therm_throt_event() and never
ever handle thermal events with MCEs. It is a bad idea.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

  reply	other threads:[~2017-01-05 23:56 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-05  5:00 Dell XPS13: MCE (Hardware Error) reported Daniel J Blueman
2017-01-05 14:05 ` Daniel J Blueman
2017-01-05 20:10   ` Alexander Alemayhu
2017-01-05 20:31     ` Borislav Petkov
2017-01-05 20:43       ` Raj, Ashok
2017-01-05 21:03         ` Pandruvada, Srinivas
2017-01-05 23:23           ` Alexander Alemayhu
2017-01-05 21:38       ` Alexander Alemayhu
2017-01-05 23:28       ` Raj, Ashok
2017-01-05 23:56         ` Borislav Petkov [this message]
2017-01-06  1:26           ` Raj, Ashok
2017-01-06 11:16             ` Borislav Petkov
2017-01-06 15:58               ` Raj, Ashok
2017-01-06 16:54                 ` Borislav Petkov
2017-01-06 17:04                   ` Raj, Ashok
2017-01-09 10:55                   ` Paul Menzel
2017-01-09 11:05                     ` Borislav Petkov
2017-01-09 11:11                       ` Paul Menzel
  -- strict thread matches above, loose matches on Subject: below --
2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
2017-01-24  8:46   ` [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/amd: " tip-bot for Yazen Ghannam
2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras/amd/inj: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
2017-01-24  8:49   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
2017-01-24  8:51   ` [tip:ras/core] x86/ras, " tip-bot for Borislav Petkov
2017-01-04 15:42 Dell XPS13: MCE (Hardware Error) reported Paul Menzel
2017-01-04 22:55 ` Borislav Petkov
2017-01-05  1:12   ` Raj, Ashok
2017-01-09 11:53     ` Paul Menzel
2017-01-09 19:23       ` Raj, Ashok
2017-01-27 13:35         ` Paul Menzel
2017-01-27 17:10           ` Borislav Petkov
2017-01-27 17:16             ` Mario.Limonciello
2017-01-31 15:29               ` Paul Menzel
2017-01-31 17:20                 ` Borislav Petkov
2017-01-31 18:50                 ` Austin S. Hemmelgarn
2017-02-01 20:52                 ` Mario.Limonciello

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170105235611.yj4ayqpg2ysibeqy@pd.tnic \
    --to=bp@suse.de \
    --cc=alexander@alemayhu.com \
    --cc=ashok.raj@intel.com \
    --cc=daniel@quora.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@leemhuis.info \
    --cc=pmenzel@molgen.mpg.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.