From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933011AbbDISWS (ORCPT ); Thu, 9 Apr 2015 14:22:18 -0400 Received: from mga01.intel.com ([192.55.52.88]:9401 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753006AbbDISWQ (ORCPT ); Thu, 9 Apr 2015 14:22:16 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.11,551,1422950400"; d="scan'208";a="677878032" From: "Luck, Tony" To: Borislav Petkov , Naoya Horiguchi CC: Ingo Molnar , Prarit Bhargava , Vivek Goyal , "linux-kernel@vger.kernel.org" , Junichi Nomura , Kiyoshi Ueda Subject: RE: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump Thread-Topic: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump Thread-Index: AQHQcQlbWBiQstPBpU6vLHDKDwZ4PZ1Eq/IAgAAd0gCAAAXYgIAACrUAgAAO6wCAABYJYA== Date: Thu, 9 Apr 2015 18:22:02 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F32A5D502@ORSMSX114.amr.corp.intel.com> References: <20150306093212.GB14982@hori1.linux.bs1.fc.nec.co.jp> <20150306102216.GA22787@hori1.linux.bs1.fc.nec.co.jp> <20150406071803.GA22950@hori1.linux.bs1.fc.nec.co.jp> <20150406115923.GD4078@pd.tnic> <20150407080017.GB27856@hori1.linux.bs1.fc.nec.co.jp> <20150407080218.GC27856@hori1.linux.bs1.fc.nec.co.jp> <20150409061346.GA25434@pd.tnic> <20150409080030.GA4713@gmail.com> <20150409082125.GE25434@pd.tnic> <20150409085944.GA27042@hori1.linux.bs1.fc.nec.co.jp> <20150409095308.GG25434@pd.tnic> In-Reply-To: <20150409095308.GG25434@pd.tnic> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id t39IMNT3018238 > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should > account for that, no? > > And if those are offlined, they're very very unlikely to trigger an MCE > as they're idle and not executing code. Let's step back a few feet and look at the big picture. There are three main classes of machine check that we might see while trying to run kdump - an remember that all machine checks are currently broadcast, so all cpus whether online or offline will see them 1) Fatal We have to crash - lose the dump. Having a new machine check handler will make things a bit easier to see what happened because we won't have any synchronization failed messages from the offline cpus. 2) Execution path recoverable (SRAR in SDM parlance). Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner messages as above. Potentially in the future we might be able to make the kdump machine check handler actually recover by just skipping a page - if the location of the error was in the old kernel image. 3) Non-execution path recoverable (SRAO in SDM) We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional", so we are going to choose to not take an action. Wherever the error was, it won't affect correctness of execution of the current context. -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I