From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932704Ab1ESGoZ (ORCPT ); Thu, 19 May 2011 02:44:25 -0400 Received: from mga03.intel.com ([143.182.124.21]:22634 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932304Ab1ESGoY (ORCPT ); Thu, 19 May 2011 02:44:24 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.65,236,1304319600"; d="scan'208";a="438859717" Message-ID: <4DD4BC45.8050301@intel.com> Date: Thu, 19 May 2011 14:44:21 +0800 From: Huang Ying User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.15) Gecko/20110402 Iceowl/1.0b2 Icedove/3.1.9 MIME-Version: 1.0 To: Ingo Molnar CC: Don Zickus , huang ying , "linux-kernel@vger.kernel.org" , Andi Kleen , Robert Richter , Andi Kleen , Borislav Petkov Subject: Re: [RFC] x86, NMI, Treat unknown NMI as hardware error References: <1305275018-20596-1-git-send-email-ying.huang@intel.com> <20110513124523.GM13984@redhat.com> <20110513130011.GA6474@elte.hu> <20110513152033.GB3854@elte.hu> <20110513160029.GD31888@redhat.com> <20110516112934.GE19837@elte.hu> <4DD22692.7050209@intel.com> <20110517085327.GG22093@elte.hu> In-Reply-To: <20110517085327.GG22093@elte.hu> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/17/2011 04:53 PM, Ingo Molnar wrote: > > * Huang Ying wrote: > >> On 05/16/2011 07:29 PM, Ingo Molnar wrote: >>> >>> * Don Zickus wrote: >>> >>>> On Fri, May 13, 2011 at 05:20:33PM +0200, Ingo Molnar wrote: >>>>> >>>>> * huang ying wrote: >>>>> >>>>>>> What should be done instead is to add an event for unknown NMIs, which can >>>>>>> then be processed by the RAS daemon to implement policy. >>>>>>> >>>>>>> By using 'active' event filters it could even be set on a system to panic >>>>>>> the box by default. >>>>>> >>>>>> If there is real fatal hardware error, maybe we have no luxury to go from NMI >>>>>> handler to user space RAS daemon to determine what to do. System may explode, >>>>>> bad data may go to disk before that. >>>>> >>>>> That is why i suggested: >>>>> >>>>> > > By using 'active' event filters it could even be set on a system to panic >>>>> > > the box by default. >>>>> >>>>> event filters are evaluated in the kernel, so the panic could be instantaneous, >>>>> without the event having to reach user-space. >>>> >>>> Interesting. Question though, what do you mean by 'event filtering'. Is >>>> that different then setting 'unknown_nmi_panic' panic on the commandline or >>>> procfs? >>>> >>>> Or are you suggesting something like registering another callback on the >>>> die_chain that looks for DIE_NMIUNKNOWN as the event, swallows them and >>>> implements the policy? That way only on HEST related platforms would >>>> register them while others would keep the default of 'Dazed and confused' >>>> messages? >>> >>> The idea is that "event filters", which are an existing upstream feature and >>> which can be used in rather flexible ways: >>> >>> http://lkml.org/lkml/2011/4/27/660 >>> >>> Could be used to trigger non-standard policy action as well - such as to panic >>> the box. >>> >>> This would replace various very limited /debugfs and /sys event filtering hacks >>> (and hardcoded policies) such as arch/x86/kernel/cpu/mcheck/mce-severity.c, and >>> it would allow nonstandard behavior like 'panic the box on unknown NMIs' as >>> well. >>> >>> This could be set by the RAS daemon, and it could be propagated to the kernel >>> boot line as well, where event filter syntax would look like this: >>> >>> events=nmi::unknown"if (reason == 0) panic();" >>> >>> (Where the 'reason' field of the NMI event is the current legacy 'reason' value >>> there.) >>> >>> The filter code would have to be modified to be able to recognize the panic() >>> bit, but that's desirable anyway and it is a one-time effort. >>> >>> This: >>> >>> events=nmi::unknown:"if (reason == 0) ignore();" >>> >>> would be a possible outcome as well, on certain boxes - to skip certain events. >> >> We can determine whether NMI is unknown in kernel now. If you want to push >> all unknown NMI logic into user space (although I don't think that is the >> best solution), is it not sufficient that just check system in user space >> (via PCI ID or DMI ID, etc) and set existing "unknown_nmi_panic" accordingly? > > yeah - no need to push the 'reason' if it's not needed. > > We want the kernel defaults to be sane - i.e. this is not to 'push' anything to > user-space in a forced way, this is to make *optional*, different policy action > possible to configure. OK. Then, what is the proper default behavior? We think Linux kernel should treat unknown NMI as hardware error reporting, at least on some modern machines (via a white list). Do you agree? Best Regards, Huang Ying