From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752096AbaGJWpY (ORCPT ); Thu, 10 Jul 2014 18:45:24 -0400 Received: from mail-oa0-f42.google.com ([209.85.219.42]:56306 "EHLO mail-oa0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751442AbaGJWpW (ORCPT ); Thu, 10 Jul 2014 18:45:22 -0400 MIME-Version: 1.0 In-Reply-To: References: <1404925766-32253-1-git-send-email-hskinnemoen@google.com> <1404925766-32253-2-git-send-email-hskinnemoen@google.com> <20140709191747.GB5249@pd.tnic> <20140710114222.GE2970@pd.tnic> Date: Thu, 10 Jul 2014 15:45:22 -0700 Message-ID: Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values. From: Havard Skinnemoen To: Tony Luck Cc: Borislav Petkov , Linux Kernel , Ewout van Bekkum , linux-edac Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 10, 2014 at 11:55 AM, Tony Luck wrote: > On Thu, Jul 10, 2014 at 10:51 AM, Havard Skinnemoen > wrote: >> What's the typical interrupt rate during a storm? We should make it >> significantly less frequent than that, otherwise there's no point >> switching to polling. >> >> IIRC we've seen at least several hundred CMCIs per second, so perhaps >> 100 ms would be a reasonable minimum? Or perhaps 10 ms, which is the >> current minimum polling interval enforced by mce_timer_fn. > > I don't think we have a solid point to really declare "storm!". The > CMCI rates between normal and abnormal rates are vast: Right, I'm talking about "typical" abnormal rates, if you understand what I mean. I probably shouldn't have used the word "typical". To determine a minimum value, I think we need to consider machines which are really bad, but not so bad that they cause non-correctable errors. We use pushbutton DIMMs to simulate this in the lab. So assuming the worst machines produce a few hundred CMCIs per second, you're probably not going to see any performance improvement from the CMCI storm handling if you set the polling interval to less than 10 ms. So that's what the minimum should be, I think. Or perhaps a second if dealing with sub-second intervals make the userspace interface ugly. I'm not arguing that's a _sensible_ value, just that there's no point in seting it to anything lower than that. > Normal rates are a few CMCI per year (or maybe per month ... if > you have a multi-terabyte machine perhaps even "per day" is normal). > > So if you see two CMCI inside the same minute, you could declare > a storm. Realistically we want the threshold a bit higher. > > It then becomes a balance between seeing all the errors (so our PFA > mechanisms get enough data to spot bad pages and take action) and > processing so many interrupts that we begin to take a performamce > hit. > > Once we do decide there is a storm - we know we have given up on > seeing all the errors ... the polling rate will only decide how fast we > can determine that the storm has ended. I don't see a lot of value > in detecting the end at milli-second granularity. But we probably don't > want to give up minutes worth of PFA data if the storm does end. Right, and since we're talking about a balance, it may be best to give the user as much room as possible to configure the rate according to their system. I think the current defaults are sensible, but they're not optimal for all machines. Havard