All of lore.kernel.org
 help / color / mirror / Atom feed
From: Havard Skinnemoen <hskinnemoen@google.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@gmail.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Ewout van Bekkum <ewout@google.com>,
	linux-edac <linux-edac@vger.kernel.org>
Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
Date: Fri, 11 Jul 2014 13:39:19 -0700	[thread overview]
Message-ID: <CAFQmdRbpVfNjvyOPHo7RZ+W=2FTA6fEw2hWBebKNBX8LMPYfeQ@mail.gmail.com> (raw)
In-Reply-To: <20140711201012.GB18246@pd.tnic>

On Fri, Jul 11, 2014 at 1:10 PM, Borislav Petkov <bp@alien8.de> wrote:
> I'm going to reply with multiple mails so that we can keep the things
> separate and not let replies grow out of proportion.
>
> On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
>> So a short burst of CMCIs would send us instantly into polling mode,
>> which would probably be suboptimal if things are quiet after that.
>> Counting is a lot more robust against this.
>
> Yes, but CMCI_STORM_THRESHOLD is arbitrary too. How is getting 15 CMCIs
> per second an interrupt storm? Apparently boxes can handle couple of
> hundred CMCIs per second just fine...

Sorry, I was being unclear. I was actually arguing the opposite:
Getting 15 CMCIs per second is fine and shouldn't cause any switch to
polling mode, especially if the polling will happen at 100 times per
second. But your proposal would switch to polling if we ever see 2
CMCIs within a period, which seems way too trigger-happy, even if the
period is short.

I do agree there are already a lot of arbitrary numbers in the code.

>> If we see two errors every 2 seconds (for example due to a bug causing
>> us to see duplicate MCEs), we'd ping-pong back and forth between CMCI
>> and polling mode on every error, polling 51 times per second on
>> average. This seems a lot more expensive than just staying in CMCI
>> mode. And we risk losing information if there are instead, say, 4
>> errors every 2 seconds.
>>
>> > After a second where we haven't seen any errors, we switch back to CMCI.
>> > check_interval relaxes back to 5 min and all gets to its normal boring
>> > existence. Otherwise, we enter storm mode quickly again.
>>
>> Since the storm detection is now independent of check_interval, we
>> don't need to place any restrictions on it right?
>
> Ok, so my initial storm detection was dumb, ok. Counting the way we do
> it now is purely sucked out of thin air too.
>
> Instead, the criteria should probably be something like: what is the
> number of CMCIs per second which we can process while leaving system
> operation relatively unaffected? Anything above that number constitutes
> a CMCI storm.

That sounds good to me. But now you're talking about CMCIs per second,
which seems to imply some form of counting right?

> Now, how we'll come up with an answer to that question is a whole
> another story...

Right. If we can come up with an answer, that's great, but if we
don't, I think we're better off exporting a nice knob and letting the
user tune his system according to his needs.

Just to throw another number out, how about doing CMCI storm polling
at a fixed interval of 100 ms? Since check_interval is an integer
representing a number of seconds, it can never get lower than 10x this
number, so we won't need to restrict it any further.

If we see more than X CMCIs in a second, we switch to polling. If less
than Y out of 10 polls see an error, we switch back to CMCI.

Now, we still leave 3 magic numbers to be figured out...but I think
their range is somewhat more limited.

Havard

  reply	other threads:[~2014-07-11 20:39 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
2014-07-09 19:17   ` Borislav Petkov
2014-07-09 21:24     ` Havard Skinnemoen
2014-07-10  9:01       ` Chen, Gong
2014-07-10 17:16         ` Havard Skinnemoen
2014-07-11  2:12           ` Chen, Gong
2014-07-10 11:42       ` Borislav Petkov
2014-07-10 17:51         ` Havard Skinnemoen
2014-07-10 18:55           ` Tony Luck
2014-07-10 22:45             ` Havard Skinnemoen
2014-07-11 15:35               ` Borislav Petkov
2014-07-11 18:56                 ` Havard Skinnemoen
2014-07-11 20:10                   ` Borislav Petkov
2014-07-11 20:39                     ` Havard Skinnemoen [this message]
2014-07-14 14:57                       ` Borislav Petkov
2014-07-11 20:22                   ` Borislav Petkov
2014-07-12  0:10                     ` Havard Skinnemoen
2014-07-14 15:14                       ` Borislav Petkov
2014-07-11 20:36                   ` Borislav Petkov
2014-07-11 21:05                     ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks Havard Skinnemoen
2014-07-09 20:20   ` Luck, Tony
2014-07-09 21:34     ` Havard Skinnemoen
2014-07-10 15:51       ` Borislav Petkov
2014-07-10 18:32         ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot Havard Skinnemoen
2014-07-09 20:36   ` Luck, Tony
2014-07-09 21:40     ` Havard Skinnemoen
2014-07-10 16:24       ` Borislav Petkov
2014-07-10 16:33         ` Tony Luck
2014-07-10 17:56         ` Havard Skinnemoen
2014-07-10 18:27           ` Tony Luck
2014-07-10 18:30           ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
2014-07-09 20:35   ` Andi Kleen
2014-07-09 21:51     ` Havard Skinnemoen
2014-07-09 23:32       ` Luck, Tony
2014-07-10  8:16         ` Borislav Petkov
2014-07-09 20:47   ` Luck, Tony
2014-07-09 21:56     ` Havard Skinnemoen
2014-07-10 16:41   ` Borislav Petkov
2014-07-10 18:03     ` Havard Skinnemoen
2014-07-10 18:44       ` Borislav Petkov
2014-07-10 18:57         ` Tony Luck
2014-07-10 19:12           ` Borislav Petkov
2014-07-11  9:24             ` Borislav Petkov
2014-07-11 19:06               ` Tony Luck
2014-07-11 19:52                 ` Borislav Petkov
2014-07-11 21:15                   ` Havard Skinnemoen
2014-07-17 10:50                     ` Borislav Petkov
2014-07-18 21:23                       ` Tony Luck
2014-07-18 21:31                         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks Havard Skinnemoen
2014-07-09 21:00   ` Luck, Tony
2014-07-09 23:00     ` Havard Skinnemoen
2014-07-09 23:27       ` Luck, Tony
2014-07-10 16:49         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn Havard Skinnemoen
2014-07-09 21:04   ` Luck, Tony
2014-07-09 23:01     ` Havard Skinnemoen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFQmdRbpVfNjvyOPHo7RZ+W=2FTA6fEw2hWBebKNBX8LMPYfeQ@mail.gmail.com' \
    --to=hskinnemoen@google.com \
    --cc=bp@alien8.de \
    --cc=ewout@google.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.