All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Havard Skinnemoen <hskinnemoen@google.com>
Cc: Tony Luck <tony.luck@gmail.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Ewout van Bekkum <ewout@google.com>,
	linux-edac <linux-edac@vger.kernel.org>
Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
Date: Fri, 11 Jul 2014 22:36:07 +0200	[thread overview]
Message-ID: <20140711203607.GD18246@pd.tnic> (raw)
In-Reply-To: <CAFQmdRajEjtGB4xXVzCmaUPA=qEjrzQTskJtpmD0cqKhKsEYsg@mail.gmail.com>

On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
> > Basically the scheme becomes the following:
> >
> > * We switch to polling if we detect a second CMCI under an interval X
> > * We poll Y times, each polling with a duration Z.
> > * If during those Y*Z msec of polling, we've encountered errors, we
> > enlarge the polling interval to additional Y*Z msec.
> >
> >
> > check_interval will be capped on the low end to something bigger than
> > the polling duration Y*Z and only the storm detection code will be
> > allowed to go to lower intervals and switch to polling.
> >
> > At least something like that. In general, I'd like to make it more
> > robust for every system without the need for user interaction, i.e.
> > adjusting check_interval and where it just works.
> 
> But at the same time, this scheme introduces even more variables that
> need careful tuning, e.g. storm polling interval and storm duration,
> while not really doing anything to make check_interval superfluous. Do

Oh, we can't make check_interval superfluous - it is API to userspace
for a long time now.

> you really think we can tune these variables correctly for every
> system out there?

Right, I was trying to figure out a scheme first where polling intervals
and thresholds would actually make sense and not be arbitrary.

We probably won't be able to have the exact values for each system but a
smart approximation could do the job nicely enough.

> Or if we want to be generous: How about we just hardcode
> check_interval to 5 seconds. Would that be fine with everyone?

We could but again, it is an API to userspace exported through sysfs.

Besides, on a healthy system, you see errors so seldomly that 5sec is
pure waste of energy.

> > I don't know whether any of the above makes sense - I hope that the
> > gist of it at least shows what IO think we should be doing: instead
> > of letting users configure the check_interval and influence the CMCI
> > polling interval, we should rely purely on machine characteristics to
> > set minimum values under which we poll and above which, we do the normal
> > duration enlarging dance.
> 
> I think the scheme may work, although I'm worried about the burstiness
> mentioned above.
>
> But I don't really buy that pulling a handful of numbers out of thin
> air and saying it should work for everyone is going to work.

No no, absolutely not. This is exactly what I think should be fixed as
the current numbers are likely pulled out of thin air. Simply because
figuring the optimal ones is a very hard task, as we come to realize.

> Either we need solid data to back up those numbers, or we need to make
> them configurable so people can experiment and find what works best
> for them.

..., or, we could measure them on each system and approximate them to
the ones close to optimal for that particular system, over the course of
its runtime.

Thanks for taking the time and humouring me with that crazy
brainstorming!

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

  parent reply	other threads:[~2014-07-11 20:36 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
2014-07-09 19:17   ` Borislav Petkov
2014-07-09 21:24     ` Havard Skinnemoen
2014-07-10  9:01       ` Chen, Gong
2014-07-10 17:16         ` Havard Skinnemoen
2014-07-11  2:12           ` Chen, Gong
2014-07-10 11:42       ` Borislav Petkov
2014-07-10 17:51         ` Havard Skinnemoen
2014-07-10 18:55           ` Tony Luck
2014-07-10 22:45             ` Havard Skinnemoen
2014-07-11 15:35               ` Borislav Petkov
2014-07-11 18:56                 ` Havard Skinnemoen
2014-07-11 20:10                   ` Borislav Petkov
2014-07-11 20:39                     ` Havard Skinnemoen
2014-07-14 14:57                       ` Borislav Petkov
2014-07-11 20:22                   ` Borislav Petkov
2014-07-12  0:10                     ` Havard Skinnemoen
2014-07-14 15:14                       ` Borislav Petkov
2014-07-11 20:36                   ` Borislav Petkov [this message]
2014-07-11 21:05                     ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks Havard Skinnemoen
2014-07-09 20:20   ` Luck, Tony
2014-07-09 21:34     ` Havard Skinnemoen
2014-07-10 15:51       ` Borislav Petkov
2014-07-10 18:32         ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot Havard Skinnemoen
2014-07-09 20:36   ` Luck, Tony
2014-07-09 21:40     ` Havard Skinnemoen
2014-07-10 16:24       ` Borislav Petkov
2014-07-10 16:33         ` Tony Luck
2014-07-10 17:56         ` Havard Skinnemoen
2014-07-10 18:27           ` Tony Luck
2014-07-10 18:30           ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
2014-07-09 20:35   ` Andi Kleen
2014-07-09 21:51     ` Havard Skinnemoen
2014-07-09 23:32       ` Luck, Tony
2014-07-10  8:16         ` Borislav Petkov
2014-07-09 20:47   ` Luck, Tony
2014-07-09 21:56     ` Havard Skinnemoen
2014-07-10 16:41   ` Borislav Petkov
2014-07-10 18:03     ` Havard Skinnemoen
2014-07-10 18:44       ` Borislav Petkov
2014-07-10 18:57         ` Tony Luck
2014-07-10 19:12           ` Borislav Petkov
2014-07-11  9:24             ` Borislav Petkov
2014-07-11 19:06               ` Tony Luck
2014-07-11 19:52                 ` Borislav Petkov
2014-07-11 21:15                   ` Havard Skinnemoen
2014-07-17 10:50                     ` Borislav Petkov
2014-07-18 21:23                       ` Tony Luck
2014-07-18 21:31                         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks Havard Skinnemoen
2014-07-09 21:00   ` Luck, Tony
2014-07-09 23:00     ` Havard Skinnemoen
2014-07-09 23:27       ` Luck, Tony
2014-07-10 16:49         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn Havard Skinnemoen
2014-07-09 21:04   ` Luck, Tony
2014-07-09 23:01     ` Havard Skinnemoen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140711203607.GD18246@pd.tnic \
    --to=bp@alien8.de \
    --cc=ewout@google.com \
    --cc=hskinnemoen@google.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.