All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sage Weil <sage@newdream.net>
To: Dan van der Ster <dan@vanderster.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	Herve Rousseau <herve.rousseau@cern.ch>
Subject: Re: scrub randomization and load threshold
Date: Thu, 12 Nov 2015 07:10:32 -0800 (PST)	[thread overview]
Message-ID: <alpine.DEB.2.00.1511120705240.614@cobra.newdream.net> (raw)
In-Reply-To: <CABZ+qqnj81gQDe6mApNdGa-HK2LbeKJwHH-7Lj9ovq2+ci1MgA@mail.gmail.com>

On Thu, 12 Nov 2015, Dan van der Ster wrote:
> On Thu, Nov 12, 2015 at 2:29 PM, Sage Weil <sage@newdream.net> wrote:
> > On Thu, 12 Nov 2015, Dan van der Ster wrote:
> >> Hi,
> >>
> >> Firstly, we just had a look at the new
> >> osd_scrub_interval_randomize_ratio option and found that it doesn't
> >> really solve the deep scrubbing problem. Given the default options,
> >>
> >> osd_scrub_min_interval = 60*60*24
> >> osd_scrub_max_interval = 7*60*60*24
> >> osd_scrub_interval_randomize_ratio = 0.5
> >> osd_deep_scrub_interval = 60*60*24*7
> >>
> >> we understand that the new option changes the min interval to the
> >> range 1-1.5 days. However, this doesn't do anything for the thundering
> >> herd of deep scrubs which will happen every 7 days. We've found a
> >> configuration that should randomize deep scrubbing across two weeks,
> >> e.g.:
> >>
> >> osd_scrub_min_interval = 60*60*24*7
> >> osd_scrub_max_interval = 100*60*60*24 // effectively disabling this option
> >> osd_scrub_load_threshold = 10 // effectively disabling this option
> >> osd_scrub_interval_randomize_ratio = 2.0
> >> osd_deep_scrub_interval = 60*60*24*7
> >>
> >> but that (a) doesn't allow shallow scrubs to run daily and (b) is so
> >> far off the defaults that its basically an abuse of the intended
> >> behaviour.
> >>
> >> So we'd like to simplify how deep scrubbing can be randomized. Our PR
> >> (http://github.com/ceph/ceph/pull/6550) adds a new option
> >> osd_deep_scrub_randomize_ratio which  controls a coin flip to randomly
> >> turn scrubs into deep scrubs. The default is tuned so roughly 1 in 7
> >> scrubs will be run deeply.
> >
> > The coin flip seems reasonable to me.  But wouldn't it also/instead make
> > sense to apply the randomize ratio to the deep_scrub_interval?  My just
> > adding in the random factor here:
> >
> > https://github.com/ceph/ceph/pull/6550/files#diff-dfb9ddca0a3ee32b266623e8fa489626R3247
> >
> > That is what I would have expected to happen, and if the coin flip is also
> > there then you have two knobs controlling the same thing, which'll cause
> > confusion...
> >
> 
> That was our first idea. But that has a couple downsides:
> 
>   1.  If we use the random range for the deep scrub intervals, e.g.
> deep every 1-1.5 weeks, we still get quite bursty scrubbing until it
> randomizes over a period of many weeks/months. And I fear it might
> even lead to lower frequency harmonics of many concurrent deep scrubs.
> Using a coin flip guarantees uniformity starting immediately from time
> zero.
>
>   2. In our PR osd_deep_scrub_interval is still used as an upper limit
> on how long a PG can go without being deeply scrubbed. This way
> there's no confusion such as PGs going undeep-scrubbed longer than
> expected. (In general, I think this random range is unintuitive and
> difficult to tune (e.g. see my 2 week deep scrubbing config above).

Fair enough..
 
> For me, the most intuitive configuration (maintaining randomness) would be:
> 
>   a. drop the osd_scrub_interval_randomize_ratio because there is no
> shallow scrub thundering herd problem (AFAIK), and it just complicates
> the configuration. (But this is in a stable release now so I don't
> know if you want to back it out).

I'm inclined to leave it, even if it complicates config: just because we 
haven't noticed the shallow scrub thundering herd doesn't mean it doesn't 
exist, and I fully expect that it is there.  Also, if the shallow scrubs 
are lumpy and we're promoting some of them to deep scrubs, then the deep 
scrubs will be lumpy too.

>   b. perform a (usually shallow) scrub every
> osd_scrub_interval_(min/max) depending on a self-tuning load
> threshold.

Yep, although as you note we have some work to do to get there.  :)

>   c. do a coin flip each (b) to occasionally turn it into deep scrub.

Works for me.

>   optionally: d. remove osd_deep_scrub_randomize_ratio and replace it
> with  osd_scrub_interval_min/osd_deep_scrub_interval.

There is no osd_deep_scrub_randomize_ratio.  Do you mean replace 
osd_deep_scrub_interval with osd_deep_scrub_{min,max}_interval?

> >> Secondly, we'd also like to discuss the osd_scrub_load_threshold
> >> option, where we see two problems:
> >>    - the default is so low that it disables all the shallow scrub
> >> randomization on all but completely idle clusters.
> >>    - finding the correct osd_scrub_load_threshold for a cluster is
> >> surely unclear/difficult and probably a moving target for most prod
> >> clusters.
> >>
> >> Given those observations, IMHO the smart Ceph admin should set
> >> osd_scrub_load_threshold = 10 or higher, to effectively disable that
> >> functionality. In the spirit of having good defaults, I therefore
> >> propose that we increase the default osd_scrub_load_threshold (to at
> >> least 5.0) and consider removing the load threshold logic completely.
> >
> > This sounds reasonable to me.  It would be great if we could use a 24-hour
> > average as the baseline or something so that it was self-tuning (e.g., set
> > threshold to .8 of daily average), but that's a bit trickier.  Generally
> > all for self-tuning, though... too many knobs...
> 
> Yes, but we probably would need to make your 0.8 a function of the
> stddev of the loadavg over a day, to handle clusters with flat
> loadavgs as well as varying ones.
> 
> In order to randomly spread the deep scrubs across the week, it's
> essential to give each PG many opportunities to scrub throughout the
> week. If PGs are only shallow scrubbed once a week (at interval_max),
> then every scrub would become a deep scrub and we again have the
> thundering herd problem.
> 
> I'll push 5.0 for now.

Sounds good.

I would still love to see someone tackle the auto-tuning approach, 
though! :)

sage

  reply	other threads:[~2015-11-12 15:10 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-12  9:24 scrub randomization and load threshold Dan van der Ster
2015-11-12 13:29 ` Sage Weil
2015-11-12 14:36   ` Dan van der Ster
2015-11-12 15:10     ` Sage Weil [this message]
2015-11-12 15:34       ` Dan van der Ster
2015-11-16 14:25         ` Dan van der Ster
2015-11-16 15:20           ` Sage Weil
2015-11-16 15:32             ` Dan van der Ster
2015-11-16 15:58               ` Dan van der Ster
2015-11-16 17:06                 ` Dan van der Ster
2015-11-16 17:13                   ` Sage Weil
2015-11-16 17:30                     ` Dan van der Ster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.00.1511120705240.614@cobra.newdream.net \
    --to=sage@newdream.net \
    --cc=ceph-devel@vger.kernel.org \
    --cc=dan@vanderster.com \
    --cc=herve.rousseau@cern.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.