scrub randomization and load threshold

* scrub randomization and load threshold
@ 2015-11-12  9:24 Dan van der Ster
  2015-11-12 13:29 ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Dan van der Ster @ 2015-11-12  9:24 UTC (permalink / raw)
  To: ceph-devel; +Cc: Herve Rousseau

Hi,

Firstly, we just had a look at the new
osd_scrub_interval_randomize_ratio option and found that it doesn't
really solve the deep scrubbing problem. Given the default options,

osd_scrub_min_interval = 60*60*24
osd_scrub_max_interval = 7*60*60*24
osd_scrub_interval_randomize_ratio = 0.5
osd_deep_scrub_interval = 60*60*24*7

we understand that the new option changes the min interval to the
range 1-1.5 days. However, this doesn't do anything for the thundering
herd of deep scrubs which will happen every 7 days. We've found a
configuration that should randomize deep scrubbing across two weeks,
e.g.:

osd_scrub_min_interval = 60*60*24*7
osd_scrub_max_interval = 100*60*60*24 // effectively disabling this option
osd_scrub_load_threshold = 10 // effectively disabling this option
osd_scrub_interval_randomize_ratio = 2.0
osd_deep_scrub_interval = 60*60*24*7

but that (a) doesn't allow shallow scrubs to run daily and (b) is so
far off the defaults that its basically an abuse of the intended
behaviour.

So we'd like to simplify how deep scrubbing can be randomized. Our PR
(http://github.com/ceph/ceph/pull/6550) adds a new option
osd_deep_scrub_randomize_ratio which  controls a coin flip to randomly
turn scrubs into deep scrubs. The default is tuned so roughly 1 in 7
scrubs will be run deeply.

Secondly, we'd also like to discuss the osd_scrub_load_threshold
option, where we see two problems:
   - the default is so low that it disables all the shallow scrub
randomization on all but completely idle clusters.
   - finding the correct osd_scrub_load_threshold for a cluster is
surely unclear/difficult and probably a moving target for most prod
clusters.

Given those observations, IMHO the smart Ceph admin should set
osd_scrub_load_threshold = 10 or higher, to effectively disable that
functionality. In the spirit of having good defaults, I therefore
propose that we increase the default osd_scrub_load_threshold (to at
least 5.0) and consider removing the load threshold logic completely.

Cheers,

Dan

^ permalink raw reply	[flat|nested] 12+ messages in thread