From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan van der Ster Subject: Re: scrub randomization and load threshold Date: Mon, 16 Nov 2015 16:32:11 +0100 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Return-path: Received: from mail-lb0-f169.google.com ([209.85.217.169]:33186 "EHLO mail-lb0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751636AbbKPPcy (ORCPT ); Mon, 16 Nov 2015 10:32:54 -0500 Received: by lbbkw15 with SMTP id kw15so90644196lbb.0 for ; Mon, 16 Nov 2015 07:32:53 -0800 (PST) Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com. [209.85.215.47]) by smtp.gmail.com with ESMTPSA id bn6sm5562937lbc.10.2015.11.16.07.32.51 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Nov 2015 07:32:51 -0800 (PST) Received: by lfaz4 with SMTP id z4so25351303lfa.0 for ; Mon, 16 Nov 2015 07:32:51 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: "ceph-devel@vger.kernel.org" , Herve Rousseau On Mon, Nov 16, 2015 at 4:20 PM, Sage Weil wrote: > On Mon, 16 Nov 2015, Dan van der Ster wrote: >> Instead of keeping a 24hr loadavg, how about we allow scrubs whenever >> the loadavg is decreasing (or below the threshold)? As long as the >> 1min loadavg is less than the 15min loadavg, we should be ok to allow >> new scrubs. If you agree I'll add the patch below to my PR. > > I like the simplicity of that, I'm afraid its going to just trigger a > feedback loop and oscillations on the host. I.e., as soo as we see *any* > decrease, all osds on the host will start to scrub, which will push the > load up. Once that round of PGs finish, the load will start to drop > again, triggering another round. This'll happen regardless of whether > we're in the peak hours or not, and the high-level goal (IMO at least) is > to do scrubbing in non-peak hours. We checked our OSDs' 24hr loadavg plots today and found that the original idea of 0.8 * 24hr loadavg wouldn't leave many chances for scrubs to run. So maybe if we used 0.9 or 1.0 it would be doable. BTW, I realized there was a silly error in that earlier patch, and we anyway need an upper bound, say # cpus. So until your response came I was working with this idea: https://stikked.web.cern.ch/stikked/view/raw/5586a912 -- dan