LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Jan Kara <jack@suse.cz>
To: Jens Axboe <axboe@fb.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, jack@suse.cz, dchinner@redhat.com,
	sedat.dilek@gmail.com
Subject: Re: [PATCH 7/8] wbt: add general throttling mechanism
Date: Thu, 28 Apr 2016 13:05:59 +0200
Message-ID: <20160428110559.GC17362@quack2.suse.cz> (raw)
In-Reply-To: <1461686131-22999-8-git-send-email-axboe@fb.com>

On Tue 26-04-16 09:55:30, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes. Or NFS can tap into it, to accomplish the same.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>                wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.

I have some comments below...

> +struct rq_wb {
> +	/*
> +	 * Settings that govern how we throttle
> +	 */
> +	unsigned int wb_background;		/* background writeback */
> +	unsigned int wb_normal;			/* normal writeback */
> +	unsigned int wb_max;			/* max throughput writeback */
> +	unsigned int scale_step;
> +
> +	u64 win_nsec;				/* default window size */
> +	u64 cur_win_nsec;			/* current window size */
> +
> +	unsigned int unknown_cnt;

It would be useful to have a comment here explaining that 'unknown_cnt' is
a number of consecutive periods in which we didn't have enough data to
decide about queue scaling (at least this is what I understood from the
code).

> +
> +	struct timer_list window_timer;
> +
> +	s64 sync_issue;
> +	void *sync_cookie;

So I'm somewhat wondering: What is protecting consistency of this
structure? The limits, scale_step, cur_win_nsec, unknown_cnt are updated only
from timer so those should be safe. However sync_issue & sync_cookie are
accessed from IO submission and completion path and there we need some
protection to keep those two in sync. It seems q->queue_lock should mostly
achieve those except for blk-mq submission path calling wbt_wait() which
doesn't hold queue_lock.

It seems you were aware of the possible races and the code handles them
mostly fine (although I wouldn't bet too much there is not some weird
corner case). However it would be good to comment on this somewhere and
explain what the rules for these two fields are.

> +
> +	unsigned int wc;
> +	unsigned int queue_depth;
> +
> +	unsigned long last_issue;		/* last non-throttled issue */
> +	unsigned long last_comp;		/* last non-throttled comp */
> +	unsigned long min_lat_nsec;
> +	struct backing_dev_info *bdi;
> +	struct request_queue *q;
> +	wait_queue_head_t wait;
> +	atomic_t inflight;
> +
> +	struct wb_stat_ops *stat_ops;
> +	void *ops_data;
> +};
...
> diff --git a/lib/wbt.c b/lib/wbt.c
> new file mode 100644
> index 000000000000..650da911f24f
> --- /dev/null
> +++ b/lib/wbt.c
> @@ -0,0 +1,524 @@
> +/*
> + * buffered writeback throttling. losely based on CoDel. We can't drop
> + * packets for IO scheduling, so the logic is something like this:
> + *
> + * - Monitor latencies in a defined window of time.
> + * - If the minimum latency in the above window exceeds some target, increment
> + *   scaling step and scale down queue depth by a factor of 2x. The monitoring
> + *   window is then shrunk to 100 / sqrt(scaling step + 1).
> + * - For any window where we don't have solid data on what the latencies
> + *   look like, retain status quo.
> + * - If latencies look good, decrement scaling step.

I'm wondering about two things:

1) There is a logic somewhat in this direction in blk_queue_start_tag().
   Probably it should be removed after your patches land?

2) As far as I can see in patch 8/8, you have plugged the throttling above
   the IO scheduler. When there are e.g. multiple cgroups with different IO
   limits operating, this throttling can lead to strange results (like a
   cgroup with low limit using up all available background "slots" and thus
   effectively stopping background writeback for other cgroups)? So won't
   it make more sense to plug this below the IO scheduler? Now I understand
   there may be other problems with this but I think we should put more
   though to that and provide some justification in changelogs.

> +static void calc_wb_limits(struct rq_wb *rwb)
> +{
> +	unsigned int depth;
> +
> +	if (!rwb->min_lat_nsec) {
> +		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
> +		return;
> +	}
> +
> +	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> +
> +	/*
> +	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
> +	 */

The comment looks a bit out of place here since we don't reduce max depth
here. We just use whatever is set in scale_step...

> +	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> +	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> +	rwb->wb_background = (rwb->wb_max + 3) / 4;
> +}
> +
> +static bool inline stat_sample_valid(struct blk_rq_stat *stat)
> +{
> +	/*
> +	 * We need at least one read sample, and a minimum of
> +	 * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
> +	 * that it's writes impacting us, and not just some sole read on
> +	 * a device that is in a lower power state.
> +	 */
> +	return stat[0].nr_samples >= 1 &&
> +		stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
> +}
> +
> +static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
> +{
> +	u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
> +
> +	if (!issue || !rwb->sync_cookie)
> +		return 0;
> +
> +	now = ktime_to_ns(ktime_get());
> +	return now - issue;
> +}
> +
> +enum {
> +	LAT_OK,
> +	LAT_UNKNOWN,
> +	LAT_EXCEEDED,
> +};
> +
> +static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
> +{
> +	u64 thislat;
> +
> +	/*
> +	 * If our stored sync issue exceeds the window size, or it
> +	 * exceeds our min target AND we haven't logged any entries,
> +	 * flag the latency as exceeded.
> +	 */
> +	thislat = rwb_sync_issue_lat(rwb);
> +	if (thislat > rwb->cur_win_nsec ||
> +	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
> +		trace_wbt_lat(rwb->bdi, thislat);
> +		return LAT_EXCEEDED;
> +	}

So I'm trying to wrap my head around this. If I read the code right,
rwb_sync_issue_lat() this returns time that has passed since issuing sync
request that is still running. We basically randomly pick which sync
request we track as we always start tracking a sync request when some is
issued and we are not tracking any at that moment. This is to detect the
case when latency of sync IO is very large compared to measurement window
so we would not get enough samples to make it valid?

Probably the comment could explain more of "why we do this?" than pure
"what we do".

> +
> +	if (!stat_sample_valid(stat))
> +		return LAT_UNKNOWN;
> +
> +	/*
> +	 * If the 'min' latency exceeds our target, step down.
> +	 */
> +	if (stat[0].min > rwb->min_lat_nsec) {
> +		trace_wbt_lat(rwb->bdi, stat[0].min);
> +		trace_wbt_stat(rwb->bdi, stat);
> +		return LAT_EXCEEDED;
> +	}
> +
> +	if (rwb->scale_step)
> +		trace_wbt_stat(rwb->bdi, stat);
> +
> +	return LAT_OK;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  parent reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
2016-04-26 15:55 ` [PATCH 1/8] block: add WRITE_BG Jens Axboe
2016-04-26 15:55 ` [PATCH 2/8] writeback: add wbc_to_write_cmd() Jens Axboe
2016-04-26 15:55 ` [PATCH 3/8] writeback: use WRITE_BG for kupdate and background writeback Jens Axboe
2016-04-26 15:55 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
2016-04-26 15:55 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
2016-04-26 15:55 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
2016-05-05  7:52   ` Ming Lei
2016-04-26 15:55 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
2016-04-27 12:06   ` xiakaixu
2016-04-27 15:21     ` Jens Axboe
2016-04-28  3:29       ` xiakaixu
2016-04-28 11:05   ` Jan Kara [this message]
2016-04-28 18:53     ` Jens Axboe
2016-04-28 19:03       ` Jens Axboe
2016-05-03  9:34       ` Jan Kara
2016-05-03 14:23         ` Jens Axboe
2016-05-03 15:22           ` Jan Kara
2016-05-03 15:32             ` Jens Axboe
2016-05-03 15:40         ` Jan Kara
2016-05-03 15:48           ` Jan Kara
2016-05-03 16:59             ` Jens Axboe
2016-05-03 18:14               ` Jens Axboe
2016-05-03 19:07                 ` Jens Axboe
2016-04-26 15:55 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe
2016-04-27 18:01 ` [PATCHSET v5] Make background writeback great again for the first time Jan Kara
2016-04-27 18:17   ` Jens Axboe
2016-04-27 20:37     ` Jens Axboe
2016-04-27 20:59       ` Jens Axboe
2016-04-28  4:06         ` xiakaixu
2016-04-28 18:36           ` Jens Axboe
2016-04-28 11:54         ` Jan Kara
2016-04-28 18:46           ` Jens Axboe
2016-05-03 12:17             ` Jan Kara
2016-05-03 12:40               ` Chris Mason
2016-05-03 13:06                 ` Jan Kara
2016-05-03 13:42                   ` Chris Mason
2016-05-03 13:57                     ` Jan Kara
2016-05-11 16:36               ` Jan Kara
2016-05-13 18:29                 ` Jens Axboe
2016-05-16  7:47                   ` Jan Kara
2016-08-31 17:05 [PATCHSET v6] Throttled background buffered writeback Jens Axboe
2016-08-31 17:05 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
2016-09-01 18:05   ` Omar Sandoval
2016-09-01 18:51     ` Jens Axboe
2016-09-07 14:46 [PATCH 0/8] Throttled background buffered writeback v7 Jens Axboe
2016-09-07 14:46 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160428110559.GC17362@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=axboe@fb.com \
    --cc=dchinner@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sedat.dilek@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git