[PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup
@ 2021-07-15 13:30 Jan Kara
  2021-07-15 13:30 ` [PATCH 1/3] block: Provide icq in request allocation data Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Jan Kara @ 2021-07-15 13:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Paolo Valente, Michal Koutný, Jan Kara

Hello!

Here is the second revision of my patches to fix how bfq weights apply on
cgroup throughput. This version has only one change fixing how we compute
number of tags that should be available to a cgroup. Previous version didn't
combine weights at several levels correctly for deeper hierarchies. It is
somewhat unfortunate that for really deep cgroup hierarchies we would now do
memory allocation inside bfq_limit_depth(). I have an idea how we could avoid
that if the rest of the approach proves OK so don't concentrate too much on
that detail please.

Changes since v1:
* Fixed computation of appropriate proportion of scheduler tags for a cgroup
  to work with deeper cgroup hierarchies.

Original cover letter:

I was looking into why cgroup weights do not have any measurable impact on
writeback throughput from different cgroups. This actually a regression from
CFQ where things work more or less OK and weights have roughly the impact they
should. The problem can be reproduced e.g. by running the following easy fio
job in two cgroups with different weight:

[writer]
directory=/mnt/repro/
numjobs=1
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

I can observe there's no significat difference in the amount of data written
from different cgroups despite their weights are in say 1:3 ratio.

After some debugging I've understood the dynamics of the system. There are two
issues:

1) The amount of scheduler tags needs to be significantly larger than the
amount of device tags. Otherwise there are not enough requests waiting in BFQ
to be dispatched to the device and thus there's nothing to schedule on.

2) Even with enough scheduler tags, writers from two cgroups eventually start
contending on scheduler tag allocation. These are served on first come first
served basis so writers from both cgroups feed requests into bfq with
approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
submitted and completed faster and eventually we end up in a situation when
there's no IO from the heavier cgroup in bfq and all scheduler tags are
consumed by requests from the lighter cgroup. At that point bfq just dispatches
lots of the IO from the lighter cgroup since there's no contender for disk
throughput. As a result observed throughput for both cgroups are the same.

This series fixes this problem by accounting how many scheduler tags are
allocated for each cgroup and if a cgroup has more tags allocated than its
fair share (based on weights) in its service tree, we heavily limit scheduler
tag bitmap depth for it so that it is not be able to starve other cgroups from
scheduler tags.

What do people think about this?

								Honza

Previous versions:
Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] block: Provide icq in request allocation data
  2021-07-15 13:30 [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Jan Kara
@ 2021-07-15 13:30 ` Jan Kara
  2021-07-15 13:30 ` [PATCH 2/3] bfq: Track number of allocated requests in bfq_entity Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2021-07-15 13:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Paolo Valente, Michal Koutný, Jan Kara

Currently we lookup ICQ only after the request is allocated. However BFQ
will want to decide how many scheduler tags it allows a given bfq queue
(effectively a process) to consume based on cgroup weight. So lookup ICQ
earlier and provide it in struct blk_mq_alloc_data so that BFQ can use
it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk-mq-sched.c | 18 ++++++++++--------
 block/blk-mq-sched.h |  3 ++-
 block/blk-mq.c       |  7 ++++---
 block/blk-mq.h       |  1 +
 4 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c838d81ac058..3e34f5bb24ae 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -18,9 +18,8 @@
 #include "blk-mq-tag.h"
 #include "blk-wbt.h"
 
-void blk_mq_sched_assign_ioc(struct request *rq)
+struct io_cq *blk_mq_sched_lookup_icq(struct request_queue *q)
 {
-	struct request_queue *q = rq->q;
 	struct io_context *ioc;
 	struct io_cq *icq;
 
@@ -29,17 +28,20 @@ void blk_mq_sched_assign_ioc(struct request *rq)
 	 */
 	ioc = current->io_context;
 	if (!ioc)
-		return;
+		return NULL;
 
 	spin_lock_irq(&q->queue_lock);
 	icq = ioc_lookup_icq(ioc, q);
 	spin_unlock_irq(&q->queue_lock);
+	if (icq)
+		return icq;
+	return ioc_create_icq(ioc, q, GFP_ATOMIC);
+}
 
-	if (!icq) {
-		icq = ioc_create_icq(ioc, q, GFP_ATOMIC);
-		if (!icq)
-			return;
-	}
+void blk_mq_sched_assign_ioc(struct request *rq, struct io_cq *icq)
+{
+	if (!icq)
+		return;
 	get_io_context(icq->ioc);
 	rq->elv.icq = icq;
 }
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 5246ae040704..4529991e55e6 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -7,7 +7,8 @@
 
 #define MAX_SCHED_RQ (16 * BLKDEV_MAX_RQ)
 
-void blk_mq_sched_assign_ioc(struct request *rq);
+struct io_cq *blk_mq_sched_lookup_icq(struct request_queue *q);
+void blk_mq_sched_assign_ioc(struct request *rq, struct io_cq *icq);
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs, struct request **merged_request);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2c4ac51e54eb..b9d83644158f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -333,9 +333,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 
 		rq->elv.icq = NULL;
 		if (e && e->type->ops.prepare_request) {
-			if (e->type->icq_cache)
-				blk_mq_sched_assign_ioc(rq);
-
+			blk_mq_sched_assign_ioc(rq, data->icq);
 			e->type->ops.prepare_request(rq);
 			rq->rq_flags |= RQF_ELVPRIV;
 		}
@@ -360,6 +358,9 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
 	if (e) {
+		if (!op_is_flush(data->cmd_flags) && e->type->icq_cache &&
+		    e->type->ops.prepare_request)
+			data->icq = blk_mq_sched_lookup_icq(q);
 		/*
 		 * Flush/passthrough requests are special and go directly to the
 		 * dispatch list. Don't include reserved tags in the
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d08779f77a26..c502232384c6 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -151,6 +151,7 @@ static inline struct blk_mq_ctx *blk_mq_get_ctx(struct request_queue *q)
 struct blk_mq_alloc_data {
 	/* input parameter */
 	struct request_queue *q;
+	struct io_cq *icq;
 	blk_mq_req_flags_t flags;
 	unsigned int shallow_depth;
 	unsigned int cmd_flags;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] bfq: Track number of allocated requests in bfq_entity
  2021-07-15 13:30 [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Jan Kara
  2021-07-15 13:30 ` [PATCH 1/3] block: Provide icq in request allocation data Jan Kara
@ 2021-07-15 13:30 ` Jan Kara
  2021-07-15 13:30 ` [PATCH 3/3] bfq: Limit number of requests consumed by each cgroup Jan Kara
  2021-08-27 10:07 ` [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Paolo Valente
  3 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2021-07-15 13:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Paolo Valente, Michal Koutný, Jan Kara

When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/bfq-iosched.c | 28 ++++++++++++++++++++++------
 block/bfq-iosched.h |  5 +++--
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 727955918563..9ef057dc0028 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1113,7 +1113,8 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
 
 static int bfqq_process_refs(struct bfq_queue *bfqq)
 {
-	return bfqq->ref - bfqq->allocated - bfqq->entity.on_st_or_in_serv -
+	return bfqq->ref - bfqq->entity.allocated -
+		bfqq->entity.on_st_or_in_serv -
 		(bfqq->weight_counter != NULL) - bfqq->stable_ref;
 }
 
@@ -5875,6 +5876,22 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 }
 
+static void bfqq_request_allocated(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	for_each_entity(entity)
+		entity->allocated++;
+}
+
+static void bfqq_request_freed(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	for_each_entity(entity)
+		entity->allocated--;
+}
+
 /* returns true if it causes the idle timer to be disabled */
 static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
 {
@@ -5888,8 +5905,8 @@ static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)
 		 * Release the request's reference to the old bfqq
 		 * and make sure one is taken to the shared queue.
 		 */
-		new_bfqq->allocated++;
-		bfqq->allocated--;
+		bfqq_request_allocated(new_bfqq);
+		bfqq_request_freed(bfqq);
 		new_bfqq->ref++;
 		/*
 		 * If the bic associated with the process
@@ -6248,8 +6265,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
 
 static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq)
 {
-	bfqq->allocated--;
-
+	bfqq_request_freed(bfqq);
 	bfq_put_queue(bfqq);
 }
 
@@ -6669,7 +6685,7 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
 		}
 	}
 
-	bfqq->allocated++;
+	bfqq_request_allocated(bfqq);
 	bfqq->ref++;
 	bfq_log_bfqq(bfqd, bfqq, "get_request %p: bfqq %p, %d",
 		     rq, bfqq, bfqq->ref);
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 99c2a3cb081e..70d4a9b54613 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -170,6 +170,9 @@ struct bfq_entity {
 	/* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */
 	int budget;
 
+	/* Number of requests allocated in the subtree of this entity */
+	int allocated;
+
 	/* device weight, if non-zero, it overrides the default weight of
 	 * bfq_group_data */
 	int dev_weight;
@@ -266,8 +269,6 @@ struct bfq_queue {
 	struct request *next_rq;
 	/* number of sync and async requests queued */
 	int queued[2];
-	/* number of requests currently allocated */
-	int allocated;
 	/* number of pending metadata requests */
 	int meta_pending;
 	/* fifo list of requests in sort_list */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] bfq: Limit number of requests consumed by each cgroup
  2021-07-15 13:30 [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Jan Kara
  2021-07-15 13:30 ` [PATCH 1/3] block: Provide icq in request allocation data Jan Kara
  2021-07-15 13:30 ` [PATCH 2/3] bfq: Track number of allocated requests in bfq_entity Jan Kara
@ 2021-07-15 13:30 ` Jan Kara
  2021-08-27 10:07 ` [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Paolo Valente
  3 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2021-07-15 13:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Paolo Valente, Michal Koutný, Jan Kara

When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.

Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/bfq-iosched.c | 103 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 18 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9ef057dc0028..8f9b4904934b 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -565,6 +565,71 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
 	}
 }
 
+#define BFQ_LIMIT_INLINE_DEPTH 16
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_entity *inline_entities[BFQ_LIMIT_INLINE_DEPTH];
+	struct bfq_entity **entities = inline_entities;
+	int depth, level;
+	bool ret = false;
+
+	if (!entity->on_st_or_in_serv)
+		return false;
+
+	/* +1 for bfqq entity, root cgroup not included */
+	depth = bfqg_to_blkg(bfqq_group(bfqq))->blkcg->css.cgroup->level + 1;
+	if (depth > BFQ_LIMIT_INLINE_DEPTH) {
+		entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);
+		if (!entities)
+			return false;
+	}
+
+	spin_lock_irq(&bfqd->lock);
+	if (!entity->on_st_or_in_serv)
+		goto out;
+	/* Gather our ancestors as we need to traverse them in reverse order */
+	level = 0;
+	for_each_entity(entity) {
+		/* Uh, more parents than cgroup subsystem thinks? */
+		if (WARN_ON_ONCE(level >= depth))
+			break;
+		entities[level++] = entity;
+	}
+	WARN_ON_ONCE(level != depth);
+	for (level--; level >= 0; level--) {
+		entity = entities[level];
+		/*
+		 * If the leaf entity has work to do, parents should be tracked
+		 * as well.
+		 */
+		WARN_ON_ONCE(!entity->on_st_or_in_serv);
+		limit = DIV_ROUND_CLOSEST(limit * entity->weight,
+					bfq_entity_service_tree(entity)->wsum);
+		if (entity->allocated >= limit) {
+			bfq_log_bfqq(bfqq->bfqd, bfqq,
+				"too many requests: allocated %d limit %d level %d",
+				entity->allocated, limit, level);
+			ret = true;
+			break;
+		}
+	}
+out:
+	spin_unlock_irq(&bfqd->lock);
+	if (entities != inline_entities)
+		kfree(entities);
+	return ret;
+}
+#else
+static bool bfqq_request_over_limit(struct bfq_queue *bfqq, int limit)
+{
+	return false;
+}
+#endif
+
 /*
  * Async I/O can easily starve sync I/O (both sync reads and sync
  * writes), by consuming all tags. Similarly, storms of sync writes,
@@ -575,16 +640,28 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
 static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
 {
 	struct bfq_data *bfqd = data->q->elevator->elevator_data;
+	struct bfq_io_cq *bic = data->icq ? icq_to_bic(data->icq) : NULL;
+	struct bfq_queue *bfqq = bic ? bic_to_bfqq(bic, op_is_sync(op)) : NULL;
+	int depth;
 
+	/* Sync reads have full depth available */
 	if (op_is_sync(op) && !op_is_write(op))
-		return;
+		depth = 0;
+	else
+		depth = bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
 
-	data->shallow_depth =
-		bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+	/*
+	 * Does queue (or any parent entity) exceed number of requests that
+	 * should be available to it? Heavily limit depth so that it cannot
+	 * consume more available requests and thus starve other entities.
+	 */
+	if (bfqq && bfqq_request_over_limit(bfqq, data->q->nr_requests))
+		depth = 1;
 
 	bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
-			__func__, bfqd->wr_busy_queues, op_is_sync(op),
-			data->shallow_depth);
+		__func__, bfqd->wr_busy_queues, op_is_sync(op), depth);
+	if (depth)
+		data->shallow_depth = depth;
 }
 
 static struct bfq_queue *
@@ -6848,11 +6925,8 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
  * See the comments on bfq_limit_depth for the purpose of
  * the depths set in the function. Return minimum shallow depth we'll use.
  */
-static unsigned int bfq_update_depths(struct bfq_data *bfqd,
-				      struct sbitmap_queue *bt)
+static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
 {
-	unsigned int i, j, min_shallow = UINT_MAX;
-
 	/*
 	 * In-word depths if no bfq_queue is being weight-raised:
 	 * leaving 25% of tags only for sync reads.
@@ -6883,22 +6957,15 @@ static unsigned int bfq_update_depths(struct bfq_data *bfqd,
 	bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
 	/* no more than ~37% of tags for sync writes (~20% extra tags) */
 	bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < 2; j++)
-			min_shallow = min(min_shallow, bfqd->word_depths[i][j]);
-
-	return min_shallow;
 }
 
 static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
 {
 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
 	struct blk_mq_tags *tags = hctx->sched_tags;
-	unsigned int min_shallow;
 
-	min_shallow = bfq_update_depths(bfqd, tags->bitmap_tags);
-	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, min_shallow);
+	bfq_update_depths(bfqd, tags->bitmap_tags);
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, 1);
 }
 
 static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup
  2021-07-15 13:30 [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Jan Kara
                   ` (2 preceding siblings ...)
  2021-07-15 13:30 ` [PATCH 3/3] bfq: Limit number of requests consumed by each cgroup Jan Kara
@ 2021-08-27 10:07 ` Paolo Valente
  2021-08-31  9:59   ` Michal Koutný
  3 siblings, 1 reply; 10+ messages in thread
From: Paolo Valente @ 2021-08-27 10:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-block, Michal Koutný



> Il giorno 15 lug 2021, alle ore 15:30, Jan Kara <jack@suse.cz> ha scritto:
> 
> Hello!
> 

Hi!

> Here is the second revision of my patches to fix how bfq weights apply on
> cgroup throughput.

I don't remember whether I replied to your first version.  Anyway,
thanks for this important contribution.

> This version has only one change fixing how we compute
> number of tags that should be available to a cgroup. Previous version didn't
> combine weights at several levels correctly for deeper hierarchies. It is
> somewhat unfortunate that for really deep cgroup hierarchies we would now do
> memory allocation inside bfq_limit_depth(). I have an idea how we could avoid
> that if the rest of the approach proves OK so don't concentrate too much on
> that detail please.
> 
> Changes since v1:
> * Fixed computation of appropriate proportion of scheduler tags for a cgroup
>  to work with deeper cgroup hierarchies.
> 
> Original cover letter:
> 
> I was looking into why cgroup weights do not have any measurable impact on
> writeback throughput from different cgroups. This actually a regression from
> CFQ where things work more or less OK and weights have roughly the impact they
> should. The problem can be reproduced e.g. by running the following easy fio
> job in two cgroups with different weight:
> 
> [writer]
> directory=/mnt/repro/
> numjobs=1
> rw=write
> size=8g
> time_based
> runtime=30
> ramp_time=10
> blocksize=1m
> direct=0
> ioengine=sync
> 
> I can observe there's no significat difference in the amount of data written
> from different cgroups despite their weights are in say 1:3 ratio.
> 
> After some debugging I've understood the dynamics of the system. There are two
> issues:
> 
> 1) The amount of scheduler tags needs to be significantly larger than the
> amount of device tags. Otherwise there are not enough requests waiting in BFQ
> to be dispatched to the device and thus there's nothing to schedule on.
> 

Before discussing your patches in detail, I need a little help on this
point.  You state that the number of scheduler tags must be larger
than the number of device tags.  So, I expected some of your patches
to address somehow this issue, e.g., by increasing the number of
scheduler tags.  Yet I have not found such a change.  Did I miss
something?

Thanks,
Paolo

> 2) Even with enough scheduler tags, writers from two cgroups eventually start
> contending on scheduler tag allocation. These are served on first come first
> served basis so writers from both cgroups feed requests into bfq with
> approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
> submitted and completed faster and eventually we end up in a situation when
> there's no IO from the heavier cgroup in bfq and all scheduler tags are
> consumed by requests from the lighter cgroup. At that point bfq just dispatches
> lots of the IO from the lighter cgroup since there's no contender for disk
> throughput. As a result observed throughput for both cgroups are the same.
> 
> This series fixes this problem by accounting how many scheduler tags are
> allocated for each cgroup and if a cgroup has more tags allocated than its
> fair share (based on weights) in its service tree, we heavily limit scheduler
> tag bitmap depth for it so that it is not be able to starve other cgroups from
> scheduler tags.
> 
> What do people think about this?
> 
> 								Honza
> 
> Previous versions:
> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@suse.cz # v1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup
  2021-08-27 10:07 ` [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Paolo Valente
@ 2021-08-31  9:59   ` Michal Koutný
  2021-09-15 13:15     ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Koutný @ 2021-08-31  9:59 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, Jens Axboe, linux-block

Hello Paolo.

On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> Before discussing your patches in detail, I need a little help on this
> point.  You state that the number of scheduler tags must be larger
> than the number of device tags.  So, I expected some of your patches
> to address somehow this issue, e.g., by increasing the number of
> scheduler tags.  Yet I have not found such a change.  Did I miss
> something?

I believe Jan's conclusions so far are based on "manual" modifications
of available scheduler tags by /sys/block/$dev/queue/nr_requests.
Finding a good default value may be an additional change.

(Hopefully this clarifies a bit before Jan can reply.)

Michal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup
  2021-08-31  9:59   ` Michal Koutný
@ 2021-09-15 13:15     ` Jan Kara
  2021-09-18 10:58       ` Paolo Valente
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2021-09-15 13:15 UTC (permalink / raw)
  To: Michal Koutný; +Cc: Paolo Valente, Jan Kara, Jens Axboe, linux-block

On Tue 31-08-21 11:59:30, Michal Koutný wrote:
> Hello Paolo.
> 
> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> > Before discussing your patches in detail, I need a little help on this
> > point.  You state that the number of scheduler tags must be larger
> > than the number of device tags.  So, I expected some of your patches
> > to address somehow this issue, e.g., by increasing the number of
> > scheduler tags.  Yet I have not found such a change.  Did I miss
> > something?
> 
> I believe Jan's conclusions so far are based on "manual" modifications
> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
> Finding a good default value may be an additional change.

Exactly. So far I was manually increasing nr_requests. I agree that
improving the default nr_requests value selection would be desirable as
well so that manual tuning is not needed. But for now I've left that aside.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup
  2021-09-15 13:15     ` Jan Kara
@ 2021-09-18 10:58       ` Paolo Valente
  2021-09-20  9:28         ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Valente @ 2021-09-18 10:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: Michal Koutný, Jens Axboe, linux-block



> Il giorno 15 set 2021, alle ore 15:15, Jan Kara <jack@suse.cz> ha scritto:
> 
> On Tue 31-08-21 11:59:30, Michal Koutný wrote:
>> Hello Paolo.
>> 
>> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
>>> Before discussing your patches in detail, I need a little help on this
>>> point.  You state that the number of scheduler tags must be larger
>>> than the number of device tags.  So, I expected some of your patches
>>> to address somehow this issue, e.g., by increasing the number of
>>> scheduler tags.  Yet I have not found such a change.  Did I miss
>>> something?
>> 
>> I believe Jan's conclusions so far are based on "manual" modifications
>> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
>> Finding a good default value may be an additional change.
> 
> Exactly. So far I was manually increasing nr_requests. I agree that
> improving the default nr_requests value selection would be desirable as
> well so that manual tuning is not needed. But for now I've left that aside.
> 

Ok. So, IIUC, to recover control on bandwidth you need to
(1) increase nr_requests manually
and
(2) apply your patch

If you don't do (1), then (2) is not sufficient, and viceversa. Correct?

Thanks,
Paolo

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup
  2021-09-18 10:58       ` Paolo Valente
@ 2021-09-20  9:28         ` Jan Kara
  2021-09-22 14:33           ` Jan Kara
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2021-09-20  9:28 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, Michal Koutný, Jens Axboe, linux-block

On Sat 18-09-21 12:58:34, Paolo Valente wrote:
> > Il giorno 15 set 2021, alle ore 15:15, Jan Kara <jack@suse.cz> ha scritto:
> > 
> > On Tue 31-08-21 11:59:30, Michal Koutný wrote:
> >> Hello Paolo.
> >> 
> >> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> >>> Before discussing your patches in detail, I need a little help on this
> >>> point.  You state that the number of scheduler tags must be larger
> >>> than the number of device tags.  So, I expected some of your patches
> >>> to address somehow this issue, e.g., by increasing the number of
> >>> scheduler tags.  Yet I have not found such a change.  Did I miss
> >>> something?
> >> 
> >> I believe Jan's conclusions so far are based on "manual" modifications
> >> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
> >> Finding a good default value may be an additional change.
> > 
> > Exactly. So far I was manually increasing nr_requests. I agree that
> > improving the default nr_requests value selection would be desirable as
> > well so that manual tuning is not needed. But for now I've left that aside.
> > 
> 
> Ok. So, IIUC, to recover control on bandwidth you need to
> (1) increase nr_requests manually
> and
> (2) apply your patch
> 
> If you don't do (1), then (2) is not sufficient, and viceversa. Correct?

Correct, although 1) depends on HW capabilities - e.g. for standard SATA
NCQ drive with queue depth of 32, the current nr_requests setting of 256 is
fine and just 2) is enough to recover control. If you run on top of virtio
device or storage controller card with queue depth of 1024, you need to
bump up the nr_requests setting.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup
  2021-09-20  9:28         ` Jan Kara
@ 2021-09-22 14:33           ` Jan Kara
  0 siblings, 0 replies; 10+ messages in thread
From: Jan Kara @ 2021-09-22 14:33 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, Michal Koutný, Jens Axboe, linux-block

On Mon 20-09-21 11:28:15, Jan Kara wrote:
> On Sat 18-09-21 12:58:34, Paolo Valente wrote:
> > > Il giorno 15 set 2021, alle ore 15:15, Jan Kara <jack@suse.cz> ha scritto:
> > > 
> > > On Tue 31-08-21 11:59:30, Michal Koutný wrote:
> > >> Hello Paolo.
> > >> 
> > >> On Fri, Aug 27, 2021 at 12:07:20PM +0200, Paolo Valente <paolo.valente@linaro.org> wrote:
> > >>> Before discussing your patches in detail, I need a little help on this
> > >>> point.  You state that the number of scheduler tags must be larger
> > >>> than the number of device tags.  So, I expected some of your patches
> > >>> to address somehow this issue, e.g., by increasing the number of
> > >>> scheduler tags.  Yet I have not found such a change.  Did I miss
> > >>> something?
> > >> 
> > >> I believe Jan's conclusions so far are based on "manual" modifications
> > >> of available scheduler tags by /sys/block/$dev/queue/nr_requests.
> > >> Finding a good default value may be an additional change.
> > > 
> > > Exactly. So far I was manually increasing nr_requests. I agree that
> > > improving the default nr_requests value selection would be desirable as
> > > well so that manual tuning is not needed. But for now I've left that aside.
> > > 
> > 
> > Ok. So, IIUC, to recover control on bandwidth you need to
> > (1) increase nr_requests manually
> > and
> > (2) apply your patch
> > 
> > If you don't do (1), then (2) is not sufficient, and viceversa. Correct?
> 
> Correct, although 1) depends on HW capabilities - e.g. for standard SATA
> NCQ drive with queue depth of 32, the current nr_requests setting of 256 is
> fine and just 2) is enough to recover control. If you run on top of virtio
> device or storage controller card with queue depth of 1024, you need to
> bump up the nr_requests setting.

Paolo, do you have any thoughts about the patches? Any estimate when you
can have a look into them? BTW I have sligthly updated version locally
which also helps with restoring service differentiation for IO priorities
but in principle there's no fundamental difference.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-22 14:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-15 13:30 [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Jan Kara
2021-07-15 13:30 ` [PATCH 1/3] block: Provide icq in request allocation data Jan Kara
2021-07-15 13:30 ` [PATCH 2/3] bfq: Track number of allocated requests in bfq_entity Jan Kara
2021-07-15 13:30 ` [PATCH 3/3] bfq: Limit number of requests consumed by each cgroup Jan Kara
2021-08-27 10:07 ` [PATCH 0/3 v2] bfq: Limit number of allocated scheduler tags per cgroup Paolo Valente
2021-08-31  9:59   ` Michal Koutný
2021-09-15 13:15     ` Jan Kara
2021-09-18 10:58       ` Paolo Valente
2021-09-20  9:28         ` Jan Kara
2021-09-22 14:33           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).