linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup
@ 2021-03-12 11:08 brookxu
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
  2021-03-21 11:04 ` [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup Paolo Valente
  0 siblings, 2 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

Tasks in the production environment can be roughly divided into
three categories: emergency tasks, ordinary tasks and offline
tasks. Emergency tasks need to be scheduled in real time, such
as system agents. Offline tasks do not need to guarantee QoS,
but can improve system resource utilization during system idle
periods, such as background tasks. The above requirements need
to achieve IO preemption. At present, we can use weights to
simulate IO preemption, but since weights are more of a shared
concept, they cannot be simulated well. For example, the weights
of emergency tasks and ordinary tasks cannot be determined well,
offline tasks (with the same weight) actually occupy different
resources on disks with different performance, and the tail
latency caused by offline tasks cannot be well controlled. Using
ioprio's concept of preemption, we can solve the above problems
very well. Since ioprio will eventually be converted to weight,
using ioprio alone can also achieve weight isolation within the
same class. But we can still use bfq.weight to control resource,
achieving better IO Qos control.

However, currently the class of bfq_group is always be class, and
the ioprio class of the task can only be reflected in a single
cgroup. We cannot guarantee that real-time tasks in a cgroup are
scheduled in time. Therefore, we introduce bfq.ioprio, which
allows us to configure ioprio class for cgroup. In this way, we
can ensure that the real-time tasks of a cgroup can be scheduled
in time. Similarly, the processing of offline task groups can
also be simpler.

The bfq.ioprio interface now is available for cgroup v1 and cgroup
v2. Users can configure the ioprio for cgroup through this interface,
as shown below:

echo "1 2"> blkio.bfq.ioprio

The above two values respectively represent the values of ioprio
class and ioprio for cgroup. The ioprio of tasks within the cgroup
is uniformly equal to the ioprio of the cgroup. If the ioprio of
the cgroup is disabled, the ioprio of the task remains the same,
usually from io_context.

When testing, using fio and fio_generate_plots we can clearly see
that the IO delay of the task satisfies RT> BE> IDLE. When RT is
running, BE and IDLE are guaranteed minimum bandwidth. When used
with bfq.weight, we can also isolate the resource within the same
class.

The test process is as follows:
# prepare data disk
mount /dev/sdb /data1

# create cgroup v1 hierarchy
cd /sys/fs/cgroup/blkio
mkdir rt be idle
echo "1 0" > rt/blkio.bfq.ioprio
echo "2 0" > be/blkio.bfq.ioprio
echo "3 0" > idle/blkio.bfq.ioprio

# run fio test
fio fio.ini

# generate svg graph
fio_generate_plots res

The contents of fio.ini are as follows:
[global]
ioengine=libaio
group_reporting=1
log_avg_msec=500
direct=1
time_based=1
iodepth=16
size=100M
rw=write
bs=1M
[rt]
name=rt
write_bw_log=rt
write_lat_log=rt
write_iops_log=rt
filename=/data1/rt.bin
cgroup=rt
runtime=30s
nice=-10
[be]
name=be
new_group
write_bw_log=be
write_lat_log=be
write_iops_log=be
filename=/data1/be.bin
cgroup=be
runtime=60s
[idle]
name=idle
new_group
write_bw_log=idle
write_lat_log=idle
write_iops_log=idle
filename=/data1/idle.bin
cgroup=idle
runtime=90s

V2:
1. Optmise bfq_select_next_class().
2. Introduce bfq_group [] to track the number of groups for each CLASS.
3. Optimse IO injection, EMQ and Idle mechanism for CLASS_RT.

Chunguang Xu (11):
  bfq: introduce bfq_entity_to_bfqg helper method
  bfq: limit the IO depth of idle_class to 1
  bfq: keep the minimun bandwidth for be_class
  bfq: expire other class if CLASS_RT is waiting
  bfq: optimse IO injection for CLASS_RT
  bfq: disallow idle if CLASS_RT waiting for service
  bfq: disallow merge CLASS_RT with other class
  bfq: introduce bfq.ioprio for cgroup
  bfq: convert the type of bfq_group.bfqd to bfq_data*
  bfq: remove unnecessary initialization logic
  bfq: optimize the calculation of bfq_weight_to_ioprio()

 block/bfq-cgroup.c  |  99 +++++++++++++++++++++++++++++++----
 block/bfq-iosched.c |  47 ++++++++++++++---
 block/bfq-iosched.h |  28 ++++++++--
 block/bfq-wf2q.c    | 124 +++++++++++++++++++++++++++++++++-----------
 4 files changed, 244 insertions(+), 54 deletions(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 01/11] bfq: introduce bfq_entity_to_bfqg helper method
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 02/11] bfq: convert the type of bfq_group.bfqd to bfq_data* brookxu
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

Introduce bfq_entity_to_bfqg() to make it easier to obtain the
bfq_group corresponding to the entity.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-cgroup.c  |  6 ++----
 block/bfq-iosched.h |  1 +
 block/bfq-wf2q.c    | 16 ++++++++++++----
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index b791e2041e49..a5f544acaa61 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -309,8 +309,7 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
 {
 	struct bfq_entity *group_entity = bfqq->entity.parent;
 
-	return group_entity ? container_of(group_entity, struct bfq_group,
-					   entity) :
+	return group_entity ? bfq_entity_to_bfqg(group_entity) :
 			      bfqq->bfqd->root_group;
 }
 
@@ -610,8 +609,7 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
 	 */
 	entity = &bfqg->entity;
 	for_each_entity(entity) {
-		struct bfq_group *curr_bfqg = container_of(entity,
-						struct bfq_group, entity);
+		struct bfq_group *curr_bfqg = bfq_entity_to_bfqg(entity);
 		if (curr_bfqg != bfqd->root_group) {
 			parent = bfqg_parent(curr_bfqg);
 			if (!parent)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index b8e793c34ff1..a6f98e9e14b5 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -941,6 +941,7 @@ struct bfq_group {
 #endif
 
 struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+struct bfq_group *bfq_entity_to_bfqg(struct bfq_entity *entity);
 
 /* --------------- main algorithm interface ----------------- */
 
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 070e34a7feb1..5ff0028920a2 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -149,7 +149,7 @@ struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq)
 	if (!group_entity)
 		group_entity = &bfqq->bfqd->root_group->entity;
 
-	return container_of(group_entity, struct bfq_group, entity);
+	return bfq_entity_to_bfqg(group_entity);
 }
 
 /*
@@ -208,7 +208,7 @@ static bool bfq_no_longer_next_in_service(struct bfq_entity *entity)
 	if (bfq_entity_to_bfqq(entity))
 		return true;
 
-	bfqg = container_of(entity, struct bfq_group, entity);
+	bfqg = bfq_entity_to_bfqg(entity);
 
 	/*
 	 * The field active_entities does not always contain the
@@ -266,6 +266,15 @@ struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
 	return bfqq;
 }
 
+struct bfq_group *bfq_entity_to_bfqg(struct bfq_entity *entity)
+{
+	struct bfq_group *bfqg = NULL;
+
+	if (entity->my_sched_data)
+		bfqg = container_of(entity, struct bfq_group, entity);
+
+	return bfqg;
+}
 
 /**
  * bfq_delta - map service into the virtual time domain.
@@ -1001,8 +1010,7 @@ static void __bfq_activate_entity(struct bfq_entity *entity,
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	if (!bfq_entity_to_bfqq(entity)) { /* bfq_group */
-		struct bfq_group *bfqg =
-			container_of(entity, struct bfq_group, entity);
+		struct bfq_group *bfqg = bfq_entity_to_bfqg(entity);
 		struct bfq_data *bfqd = bfqg->bfqd;
 
 		if (!entity->in_groups_with_pending_reqs) {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 02/11] bfq: convert the type of bfq_group.bfqd to bfq_data*
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
  2021-03-12 11:08   ` [RFC PATCH v2 01/11] bfq: introduce bfq_entity_to_bfqg helper method brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 03/11] bfq: introduce bfq.ioprio for cgroup brookxu
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

Setting bfq_group.bfqd to void* type does not seem to make much sense.
This will cause unnecessary type conversion. Perhaps it would be better
to change it to bfq_data* type.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-cgroup.c  | 2 +-
 block/bfq-iosched.h | 2 +-
 block/bfq-wf2q.c    | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index a5f544acaa61..50d06c760194 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -224,7 +224,7 @@ void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq,
 {
 	blkg_rwstat_add(&bfqg->stats.queued, op, 1);
 	bfqg_stats_end_empty_time(&bfqg->stats);
-	if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue))
+	if (!(bfqq == bfqg->bfqd->in_service_queue))
 		bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq));
 }
 
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index a6f98e9e14b5..28d85903cf66 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -914,7 +914,7 @@ struct bfq_group {
 	struct bfq_entity entity;
 	struct bfq_sched_data sched_data;
 
-	void *bfqd;
+	struct bfq_data *bfqd;
 
 	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
 	struct bfq_queue *async_idle_bfqq;
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 5ff0028920a2..276f225f9c6e 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -498,7 +498,7 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	sd = entity->sched_data;
 	bfqg = container_of(sd, struct bfq_group, sched_data);
-	bfqd = (struct bfq_data *)bfqg->bfqd;
+	bfqd = bfqg->bfqd;
 #endif
 	if (bfqq)
 		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
@@ -597,7 +597,7 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	sd = entity->sched_data;
 	bfqg = container_of(sd, struct bfq_group, sched_data);
-	bfqd = (struct bfq_data *)bfqg->bfqd;
+	bfqd = bfqg->bfqd;
 #endif
 	if (bfqq)
 		list_del(&bfqq->bfqq_list);
@@ -743,7 +743,7 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		else {
 			sd = entity->my_sched_data;
 			bfqg = container_of(sd, struct bfq_group, sched_data);
-			bfqd = (struct bfq_data *)bfqg->bfqd;
+			bfqd = bfqg->bfqd;
 		}
 #endif
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 03/11] bfq: introduce bfq.ioprio for cgroup
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
  2021-03-12 11:08   ` [RFC PATCH v2 01/11] bfq: introduce bfq_entity_to_bfqg helper method brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 02/11] bfq: convert the type of bfq_group.bfqd to bfq_data* brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 04/11] bfq: limit the IO depth of idle_class to 1 brookxu
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

Tasks in the production environment can be roughly divided into
three categories: emergency tasks, ordinary tasks and offline
tasks. Emergency tasks need to be scheduled in real time, such
as system agents. Offline tasks do not need to guarantee QoS,
but can improve system resource utilization during system idle
periods, such as background tasks. The above requirements need
to achieve IO preemption. At present, we can use weights to
simulate IO preemption, but since weights are more of a shared
concept, they cannot be simulated well. For example, the weights
of emergency tasks and ordinary tasks cannot be determined well,
offline tasks (with the same weight) actually occupy different
resources on disks with different performance, and the tail
latency caused by offline tasks cannot be well controlled. Using
ioprio's concept of preemption, we can solve the above problems
very well. Since ioprio will eventually be converted to weight,
using ioprio alone can also achieve weight isolation within the
same class. But we can still use bfq.weight to control resource,
achieving better IO Qos control.

However, currently the class of bfq_group is always be class, and
the ioprio class of the task can only be reflected in a single
cgroup. We cannot guarantee that real-time tasks in a cgroup are
scheduled in time. Therefore, we introduce bfq.ioprio, which
allows us to configure ioprio class for cgroup. In this way, we
can ensure that the real-time tasks of a cgroup can be scheduled
in time. Similarly, the processing of offline task groups can
also be simpler.

The bfq.ioprio interface now is available for cgroup v1 and cgroup
v2. Users can configure the ioprio for cgroup through this interface,
as shown below:

echo "1 2"> blkio.bfq.ioprio

The above two values respectively represent the values of ioprio
class and ioprio for cgroup. The ioprio of tasks within the cgroup
is uniformly equal to the ioprio of the cgroup. If the ioprio of
the cgroup is disabled, the ioprio of the task remains the same,
usually from io_context.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-cgroup.c  | 87 ++++++++++++++++++++++++++++++++++++++++++++-
 block/bfq-iosched.c |  5 +++
 block/bfq-iosched.h |  8 +++++
 block/bfq-wf2q.c    | 30 +++++++++++++---
 4 files changed, 124 insertions(+), 6 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 50d06c760194..ab4bc410e635 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -489,7 +489,7 @@ static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd)
 	return cpd ? container_of(cpd, struct bfq_group_data, pd) : NULL;
 }
 
-static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
+struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
 {
 	return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
 }
@@ -553,6 +553,16 @@ static void bfq_pd_init(struct blkg_policy_data *pd)
 	bfqg->bfqd = bfqd;
 	bfqg->active_entities = 0;
 	bfqg->rq_pos_tree = RB_ROOT;
+
+	bfqg->new_ioprio_class = IOPRIO_PRIO_CLASS(d->ioprio);
+	bfqg->new_ioprio = IOPRIO_PRIO_DATA(d->ioprio);
+	bfqg->ioprio_class = bfqg->new_ioprio_class;
+	bfqg->ioprio = bfqg->new_ioprio;
+
+	if (d->ioprio) {
+		entity->new_weight = bfq_ioprio_to_weight(bfqg->ioprio);
+		entity->weight = entity->new_weight;
+	}
 }
 
 static void bfq_pd_free(struct blkg_policy_data *pd)
@@ -981,6 +991,20 @@ static int bfq_io_show_weight(struct seq_file *sf, void *v)
 	return 0;
 }
 
+static int bfq_io_show_ioprio(struct seq_file *sf, void *v)
+{
+	struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+	struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+	unsigned int val = 0;
+
+	if (bfqgd)
+		val = bfqgd->ioprio;
+
+	seq_printf(sf, "%u %lu\n", IOPRIO_PRIO_CLASS(val), IOPRIO_PRIO_DATA(val));
+
+	return 0;
+}
+
 static void bfq_group_set_weight(struct bfq_group *bfqg, u64 weight, u64 dev_weight)
 {
 	weight = dev_weight ?: weight;
@@ -1098,6 +1122,55 @@ static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
 	return bfq_io_set_device_weight(of, buf, nbytes, off);
 }
 
+static ssize_t bfq_io_set_ioprio(struct kernfs_open_file *of, char *buf,
+				 size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+	struct blkcg_gq *blkg;
+	unsigned int class, data;
+	char *endp;
+
+	buf = strstrip(buf);
+
+	class = simple_strtoul(buf, &endp, 10);
+	if (*endp != ' ')
+		return -EINVAL;
+	buf = endp + 1;
+
+	data = simple_strtoul(buf, &endp, 10);
+	if ((*endp != ' ') && (*endp != '\0'))
+		return -EINVAL;
+
+	if (class > IOPRIO_CLASS_IDLE || data >= IOPRIO_BE_NR)
+		return -EINVAL;
+
+	spin_lock_irq(&blkcg->lock);
+	bfqgd->ioprio = IOPRIO_PRIO_VALUE(class, data);
+	hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
+		struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+
+		if (bfqg) {
+			if ((bfqg->ioprio_class != class) ||
+			    (bfqg->ioprio != data)) {
+				unsigned short weight;
+
+				weight = class ? bfq_ioprio_to_weight(data) :
+					BFQ_WEIGHT_LEGACY_DFL;
+
+				bfqg->new_ioprio_class = class;
+				bfqg->new_ioprio = data;
+				bfqg->entity.new_weight = weight;
+				bfqg->entity.prio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&blkcg->lock);
+
+	return nbytes;
+}
+
 static int bfqg_print_rwstat(struct seq_file *sf, void *v)
 {
 	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_rwstat,
@@ -1264,6 +1337,12 @@ struct cftype bfq_blkcg_legacy_files[] = {
 		.seq_show = bfq_io_show_weight,
 		.write = bfq_io_set_weight,
 	},
+	{
+		.name = "bfq.ioprio",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = bfq_io_show_ioprio,
+		.write = bfq_io_set_ioprio,
+	},
 
 	/* statistics, covers only the tasks in the bfqg */
 	{
@@ -1384,6 +1463,12 @@ struct cftype bfq_blkg_files[] = {
 		.seq_show = bfq_io_show_weight,
 		.write = bfq_io_set_weight,
 	},
+	{
+		.name = "bfq.ioprio",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = bfq_io_show_ioprio,
+		.write = bfq_io_set_ioprio,
+	},
 	{} /* terminate */
 };
 
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ec482e6641ff..f0f53d6f1f6e 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5105,7 +5105,12 @@ static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
 {
 	struct bfq_data *bfqd = bic_to_bfqd(bic);
 	struct bfq_queue *bfqq;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	struct bfq_group_data *bfqgd = blkcg_to_bfqgd(__bio_blkcg(bio));
+	int ioprio = bfqgd->ioprio ?: bic->icq.ioc->ioprio;
+#else
 	int ioprio = bic->icq.ioc->ioprio;
+#endif
 
 	/*
 	 * This condition may trigger on a newly created bic, be sure to
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 28d85903cf66..3416a75f47da 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -867,6 +867,7 @@ struct bfq_group_data {
 	struct blkcg_policy_data pd;
 
 	unsigned int weight;
+	unsigned short ioprio;
 };
 
 /**
@@ -923,6 +924,11 @@ struct bfq_group {
 
 	int active_entities;
 
+	/* current ioprio and ioprio class */
+	unsigned short ioprio, ioprio_class;
+	/* next ioprio and ioprio class if a change is in progress */
+	unsigned short new_ioprio, new_ioprio_class;
+
 	struct rb_root rq_pos_tree;
 
 	struct bfqg_stats stats;
@@ -991,6 +997,7 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg);
 void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio);
 void bfq_end_wr_async(struct bfq_data *bfqd);
+struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg);
 struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
 				     struct blkcg *blkcg);
 struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
@@ -1037,6 +1044,7 @@ extern struct blkcg_policy blkcg_policy_bfq;
 
 struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq);
 struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+unsigned int bfq_class_idx(struct bfq_entity *entity);
 unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd);
 struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity);
 struct bfq_entity *bfq_entity_of(struct rb_node *node);
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 276f225f9c6e..7405be960a92 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -27,12 +27,21 @@ static struct bfq_entity *bfq_root_active_entity(struct rb_root *tree)
 	return rb_entry(node, struct bfq_entity, rb_node);
 }
 
-static unsigned int bfq_class_idx(struct bfq_entity *entity)
+unsigned int bfq_class_idx(struct bfq_entity *entity)
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	unsigned short class = BFQ_DEFAULT_GRP_CLASS;
 
-	return bfqq ? bfqq->ioprio_class - 1 :
-		BFQ_DEFAULT_GRP_CLASS - 1;
+	if (bfqq)
+		class = bfqq->ioprio_class;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	else {
+		struct bfq_group *bfqg = bfq_entity_to_bfqg(entity);
+
+		class = bfqg->ioprio_class ?: BFQ_DEFAULT_GRP_CLASS;
+	}
+#endif
+	return class - 1;
 }
 
 unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd)
@@ -767,14 +776,25 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 				  bfq_weight_to_ioprio(entity->orig_weight);
 		}
 
-		if (bfqq && update_class_too)
-			bfqq->ioprio_class = bfqq->new_ioprio_class;
+		if (update_class_too) {
+			if (bfqq)
+				bfqq->ioprio_class = bfqq->new_ioprio_class;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+			else
+				bfqg->ioprio_class = bfqg->new_ioprio_class;
+#endif
+		}
 
 		/*
 		 * Reset prio_changed only if the ioprio_class change
 		 * is not pending any longer.
 		 */
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+		if ((bfqq && bfqq->ioprio_class == bfqq->new_ioprio_class) ||
+		    (!bfqq && bfqg->ioprio_class == bfqg->new_ioprio_class))
+#else
 		if (!bfqq || bfqq->ioprio_class == bfqq->new_ioprio_class)
+#endif
 			entity->prio_changed = 0;
 
 		/*
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 04/11] bfq: limit the IO depth of idle_class to 1
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (2 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 03/11] bfq: introduce bfq.ioprio for cgroup brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 05/11] bfq: keep the minimun bandwidth for be_class brookxu
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

The IO depth of queues belong to CLASS_IDLE  is limited to 1,
so that it can avoid introducing a larger tail latency under
a device with a larger IO depth. Although limiting the IO
depth may reduce the performance of idle_class, it is
generally not a big problem, because idle_class usually does
not have strict performance requirements.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-iosched.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index f0f53d6f1f6e..91e903f1e550 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4808,6 +4808,17 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	if (!bfqq)
 		goto exit;
 
+	/*
+	 * Here, the IO depth of queues belong to CLASS_IDLE is limited
+	 * to 1, so that it can avoid introducing a larger tail latency
+	 * under a device with a larger IO depth. Although limiting the
+	 * IO depth may reduce the performance of idle_class, it is
+	 * generally not a big problem, because idle_class usually
+	 * does not have strict performance requirements.
+	 */
+	if (bfq_class_idle(bfqq) && bfqq->dispatched)
+		goto exit;
+
 	rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);
 
 	if (rq) {
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 05/11] bfq: keep the minimun bandwidth for be_class
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (3 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 04/11] bfq: limit the IO depth of idle_class to 1 brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 06/11] bfq: expire other class if CLASS_RT is waiting brookxu
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

rt_class will preempt other classes, which may cause other
classes to starve to death. At present, idle_class has
alleviated the starvation problem through the minimum
bandwidth mechanism. Similarly, we should do the same for
be_class.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-iosched.c |  6 +++--
 block/bfq-iosched.h | 11 ++++++---
 block/bfq-wf2q.c    | 59 ++++++++++++++++++++++++++++++++-------------
 3 files changed, 53 insertions(+), 23 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 91e903f1e550..ab00b664348c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6542,9 +6542,11 @@ static void bfq_init_root_group(struct bfq_group *root_group,
 	root_group->bfqd = bfqd;
 #endif
 	root_group->rq_pos_tree = RB_ROOT;
-	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
 		root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
-	root_group->sched_data.bfq_class_idle_last_service = jiffies;
+		root_group->sched_data.bfq_class_last_service[i] = jiffies;
+	}
+	root_group->sched_data.class_timeout_last_check = jiffies;
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 3416a75f47da..de7301664ad3 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -13,7 +13,7 @@
 #include "blk-cgroup-rwstat.h"
 
 #define BFQ_IOPRIO_CLASSES	3
-#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
+#define BFQ_CLASS_TIMEOUT	(HZ/5)
 
 #define BFQ_MIN_WEIGHT			1
 #define BFQ_MAX_WEIGHT			1000
@@ -97,9 +97,12 @@ struct bfq_sched_data {
 	struct bfq_entity *next_in_service;
 	/* array of service trees, one per ioprio_class */
 	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
-	/* last time CLASS_IDLE was served */
-	unsigned long bfq_class_idle_last_service;
-
+	/* last time the class was served */
+	unsigned long bfq_class_last_service[BFQ_IOPRIO_CLASSES];
+	/* last time class timeout was checked */
+	unsigned long class_timeout_last_check;
+	/* next index to check class timeout */
+	unsigned int next_class_index;
 };
 
 /**
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 7405be960a92..0ac35fd4f2ab 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1188,6 +1188,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
 {
 	struct bfq_sched_data *sd = entity->sched_data;
 	struct bfq_service_tree *st;
+	int idx = bfq_class_idx(entity);
 	bool is_in_service;
 
 	if (!entity->on_st_or_in_serv) /*
@@ -1227,6 +1228,7 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
 	else
 		bfq_idle_insert(st, entity);
 
+	sd->bfq_class_last_service[idx] = jiffies;
 	return true;
 }
 
@@ -1455,6 +1457,45 @@ __bfq_lookup_next_entity(struct bfq_service_tree *st, bool in_service)
 	return entity;
 }
 
+static int bfq_select_next_class(struct bfq_sched_data *sd)
+{
+	struct bfq_service_tree *st = sd->service_tree;
+	unsigned long last_check, last_serve;
+	int i, class_idx, next_class = 0;
+	bool found = false;
+
+	/*
+	 * we needed to guarantee a minimum bandwidth for each class (if
+	 * there is some active entity in this class). This should also
+	 * mitigate priority-inversion problems in case a low priority
+	 * task is holding file system resources.
+	 */
+	last_check = sd->class_timeout_last_check;
+	if (time_is_after_jiffies(last_check + BFQ_CLASS_TIMEOUT))
+		return next_class;
+
+	sd->class_timeout_last_check = jiffies;
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+		class_idx = (sd->next_class_index + i) % BFQ_IOPRIO_CLASSES;
+		last_serve = sd->bfq_class_last_service[class_idx];
+
+		if (time_is_after_jiffies(last_serve + BFQ_CLASS_TIMEOUT))
+			continue;
+
+		if (!RB_EMPTY_ROOT(&(st + class_idx)->active)) {
+			if (found)
+				continue;
+
+			next_class = class_idx++;
+			class_idx %= BFQ_IOPRIO_CLASSES;
+			sd->next_class_index = class_idx;
+			found = true;
+		}
+		sd->bfq_class_last_service[class_idx] = jiffies;
+	}
+	return next_class;
+}
+
 /**
  * bfq_lookup_next_entity - return the first eligible entity in @sd.
  * @sd: the sched_data.
@@ -1468,24 +1509,8 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
 						 bool expiration)
 {
 	struct bfq_service_tree *st = sd->service_tree;
-	struct bfq_service_tree *idle_class_st = st + (BFQ_IOPRIO_CLASSES - 1);
 	struct bfq_entity *entity = NULL;
-	int class_idx = 0;
-
-	/*
-	 * Choose from idle class, if needed to guarantee a minimum
-	 * bandwidth to this class (and if there is some active entity
-	 * in idle class). This should also mitigate
-	 * priority-inversion problems in case a low priority task is
-	 * holding file system resources.
-	 */
-	if (time_is_before_jiffies(sd->bfq_class_idle_last_service +
-				   BFQ_CL_IDLE_TIMEOUT)) {
-		if (!RB_EMPTY_ROOT(&idle_class_st->active))
-			class_idx = BFQ_IOPRIO_CLASSES - 1;
-		/* About to be served if backlogged, or not yet backlogged */
-		sd->bfq_class_idle_last_service = jiffies;
-	}
+	int class_idx = bfq_select_next_class(sd);
 
 	/*
 	 * Find the next entity to serve for the highest-priority
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 06/11] bfq: expire other class if CLASS_RT is waiting
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (4 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 05/11] bfq: keep the minimun bandwidth for be_class brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 07/11] bfq: optimse IO injection for CLASS_RT brookxu
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

Expire bfqq not belong to CLASS_RT and CLASS_RT is waiting for
service, we can further guarantee the latency for CLASS_RT.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-iosched.c | 15 ++++++++++-----
 block/bfq-iosched.h |  8 ++++++++
 block/bfq-wf2q.c    | 12 ++++++++++++
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ab00b664348c..8af73473c45c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -428,6 +428,7 @@ void bfq_schedule_dispatch(struct bfq_data *bfqd)
 	}
 }
 
+#define bfq_class_rt(bfqq)	((bfqq)->ioprio_class == IOPRIO_CLASS_RT)
 #define bfq_class_idle(bfqq)	((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
 
 #define bfq_sample_valid(samples)	((samples) > 80)
@@ -4709,12 +4710,16 @@ static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,
 	/*
 	 * Expire bfqq, pretending that its budget expired, if bfqq
 	 * belongs to CLASS_IDLE and other queues are waiting for
-	 * service.
+	 * service, or if bfqq not belongs to CLASS_RT and CLASS_RT
+	 * is waiting for service.
 	 */
-	if (!(bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)))
-		goto return_rq;
-
-	bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	if ((bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)) ||
+	    (!bfq_class_rt(bfqq) && bfqd->busy_groups[0]))
+#else
+	if (bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq))
+#endif
+		bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
 
 return_rq:
 	return rq;
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index de7301664ad3..6b87ba53db94 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -531,6 +531,14 @@ struct bfq_data {
 	 */
 	unsigned int num_groups_with_pending_reqs;
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	/*
+	 * Per-class (RT, BE, IDLE) number of bfq_groups waiting for
+	 * service.
+	 */
+	unsigned int busy_groups[3];
+#endif
+
 	/*
 	 * Per-class (RT, BE, IDLE) number of bfq_queues containing
 	 * requests (including the queue in service, even if it is
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 0ac35fd4f2ab..f6c67a80e454 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -1032,11 +1032,14 @@ static void __bfq_activate_entity(struct bfq_entity *entity,
 	if (!bfq_entity_to_bfqq(entity)) { /* bfq_group */
 		struct bfq_group *bfqg = bfq_entity_to_bfqg(entity);
 		struct bfq_data *bfqd = bfqg->bfqd;
+		int idx = bfq_class_idx(entity);
 
 		if (!entity->in_groups_with_pending_reqs) {
 			entity->in_groups_with_pending_reqs = true;
 			bfqd->num_groups_with_pending_reqs++;
 		}
+
+		bfqd->busy_groups[idx]++;
 	}
 #endif
 
@@ -1188,6 +1191,9 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
 {
 	struct bfq_sched_data *sd = entity->sched_data;
 	struct bfq_service_tree *st;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	struct bfq_group *bfqg;
+#endif
 	int idx = bfq_class_idx(entity);
 	bool is_in_service;
 
@@ -1229,6 +1235,12 @@ bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
 		bfq_idle_insert(st, entity);
 
 	sd->bfq_class_last_service[idx] = jiffies;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	bfqg = bfq_entity_to_bfqg(entity);
+	if (bfqg)
+		bfqg->bfqd->busy_groups[idx]--;
+#endif
+
 	return true;
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 07/11] bfq: optimse IO injection for CLASS_RT
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (5 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 06/11] bfq: expire other class if CLASS_RT is waiting brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 08/11] bfq: disallow idle if CLASS_RT waiting for service brookxu
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

CLASS_RT is more sensitive to latency, and IO injection
will increase the CLASS_RT latency. For this reason,
consider prohibiting the injection of async queue for
CLASS_RT, and only the waker queue and other active
queues belonging to CLASS_RT are allowed to inject. In
this way, for CLASS_RT, both the advantages of inject
and IO latency can be maintained.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-iosched.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 8af73473c45c..a5f13589df79 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1965,6 +1965,9 @@ static void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	    bfqd->last_completed_rq_bfqq == bfqq->waker_bfqq)
 		return;
 
+	if (bfq_class_rt(bfqq) && !bfq_class_rt(bfqd->last_completed_rq_bfqq))
+		return;
+
 	if (bfqd->last_completed_rq_bfqq !=
 	    bfqq->tentative_waker_bfqq) {
 		/*
@@ -4391,6 +4394,9 @@ bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
 			else
 				limit = in_serv_bfqq->inject_limit;
 
+			if (bfq_class_rt(in_serv_bfqq) && !bfq_class_rt(bfqq))
+				continue;
+
 			if (bfqd->rq_in_driver < limit) {
 				bfqd->rqs_injected = true;
 				return bfqq;
@@ -4565,7 +4571,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
 		 * may not be minimized, because the waker queue may
 		 * happen to be served only after other queues.
 		 */
-		if (async_bfqq &&
+		if (async_bfqq && !bfq_class_rt(bfqq) &&
 		    icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
 		    bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
 		    bfq_bfqq_budget_left(async_bfqq))
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 08/11] bfq: disallow idle if CLASS_RT waiting for service
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (6 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 07/11] bfq: optimse IO injection for CLASS_RT brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 09/11] bfq: disallow merge CLASS_RT with other class brookxu
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

if CLASS_RT is waiting for service,queues belong
to other class disallow idle, so that a schedule
can be invoked in time.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-iosched.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a5f13589df79..9330043cdd53 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4263,6 +4263,11 @@ static bool bfq_better_to_idle(struct bfq_queue *bfqq)
 	if (unlikely(bfqd->strict_guarantees))
 		return true;
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	if (!bfq_class_rt(bfqq) && bfqd->busy_groups[0])
+		return false;
+#endif
+
 	/*
 	 * Idling is performed only if slice_idle > 0. In addition, we
 	 * do not idle if
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 09/11] bfq: disallow merge CLASS_RT with other class
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (7 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 08/11] bfq: disallow idle if CLASS_RT waiting for service brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 10/11] bfq: remove unnecessary initialization logic brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 11/11] bfq: optimize the calculation of bfq_weight_to_ioprio() brookxu
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

In EMQ, perhaps we should not merge the CLASS_RT queue
with other class queues. Otherwise, the delay of
CLASS_RT IO will increase.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-iosched.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9330043cdd53..bb6cc8c9ddf5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2623,6 +2623,9 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
 	if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq))
 		return false;
 
+	if (bfq_class_rt(bfqq) && !bfq_class_rt(new_bfqq))
+		return false;
+
 	return true;
 }
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 10/11] bfq: remove unnecessary initialization logic
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (8 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 09/11] bfq: disallow merge CLASS_RT with other class brookxu
@ 2021-03-12 11:08   ` brookxu
  2021-03-12 11:08   ` [RFC PATCH v2 11/11] bfq: optimize the calculation of bfq_weight_to_ioprio() brookxu
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

Since we will initialize sched_data.service_tree[] in
bfq_init_root_group(), bfq_create_group_hierarchy() can
ignore this part of the initialization, which can avoid
repeated initialization.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-cgroup.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index ab4bc410e635..05054e1b5d97 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -1514,15 +1514,11 @@ void bfqg_and_blkg_put(struct bfq_group *bfqg) {}
 struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
 {
 	struct bfq_group *bfqg;
-	int i;
 
 	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
 	if (!bfqg)
 		return NULL;
 
-	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
-		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
-
 	return bfqg;
 }
 #endif	/* CONFIG_BFQ_GROUP_IOSCHED */
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 11/11] bfq: optimize the calculation of bfq_weight_to_ioprio()
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
                     ` (9 preceding siblings ...)
  2021-03-12 11:08   ` [RFC PATCH v2 10/11] bfq: remove unnecessary initialization logic brookxu
@ 2021-03-12 11:08   ` brookxu
  10 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-12 11:08 UTC (permalink / raw)
  To: paolo.valente, axboe, tj; +Cc: linux-block, cgroups, linux-kernel

From: Chunguang Xu <brookxu@tencent.com>

The value range of ioprio is [0, 7], but the result of
bfq_weight_to_ioprio() may exceed this range, so simple
optimization is required.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
---
 block/bfq-wf2q.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index f6c67a80e454..86558a8fede1 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -536,8 +536,9 @@ unsigned short bfq_ioprio_to_weight(int ioprio)
  */
 static unsigned short bfq_weight_to_ioprio(int weight)
 {
-	return max_t(int, 0,
-		     IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight);
+	int ioprio = IOPRIO_BE_NR  - weight / BFQ_WEIGHT_CONVERSION_COEFF;
+
+	return ioprio < 0 ? 0 : min_t(int, ioprio, IOPRIO_BE_NR - 1);
 }
 
 static void bfq_get_entity(struct bfq_entity *entity)
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup
  2021-03-12 11:08 [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup brookxu
       [not found] ` <cover.1615527324.git.brookxu@tencent.com>
@ 2021-03-21 11:04 ` Paolo Valente
  2021-03-22  5:44   ` brookxu
  1 sibling, 1 reply; 14+ messages in thread
From: Paolo Valente @ 2021-03-21 11:04 UTC (permalink / raw)
  To: brookxu; +Cc: axboe, tj, linux-block, cgroups, linux-kernel



> Il giorno 12 mar 2021, alle ore 12:08, brookxu <brookxu.cn@gmail.com> ha scritto:
> 
> From: Chunguang Xu <brookxu@tencent.com>
> 

Hi Chunguang,

> Tasks in the production environment can be roughly divided into
> three categories: emergency tasks, ordinary tasks and offline
> tasks. Emergency tasks need to be scheduled in real time, such
> as system agents. Offline tasks do not need to guarantee QoS,
> but can improve system resource utilization during system idle
> periods, such as background tasks. The above requirements need
> to achieve IO preemption. At present, we can use weights to
> simulate IO preemption, but since weights are more of a shared
> concept, they cannot be simulated well. For example, the weights
> of emergency tasks and ordinary tasks cannot be determined well,
> offline tasks (with the same weight) actually occupy different
> resources on disks with different performance, and the tail
> latency caused by offline tasks cannot be well controlled. Using
> ioprio's concept of preemption, we can solve the above problems
> very well. Since ioprio will eventually be converted to weight,
> using ioprio alone can also achieve weight isolation within the
> same class. But we can still use bfq.weight to control resource,
> achieving better IO Qos control.
> 
> However, currently the class of bfq_group is always be class, and
> the ioprio class of the task can only be reflected in a single
> cgroup. We cannot guarantee that real-time tasks in a cgroup are
> scheduled in time. Therefore, we introduce bfq.ioprio, which
> allows us to configure ioprio class for cgroup. In this way, we
> can ensure that the real-time tasks of a cgroup can be scheduled
> in time. Similarly, the processing of offline task groups can
> also be simpler.
> 

I find this contribution very interesting.  Anyway, given the
relevance of such a contribution, I'd like to hear from relevant
people (Jens, Tejun, ...?), before revising individual patches.

Yet I already have a general question.  How does this mechanism comply
with per-process ioprios and ioprio classes?  For example, what
happens if a process belongs to BE-class group according to your
mechanism, but to a RT class according to its ioprio?  Does the
pre-group class dominate the per-process class?  Is all clean and
predictable?

> The bfq.ioprio interface now is available for cgroup v1 and cgroup
> v2. Users can configure the ioprio for cgroup through this interface,
> as shown below:
> 
> echo "1 2"> blkio.bfq.ioprio

Wouldn't it be nicer to have acronyms for classes (RT, BE, IDLE),
instead of numbers?

Thank you very much for this improvement proposal,
Paolo

> 
> The above two values respectively represent the values of ioprio
> class and ioprio for cgroup. The ioprio of tasks within the cgroup
> is uniformly equal to the ioprio of the cgroup. If the ioprio of
> the cgroup is disabled, the ioprio of the task remains the same,
> usually from io_context.
> 
> When testing, using fio and fio_generate_plots we can clearly see
> that the IO delay of the task satisfies RT> BE> IDLE. When RT is
> running, BE and IDLE are guaranteed minimum bandwidth. When used
> with bfq.weight, we can also isolate the resource within the same
> class.
> 
> The test process is as follows:
> # prepare data disk
> mount /dev/sdb /data1
> 
> # create cgroup v1 hierarchy
> cd /sys/fs/cgroup/blkio
> mkdir rt be idle
> echo "1 0" > rt/blkio.bfq.ioprio
> echo "2 0" > be/blkio.bfq.ioprio
> echo "3 0" > idle/blkio.bfq.ioprio
> 
> # run fio test
> fio fio.ini
> 
> # generate svg graph
> fio_generate_plots res
> 
> The contents of fio.ini are as follows:
> [global]
> ioengine=libaio
> group_reporting=1
> log_avg_msec=500
> direct=1
> time_based=1
> iodepth=16
> size=100M
> rw=write
> bs=1M
> [rt]
> name=rt
> write_bw_log=rt
> write_lat_log=rt
> write_iops_log=rt
> filename=/data1/rt.bin
> cgroup=rt
> runtime=30s
> nice=-10
> [be]
> name=be
> new_group
> write_bw_log=be
> write_lat_log=be
> write_iops_log=be
> filename=/data1/be.bin
> cgroup=be
> runtime=60s
> [idle]
> name=idle
> new_group
> write_bw_log=idle
> write_lat_log=idle
> write_iops_log=idle
> filename=/data1/idle.bin
> cgroup=idle
> runtime=90s
> 
> V2:
> 1. Optmise bfq_select_next_class().
> 2. Introduce bfq_group [] to track the number of groups for each CLASS.
> 3. Optimse IO injection, EMQ and Idle mechanism for CLASS_RT.
> 
> Chunguang Xu (11):
>  bfq: introduce bfq_entity_to_bfqg helper method
>  bfq: limit the IO depth of idle_class to 1
>  bfq: keep the minimun bandwidth for be_class
>  bfq: expire other class if CLASS_RT is waiting
>  bfq: optimse IO injection for CLASS_RT
>  bfq: disallow idle if CLASS_RT waiting for service
>  bfq: disallow merge CLASS_RT with other class
>  bfq: introduce bfq.ioprio for cgroup
>  bfq: convert the type of bfq_group.bfqd to bfq_data*
>  bfq: remove unnecessary initialization logic
>  bfq: optimize the calculation of bfq_weight_to_ioprio()
> 
> block/bfq-cgroup.c  |  99 +++++++++++++++++++++++++++++++----
> block/bfq-iosched.c |  47 ++++++++++++++---
> block/bfq-iosched.h |  28 ++++++++--
> block/bfq-wf2q.c    | 124 +++++++++++++++++++++++++++++++++-----------
> 4 files changed, 244 insertions(+), 54 deletions(-)
> 
> -- 
> 2.30.0
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup
  2021-03-21 11:04 ` [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup Paolo Valente
@ 2021-03-22  5:44   ` brookxu
  0 siblings, 0 replies; 14+ messages in thread
From: brookxu @ 2021-03-22  5:44 UTC (permalink / raw)
  To: Paolo Valente; +Cc: axboe, tj, linux-block, cgroups, linux-kernel



Paolo Valente wrote on 2021/3/21 19:04:
> 
> 
>> Il giorno 12 mar 2021, alle ore 12:08, brookxu <brookxu.cn@gmail.com> ha scritto:
>>
>> From: Chunguang Xu <brookxu@tencent.com>
>>
> 
> Hi Chunguang,
> 
>> Tasks in the production environment can be roughly divided into
>> three categories: emergency tasks, ordinary tasks and offline
>> tasks. Emergency tasks need to be scheduled in real time, such
>> as system agents. Offline tasks do not need to guarantee QoS,
>> but can improve system resource utilization during system idle
>> periods, such as background tasks. The above requirements need
>> to achieve IO preemption. At present, we can use weights to
>> simulate IO preemption, but since weights are more of a shared
>> concept, they cannot be simulated well. For example, the weights
>> of emergency tasks and ordinary tasks cannot be determined well,
>> offline tasks (with the same weight) actually occupy different
>> resources on disks with different performance, and the tail
>> latency caused by offline tasks cannot be well controlled. Using
>> ioprio's concept of preemption, we can solve the above problems
>> very well. Since ioprio will eventually be converted to weight,
>> using ioprio alone can also achieve weight isolation within the
>> same class. But we can still use bfq.weight to control resource,
>> achieving better IO Qos control.
>>
>> However, currently the class of bfq_group is always be class, and
>> the ioprio class of the task can only be reflected in a single
>> cgroup. We cannot guarantee that real-time tasks in a cgroup are
>> scheduled in time. Therefore, we introduce bfq.ioprio, which
>> allows us to configure ioprio class for cgroup. In this way, we
>> can ensure that the real-time tasks of a cgroup can be scheduled
>> in time. Similarly, the processing of offline task groups can
>> also be simpler.
>>
> 
> I find this contribution very interesting.  Anyway, given the
> relevance of such a contribution, I'd like to hear from relevant
> people (Jens, Tejun, ...?), before revising individual patches.
> 
> Yet I already have a general question.  How does this mechanism comply
> with per-process ioprios and ioprio classes?  For example, what
> happens if a process belongs to BE-class group according to your
> mechanism, but to a RT class according to its ioprio?  Does the
> pre-group class dominate the per-process class?  Is all clean and
> predictable?
Hi Paolo, thanks for your precious time. This is a good question. Now
the pre-group class dominate the per-process class. But thinking about
it in depth now, there seems to be a problem in the container scene,
because the tasks inside the container may have different ioprio class
and ioprio. Maybe Bfq.ioprio should only affects the scheduling of the
group? which can be better compatible with the actual production
environment.

>> The bfq.ioprio interface now is available for cgroup v1 and cgroup
>> v2. Users can configure the ioprio for cgroup through this interface,
>> as shown below:
>>
>> echo "1 2"> blkio.bfq.ioprio
> 
> Wouldn't it be nicer to have acronyms for classes (RT, BE, IDLE),
> instead of numbers?

As ioprio is a number, so the ioprio class also uses a number form.
But your suggestion is good. If necessary, I will modify it later.

> 
> Thank you very much for this improvement proposal,

More discussions are welcome, Thanks.

> Paolo
> 
>>
>> The above two values respectively represent the values of ioprio
>> class and ioprio for cgroup. The ioprio of tasks within the cgroup
>> is uniformly equal to the ioprio of the cgroup. If the ioprio of
>> the cgroup is disabled, the ioprio of the task remains the same,
>> usually from io_context.
>>
>> When testing, using fio and fio_generate_plots we can clearly see
>> that the IO delay of the task satisfies RT> BE> IDLE. When RT is
>> running, BE and IDLE are guaranteed minimum bandwidth. When used
>> with bfq.weight, we can also isolate the resource within the same
>> class.
>>
>> The test process is as follows:
>> # prepare data disk
>> mount /dev/sdb /data1
>>
>> # create cgroup v1 hierarchy
>> cd /sys/fs/cgroup/blkio
>> mkdir rt be idle
>> echo "1 0" > rt/blkio.bfq.ioprio
>> echo "2 0" > be/blkio.bfq.ioprio
>> echo "3 0" > idle/blkio.bfq.ioprio
>>
>> # run fio test
>> fio fio.ini
>>
>> # generate svg graph
>> fio_generate_plots res
>>
>> The contents of fio.ini are as follows:
>> [global]
>> ioengine=libaio
>> group_reporting=1
>> log_avg_msec=500
>> direct=1
>> time_based=1
>> iodepth=16
>> size=100M
>> rw=write
>> bs=1M
>> [rt]
>> name=rt
>> write_bw_log=rt
>> write_lat_log=rt
>> write_iops_log=rt
>> filename=/data1/rt.bin
>> cgroup=rt
>> runtime=30s
>> nice=-10
>> [be]
>> name=be
>> new_group
>> write_bw_log=be
>> write_lat_log=be
>> write_iops_log=be
>> filename=/data1/be.bin
>> cgroup=be
>> runtime=60s
>> [idle]
>> name=idle
>> new_group
>> write_bw_log=idle
>> write_lat_log=idle
>> write_iops_log=idle
>> filename=/data1/idle.bin
>> cgroup=idle
>> runtime=90s
>>
>> V2:
>> 1. Optmise bfq_select_next_class().
>> 2. Introduce bfq_group [] to track the number of groups for each CLASS.
>> 3. Optimse IO injection, EMQ and Idle mechanism for CLASS_RT.
>>
>> Chunguang Xu (11):
>>  bfq: introduce bfq_entity_to_bfqg helper method
>>  bfq: limit the IO depth of idle_class to 1
>>  bfq: keep the minimun bandwidth for be_class
>>  bfq: expire other class if CLASS_RT is waiting
>>  bfq: optimse IO injection for CLASS_RT
>>  bfq: disallow idle if CLASS_RT waiting for service
>>  bfq: disallow merge CLASS_RT with other class
>>  bfq: introduce bfq.ioprio for cgroup
>>  bfq: convert the type of bfq_group.bfqd to bfq_data*
>>  bfq: remove unnecessary initialization logic
>>  bfq: optimize the calculation of bfq_weight_to_ioprio()
>>
>> block/bfq-cgroup.c  |  99 +++++++++++++++++++++++++++++++----
>> block/bfq-iosched.c |  47 ++++++++++++++---
>> block/bfq-iosched.h |  28 ++++++++--
>> block/bfq-wf2q.c    | 124 +++++++++++++++++++++++++++++++++-----------
>> 4 files changed, 244 insertions(+), 54 deletions(-)
>>
>> -- 
>> 2.30.0
>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-22  5:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-12 11:08 [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup brookxu
     [not found] ` <cover.1615527324.git.brookxu@tencent.com>
2021-03-12 11:08   ` [RFC PATCH v2 01/11] bfq: introduce bfq_entity_to_bfqg helper method brookxu
2021-03-12 11:08   ` [RFC PATCH v2 02/11] bfq: convert the type of bfq_group.bfqd to bfq_data* brookxu
2021-03-12 11:08   ` [RFC PATCH v2 03/11] bfq: introduce bfq.ioprio for cgroup brookxu
2021-03-12 11:08   ` [RFC PATCH v2 04/11] bfq: limit the IO depth of idle_class to 1 brookxu
2021-03-12 11:08   ` [RFC PATCH v2 05/11] bfq: keep the minimun bandwidth for be_class brookxu
2021-03-12 11:08   ` [RFC PATCH v2 06/11] bfq: expire other class if CLASS_RT is waiting brookxu
2021-03-12 11:08   ` [RFC PATCH v2 07/11] bfq: optimse IO injection for CLASS_RT brookxu
2021-03-12 11:08   ` [RFC PATCH v2 08/11] bfq: disallow idle if CLASS_RT waiting for service brookxu
2021-03-12 11:08   ` [RFC PATCH v2 09/11] bfq: disallow merge CLASS_RT with other class brookxu
2021-03-12 11:08   ` [RFC PATCH v2 10/11] bfq: remove unnecessary initialization logic brookxu
2021-03-12 11:08   ` [RFC PATCH v2 11/11] bfq: optimize the calculation of bfq_weight_to_ioprio() brookxu
2021-03-21 11:04 ` [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup Paolo Valente
2021-03-22  5:44   ` brookxu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).