All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -next v3 0/2] bugfix for blk-throttle
@ 2022-05-19  8:58 ` Yu Kuai
  0 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2022-05-19  8:58 UTC (permalink / raw)
  To: tj, axboe, ming.lei, geert
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang

Changes in v3:
 - fix a check in patch 1
 - fix link err in patch 2 on 32-bit platform
 - hanlde overflow in patch 2
Changes in v2:
 - use a new solution suggested by Ming
 - change the title of patch 1
 - add patch 2

Patch 1 fix that blk-throttle can't work if multiple bios are throttle,
Patch 2 fix io hung due to configuration updates.

Previous version:
v1 : https://lore.kernel.org/all/20220517134909.2910251-1-yukuai3@huawei.com/
v2 : https://lore.kernel.org/all/20220518072751.1188163-1-yukuai3@huawei.com/
Yu Kuai (2):
  blk-throttle: fix that io throttle can only work for single bio
  blk-throttle: fix io hung due to configuration updates

 block/blk-throttle.c | 104 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 19 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -next v3 0/2] bugfix for blk-throttle
@ 2022-05-19  8:58 ` Yu Kuai
  0 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2022-05-19  8:58 UTC (permalink / raw)
  To: tj, axboe, ming.lei, geert
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang

Changes in v3:
 - fix a check in patch 1
 - fix link err in patch 2 on 32-bit platform
 - hanlde overflow in patch 2
Changes in v2:
 - use a new solution suggested by Ming
 - change the title of patch 1
 - add patch 2

Patch 1 fix that blk-throttle can't work if multiple bios are throttle,
Patch 2 fix io hung due to configuration updates.

Previous version:
v1 : https://lore.kernel.org/all/20220517134909.2910251-1-yukuai3@huawei.com/
v2 : https://lore.kernel.org/all/20220518072751.1188163-1-yukuai3@huawei.com/
Yu Kuai (2):
  blk-throttle: fix that io throttle can only work for single bio
  blk-throttle: fix io hung due to configuration updates

 block/blk-throttle.c | 104 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 85 insertions(+), 19 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -next v3 1/2] blk-throttle: fix that io throttle can only work for single bio
  2022-05-19  8:58 ` Yu Kuai
@ 2022-05-19  8:58   ` Yu Kuai
  -1 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2022-05-19  8:58 UTC (permalink / raw)
  To: tj, axboe, ming.lei, geert
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang

commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
introduce a new problem, for example:

[root@localhost ~]# echo "8:0 1024" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
[root@localhost ~]# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
[root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
[1] 620
[root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
[2] 626
[root@localhost ~]# 1+0 records in
1+0 records out
10240 bytes (10 kB, 10 KiB) copied, 10.0038 s, 1.0 kB/s1+0 records in
1+0 records out

10240 bytes (10 kB, 10 KiB) copied, 9.23076 s, 1.1 kB/s
-> the second bio is issued after 10s instead of 20s.

This is because if some bios are already queued, current bio is queued
directly and the flag 'BIO_THROTTLED' is set. And later, when former
bios are dispatched, this bio will be dispatched without waiting at all,
this is due to tg_with_in_bps_limit() return 0 for this bio.

In order to fix the problem, don't skip flaged bio in
tg_with_in_bps_limit(), and for the problem that split bio can be
double accounted, compensate the over-accounting in __blk_throtl_bio().

Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 447e1b8722f7..0c37be08ff28 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -811,7 +811,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	/* no need to throttle if this bio's bytes have been accounted */
-	if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) {
+	if (bps_limit == U64_MAX) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -921,11 +921,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	/* Charge the bio to the group */
-	if (!bio_flagged(bio, BIO_THROTTLED)) {
-		tg->bytes_disp[rw] += bio_size;
-		tg->last_bytes_disp[rw] += bio_size;
-	}
-
+	tg->bytes_disp[rw] += bio_size;
+	tg->last_bytes_disp[rw] += bio_size;
 	tg->io_disp[rw]++;
 	tg->last_io_disp[rw]++;
 
@@ -2121,6 +2118,21 @@ bool __blk_throtl_bio(struct bio *bio)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
 		throtl_upgrade_check(tg);
+
+		/*
+		 * re-entered bio has accounted bytes already, so try to
+		 * compensate previous over-accounting. However, if new
+		 * slice is started, just forget it.
+		 */
+		if (bio_flagged(bio, BIO_THROTTLED)) {
+			unsigned int bio_size = throtl_bio_data_size(bio);
+
+			if (tg->bytes_disp[rw] >= bio_size)
+				tg->bytes_disp[rw] -= bio_size;
+			if (tg->last_bytes_disp[rw] >= bio_size)
+				tg->last_bytes_disp[rw] -= bio_size;
+		}
+
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v3 1/2] blk-throttle: fix that io throttle can only work for single bio
@ 2022-05-19  8:58   ` Yu Kuai
  0 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2022-05-19  8:58 UTC (permalink / raw)
  To: tj, axboe, ming.lei, geert
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang

commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
introduce a new problem, for example:

[root@localhost ~]# echo "8:0 1024" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
[root@localhost ~]# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
[root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
[1] 620
[root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
[2] 626
[root@localhost ~]# 1+0 records in
1+0 records out
10240 bytes (10 kB, 10 KiB) copied, 10.0038 s, 1.0 kB/s1+0 records in
1+0 records out

10240 bytes (10 kB, 10 KiB) copied, 9.23076 s, 1.1 kB/s
-> the second bio is issued after 10s instead of 20s.

This is because if some bios are already queued, current bio is queued
directly and the flag 'BIO_THROTTLED' is set. And later, when former
bios are dispatched, this bio will be dispatched without waiting at all,
this is due to tg_with_in_bps_limit() return 0 for this bio.

In order to fix the problem, don't skip flaged bio in
tg_with_in_bps_limit(), and for the problem that split bio can be
double accounted, compensate the over-accounting in __blk_throtl_bio().

Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 447e1b8722f7..0c37be08ff28 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -811,7 +811,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	/* no need to throttle if this bio's bytes have been accounted */
-	if (bps_limit == U64_MAX || bio_flagged(bio, BIO_THROTTLED)) {
+	if (bps_limit == U64_MAX) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -921,11 +921,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	/* Charge the bio to the group */
-	if (!bio_flagged(bio, BIO_THROTTLED)) {
-		tg->bytes_disp[rw] += bio_size;
-		tg->last_bytes_disp[rw] += bio_size;
-	}
-
+	tg->bytes_disp[rw] += bio_size;
+	tg->last_bytes_disp[rw] += bio_size;
 	tg->io_disp[rw]++;
 	tg->last_io_disp[rw]++;
 
@@ -2121,6 +2118,21 @@ bool __blk_throtl_bio(struct bio *bio)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
 		throtl_upgrade_check(tg);
+
+		/*
+		 * re-entered bio has accounted bytes already, so try to
+		 * compensate previous over-accounting. However, if new
+		 * slice is started, just forget it.
+		 */
+		if (bio_flagged(bio, BIO_THROTTLED)) {
+			unsigned int bio_size = throtl_bio_data_size(bio);
+
+			if (tg->bytes_disp[rw] >= bio_size)
+				tg->bytes_disp[rw] -= bio_size;
+			if (tg->last_bytes_disp[rw] >= bio_size)
+				tg->last_bytes_disp[rw] -= bio_size;
+		}
+
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-19  8:58   ` Yu Kuai
  0 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2022-05-19  8:58 UTC (permalink / raw)
  To: tj, axboe, ming.lei, geert
  Cc: cgroups, linux-block, linux-kernel, yukuai3, yi.zhang

If new configuration is submitted while a bio is throttled, then new
waiting time is recaculated regardless that the bio might aready wait
for some time:

tg_conf_updated
 throtl_start_new_slice
  tg_update_disptime
  throtl_schedule_next_dispatch

Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.

Fix the problem by respecting the time that throttled bio aready waited.
In order to do that, instead of start new slice in tg_conf_updated(),
just update 'bytes_disp' and 'io_disp' based on the new configuration.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-throttle.c | 80 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 67 insertions(+), 13 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0c37be08ff28..aca63148bb83 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1271,7 +1271,58 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
 	return 0;
 }
 
-static void tg_conf_updated(struct throtl_grp *tg, bool global)
+static u64 throtl_update_bytes_disp(u64 dispatched, u64 new_limit,
+				    u64 old_limit)
+{
+	if (new_limit == old_limit)
+		return dispatched;
+
+	if (!dispatched)
+		return 0;
+
+	/*
+	 * In the case that multiply will overflow, just return 0. It will only
+	 * let bios to be dispatched earlier.
+	 */
+	if (div64_u64(U64_MAX, dispatched) < new_limit)
+		return 0;
+
+	dispatched *= new_limit;
+	return div64_u64(dispatched, old_limit);
+}
+
+static u32 throtl_update_io_disp(u32 dispatched, u32 new_limit, u32 old_limit)
+{
+	if (new_limit == old_limit)
+		return dispatched;
+
+	if (!dispatched)
+		return 0;
+
+	/*
+	 * In the case that multiply will overflow, just return 0. It will only
+	 * let bios to be dispatched earlier.
+	 */
+	if (UINT_MAX / dispatched < new_limit)
+		return 0;
+
+	dispatched *= new_limit;
+	return dispatched / old_limit;
+}
+
+static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
+{
+	tg->bytes_disp[READ] = throtl_update_bytes_disp(tg->bytes_disp[READ],
+			tg_bps_limit(tg, READ), old_limits[0]);
+	tg->bytes_disp[WRITE] = throtl_update_bytes_disp(tg->bytes_disp[WRITE],
+			tg_bps_limit(tg, WRITE), old_limits[1]);
+	tg->io_disp[READ] = throtl_update_io_disp(tg->io_disp[READ],
+			tg_iops_limit(tg, READ), (u32)old_limits[2]);
+	tg->io_disp[WRITE] = throtl_update_io_disp(tg->io_disp[WRITE],
+			tg_iops_limit(tg, WRITE), (u32)old_limits[3]);
+}
+
+static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct cgroup_subsys_state *pos_css;
@@ -1310,16 +1361,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 				parent_tg->latency_target);
 	}
 
-	/*
-	 * We're already holding queue_lock and know @tg is valid.  Let's
-	 * apply the new config directly.
-	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
-	 */
-	throtl_start_new_slice(tg, READ);
-	throtl_start_new_slice(tg, WRITE);
+	throtl_update_slice(tg, old_limits);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1327,6 +1369,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	}
 }
 
+static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
+{
+	limits[0] = tg_bps_limit(tg, READ);
+	limits[1] = tg_bps_limit(tg, WRITE);
+	limits[2] = tg_iops_limit(tg, READ);
+	limits[3] = tg_iops_limit(tg, WRITE);
+}
+
 static ssize_t tg_set_conf(struct kernfs_open_file *of,
 			   char *buf, size_t nbytes, loff_t off, bool is_u64)
 {
@@ -1335,6 +1385,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	int ret;
 	u64 v;
+	u64 old_limits[4];
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1347,13 +1398,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
 	else
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
 
-	tg_conf_updated(tg, false);
+	tg_conf_updated(tg, old_limits, false);
 	ret = 0;
 out_finish:
 	blkg_conf_finish(&ctx);
@@ -1523,6 +1575,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	u64 v[4];
+	u64 old_limits[4];
 	unsigned long idle_time;
 	unsigned long latency_time;
 	int ret;
@@ -1533,6 +1586,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	v[0] = tg->bps_conf[READ][index];
 	v[1] = tg->bps_conf[WRITE][index];
@@ -1624,7 +1678,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 	} else
 		tg->td->limit_index = LIMIT_MAX;
-	tg_conf_updated(tg, index == LIMIT_LOW &&
+	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
 		tg->td->limit_valid[LIMIT_LOW]);
 	ret = 0;
 out_finish:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-19  8:58   ` Yu Kuai
  0 siblings, 0 replies; 24+ messages in thread
From: Yu Kuai @ 2022-05-19  8:58 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ming.lei-H+wXaHxf7aLQT0dZR+AlfA, geert-Td1EMuHUCqxL1ZNQvxDV9g
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yukuai3-hv44wF8Li93QT0dZR+AlfA, yi.zhang-hv44wF8Li93QT0dZR+AlfA

If new configuration is submitted while a bio is throttled, then new
waiting time is recaculated regardless that the bio might aready wait
for some time:

tg_conf_updated
 throtl_start_new_slice
  tg_update_disptime
  throtl_schedule_next_dispatch

Then io hung can be triggered by always submmiting new configuration
before the throttled bio is dispatched.

Fix the problem by respecting the time that throttled bio aready waited.
In order to do that, instead of start new slice in tg_conf_updated(),
just update 'bytes_disp' and 'io_disp' based on the new configuration.

Signed-off-by: Yu Kuai <yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 block/blk-throttle.c | 80 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 67 insertions(+), 13 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0c37be08ff28..aca63148bb83 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1271,7 +1271,58 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
 	return 0;
 }
 
-static void tg_conf_updated(struct throtl_grp *tg, bool global)
+static u64 throtl_update_bytes_disp(u64 dispatched, u64 new_limit,
+				    u64 old_limit)
+{
+	if (new_limit == old_limit)
+		return dispatched;
+
+	if (!dispatched)
+		return 0;
+
+	/*
+	 * In the case that multiply will overflow, just return 0. It will only
+	 * let bios to be dispatched earlier.
+	 */
+	if (div64_u64(U64_MAX, dispatched) < new_limit)
+		return 0;
+
+	dispatched *= new_limit;
+	return div64_u64(dispatched, old_limit);
+}
+
+static u32 throtl_update_io_disp(u32 dispatched, u32 new_limit, u32 old_limit)
+{
+	if (new_limit == old_limit)
+		return dispatched;
+
+	if (!dispatched)
+		return 0;
+
+	/*
+	 * In the case that multiply will overflow, just return 0. It will only
+	 * let bios to be dispatched earlier.
+	 */
+	if (UINT_MAX / dispatched < new_limit)
+		return 0;
+
+	dispatched *= new_limit;
+	return dispatched / old_limit;
+}
+
+static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
+{
+	tg->bytes_disp[READ] = throtl_update_bytes_disp(tg->bytes_disp[READ],
+			tg_bps_limit(tg, READ), old_limits[0]);
+	tg->bytes_disp[WRITE] = throtl_update_bytes_disp(tg->bytes_disp[WRITE],
+			tg_bps_limit(tg, WRITE), old_limits[1]);
+	tg->io_disp[READ] = throtl_update_io_disp(tg->io_disp[READ],
+			tg_iops_limit(tg, READ), (u32)old_limits[2]);
+	tg->io_disp[WRITE] = throtl_update_io_disp(tg->io_disp[WRITE],
+			tg_iops_limit(tg, WRITE), (u32)old_limits[3]);
+}
+
+static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct cgroup_subsys_state *pos_css;
@@ -1310,16 +1361,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 				parent_tg->latency_target);
 	}
 
-	/*
-	 * We're already holding queue_lock and know @tg is valid.  Let's
-	 * apply the new config directly.
-	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
-	 */
-	throtl_start_new_slice(tg, READ);
-	throtl_start_new_slice(tg, WRITE);
+	throtl_update_slice(tg, old_limits);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1327,6 +1369,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	}
 }
 
+static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
+{
+	limits[0] = tg_bps_limit(tg, READ);
+	limits[1] = tg_bps_limit(tg, WRITE);
+	limits[2] = tg_iops_limit(tg, READ);
+	limits[3] = tg_iops_limit(tg, WRITE);
+}
+
 static ssize_t tg_set_conf(struct kernfs_open_file *of,
 			   char *buf, size_t nbytes, loff_t off, bool is_u64)
 {
@@ -1335,6 +1385,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	int ret;
 	u64 v;
+	u64 old_limits[4];
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1347,13 +1398,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
 	else
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
 
-	tg_conf_updated(tg, false);
+	tg_conf_updated(tg, old_limits, false);
 	ret = 0;
 out_finish:
 	blkg_conf_finish(&ctx);
@@ -1523,6 +1575,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	u64 v[4];
+	u64 old_limits[4];
 	unsigned long idle_time;
 	unsigned long latency_time;
 	int ret;
@@ -1533,6 +1586,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	v[0] = tg->bps_conf[READ][index];
 	v[1] = tg->bps_conf[WRITE][index];
@@ -1624,7 +1678,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 	} else
 		tg->td->limit_index = LIMIT_MAX;
-	tg_conf_updated(tg, index == LIMIT_LOW &&
+	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
 		tg->td->limit_valid[LIMIT_LOW]);
 	ret = 0;
 out_finish:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
  2022-05-19  8:58   ` Yu Kuai
  (?)
@ 2022-05-19  9:58   ` Michal Koutný
  2022-05-19 12:14       ` yukuai (C)
  -1 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-05-19  9:58 UTC (permalink / raw)
  To: Yu Kuai
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

Hello Kuayi.

On Thu, May 19, 2022 at 04:58:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
> If new configuration is submitted while a bio is throttled, then new
> waiting time is recaculated regardless that the bio might aready wait
> for some time:
> 
> tg_conf_updated
>  throtl_start_new_slice
>   tg_update_disptime
>   throtl_schedule_next_dispatch
> 
> Then io hung can be triggered by always submmiting new configuration
> before the throttled bio is dispatched.

O.K.

> -	/*
> -	 * We're already holding queue_lock and know @tg is valid.  Let's
> -	 * apply the new config directly.
> -	 *
> -	 * Restart the slices for both READ and WRITES. It might happen
> -	 * that a group's limit are dropped suddenly and we don't want to
> -	 * account recently dispatched IO with new low rate.
> -	 */
> -	throtl_start_new_slice(tg, READ);
> -	throtl_start_new_slice(tg, WRITE);
> +	throtl_update_slice(tg, old_limits);

throtl_start_new_slice zeroes *_disp fields.
If for instance, new config allowed only 0.5 throughput, the *_disp
fields would be scaled to 0.5.
How that change helps (better) the previously throttled bio to be dispatched?

(Is it because you omit update of slice_{start,end}?)

Thanks,
Michal


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 1/2] blk-throttle: fix that io throttle can only work for single bio
  2022-05-19  8:58   ` Yu Kuai
  (?)
@ 2022-05-19 10:42   ` Ming Lei
  -1 siblings, 0 replies; 24+ messages in thread
From: Ming Lei @ 2022-05-19 10:42 UTC (permalink / raw)
  To: Yu Kuai; +Cc: tj, axboe, geert, cgroups, linux-block, linux-kernel, yi.zhang

On Thu, May 19, 2022 at 04:58:10PM +0800, Yu Kuai wrote:
> commit 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
> introduce a new problem, for example:
> 
> [root@localhost ~]# echo "8:0 1024" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
> [root@localhost ~]# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
> [root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
> [1] 620
> [root@localhost ~]# dd if=/dev/zero of=/dev/sda bs=10k count=1 oflag=direct &
> [2] 626
> [root@localhost ~]# 1+0 records in
> 1+0 records out
> 10240 bytes (10 kB, 10 KiB) copied, 10.0038 s, 1.0 kB/s1+0 records in
> 1+0 records out
> 
> 10240 bytes (10 kB, 10 KiB) copied, 9.23076 s, 1.1 kB/s
> -> the second bio is issued after 10s instead of 20s.
> 
> This is because if some bios are already queued, current bio is queued
> directly and the flag 'BIO_THROTTLED' is set. And later, when former
> bios are dispatched, this bio will be dispatched without waiting at all,
> this is due to tg_with_in_bps_limit() return 0 for this bio.
> 
> In order to fix the problem, don't skip flaged bio in
> tg_with_in_bps_limit(), and for the problem that split bio can be
> double accounted, compensate the over-accounting in __blk_throtl_bio().
> 
> Fixes: 9f5ede3c01f9 ("block: throttle split bio in case of iops limit")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
  2022-05-19  9:58   ` Michal Koutný
@ 2022-05-19 12:14       ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-19 12:14 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

在 2022/05/19 17:58, Michal Koutný 写道:
> Hello Kuayi.
> 
> On Thu, May 19, 2022 at 04:58:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>> If new configuration is submitted while a bio is throttled, then new
>> waiting time is recaculated regardless that the bio might aready wait
>> for some time:
>>
>> tg_conf_updated
>>   throtl_start_new_slice
>>    tg_update_disptime
>>    throtl_schedule_next_dispatch
>>
>> Then io hung can be triggered by always submmiting new configuration
>> before the throttled bio is dispatched.
> 
> O.K.
> 
>> -	/*
>> -	 * We're already holding queue_lock and know @tg is valid.  Let's
>> -	 * apply the new config directly.
>> -	 *
>> -	 * Restart the slices for both READ and WRITES. It might happen
>> -	 * that a group's limit are dropped suddenly and we don't want to
>> -	 * account recently dispatched IO with new low rate.
>> -	 */
>> -	throtl_start_new_slice(tg, READ);
>> -	throtl_start_new_slice(tg, WRITE);
>> +	throtl_update_slice(tg, old_limits);
> 
> throtl_start_new_slice zeroes *_disp fields.
Hi,

The problem is not just zeroes *_disp fields, in fact, the real problem
is that 'slice_start' is reset to jiffies.

> If for instance, new config allowed only 0.5 throughput, the *_disp
> fields would be scaled to 0.5.
> How that change helps (better) the previously throttled bio to be dispatched?
> 
tg_with_in_bps_limit() is caculating 'wait' based on 'slice_start'and
'bytes_disp':

tg_with_in_bps_limit:
  jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
  tmp = bps_limit * jiffy_elapsed_rnd;
  do_div(tmp, HZ);
  bytes_allowed = tmp; -> how many bytes are allowed in this slice,
		         incluing dispatched.
  if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
   *wait = 0 -> no need to wait if this bio is within limit

  extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
  -> extra_bytes is based on 'bytes_disp'

For example:

1) bps_limit is 2k, we issue two io, (1k and 9k)
2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
    the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:

without this patch:  bytes_disp = 0, slict_start =3:
bytes_allowed = 1k
extra_bytes = 9k - 1k = 8k
wait = 8s

whth this patch: bytes_disp = 0.5k, slice_start =  0,
bytes_allowed = 1k * 3 + 1k = 4k
extra_bytes =  0.5k + 9k - 4k = 5.5k
wait = 5.5s

I hope I can expliain it clearly...

Thanks,
Kuai
> (Is it because you omit update of slice_{start,end}?)
> 
> Thanks,
> Michal
> 
> .
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-19 12:14       ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-19 12:14 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

在 2022/05/19 17:58, Michal Koutný 写道:
> Hello Kuayi.
> 
> On Thu, May 19, 2022 at 04:58:11PM +0800, Yu Kuai <yukuai3@huawei.com> wrote:
>> If new configuration is submitted while a bio is throttled, then new
>> waiting time is recaculated regardless that the bio might aready wait
>> for some time:
>>
>> tg_conf_updated
>>   throtl_start_new_slice
>>    tg_update_disptime
>>    throtl_schedule_next_dispatch
>>
>> Then io hung can be triggered by always submmiting new configuration
>> before the throttled bio is dispatched.
> 
> O.K.
> 
>> -	/*
>> -	 * We're already holding queue_lock and know @tg is valid.  Let's
>> -	 * apply the new config directly.
>> -	 *
>> -	 * Restart the slices for both READ and WRITES. It might happen
>> -	 * that a group's limit are dropped suddenly and we don't want to
>> -	 * account recently dispatched IO with new low rate.
>> -	 */
>> -	throtl_start_new_slice(tg, READ);
>> -	throtl_start_new_slice(tg, WRITE);
>> +	throtl_update_slice(tg, old_limits);
> 
> throtl_start_new_slice zeroes *_disp fields.
Hi,

The problem is not just zeroes *_disp fields, in fact, the real problem
is that 'slice_start' is reset to jiffies.

> If for instance, new config allowed only 0.5 throughput, the *_disp
> fields would be scaled to 0.5.
> How that change helps (better) the previously throttled bio to be dispatched?
> 
tg_with_in_bps_limit() is caculating 'wait' based on 'slice_start'and
'bytes_disp':

tg_with_in_bps_limit:
  jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
  tmp = bps_limit * jiffy_elapsed_rnd;
  do_div(tmp, HZ);
  bytes_allowed = tmp; -> how many bytes are allowed in this slice,
		         incluing dispatched.
  if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
   *wait = 0 -> no need to wait if this bio is within limit

  extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
  -> extra_bytes is based on 'bytes_disp'

For example:

1) bps_limit is 2k, we issue two io, (1k and 9k)
2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
    the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:

without this patch:  bytes_disp = 0, slict_start =3:
bytes_allowed = 1k
extra_bytes = 9k - 1k = 8k
wait = 8s

whth this patch: bytes_disp = 0.5k, slice_start =  0,
bytes_allowed = 1k * 3 + 1k = 4k
extra_bytes =  0.5k + 9k - 4k = 5.5k
wait = 5.5s

I hope I can expliain it clearly...

Thanks,
Kuai
> (Is it because you omit update of slice_{start,end}?)
> 
> Thanks,
> Michal
> 
> .
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-19 16:10         ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2022-05-19 16:10 UTC (permalink / raw)
  To: yukuai (C)
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
> tg_with_in_bps_limit:
>  jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>  tmp = bps_limit * jiffy_elapsed_rnd;
>  do_div(tmp, HZ);
>  bytes_allowed = tmp; -> how many bytes are allowed in this slice,
> 		         incluing dispatched.
>  if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>   *wait = 0 -> no need to wait if this bio is within limit
> 
>  extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>  -> extra_bytes is based on 'bytes_disp'
> 
> For example:
> 
> 1) bps_limit is 2k, we issue two io, (1k and 9k)
> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>    the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s

The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
at 5s (i.e. 10k/*2kB/s = 5s).

> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
> 
> without this patch:  bytes_disp = 0, slict_start =3:
> bytes_allowed = 1k	                            <--- why 1k and not 0?
> extra_bytes = 9k - 1k = 8k
> wait = 8s

This looks like it was calculated at time 4s (1s after new config was
set).

> 
> whth this patch: bytes_disp = 0.5k, slice_start =  0,
> bytes_allowed = 1k * 3 + 1k = 4k
> extra_bytes =  0.5k + 9k - 4k = 5.5k
> wait = 5.5s

This looks like calculated at 4s, so the IO would be waiting till
4s+5.5s = 9.5s.

As I don't know why using time 4s, I'll shift this calculation to the
time 3s (when the config changes):

bytes_disp = 0.5k, slice_start =  0,
bytes_allowed = 1k * 3  = 3k
extra_bytes =  0.5k + 9k - 3k = 7.5k
wait = 7.5s

In absolute time, the IO would wait till 3s+7.5s = 10.5s

OK, either your 9.5s or my 10.5s looks weird (although earlier than
original 4s+8s=12s).
However, the IO should ideally only wait till

    3s + (9k -   (6k    -    1k)     ) / 1k/s =
         bio - (allowed - dispatched)  / new_limit

   =3s + 4k / 1k/s = 7s

   ('allowed' is based on old limit)

Or in another example, what if you change the config from 2k/s to ∞k/s
(unlimited, let's neglect the arithmetic overflow that you handle
explicitly, imagine a big number but not so big to be greater than
division result).

In such a case, the wait time should be zero, i.e. IO should be
dispatched right at the time of config change.
(With your patch that still calculates >0 wait time (and the original
behavior gives >0 wait too.)

> I hope I can expliain it clearly...

Yes, thanks for pointing me to relevant parts.
I hope I grasped them correctly.

IOW, your patch and formula make the wait time shorter but still IO can
be delayed indefinitely if you pass a sequence of new configs. (AFAIU)

Regards,
Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-19 16:10         ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2022-05-19 16:10 UTC (permalink / raw)
  To: yukuai (C)
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ming.lei-H+wXaHxf7aLQT0dZR+AlfA, geert-Td1EMuHUCqxL1ZNQvxDV9g,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yi.zhang-hv44wF8Li93QT0dZR+AlfA

On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" <yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
> tg_with_in_bps_limit:
>  jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>  tmp = bps_limit * jiffy_elapsed_rnd;
>  do_div(tmp, HZ);
>  bytes_allowed = tmp; -> how many bytes are allowed in this slice,
> 		         incluing dispatched.
>  if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>   *wait = 0 -> no need to wait if this bio is within limit
> 
>  extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>  -> extra_bytes is based on 'bytes_disp'
> 
> For example:
> 
> 1) bps_limit is 2k, we issue two io, (1k and 9k)
> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>    the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s

The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
at 5s (i.e. 10k/*2kB/s = 5s).

> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
> 
> without this patch:  bytes_disp = 0, slict_start =3:
> bytes_allowed = 1k	                            <--- why 1k and not 0?
> extra_bytes = 9k - 1k = 8k
> wait = 8s

This looks like it was calculated at time 4s (1s after new config was
set).

> 
> whth this patch: bytes_disp = 0.5k, slice_start =  0,
> bytes_allowed = 1k * 3 + 1k = 4k
> extra_bytes =  0.5k + 9k - 4k = 5.5k
> wait = 5.5s

This looks like calculated at 4s, so the IO would be waiting till
4s+5.5s = 9.5s.

As I don't know why using time 4s, I'll shift this calculation to the
time 3s (when the config changes):

bytes_disp = 0.5k, slice_start =  0,
bytes_allowed = 1k * 3  = 3k
extra_bytes =  0.5k + 9k - 3k = 7.5k
wait = 7.5s

In absolute time, the IO would wait till 3s+7.5s = 10.5s

OK, either your 9.5s or my 10.5s looks weird (although earlier than
original 4s+8s=12s).
However, the IO should ideally only wait till

    3s + (9k -   (6k    -    1k)     ) / 1k/s =
         bio - (allowed - dispatched)  / new_limit

   =3s + 4k / 1k/s = 7s

   ('allowed' is based on old limit)

Or in another example, what if you change the config from 2k/s to ∞k/s
(unlimited, let's neglect the arithmetic overflow that you handle
explicitly, imagine a big number but not so big to be greater than
division result).

In such a case, the wait time should be zero, i.e. IO should be
dispatched right at the time of config change.
(With your patch that still calculates >0 wait time (and the original
behavior gives >0 wait too.)

> I hope I can expliain it clearly...

Yes, thanks for pointing me to relevant parts.
I hope I grasped them correctly.

IOW, your patch and formula make the wait time shorter but still IO can
be delayed indefinitely if you pass a sequence of new configs. (AFAIU)

Regards,
Michal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-20  1:22           ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-20  1:22 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

在 2022/05/20 0:10, Michal Koutný 写道:
> On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
>> tg_with_in_bps_limit:
>>   jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>>   tmp = bps_limit * jiffy_elapsed_rnd;
>>   do_div(tmp, HZ);
>>   bytes_allowed = tmp; -> how many bytes are allowed in this slice,
>> 		         incluing dispatched.
>>   if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>>    *wait = 0 -> no need to wait if this bio is within limit
>>
>>   extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>>   -> extra_bytes is based on 'bytes_disp'
>>
>> For example:
>>
>> 1) bps_limit is 2k, we issue two io, (1k and 9k)
>> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>>     the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
> 
> The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
> at 5s (i.e. 10k/*2kB/s = 5s).
No, the example is that the second io arrived together with first io.
> 
>> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
>>
>> without this patch:  bytes_disp = 0, slict_start =3:
>> bytes_allowed = 1k	                            <--- why 1k and not 0?
Because slice_start == jiffies, bytes_allowed is equal to bps_limit
>> extra_bytes = 9k - 1k = 8k
>> wait = 8s
> 
> This looks like it was calculated at time 4s (1s after new config was
> set).
No... it was caculated at time 3s:

jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

jiffies should be greater than 3s here, thus jiffy_elapsed_rnd is
3s + throtl_slice (I'm using throtl_slice = 1s here, it should not
affect result)
> 
>>
>> whth this patch: bytes_disp = 0.5k, slice_start =  0,
>> bytes_allowed = 1k * 3 + 1k = 4k
>> extra_bytes =  0.5k + 9k - 4k = 5.5k
>> wait = 5.5s
> 
> This looks like calculated at 4s, so the IO would be waiting till
> 4s+5.5s = 9.5s.
wait time is based on extra_bytes, this is really 5.5s, add 4s is
wrong here.

bytes_allowed = ((jiffies - slice_start) / Hz + 1) * bps_limit
extra_bytes = bio_size + bytes_disp - bytes_allowed
wait = extra_bytes / bps_limit
> 
> As I don't know why using time 4s, I'll shift this calculation to the
> time 3s (when the config changes):
> 
> bytes_disp = 0.5k, slice_start =  0,
> bytes_allowed = 1k * 3  = 3k
> extra_bytes =  0.5k + 9k - 3k = 7.5k
6.5k
> wait = 7.5s
> 
> In absolute time, the IO would wait till 3s+7.5s = 10.5s
Like I said above, wait time should not add (jiffies - slice_start)
> 
> OK, either your 9.5s or my 10.5s looks weird (although earlier than
> original 4s+8s=12s).
> However, the IO should ideally only wait till
> 
>      3s + (9k -   (6k    -    1k)     ) / 1k/s =
>           bio - (allowed - dispatched)  / new_limit
> 
>     =3s + 4k / 1k/s = 7s
> 
>     ('allowed' is based on old limit)
> 
> Or in another example, what if you change the config from 2k/s to ∞k/s
> (unlimited, let's neglect the arithmetic overflow that you handle
> explicitly, imagine a big number but not so big to be greater than
> division result).
> 
> In such a case, the wait time should be zero, i.e. IO should be
> dispatched right at the time of config change.

I thought about it, however, IMO, this is not a good idea. If user
updated config quite frequently, io throttle will be invalid.

Thanks,
Kuai
> (With your patch that still calculates >0 wait time (and the original
> behavior gives >0 wait too.)
> 
>> I hope I can expliain it clearly...
> 
> Yes, thanks for pointing me to relevant parts.
> I hope I grasped them correctly.
> 
> IOW, your patch and formula make the wait time shorter but still IO can
> be delayed indefinitely if you pass a sequence of new configs. (AFAIU)
> 
> Regards,
> Michal
> .
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-20  1:22           ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-20  1:22 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ming.lei-H+wXaHxf7aLQT0dZR+AlfA, geert-Td1EMuHUCqxL1ZNQvxDV9g,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yi.zhang-hv44wF8Li93QT0dZR+AlfA

在 2022/05/20 0:10, Michal Koutný 写道:
> On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" <yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>> tg_with_in_bps_limit:
>>   jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>>   tmp = bps_limit * jiffy_elapsed_rnd;
>>   do_div(tmp, HZ);
>>   bytes_allowed = tmp; -> how many bytes are allowed in this slice,
>> 		         incluing dispatched.
>>   if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>>    *wait = 0 -> no need to wait if this bio is within limit
>>
>>   extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>>   -> extra_bytes is based on 'bytes_disp'
>>
>> For example:
>>
>> 1) bps_limit is 2k, we issue two io, (1k and 9k)
>> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>>     the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
> 
> The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
> at 5s (i.e. 10k/*2kB/s = 5s).
No, the example is that the second io arrived together with first io.
> 
>> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
>>
>> without this patch:  bytes_disp = 0, slict_start =3:
>> bytes_allowed = 1k	                            <--- why 1k and not 0?
Because slice_start == jiffies, bytes_allowed is equal to bps_limit
>> extra_bytes = 9k - 1k = 8k
>> wait = 8s
> 
> This looks like it was calculated at time 4s (1s after new config was
> set).
No... it was caculated at time 3s:

jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);

jiffies should be greater than 3s here, thus jiffy_elapsed_rnd is
3s + throtl_slice (I'm using throtl_slice = 1s here, it should not
affect result)
> 
>>
>> whth this patch: bytes_disp = 0.5k, slice_start =  0,
>> bytes_allowed = 1k * 3 + 1k = 4k
>> extra_bytes =  0.5k + 9k - 4k = 5.5k
>> wait = 5.5s
> 
> This looks like calculated at 4s, so the IO would be waiting till
> 4s+5.5s = 9.5s.
wait time is based on extra_bytes, this is really 5.5s, add 4s is
wrong here.

bytes_allowed = ((jiffies - slice_start) / Hz + 1) * bps_limit
extra_bytes = bio_size + bytes_disp - bytes_allowed
wait = extra_bytes / bps_limit
> 
> As I don't know why using time 4s, I'll shift this calculation to the
> time 3s (when the config changes):
> 
> bytes_disp = 0.5k, slice_start =  0,
> bytes_allowed = 1k * 3  = 3k
> extra_bytes =  0.5k + 9k - 3k = 7.5k
6.5k
> wait = 7.5s
> 
> In absolute time, the IO would wait till 3s+7.5s = 10.5s
Like I said above, wait time should not add (jiffies - slice_start)
> 
> OK, either your 9.5s or my 10.5s looks weird (although earlier than
> original 4s+8s=12s).
> However, the IO should ideally only wait till
> 
>      3s + (9k -   (6k    -    1k)     ) / 1k/s =
>           bio - (allowed - dispatched)  / new_limit
> 
>     =3s + 4k / 1k/s = 7s
> 
>     ('allowed' is based on old limit)
> 
> Or in another example, what if you change the config from 2k/s to ∞k/s
> (unlimited, let's neglect the arithmetic overflow that you handle
> explicitly, imagine a big number but not so big to be greater than
> division result).
> 
> In such a case, the wait time should be zero, i.e. IO should be
> dispatched right at the time of config change.

I thought about it, however, IMO, this is not a good idea. If user
updated config quite frequently, io throttle will be invalid.

Thanks,
Kuai
> (With your patch that still calculates >0 wait time (and the original
> behavior gives >0 wait too.)
> 
>> I hope I can expliain it clearly...
> 
> Yes, thanks for pointing me to relevant parts.
> I hope I grasped them correctly.
> 
> IOW, your patch and formula make the wait time shorter but still IO can
> be delayed indefinitely if you pass a sequence of new configs. (AFAIU)
> 
> Regards,
> Michal
> .
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-20  1:36             ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-20  1:36 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

在 2022/05/20 9:22, yukuai (C) 写道:
> 在 2022/05/20 0:10, Michal Koutný 写道:
>> On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" 
>> <yukuai3@huawei.com> wrote:
>>> tg_with_in_bps_limit:
>>>   jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>>>   tmp = bps_limit * jiffy_elapsed_rnd;
>>>   do_div(tmp, HZ);
>>>   bytes_allowed = tmp; -> how many bytes are allowed in this slice,
>>>                  incluing dispatched.
>>>   if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>>>    *wait = 0 -> no need to wait if this bio is within limit
>>>
>>>   extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>>>   -> extra_bytes is based on 'bytes_disp'
>>>
>>> For example:
>>>
>>> 1) bps_limit is 2k, we issue two io, (1k and 9k)
>>> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>>>     the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
>>
>> The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
>> at 5s (i.e. 10k/*2kB/s = 5s).
> No, the example is that the second io arrived together with first io.
>>
>>> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
>>>
>>> without this patch:  bytes_disp = 0, slict_start =3:
>>> bytes_allowed = 1k                                <--- why 1k and not 0?
> Because slice_start == jiffies, bytes_allowed is equal to bps_limit
>>> extra_bytes = 9k - 1k = 8k
>>> wait = 8s
>>
>> This looks like it was calculated at time 4s (1s after new config was
>> set).
> No... it was caculated at time 3s:
> 
> jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
> 
> jiffies should be greater than 3s here, thus jiffy_elapsed_rnd is
> 3s + throtl_slice (I'm using throtl_slice = 1s here, it should not
> affect result)
Hi,

Just to simplify explanation (assum that throtl_slice is greater than
0.5s):
Without this patch:
wait time is caculated based on issuing 9k from now(3s) without any
bytes aready dispatched.

With this patch:
wait time is caculated based on issuing 9k from 0s with 0.5 bytes
aready dispatched.
>>
>>>
>>> whth this patch: bytes_disp = 0.5k, slice_start =  0,
>>> bytes_allowed = 1k * 3 + 1k = 4k
>>> extra_bytes =  0.5k + 9k - 4k = 5.5k
>>> wait = 5.5s
>>
>> This looks like calculated at 4s, so the IO would be waiting till
>> 4s+5.5s = 9.5s.
> wait time is based on extra_bytes, this is really 5.5s, add 4s is
> wrong here.
> 
> bytes_allowed = ((jiffies - slice_start) / Hz + 1) * bps_limit
> extra_bytes = bio_size + bytes_disp - bytes_allowed
> wait = extra_bytes / bps_limit
>>
>> As I don't know why using time 4s, I'll shift this calculation to the
>> time 3s (when the config changes):
>>
>> bytes_disp = 0.5k, slice_start =  0,
>> bytes_allowed = 1k * 3  = 3k
>> extra_bytes =  0.5k + 9k - 3k = 7.5k
> 6.5k
>> wait = 7.5s
>>
>> In absolute time, the IO would wait till 3s+7.5s = 10.5s
> Like I said above, wait time should not add (jiffies - slice_start)
>>
>> OK, either your 9.5s or my 10.5s looks weird (although earlier than
>> original 4s+8s=12s).
>> However, the IO should ideally only wait till
>>
>>      3s + (9k -   (6k    -    1k)     ) / 1k/s =
>>           bio - (allowed - dispatched)  / new_limit
>>
>>     =3s + 4k / 1k/s = 7s
>>
>>     ('allowed' is based on old limit)
>>
>> Or in another example, what if you change the config from 2k/s to ∞k/s
>> (unlimited, let's neglect the arithmetic overflow that you handle
>> explicitly, imagine a big number but not so big to be greater than
>> division result).
>>
>> In such a case, the wait time should be zero, i.e. IO should be
>> dispatched right at the time of config change.
> 
> I thought about it, however, IMO, this is not a good idea. If user
> updated config quite frequently, io throttle will be invalid.
> 
> Thanks,
> Kuai
>> (With your patch that still calculates >0 wait time (and the original
>> behavior gives >0 wait too.)
>>
>>> I hope I can expliain it clearly...
>>
>> Yes, thanks for pointing me to relevant parts.
>> I hope I grasped them correctly.
>>
>> IOW, your patch and formula make the wait time shorter but still IO can
>> be delayed indefinitely if you pass a sequence of new configs. (AFAIU)
>>
>> Regards,
>> Michal
>> .
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-20  1:36             ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-20  1:36 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ming.lei-H+wXaHxf7aLQT0dZR+AlfA, geert-Td1EMuHUCqxL1ZNQvxDV9g,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yi.zhang-hv44wF8Li93QT0dZR+AlfA

在 2022/05/20 9:22, yukuai (C) 写道:
> 在 2022/05/20 0:10, Michal Koutný 写道:
>> On Thu, May 19, 2022 at 08:14:28PM +0800, "yukuai (C)" 
>> <yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>>> tg_with_in_bps_limit:
>>>   jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>>>   tmp = bps_limit * jiffy_elapsed_rnd;
>>>   do_div(tmp, HZ);
>>>   bytes_allowed = tmp; -> how many bytes are allowed in this slice,
>>>                  incluing dispatched.
>>>   if (tg->bytes_disp[rw] + bio_size <= bytes_allowed)
>>>    *wait = 0 -> no need to wait if this bio is within limit
>>>
>>>   extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
>>>   -> extra_bytes is based on 'bytes_disp'
>>>
>>> For example:
>>>
>>> 1) bps_limit is 2k, we issue two io, (1k and 9k)
>>> 2) the first io(1k) will be dispatched, bytes_disp = 1k, slice_start = 0
>>>     the second io(9k) is waiting for (9 - (2 - 1)) / 2 = 4 s
>>
>> The 2nd io arrived at 1s, the wait time is 4s, i.e. it can be dispatched
>> at 5s (i.e. 10k/*2kB/s = 5s).
> No, the example is that the second io arrived together with first io.
>>
>>> 3) after 3 s, we update bps_limit to 1k, then new waiting is caculated:
>>>
>>> without this patch:  bytes_disp = 0, slict_start =3:
>>> bytes_allowed = 1k                                <--- why 1k and not 0?
> Because slice_start == jiffies, bytes_allowed is equal to bps_limit
>>> extra_bytes = 9k - 1k = 8k
>>> wait = 8s
>>
>> This looks like it was calculated at time 4s (1s after new config was
>> set).
> No... it was caculated at time 3s:
> 
> jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
> 
> jiffies should be greater than 3s here, thus jiffy_elapsed_rnd is
> 3s + throtl_slice (I'm using throtl_slice = 1s here, it should not
> affect result)
Hi,

Just to simplify explanation (assum that throtl_slice is greater than
0.5s):
Without this patch:
wait time is caculated based on issuing 9k from now(3s) without any
bytes aready dispatched.

With this patch:
wait time is caculated based on issuing 9k from 0s with 0.5 bytes
aready dispatched.
>>
>>>
>>> whth this patch: bytes_disp = 0.5k, slice_start =  0,
>>> bytes_allowed = 1k * 3 + 1k = 4k
>>> extra_bytes =  0.5k + 9k - 4k = 5.5k
>>> wait = 5.5s
>>
>> This looks like calculated at 4s, so the IO would be waiting till
>> 4s+5.5s = 9.5s.
> wait time is based on extra_bytes, this is really 5.5s, add 4s is
> wrong here.
> 
> bytes_allowed = ((jiffies - slice_start) / Hz + 1) * bps_limit
> extra_bytes = bio_size + bytes_disp - bytes_allowed
> wait = extra_bytes / bps_limit
>>
>> As I don't know why using time 4s, I'll shift this calculation to the
>> time 3s (when the config changes):
>>
>> bytes_disp = 0.5k, slice_start =  0,
>> bytes_allowed = 1k * 3  = 3k
>> extra_bytes =  0.5k + 9k - 3k = 7.5k
> 6.5k
>> wait = 7.5s
>>
>> In absolute time, the IO would wait till 3s+7.5s = 10.5s
> Like I said above, wait time should not add (jiffies - slice_start)
>>
>> OK, either your 9.5s or my 10.5s looks weird (although earlier than
>> original 4s+8s=12s).
>> However, the IO should ideally only wait till
>>
>>      3s + (9k -   (6k    -    1k)     ) / 1k/s =
>>           bio - (allowed - dispatched)  / new_limit
>>
>>     =3s + 4k / 1k/s = 7s
>>
>>     ('allowed' is based on old limit)
>>
>> Or in another example, what if you change the config from 2k/s to ∞k/s
>> (unlimited, let's neglect the arithmetic overflow that you handle
>> explicitly, imagine a big number but not so big to be greater than
>> division result).
>>
>> In such a case, the wait time should be zero, i.e. IO should be
>> dispatched right at the time of config change.
> 
> I thought about it, however, IMO, this is not a good idea. If user
> updated config quite frequently, io throttle will be invalid.
> 
> Thanks,
> Kuai
>> (With your patch that still calculates >0 wait time (and the original
>> behavior gives >0 wait too.)
>>
>>> I hope I can expliain it clearly...
>>
>> Yes, thanks for pointing me to relevant parts.
>> I hope I grasped them correctly.
>>
>> IOW, your patch and formula make the wait time shorter but still IO can
>> be delayed indefinitely if you pass a sequence of new configs. (AFAIU)
>>
>> Regards,
>> Michal
>> .
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
  2022-05-20  1:36             ` yukuai (C)
  (?)
@ 2022-05-20 16:03             ` Michal Koutný
  2022-05-20 16:20               ` Tejun Heo
  2022-05-21  3:01                 ` yukuai (C)
  -1 siblings, 2 replies; 24+ messages in thread
From: Michal Koutný @ 2022-05-20 16:03 UTC (permalink / raw)
  To: yukuai (C)
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

On Fri, May 20, 2022 at 09:36:11AM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
> Just to simplify explanation (assum that throtl_slice is greater than
> 0.5s):
> Without this patch:
> wait time is caculated based on issuing 9k from now(3s) without any
> bytes aready dispatched.

I acknowledge that pre-patch state is incorrect because it erases
already passed wait-time from the previous slice.

> With this patch:
> wait time is caculated based on issuing 9k from 0s with 0.5 bytes
> aready dispatched.

Thanks for your further hint. Hopefully, I'm getting closer to real
understanding. Now, I calculate the wait times as durations between
current moment and timepoint when a bio can be dispatched.

IIUC, after config change the ideal wait time of a bio is

    wait_ideal := (disp + bio - Δt*l_old) / l_new

where Δt is the elapsed time of the current slice.
You maintain the slice but scale disp, so you get

    wait_kuai := ((l_new/l_old)*disp + bio - Δt*l_lew) / l_new
               = disp / l_old + bio / l_new - Δt

Please confirm we're on the same page here.

Then I look at

    error := wait_kuai - wait_ideal
          ...
	  = (Δt * l_old - disp) * (1/l_new - 1/l_old)
	  = (Δt * l_old - disp) * (1 - α) / (α * l_old)
where
    α = l_new / l_old

The leftmost term is a unconsumed IO of the slice. Say it's positive,
while the bigger bio is throttled at the moment of a config change.
If the config change increases throttling (α < 1), the error grows very
high (i.e. over-throttling similar to the existing behavior).
If the config change relieves throttling (α > 1), the wait time's
slightly shorter (under-throttling) wrt the ideal.

If I was to propose a correction, it'd be like the patch at the bottom
derived from your but not finished (the XXX part). It's for potential
further discussion.


I had myself carried a way with the formulas. If I go back to the
beginning:

> Then io hung can be triggered by always submmiting new configuration
> before the throttled bio is dispatched.

How big is this a problem actually? Is it only shooting oneself in the leg
or can there be a user who's privileged enough to modify throttling
configuration yet not privileged enough to justify the hung's
consequences (like some global FS locks).


Thanks,
Michal

--- 8< ---
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 469c483719be..3fd458d16f31 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1274,7 +1274,62 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
 	return 0;
 }
 
-static void tg_conf_updated(struct throtl_grp *tg, bool global)
+static u64 throtl_update_slice_scale(unsigned int slice_start, u64 new_limit,
+				     u64 old_limit)
+{
+	if (new_limit == old_limit)
+		return slice_start;
+
+	/* This shouldn't really matter but semantically we want to extend the
+	 * slice from the earliest possible point of time. */
+	if (WARN_ON(new_limit == 0))
+		return 0;
+
+	return jiffies - div64_u64((jiffies - slice_start) * old_limit, new_limit);
+}
+
+static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
+{
+	/*
+	 * How does this work? We're going to calculate new wait time in
+	 * tg_with_in_bps_limit(). Ideal wait time after config change is
+	 *
+	 *   wait_ideal := (disp + bio - Δt*l_old) / l_new
+	 *
+	 * where Δt = jiffies - tg->slice_start (elapsed time of slice).
+	 * In reality, the function has no idea about l_old so it calculates
+	 *
+	 *   wait_skewed := (disp + bio - Δt*l_new) / l_new
+	 *
+	 * So we modify slice_start to get correct number
+	 *
+	 *   wait_fixed := (disp + bio - Δt'*l_new) / l_new == wait_ideal
+	 *
+	 * from that
+	 *   Δt' = Δt * l_old / l_new
+	 * or
+	 *   jiffies - slice_start' = (jiffies - slice_start) * l_old / l_new
+	 * .
+	 */
+	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
+							   tg_bps_limit(tg, READ),
+							   old_limits[0]);
+	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
+							   tg_bps_limit(tg, WRITE),
+							   old_limits[1]);
+
+	// XXX This looks like OK since we should not change BPS and IOPS limit
+	// at the same time but it is not actually OK because scaling
+	// slice_start for one limit breaks the other anyway.
+	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
+							   tg_iops_limit(tg, READ),
+							   old_limits[2]);
+	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
+							   tg_iops_limit(tg, WRITE),
+							   old_limits[3]);
+}
+
+static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct cgroup_subsys_state *pos_css;
@@ -1313,16 +1368,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 				parent_tg->latency_target);
 	}
 
-	/*
-	 * We're already holding queue_lock and know @tg is valid.  Let's
-	 * apply the new config directly.
-	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
-	 */
-	throtl_start_new_slice(tg, READ);
-	throtl_start_new_slice(tg, WRITE);
+	throtl_update_slice(tg, old_limits);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1330,6 +1376,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
 	}
 }
 
+static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
+{
+	limits[0] = tg_bps_limit(tg, READ);
+	limits[1] = tg_bps_limit(tg, WRITE);
+	limits[2] = tg_iops_limit(tg, READ);
+	limits[3] = tg_iops_limit(tg, WRITE);
+}
+
 static ssize_t tg_set_conf(struct kernfs_open_file *of,
 			   char *buf, size_t nbytes, loff_t off, bool is_u64)
 {
@@ -1338,6 +1392,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	int ret;
 	u64 v;
+	u64 old_limits[4];
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1350,13 +1405,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
 	else
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
 
-	tg_conf_updated(tg, false);
+	tg_conf_updated(tg, old_limits, false);
 	ret = 0;
 out_finish:
 	blkg_conf_finish(&ctx);
@@ -1526,6 +1582,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	u64 v[4];
+	u64 old_limits[4];
 	unsigned long idle_time;
 	unsigned long latency_time;
 	int ret;
@@ -1536,6 +1593,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	tg_get_limits(tg, old_limits);
 
 	v[0] = tg->bps_conf[READ][index];
 	v[1] = tg->bps_conf[WRITE][index];
@@ -1627,7 +1685,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 	} else
 		tg->td->limit_index = LIMIT_MAX;
-	tg_conf_updated(tg, index == LIMIT_LOW &&
+	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
 		tg->td->limit_valid[LIMIT_LOW]);
 	ret = 0;
 out_finish:

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
  2022-05-20 16:03             ` Michal Koutný
@ 2022-05-20 16:20               ` Tejun Heo
  2022-05-21  3:51                   ` yukuai (C)
  2022-05-21  3:01                 ` yukuai (C)
  1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2022-05-20 16:20 UTC (permalink / raw)
  To: Michal Koutný
  Cc: yukuai (C),
	axboe, ming.lei, geert, cgroups, linux-block, linux-kernel,
	yi.zhang

Hello,

On Fri, May 20, 2022 at 06:03:05PM +0200, Michal Koutný wrote:
> > Then io hung can be triggered by always submmiting new configuration
> > before the throttled bio is dispatched.
> 
> How big is this a problem actually? Is it only shooting oneself in the leg
> or can there be a user who's privileged enough to modify throttling
> configuration yet not privileged enough to justify the hung's
> consequences (like some global FS locks).

So, the problem in itself is of the self-inflicted type and I'd prefer to
ignore it. Unfortunately, the kernel doesn't have the kind of isolation
where stalling out some aribtrary tasks is generally safe, especially not
blk-throtl as it doesn't handle bio_issue_as_root() and thus can have a
pretty severe priority inversions where IOs which can block system-wide
operations (e.g. memory reclaim) get trapped in a random cgroup.

Even ignoring that, the kernel in general assumes some forward progress from
everybody and when a part stalls it's relatively easy to spread to the rest
of the system, sometimes gradually, sometimes suddenly - e.g. if the stalled
IO was being performed while holding the mmap_sem, which isn't rare, then
anything which tries to read its proc cmdline will hang behind it.

So, we wanna avoid a situation where a non-priviledged user can cause
indefinite UNINTERRUPTIBLE sleeps to prevent local DoS attacks. I mean,
preventing local attacks is almost never fool proof but we don't want to
make it too easy at least.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-21  3:01                 ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-21  3:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

在 2022/05/21 0:03, Michal Koutný 写道:
> On Fri, May 20, 2022 at 09:36:11AM +0800, "yukuai (C)" <yukuai3@huawei.com> wrote:
>> Just to simplify explanation (assum that throtl_slice is greater than
>> 0.5s):
>> Without this patch:
>> wait time is caculated based on issuing 9k from now(3s) without any
>> bytes aready dispatched.
> 
> I acknowledge that pre-patch state is incorrect because it erases
> already passed wait-time from the previous slice.
> 
>> With this patch:
>> wait time is caculated based on issuing 9k from 0s with 0.5 bytes
>> aready dispatched.
> 
> Thanks for your further hint. Hopefully, I'm getting closer to real
> understanding. Now, I calculate the wait times as durations between
> current moment and timepoint when a bio can be dispatched.
> 
> IIUC, after config change the ideal wait time of a bio is
> 
>      wait_ideal := (disp + bio - Δt*l_old) / l_new
> 
> where Δt is the elapsed time of the current slice.
> You maintain the slice but scale disp, so you get
> 
>      wait_kuai := ((l_new/l_old)*disp + bio - Δt*l_lew) / l_new
>                 = disp / l_old + bio / l_new - Δt
> 
> Please confirm we're on the same page here.
Hi, Michal

Yes we're on the same page here.
> 
> Then I look at
> 
>      error := wait_kuai - wait_ideal
>            ...
> 	  = (Δt * l_old - disp) * (1/l_new - 1/l_old)
> 	  = (Δt * l_old - disp) * (1 - α) / (α * l_old)
> where
>      α = l_new / l_old
> 
> The leftmost term is a unconsumed IO of the slice. Say it's positive,
> while the bigger bio is throttled at the moment of a config change.
> If the config change increases throttling (α < 1), the error grows very
> high (i.e. over-throttling similar to the existing behavior).
> If the config change relieves throttling (α > 1), the wait time's
> slightly shorter (under-throttling) wrt the ideal.
Yew, you are right.
> 
> If I was to propose a correction, it'd be like the patch at the bottom
> derived from your but not finished (the XXX part). It's for potential
> further discussion.
I thought about it, however, I was thing that for such corner case,
fixing io hung if probably enough. Now with the formula that you sorted
out, it's right this is better.

Thanks,
Kuai
> 
> 
> I had myself carried a way with the formulas. If I go back to the
> beginning:
> 
>> Then io hung can be triggered by always submmiting new configuration
>> before the throttled bio is dispatched.
> 
> How big is this a problem actually? Is it only shooting oneself in the leg
> or can there be a user who's privileged enough to modify throttling
> configuration yet not privileged enough to justify the hung's
> consequences (like some global FS locks).
> 
> 
> Thanks,
> Michal
> 
> --- 8< ---
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 469c483719be..3fd458d16f31 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1274,7 +1274,62 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
>   	return 0;
>   }
>   
> -static void tg_conf_updated(struct throtl_grp *tg, bool global)
> +static u64 throtl_update_slice_scale(unsigned int slice_start, u64 new_limit,
> +				     u64 old_limit)
> +{
> +	if (new_limit == old_limit)
> +		return slice_start;
> +
> +	/* This shouldn't really matter but semantically we want to extend the
> +	 * slice from the earliest possible point of time. */
> +	if (WARN_ON(new_limit == 0))
> +		return 0;
> +
> +	return jiffies - div64_u64((jiffies - slice_start) * old_limit, new_limit);
> +}
> +
> +static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
> +{
> +	/*
> +	 * How does this work? We're going to calculate new wait time in
> +	 * tg_with_in_bps_limit(). Ideal wait time after config change is
> +	 *
> +	 *   wait_ideal := (disp + bio - Δt*l_old) / l_new
> +	 *
> +	 * where Δt = jiffies - tg->slice_start (elapsed time of slice).
> +	 * In reality, the function has no idea about l_old so it calculates
> +	 *
> +	 *   wait_skewed := (disp + bio - Δt*l_new) / l_new
> +	 *
> +	 * So we modify slice_start to get correct number
> +	 *
> +	 *   wait_fixed := (disp + bio - Δt'*l_new) / l_new == wait_ideal
> +	 *
> +	 * from that
> +	 *   Δt' = Δt * l_old / l_new
> +	 * or
> +	 *   jiffies - slice_start' = (jiffies - slice_start) * l_old / l_new
> +	 * .
> +	 */
> +	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
> +							   tg_bps_limit(tg, READ),
> +							   old_limits[0]);
> +	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
> +							   tg_bps_limit(tg, WRITE),
> +							   old_limits[1]);
> +
> +	// XXX This looks like OK since we should not change BPS and IOPS limit
> +	// at the same time but it is not actually OK because scaling
> +	// slice_start for one limit breaks the other anyway.
> +	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
> +							   tg_iops_limit(tg, READ),
> +							   old_limits[2]);
> +	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
> +							   tg_iops_limit(tg, WRITE),
> +							   old_limits[3]);
> +}
> +
> +static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
>   {
>   	struct throtl_service_queue *sq = &tg->service_queue;
>   	struct cgroup_subsys_state *pos_css;
> @@ -1313,16 +1368,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
>   				parent_tg->latency_target);
>   	}
>   
> -	/*
> -	 * We're already holding queue_lock and know @tg is valid.  Let's
> -	 * apply the new config directly.
> -	 *
> -	 * Restart the slices for both READ and WRITES. It might happen
> -	 * that a group's limit are dropped suddenly and we don't want to
> -	 * account recently dispatched IO with new low rate.
> -	 */
> -	throtl_start_new_slice(tg, READ);
> -	throtl_start_new_slice(tg, WRITE);
> +	throtl_update_slice(tg, old_limits);
>   
>   	if (tg->flags & THROTL_TG_PENDING) {
>   		tg_update_disptime(tg);
> @@ -1330,6 +1376,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
>   	}
>   }
>   
> +static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
> +{
> +	limits[0] = tg_bps_limit(tg, READ);
> +	limits[1] = tg_bps_limit(tg, WRITE);
> +	limits[2] = tg_iops_limit(tg, READ);
> +	limits[3] = tg_iops_limit(tg, WRITE);
> +}
> +
>   static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   			   char *buf, size_t nbytes, loff_t off, bool is_u64)
>   {
> @@ -1338,6 +1392,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   	struct throtl_grp *tg;
>   	int ret;
>   	u64 v;
> +	u64 old_limits[4];
>   
>   	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
>   	if (ret)
> @@ -1350,13 +1405,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   		v = U64_MAX;
>   
>   	tg = blkg_to_tg(ctx.blkg);
> +	tg_get_limits(tg, old_limits);
>   
>   	if (is_u64)
>   		*(u64 *)((void *)tg + of_cft(of)->private) = v;
>   	else
>   		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
>   
> -	tg_conf_updated(tg, false);
> +	tg_conf_updated(tg, old_limits, false);
>   	ret = 0;
>   out_finish:
>   	blkg_conf_finish(&ctx);
> @@ -1526,6 +1582,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   	struct blkg_conf_ctx ctx;
>   	struct throtl_grp *tg;
>   	u64 v[4];
> +	u64 old_limits[4];
>   	unsigned long idle_time;
>   	unsigned long latency_time;
>   	int ret;
> @@ -1536,6 +1593,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   		return ret;
>   
>   	tg = blkg_to_tg(ctx.blkg);
> +	tg_get_limits(tg, old_limits);
>   
>   	v[0] = tg->bps_conf[READ][index];
>   	v[1] = tg->bps_conf[WRITE][index];
> @@ -1627,7 +1685,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   			tg->td->limit_index = LIMIT_LOW;
>   	} else
>   		tg->td->limit_index = LIMIT_MAX;
> -	tg_conf_updated(tg, index == LIMIT_LOW &&
> +	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
>   		tg->td->limit_valid[LIMIT_LOW]);
>   	ret = 0;
>   out_finish:
> .
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-21  3:01                 ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-21  3:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj-DgEjT+Ai2ygdnm+yROfE0A, axboe-tSWWG44O7X1aa/9Udqfwiw,
	ming.lei-H+wXaHxf7aLQT0dZR+AlfA, geert-Td1EMuHUCqxL1ZNQvxDV9g,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yi.zhang-hv44wF8Li93QT0dZR+AlfA

在 2022/05/21 0:03, Michal Koutný 写道:
> On Fri, May 20, 2022 at 09:36:11AM +0800, "yukuai (C)" <yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> wrote:
>> Just to simplify explanation (assum that throtl_slice is greater than
>> 0.5s):
>> Without this patch:
>> wait time is caculated based on issuing 9k from now(3s) without any
>> bytes aready dispatched.
> 
> I acknowledge that pre-patch state is incorrect because it erases
> already passed wait-time from the previous slice.
> 
>> With this patch:
>> wait time is caculated based on issuing 9k from 0s with 0.5 bytes
>> aready dispatched.
> 
> Thanks for your further hint. Hopefully, I'm getting closer to real
> understanding. Now, I calculate the wait times as durations between
> current moment and timepoint when a bio can be dispatched.
> 
> IIUC, after config change the ideal wait time of a bio is
> 
>      wait_ideal := (disp + bio - Δt*l_old) / l_new
> 
> where Δt is the elapsed time of the current slice.
> You maintain the slice but scale disp, so you get
> 
>      wait_kuai := ((l_new/l_old)*disp + bio - Δt*l_lew) / l_new
>                 = disp / l_old + bio / l_new - Δt
> 
> Please confirm we're on the same page here.
Hi, Michal

Yes we're on the same page here.
> 
> Then I look at
> 
>      error := wait_kuai - wait_ideal
>            ...
> 	  = (Δt * l_old - disp) * (1/l_new - 1/l_old)
> 	  = (Δt * l_old - disp) * (1 - α) / (α * l_old)
> where
>      α = l_new / l_old
> 
> The leftmost term is a unconsumed IO of the slice. Say it's positive,
> while the bigger bio is throttled at the moment of a config change.
> If the config change increases throttling (α < 1), the error grows very
> high (i.e. over-throttling similar to the existing behavior).
> If the config change relieves throttling (α > 1), the wait time's
> slightly shorter (under-throttling) wrt the ideal.
Yew, you are right.
> 
> If I was to propose a correction, it'd be like the patch at the bottom
> derived from your but not finished (the XXX part). It's for potential
> further discussion.
I thought about it, however, I was thing that for such corner case,
fixing io hung if probably enough. Now with the formula that you sorted
out, it's right this is better.

Thanks,
Kuai
> 
> 
> I had myself carried a way with the formulas. If I go back to the
> beginning:
> 
>> Then io hung can be triggered by always submmiting new configuration
>> before the throttled bio is dispatched.
> 
> How big is this a problem actually? Is it only shooting oneself in the leg
> or can there be a user who's privileged enough to modify throttling
> configuration yet not privileged enough to justify the hung's
> consequences (like some global FS locks).
> 
> 
> Thanks,
> Michal
> 
> --- 8< ---
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 469c483719be..3fd458d16f31 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1274,7 +1274,62 @@ static int tg_print_conf_uint(struct seq_file *sf, void *v)
>   	return 0;
>   }
>   
> -static void tg_conf_updated(struct throtl_grp *tg, bool global)
> +static u64 throtl_update_slice_scale(unsigned int slice_start, u64 new_limit,
> +				     u64 old_limit)
> +{
> +	if (new_limit == old_limit)
> +		return slice_start;
> +
> +	/* This shouldn't really matter but semantically we want to extend the
> +	 * slice from the earliest possible point of time. */
> +	if (WARN_ON(new_limit == 0))
> +		return 0;
> +
> +	return jiffies - div64_u64((jiffies - slice_start) * old_limit, new_limit);
> +}
> +
> +static void throtl_update_slice(struct throtl_grp *tg, u64 *old_limits)
> +{
> +	/*
> +	 * How does this work? We're going to calculate new wait time in
> +	 * tg_with_in_bps_limit(). Ideal wait time after config change is
> +	 *
> +	 *   wait_ideal := (disp + bio - Δt*l_old) / l_new
> +	 *
> +	 * where Δt = jiffies - tg->slice_start (elapsed time of slice).
> +	 * In reality, the function has no idea about l_old so it calculates
> +	 *
> +	 *   wait_skewed := (disp + bio - Δt*l_new) / l_new
> +	 *
> +	 * So we modify slice_start to get correct number
> +	 *
> +	 *   wait_fixed := (disp + bio - Δt'*l_new) / l_new == wait_ideal
> +	 *
> +	 * from that
> +	 *   Δt' = Δt * l_old / l_new
> +	 * or
> +	 *   jiffies - slice_start' = (jiffies - slice_start) * l_old / l_new
> +	 * .
> +	 */
> +	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
> +							   tg_bps_limit(tg, READ),
> +							   old_limits[0]);
> +	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
> +							   tg_bps_limit(tg, WRITE),
> +							   old_limits[1]);
> +
> +	// XXX This looks like OK since we should not change BPS and IOPS limit
> +	// at the same time but it is not actually OK because scaling
> +	// slice_start for one limit breaks the other anyway.
> +	tg->slice_start[READ]  = throtl_update_slice_scale(tg->slice_start[READ],
> +							   tg_iops_limit(tg, READ),
> +							   old_limits[2]);
> +	tg->slice_start[WRITE] = throtl_update_slice_scale(tg->slice_start[WRITE],
> +							   tg_iops_limit(tg, WRITE),
> +							   old_limits[3]);
> +}
> +
> +static void tg_conf_updated(struct throtl_grp *tg, u64 *old_limits, bool global)
>   {
>   	struct throtl_service_queue *sq = &tg->service_queue;
>   	struct cgroup_subsys_state *pos_css;
> @@ -1313,16 +1368,7 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
>   				parent_tg->latency_target);
>   	}
>   
> -	/*
> -	 * We're already holding queue_lock and know @tg is valid.  Let's
> -	 * apply the new config directly.
> -	 *
> -	 * Restart the slices for both READ and WRITES. It might happen
> -	 * that a group's limit are dropped suddenly and we don't want to
> -	 * account recently dispatched IO with new low rate.
> -	 */
> -	throtl_start_new_slice(tg, READ);
> -	throtl_start_new_slice(tg, WRITE);
> +	throtl_update_slice(tg, old_limits);
>   
>   	if (tg->flags & THROTL_TG_PENDING) {
>   		tg_update_disptime(tg);
> @@ -1330,6 +1376,14 @@ static void tg_conf_updated(struct throtl_grp *tg, bool global)
>   	}
>   }
>   
> +static void tg_get_limits(struct throtl_grp *tg, u64 *limits)
> +{
> +	limits[0] = tg_bps_limit(tg, READ);
> +	limits[1] = tg_bps_limit(tg, WRITE);
> +	limits[2] = tg_iops_limit(tg, READ);
> +	limits[3] = tg_iops_limit(tg, WRITE);
> +}
> +
>   static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   			   char *buf, size_t nbytes, loff_t off, bool is_u64)
>   {
> @@ -1338,6 +1392,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   	struct throtl_grp *tg;
>   	int ret;
>   	u64 v;
> +	u64 old_limits[4];
>   
>   	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
>   	if (ret)
> @@ -1350,13 +1405,14 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
>   		v = U64_MAX;
>   
>   	tg = blkg_to_tg(ctx.blkg);
> +	tg_get_limits(tg, old_limits);
>   
>   	if (is_u64)
>   		*(u64 *)((void *)tg + of_cft(of)->private) = v;
>   	else
>   		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
>   
> -	tg_conf_updated(tg, false);
> +	tg_conf_updated(tg, old_limits, false);
>   	ret = 0;
>   out_finish:
>   	blkg_conf_finish(&ctx);
> @@ -1526,6 +1582,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   	struct blkg_conf_ctx ctx;
>   	struct throtl_grp *tg;
>   	u64 v[4];
> +	u64 old_limits[4];
>   	unsigned long idle_time;
>   	unsigned long latency_time;
>   	int ret;
> @@ -1536,6 +1593,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   		return ret;
>   
>   	tg = blkg_to_tg(ctx.blkg);
> +	tg_get_limits(tg, old_limits);
>   
>   	v[0] = tg->bps_conf[READ][index];
>   	v[1] = tg->bps_conf[WRITE][index];
> @@ -1627,7 +1685,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
>   			tg->td->limit_index = LIMIT_LOW;
>   	} else
>   		tg->td->limit_index = LIMIT_MAX;
> -	tg_conf_updated(tg, index == LIMIT_LOW &&
> +	tg_conf_updated(tg, old_limits, index == LIMIT_LOW &&
>   		tg->td->limit_valid[LIMIT_LOW]);
>   	ret = 0;
>   out_finish:
> .
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
  2022-05-20 16:20               ` Tejun Heo
@ 2022-05-21  3:51                   ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-21  3:51 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

在 2022/05/21 0:20, Tejun Heo 写道:
> Hello,
> 
> On Fri, May 20, 2022 at 06:03:05PM +0200, Michal Koutný wrote:
>>> Then io hung can be triggered by always submmiting new configuration
>>> before the throttled bio is dispatched.
>>
>> How big is this a problem actually? Is it only shooting oneself in the leg
>> or can there be a user who's privileged enough to modify throttling
>> configuration yet not privileged enough to justify the hung's
>> consequences (like some global FS locks).
> 
> So, the problem in itself is of the self-inflicted type and I'd prefer to
> ignore it. Unfortunately, the kernel doesn't have the kind of isolation
> where stalling out some aribtrary tasks is generally safe, especially not
> blk-throtl as it doesn't handle bio_issue_as_root() and thus can have a
> pretty severe priority inversions where IOs which can block system-wide
> operations (e.g. memory reclaim) get trapped in a random cgroup.
Hi, Tejun

It's right the problem is self-inflicted. However, I do think with
Michal's suggestion, how throttled bios are handled while new config is
submitted really make sense from the functional poinit of view.

Do you think the solution is OK?

Thnaks,
Kuai
> 
> Even ignoring that, the kernel in general assumes some forward progress from
> everybody and when a part stalls it's relatively easy to spread to the rest
> of the system, sometimes gradually, sometimes suddenly - e.g. if the stalled
> IO was being performed while holding the mmap_sem, which isn't rare, then
> anything which tries to read its proc cmdline will hang behind it.
> 
> So, we wanna avoid a situation where a non-priviledged user can cause
> indefinite UNINTERRUPTIBLE sleeps to prevent local DoS attacks. I mean,
> preventing local attacks is almost never fool proof but we don't want to
> make it too easy at least.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-21  3:51                   ` yukuai (C)
  0 siblings, 0 replies; 24+ messages in thread
From: yukuai (C) @ 2022-05-21  3:51 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: axboe, ming.lei, geert, cgroups, linux-block, linux-kernel, yi.zhang

在 2022/05/21 0:20, Tejun Heo 写道:
> Hello,
> 
> On Fri, May 20, 2022 at 06:03:05PM +0200, Michal Koutný wrote:
>>> Then io hung can be triggered by always submmiting new configuration
>>> before the throttled bio is dispatched.
>>
>> How big is this a problem actually? Is it only shooting oneself in the leg
>> or can there be a user who's privileged enough to modify throttling
>> configuration yet not privileged enough to justify the hung's
>> consequences (like some global FS locks).
> 
> So, the problem in itself is of the self-inflicted type and I'd prefer to
> ignore it. Unfortunately, the kernel doesn't have the kind of isolation
> where stalling out some aribtrary tasks is generally safe, especially not
> blk-throtl as it doesn't handle bio_issue_as_root() and thus can have a
> pretty severe priority inversions where IOs which can block system-wide
> operations (e.g. memory reclaim) get trapped in a random cgroup.
Hi, Tejun

It's right the problem is self-inflicted. However, I do think with
Michal's suggestion, how throttled bios are handled while new config is
submitted really make sense from the functional poinit of view.

Do you think the solution is OK?

Thnaks,
Kuai
> 
> Even ignoring that, the kernel in general assumes some forward progress from
> everybody and when a part stalls it's relatively easy to spread to the rest
> of the system, sometimes gradually, sometimes suddenly - e.g. if the stalled
> IO was being performed while holding the mmap_sem, which isn't rare, then
> anything which tries to read its proc cmdline will hang behind it.
> 
> So, we wanna avoid a situation where a non-priviledged user can cause
> indefinite UNINTERRUPTIBLE sleeps to prevent local DoS attacks. I mean,
> preventing local attacks is almost never fool proof but we don't want to
> make it too easy at least.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-21  5:00                     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2022-05-21  5:00 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Michal Koutný,
	axboe, ming.lei, geert, cgroups, linux-block, linux-kernel,
	yi.zhang

On Sat, May 21, 2022 at 11:51:11AM +0800, yukuai (C) wrote:
> It's right the problem is self-inflicted. However, I do think with
> Michal's suggestion, how throttled bios are handled while new config is
> submitted really make sense from the functional poinit of view.
> 
> Do you think the solution is OK?

I haven't followed the details but anything which isn't overly complex and
doesn't produce extra budget or eat into existing one on config change is
fine by me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates
@ 2022-05-21  5:00                     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2022-05-21  5:00 UTC (permalink / raw)
  To: yukuai (C)
  Cc: Michal Koutný,
	axboe-tSWWG44O7X1aa/9Udqfwiw, ming.lei-H+wXaHxf7aLQT0dZR+AlfA,
	geert-Td1EMuHUCqxL1ZNQvxDV9g, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	yi.zhang-hv44wF8Li93QT0dZR+AlfA

On Sat, May 21, 2022 at 11:51:11AM +0800, yukuai (C) wrote:
> It's right the problem is self-inflicted. However, I do think with
> Michal's suggestion, how throttled bios are handled while new config is
> submitted really make sense from the functional poinit of view.
> 
> Do you think the solution is OK?

I haven't followed the details but anything which isn't overly complex and
doesn't produce extra budget or eat into existing one on config change is
fine by me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-05-21  5:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19  8:58 [PATCH -next v3 0/2] bugfix for blk-throttle Yu Kuai
2022-05-19  8:58 ` Yu Kuai
2022-05-19  8:58 ` [PATCH -next v3 1/2] blk-throttle: fix that io throttle can only work for single bio Yu Kuai
2022-05-19  8:58   ` Yu Kuai
2022-05-19 10:42   ` Ming Lei
2022-05-19  8:58 ` [PATCH -next v3 2/2] blk-throttle: fix io hung due to configuration updates Yu Kuai
2022-05-19  8:58   ` Yu Kuai
2022-05-19  9:58   ` Michal Koutný
2022-05-19 12:14     ` yukuai (C)
2022-05-19 12:14       ` yukuai (C)
2022-05-19 16:10       ` Michal Koutný
2022-05-19 16:10         ` Michal Koutný
2022-05-20  1:22         ` yukuai (C)
2022-05-20  1:22           ` yukuai (C)
2022-05-20  1:36           ` yukuai (C)
2022-05-20  1:36             ` yukuai (C)
2022-05-20 16:03             ` Michal Koutný
2022-05-20 16:20               ` Tejun Heo
2022-05-21  3:51                 ` yukuai (C)
2022-05-21  3:51                   ` yukuai (C)
2022-05-21  5:00                   ` Tejun Heo
2022-05-21  5:00                     ` Tejun Heo
2022-05-21  3:01               ` yukuai (C)
2022-05-21  3:01                 ` yukuai (C)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.