All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-02 19:20 ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-02 19:20 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

 v6:
  - Add a missing free_percpu() into blkcg_css_free() in patch 3.
  - Integrating the documentation patch 4 back into patch 3.

 v5:
  - Add a new patch 2 to eliminate the use of intermediate "ret"
    variable in blkcg_css_alloc() to fix compilation warning reported
    by kernel test robot.

 v4:
  - Update comment and eliminate "inline" keywords as suggested by TJ.

 v3:
  - Update comments in patch 2.
  - Put rcu_read_lock/unlock() in blkcg_rstat_flush().
  - Use READ_ONCE/WRITE_ONCE() to access lnode->next to reduce data
    races.
  - Get a blkg reference when putting into the lockless list and put it
    back when removed.
 
This patch series improves blkcg_rstat_flush() performance by eliminating
unnecessary blkg enumeration and flush operations for those blkg's and
blkg_iostat_set's that haven't been updated since the last flush.

Waiman Long (3):
  blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
  blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  blk-cgroup: Optimize blkcg_rstat_flush()

 block/blk-cgroup.c | 119 +++++++++++++++++++++++++++++++++++++++------
 block/blk-cgroup.h |   9 ++++
 2 files changed, 112 insertions(+), 16 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v6 0/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-02 19:20 ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-02 19:20 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ming Lei, Waiman Long

 v6:
  - Add a missing free_percpu() into blkcg_css_free() in patch 3.
  - Integrating the documentation patch 4 back into patch 3.

 v5:
  - Add a new patch 2 to eliminate the use of intermediate "ret"
    variable in blkcg_css_alloc() to fix compilation warning reported
    by kernel test robot.

 v4:
  - Update comment and eliminate "inline" keywords as suggested by TJ.

 v3:
  - Update comments in patch 2.
  - Put rcu_read_lock/unlock() in blkcg_rstat_flush().
  - Use READ_ONCE/WRITE_ONCE() to access lnode->next to reduce data
    races.
  - Get a blkg reference when putting into the lockless list and put it
    back when removed.
 
This patch series improves blkcg_rstat_flush() performance by eliminating
unnecessary blkg enumeration and flush operations for those blkg's and
blkg_iostat_set's that haven't been updated since the last flush.

Waiman Long (3):
  blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
  blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  blk-cgroup: Optimize blkcg_rstat_flush()

 block/blk-cgroup.c | 119 +++++++++++++++++++++++++++++++++++++++------
 block/blk-cgroup.h |   9 ++++
 2 files changed, 112 insertions(+), 16 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v6 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
@ 2022-06-02 19:20   ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-02 19:20 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup
rstat") changes block cgroup IO stats to use the rstat APIs. It added
a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified
to allocate the new percpu iostat_cpu but didn't free it when an error
happened. Fix this by freeing the percpu iostat_cpu on error exit.

Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 40161a3f68d0..acd9b0aa8dc8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		return NULL;
 
 	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
-		goto err_free;
+		goto err_free_blkg;
 
 	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
 	if (!blkg->iostat_cpu)
-		goto err_free;
+		goto err_free_blkg;
 
 	if (!blk_get_queue(q))
 		goto err_free;
@@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	return blkg;
 
 err_free:
+	free_percpu(blkg->iostat_cpu);
+
+err_free_blkg:
 	blkg_free(blkg);
 	return NULL;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
@ 2022-06-02 19:20   ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-02 19:20 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ming Lei, Waiman Long

Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup
rstat") changes block cgroup IO stats to use the rstat APIs. It added
a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified
to allocate the new percpu iostat_cpu but didn't free it when an error
happened. Fix this by freeing the percpu iostat_cpu on error exit.

Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 40161a3f68d0..acd9b0aa8dc8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		return NULL;
 
 	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
-		goto err_free;
+		goto err_free_blkg;
 
 	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
 	if (!blkg->iostat_cpu)
-		goto err_free;
+		goto err_free_blkg;
 
 	if (!blk_get_queue(q))
 		goto err_free;
@@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	return blkg;
 
 err_free:
+	free_percpu(blkg->iostat_cpu);
+
+err_free_blkg:
 	blkg_free(blkg);
 	return NULL;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
@ 2022-06-02 19:20   ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-02 19:20 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
Simplify error handling code by returning this error directly instead
of setting an intermediate "ret" variable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 block/blk-cgroup.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index acd9b0aa8dc8..9021f75fc752 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1177,7 +1177,6 @@ static struct cgroup_subsys_state *
 blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct blkcg *blkcg;
-	struct cgroup_subsys_state *ret;
 	int i;
 
 	mutex_lock(&blkcg_pol_mutex);
@@ -1186,10 +1185,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		blkcg = &blkcg_root;
 	} else {
 		blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
-		if (!blkcg) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!blkcg)
 			goto unlock;
-		}
 	}
 
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
@@ -1206,10 +1203,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			continue;
 
 		cpd = pol->cpd_alloc_fn(GFP_KERNEL);
-		if (!cpd) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!cpd)
 			goto free_pd_blkcg;
-		}
+
 		blkcg->cpd[i] = cpd;
 		cpd->blkcg = blkcg;
 		cpd->plid = i;
@@ -1238,7 +1234,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		kfree(blkcg);
 unlock:
 	mutex_unlock(&blkcg_pol_mutex);
-	return ret;
+	return ERR_PTR(-ENOMEM);
 }
 
 static int blkcg_css_online(struct cgroup_subsys_state *css)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
@ 2022-06-02 19:20   ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-02 19:20 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ming Lei, Waiman Long

For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
Simplify error handling code by returning this error directly instead
of setting an intermediate "ret" variable.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index acd9b0aa8dc8..9021f75fc752 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1177,7 +1177,6 @@ static struct cgroup_subsys_state *
 blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct blkcg *blkcg;
-	struct cgroup_subsys_state *ret;
 	int i;
 
 	mutex_lock(&blkcg_pol_mutex);
@@ -1186,10 +1185,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		blkcg = &blkcg_root;
 	} else {
 		blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
-		if (!blkcg) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!blkcg)
 			goto unlock;
-		}
 	}
 
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
@@ -1206,10 +1203,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			continue;
 
 		cpd = pol->cpd_alloc_fn(GFP_KERNEL);
-		if (!cpd) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!cpd)
 			goto free_pd_blkcg;
-		}
+
 		blkcg->cpd[i] = cpd;
 		cpd->blkcg = blkcg;
 		cpd->plid = i;
@@ -1238,7 +1234,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		kfree(blkcg);
 unlock:
 	mutex_unlock(&blkcg_pol_mutex);
-	return ret;
+	return ERR_PTR(-ENOMEM);
 }
 
 static int blkcg_css_online(struct cgroup_subsys_state *css)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-02 19:20 ` Waiman Long
                   ` (2 preceding siblings ...)
  (?)
@ 2022-06-02 19:20 ` Waiman Long
  2022-06-04  3:58   ` Ming Lei
                     ` (2 more replies)
  -1 siblings, 3 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-02 19:20 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.

The blkg_iostat_set is added to the lockless list on the update side
in blk_cgroup_bio_start(). It is removed from the lockless list when
flushed in blkcg_rstat_flush(). Due to racing, it is possible that
blk_iostat_set's in the lockless list may have no new IO stats to be
flushed. To protect against destruction of blkg, a percpu reference is
gotten when putting into the lockless list and put back when removed.

A blkg_iostat_set can determine if it is in a lockless list by checking
the content of its lnode.next pointer which will be non-NULL when in
a lockless list. This requires the presence of a special llist_last
sentinel node to be put at the end of the lockless list.

When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 100 ++++++++++++++++++++++++++++++++++++++++++---
 block/blk-cgroup.h |   9 ++++
 2 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9021f75fc752..963a779c4cab 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -59,6 +59,71 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
 
 #define BLKG_DESTROY_BATCH_SIZE  64
 
+/*
+ * Lockless lists for tracking IO stats update
+ *
+ * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
+ * There are multiple blkg's (one for each block device) attached to each
+ * blkcg. The rstat code keeps track of which cpu has IO stats updated,
+ * but it doesn't know which blkg has the updated stats. If there are many
+ * block devices in a system, the cost of iterating all the blkg's to flush
+ * out the IO stats can be high. To reduce such overhead, a set of percpu
+ * lockless lists (lhead) per blkcg are used to track the set of recently
+ * updated iostat_cpu's since the last flush. An iostat_cpu will be put
+ * onto the lockless list on the update side [blk_cgroup_bio_start()] if
+ * not there yet and then removed when being flushed [blkcg_rstat_flush()].
+ * References to blkg are gotten and then put back in the process to
+ * protect against blkg removal.
+ *
+ * lnode.next of the last entry in a lockless list is NULL. To enable us to
+ * use lnode.next as a boolean flag to indicate its presence in a lockless
+ * list, we have to make it non-NULL for all. This is done by using a
+ * sentinel node at the end of the lockless list. All the percpu lhead's
+ * are initialized to point to that sentinel node as being empty.
+ */
+static struct llist_node llist_last;
+
+static bool blkcg_llist_empty(struct llist_head *lhead)
+{
+	return lhead->first == &llist_last;
+}
+
+static void init_blkcg_llists(struct blkcg *blkcg)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
+}
+
+static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
+{
+	return xchg(&lhead->first, &llist_last);
+}
+
+static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
+{
+	struct llist_node *next = READ_ONCE(lnode->next);
+	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
+					    lnode)->blkg;
+
+	WRITE_ONCE(lnode->next, NULL);
+	percpu_ref_put(&blkg->refcnt);
+	return next;
+}
+
+/*
+ * The retrieved blkg_iostat_set is immediately marked as not in the
+ * lockless list by clearing its node->next pointer. It could be put
+ * back into the list by a parallel update before the iostat's are
+ * finally flushed including probably the new update.
+ */
+#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
+	for (; (node != &llist_last) &&					\
+	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
+		nxt = fetch_delete_lnode_next(node), true);		\
+		node = nxt)
+
 /**
  * blkcg_css - find the current css
  *
@@ -236,8 +301,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 
 	u64_stats_init(&blkg->iostat.sync);
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
+		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
+	}
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
@@ -852,17 +919,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
 static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
-	struct blkcg_gq *blkg;
+	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+	struct llist_node *lnode, *lnext;
+	struct blkg_iostat_set *bisc;
 
 	/* Root-level stats are sourced from system-wide IO stats */
 	if (!cgroup_parent(css->cgroup))
 		return;
 
+	if (blkcg_llist_empty(lhead))
+		return;
+
 	rcu_read_lock();
 
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+	lnode = fetch_delete_blkcg_llist(lhead);
+	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
+		struct blkcg_gq *blkg = bisc->blkg;
 		struct blkcg_gq *parent = blkg->parent;
-		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
 		struct blkg_iostat cur, delta;
 		unsigned long flags;
 		unsigned int seq;
@@ -1170,6 +1243,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
 
 	mutex_unlock(&blkcg_pol_mutex);
 
+	free_percpu(blkcg->lhead);
 	kfree(blkcg);
 }
 
@@ -1189,6 +1263,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			goto unlock;
 	}
 
+	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
+	if (!blkcg->lhead)
+		goto free_blkcg;
+	init_blkcg_llists(blkcg);
+
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkcg_policy_data *cpd;
@@ -1229,7 +1308,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	for (i--; i >= 0; i--)
 		if (blkcg->cpd[i])
 			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
-
+	free_percpu(blkcg->lhead);
+free_blkcg:
 	if (blkcg != &blkcg_root)
 		kfree(blkcg);
 unlock:
@@ -1993,6 +2073,7 @@ static int blk_cgroup_io_type(struct bio *bio)
 
 void blk_cgroup_bio_start(struct bio *bio)
 {
+	struct blkcg *blkcg = bio->bi_blkg->blkcg;
 	int rwd = blk_cgroup_io_type(bio), cpu;
 	struct blkg_iostat_set *bis;
 	unsigned long flags;
@@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 	bis->cur.ios[rwd]++;
 
+	if (!READ_ONCE(bis->lnode.next)) {
+		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+
+		llist_add(&bis->lnode, lhead);
+		percpu_ref_get(&bis->blkg->refcnt);
+	}
+
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
-		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
+		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
 	put_cpu();
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d4de0a35e066..2c36362a332e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -18,6 +18,7 @@
 #include <linux/cgroup.h>
 #include <linux/kthread.h>
 #include <linux/blk-mq.h>
+#include <linux/llist.h>
 
 struct blkcg_gq;
 struct blkg_policy_data;
@@ -43,6 +44,8 @@ struct blkg_iostat {
 
 struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
+	struct llist_node		lnode;
+	struct blkcg_gq		       *blkg;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
 };
@@ -97,6 +100,12 @@ struct blkcg {
 	struct blkcg_policy_data	*cpd[BLKCG_MAX_POLS];
 
 	struct list_head		all_blkcgs_node;
+
+	/*
+	 * List of updated percpu blkg_iostat_set's since the last flush.
+	 */
+	struct llist_head __percpu	*lhead;
+
 #ifdef CONFIG_BLK_CGROUP_FC_APPID
 	char                            fc_app_id[FC_APPID_LEN];
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  2022-06-02 19:20   ` Waiman Long
  (?)
@ 2022-06-02 20:39   ` Tejun Heo
  -1 siblings, 0 replies; 31+ messages in thread
From: Tejun Heo @ 2022-06-02 20:39 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On Thu, Jun 02, 2022 at 03:20:19PM -0400, Waiman Long wrote:
> For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
> Simplify error handling code by returning this error directly instead
> of setting an intermediate "ret" variable.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
@ 2022-06-04  2:08     ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2022-06-04  2:08 UTC (permalink / raw)
  To: Waiman Long; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On Thu, Jun 02, 2022 at 03:20:18PM -0400, Waiman Long wrote:
> Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup
> rstat") changes block cgroup IO stats to use the rstat APIs. It added
> a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified
> to allocate the new percpu iostat_cpu but didn't free it when an error
> happened. Fix this by freeing the percpu iostat_cpu on error exit.
> 
> Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-cgroup.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 40161a3f68d0..acd9b0aa8dc8 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  		return NULL;
>  
>  	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
> -		goto err_free;
> +		goto err_free_blkg;
>  
>  	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>  	if (!blkg->iostat_cpu)
> -		goto err_free;
> +		goto err_free_blkg;
>  
>  	if (!blk_get_queue(q))
>  		goto err_free;
> @@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  	return blkg;
>  
>  err_free:
> +	free_percpu(blkg->iostat_cpu);
> +
> +err_free_blkg:
>  	blkg_free(blkg);

Hi Waiman,

But blkg_free() frees blkg->iostat_cpu via blkg_free_workfn(), so I am
confused where the leak is in failure path?


Thanks
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
@ 2022-06-04  2:08     ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2022-06-04  2:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 02, 2022 at 03:20:18PM -0400, Waiman Long wrote:
> Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup
> rstat") changes block cgroup IO stats to use the rstat APIs. It added
> a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified
> to allocate the new percpu iostat_cpu but didn't free it when an error
> happened. Fix this by freeing the percpu iostat_cpu on error exit.
> 
> Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-cgroup.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 40161a3f68d0..acd9b0aa8dc8 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  		return NULL;
>  
>  	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
> -		goto err_free;
> +		goto err_free_blkg;
>  
>  	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>  	if (!blkg->iostat_cpu)
> -		goto err_free;
> +		goto err_free_blkg;
>  
>  	if (!blk_get_queue(q))
>  		goto err_free;
> @@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  	return blkg;
>  
>  err_free:
> +	free_percpu(blkg->iostat_cpu);
> +
> +err_free_blkg:
>  	blkg_free(blkg);

Hi Waiman,

But blkg_free() frees blkg->iostat_cpu via blkg_free_workfn(), so I am
confused where the leak is in failure path?


Thanks
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
@ 2022-06-04  2:16     ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2022-06-04  2:16 UTC (permalink / raw)
  To: Waiman Long; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On Thu, Jun 02, 2022 at 03:20:19PM -0400, Waiman Long wrote:
> For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
> Simplify error handling code by returning this error directly instead
> of setting an intermediate "ret" variable.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Looks fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
@ 2022-06-04  2:16     ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2022-06-04  2:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 02, 2022 at 03:20:19PM -0400, Waiman Long wrote:
> For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
> Simplify error handling code by returning this error directly instead
> of setting an intermediate "ret" variable.
> 
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Looks fine:

Reviewed-by: Ming Lei <ming.lei-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
  2022-06-04  2:08     ` Ming Lei
  (?)
@ 2022-06-04  2:47     ` Waiman Long
  -1 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-04  2:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On 6/3/22 22:08, Ming Lei wrote:
> On Thu, Jun 02, 2022 at 03:20:18PM -0400, Waiman Long wrote:
>> Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup
>> rstat") changes block cgroup IO stats to use the rstat APIs. It added
>> a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified
>> to allocate the new percpu iostat_cpu but didn't free it when an error
>> happened. Fix this by freeing the percpu iostat_cpu on error exit.
>>
>> Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> ---
>>   block/blk-cgroup.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 40161a3f68d0..acd9b0aa8dc8 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>>   		return NULL;
>>   
>>   	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
>> -		goto err_free;
>> +		goto err_free_blkg;
>>   
>>   	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
>>   	if (!blkg->iostat_cpu)
>> -		goto err_free;
>> +		goto err_free_blkg;
>>   
>>   	if (!blk_get_queue(q))
>>   		goto err_free;
>> @@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>>   	return blkg;
>>   
>>   err_free:
>> +	free_percpu(blkg->iostat_cpu);
>> +
>> +err_free_blkg:
>>   	blkg_free(blkg);
> Hi Waiman,
>
> But blkg_free() frees blkg->iostat_cpu via blkg_free_workfn(), so I am
> confused where the leak is in failure path?

Yes, you are right. I have overlooked that. So this patch isn't really 
necessary. Thanks for correcting me.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-02 19:20 ` [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-06-04  3:58   ` Ming Lei
  2022-06-05 23:15     ` Waiman Long
  2022-06-06  3:16     ` Ming Lei
  2022-06-08 16:57     ` Michal Koutný
  2 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2022-06-04  3:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, ming.lei

Hi Waiman,

On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long wrote:
> For a system with many CPUs and block devices, the time to do
> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
> can be especially problematic as interrupt is disabled during the flush.
> It was reported that it might take seconds to complete in some extreme
> cases leading to hard lockup messages.
> 
> As it is likely that not all the percpu blkg_iostat_set's has been
> updated since the last flush, those stale blkg_iostat_set's don't need
> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
> by keeping a lockless list of recently updated blkg_iostat_set's in a
> newly added percpu blkcg->lhead pointer.
> 
> The blkg_iostat_set is added to the lockless list on the update side
> in blk_cgroup_bio_start(). It is removed from the lockless list when
> flushed in blkcg_rstat_flush(). Due to racing, it is possible that
> blk_iostat_set's in the lockless list may have no new IO stats to be
> flushed. To protect against destruction of blkg, a percpu reference is
> gotten when putting into the lockless list and put back when removed.
> 
> A blkg_iostat_set can determine if it is in a lockless list by checking
> the content of its lnode.next pointer which will be non-NULL when in
> a lockless list. This requires the presence of a special llist_last
> sentinel node to be put at the end of the lockless list.
> 
> When booting up an instrumented test kernel with this patch on a
> 2-socket 96-thread system with cgroup v2, out of the 2051 calls to
> cgroup_rstat_flush() after bootup, 1788 of the calls were exited
> immediately because of empty lockless list. After an all-cpu kernel
> build, the ratio became 6295424/6340513. That was more than 99%.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-cgroup.c | 100 ++++++++++++++++++++++++++++++++++++++++++---
>  block/blk-cgroup.h |   9 ++++
>  2 files changed, 103 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 9021f75fc752..963a779c4cab 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -59,6 +59,71 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
>  
>  #define BLKG_DESTROY_BATCH_SIZE  64
>  
> +/*
> + * Lockless lists for tracking IO stats update
> + *
> + * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
> + * There are multiple blkg's (one for each block device) attached to each
> + * blkcg. The rstat code keeps track of which cpu has IO stats updated,
> + * but it doesn't know which blkg has the updated stats. If there are many
> + * block devices in a system, the cost of iterating all the blkg's to flush
> + * out the IO stats can be high. To reduce such overhead, a set of percpu
> + * lockless lists (lhead) per blkcg are used to track the set of recently
> + * updated iostat_cpu's since the last flush. An iostat_cpu will be put
> + * onto the lockless list on the update side [blk_cgroup_bio_start()] if
> + * not there yet and then removed when being flushed [blkcg_rstat_flush()].
> + * References to blkg are gotten and then put back in the process to
> + * protect against blkg removal.
> + *
> + * lnode.next of the last entry in a lockless list is NULL. To enable us to
> + * use lnode.next as a boolean flag to indicate its presence in a lockless
> + * list, we have to make it non-NULL for all. This is done by using a
> + * sentinel node at the end of the lockless list. All the percpu lhead's
> + * are initialized to point to that sentinel node as being empty.
> + */
> +static struct llist_node llist_last;
> +
> +static bool blkcg_llist_empty(struct llist_head *lhead)
> +{
> +	return lhead->first == &llist_last;
> +}
> +
> +static void init_blkcg_llists(struct blkcg *blkcg)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
> +}
> +
> +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
> +{
> +	return xchg(&lhead->first, &llist_last);
> +}
> +
> +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
> +{
> +	struct llist_node *next = READ_ONCE(lnode->next);
> +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
> +					    lnode)->blkg;
> +
> +	WRITE_ONCE(lnode->next, NULL);
> +	percpu_ref_put(&blkg->refcnt);
> +	return next;
> +}
> +
> +/*
> + * The retrieved blkg_iostat_set is immediately marked as not in the
> + * lockless list by clearing its node->next pointer. It could be put
> + * back into the list by a parallel update before the iostat's are
> + * finally flushed including probably the new update.
> + */
> +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
> +	for (; (node != &llist_last) &&					\
> +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
> +		nxt = fetch_delete_lnode_next(node), true);		\
> +		node = nxt)
> +
>  /**
>   * blkcg_css - find the current css
>   *
> @@ -236,8 +301,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  	blkg->blkcg = blkcg;
>  
>  	u64_stats_init(&blkg->iostat.sync);
> -	for_each_possible_cpu(cpu)
> +	for_each_possible_cpu(cpu) {
>  		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
> +		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
> +	}
>  
>  	for (i = 0; i < BLKCG_MAX_POLS; i++) {
>  		struct blkcg_policy *pol = blkcg_policy[i];
> @@ -852,17 +919,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
>  static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>  {
>  	struct blkcg *blkcg = css_to_blkcg(css);
> -	struct blkcg_gq *blkg;
> +	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> +	struct llist_node *lnode, *lnext;
> +	struct blkg_iostat_set *bisc;
>  
>  	/* Root-level stats are sourced from system-wide IO stats */
>  	if (!cgroup_parent(css->cgroup))
>  		return;
>  
> +	if (blkcg_llist_empty(lhead))
> +		return;
> +
>  	rcu_read_lock();
>  
> -	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> +	lnode = fetch_delete_blkcg_llist(lhead);
> +	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
> +		struct blkcg_gq *blkg = bisc->blkg;
>  		struct blkcg_gq *parent = blkg->parent;
> -		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
>  		struct blkg_iostat cur, delta;
>  		unsigned long flags;
>  		unsigned int seq;
> @@ -1170,6 +1243,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
>  
>  	mutex_unlock(&blkcg_pol_mutex);
>  
> +	free_percpu(blkcg->lhead);
>  	kfree(blkcg);
>  }
>  
> @@ -1189,6 +1263,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
>  			goto unlock;
>  	}
>  
> +	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
> +	if (!blkcg->lhead)
> +		goto free_blkcg;
> +	init_blkcg_llists(blkcg);
> +
>  	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
>  		struct blkcg_policy *pol = blkcg_policy[i];
>  		struct blkcg_policy_data *cpd;
> @@ -1229,7 +1308,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
>  	for (i--; i >= 0; i--)
>  		if (blkcg->cpd[i])
>  			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
> -
> +	free_percpu(blkcg->lhead);
> +free_blkcg:
>  	if (blkcg != &blkcg_root)
>  		kfree(blkcg);
>  unlock:
> @@ -1993,6 +2073,7 @@ static int blk_cgroup_io_type(struct bio *bio)
>  
>  void blk_cgroup_bio_start(struct bio *bio)
>  {
> +	struct blkcg *blkcg = bio->bi_blkg->blkcg;
>  	int rwd = blk_cgroup_io_type(bio), cpu;
>  	struct blkg_iostat_set *bis;
>  	unsigned long flags;
> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>  	}
>  	bis->cur.ios[rwd]++;
>  
> +	if (!READ_ONCE(bis->lnode.next)) {
> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> +
> +		llist_add(&bis->lnode, lhead);
> +		percpu_ref_get(&bis->blkg->refcnt);
> +	}

The above still adds cost in fast io path.

> +
>  	u64_stats_update_end_irqrestore(&bis->sync, flags);
>  	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> +		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
>  	put_cpu();
>  }

IMO, it seems one cgroup generic issue. More importantly, the percpu
lock of cgroup_rstat_cpu_lock is held in both cgroup_rstat_updated()
and cgroup_rstat_flush_locked(), which can provide enough sync with
zero extra cost, meantime other cgroups can benefit from this kind of
much simpler improvement.

So what do you think of the following approach?

BTW, the cpumask can be replaced with one plain percpu variable for avoiding
cache conflict.

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 23ec30f50cca..f8287fced726 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -858,6 +858,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 	if (!cgroup_parent(css->cgroup))
 		return;
 
+	if (!cpumask_test_cpu(cpu, blkcg->iostat_cpumask))
+		return;
+
+	cpumask_clear_cpu(cpu, blkcg->iostat_cpumask);
+
 	rcu_read_lock();
 
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -1170,6 +1175,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
 
 	mutex_unlock(&blkcg_pol_mutex);
 
+	free_cpumask_var(blkcg->iostat_cpumask);
 	kfree(blkcg);
 }
 
@@ -1213,6 +1219,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			pol->cpd_init_fn(cpd);
 	}
 
+	if (!zalloc_cpumask_var(&blkcg->iostat_cpumask, GFP_KERNEL))
+		goto free_pd_blkcg;
+
 	spin_lock_init(&blkcg->lock);
 	refcount_set(&blkcg->online_pin, 1);
 	INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
@@ -2009,7 +2018,8 @@ void blk_cgroup_bio_start(struct bio *bio)
 
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
-		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
+		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu,
+				bio->bi_blkg->blkcg->iostat_cpumask);
 	put_cpu();
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d4de0a35e066..458b40ca045a 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -103,6 +103,7 @@ struct blkcg {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct list_head		cgwb_list;
 #endif
+	cpumask_var_t			iostat_cpumask;
 };
 
 static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0d1ada8968d7..4fa5dde3a62c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -763,7 +763,7 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
 /*
  * cgroup scalable recursive statistics.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
+void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask);
 void cgroup_rstat_flush(struct cgroup *cgrp);
 void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..f4eb63b86e56 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -22,7 +22,7 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
  * rstat_cpu->updated_children list.  See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask)
 {
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
@@ -40,6 +40,9 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 
 	raw_spin_lock_irqsave(cpu_lock, flags);
 
+	if (cpumask)
+		cpumask_set_cpu(cpu, cpumask);
+
 	/* put @cgrp and all ancestors on the corresponding updated lists */
 	while (true) {
 		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
@@ -366,7 +369,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 						 unsigned long flags)
 {
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
-	cgroup_rstat_updated(cgrp, smp_processor_id());
+	cgroup_rstat_updated(cgrp, smp_processor_id(), NULL);
 	put_cpu_ptr(rstatc);
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f31fe6..8c4f204dbf5b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
 	unsigned int x;
 
-	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
+	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
 
 	x = __this_cpu_add_return(stats_updates, abs(val));
 	if (x > MEMCG_CHARGE_BATCH) {

Thanks, 
Ming


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-04  3:58   ` Ming Lei
@ 2022-06-05 23:15     ` Waiman Long
  2022-06-06  1:39       ` Ming Lei
  0 siblings, 1 reply; 31+ messages in thread
From: Waiman Long @ 2022-06-05 23:15 UTC (permalink / raw)
  To: Ming Lei; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On 6/3/22 23:58, Ming Lei wrote:
> Hi Waiman,
>
> On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long wrote:
>> For a system with many CPUs and block devices, the time to do
>> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
>> can be especially problematic as interrupt is disabled during the flush.
>> It was reported that it might take seconds to complete in some extreme
>> cases leading to hard lockup messages.
>>
>> As it is likely that not all the percpu blkg_iostat_set's has been
>> updated since the last flush, those stale blkg_iostat_set's don't need
>> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
>> by keeping a lockless list of recently updated blkg_iostat_set's in a
>> newly added percpu blkcg->lhead pointer.
>>
>> The blkg_iostat_set is added to the lockless list on the update side
>> in blk_cgroup_bio_start(). It is removed from the lockless list when
>> flushed in blkcg_rstat_flush(). Due to racing, it is possible that
>> blk_iostat_set's in the lockless list may have no new IO stats to be
>> flushed. To protect against destruction of blkg, a percpu reference is
>> gotten when putting into the lockless list and put back when removed.
>>
>> A blkg_iostat_set can determine if it is in a lockless list by checking
>> the content of its lnode.next pointer which will be non-NULL when in
>> a lockless list. This requires the presence of a special llist_last
>> sentinel node to be put at the end of the lockless list.
>>
>> When booting up an instrumented test kernel with this patch on a
>> 2-socket 96-thread system with cgroup v2, out of the 2051 calls to
>> cgroup_rstat_flush() after bootup, 1788 of the calls were exited
>> immediately because of empty lockless list. After an all-cpu kernel
>> build, the ratio became 6295424/6340513. That was more than 99%.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> ---
>>   block/blk-cgroup.c | 100 ++++++++++++++++++++++++++++++++++++++++++---
>>   block/blk-cgroup.h |   9 ++++
>>   2 files changed, 103 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 9021f75fc752..963a779c4cab 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -59,6 +59,71 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
>>   
>>   #define BLKG_DESTROY_BATCH_SIZE  64
>>   
>> +/*
>> + * Lockless lists for tracking IO stats update
>> + *
>> + * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
>> + * There are multiple blkg's (one for each block device) attached to each
>> + * blkcg. The rstat code keeps track of which cpu has IO stats updated,
>> + * but it doesn't know which blkg has the updated stats. If there are many
>> + * block devices in a system, the cost of iterating all the blkg's to flush
>> + * out the IO stats can be high. To reduce such overhead, a set of percpu
>> + * lockless lists (lhead) per blkcg are used to track the set of recently
>> + * updated iostat_cpu's since the last flush. An iostat_cpu will be put
>> + * onto the lockless list on the update side [blk_cgroup_bio_start()] if
>> + * not there yet and then removed when being flushed [blkcg_rstat_flush()].
>> + * References to blkg are gotten and then put back in the process to
>> + * protect against blkg removal.
>> + *
>> + * lnode.next of the last entry in a lockless list is NULL. To enable us to
>> + * use lnode.next as a boolean flag to indicate its presence in a lockless
>> + * list, we have to make it non-NULL for all. This is done by using a
>> + * sentinel node at the end of the lockless list. All the percpu lhead's
>> + * are initialized to point to that sentinel node as being empty.
>> + */
>> +static struct llist_node llist_last;
>> +
>> +static bool blkcg_llist_empty(struct llist_head *lhead)
>> +{
>> +	return lhead->first == &llist_last;
>> +}
>> +
>> +static void init_blkcg_llists(struct blkcg *blkcg)
>> +{
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu)
>> +		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
>> +}
>> +
>> +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
>> +{
>> +	return xchg(&lhead->first, &llist_last);
>> +}
>> +
>> +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
>> +{
>> +	struct llist_node *next = READ_ONCE(lnode->next);
>> +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
>> +					    lnode)->blkg;
>> +
>> +	WRITE_ONCE(lnode->next, NULL);
>> +	percpu_ref_put(&blkg->refcnt);
>> +	return next;
>> +}
>> +
>> +/*
>> + * The retrieved blkg_iostat_set is immediately marked as not in the
>> + * lockless list by clearing its node->next pointer. It could be put
>> + * back into the list by a parallel update before the iostat's are
>> + * finally flushed including probably the new update.
>> + */
>> +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
>> +	for (; (node != &llist_last) &&					\
>> +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
>> +		nxt = fetch_delete_lnode_next(node), true);		\
>> +		node = nxt)
>> +
>>   /**
>>    * blkcg_css - find the current css
>>    *
>> @@ -236,8 +301,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>>   	blkg->blkcg = blkcg;
>>   
>>   	u64_stats_init(&blkg->iostat.sync);
>> -	for_each_possible_cpu(cpu)
>> +	for_each_possible_cpu(cpu) {
>>   		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
>> +		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
>> +	}
>>   
>>   	for (i = 0; i < BLKCG_MAX_POLS; i++) {
>>   		struct blkcg_policy *pol = blkcg_policy[i];
>> @@ -852,17 +919,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
>>   static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>>   {
>>   	struct blkcg *blkcg = css_to_blkcg(css);
>> -	struct blkcg_gq *blkg;
>> +	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +	struct llist_node *lnode, *lnext;
>> +	struct blkg_iostat_set *bisc;
>>   
>>   	/* Root-level stats are sourced from system-wide IO stats */
>>   	if (!cgroup_parent(css->cgroup))
>>   		return;
>>   
>> +	if (blkcg_llist_empty(lhead))
>> +		return;
>> +
>>   	rcu_read_lock();
>>   
>> -	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
>> +	lnode = fetch_delete_blkcg_llist(lhead);
>> +	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
>> +		struct blkcg_gq *blkg = bisc->blkg;
>>   		struct blkcg_gq *parent = blkg->parent;
>> -		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
>>   		struct blkg_iostat cur, delta;
>>   		unsigned long flags;
>>   		unsigned int seq;
>> @@ -1170,6 +1243,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
>>   
>>   	mutex_unlock(&blkcg_pol_mutex);
>>   
>> +	free_percpu(blkcg->lhead);
>>   	kfree(blkcg);
>>   }
>>   
>> @@ -1189,6 +1263,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
>>   			goto unlock;
>>   	}
>>   
>> +	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
>> +	if (!blkcg->lhead)
>> +		goto free_blkcg;
>> +	init_blkcg_llists(blkcg);
>> +
>>   	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
>>   		struct blkcg_policy *pol = blkcg_policy[i];
>>   		struct blkcg_policy_data *cpd;
>> @@ -1229,7 +1308,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
>>   	for (i--; i >= 0; i--)
>>   		if (blkcg->cpd[i])
>>   			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
>> -
>> +	free_percpu(blkcg->lhead);
>> +free_blkcg:
>>   	if (blkcg != &blkcg_root)
>>   		kfree(blkcg);
>>   unlock:
>> @@ -1993,6 +2073,7 @@ static int blk_cgroup_io_type(struct bio *bio)
>>   
>>   void blk_cgroup_bio_start(struct bio *bio)
>>   {
>> +	struct blkcg *blkcg = bio->bi_blkg->blkcg;
>>   	int rwd = blk_cgroup_io_type(bio), cpu;
>>   	struct blkg_iostat_set *bis;
>>   	unsigned long flags;
>> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>>   	}
>>   	bis->cur.ios[rwd]++;
>>   
>> +	if (!READ_ONCE(bis->lnode.next)) {
>> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +
>> +		llist_add(&bis->lnode, lhead);
>> +		percpu_ref_get(&bis->blkg->refcnt);
>> +	}
> The above still adds cost in fast io path.

That is true, but it depends on how often is cgroup_rstat_flush*() is 
called. There is a one time setup cost after a flush. Subsequent IO ops 
on the same device and cpu will have negligible cost.


>
>> +
>>   	u64_stats_update_end_irqrestore(&bis->sync, flags);
>>   	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
>> -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
>> +		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
>>   	put_cpu();
>>   }
> IMO, it seems one cgroup generic issue. More importantly, the percpu
> lock of cgroup_rstat_cpu_lock is held in both cgroup_rstat_updated()
> and cgroup_rstat_flush_locked(), which can provide enough sync with
> zero extra cost, meantime other cgroups can benefit from this kind of
> much simpler improvement.
>
> So what do you think of the following approach?
>
> BTW, the cpumask can be replaced with one plain percpu variable for avoiding
> cache conflict.
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 23ec30f50cca..f8287fced726 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -858,6 +858,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>   	if (!cgroup_parent(css->cgroup))
>   		return;
>   
> +	if (!cpumask_test_cpu(cpu, blkcg->iostat_cpumask))
> +		return;
> +
> +	cpumask_clear_cpu(cpu, blkcg->iostat_cpumask);
> +
>   	rcu_read_lock();
>   
>   	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> @@ -1170,6 +1175,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
>   
>   	mutex_unlock(&blkcg_pol_mutex);
>   
> +	free_cpumask_var(blkcg->iostat_cpumask);
>   	kfree(blkcg);
>   }
>   
> @@ -1213,6 +1219,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
>   			pol->cpd_init_fn(cpd);
>   	}
>   
> +	if (!zalloc_cpumask_var(&blkcg->iostat_cpumask, GFP_KERNEL))
> +		goto free_pd_blkcg;
> +
>   	spin_lock_init(&blkcg->lock);
>   	refcount_set(&blkcg->online_pin, 1);
>   	INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
> @@ -2009,7 +2018,8 @@ void blk_cgroup_bio_start(struct bio *bio)
>   
>   	u64_stats_update_end_irqrestore(&bis->sync, flags);
>   	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> +		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu,
> +				bio->bi_blkg->blkcg->iostat_cpumask);
>   	put_cpu();
>   }
>   
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index d4de0a35e066..458b40ca045a 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -103,6 +103,7 @@ struct blkcg {
>   #ifdef CONFIG_CGROUP_WRITEBACK
>   	struct list_head		cgwb_list;
>   #endif
> +	cpumask_var_t			iostat_cpumask;
>   };
>   
>   static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 0d1ada8968d7..4fa5dde3a62c 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -763,7 +763,7 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
>   /*
>    * cgroup scalable recursive statistics.
>    */
> -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
> +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask);
>   void cgroup_rstat_flush(struct cgroup *cgrp);
>   void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
>   void cgroup_rstat_flush_hold(struct cgroup *cgrp);
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 24b5c2ab5598..f4eb63b86e56 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -22,7 +22,7 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
>    * rstat_cpu->updated_children list.  See the comment on top of
>    * cgroup_rstat_cpu definition for details.
>    */
> -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
> +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask)
>   {
>   	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
>   	unsigned long flags;
> @@ -40,6 +40,9 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>   
>   	raw_spin_lock_irqsave(cpu_lock, flags);
>   
> +	if (cpumask)
> +		cpumask_set_cpu(cpu, cpumask);
> +
>   	/* put @cgrp and all ancestors on the corresponding updated lists */
>   	while (true) {
>   		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> @@ -366,7 +369,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
>   						 unsigned long flags)
>   {
>   	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
> -	cgroup_rstat_updated(cgrp, smp_processor_id());
> +	cgroup_rstat_updated(cgrp, smp_processor_id(), NULL);
>   	put_cpu_ptr(rstatc);
>   }
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index abec50f31fe6..8c4f204dbf5b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>   {
>   	unsigned int x;
>   
> -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
> +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
>   
>   	x = __this_cpu_add_return(stats_updates, abs(val));
>   	if (x > MEMCG_CHARGE_BATCH) {

I think the rstat set of functions are doing that already. So flush will 
only call CPUs that have called cgroup_rstat_updated() before. However, 
one deficiency that I am aware of is that there is no bitmap of which 
controller have update. The problem that I saw in cgroup v2 is that in a 
cgroup with both memory controller and block controller enabled, a 
cgroup_rstat_updated() call from memory cgroup later causes the rstat 
function to call into block cgroup flush method even though there is no 
update in the block controller. This is an area that needs improvement.

Your code does allow the block controller to be aware of that and avoid 
further action, but I think it has to be done in the rstat code to be 
applicable to all controllers instead of just specific to block controller.

There is another problem that this approach. Suppose the system have 20 
block devices and one of them has an IO operation. Now the flush method 
still needs to iterate all the 20 blkg's to do an update. The block 
controller is kind of special that the number of per-cgroup IO stats 
depends on the number of block devices present. Other controllers just 
have one set of stats per cgroup.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-05 23:15     ` Waiman Long
@ 2022-06-06  1:39       ` Ming Lei
  2022-06-06  1:59           ` Waiman Long
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2022-06-06  1:39 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, ming.lei

On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
> On 6/3/22 23:58, Ming Lei wrote:
> > Hi Waiman,
> > 
> > On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long wrote:
> > > For a system with many CPUs and block devices, the time to do
> > > blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
> > > can be especially problematic as interrupt is disabled during the flush.
> > > It was reported that it might take seconds to complete in some extreme
> > > cases leading to hard lockup messages.
> > > 
> > > As it is likely that not all the percpu blkg_iostat_set's has been
> > > updated since the last flush, those stale blkg_iostat_set's don't need
> > > to be flushed in this case. This patch optimizes blkcg_rstat_flush()
> > > by keeping a lockless list of recently updated blkg_iostat_set's in a
> > > newly added percpu blkcg->lhead pointer.
> > > 
> > > The blkg_iostat_set is added to the lockless list on the update side
> > > in blk_cgroup_bio_start(). It is removed from the lockless list when
> > > flushed in blkcg_rstat_flush(). Due to racing, it is possible that
> > > blk_iostat_set's in the lockless list may have no new IO stats to be
> > > flushed. To protect against destruction of blkg, a percpu reference is
> > > gotten when putting into the lockless list and put back when removed.
> > > 
> > > A blkg_iostat_set can determine if it is in a lockless list by checking
> > > the content of its lnode.next pointer which will be non-NULL when in
> > > a lockless list. This requires the presence of a special llist_last
> > > sentinel node to be put at the end of the lockless list.
> > > 
> > > When booting up an instrumented test kernel with this patch on a
> > > 2-socket 96-thread system with cgroup v2, out of the 2051 calls to
> > > cgroup_rstat_flush() after bootup, 1788 of the calls were exited
> > > immediately because of empty lockless list. After an all-cpu kernel
> > > build, the ratio became 6295424/6340513. That was more than 99%.
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > Acked-by: Tejun Heo <tj@kernel.org>
> > > ---
> > >   block/blk-cgroup.c | 100 ++++++++++++++++++++++++++++++++++++++++++---
> > >   block/blk-cgroup.h |   9 ++++
> > >   2 files changed, 103 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > > index 9021f75fc752..963a779c4cab 100644
> > > --- a/block/blk-cgroup.c
> > > +++ b/block/blk-cgroup.c
> > > @@ -59,6 +59,71 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
> > >   #define BLKG_DESTROY_BATCH_SIZE  64
> > > +/*
> > > + * Lockless lists for tracking IO stats update
> > > + *
> > > + * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
> > > + * There are multiple blkg's (one for each block device) attached to each
> > > + * blkcg. The rstat code keeps track of which cpu has IO stats updated,
> > > + * but it doesn't know which blkg has the updated stats. If there are many
> > > + * block devices in a system, the cost of iterating all the blkg's to flush
> > > + * out the IO stats can be high. To reduce such overhead, a set of percpu
> > > + * lockless lists (lhead) per blkcg are used to track the set of recently
> > > + * updated iostat_cpu's since the last flush. An iostat_cpu will be put
> > > + * onto the lockless list on the update side [blk_cgroup_bio_start()] if
> > > + * not there yet and then removed when being flushed [blkcg_rstat_flush()].
> > > + * References to blkg are gotten and then put back in the process to
> > > + * protect against blkg removal.
> > > + *
> > > + * lnode.next of the last entry in a lockless list is NULL. To enable us to
> > > + * use lnode.next as a boolean flag to indicate its presence in a lockless
> > > + * list, we have to make it non-NULL for all. This is done by using a
> > > + * sentinel node at the end of the lockless list. All the percpu lhead's
> > > + * are initialized to point to that sentinel node as being empty.
> > > + */
> > > +static struct llist_node llist_last;
> > > +
> > > +static bool blkcg_llist_empty(struct llist_head *lhead)
> > > +{
> > > +	return lhead->first == &llist_last;
> > > +}
> > > +
> > > +static void init_blkcg_llists(struct blkcg *blkcg)
> > > +{
> > > +	int cpu;
> > > +
> > > +	for_each_possible_cpu(cpu)
> > > +		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
> > > +}
> > > +
> > > +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
> > > +{
> > > +	return xchg(&lhead->first, &llist_last);
> > > +}
> > > +
> > > +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
> > > +{
> > > +	struct llist_node *next = READ_ONCE(lnode->next);
> > > +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
> > > +					    lnode)->blkg;
> > > +
> > > +	WRITE_ONCE(lnode->next, NULL);
> > > +	percpu_ref_put(&blkg->refcnt);
> > > +	return next;
> > > +}
> > > +
> > > +/*
> > > + * The retrieved blkg_iostat_set is immediately marked as not in the
> > > + * lockless list by clearing its node->next pointer. It could be put
> > > + * back into the list by a parallel update before the iostat's are
> > > + * finally flushed including probably the new update.
> > > + */
> > > +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
> > > +	for (; (node != &llist_last) &&					\
> > > +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
> > > +		nxt = fetch_delete_lnode_next(node), true);		\
> > > +		node = nxt)
> > > +
> > >   /**
> > >    * blkcg_css - find the current css
> > >    *
> > > @@ -236,8 +301,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
> > >   	blkg->blkcg = blkcg;
> > >   	u64_stats_init(&blkg->iostat.sync);
> > > -	for_each_possible_cpu(cpu)
> > > +	for_each_possible_cpu(cpu) {
> > >   		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
> > > +		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
> > > +	}
> > >   	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> > >   		struct blkcg_policy *pol = blkcg_policy[i];
> > > @@ -852,17 +919,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
> > >   static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> > >   {
> > >   	struct blkcg *blkcg = css_to_blkcg(css);
> > > -	struct blkcg_gq *blkg;
> > > +	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > > +	struct llist_node *lnode, *lnext;
> > > +	struct blkg_iostat_set *bisc;
> > >   	/* Root-level stats are sourced from system-wide IO stats */
> > >   	if (!cgroup_parent(css->cgroup))
> > >   		return;
> > > +	if (blkcg_llist_empty(lhead))
> > > +		return;
> > > +
> > >   	rcu_read_lock();
> > > -	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> > > +	lnode = fetch_delete_blkcg_llist(lhead);
> > > +	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
> > > +		struct blkcg_gq *blkg = bisc->blkg;
> > >   		struct blkcg_gq *parent = blkg->parent;
> > > -		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
> > >   		struct blkg_iostat cur, delta;
> > >   		unsigned long flags;
> > >   		unsigned int seq;
> > > @@ -1170,6 +1243,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
> > >   	mutex_unlock(&blkcg_pol_mutex);
> > > +	free_percpu(blkcg->lhead);
> > >   	kfree(blkcg);
> > >   }
> > > @@ -1189,6 +1263,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
> > >   			goto unlock;
> > >   	}
> > > +	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
> > > +	if (!blkcg->lhead)
> > > +		goto free_blkcg;
> > > +	init_blkcg_llists(blkcg);
> > > +
> > >   	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
> > >   		struct blkcg_policy *pol = blkcg_policy[i];
> > >   		struct blkcg_policy_data *cpd;
> > > @@ -1229,7 +1308,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
> > >   	for (i--; i >= 0; i--)
> > >   		if (blkcg->cpd[i])
> > >   			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
> > > -
> > > +	free_percpu(blkcg->lhead);
> > > +free_blkcg:
> > >   	if (blkcg != &blkcg_root)
> > >   		kfree(blkcg);
> > >   unlock:
> > > @@ -1993,6 +2073,7 @@ static int blk_cgroup_io_type(struct bio *bio)
> > >   void blk_cgroup_bio_start(struct bio *bio)
> > >   {
> > > +	struct blkcg *blkcg = bio->bi_blkg->blkcg;
> > >   	int rwd = blk_cgroup_io_type(bio), cpu;
> > >   	struct blkg_iostat_set *bis;
> > >   	unsigned long flags;
> > > @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
> > >   	}
> > >   	bis->cur.ios[rwd]++;
> > > +	if (!READ_ONCE(bis->lnode.next)) {
> > > +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > > +
> > > +		llist_add(&bis->lnode, lhead);
> > > +		percpu_ref_get(&bis->blkg->refcnt);
> > > +	}
> > The above still adds cost in fast io path.
> 
> That is true, but it depends on how often is cgroup_rstat_flush*() is
> called. There is a one time setup cost after a flush. Subsequent IO ops on
> the same device and cpu will have negligible cost.

OK.

> 
> 
> > 
> > > +
> > >   	u64_stats_update_end_irqrestore(&bis->sync, flags);
> > >   	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> > > -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> > > +		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
> > >   	put_cpu();
> > >   }
> > IMO, it seems one cgroup generic issue. More importantly, the percpu
> > lock of cgroup_rstat_cpu_lock is held in both cgroup_rstat_updated()
> > and cgroup_rstat_flush_locked(), which can provide enough sync with
> > zero extra cost, meantime other cgroups can benefit from this kind of
> > much simpler improvement.
> > 
> > So what do you think of the following approach?
> > 
> > BTW, the cpumask can be replaced with one plain percpu variable for avoiding
> > cache conflict.
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 23ec30f50cca..f8287fced726 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -858,6 +858,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> >   	if (!cgroup_parent(css->cgroup))
> >   		return;
> > +	if (!cpumask_test_cpu(cpu, blkcg->iostat_cpumask))
> > +		return;
> > +
> > +	cpumask_clear_cpu(cpu, blkcg->iostat_cpumask);
> > +
> >   	rcu_read_lock();
> >   	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> > @@ -1170,6 +1175,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
> >   	mutex_unlock(&blkcg_pol_mutex);
> > +	free_cpumask_var(blkcg->iostat_cpumask);
> >   	kfree(blkcg);
> >   }
> > @@ -1213,6 +1219,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
> >   			pol->cpd_init_fn(cpd);
> >   	}
> > +	if (!zalloc_cpumask_var(&blkcg->iostat_cpumask, GFP_KERNEL))
> > +		goto free_pd_blkcg;
> > +
> >   	spin_lock_init(&blkcg->lock);
> >   	refcount_set(&blkcg->online_pin, 1);
> >   	INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
> > @@ -2009,7 +2018,8 @@ void blk_cgroup_bio_start(struct bio *bio)
> >   	u64_stats_update_end_irqrestore(&bis->sync, flags);
> >   	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> > -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> > +		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu,
> > +				bio->bi_blkg->blkcg->iostat_cpumask);
> >   	put_cpu();
> >   }
> > diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> > index d4de0a35e066..458b40ca045a 100644
> > --- a/block/blk-cgroup.h
> > +++ b/block/blk-cgroup.h
> > @@ -103,6 +103,7 @@ struct blkcg {
> >   #ifdef CONFIG_CGROUP_WRITEBACK
> >   	struct list_head		cgwb_list;
> >   #endif
> > +	cpumask_var_t			iostat_cpumask;
> >   };
> >   static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 0d1ada8968d7..4fa5dde3a62c 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -763,7 +763,7 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
> >   /*
> >    * cgroup scalable recursive statistics.
> >    */
> > -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
> > +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask);
> >   void cgroup_rstat_flush(struct cgroup *cgrp);
> >   void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
> >   void cgroup_rstat_flush_hold(struct cgroup *cgrp);
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index 24b5c2ab5598..f4eb63b86e56 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -22,7 +22,7 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
> >    * rstat_cpu->updated_children list.  See the comment on top of
> >    * cgroup_rstat_cpu definition for details.
> >    */
> > -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
> > +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask)
> >   {
> >   	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> >   	unsigned long flags;
> > @@ -40,6 +40,9 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
> >   	raw_spin_lock_irqsave(cpu_lock, flags);
> > +	if (cpumask)
> > +		cpumask_set_cpu(cpu, cpumask);
> > +
> >   	/* put @cgrp and all ancestors on the corresponding updated lists */
> >   	while (true) {
> >   		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> > @@ -366,7 +369,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
> >   						 unsigned long flags)
> >   {
> >   	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
> > -	cgroup_rstat_updated(cgrp, smp_processor_id());
> > +	cgroup_rstat_updated(cgrp, smp_processor_id(), NULL);
> >   	put_cpu_ptr(rstatc);
> >   }
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index abec50f31fe6..8c4f204dbf5b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >   {
> >   	unsigned int x;
> > -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
> > +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
> >   	x = __this_cpu_add_return(stats_updates, abs(val));
> >   	if (x > MEMCG_CHARGE_BATCH) {
> 
> I think the rstat set of functions are doing that already. So flush will
> only call CPUs that have called cgroup_rstat_updated() before. However, one

Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
is still called even through there isn't any update on this CPU.

> deficiency that I am aware of is that there is no bitmap of which controller
> have update. The problem that I saw in cgroup v2 is that in a cgroup with
> both memory controller and block controller enabled, a
> cgroup_rstat_updated() call from memory cgroup later causes the rstat
> function to call into block cgroup flush method even though there is no
> update in the block controller. This is an area that needs improvement.
> 
> Your code does allow the block controller to be aware of that and avoid
> further action, but I think it has to be done in the rstat code to be
> applicable to all controllers instead of just specific to block controller.

I guess it can be done by adding one percpu variable to 'struct cgroup'.

> 
> There is another problem that this approach. Suppose the system have 20
> block devices and one of them has an IO operation. Now the flush method
> still needs to iterate all the 20 blkg's to do an update. The block
> controller is kind of special that the number of per-cgroup IO stats depends
> on the number of block devices present. Other controllers just have one set
> of stats per cgroup.

Yeah, and this one is really blkio specific issue, and your patch does
cover this one. Maybe you can add one callback to
cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
list isn't needed.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-06  1:39       ` Ming Lei
@ 2022-06-06  1:59           ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-06  1:59 UTC (permalink / raw)
  To: Ming Lei; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On 6/5/22 21:39, Ming Lei wrote:
> On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
>> On 6/3/22 23:58, Ming Lei wrote:
>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index abec50f31fe6..8c4f204dbf5b 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>>>    {
>>>    	unsigned int x;
>>> -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
>>> +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
>>>    	x = __this_cpu_add_return(stats_updates, abs(val));
>>>    	if (x > MEMCG_CHARGE_BATCH) {
>> I think the rstat set of functions are doing that already. So flush will
>> only call CPUs that have called cgroup_rstat_updated() before. However, one
> Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
> percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
> is still called even through there isn't any update on this CPU.
Yes, I think we may need to add a bitmask of what controllers have 
updates in cgroup_rstat_cpu structure.
>
>> deficiency that I am aware of is that there is no bitmap of which controller
>> have update. The problem that I saw in cgroup v2 is that in a cgroup with
>> both memory controller and block controller enabled, a
>> cgroup_rstat_updated() call from memory cgroup later causes the rstat
>> function to call into block cgroup flush method even though there is no
>> update in the block controller. This is an area that needs improvement.
>>
>> Your code does allow the block controller to be aware of that and avoid
>> further action, but I think it has to be done in the rstat code to be
>> applicable to all controllers instead of just specific to block controller.
> I guess it can be done by adding one percpu variable to 'struct cgroup'.
>
>> There is another problem that this approach. Suppose the system have 20
>> block devices and one of them has an IO operation. Now the flush method
>> still needs to iterate all the 20 blkg's to do an update. The block
>> controller is kind of special that the number of per-cgroup IO stats depends
>> on the number of block devices present. Other controllers just have one set
>> of stats per cgroup.
> Yeah, and this one is really blkio specific issue, and your patch does
> cover this one. Maybe you can add one callback to
> cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
> percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
> list isn't needed.

The rstat API is generic. It may not be a good idea to put controller 
specific information into it. Yes, cgroup_rstat_cpu_lock is taken at the 
read side (flush). It may not taken on the write side (update). So it 
may not be easy to rely on this lock for synchronization between the 
read and write side.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-06  1:59           ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-06  1:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tejun Heo, Jens Axboe, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 6/5/22 21:39, Ming Lei wrote:
> On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
>> On 6/3/22 23:58, Ming Lei wrote:
>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index abec50f31fe6..8c4f204dbf5b 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>>>    {
>>>    	unsigned int x;
>>> -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
>>> +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
>>>    	x = __this_cpu_add_return(stats_updates, abs(val));
>>>    	if (x > MEMCG_CHARGE_BATCH) {
>> I think the rstat set of functions are doing that already. So flush will
>> only call CPUs that have called cgroup_rstat_updated() before. However, one
> Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
> percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
> is still called even through there isn't any update on this CPU.
Yes, I think we may need to add a bitmask of what controllers have 
updates in cgroup_rstat_cpu structure.
>
>> deficiency that I am aware of is that there is no bitmap of which controller
>> have update. The problem that I saw in cgroup v2 is that in a cgroup with
>> both memory controller and block controller enabled, a
>> cgroup_rstat_updated() call from memory cgroup later causes the rstat
>> function to call into block cgroup flush method even though there is no
>> update in the block controller. This is an area that needs improvement.
>>
>> Your code does allow the block controller to be aware of that and avoid
>> further action, but I think it has to be done in the rstat code to be
>> applicable to all controllers instead of just specific to block controller.
> I guess it can be done by adding one percpu variable to 'struct cgroup'.
>
>> There is another problem that this approach. Suppose the system have 20
>> block devices and one of them has an IO operation. Now the flush method
>> still needs to iterate all the 20 blkg's to do an update. The block
>> controller is kind of special that the number of per-cgroup IO stats depends
>> on the number of block devices present. Other controllers just have one set
>> of stats per cgroup.
> Yeah, and this one is really blkio specific issue, and your patch does
> cover this one. Maybe you can add one callback to
> cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
> percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
> list isn't needed.

The rstat API is generic. It may not be a good idea to put controller 
specific information into it. Yes, cgroup_rstat_cpu_lock is taken at the 
read side (flush). It may not taken on the write side (update). So it 
may not be easy to rely on this lock for synchronization between the 
read and write side.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-06  1:59           ` Waiman Long
  (?)
@ 2022-06-06  2:23           ` Ming Lei
  2022-06-06  2:58               ` Waiman Long
  -1 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2022-06-06  2:23 UTC (permalink / raw)
  To: Waiman Long; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On Sun, Jun 05, 2022 at 09:59:50PM -0400, Waiman Long wrote:
> On 6/5/22 21:39, Ming Lei wrote:
> > On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
> > > On 6/3/22 23:58, Ming Lei wrote:
> > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index abec50f31fe6..8c4f204dbf5b 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> > > >    {
> > > >    	unsigned int x;
> > > > -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
> > > > +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
> > > >    	x = __this_cpu_add_return(stats_updates, abs(val));
> > > >    	if (x > MEMCG_CHARGE_BATCH) {
> > > I think the rstat set of functions are doing that already. So flush will
> > > only call CPUs that have called cgroup_rstat_updated() before. However, one
> > Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
> > percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
> > is still called even through there isn't any update on this CPU.
> Yes, I think we may need to add a bitmask of what controllers have updates
> in cgroup_rstat_cpu structure.
> > 
> > > deficiency that I am aware of is that there is no bitmap of which controller
> > > have update. The problem that I saw in cgroup v2 is that in a cgroup with
> > > both memory controller and block controller enabled, a
> > > cgroup_rstat_updated() call from memory cgroup later causes the rstat
> > > function to call into block cgroup flush method even though there is no
> > > update in the block controller. This is an area that needs improvement.
> > > 
> > > Your code does allow the block controller to be aware of that and avoid
> > > further action, but I think it has to be done in the rstat code to be
> > > applicable to all controllers instead of just specific to block controller.
> > I guess it can be done by adding one percpu variable to 'struct cgroup'.
> > 
> > > There is another problem that this approach. Suppose the system have 20
> > > block devices and one of them has an IO operation. Now the flush method
> > > still needs to iterate all the 20 blkg's to do an update. The block
> > > controller is kind of special that the number of per-cgroup IO stats depends
> > > on the number of block devices present. Other controllers just have one set
> > > of stats per cgroup.
> > Yeah, and this one is really blkio specific issue, and your patch does
> > cover this one. Maybe you can add one callback to
> > cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
> > percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
> > list isn't needed.
> 
> The rstat API is generic. It may not be a good idea to put controller
> specific information into it. Yes, cgroup_rstat_cpu_lock is taken at the
> read side (flush). It may not taken on the write side (update). So it may

Both cgroup_rstat_flush_locked()/cgroup_rstat_updated() take the percpu
cgroup_rstat_cpu_lock, so the new invented lockless list can be
replaced with plain list.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-06  2:23           ` Ming Lei
@ 2022-06-06  2:58               ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-06  2:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On 6/5/22 22:23, Ming Lei wrote:
> On Sun, Jun 05, 2022 at 09:59:50PM -0400, Waiman Long wrote:
>> On 6/5/22 21:39, Ming Lei wrote:
>>> On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
>>>> On 6/3/22 23:58, Ming Lei wrote:
>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index abec50f31fe6..8c4f204dbf5b 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>>>>>     {
>>>>>     	unsigned int x;
>>>>> -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
>>>>> +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
>>>>>     	x = __this_cpu_add_return(stats_updates, abs(val));
>>>>>     	if (x > MEMCG_CHARGE_BATCH) {
>>>> I think the rstat set of functions are doing that already. So flush will
>>>> only call CPUs that have called cgroup_rstat_updated() before. However, one
>>> Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
>>> percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
>>> is still called even through there isn't any update on this CPU.
>> Yes, I think we may need to add a bitmask of what controllers have updates
>> in cgroup_rstat_cpu structure.
>>>> deficiency that I am aware of is that there is no bitmap of which controller
>>>> have update. The problem that I saw in cgroup v2 is that in a cgroup with
>>>> both memory controller and block controller enabled, a
>>>> cgroup_rstat_updated() call from memory cgroup later causes the rstat
>>>> function to call into block cgroup flush method even though there is no
>>>> update in the block controller. This is an area that needs improvement.
>>>>
>>>> Your code does allow the block controller to be aware of that and avoid
>>>> further action, but I think it has to be done in the rstat code to be
>>>> applicable to all controllers instead of just specific to block controller.
>>> I guess it can be done by adding one percpu variable to 'struct cgroup'.
>>>
>>>> There is another problem that this approach. Suppose the system have 20
>>>> block devices and one of them has an IO operation. Now the flush method
>>>> still needs to iterate all the 20 blkg's to do an update. The block
>>>> controller is kind of special that the number of per-cgroup IO stats depends
>>>> on the number of block devices present. Other controllers just have one set
>>>> of stats per cgroup.
>>> Yeah, and this one is really blkio specific issue, and your patch does
>>> cover this one. Maybe you can add one callback to
>>> cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
>>> percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
>>> list isn't needed.
>> The rstat API is generic. It may not be a good idea to put controller
>> specific information into it. Yes, cgroup_rstat_cpu_lock is taken at the
>> read side (flush). It may not taken on the write side (update). So it may
> Both cgroup_rstat_flush_locked()/cgroup_rstat_updated() take the percpu
> cgroup_rstat_cpu_lock, so the new invented lockless list can be
> replaced with plain list.

cgroup_rstat_updated() should only take the percpu cgroup_rstat_cpu_lock 
the first time it transition from "!updated" to "updated". After that, 
it returns immediately without the the lock. With a regular list, you 
will have to take the lock every time a new block device has an update. 
So there isn't much saving on the update side. In general, the 
lock/unlock sequence has a bit more overhead than the lockless 
insertion. On the flush side, there may be a bit of saving, but it is 
not the fast path.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-06  2:58               ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-06-06  2:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Tejun Heo, Jens Axboe, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 6/5/22 22:23, Ming Lei wrote:
> On Sun, Jun 05, 2022 at 09:59:50PM -0400, Waiman Long wrote:
>> On 6/5/22 21:39, Ming Lei wrote:
>>> On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
>>>> On 6/3/22 23:58, Ming Lei wrote:
>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index abec50f31fe6..8c4f204dbf5b 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>>>>>     {
>>>>>     	unsigned int x;
>>>>> -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
>>>>> +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
>>>>>     	x = __this_cpu_add_return(stats_updates, abs(val));
>>>>>     	if (x > MEMCG_CHARGE_BATCH) {
>>>> I think the rstat set of functions are doing that already. So flush will
>>>> only call CPUs that have called cgroup_rstat_updated() before. However, one
>>> Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
>>> percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
>>> is still called even through there isn't any update on this CPU.
>> Yes, I think we may need to add a bitmask of what controllers have updates
>> in cgroup_rstat_cpu structure.
>>>> deficiency that I am aware of is that there is no bitmap of which controller
>>>> have update. The problem that I saw in cgroup v2 is that in a cgroup with
>>>> both memory controller and block controller enabled, a
>>>> cgroup_rstat_updated() call from memory cgroup later causes the rstat
>>>> function to call into block cgroup flush method even though there is no
>>>> update in the block controller. This is an area that needs improvement.
>>>>
>>>> Your code does allow the block controller to be aware of that and avoid
>>>> further action, but I think it has to be done in the rstat code to be
>>>> applicable to all controllers instead of just specific to block controller.
>>> I guess it can be done by adding one percpu variable to 'struct cgroup'.
>>>
>>>> There is another problem that this approach. Suppose the system have 20
>>>> block devices and one of them has an IO operation. Now the flush method
>>>> still needs to iterate all the 20 blkg's to do an update. The block
>>>> controller is kind of special that the number of per-cgroup IO stats depends
>>>> on the number of block devices present. Other controllers just have one set
>>>> of stats per cgroup.
>>> Yeah, and this one is really blkio specific issue, and your patch does
>>> cover this one. Maybe you can add one callback to
>>> cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
>>> percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
>>> list isn't needed.
>> The rstat API is generic. It may not be a good idea to put controller
>> specific information into it. Yes, cgroup_rstat_cpu_lock is taken at the
>> read side (flush). It may not taken on the write side (update). So it may
> Both cgroup_rstat_flush_locked()/cgroup_rstat_updated() take the percpu
> cgroup_rstat_cpu_lock, so the new invented lockless list can be
> replaced with plain list.

cgroup_rstat_updated() should only take the percpu cgroup_rstat_cpu_lock 
the first time it transition from "!updated" to "updated". After that, 
it returns immediately without the the lock. With a regular list, you 
will have to take the lock every time a new block device has an update. 
So there isn't much saving on the update side. In general, the 
lock/unlock sequence has a bit more overhead than the lockless 
insertion. On the flush side, there may be a bit of saving, but it is 
not the fast path.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-06  2:58               ` Waiman Long
  (?)
@ 2022-06-06  3:15               ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2022-06-06  3:15 UTC (permalink / raw)
  To: Waiman Long; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On Sun, Jun 05, 2022 at 10:58:15PM -0400, Waiman Long wrote:
> On 6/5/22 22:23, Ming Lei wrote:
> > On Sun, Jun 05, 2022 at 09:59:50PM -0400, Waiman Long wrote:
> > > On 6/5/22 21:39, Ming Lei wrote:
> > > > On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
> > > > > On 6/3/22 23:58, Ming Lei wrote:
> > > > > 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index abec50f31fe6..8c4f204dbf5b 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> > > > > >     {
> > > > > >     	unsigned int x;
> > > > > > -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
> > > > > > +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
> > > > > >     	x = __this_cpu_add_return(stats_updates, abs(val));
> > > > > >     	if (x > MEMCG_CHARGE_BATCH) {
> > > > > I think the rstat set of functions are doing that already. So flush will
> > > > > only call CPUs that have called cgroup_rstat_updated() before. However, one
> > > > Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
> > > > percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
> > > > is still called even through there isn't any update on this CPU.
> > > Yes, I think we may need to add a bitmask of what controllers have updates
> > > in cgroup_rstat_cpu structure.
> > > > > deficiency that I am aware of is that there is no bitmap of which controller
> > > > > have update. The problem that I saw in cgroup v2 is that in a cgroup with
> > > > > both memory controller and block controller enabled, a
> > > > > cgroup_rstat_updated() call from memory cgroup later causes the rstat
> > > > > function to call into block cgroup flush method even though there is no
> > > > > update in the block controller. This is an area that needs improvement.
> > > > > 
> > > > > Your code does allow the block controller to be aware of that and avoid
> > > > > further action, but I think it has to be done in the rstat code to be
> > > > > applicable to all controllers instead of just specific to block controller.
> > > > I guess it can be done by adding one percpu variable to 'struct cgroup'.
> > > > 
> > > > > There is another problem that this approach. Suppose the system have 20
> > > > > block devices and one of them has an IO operation. Now the flush method
> > > > > still needs to iterate all the 20 blkg's to do an update. The block
> > > > > controller is kind of special that the number of per-cgroup IO stats depends
> > > > > on the number of block devices present. Other controllers just have one set
> > > > > of stats per cgroup.
> > > > Yeah, and this one is really blkio specific issue, and your patch does
> > > > cover this one. Maybe you can add one callback to
> > > > cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
> > > > percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
> > > > list isn't needed.
> > > The rstat API is generic. It may not be a good idea to put controller
> > > specific information into it. Yes, cgroup_rstat_cpu_lock is taken at the
> > > read side (flush). It may not taken on the write side (update). So it may
> > Both cgroup_rstat_flush_locked()/cgroup_rstat_updated() take the percpu
> > cgroup_rstat_cpu_lock, so the new invented lockless list can be
> > replaced with plain list.
> 
> cgroup_rstat_updated() should only take the percpu cgroup_rstat_cpu_lock the
> first time it transition from "!updated" to "updated". After that, it
> returns immediately without the the lock. With a regular list, you will have
> to take the lock every time a new block device has an update. So there isn't
> much saving on the update side. In general, the lock/unlock sequence has a
> bit more overhead than the lockless insertion. On the flush side, there may
> be a bit of saving, but it is not the fast path.

OK, got it, looks I misunderstood cgroup_rstat_updated(), and still the
point of N queue vs. 1 cgroup, then looks your patch is good.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-06  3:16     ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2022-06-06  3:16 UTC (permalink / raw)
  To: Waiman Long; +Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel

On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long wrote:
> For a system with many CPUs and block devices, the time to do
> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
> can be especially problematic as interrupt is disabled during the flush.
> It was reported that it might take seconds to complete in some extreme
> cases leading to hard lockup messages.
> 
> As it is likely that not all the percpu blkg_iostat_set's has been
> updated since the last flush, those stale blkg_iostat_set's don't need
> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
> by keeping a lockless list of recently updated blkg_iostat_set's in a
> newly added percpu blkcg->lhead pointer.
> 
> The blkg_iostat_set is added to the lockless list on the update side
> in blk_cgroup_bio_start(). It is removed from the lockless list when
> flushed in blkcg_rstat_flush(). Due to racing, it is possible that
> blk_iostat_set's in the lockless list may have no new IO stats to be
> flushed. To protect against destruction of blkg, a percpu reference is
> gotten when putting into the lockless list and put back when removed.
> 
> A blkg_iostat_set can determine if it is in a lockless list by checking
> the content of its lnode.next pointer which will be non-NULL when in
> a lockless list. This requires the presence of a special llist_last
> sentinel node to be put at the end of the lockless list.
> 
> When booting up an instrumented test kernel with this patch on a
> 2-socket 96-thread system with cgroup v2, out of the 2051 calls to
> cgroup_rstat_flush() after bootup, 1788 of the calls were exited
> immediately because of empty lockless list. After an all-cpu kernel
> build, the ratio became 6295424/6340513. That was more than 99%.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-06  3:16     ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2022-06-06  3:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long wrote:
> For a system with many CPUs and block devices, the time to do
> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
> can be especially problematic as interrupt is disabled during the flush.
> It was reported that it might take seconds to complete in some extreme
> cases leading to hard lockup messages.
> 
> As it is likely that not all the percpu blkg_iostat_set's has been
> updated since the last flush, those stale blkg_iostat_set's don't need
> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
> by keeping a lockless list of recently updated blkg_iostat_set's in a
> newly added percpu blkcg->lhead pointer.
> 
> The blkg_iostat_set is added to the lockless list on the update side
> in blk_cgroup_bio_start(). It is removed from the lockless list when
> flushed in blkcg_rstat_flush(). Due to racing, it is possible that
> blk_iostat_set's in the lockless list may have no new IO stats to be
> flushed. To protect against destruction of blkg, a percpu reference is
> gotten when putting into the lockless list and put back when removed.
> 
> A blkg_iostat_set can determine if it is in a lockless list by checking
> the content of its lnode.next pointer which will be non-NULL when in
> a lockless list. This requires the presence of a special llist_last
> sentinel node to be put at the end of the lockless list.
> 
> When booting up an instrumented test kernel with this patch on a
> 2-socket 96-thread system with cgroup v2, out of the 2051 calls to
> cgroup_rstat_flush() after bootup, 1788 of the calls were exited
> immediately because of empty lockless list. After an all-cpu kernel
> build, the ratio became 6295424/6340513. That was more than 99%.
> 
> Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Reviewed-by: Ming Lei <ming.lei-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-08 16:57     ` Michal Koutný
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Koutný @ 2022-06-08 16:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

Hello.

On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long <longman@redhat.com> wrote:
> As it is likely that not all the percpu blkg_iostat_set's has been
> updated since the last flush, those stale blkg_iostat_set's don't need
> to be flushed in this case.

Yes, there's no point to flush stats for idle devices if there can be
many of them. Good idea.

> +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
> +{
> +	return xchg(&lhead->first, &llist_last);
> +}
> +
> +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
> +{
> +	struct llist_node *next = READ_ONCE(lnode->next);
> +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
> +					    lnode)->blkg;
> +
> +	WRITE_ONCE(lnode->next, NULL);
> +	percpu_ref_put(&blkg->refcnt);
> +	return next;
> +}

Idea/just asking: would it make sense to generalize this into llist.c
(this is basically llist_del_first() + llist_del_all() with a sentinel)?
For the sake of reusability.

> +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
> +	for (; (node != &llist_last) &&					\
> +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
> +		nxt = fetch_delete_lnode_next(node), true);		\
> +		node = nxt)
> +

It's good hygiene to parenthesize the args.

> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>  	}
>  	bis->cur.ios[rwd]++;
>  
> +	if (!READ_ONCE(bis->lnode.next)) {
> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> +
> +		llist_add(&bis->lnode, lhead);
> +		percpu_ref_get(&bis->blkg->refcnt);
> +	}
> +

When a blkg's cgroup is rmdir'd, what happens with the lhead list?
We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats.
init_and_link_css however adds reference form blkcg->css to cgroup->css.
The blkcg->css would be (transitively) pinned by the lhead list and
hence would prevent the final flush (when refs drop to zero). Seems like
a cyclic dependency.

Luckily, there's also per-subsys flushing in css_release which could be
moved after rmdir (offlining) but before last ref is gone:

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..d830e6a8fb3b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work)

        if (ss) {
                /* css release path */
-               if (!list_empty(&css->rstat_css_node)) {
-                       cgroup_rstat_flush(cgrp);
-                       list_del_rcu(&css->rstat_css_node);
-               }
-
                cgroup_idr_replace(&ss->css_idr, NULL, css->id);
                if (ss->css_released)
                        ss->css_released(css);
@@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css)
        css->flags &= ~CSS_ONLINE;
        RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);

+       if (!list_empty(&css->rstat_css_node)) {
+               cgroup_rstat_flush(css->cgrp);
+               list_del_rcu(&css->rstat_css_node);
+       }
+
        wake_up_all(&css->cgroup->offline_waitq);
 }

(not tested)


>  	u64_stats_update_end_irqrestore(&bis->sync, flags);
>  	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> +		cgroup_rstat_updated(blkcg->css.cgroup, cpu);

Maybe bundle the lhead list maintenace with cgroup_rstat_updated() under
cgroup_subsys_on_dfl()? The stats can be read on v1 anyway.


Thanks,
Michal

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-08 16:57     ` Michal Koutný
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Koutný @ 2022-06-08 16:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ming Lei

Hello.

On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> As it is likely that not all the percpu blkg_iostat_set's has been
> updated since the last flush, those stale blkg_iostat_set's don't need
> to be flushed in this case.

Yes, there's no point to flush stats for idle devices if there can be
many of them. Good idea.

> +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
> +{
> +	return xchg(&lhead->first, &llist_last);
> +}
> +
> +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
> +{
> +	struct llist_node *next = READ_ONCE(lnode->next);
> +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
> +					    lnode)->blkg;
> +
> +	WRITE_ONCE(lnode->next, NULL);
> +	percpu_ref_put(&blkg->refcnt);
> +	return next;
> +}

Idea/just asking: would it make sense to generalize this into llist.c
(this is basically llist_del_first() + llist_del_all() with a sentinel)?
For the sake of reusability.

> +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
> +	for (; (node != &llist_last) &&					\
> +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
> +		nxt = fetch_delete_lnode_next(node), true);		\
> +		node = nxt)
> +

It's good hygiene to parenthesize the args.

> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>  	}
>  	bis->cur.ios[rwd]++;
>  
> +	if (!READ_ONCE(bis->lnode.next)) {
> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> +
> +		llist_add(&bis->lnode, lhead);
> +		percpu_ref_get(&bis->blkg->refcnt);
> +	}
> +

When a blkg's cgroup is rmdir'd, what happens with the lhead list?
We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats.
init_and_link_css however adds reference form blkcg->css to cgroup->css.
The blkcg->css would be (transitively) pinned by the lhead list and
hence would prevent the final flush (when refs drop to zero). Seems like
a cyclic dependency.

Luckily, there's also per-subsys flushing in css_release which could be
moved after rmdir (offlining) but before last ref is gone:

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..d830e6a8fb3b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work)

        if (ss) {
                /* css release path */
-               if (!list_empty(&css->rstat_css_node)) {
-                       cgroup_rstat_flush(cgrp);
-                       list_del_rcu(&css->rstat_css_node);
-               }
-
                cgroup_idr_replace(&ss->css_idr, NULL, css->id);
                if (ss->css_released)
                        ss->css_released(css);
@@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css)
        css->flags &= ~CSS_ONLINE;
        RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);

+       if (!list_empty(&css->rstat_css_node)) {
+               cgroup_rstat_flush(css->cgrp);
+               list_del_rcu(&css->rstat_css_node);
+       }
+
        wake_up_all(&css->cgroup->offline_waitq);
 }

(not tested)


>  	u64_stats_update_end_irqrestore(&bis->sync, flags);
>  	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> +		cgroup_rstat_updated(blkcg->css.cgroup, cpu);

Maybe bundle the lhead list maintenace with cgroup_rstat_updated() under
cgroup_subsys_on_dfl()? The stats can be read on v1 anyway.


Thanks,
Michal

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-08 16:57     ` Michal Koutný
  (?)
@ 2022-06-08 18:16     ` Waiman Long
  2022-06-08 21:12       ` Michal Koutný
  -1 siblings, 1 reply; 31+ messages in thread
From: Waiman Long @ 2022-06-08 18:16 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 6/8/22 12:57, Michal Koutný wrote:
> Hello.
>
> On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long <longman@redhat.com> wrote:
>> As it is likely that not all the percpu blkg_iostat_set's has been
>> updated since the last flush, those stale blkg_iostat_set's don't need
>> to be flushed in this case.
> Yes, there's no point to flush stats for idle devices if there can be
> many of them. Good idea.
>
>> +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
>> +{
>> +	return xchg(&lhead->first, &llist_last);
>> +}
>> +
>> +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
>> +{
>> +	struct llist_node *next = READ_ONCE(lnode->next);
>> +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
>> +					    lnode)->blkg;
>> +
>> +	WRITE_ONCE(lnode->next, NULL);
>> +	percpu_ref_put(&blkg->refcnt);
>> +	return next;
>> +}
> Idea/just asking: would it make sense to generalize this into llist.c
> (this is basically llist_del_first() + llist_del_all() with a sentinel)?
> For the sake of reusability.

I have thought about that. It can be done as a follow-up patch to add a 
sentinel version into llist and use that instead. Of course, I can also 
update this patchset to include that.


>
>> +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
>> +	for (; (node != &llist_last) &&					\
>> +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
>> +		nxt = fetch_delete_lnode_next(node), true);		\
>> +		node = nxt)
>> +
> It's good hygiene to parenthesize the args.
I am aware of that. I will certainly add that if it is a generic macro 
that can have many users.

>
>> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>>   	}
>>   	bis->cur.ios[rwd]++;
>>   
>> +	if (!READ_ONCE(bis->lnode.next)) {
>> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +
>> +		llist_add(&bis->lnode, lhead);
>> +		percpu_ref_get(&bis->blkg->refcnt);
>> +	}
>> +
> When a blkg's cgroup is rmdir'd, what happens with the lhead list?
> We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats.
> init_and_link_css however adds reference form blkcg->css to cgroup->css.
> The blkcg->css would be (transitively) pinned by the lhead list and
> hence would prevent the final flush (when refs drop to zero). Seems like
> a cyclic dependency.
>
> Luckily, there's also per-subsys flushing in css_release which could be
> moved after rmdir (offlining) but before last ref is gone:
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..d830e6a8fb3b 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work)
>
>          if (ss) {
>                  /* css release path */
> -               if (!list_empty(&css->rstat_css_node)) {
> -                       cgroup_rstat_flush(cgrp);
> -                       list_del_rcu(&css->rstat_css_node);
> -               }
> -
>                  cgroup_idr_replace(&ss->css_idr, NULL, css->id);
>                  if (ss->css_released)
>                          ss->css_released(css);
> @@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css)
>          css->flags &= ~CSS_ONLINE;
>          RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
>
> +       if (!list_empty(&css->rstat_css_node)) {
> +               cgroup_rstat_flush(css->cgrp);
> +               list_del_rcu(&css->rstat_css_node);
> +       }
> +
>          wake_up_all(&css->cgroup->offline_waitq);
>   }
>
> (not tested)

Good point.

Your change may not be enough since there could be update after the 
flush which will pin the blkg and hence blkcg.  I guess one possible 
solution may be to abandon the llist and revert back to list iteration 
when offline. I need to think a bit more about that.

>
>
>>   	u64_stats_update_end_irqrestore(&bis->sync, flags);
>>   	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
>> -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
>> +		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
> Maybe bundle the lhead list maintenace with cgroup_rstat_updated() under
> cgroup_subsys_on_dfl()? The stats can be read on v1 anyway.

I don't quite understand here. The change is not specific to v1 or v2. 
What do you mean by the stat is readable on v1?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-08 18:16     ` Waiman Long
@ 2022-06-08 21:12       ` Michal Koutný
  2022-06-08 22:14         ` Michal Koutný
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Koutný @ 2022-06-08 21:12 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On Wed, Jun 08, 2022 at 02:16:45PM -0400, Waiman Long <longman@redhat.com> wrote:
> I have thought about that. It can be done as a follow-up patch to add a
> sentinel version into llist and use that instead. Of course, I can also
> update this patchset to include that.

Nothing against the current form, really just an idea for a followup or
prequel.

> Your change may not be enough since there could be update after the flush
> which will pin the blkg and hence blkcg.

Wouldn't that mean submitting a bio from offlined blkcg?
blkg_tryget_closest() should prevent that.

> I guess one possible solution may be to abandon the llist and revert
> back to list iteration when offline. I need to think a bit more about
> that.


> > Maybe bundle the lhead list maintenace with cgroup_rstat_updated() under
> > cgroup_subsys_on_dfl()? The stats can be read on v1 anyway.
> 
> I don't quite understand here. The change is not specific to v1 or v2. What
> do you mean by the stat is readable on v1?

Apologies, the critical "not" fell out. ...can not be read on v1... IOW,
the rstat data are only kept when attached to v2 hierarchy, so the list
of active devices needn't be maintained on v1.


Michal


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-08 21:12       ` Michal Koutný
@ 2022-06-08 22:14         ` Michal Koutný
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Koutný @ 2022-06-08 22:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On Wed, Jun 08, 2022 at 11:12:55PM +0200, Michal Koutný <mkoutny@suse.com> wrote:
> Wouldn't that mean submitting a bio from offlined blkcg?
> blkg_tryget_closest() should prevent that.

Self-correction -- no, forgot blkg_tryget_closest() gets any non-zero
reference, not just a live one (percpu_ref_tryget_live()), furthermore,
I can see that offlined blkcg may still issue writeback bios for
instance.

> > I guess one possible solution may be to abandon the llist and revert
> > back to list iteration when offline. I need to think a bit more about
> > that.

Since blkcg stats are only used for io.stat of an online blkcg, the
update may be skipped on an offlined blkcg. (Which of course breaks when
something starts to depend on the stats of an offlined blkcg.)

Michal

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-08 16:57     ` Michal Koutný
@ 2022-09-30 18:34       ` Waiman Long
  -1 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-09-30 18:34 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 6/8/22 12:57, Michal Koutný wrote:
> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>>   	}
>>   	bis->cur.ios[rwd]++;
>>   
>> +	if (!READ_ONCE(bis->lnode.next)) {
>> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +
>> +		llist_add(&bis->lnode, lhead);
>> +		percpu_ref_get(&bis->blkg->refcnt);
>> +	}
>> +
> When a blkg's cgroup is rmdir'd, what happens with the lhead list?
> We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats.
> init_and_link_css however adds reference form blkcg->css to cgroup->css.
> The blkcg->css would be (transitively) pinned by the lhead list and
> hence would prevent the final flush (when refs drop to zero). Seems like
> a cyclic dependency.

That is not true. The percpu lhead list is embedded in blkcg but it does 
not pin blkcg. What the code does is to pin the blkg from being freed 
while it is on the lockless list. I do need to move the percpu_ref_put() 
in blkcg_rstat_flush() later to avoid use-after-free though.


>
> Luckily, there's also per-subsys flushing in css_release which could be
> moved after rmdir (offlining) but before last ref is gone:
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..d830e6a8fb3b 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work)
>
>          if (ss) {
>                  /* css release path */
> -               if (!list_empty(&css->rstat_css_node)) {
> -                       cgroup_rstat_flush(cgrp);
> -                       list_del_rcu(&css->rstat_css_node);
> -               }
> -
>                  cgroup_idr_replace(&ss->css_idr, NULL, css->id);
>                  if (ss->css_released)
>                          ss->css_released(css);
> @@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css)
>          css->flags &= ~CSS_ONLINE;
>          RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
>
> +       if (!list_empty(&css->rstat_css_node)) {
> +               cgroup_rstat_flush(css->cgrp);
> +               list_del_rcu(&css->rstat_css_node);
> +       }
> +
>          wake_up_all(&css->cgroup->offline_waitq);
>   }
>
> (not tested)

I don't think that code is necessary. Anyway, I am planning go make a 
parallel set of helpers for a lockless list with sentinel variant as 
suggested.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-09-30 18:34       ` Waiman Long
  0 siblings, 0 replies; 31+ messages in thread
From: Waiman Long @ 2022-09-30 18:34 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 6/8/22 12:57, Michal Koutný wrote:
> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>>   	}
>>   	bis->cur.ios[rwd]++;
>>   
>> +	if (!READ_ONCE(bis->lnode.next)) {
>> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +
>> +		llist_add(&bis->lnode, lhead);
>> +		percpu_ref_get(&bis->blkg->refcnt);
>> +	}
>> +
> When a blkg's cgroup is rmdir'd, what happens with the lhead list?
> We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats.
> init_and_link_css however adds reference form blkcg->css to cgroup->css.
> The blkcg->css would be (transitively) pinned by the lhead list and
> hence would prevent the final flush (when refs drop to zero). Seems like
> a cyclic dependency.

That is not true. The percpu lhead list is embedded in blkcg but it does 
not pin blkcg. What the code does is to pin the blkg from being freed 
while it is on the lockless list. I do need to move the percpu_ref_put() 
in blkcg_rstat_flush() later to avoid use-after-free though.


>
> Luckily, there's also per-subsys flushing in css_release which could be
> moved after rmdir (offlining) but before last ref is gone:
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..d830e6a8fb3b 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work)
>
>          if (ss) {
>                  /* css release path */
> -               if (!list_empty(&css->rstat_css_node)) {
> -                       cgroup_rstat_flush(cgrp);
> -                       list_del_rcu(&css->rstat_css_node);
> -               }
> -
>                  cgroup_idr_replace(&ss->css_idr, NULL, css->id);
>                  if (ss->css_released)
>                          ss->css_released(css);
> @@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css)
>          css->flags &= ~CSS_ONLINE;
>          RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
>
> +       if (!list_empty(&css->rstat_css_node)) {
> +               cgroup_rstat_flush(css->cgrp);
> +               list_del_rcu(&css->rstat_css_node);
> +       }
> +
>          wake_up_all(&css->cgroup->offline_waitq);
>   }
>
> (not tested)

I don't think that code is necessary. Anyway, I am planning go make a 
parallel set of helpers for a lockless list with sentinel variant as 
suggested.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-09-30 18:34 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-02 19:20 [PATCH v6 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-06-02 19:20 ` Waiman Long
2022-06-02 19:20 ` [PATCH v6 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
2022-06-02 19:20   ` Waiman Long
2022-06-04  2:08   ` Ming Lei
2022-06-04  2:08     ` Ming Lei
2022-06-04  2:47     ` Waiman Long
2022-06-02 19:20 ` [PATCH v6 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
2022-06-02 19:20   ` Waiman Long
2022-06-02 20:39   ` Tejun Heo
2022-06-04  2:16   ` Ming Lei
2022-06-04  2:16     ` Ming Lei
2022-06-02 19:20 ` [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-06-04  3:58   ` Ming Lei
2022-06-05 23:15     ` Waiman Long
2022-06-06  1:39       ` Ming Lei
2022-06-06  1:59         ` Waiman Long
2022-06-06  1:59           ` Waiman Long
2022-06-06  2:23           ` Ming Lei
2022-06-06  2:58             ` Waiman Long
2022-06-06  2:58               ` Waiman Long
2022-06-06  3:15               ` Ming Lei
2022-06-06  3:16   ` Ming Lei
2022-06-06  3:16     ` Ming Lei
2022-06-08 16:57   ` Michal Koutný
2022-06-08 16:57     ` Michal Koutný
2022-06-08 18:16     ` Waiman Long
2022-06-08 21:12       ` Michal Koutný
2022-06-08 22:14         ` Michal Koutný
2022-09-30 18:34     ` Waiman Long
2022-09-30 18:34       ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.