[PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-06-01 21:18 Waiman Long
  2022-06-01 21:18 ` [PATCH v3 1/2] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-01 21:18 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

This is a follow-up of [1]. There is no change in patch 1. Patch 2 has
the following changes:
 - Update comments in patch 2.
 - Put rcu_read_lock/unlock() in blkcg_rstat_flush().
 - Use READ_ONCE/WRITE_ONCE() to access lnode->next to reduce data
   races.
 - Get a blkg reference when putting into the lockless list and put it
   back when removed.

[1] https://lore.kernel.org/lkml/20220601165324.60892-1-longman@redhat.com/

Waiman Long (2):
  blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
  blk-cgroup: Optimize blkcg_rstat_flush()

 block/blk-cgroup.c | 92 ++++++++++++++++++++++++++++++++++++++++++----
 block/blk-cgroup.h |  9 +++++
 2 files changed, 93 insertions(+), 8 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/2] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
  2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-06-01 21:18 ` Waiman Long
  2022-06-01 21:18 ` [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-01 21:18 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup
rstat") changes block cgroup IO stats to use the rstat APIs. It added
a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified
to allocate the new percpu iostat_cpu but didn't free it when an error
happened. Fix this by freeing the percpu iostat_cpu on error exit.

Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 40161a3f68d0..acd9b0aa8dc8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		return NULL;
 
 	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
-		goto err_free;
+		goto err_free_blkg;
 
 	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
 	if (!blkg->iostat_cpu)
-		goto err_free;
+		goto err_free_blkg;
 
 	if (!blk_get_queue(q))
 		goto err_free;
@@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	return blkg;
 
 err_free:
+	free_percpu(blkg->iostat_cpu);
+
+err_free_blkg:
 	blkg_free(blkg);
 	return NULL;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2022-06-01 21:18 ` [PATCH v3 1/2] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
@ 2022-06-01 21:18 ` Waiman Long
  2022-06-01 21:26   ` Tejun Heo
  2022-06-02  6:32   ` kernel test robot
  2022-06-02  1:54 ` [PATCH v4 " Waiman Long
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-01 21:18 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.

The blkg_iostat_set is added to the lockless list on the update side
in blk_cgroup_bio_start(). It is removed from the lockless list when
flushed in blkcg_rstat_flush(). Due to racing, it is possible that
blk_iostat_set's in the lockless list may have no new IO stats to be
flushed. To protect against destruction of blkg, a percpu reference is
taken when putting into the lockless list and released when removed.

A blkg_iostat_set can determine if it is in a lockless list by checking
the content of its lnode.next pointer which will be non-NULL when in
a lockless list. This requires the presence of a special llist_last
sentinel node to be put at the end of the lockless list.

When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 block/blk-cgroup.c | 85 ++++++++++++++++++++++++++++++++++++++++++----
 block/blk-cgroup.h |  9 +++++
 2 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index acd9b0aa8dc8..0143dda589bd 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -59,6 +59,57 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
 
 #define BLKG_DESTROY_BATCH_SIZE  64
 
+/*
+ * lnode.next of the last entry in a lockless list is NULL. To make it
+ * always non-NULL for lnode's, a sentinel node has to be put at the
+ * end of the lockless list. So all the percpu lhead's are initialized
+ * to point to that sentinel node.
+ */
+static struct llist_node llist_last;
+
+static inline bool blkcg_llist_empty(struct llist_head *lhead)
+{
+	return lhead->first == &llist_last;
+}
+
+static inline void init_blkcg_llists(struct blkcg *blkcg)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
+}
+
+static inline struct llist_node *
+fetch_delete_blkcg_llist(struct llist_head *lhead)
+{
+	return xchg(&lhead->first, &llist_last);
+}
+
+static inline struct llist_node *
+fetch_delete_lnode_next(struct llist_node *lnode)
+{
+	struct llist_node *next = READ_ONCE(lnode->next);
+	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
+					    lnode)->blkg;
+
+	WRITE_ONCE(lnode->next, NULL);
+	percpu_ref_put(&blkg->refcnt);
+	return next;
+}
+
+/*
+ * The retrieved blkg_iostat_set is immediately marked as not in the
+ * lockless list by clearing its node->next pointer. It could be put
+ * back into the list by a parallel update before the iostat's are
+ * finally flushed including probably the new update.
+ */
+#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
+	for (; (node != &llist_last) &&					\
+	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
+		nxt = fetch_delete_lnode_next(node), true);		\
+		node = nxt)
+
 /**
  * blkcg_css - find the current css
  *
@@ -236,8 +287,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 
 	u64_stats_init(&blkg->iostat.sync);
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
+		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
+	}
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
@@ -852,17 +905,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
 static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
-	struct blkcg_gq *blkg;
+	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+	struct llist_node *lnode, *lnext;
+	struct blkg_iostat_set *bisc;
 
 	/* Root-level stats are sourced from system-wide IO stats */
 	if (!cgroup_parent(css->cgroup))
 		return;
 
+	if (blkcg_llist_empty(lhead))
+		return;
+
 	rcu_read_lock();
 
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+	lnode = fetch_delete_blkcg_llist(lhead);
+	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
+		struct blkcg_gq *blkg = bisc->blkg;
 		struct blkcg_gq *parent = blkg->parent;
-		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
 		struct blkg_iostat cur, delta;
 		unsigned long flags;
 		unsigned int seq;
@@ -1192,6 +1251,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		}
 	}
 
+	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
+	if (!blkcg->lhead)
+		goto free_blkcg;
+	init_blkcg_llists(blkcg);
+
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkcg_policy_data *cpd;
@@ -1233,7 +1297,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	for (i--; i >= 0; i--)
 		if (blkcg->cpd[i])
 			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
-
+	free_percpu(blkcg->lhead);
+free_blkcg:
 	if (blkcg != &blkcg_root)
 		kfree(blkcg);
 unlock:
@@ -1997,6 +2062,7 @@ static int blk_cgroup_io_type(struct bio *bio)
 
 void blk_cgroup_bio_start(struct bio *bio)
 {
+	struct blkcg *blkcg = bio->bi_blkg->blkcg;
 	int rwd = blk_cgroup_io_type(bio), cpu;
 	struct blkg_iostat_set *bis;
 	unsigned long flags;
@@ -2015,9 +2081,16 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 	bis->cur.ios[rwd]++;
 
+	if (!READ_ONCE(bis->lnode.next)) {
+		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+
+		llist_add(&bis->lnode, lhead);
+		percpu_ref_get(&bis->blkg->refcnt);
+	}
+
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
-		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
+		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
 	put_cpu();
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d4de0a35e066..2c36362a332e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -18,6 +18,7 @@
 #include <linux/cgroup.h>
 #include <linux/kthread.h>
 #include <linux/blk-mq.h>
+#include <linux/llist.h>
 
 struct blkcg_gq;
 struct blkg_policy_data;
@@ -43,6 +44,8 @@ struct blkg_iostat {
 
 struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
+	struct llist_node		lnode;
+	struct blkcg_gq		       *blkg;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
 };
@@ -97,6 +100,12 @@ struct blkcg {
 	struct blkcg_policy_data	*cpd[BLKCG_MAX_POLS];
 
 	struct list_head		all_blkcgs_node;
+
+	/*
+	 * List of updated percpu blkg_iostat_set's since the last flush.
+	 */
+	struct llist_head __percpu	*lhead;
+
 #ifdef CONFIG_BLK_CGROUP_FC_APPID
 	char                            fc_app_id[FC_APPID_LEN];
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-01 21:18 ` [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-06-01 21:26   ` Tejun Heo
  2022-06-01 21:30     ` Waiman Long
  2022-06-02  6:32   ` kernel test robot
  1 sibling, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2022-06-01 21:26 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On Wed, Jun 01, 2022 at 05:18:24PM -0400, Waiman Long wrote:
> @@ -59,6 +59,57 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
>  
>  #define BLKG_DESTROY_BATCH_SIZE  64
>  
> +/*
> + * lnode.next of the last entry in a lockless list is NULL. To make it
> + * always non-NULL for lnode's, a sentinel node has to be put at the
> + * end of the lockless list. So all the percpu lhead's are initialized
> + * to point to that sentinel node.
> + */

Can you please add why we want all entries to have non-NULL next?

> +static inline bool blkcg_llist_empty(struct llist_head *lhead)
> +{
> +	return lhead->first == &llist_last;
> +}
> +
> +static inline void init_blkcg_llists(struct blkcg *blkcg)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
> +}
> +
> +static inline struct llist_node *
> +fetch_delete_blkcg_llist(struct llist_head *lhead)
> +{
> +	return xchg(&lhead->first, &llist_last);
> +}
> +
> +static inline struct llist_node *
> +fetch_delete_lnode_next(struct llist_node *lnode)
> +{
> +	struct llist_node *next = READ_ONCE(lnode->next);
> +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
> +					    lnode)->blkg;
> +
> +	WRITE_ONCE(lnode->next, NULL);
> +	percpu_ref_put(&blkg->refcnt);
> +	return next;
> +}

It's not a strong opinion but I'm not too fond of using inlines to mark
trivial functions. The compiler should be able to make these decisions,
right?

Other than the above two bikesheddings,

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-01 21:26   ` Tejun Heo
@ 2022-06-01 21:30     ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-01 21:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei


On 6/1/22 17:26, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 05:18:24PM -0400, Waiman Long wrote:
>> @@ -59,6 +59,57 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
>>   
>>   #define BLKG_DESTROY_BATCH_SIZE  64
>>   
>> +/*
>> + * lnode.next of the last entry in a lockless list is NULL. To make it
>> + * always non-NULL for lnode's, a sentinel node has to be put at the
>> + * end of the lockless list. So all the percpu lhead's are initialized
>> + * to point to that sentinel node.
>> + */
> Can you please add why we want all entries to have non-NULL next?

As said elsewhere, lnode->next is used as a flag to indicate its 
presence in a lockless list. Sorry for not being explicit here.


>
>> +static inline bool blkcg_llist_empty(struct llist_head *lhead)
>> +{
>> +	return lhead->first == &llist_last;
>> +}
>> +
>> +static inline void init_blkcg_llists(struct blkcg *blkcg)
>> +{
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu)
>> +		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
>> +}
>> +
>> +static inline struct llist_node *
>> +fetch_delete_blkcg_llist(struct llist_head *lhead)
>> +{
>> +	return xchg(&lhead->first, &llist_last);
>> +}
>> +
>> +static inline struct llist_node *
>> +fetch_delete_lnode_next(struct llist_node *lnode)
>> +{
>> +	struct llist_node *next = READ_ONCE(lnode->next);
>> +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
>> +					    lnode)->blkg;
>> +
>> +	WRITE_ONCE(lnode->next, NULL);
>> +	percpu_ref_put(&blkg->refcnt);
>> +	return next;
>> +}
> It's not a strong opinion but I'm not too fond of using inlines to mark
> trivial functions. The compiler should be able to make these decisions,
> right?
>
> Other than the above two bikesheddings,

Sure, I can remove the inline keywords. I think I do it out of habit:-)

Regards,
Longman

>
>   Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 2/2] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2022-06-01 21:18 ` [PATCH v3 1/2] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
  2022-06-01 21:18 ` [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-06-02  1:54 ` Waiman Long
  2022-06-02 13:35 ` [PATCH v5 0/3] " Waiman Long
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-02  1:54 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.

The blkg_iostat_set is added to the lockless list on the update side
in blk_cgroup_bio_start(). It is removed from the lockless list when
flushed in blkcg_rstat_flush(). Due to racing, it is possible that
blk_iostat_set's in the lockless list may have no new IO stats to be
flushed. To protect against destruction of blkg, a percpu reference is
gotten when putting into the lockless list and put back when removed.

A blkg_iostat_set can determine if it is in a lockless list by checking
the content of its lnode.next pointer which will be non-NULL when in
a lockless list. This requires the presence of a special llist_last
sentinel node to be put at the end of the lockless list.

When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 84 ++++++++++++++++++++++++++++++++++++++++++----
 block/blk-cgroup.h |  9 +++++
 2 files changed, 87 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index acd9b0aa8dc8..1b74df5f2710 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -59,6 +59,56 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
 
 #define BLKG_DESTROY_BATCH_SIZE  64
 
+/*
+ * lnode.next of the last entry in a lockless list is NULL. To enable us to
+ * use lnode.next as a boolean flag to indicate its presence in a lockless
+ * list, we have to make it non-NULL for all. This is done by using a
+ * sentinel node at the end of the lockless list. All the percpu lhead's
+ * are initialized to point to that sentinel node as being empty.
+ */
+static struct llist_node llist_last;
+
+static bool blkcg_llist_empty(struct llist_head *lhead)
+{
+	return lhead->first == &llist_last;
+}
+
+static void init_blkcg_llists(struct blkcg *blkcg)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
+}
+
+static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
+{
+	return xchg(&lhead->first, &llist_last);
+}
+
+static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
+{
+	struct llist_node *next = READ_ONCE(lnode->next);
+	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
+					    lnode)->blkg;
+
+	WRITE_ONCE(lnode->next, NULL);
+	percpu_ref_put(&blkg->refcnt);
+	return next;
+}
+
+/*
+ * The retrieved blkg_iostat_set is immediately marked as not in the
+ * lockless list by clearing its node->next pointer. It could be put
+ * back into the list by a parallel update before the iostat's are
+ * finally flushed including probably the new update.
+ */
+#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
+	for (; (node != &llist_last) &&					\
+	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
+		nxt = fetch_delete_lnode_next(node), true);		\
+		node = nxt)
+
 /**
  * blkcg_css - find the current css
  *
@@ -236,8 +286,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 
 	u64_stats_init(&blkg->iostat.sync);
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
+		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
+	}
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
@@ -852,17 +904,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
 static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
-	struct blkcg_gq *blkg;
+	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+	struct llist_node *lnode, *lnext;
+	struct blkg_iostat_set *bisc;
 
 	/* Root-level stats are sourced from system-wide IO stats */
 	if (!cgroup_parent(css->cgroup))
 		return;
 
+	if (blkcg_llist_empty(lhead))
+		return;
+
 	rcu_read_lock();
 
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+	lnode = fetch_delete_blkcg_llist(lhead);
+	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
+		struct blkcg_gq *blkg = bisc->blkg;
 		struct blkcg_gq *parent = blkg->parent;
-		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
 		struct blkg_iostat cur, delta;
 		unsigned long flags;
 		unsigned int seq;
@@ -1192,6 +1250,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		}
 	}
 
+	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
+	if (!blkcg->lhead)
+		goto free_blkcg;
+	init_blkcg_llists(blkcg);
+
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkcg_policy_data *cpd;
@@ -1233,7 +1296,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	for (i--; i >= 0; i--)
 		if (blkcg->cpd[i])
 			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
-
+	free_percpu(blkcg->lhead);
+free_blkcg:
 	if (blkcg != &blkcg_root)
 		kfree(blkcg);
 unlock:
@@ -1997,6 +2061,7 @@ static int blk_cgroup_io_type(struct bio *bio)
 
 void blk_cgroup_bio_start(struct bio *bio)
 {
+	struct blkcg *blkcg = bio->bi_blkg->blkcg;
 	int rwd = blk_cgroup_io_type(bio), cpu;
 	struct blkg_iostat_set *bis;
 	unsigned long flags;
@@ -2015,9 +2080,16 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 	bis->cur.ios[rwd]++;
 
+	if (!READ_ONCE(bis->lnode.next)) {
+		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+
+		llist_add(&bis->lnode, lhead);
+		percpu_ref_get(&bis->blkg->refcnt);
+	}
+
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
-		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
+		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
 	put_cpu();
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d4de0a35e066..2c36362a332e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -18,6 +18,7 @@
 #include <linux/cgroup.h>
 #include <linux/kthread.h>
 #include <linux/blk-mq.h>
+#include <linux/llist.h>
 
 struct blkcg_gq;
 struct blkg_policy_data;
@@ -43,6 +44,8 @@ struct blkg_iostat {
 
 struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
+	struct llist_node		lnode;
+	struct blkcg_gq		       *blkg;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
 };
@@ -97,6 +100,12 @@ struct blkcg {
 	struct blkcg_policy_data	*cpd[BLKCG_MAX_POLS];
 
 	struct list_head		all_blkcgs_node;
+
+	/*
+	 * List of updated percpu blkg_iostat_set's since the last flush.
+	 */
+	struct llist_head __percpu	*lhead;
+
 #ifdef CONFIG_BLK_CGROUP_FC_APPID
 	char                            fc_app_id[FC_APPID_LEN];
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-01 21:18 ` [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2022-06-01 21:26   ` Tejun Heo
@ 2022-06-02  6:32   ` kernel test robot
  1 sibling, 0 replies; 20+ messages in thread
From: kernel test robot @ 2022-06-02  6:32 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Jens Axboe
  Cc: llvm, kbuild-all, cgroups, linux-block, linux-kernel, Ming Lei,
	Waiman Long

Hi Waiman,

I love your patch! Perhaps something to improve:

[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master next-20220601]
[cannot apply to v5.18]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Waiman-Long/blk-cgroup-Optimize-blkcg_rstat_flush/20220602-052441
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: x86_64-randconfig-a012 (https://download.01.org/0day-ci/archive/20220602/202206021418.wpJNbe3g-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project c825abd6b0198fb088d9752f556a70705bc99dfd)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/3f979cef411e5d5512b725753034b02f3b7baf44
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Waiman-Long/blk-cgroup-Optimize-blkcg_rstat_flush/20220602-052441
        git checkout 3f979cef411e5d5512b725753034b02f3b7baf44
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> block/blk-cgroup.c:1255:6: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
           if (!blkcg->lhead)
               ^~~~~~~~~~~~~
   block/blk-cgroup.c:1306:9: note: uninitialized use occurs here
           return ret;
                  ^~~
   block/blk-cgroup.c:1255:2: note: remove the 'if' if its condition is always false
           if (!blkcg->lhead)
           ^~~~~~~~~~~~~~~~~~
   block/blk-cgroup.c:1239:33: note: initialize the variable 'ret' to silence this warning
           struct cgroup_subsys_state *ret;
                                          ^
                                           = NULL
   1 warning generated.


vim +1255 block/blk-cgroup.c

  1234	
  1235	static struct cgroup_subsys_state *
  1236	blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
  1237	{
  1238		struct blkcg *blkcg;
  1239		struct cgroup_subsys_state *ret;
  1240		int i;
  1241	
  1242		mutex_lock(&blkcg_pol_mutex);
  1243	
  1244		if (!parent_css) {
  1245			blkcg = &blkcg_root;
  1246		} else {
  1247			blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
  1248			if (!blkcg) {
  1249				ret = ERR_PTR(-ENOMEM);
  1250				goto unlock;
  1251			}
  1252		}
  1253	
  1254		blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
> 1255		if (!blkcg->lhead)
  1256			goto free_blkcg;
  1257		init_blkcg_llists(blkcg);
  1258	
  1259		for (i = 0; i < BLKCG_MAX_POLS ; i++) {
  1260			struct blkcg_policy *pol = blkcg_policy[i];
  1261			struct blkcg_policy_data *cpd;
  1262	
  1263			/*
  1264			 * If the policy hasn't been attached yet, wait for it
  1265			 * to be attached before doing anything else. Otherwise,
  1266			 * check if the policy requires any specific per-cgroup
  1267			 * data: if it does, allocate and initialize it.
  1268			 */
  1269			if (!pol || !pol->cpd_alloc_fn)
  1270				continue;
  1271	
  1272			cpd = pol->cpd_alloc_fn(GFP_KERNEL);
  1273			if (!cpd) {
  1274				ret = ERR_PTR(-ENOMEM);
  1275				goto free_pd_blkcg;
  1276			}
  1277			blkcg->cpd[i] = cpd;
  1278			cpd->blkcg = blkcg;
  1279			cpd->plid = i;
  1280			if (pol->cpd_init_fn)
  1281				pol->cpd_init_fn(cpd);
  1282		}
  1283	
  1284		spin_lock_init(&blkcg->lock);
  1285		refcount_set(&blkcg->online_pin, 1);
  1286		INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
  1287		INIT_HLIST_HEAD(&blkcg->blkg_list);
  1288	#ifdef CONFIG_CGROUP_WRITEBACK
  1289		INIT_LIST_HEAD(&blkcg->cgwb_list);
  1290	#endif
  1291		list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs);
  1292	
  1293		mutex_unlock(&blkcg_pol_mutex);
  1294		return &blkcg->css;
  1295	
  1296	free_pd_blkcg:
  1297		for (i--; i >= 0; i--)
  1298			if (blkcg->cpd[i])
  1299				blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
  1300		free_percpu(blkcg->lhead);
  1301	free_blkcg:
  1302		if (blkcg != &blkcg_root)
  1303			kfree(blkcg);
  1304	unlock:
  1305		mutex_unlock(&blkcg_pol_mutex);
  1306		return ret;
  1307	}
  1308	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 0/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
                   ` (2 preceding siblings ...)
  2022-06-02  1:54 ` [PATCH v4 " Waiman Long
@ 2022-06-02 13:35 ` Waiman Long
  2022-06-02 13:35 ` [PATCH v5 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-02 13:35 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

 v5:
  - Add a new patch 2 to eliminate the use of intermediate "ret"
    variable in blkcg_css_alloc() to fix compilation warning reported
    by kernel test robot.

 v4:
  - Update comment and eliminate "inline" keywords as suggested by TJ.

 v3:
  - Update comments in patch 2.
  - Put rcu_read_lock/unlock() in blkcg_rstat_flush().
  - Use READ_ONCE/WRITE_ONCE() to access lnode->next to reduce data
    races.
  - Get a blkg reference when putting into the lockless list and put it
    back when removed.
 
This patch series improves blkcg_rstat_flush() performance by eliminating
unnecessary blkg enumeration and flush operations for those blkg's and
blkg_iostat_set's that haven't been updated since the last flush.

Waiman Long (3):
  blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
  blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  blk-cgroup: Optimize blkcg_rstat_flush()

 block/blk-cgroup.c | 103 ++++++++++++++++++++++++++++++++++++++-------
 block/blk-cgroup.h |   9 ++++
 2 files changed, 96 insertions(+), 16 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit
  2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
                   ` (3 preceding siblings ...)
  2022-06-02 13:35 ` [PATCH v5 0/3] " Waiman Long
@ 2022-06-02 13:35 ` Waiman Long
  2022-06-02 18:54   ` [PATCH v5 4/4] blk-cgroup: Document the design of new lockless iostat_cpu list Waiman Long
  2022-06-02 13:35 ` [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
  2022-06-02 13:35 ` [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  6 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2022-06-02 13:35 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

Commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup
rstat") changes block cgroup IO stats to use the rstat APIs. It added
a new percpu iostat_cpu field into blkg. The blkg_alloc() was modified
to allocate the new percpu iostat_cpu but didn't free it when an error
happened. Fix this by freeing the percpu iostat_cpu on error exit.

Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 40161a3f68d0..acd9b0aa8dc8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -219,11 +219,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		return NULL;
 
 	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask))
-		goto err_free;
+		goto err_free_blkg;
 
 	blkg->iostat_cpu = alloc_percpu_gfp(struct blkg_iostat_set, gfp_mask);
 	if (!blkg->iostat_cpu)
-		goto err_free;
+		goto err_free_blkg;
 
 	if (!blk_get_queue(q))
 		goto err_free;
@@ -259,6 +259,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	return blkg;
 
 err_free:
+	free_percpu(blkg->iostat_cpu);
+
+err_free_blkg:
 	blkg_free(blkg);
 	return NULL;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
                   ` (4 preceding siblings ...)
  2022-06-02 13:35 ` [PATCH v5 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
@ 2022-06-02 13:35 ` Waiman Long
  2022-06-02 16:16   ` Tejun Heo
  2022-06-02 13:35 ` [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  6 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2022-06-02 13:35 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
Simplify error handling code by returning this error directly instead
of setting an intermediate "ret" variable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 block/blk-cgroup.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index acd9b0aa8dc8..9021f75fc752 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1177,7 +1177,6 @@ static struct cgroup_subsys_state *
 blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct blkcg *blkcg;
-	struct cgroup_subsys_state *ret;
 	int i;
 
 	mutex_lock(&blkcg_pol_mutex);
@@ -1186,10 +1185,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		blkcg = &blkcg_root;
 	} else {
 		blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
-		if (!blkcg) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!blkcg)
 			goto unlock;
-		}
 	}
 
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
@@ -1206,10 +1203,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			continue;
 
 		cpd = pol->cpd_alloc_fn(GFP_KERNEL);
-		if (!cpd) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!cpd)
 			goto free_pd_blkcg;
-		}
+
 		blkcg->cpd[i] = cpd;
 		cpd->blkcg = blkcg;
 		cpd->plid = i;
@@ -1238,7 +1234,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		kfree(blkcg);
 unlock:
 	mutex_unlock(&blkcg_pol_mutex);
-	return ret;
+	return ERR_PTR(-ENOMEM);
 }
 
 static int blkcg_css_online(struct cgroup_subsys_state *css)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
                   ` (5 preceding siblings ...)
  2022-06-02 13:35 ` [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
@ 2022-06-02 13:35 ` Waiman Long
  2022-06-02 16:58   ` Tejun Heo
  6 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2022-06-02 13:35 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.

The blkg_iostat_set is added to the lockless list on the update side
in blk_cgroup_bio_start(). It is removed from the lockless list when
flushed in blkcg_rstat_flush(). Due to racing, it is possible that
blk_iostat_set's in the lockless list may have no new IO stats to be
flushed. To protect against destruction of blkg, a percpu reference is
gotten when putting into the lockless list and put back when removed.

A blkg_iostat_set can determine if it is in a lockless list by checking
the content of its lnode.next pointer which will be non-NULL when in
a lockless list. This requires the presence of a special llist_last
sentinel node to be put at the end of the lockless list.

When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 84 ++++++++++++++++++++++++++++++++++++++++++----
 block/blk-cgroup.h |  9 +++++
 2 files changed, 87 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9021f75fc752..8af97f3b2fc9 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -59,6 +59,56 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
 
 #define BLKG_DESTROY_BATCH_SIZE  64
 
+/*
+ * lnode.next of the last entry in a lockless list is NULL. To enable us to
+ * use lnode.next as a boolean flag to indicate its presence in a lockless
+ * list, we have to make it non-NULL for all. This is done by using a
+ * sentinel node at the end of the lockless list. All the percpu lhead's
+ * are initialized to point to that sentinel node as being empty.
+ */
+static struct llist_node llist_last;
+
+static bool blkcg_llist_empty(struct llist_head *lhead)
+{
+	return lhead->first == &llist_last;
+}
+
+static void init_blkcg_llists(struct blkcg *blkcg)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
+}
+
+static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
+{
+	return xchg(&lhead->first, &llist_last);
+}
+
+static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
+{
+	struct llist_node *next = READ_ONCE(lnode->next);
+	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
+					    lnode)->blkg;
+
+	WRITE_ONCE(lnode->next, NULL);
+	percpu_ref_put(&blkg->refcnt);
+	return next;
+}
+
+/*
+ * The retrieved blkg_iostat_set is immediately marked as not in the
+ * lockless list by clearing its node->next pointer. It could be put
+ * back into the list by a parallel update before the iostat's are
+ * finally flushed including probably the new update.
+ */
+#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
+	for (; (node != &llist_last) &&					\
+	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
+		nxt = fetch_delete_lnode_next(node), true);		\
+		node = nxt)
+
 /**
  * blkcg_css - find the current css
  *
@@ -236,8 +286,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 
 	u64_stats_init(&blkg->iostat.sync);
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
+		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
+	}
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
@@ -852,17 +904,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
 static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
-	struct blkcg_gq *blkg;
+	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+	struct llist_node *lnode, *lnext;
+	struct blkg_iostat_set *bisc;
 
 	/* Root-level stats are sourced from system-wide IO stats */
 	if (!cgroup_parent(css->cgroup))
 		return;
 
+	if (blkcg_llist_empty(lhead))
+		return;
+
 	rcu_read_lock();
 
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+	lnode = fetch_delete_blkcg_llist(lhead);
+	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
+		struct blkcg_gq *blkg = bisc->blkg;
 		struct blkcg_gq *parent = blkg->parent;
-		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
 		struct blkg_iostat cur, delta;
 		unsigned long flags;
 		unsigned int seq;
@@ -1189,6 +1247,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			goto unlock;
 	}
 
+	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
+	if (!blkcg->lhead)
+		goto free_blkcg;
+	init_blkcg_llists(blkcg);
+
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkcg_policy_data *cpd;
@@ -1229,7 +1292,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	for (i--; i >= 0; i--)
 		if (blkcg->cpd[i])
 			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
-
+	free_percpu(blkcg->lhead);
+free_blkcg:
 	if (blkcg != &blkcg_root)
 		kfree(blkcg);
 unlock:
@@ -1993,6 +2057,7 @@ static int blk_cgroup_io_type(struct bio *bio)
 
 void blk_cgroup_bio_start(struct bio *bio)
 {
+	struct blkcg *blkcg = bio->bi_blkg->blkcg;
 	int rwd = blk_cgroup_io_type(bio), cpu;
 	struct blkg_iostat_set *bis;
 	unsigned long flags;
@@ -2011,9 +2076,16 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 	bis->cur.ios[rwd]++;
 
+	if (!READ_ONCE(bis->lnode.next)) {
+		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+
+		llist_add(&bis->lnode, lhead);
+		percpu_ref_get(&bis->blkg->refcnt);
+	}
+
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
-		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
+		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
 	put_cpu();
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d4de0a35e066..2c36362a332e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -18,6 +18,7 @@
 #include <linux/cgroup.h>
 #include <linux/kthread.h>
 #include <linux/blk-mq.h>
+#include <linux/llist.h>
 
 struct blkcg_gq;
 struct blkg_policy_data;
@@ -43,6 +44,8 @@ struct blkg_iostat {
 
 struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
+	struct llist_node		lnode;
+	struct blkcg_gq		       *blkg;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
 };
@@ -97,6 +100,12 @@ struct blkcg {
 	struct blkcg_policy_data	*cpd[BLKCG_MAX_POLS];
 
 	struct list_head		all_blkcgs_node;
+
+	/*
+	 * List of updated percpu blkg_iostat_set's since the last flush.
+	 */
+	struct llist_head __percpu	*lhead;
+
 #ifdef CONFIG_BLK_CGROUP_FC_APPID
 	char                            fc_app_id[FC_APPID_LEN];
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  2022-06-02 13:35 ` [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
@ 2022-06-02 16:16   ` Tejun Heo
  2022-06-02 17:17     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2022-06-02 16:16 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On Thu, Jun 02, 2022 at 09:35:42AM -0400, Waiman Long wrote:
> For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
> Simplify error handling code by returning this error directly instead
> of setting an intermediate "ret" variable.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

But I don't understand why this would trigger warning. Can you please
elaborate why this is needed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-02 13:35 ` [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-06-02 16:58   ` Tejun Heo
  2022-06-02 17:26     ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2022-06-02 16:58 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

Hello,

On Thu, Jun 02, 2022 at 09:35:43AM -0400, Waiman Long wrote:
> @@ -2011,9 +2076,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>  	}
>  	bis->cur.ios[rwd]++;
>  
> +	if (!READ_ONCE(bis->lnode.next)) {
> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> +
> +		llist_add(&bis->lnode, lhead);
> +		percpu_ref_get(&bis->blkg->refcnt);

Hmm... what guarantees that more than one threads race here? llist assumes
that there's a single writer for a given llist_node and the ref count would
be off too, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  2022-06-02 16:16   ` Tejun Heo
@ 2022-06-02 17:17     ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-02 17:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 6/2/22 12:16, Tejun Heo wrote:
> On Thu, Jun 02, 2022 at 09:35:42AM -0400, Waiman Long wrote:
>> For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
>> Simplify error handling code by returning this error directly instead
>> of setting an intermediate "ret" variable.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Tejun Heo <tj@kernel.org>
>
> But I don't understand why this would trigger warning. Can you please
> elaborate why this is needed.
>
> Thanks.

I forgot to set "ret" in my original patch 2 in case of allocation 
error. I didn't got a warning in my own build, maybe I didn't explicitly 
enable more warning. I could have modified the patch to set "ret" on 
error, but the "ret" setting looks duplicative to me and so I added this 
patch to get rid of it.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-02 16:58   ` Tejun Heo
@ 2022-06-02 17:26     ` Waiman Long
  2022-06-02 17:46       ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2022-06-02 17:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 6/2/22 12:58, Tejun Heo wrote:
> Hello,
>
> On Thu, Jun 02, 2022 at 09:35:43AM -0400, Waiman Long wrote:
>> @@ -2011,9 +2076,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>>   	}
>>   	bis->cur.ios[rwd]++;
>>   
>> +	if (!READ_ONCE(bis->lnode.next)) {
>> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +
>> +		llist_add(&bis->lnode, lhead);
>> +		percpu_ref_get(&bis->blkg->refcnt);
> Hmm... what guarantees that more than one threads race here? llist assumes
> that there's a single writer for a given llist_node and the ref count would
> be off too, right?

The llist_add() function is atomic. It calls into llist_add_batch() in 
lib/llist.c which uses cmpxchg() to make the change. There is a 
non-atomic version __llist_add() which may be problematic in this case. 
Note that irq is disabled in the u64_stats_update* critical section, 
there shouldn't be a racing thread running in the same cpu. Other cpus 
will modify their own version of lhead. Perhaps the non-atomic version 
can be used here as well.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-02 17:26     ` Waiman Long
@ 2022-06-02 17:46       ` Tejun Heo
  2022-06-02 18:18         ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2022-06-02 17:46 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On Thu, Jun 02, 2022 at 01:26:10PM -0400, Waiman Long wrote:
> 
> On 6/2/22 12:58, Tejun Heo wrote:
> > Hello,
> > 
> > On Thu, Jun 02, 2022 at 09:35:43AM -0400, Waiman Long wrote:
> > > @@ -2011,9 +2076,16 @@ void blk_cgroup_bio_start(struct bio *bio)
> > >   	}
> > >   	bis->cur.ios[rwd]++;
> > > +	if (!READ_ONCE(bis->lnode.next)) {
> > > +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > > +
> > > +		llist_add(&bis->lnode, lhead);
> > > +		percpu_ref_get(&bis->blkg->refcnt);
> > Hmm... what guarantees that more than one threads race here? llist assumes
> > that there's a single writer for a given llist_node and the ref count would
> > be off too, right?
> 
> The llist_add() function is atomic. It calls into llist_add_batch() in
> lib/llist.c which uses cmpxchg() to make the change. There is a non-atomic
> version __llist_add() which may be problematic in this case. Note that irq
> is disabled in the u64_stats_update* critical section, there shouldn't be a
> racing thread running in the same cpu. Other cpus will modify their own
> version of lhead. Perhaps the non-atomic version can be used here as well.

Ah, right, this is per-cpu, so there can be no second writer trying to add
the same node at the same time. Can you add a comment explaining the overall
design / behavior? Other than that, please feel free to add

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-06-02 17:46       ` Tejun Heo
@ 2022-06-02 18:18         ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-02 18:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 6/2/22 13:46, Tejun Heo wrote:
> On Thu, Jun 02, 2022 at 01:26:10PM -0400, Waiman Long wrote:
>> On 6/2/22 12:58, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Thu, Jun 02, 2022 at 09:35:43AM -0400, Waiman Long wrote:
>>>> @@ -2011,9 +2076,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>>>>    	}
>>>>    	bis->cur.ios[rwd]++;
>>>> +	if (!READ_ONCE(bis->lnode.next)) {
>>>> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>>>> +
>>>> +		llist_add(&bis->lnode, lhead);
>>>> +		percpu_ref_get(&bis->blkg->refcnt);
>>> Hmm... what guarantees that more than one threads race here? llist assumes
>>> that there's a single writer for a given llist_node and the ref count would
>>> be off too, right?
>> The llist_add() function is atomic. It calls into llist_add_batch() in
>> lib/llist.c which uses cmpxchg() to make the change. There is a non-atomic
>> version __llist_add() which may be problematic in this case. Note that irq
>> is disabled in the u64_stats_update* critical section, there shouldn't be a
>> racing thread running in the same cpu. Other cpus will modify their own
>> version of lhead. Perhaps the non-atomic version can be used here as well.
> Ah, right, this is per-cpu, so there can be no second writer trying to add
> the same node at the same time. Can you add a comment explaining the overall
> design / behavior? Other than that, please feel free to add
>
>   Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.

OK, I will send another patch to document the design in 
block/blk-cgroup.c. I don't want to touch this patch unless I need to 
correct some code here.

Thanks,
Longman

>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 4/4] blk-cgroup: Document the design of new lockless iostat_cpu list
  2022-06-02 13:35 ` [PATCH v5 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
@ 2022-06-02 18:54   ` Waiman Long
  2022-06-02 19:05     ` Tejun Heo
  0 siblings, 1 reply; 20+ messages in thread
From: Waiman Long @ 2022-06-02 18:54 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

A set of percpu lockless lists per block cgroup (blkcg) is added to
track the set of recently updated iostat_cpu structures. Add comment
in the code to document the design of this new set of lockless lists.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 block/blk-cgroup.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8af97f3b2fc9..f8f27551c16a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -60,6 +60,21 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
 #define BLKG_DESTROY_BATCH_SIZE  64
 
 /*
+ * Lockless lists for tracking IO stats update
+ *
+ * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
+ * There are multiple blkg's (one for each block device) attached to each
+ * blkcg. The rstat code keeps track of which cpu has IO stats updated,
+ * but it doesn't know which blkg has the updated stats. If there are many
+ * block devices in a system, the cost of iterating all the blkg's to flush
+ * out the IO stats can be high. To reduce such overhead, a set of percpu
+ * lockless lists (lhead) per blkcg are used to track the set of recently
+ * updated iostat_cpu's since the last flush. An iostat_cpu will be put
+ * onto the lockless list on the update side [blk_cgroup_bio_start()] if
+ * not there yet and then removed when being flushed [blkcg_rstat_flush()].
+ * References to blkg are gotten and then put back in the process to
+ * protect against blkg removal.
+ *
  * lnode.next of the last entry in a lockless list is NULL. To enable us to
  * use lnode.next as a boolean flag to indicate its presence in a lockless
  * list, we have to make it non-NULL for all. This is done by using a
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 4/4] blk-cgroup: Document the design of new lockless iostat_cpu list
  2022-06-02 18:54   ` [PATCH v5 4/4] blk-cgroup: Document the design of new lockless iostat_cpu list Waiman Long
@ 2022-06-02 19:05     ` Tejun Heo
  2022-06-02 19:12       ` Waiman Long
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2022-06-02 19:05 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On Thu, Jun 02, 2022 at 02:54:01PM -0400, Waiman Long wrote:
> A set of percpu lockless lists per block cgroup (blkcg) is added to
> track the set of recently updated iostat_cpu structures. Add comment
> in the code to document the design of this new set of lockless lists.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 4/4] blk-cgroup: Document the design of new lockless iostat_cpu list
  2022-06-02 19:05     ` Tejun Heo
@ 2022-06-02 19:12       ` Waiman Long
  0 siblings, 0 replies; 20+ messages in thread
From: Waiman Long @ 2022-06-02 19:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 6/2/22 15:05, Tejun Heo wrote:
> On Thu, Jun 02, 2022 at 02:54:01PM -0400, Waiman Long wrote:
>> A set of percpu lockless lists per block cgroup (blkcg) is added to
>> track the set of recently updated iostat_cpu structures. Add comment
>> in the code to document the design of this new set of lockless lists.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.

I have just realized that I forgot to free the percpu blkcg->lhead in 
blkcg_css_free(). I will send out v6 with this change as well as 
integrating this documentation patch back. Sorry for the omission.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-06-02 19:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-01 21:18 [PATCH v3 0/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-06-01 21:18 ` [PATCH v3 1/2] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
2022-06-01 21:18 ` [PATCH v3 2/2] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-06-01 21:26   ` Tejun Heo
2022-06-01 21:30     ` Waiman Long
2022-06-02  6:32   ` kernel test robot
2022-06-02  1:54 ` [PATCH v4 " Waiman Long
2022-06-02 13:35 ` [PATCH v5 0/3] " Waiman Long
2022-06-02 13:35 ` [PATCH v5 1/3] blk-cgroup: Correctly free percpu iostat_cpu in blkg on error exit Waiman Long
2022-06-02 18:54   ` [PATCH v5 4/4] blk-cgroup: Document the design of new lockless iostat_cpu list Waiman Long
2022-06-02 19:05     ` Tejun Heo
2022-06-02 19:12       ` Waiman Long
2022-06-02 13:35 ` [PATCH v5 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
2022-06-02 16:16   ` Tejun Heo
2022-06-02 17:17     ` Waiman Long
2022-06-02 13:35 ` [PATCH v5 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-06-02 16:58   ` Tejun Heo
2022-06-02 17:26     ` Waiman Long
2022-06-02 17:46       ` Tejun Heo
2022-06-02 18:18         ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).