linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-10-04 15:17 Waiman Long
  2022-10-04 15:17 ` [PATCH v8 1/3] llist: Allow optional sentinel node terminated lockless list Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Waiman Long @ 2022-10-04 15:17 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Michal Koutný,
	Waiman Long

 v8:
  - Update the llist patch to make existing llist functions and macros
    work for both NULL and sentinel terminated lockless list as much
    as possible and leave only the initialization and removal functions
    to have a sentinel terminated llist variants.

 v7:
  - Drop patch 1 ("blk-cgroup: Correctly free percpu iostat_cpu in blkg
    on error exit") as it is found to be unnecessary.
  - Add a new llist patch to provide a lockless list variant terminated
    by a sentinel node.
  - Modified patch 3 to use the new sllist API and move percpu_ref_put()
    later in the blkcg_rstat_flush() loop to prevent potential
    use-after-free problem.

 v6:
  - Add a missing free_percpu() into blkcg_css_free() in patch 3.
  - Integrating the documentation patch 4 back into patch 3.

 v5:
  - Add a new patch 2 to eliminate the use of intermediate "ret"
    variable in blkcg_css_alloc() to fix compilation warning reported
    by kernel test robot.

This patch series improves blkcg_rstat_flush() performance by eliminating
unnecessary blkg enumeration and flush operations for those blkg's and
blkg_iostat_set's that haven't been updated since the last flush. It
also enhances the llist API to support a sentinel termianted variant
of the lockless list.

Waiman Long (3):
  llist: Allow optional sentinel node terminated lockless list
  blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  blk-cgroup: Optimize blkcg_rstat_flush()

 block/blk-cgroup.c    |  87 ++++++++++++++++++++++++++++++------
 block/blk-cgroup.h    |   9 ++++
 include/linux/llist.h | 100 +++++++++++++++++++++++++++++++++---------
 lib/llist.c           |  20 ++++++---
 4 files changed, 177 insertions(+), 39 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v8 1/3] llist: Allow optional sentinel node terminated lockless list
  2022-10-04 15:17 [PATCH v8 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-10-04 15:17 ` Waiman Long
  2022-10-04 15:17 ` [PATCH v8 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2022-10-04 15:17 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Michal Koutný,
	Waiman Long

The lockless list API is useful for dealing with list in a lockless
manner. However, one of the drawback of the existing API is that there
is not an easy way to determine if an entry has already been put into a
lockless list. This has to be tracked externally and the tracking will
not be atomic unless some external synchronization logic is in place.

This patch changes the internal of the lockless list code to allow it
to support a lockless list terminated by an internal sentinel value
(LLIST_END) instead of NULL. The advantage of this scheme is that
we can atomically determine if an entry has been put into a lockless
list by doing a NULL check of the next pointer of the llist_node. The
drawback is that a bit more code may be needed to handle both NULL and
the sentinel value. The real world performance impact of this change,
however, should be negligible.

To use a sentinel terminated lockless list, the following new API must
be used for initialization and deletion of a lockless list.

 - SLLIST_HEAD_INIT() and init_sllist_head() for initialization
 - sllist_del_all() and __llist_del_all() for deletion

Other llist APIs are modified to process both NULL or the sentinel
terminated lockless list.

Of course, the callers should clear the next pointer when an entry is
removed from a sentinel terminated lockless list. Note that the internal
LIST_END sentinel value will never be returned. NULL will always be
returned if the lockless list is empty for backward compatibility.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/llist.h | 100 +++++++++++++++++++++++++++++++++---------
 lib/llist.c           |  20 ++++++---
 2 files changed, 95 insertions(+), 25 deletions(-)

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 85bda2d02d65..c7380e9b98e2 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -2,7 +2,8 @@
 #ifndef LLIST_H
 #define LLIST_H
 /*
- * Lock-less NULL terminated single linked list
+ * Lock-less NULL or sentinel terminated singly linked list
+ * --------------------------------------------------------
  *
  * Cases where locking is not needed:
  * If there are multiple producers and multiple consumers, llist_add can be
@@ -44,6 +45,15 @@
  * list can NOT be used in NMI handlers.  So code that uses the list in
  * an NMI handler should depend on CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
  *
+ * A sentinel node terminated lock-less list allows lock-list membership
+ * determination to be done atomically by doing a NULL check of the next
+ * pointer of the llist_node as it will never be NULL if it is in a lock-less
+ * list. The following APIs must be used for the initalization and deletion
+ * of a sentinel terminated lock-less list.
+ *
+ * - SLLIST_HEAD_INIT() and init_sllist_head() for initialization
+ * - sllist_del_all() and __llist_del_all() for deletion
+ *
  * Copyright 2010,2011 Intel Corp.
  *   Author: Huang Ying <ying.huang@intel.com>
  */
@@ -64,6 +74,16 @@ struct llist_node {
 #define LLIST_HEAD_INIT(name)	{ NULL }
 #define LLIST_HEAD(name)	struct llist_head name = LLIST_HEAD_INIT(name)
 
+/*
+ * Sentinel terminated llist_head initializer
+ *
+ * LLIST_END is chosen to be 1 so that a check for both NULL and LLIST_END
+ * can be optimized by the compiler to a single unsigned integer comparison.
+ */
+#define LLIST_END		((struct llist_node *)1UL)
+#define SLLIST_HEAD_INIT(name)	{ LLIST_END }
+#define SLLIST_HEAD(name)	struct llist_head name = SLLIST_HEAD_INIT(name)
+
 /**
  * init_llist_head - initialize lock-less list head
  * @head:	the head for your lock-less list
@@ -73,6 +93,15 @@ static inline void init_llist_head(struct llist_head *list)
 	list->first = NULL;
 }
 
+/**
+ * init_sllist_head - initialize sentinel terminated lock-less list head
+ * @head:	the head for your lock-less list
+ */
+static inline void init_sllist_head(struct llist_head *list)
+{
+	list->first = LLIST_END;
+}
+
 /**
  * llist_entry - get the struct of this entry
  * @ptr:	the &struct llist_node pointer.
@@ -83,21 +112,22 @@ static inline void init_llist_head(struct llist_head *list)
 	container_of(ptr, type, member)
 
 /**
- * member_address_is_nonnull - check whether the member address is not NULL
+ * member_address_is_valid - check whether member addr is not NULL or sentinel
  * @ptr:	the object pointer (struct type * that contains the llist_node)
  * @member:	the name of the llist_node within the struct.
  *
  * This macro is conceptually the same as
- *	&ptr->member != NULL
+ *	(&ptr->member != NULL) && (&ptr->member != LLIST_END)
  * but it works around the fact that compilers can decide that taking a member
- * address is never a NULL pointer.
+ * address is never a NULL or the sentinel pointer.
  *
- * Real objects that start at a high address and have a member at NULL are
- * unlikely to exist, but such pointers may be returned e.g. by the
- * container_of() macro.
+ * Real objects that start at a high address and have a member at NULL or
+ * LLIST_END are unlikely to exist, but such pointers may be returned e.g.
+ * by the container_of() macro.
  */
-#define member_address_is_nonnull(ptr, member)	\
-	((uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member) != 0)
+#define member_address_is_valid(ptr, member)				       \
+	({ uintptr_t __n = (uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member);\
+	   __n && (__n != (uintptr_t)LLIST_END); })
 
 /**
  * llist_for_each - iterate over some deleted entries of a lock-less list
@@ -114,7 +144,7 @@ static inline void init_llist_head(struct llist_head *list)
  * reverse the order by yourself before traversing.
  */
 #define llist_for_each(pos, node)			\
-	for ((pos) = (node); pos; (pos) = (pos)->next)
+	for ((pos) = (node); (pos) && (pos) != LLIST_END; (pos) = (pos)->next)
 
 /**
  * llist_for_each_safe - iterate over some deleted entries of a lock-less list
@@ -133,7 +163,8 @@ static inline void init_llist_head(struct llist_head *list)
  * reverse the order by yourself before traversing.
  */
 #define llist_for_each_safe(pos, n, node)			\
-	for ((pos) = (node); (pos) && ((n) = (pos)->next, true); (pos) = (n))
+	for ((pos) = (node); (pos) && ((pos) != LLIST_END) &&	\
+	     ((n) = (pos)->next, true); (pos) = (n))
 
 /**
  * llist_for_each_entry - iterate over some deleted entries of lock-less list of given type
@@ -152,7 +183,7 @@ static inline void init_llist_head(struct llist_head *list)
  */
 #define llist_for_each_entry(pos, node, member)				\
 	for ((pos) = llist_entry((node), typeof(*(pos)), member);	\
-	     member_address_is_nonnull(pos, member);			\
+	     member_address_is_valid(pos, member);			\
 	     (pos) = llist_entry((pos)->member.next, typeof(*(pos)), member))
 
 /**
@@ -172,11 +203,11 @@ static inline void init_llist_head(struct llist_head *list)
  * you want to traverse from the oldest to the newest, you must
  * reverse the order by yourself before traversing.
  */
-#define llist_for_each_entry_safe(pos, n, node, member)			       \
-	for (pos = llist_entry((node), typeof(*pos), member);		       \
-	     member_address_is_nonnull(pos, member) &&			       \
-	        (n = llist_entry(pos->member.next, typeof(*n), member), true); \
-	     pos = n)
+#define llist_for_each_entry_safe(pos, n, node, member)			   \
+	for (pos = llist_entry((node), typeof(*(pos)), member);		   \
+	     member_address_is_valid(pos, member) &&			   \
+		(n = llist_entry((pos)->member.next, typeof(*(n)), member),\
+		 true); pos = n)
 
 /**
  * llist_empty - tests whether a lock-less list is empty
@@ -188,12 +219,16 @@ static inline void init_llist_head(struct llist_head *list)
  */
 static inline bool llist_empty(const struct llist_head *head)
 {
-	return READ_ONCE(head->first) == NULL;
+	struct llist_node *first = READ_ONCE(head->first);
+
+	return !first || (first == LLIST_END);
 }
 
 static inline struct llist_node *llist_next(struct llist_node *node)
 {
-	return node->next;
+	struct llist_node *next = node->next;
+
+	return (next == LLIST_END) ? NULL : next;
 }
 
 extern bool llist_add_batch(struct llist_node *new_first,
@@ -204,9 +239,11 @@ static inline bool __llist_add_batch(struct llist_node *new_first,
 				     struct llist_node *new_last,
 				     struct llist_head *head)
 {
+	bool empty = llist_empty(head);
+
 	new_last->next = head->first;
 	head->first = new_first;
-	return new_last->next == NULL;
+	return empty;
 }
 
 /**
@@ -247,6 +284,29 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
 	return first;
 }
 
+/**
+ * sllist_del_all - delete all entries from sentinel terminated lock-less list
+ * @head:	the head of lock-less list to delete all entries
+ *
+ * If list is empty, return NULL, otherwise, delete all entries and
+ * return the pointer to the first entry.  The order of entries
+ * deleted is from the newest to the oldest added one.
+ */
+static inline struct llist_node *sllist_del_all(struct llist_head *head)
+{
+	struct llist_node *first = xchg(&head->first, LLIST_END);
+
+	return (first == LLIST_END) ? NULL : first;
+}
+
+static inline struct llist_node *__sllist_del_all(struct llist_head *head)
+{
+	struct llist_node *first = head->first;
+
+	head->first = LLIST_END;
+	return (first == LLIST_END) ? NULL : first;
+}
+
 extern struct llist_node *llist_del_first(struct llist_head *head);
 
 struct llist_node *llist_reverse_order(struct llist_node *head);
diff --git a/lib/llist.c b/lib/llist.c
index 611ce4881a87..1e782c9cafa8 100644
--- a/lib/llist.c
+++ b/lib/llist.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * Lock-less NULL terminated single linked list
+ * Lock-less NULL or sentinel terminated singly linked lists
  *
  * The basic atomic operation of this list is cmpxchg on long.  On
  * architectures that don't have NMI-safe cmpxchg implementation, the
@@ -14,7 +14,6 @@
 #include <linux/export.h>
 #include <linux/llist.h>
 
-
 /**
  * llist_add_batch - add several linked entries in batch
  * @new_first:	first entry in batch to be added
@@ -32,7 +31,7 @@ bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
 		new_last->next = first = READ_ONCE(head->first);
 	} while (cmpxchg(&head->first, first, new_first) != first);
 
-	return !first;
+	return !first || (first == LLIST_END);
 }
 EXPORT_SYMBOL_GPL(llist_add_batch);
 
@@ -56,7 +55,7 @@ struct llist_node *llist_del_first(struct llist_head *head)
 
 	entry = smp_load_acquire(&head->first);
 	for (;;) {
-		if (entry == NULL)
+		if (!entry || (entry == LLIST_END))
 			return NULL;
 		old_entry = entry;
 		next = READ_ONCE(entry->next);
@@ -79,14 +78,25 @@ EXPORT_SYMBOL_GPL(llist_del_first);
 struct llist_node *llist_reverse_order(struct llist_node *head)
 {
 	struct llist_node *new_head = NULL;
+	struct llist_node *new_tail = head;
+
+	if (!head || (head == LLIST_END))
+		return NULL;
 
-	while (head) {
+	while (head && (head != LLIST_END)) {
 		struct llist_node *tmp = head;
+
 		head = head->next;
 		tmp->next = new_head;
 		new_head = tmp;
 	}
 
+	/*
+	 * Terminate list with the same NULL or sentinel terminator
+	 */
+	if (head)
+		new_tail->next = LLIST_END;
+
 	return new_head;
 }
 EXPORT_SYMBOL_GPL(llist_reverse_order);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  2022-10-04 15:17 [PATCH v8 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2022-10-04 15:17 ` [PATCH v8 1/3] llist: Allow optional sentinel node terminated lockless list Waiman Long
@ 2022-10-04 15:17 ` Waiman Long
  2022-10-04 15:17 ` [PATCH v8 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
       [not found] ` <20221006101141.1832-1-hdanton@sina.com>
  3 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2022-10-04 15:17 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Michal Koutný,
	Waiman Long

For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
Simplify error handling code by returning this error directly instead
of setting an intermediate "ret" variable.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 869af9d72bcf..946592249795 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1177,7 +1177,6 @@ static struct cgroup_subsys_state *
 blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct blkcg *blkcg;
-	struct cgroup_subsys_state *ret;
 	int i;
 
 	mutex_lock(&blkcg_pol_mutex);
@@ -1186,10 +1185,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		blkcg = &blkcg_root;
 	} else {
 		blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
-		if (!blkcg) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!blkcg)
 			goto unlock;
-		}
 	}
 
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
@@ -1206,10 +1203,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			continue;
 
 		cpd = pol->cpd_alloc_fn(GFP_KERNEL);
-		if (!cpd) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!cpd)
 			goto free_pd_blkcg;
-		}
+
 		blkcg->cpd[i] = cpd;
 		cpd->blkcg = blkcg;
 		cpd->plid = i;
@@ -1238,7 +1234,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		kfree(blkcg);
 unlock:
 	mutex_unlock(&blkcg_pol_mutex);
-	return ret;
+	return ERR_PTR(-ENOMEM);
 }
 
 static int blkcg_css_online(struct cgroup_subsys_state *css)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-10-04 15:17 [PATCH v8 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2022-10-04 15:17 ` [PATCH v8 1/3] llist: Allow optional sentinel node terminated lockless list Waiman Long
  2022-10-04 15:17 ` [PATCH v8 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
@ 2022-10-04 15:17 ` Waiman Long
  2022-10-04 18:49   ` Michal Koutný
       [not found] ` <20221006101141.1832-1-hdanton@sina.com>
  3 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2022-10-04 15:17 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Michal Koutný,
	Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.

The blkg_iostat_set is added to a sentinel lockless list on the update
side in blk_cgroup_bio_start(). It is removed from the sentinel lockless
list when flushed in blkcg_rstat_flush(). Due to racing, it is possible
that blk_iostat_set's in the lockless list may have no new IO stats to
be flushed, but that is OK.

To protect against destruction of blkg, a percpu reference is gotten
when putting into the lockless list and put back when removed.

A blkg_iostat_set can determine if it is in a lockless list by checking
the content of its lnode.next pointer which will be non-NULL when in
a sentinel lockless list.

When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 75 ++++++++++++++++++++++++++++++++++++++++++----
 block/blk-cgroup.h |  9 ++++++
 2 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 946592249795..63569b05db0d 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -59,6 +59,37 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
 
 #define BLKG_DESTROY_BATCH_SIZE  64
 
+/*
+ * Lockless lists for tracking IO stats update
+ *
+ * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
+ * There are multiple blkg's (one for each block device) attached to each
+ * blkcg. The rstat code keeps track of which cpu has IO stats updated,
+ * but it doesn't know which blkg has the updated stats. If there are many
+ * block devices in a system, the cost of iterating all the blkg's to flush
+ * out the IO stats can be high. To reduce such overhead, a set of percpu
+ * lockless lists (lhead) per blkcg are used to track the set of recently
+ * updated iostat_cpu's since the last flush. An iostat_cpu will be put
+ * onto the lockless list on the update side [blk_cgroup_bio_start()] if
+ * not there yet and then removed when being flushed [blkcg_rstat_flush()].
+ * References to blkg are gotten and then put back in the process to
+ * protect against blkg removal.
+ *
+ * Return: 0 if successful or -ENOMEM if allocation fails.
+ */
+static int init_blkcg_llists(struct blkcg *blkcg)
+{
+	int cpu;
+
+	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
+	if (!blkcg->lhead)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu)
+		init_sllist_head(per_cpu_ptr(blkcg->lhead, cpu));
+	return 0;
+}
+
 /**
  * blkcg_css - find the current css
  *
@@ -236,8 +267,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 
 	u64_stats_init(&blkg->iostat.sync);
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
+		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
+	}
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
@@ -864,7 +897,9 @@ static void blkcg_iostat_update(struct blkcg_gq *blkg, struct blkg_iostat *cur,
 static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
-	struct blkcg_gq *blkg;
+	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+	struct llist_node *lnode;
+	struct blkg_iostat_set *bisc, *next_bisc;
 
 	/* Root-level stats are sourced from system-wide IO stats */
 	if (!cgroup_parent(css->cgroup))
@@ -872,12 +907,21 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 
 	rcu_read_lock();
 
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+	lnode = sllist_del_all(lhead);
+	if (!lnode)
+		goto out;
+
+	/*
+	 * Iterate only the iostat_cpu's queued in the lockless list.
+	 */
+	llist_for_each_entry_safe(bisc, next_bisc, lnode, lnode) {
+		struct blkcg_gq *blkg = bisc->blkg;
 		struct blkcg_gq *parent = blkg->parent;
-		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
 		struct blkg_iostat cur;
 		unsigned int seq;
 
+		WRITE_ONCE(lnode->next, NULL);
+
 		/* fetch the current per-cpu values */
 		do {
 			seq = u64_stats_fetch_begin(&bisc->sync);
@@ -890,8 +934,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		if (parent && parent->parent)
 			blkcg_iostat_update(parent, &blkg->iostat.cur,
 					    &blkg->iostat.last);
+		percpu_ref_put(&blkg->refcnt);
 	}
 
+out:
 	rcu_read_unlock();
 }
 
@@ -1170,6 +1216,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
 
 	mutex_unlock(&blkcg_pol_mutex);
 
+	free_percpu(blkcg->lhead);
 	kfree(blkcg);
 }
 
@@ -1189,6 +1236,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			goto unlock;
 	}
 
+	if (init_blkcg_llists(blkcg))
+		goto free_blkcg;
+
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkcg_policy_data *cpd;
@@ -1229,7 +1279,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	for (i--; i >= 0; i--)
 		if (blkcg->cpd[i])
 			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
-
+	free_percpu(blkcg->lhead);
+free_blkcg:
 	if (blkcg != &blkcg_root)
 		kfree(blkcg);
 unlock:
@@ -1990,6 +2041,7 @@ static int blk_cgroup_io_type(struct bio *bio)
 
 void blk_cgroup_bio_start(struct bio *bio)
 {
+	struct blkcg *blkcg = bio->bi_blkg->blkcg;
 	int rwd = blk_cgroup_io_type(bio), cpu;
 	struct blkg_iostat_set *bis;
 	unsigned long flags;
@@ -2008,9 +2060,20 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 	bis->cur.ios[rwd]++;
 
+	/*
+	 * If the iostat_cpu isn't in a lockless list, put it into the
+	 * list to indicate that a stat update is pending.
+	 */
+	if (!READ_ONCE(bis->lnode.next)) {
+		struct llist_head *lhead = this_cpu_ptr(blkcg->lhead);
+
+		llist_add(&bis->lnode, lhead);
+		percpu_ref_get(&bis->blkg->refcnt);
+	}
+
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
-		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
+		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
 	put_cpu();
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d2724d1dd7c9..0968b6c8ea12 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -18,6 +18,7 @@
 #include <linux/cgroup.h>
 #include <linux/kthread.h>
 #include <linux/blk-mq.h>
+#include <linux/llist.h>
 
 struct blkcg_gq;
 struct blkg_policy_data;
@@ -43,6 +44,8 @@ struct blkg_iostat {
 
 struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
+	struct llist_node		lnode;
+	struct blkcg_gq		       *blkg;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
 };
@@ -97,6 +100,12 @@ struct blkcg {
 	struct blkcg_policy_data	*cpd[BLKCG_MAX_POLS];
 
 	struct list_head		all_blkcgs_node;
+
+	/*
+	 * List of updated percpu blkg_iostat_set's since the last flush.
+	 */
+	struct llist_head __percpu	*lhead;
+
 #ifdef CONFIG_BLK_CGROUP_FC_APPID
 	char                            fc_app_id[FC_APPID_LEN];
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-10-04 15:17 ` [PATCH v8 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-10-04 18:49   ` Michal Koutný
  2022-10-04 22:53     ` Waiman Long
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Koutný @ 2022-10-04 18:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel,
	Ming Lei, Andy Shevchenko, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

Hello.

On Tue, Oct 04, 2022 at 11:17:48AM -0400, Waiman Long <longman@redhat.com> wrote:
> To protect against destruction of blkg, a percpu reference is gotten
> when putting into the lockless list and put back when removed.

Just to conclude my previous remark about the loop, let me try
explaining it more precisely:

blkcg->lhead via blkg_iostat_set holds reference to blkcg_gq 
   (taken in in blk_cgroup_bio_start)

blkcg_gq holds reference to its blkcg_gq->blkcg 
   (taken in blkg_create)

The cycle has two edges, the latter is broken in __blkg_release but
that's a release callback of the involved blkcg_gq->refcnt, so it won't
be called.

The first edges is broken in blkcg_rstat_flush and that's more promising.
The current code does the final flushes -- in css_release_work_fn.
The problem is that it's the release callback of blkcg->css, i.e. it's
also referenced on the cycle, therefore this final flush won't happen
before cycle is broken.

Fortunately, any other caller of cgroup_rstat_flush comes to the rescue
-- the blkcg_rstat_flush on the stuck blkcg would decompose lhead list
and the reference cycle is broken.

In summary, I think this adds the reference cycle but its survival time
is limited to the soonest cgroup_rstat_flush call, which should not
cause practical troubles.

HTH,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-10-04 18:49   ` Michal Koutný
@ 2022-10-04 22:53     ` Waiman Long
  0 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2022-10-04 22:53 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel,
	Ming Lei, Andy Shevchenko, Andrew Morton

On 10/4/22 14:49, Michal Koutný wrote:
> Hello.
>
> On Tue, Oct 04, 2022 at 11:17:48AM -0400, Waiman Long <longman@redhat.com> wrote:
>> To protect against destruction of blkg, a percpu reference is gotten
>> when putting into the lockless list and put back when removed.
> Just to conclude my previous remark about the loop, let me try
> explaining it more precisely:
>
> blkcg->lhead via blkg_iostat_set holds reference to blkcg_gq
>     (taken in in blk_cgroup_bio_start)
>
> blkcg_gq holds reference to its blkcg_gq->blkcg
>     (taken in blkg_create)
>
> The cycle has two edges, the latter is broken in __blkg_release but
> that's a release callback of the involved blkcg_gq->refcnt, so it won't
> be called.
>
> The first edges is broken in blkcg_rstat_flush and that's more promising.
> The current code does the final flushes -- in css_release_work_fn.
> The problem is that it's the release callback of blkcg->css, i.e. it's
> also referenced on the cycle, therefore this final flush won't happen
> before cycle is broken.
>
> Fortunately, any other caller of cgroup_rstat_flush comes to the rescue
> -- the blkcg_rstat_flush on the stuck blkcg would decompose lhead list
> and the reference cycle is broken.
>
> In summary, I think this adds the reference cycle but its survival time
> is limited to the soonest cgroup_rstat_flush call, which should not
> cause practical troubles.

Thanks for the explanation. I now get what you are referring to. Yes, 
this delayed blkcg removal problem is annoying. I think the following 
patch should eliminate this issue. What do you think?

Cheers,
Longman

----------------8<-------------[ cut here ]------------------

  block/blk-cgroup.c     | 15 ++++++++++++++-
  include/linux/cgroup.h |  1 +
  kernel/cgroup/rstat.c  | 20 ++++++++++++++++++++
  3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 63569b05db0d..f896caef9947 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1122,10 +1122,12 @@ struct list_head *blkcg_get_cgwb_list(struct 
cgroup_subsys_state *css)
   */
  static void blkcg_destroy_blkgs(struct blkcg *blkcg)
  {
+    int cpu;
+
      might_sleep();

+    css_get(&blkcg->css);
      spin_lock_irq(&blkcg->lock);
-
      while (!hlist_empty(&blkcg->blkg_list)) {
          struct blkcg_gq *blkg = hlist_entry(blkcg->blkg_list.first,
                          struct blkcg_gq, blkcg_node);
@@ -1148,6 +1150,17 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
      }

      spin_unlock_irq(&blkcg->lock);
+
+    /*
+     * Flush all the non-empty percpu lockless lists.
+     */
+    for_each_possible_cpu(cpu) {
+        struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+
+        if (!llist_empty(lhead))
+            cgroup_rstat_css_flush(&blkcg->css, cpu);
+    }
+    css_put(&blkcg->css);
  }

  /**
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ac5d0515680e..33e226a34073 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -763,6 +763,7 @@ void cgroup_rstat_flush(struct cgroup *cgrp);
  void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
  void cgroup_rstat_flush_hold(struct cgroup *cgrp);
  void cgroup_rstat_flush_release(void);
+void cgroup_rstat_css_flush(struct cgroup_subsys_state *css, int cpu);

  /*
   * Basic resource stats.
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index feb59380c896..a4e18d627b54 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -251,6 +251,26 @@ void cgroup_rstat_flush_release(void)
      spin_unlock_irq(&cgroup_rstat_lock);
  }

+/**
+ * cgroup_rstat_css_flush - flush stats for the given css and cpu
+ * @css: target css to be flush
+ * @cpu: the cpu that holds the stats to be flush
+ *
+ * A lightweight rstat flush operation for a given css and cpu.
+ * Only the cpu_lock is being held for mutual exclusion, the 
cgroup_rstat_lock
+ * isn't used.
+ */
+void cgroup_rstat_css_flush(struct cgroup_subsys_state *css, int cpu)
+{
+    raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+
+    raw_spin_lock_irq(cpu_lock);
+    rcu_read_lock();
+    css->ss->css_rstat_flush(css, cpu);
+    rcu_read_unlock();
+    raw_spin_unlock_irq(cpu_lock);
+}
+
  int cgroup_rstat_init(struct cgroup *cgrp)
  {
      int cpu;
-- 



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
       [not found] ` <20221006101141.1832-1-hdanton@sina.com>
@ 2022-10-06 21:34   ` Waiman Long
  0 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2022-10-06 21:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel,
	Ming Lei, linux-mm, Andrew Morton

On 10/6/22 06:11, Hillf Danton wrote:
> On 4 Oct 2022 11:17:48 -0400 Waiman Long <longman@redhat.com>
>> For a system with many CPUs and block devices, the time to do
>> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
>> can be especially problematic as interrupt is disabled during the flush.
>> It was reported that it might take seconds to complete in some extreme
>> cases leading to hard lockup messages.
>>
>> As it is likely that not all the percpu blkg_iostat_set's has been
>> updated since the last flush, those stale blkg_iostat_set's don't need
>> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
>> by keeping a lockless list of recently updated blkg_iostat_set's in a
>> newly added percpu blkcg->lhead pointer.
>>
>> The blkg_iostat_set is added to a sentinel lockless list on the update
>> side in blk_cgroup_bio_start(). It is removed from the sentinel lockless
>> list when flushed in blkcg_rstat_flush(). Due to racing, it is possible
>> that blk_iostat_set's in the lockless list may have no new IO stats to
>> be flushed, but that is OK.
> So it is likely that another flag, updated when bis is added to/deleted
> from llist, can cut 1/3 off without raising the risk of getting your patch
> over complicated.
>
>>   
>>   struct blkg_iostat_set {
>>   	struct u64_stats_sync		sync;
>> +	struct llist_node		lnode;
>> +	struct blkcg_gq		       *blkg;
> +	atomic_t			queued;
>
>>   	struct blkg_iostat		cur;
>>   	struct blkg_iostat		last;
>>   };

Yes, by introducing a flag to record the lockless list state, it is 
possible to just use the current llist implementation. Maybe I can 
rework it for now without the sentinel variant and post a separate llist 
patch for that later on.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-10-06 21:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-04 15:17 [PATCH v8 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-10-04 15:17 ` [PATCH v8 1/3] llist: Allow optional sentinel node terminated lockless list Waiman Long
2022-10-04 15:17 ` [PATCH v8 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
2022-10-04 15:17 ` [PATCH v8 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-10-04 18:49   ` Michal Koutný
2022-10-04 22:53     ` Waiman Long
     [not found] ` <20221006101141.1832-1-hdanton@sina.com>
2022-10-06 21:34   ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).