linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/3] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-10-03 15:44 Waiman Long
  2022-10-03 15:44 ` [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Waiman Long @ 2022-10-03 15:44 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Huang Ying, Michal Koutný,
	Waiman Long

 v7:
  - Drop patch 1 ("blk-cgroup: Correctly free percpu iostat_cpu in blkg
    on error exit") as it is found to be unnecessary.
  - Add a new llist patch to provide a lockless list variant terminated
    by a sentinel node.
  - Modified patch 3 to use the new sllist API and move percpu_ref_put()
    later in the blkcg_rstat_flush() loop to prevent potential
    use-after-free problem.

 v6:
  - Add a missing free_percpu() into blkcg_css_free() in patch 3.
  - Integrating the documentation patch 4 back into patch 3.

 v5:
  - Add a new patch 2 to eliminate the use of intermediate "ret"
    variable in blkcg_css_alloc() to fix compilation warning reported
    by kernel test robot.

This patch series improves blkcg_rstat_flush() performance by eliminating
unnecessary blkg enumeration and flush operations for those blkg's and
blkg_iostat_set's that haven't been updated since the last flush.

Waiman Long (3):
  llist: Add a lock-less list variant terminated by a sentinel node
  blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  blk-cgroup: Optimize blkcg_rstat_flush()

 block/blk-cgroup.c    |  85 ++++++++++++++++++++++-----
 block/blk-cgroup.h    |   9 +++
 include/linux/llist.h | 132 +++++++++++++++++++++++++++++++++++++++++-
 lib/llist.c           |  79 ++++++++++++++++++++++++-
 4 files changed, 289 insertions(+), 16 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 15:44 [PATCH v7 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
@ 2022-10-03 15:44 ` Waiman Long
  2022-10-03 16:40   ` Tejun Heo
  2022-10-08  2:15   ` Huang, Ying
  2022-10-03 15:44 ` [PATCH v7 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
  2022-10-03 15:44 ` [PATCH v7 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2 siblings, 2 replies; 14+ messages in thread
From: Waiman Long @ 2022-10-03 15:44 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Huang Ying, Michal Koutný,
	Waiman Long

The lock-less list API is useful for dealing with list in a lock-less
manner. However, one of the drawback of the current API is that there
is not an easy way to determine if an entry has already been put into a
lock-less list. This has to be tracked externally and the tracking will
not be atomic unless some external synchronization logic is in place.

This patch introduces a new variant of the lock-less list terminated
by a sentinel node instead of by NULL. The function names start with
"sllist" instead of "llist". The advantage of this scheme is that we
can atomically determine if an entry has been put into a lock-less
list by looking at the next pointer of the llist_node. Of course, the
callers must clear the next pointer when an entry is removed from the
lockless list. This is done automatically when the sllist_for_each_safe
or sllist_for_each_entry_safe iteraters are used. The non-safe versions
of sllist iterator are not provided.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/llist.h | 132 +++++++++++++++++++++++++++++++++++++++++-
 lib/llist.c           |  79 ++++++++++++++++++++++++-
 2 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 85bda2d02d65..d9a921656adb 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -2,7 +2,8 @@
 #ifndef LLIST_H
 #define LLIST_H
 /*
- * Lock-less NULL terminated single linked list
+ * Lock-less NULL terminated singly linked list
+ * --------------------------------------------
  *
  * Cases where locking is not needed:
  * If there are multiple producers and multiple consumers, llist_add can be
@@ -46,6 +47,16 @@
  *
  * Copyright 2010,2011 Intel Corp.
  *   Author: Huang Ying <ying.huang@intel.com>
+ *
+ * Lock-less sentinel node terminated singly linked list
+ * -----------------------------------------------------
+ *
+ * This is a variant of the generic lock-less list where the end of the list
+ * is terminated by a sentinel node instead of NULL. The advantage of this
+ * scheme is that we can use the next pointer of the llist_node to determine
+ * if it has been put into a lock-less list. However, the next pointer of
+ * the llist_node should be cleared ASAP after it has been removed from a
+ * lock-less list.
  */
 
 #include <linux/atomic.h>
@@ -64,6 +75,13 @@ struct llist_node {
 #define LLIST_HEAD_INIT(name)	{ NULL }
 #define LLIST_HEAD(name)	struct llist_head name = LLIST_HEAD_INIT(name)
 
+/*
+ * Sentinel lock-less list
+ */
+extern struct llist_node	__llist_end;
+#define SLLIST_HEAD_INIT(name)	{ &__llist_end }
+#define SLLIST_HEAD(name)	struct llist_head name = SLLIST_HEAD_INIT(name)
+
 /**
  * init_llist_head - initialize lock-less list head
  * @head:	the head for your lock-less list
@@ -73,6 +91,15 @@ static inline void init_llist_head(struct llist_head *list)
 	list->first = NULL;
 }
 
+/**
+ * init_sllist_head - initialize sentinel lock-less list head
+ * @head:	the head for your sentinel lock-less list
+ */
+static inline void init_sllist_head(struct llist_head *list)
+{
+	list->first = &__llist_end;
+}
+
 /**
  * llist_entry - get the struct of this entry
  * @ptr:	the &struct llist_node pointer.
@@ -99,6 +126,15 @@ static inline void init_llist_head(struct llist_head *list)
 #define member_address_is_nonnull(ptr, member)	\
 	((uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member) != 0)
 
+/**
+ * member_address_is_notsentinel - check whether member address is not sentinel
+ * @ptr:	the object pointer (struct type * that contains the llist_node)
+ * @member:	the name of the llist_node within the struct.
+ */
+#define member_address_is_notsentinel(ptr, member)	\
+	((uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member)	\
+		!= (uintptr_t)&__llist_end)
+
 /**
  * llist_for_each - iterate over some deleted entries of a lock-less list
  * @pos:	the &struct llist_node to use as a loop cursor
@@ -135,6 +171,18 @@ static inline void init_llist_head(struct llist_head *list)
 #define llist_for_each_safe(pos, n, node)			\
 	for ((pos) = (node); (pos) && ((n) = (pos)->next, true); (pos) = (n))
 
+/**
+ * sllist_for_each_safe - iterate over entries of a sentinel lock-less list
+ *			  safe against removal of list entry
+ * @pos:	the &struct llist_node to use as a loop cursor
+ * @n:		another &struct llist_node to use as temporary storage
+ * @node:	the first entry of deleted list entries
+ */
+#define sllist_for_each_safe(pos, n, node)			\
+	for ((pos) = (node); ((pos) !=  &__llist_end) &&	\
+	     ((n) = (pos)->next,				\
+	     ({ WRITE_ONCE((pos)->next, NULL); }), true); (pos) = (n))
+
 /**
  * llist_for_each_entry - iterate over some deleted entries of lock-less list of given type
  * @pos:	the type * to use as a loop cursor.
@@ -178,6 +226,21 @@ static inline void init_llist_head(struct llist_head *list)
 	        (n = llist_entry(pos->member.next, typeof(*n), member), true); \
 	     pos = n)
 
+/**
+ * sllist_for_each_entry_safe - iterate over sentinel entries of lock-less list
+ *				of given type safe against removal of list entry
+ * @pos:	the type * to use as a loop cursor.
+ * @n:		another type * to use as temporary storage
+ * @node:	the first entry of deleted list entries.
+ * @member:	the name of the llist_node with the struct.
+ */
+#define sllist_for_each_entry_safe(pos, n, node, member)		       \
+	for (pos = llist_entry((node), typeof(*(pos)), member);		       \
+	     member_address_is_notsentinel(pos, member) &&		       \
+		(n = llist_entry((pos)->member.next, typeof(*(n)), member),    \
+		({ WRITE_ONCE((pos)->member.next, NULL); }), true);	       \
+	     pos = n)
+
 /**
  * llist_empty - tests whether a lock-less list is empty
  * @head:	the list to test
@@ -191,15 +254,35 @@ static inline bool llist_empty(const struct llist_head *head)
 	return READ_ONCE(head->first) == NULL;
 }
 
+/**
+ * sllist_empty - tests whether a lock-less list is empty
+ * @head:	the list to test
+ */
+static inline bool sllist_empty(const struct llist_head *head)
+{
+	return READ_ONCE(head->first) == &__llist_end;
+}
+
 static inline struct llist_node *llist_next(struct llist_node *node)
 {
 	return node->next;
 }
 
+static inline struct llist_node *sllist_next(struct llist_node *node)
+{
+	struct llist_node *next = node->next;
+
+	return (next == &__llist_end) ? NULL : next;
+}
+
 extern bool llist_add_batch(struct llist_node *new_first,
 			    struct llist_node *new_last,
 			    struct llist_head *head);
 
+extern bool sllist_add_batch(struct llist_node *new_first,
+			     struct llist_node *new_last,
+			     struct llist_head *head);
+
 static inline bool __llist_add_batch(struct llist_node *new_first,
 				     struct llist_node *new_last,
 				     struct llist_head *head)
@@ -209,6 +292,15 @@ static inline bool __llist_add_batch(struct llist_node *new_first,
 	return new_last->next == NULL;
 }
 
+static inline bool __sllist_add_batch(struct llist_node *new_first,
+				      struct llist_node *new_last,
+				      struct llist_head *head)
+{
+	new_last->next = head->first;
+	head->first = new_first;
+	return new_last->next == &__llist_end;
+}
+
 /**
  * llist_add - add a new entry
  * @new:	new entry to be added
@@ -221,11 +313,28 @@ static inline bool llist_add(struct llist_node *new, struct llist_head *head)
 	return llist_add_batch(new, new, head);
 }
 
+/**
+ * sllist_add - add a new entry
+ * @new:	new entry to be added
+ * @head:	the head for your lock-less list
+ *
+ * Returns true if the list was empty prior to adding this entry.
+ */
+static inline bool sllist_add(struct llist_node *new, struct llist_head *head)
+{
+	return sllist_add_batch(new, new, head);
+}
+
 static inline bool __llist_add(struct llist_node *new, struct llist_head *head)
 {
 	return __llist_add_batch(new, new, head);
 }
 
+static inline bool __sllist_add(struct llist_node *new, struct llist_head *head)
+{
+	return __sllist_add_batch(new, new, head);
+}
+
 /**
  * llist_del_all - delete all entries from lock-less list
  * @head:	the head of lock-less list to delete all entries
@@ -239,6 +348,17 @@ static inline struct llist_node *llist_del_all(struct llist_head *head)
 	return xchg(&head->first, NULL);
 }
 
+/**
+ * sllist_del_all - delete all entries from sentinel lock-less list
+ * @head:	the head of lock-less list to delete all entries
+ */
+static inline struct llist_node *sllist_del_all(struct llist_head *head)
+{
+	struct llist_node *first = xchg(&head->first, &__llist_end);
+
+	return (first == &__llist_end) ? NULL : first;
+}
+
 static inline struct llist_node *__llist_del_all(struct llist_head *head)
 {
 	struct llist_node *first = head->first;
@@ -247,8 +367,18 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
 	return first;
 }
 
+static inline struct llist_node *__sllist_del_all(struct llist_head *head)
+{
+	struct llist_node *first = head->first;
+
+	head->first = &__llist_end;
+	return (first == &__llist_end) ? NULL : first;
+}
+
 extern struct llist_node *llist_del_first(struct llist_head *head);
+extern struct llist_node *sllist_del_first(struct llist_head *head);
 
 struct llist_node *llist_reverse_order(struct llist_node *head);
+struct llist_node *sllist_reverse_order(struct llist_node *head);
 
 #endif /* LLIST_H */
diff --git a/lib/llist.c b/lib/llist.c
index 611ce4881a87..418327394735 100644
--- a/lib/llist.c
+++ b/lib/llist.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * Lock-less NULL terminated single linked list
+ * Lock-less NULL and sentinel node terminated singly linked lists
  *
  * The basic atomic operation of this list is cmpxchg on long.  On
  * architectures that don't have NMI-safe cmpxchg implementation, the
@@ -12,8 +12,11 @@
  */
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <linux/cache.h>
 #include <linux/llist.h>
 
+struct llist_node __llist_end __ro_after_init;
+EXPORT_SYMBOL_GPL(__llist_end);
 
 /**
  * llist_add_batch - add several linked entries in batch
@@ -36,6 +39,27 @@ bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
 }
 EXPORT_SYMBOL_GPL(llist_add_batch);
 
+/**
+ * sllist_add_batch - add several linked entries in batch
+ * @new_first:	first entry in batch to be added
+ * @new_last:	last entry in batch to be added
+ * @head:	the head for your lock-less list
+ *
+ * Return whether list is empty before adding.
+ */
+bool sllist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
+		      struct llist_head *head)
+{
+	struct llist_node *first;
+
+	do {
+		new_last->next = first = READ_ONCE(head->first);
+	} while (cmpxchg(&head->first, first, new_first) != first);
+
+	return first == &__llist_end;
+}
+EXPORT_SYMBOL_GPL(sllist_add_batch);
+
 /**
  * llist_del_first - delete the first entry of lock-less list
  * @head:	the head for your lock-less list
@@ -69,6 +93,33 @@ struct llist_node *llist_del_first(struct llist_head *head)
 }
 EXPORT_SYMBOL_GPL(llist_del_first);
 
+/**
+ * sllist_del_first - delete the first entry of sentinel lock-less list
+ * @head:	the head for your lock-less list
+ *
+ * If list is empty, return NULL, otherwise, return the first entry
+ * deleted, this is the newest added one.
+ */
+struct llist_node *sllist_del_first(struct llist_head *head)
+{
+	struct llist_node *entry, *old_entry, *next;
+
+	entry = smp_load_acquire(&head->first);
+	for (;;) {
+		if (entry == &__llist_end)
+			return NULL;
+		old_entry = entry;
+		next = READ_ONCE(entry->next);
+		entry = cmpxchg(&head->first, old_entry, next);
+		if (entry == old_entry)
+			break;
+	}
+	WRITE_ONCE(entry->next, NULL);
+
+	return entry;
+}
+EXPORT_SYMBOL_GPL(sllist_del_first);
+
 /**
  * llist_reverse_order - reverse order of a llist chain
  * @head:	first item of the list to be reversed
@@ -90,3 +141,29 @@ struct llist_node *llist_reverse_order(struct llist_node *head)
 	return new_head;
 }
 EXPORT_SYMBOL_GPL(llist_reverse_order);
+
+/**
+ * sllist_reverse_order - reverse order of a llist chain
+ * @head:	first item of the list to be reversed
+ *
+ * Reverse the order of a chain of llist entries and return the
+ * new first entry.
+ */
+struct llist_node *sllist_reverse_order(struct llist_node *head)
+{
+	struct llist_node *new_head = &__llist_end;
+
+	if (!head || (head == &__llist_end))
+		return NULL;
+
+	while (head != &__llist_end) {
+		struct llist_node *tmp = head;
+
+		head = head->next;
+		tmp->next = new_head;
+		new_head = tmp;
+	}
+
+	return new_head;
+}
+EXPORT_SYMBOL_GPL(sllist_reverse_order);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  2022-10-03 15:44 [PATCH v7 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2022-10-03 15:44 ` [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node Waiman Long
@ 2022-10-03 15:44 ` Waiman Long
  2022-10-03 16:40   ` Tejun Heo
  2022-10-03 15:44 ` [PATCH v7 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2022-10-03 15:44 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Huang Ying, Michal Koutný,
	Waiman Long

For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
Simplify error handling code by returning this error directly instead
of setting an intermediate "ret" variable.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-cgroup.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 869af9d72bcf..946592249795 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1177,7 +1177,6 @@ static struct cgroup_subsys_state *
 blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct blkcg *blkcg;
-	struct cgroup_subsys_state *ret;
 	int i;
 
 	mutex_lock(&blkcg_pol_mutex);
@@ -1186,10 +1185,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		blkcg = &blkcg_root;
 	} else {
 		blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
-		if (!blkcg) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!blkcg)
 			goto unlock;
-		}
 	}
 
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
@@ -1206,10 +1203,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			continue;
 
 		cpd = pol->cpd_alloc_fn(GFP_KERNEL);
-		if (!cpd) {
-			ret = ERR_PTR(-ENOMEM);
+		if (!cpd)
 			goto free_pd_blkcg;
-		}
+
 		blkcg->cpd[i] = cpd;
 		cpd->blkcg = blkcg;
 		cpd->plid = i;
@@ -1238,7 +1234,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 		kfree(blkcg);
 unlock:
 	mutex_unlock(&blkcg_pol_mutex);
-	return ret;
+	return ERR_PTR(-ENOMEM);
 }
 
 static int blkcg_css_online(struct cgroup_subsys_state *css)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v7 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-10-03 15:44 [PATCH v7 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
  2022-10-03 15:44 ` [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node Waiman Long
  2022-10-03 15:44 ` [PATCH v7 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
@ 2022-10-03 15:44 ` Waiman Long
  2 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2022-10-03 15:44 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Andy Shevchenko,
	Andrew Morton, Huang Ying, Michal Koutný,
	Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.

The blkg_iostat_set is added to a sentinel lockless list on the update
side in blk_cgroup_bio_start(). It is removed from the sentinel lockless
list when flushed in blkcg_rstat_flush(). Due to racing, it is possible
that blk_iostat_set's in the lockless list may have no new IO stats to
be flushed, but that is OK.

To protect against destruction of blkg, a percpu reference is gotten
when putting into the lockless list and put back when removed.

A blkg_iostat_set can determine if it is in a lockless list by checking
the content of its lnode.next pointer which will be non-NULL when in
a sentinel lockless list.

When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 73 ++++++++++++++++++++++++++++++++++++++++++----
 block/blk-cgroup.h |  9 ++++++
 2 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 946592249795..f1f580c46190 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -59,6 +59,37 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
 
 #define BLKG_DESTROY_BATCH_SIZE  64
 
+/*
+ * Lockless lists for tracking IO stats update
+ *
+ * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
+ * There are multiple blkg's (one for each block device) attached to each
+ * blkcg. The rstat code keeps track of which cpu has IO stats updated,
+ * but it doesn't know which blkg has the updated stats. If there are many
+ * block devices in a system, the cost of iterating all the blkg's to flush
+ * out the IO stats can be high. To reduce such overhead, a set of percpu
+ * lockless lists (lhead) per blkcg are used to track the set of recently
+ * updated iostat_cpu's since the last flush. An iostat_cpu will be put
+ * onto the lockless list on the update side [blk_cgroup_bio_start()] if
+ * not there yet and then removed when being flushed [blkcg_rstat_flush()].
+ * References to blkg are gotten and then put back in the process to
+ * protect against blkg removal.
+ *
+ * Return: 0 if successful or -ENOMEM if allocation fails.
+ */
+static int init_blkcg_sllists(struct blkcg *blkcg)
+{
+	int cpu;
+
+	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
+	if (!blkcg->lhead)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu)
+		init_sllist_head(per_cpu_ptr(blkcg->lhead, cpu));
+	return 0;
+}
+
 /**
  * blkcg_css - find the current css
  *
@@ -236,8 +267,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->blkcg = blkcg;
 
 	u64_stats_init(&blkg->iostat.sync);
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
+		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
+	}
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
@@ -864,7 +897,9 @@ static void blkcg_iostat_update(struct blkcg_gq *blkg, struct blkg_iostat *cur,
 static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
-	struct blkcg_gq *blkg;
+	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+	struct llist_node *lnode;
+	struct blkg_iostat_set *bisc, *next_bisc;
 
 	/* Root-level stats are sourced from system-wide IO stats */
 	if (!cgroup_parent(css->cgroup))
@@ -872,9 +907,16 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 
 	rcu_read_lock();
 
-	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
+	lnode = sllist_del_all(lhead);
+	if (!lnode)
+		goto out;
+
+	/*
+	 * Iterate only the iostat_cpu's queued in the lockless list.
+	 */
+	sllist_for_each_entry_safe(bisc, next_bisc, lnode, lnode) {
+		struct blkcg_gq *blkg = bisc->blkg;
 		struct blkcg_gq *parent = blkg->parent;
-		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
 		struct blkg_iostat cur;
 		unsigned int seq;
 
@@ -890,8 +932,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		if (parent && parent->parent)
 			blkcg_iostat_update(parent, &blkg->iostat.cur,
 					    &blkg->iostat.last);
+		percpu_ref_put(&blkg->refcnt);
 	}
 
+out:
 	rcu_read_unlock();
 }
 
@@ -1170,6 +1214,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
 
 	mutex_unlock(&blkcg_pol_mutex);
 
+	free_percpu(blkcg->lhead);
 	kfree(blkcg);
 }
 
@@ -1189,6 +1234,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 			goto unlock;
 	}
 
+	if (init_blkcg_sllists(blkcg))
+		goto free_blkcg;
+
 	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
 		struct blkcg_policy *pol = blkcg_policy[i];
 		struct blkcg_policy_data *cpd;
@@ -1229,7 +1277,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	for (i--; i >= 0; i--)
 		if (blkcg->cpd[i])
 			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
-
+	free_percpu(blkcg->lhead);
+free_blkcg:
 	if (blkcg != &blkcg_root)
 		kfree(blkcg);
 unlock:
@@ -1990,6 +2039,7 @@ static int blk_cgroup_io_type(struct bio *bio)
 
 void blk_cgroup_bio_start(struct bio *bio)
 {
+	struct blkcg *blkcg = bio->bi_blkg->blkcg;
 	int rwd = blk_cgroup_io_type(bio), cpu;
 	struct blkg_iostat_set *bis;
 	unsigned long flags;
@@ -2008,9 +2058,20 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 	bis->cur.ios[rwd]++;
 
+	/*
+	 * If the iostat_cpu isn't in a lockless list, put it into the
+	 * list to indicate that a stat update is pending.
+	 */
+	if (!READ_ONCE(bis->lnode.next)) {
+		struct llist_head *lhead = this_cpu_ptr(blkcg->lhead);
+
+		sllist_add(&bis->lnode, lhead);
+		percpu_ref_get(&bis->blkg->refcnt);
+	}
+
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
-		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
+		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
 	put_cpu();
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d2724d1dd7c9..0968b6c8ea12 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -18,6 +18,7 @@
 #include <linux/cgroup.h>
 #include <linux/kthread.h>
 #include <linux/blk-mq.h>
+#include <linux/llist.h>
 
 struct blkcg_gq;
 struct blkg_policy_data;
@@ -43,6 +44,8 @@ struct blkg_iostat {
 
 struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
+	struct llist_node		lnode;
+	struct blkcg_gq		       *blkg;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
 };
@@ -97,6 +100,12 @@ struct blkcg {
 	struct blkcg_policy_data	*cpd[BLKCG_MAX_POLS];
 
 	struct list_head		all_blkcgs_node;
+
+	/*
+	 * List of updated percpu blkg_iostat_set's since the last flush.
+	 */
+	struct llist_head __percpu	*lhead;
+
 #ifdef CONFIG_BLK_CGROUP_FC_APPID
 	char                            fc_app_id[FC_APPID_LEN];
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 15:44 ` [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node Waiman Long
@ 2022-10-03 16:40   ` Tejun Heo
  2022-10-03 16:55     ` Waiman Long
  2022-10-08  2:15   ` Huang, Ying
  1 sibling, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2022-10-03 16:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Huang Ying, Michal Koutný

Hello, Waiman.

On Mon, Oct 03, 2022 at 11:44:57AM -0400, Waiman Long wrote:
> The lock-less list API is useful for dealing with list in a lock-less
> manner. However, one of the drawback of the current API is that there
> is not an easy way to determine if an entry has already been put into a
> lock-less list. This has to be tracked externally and the tracking will
> not be atomic unless some external synchronization logic is in place.
> 
> This patch introduces a new variant of the lock-less list terminated
> by a sentinel node instead of by NULL. The function names start with
> "sllist" instead of "llist". The advantage of this scheme is that we
> can atomically determine if an entry has been put into a lock-less
> list by looking at the next pointer of the llist_node. Of course, the
> callers must clear the next pointer when an entry is removed from the
> lockless list. This is done automatically when the sllist_for_each_safe
> or sllist_for_each_entry_safe iteraters are used. The non-safe versions
> of sllist iterator are not provided.

Any chance we can add sentinel to the existing llist instead of creating a
new variant? There's no real downside to always using sentinel, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
  2022-10-03 15:44 ` [PATCH v7 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
@ 2022-10-03 16:40   ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2022-10-03 16:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Huang Ying, Michal Koutný

On Mon, Oct 03, 2022 at 11:44:58AM -0400, Waiman Long wrote:
> For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
> Simplify error handling code by returning this error directly instead
> of setting an intermediate "ret" variable.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Reviewed-by: Ming Lei <ming.lei@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 16:40   ` Tejun Heo
@ 2022-10-03 16:55     ` Waiman Long
  2022-10-03 16:57       ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2022-10-03 16:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Michal Koutný


On 10/3/22 12:40, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Oct 03, 2022 at 11:44:57AM -0400, Waiman Long wrote:
>> The lock-less list API is useful for dealing with list in a lock-less
>> manner. However, one of the drawback of the current API is that there
>> is not an easy way to determine if an entry has already been put into a
>> lock-less list. This has to be tracked externally and the tracking will
>> not be atomic unless some external synchronization logic is in place.
>>
>> This patch introduces a new variant of the lock-less list terminated
>> by a sentinel node instead of by NULL. The function names start with
>> "sllist" instead of "llist". The advantage of this scheme is that we
>> can atomically determine if an entry has been put into a lock-less
>> list by looking at the next pointer of the llist_node. Of course, the
>> callers must clear the next pointer when an entry is removed from the
>> lockless list. This is done automatically when the sllist_for_each_safe
>> or sllist_for_each_entry_safe iteraters are used. The non-safe versions
>> of sllist iterator are not provided.
> Any chance we can add sentinel to the existing llist instead of creating a
> new variant? There's no real downside to always using sentinel, right?

That was my original plan. However, after looking at some existing users 
of lockless list, they have coded in the dependency on the fact that a 
lockless list is empty if it is NULL. I guess I can make this true also 
for the new lockless list with sentinel at the expense of a bit more 
overhead in the entry insertion path and deletion path. I will take a 
further look at that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 16:55     ` Waiman Long
@ 2022-10-03 16:57       ` Tejun Heo
  2022-10-03 17:32         ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2022-10-03 16:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Michal Koutný

On Mon, Oct 03, 2022 at 12:55:24PM -0400, Waiman Long wrote:
> That was my original plan. However, after looking at some existing users of
> lockless list, they have coded in the dependency on the fact that a lockless
> list is empty if it is NULL. I guess I can make this true also for the new
> lockless list with sentinel at the expense of a bit more overhead in the
> entry insertion path and deletion path. I will take a further look at that.

There aren't that many users of llist. Maybe it'd be easier / cleaner to
introduce a macro to test whether a llist is empty and replace the NULL
tests?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 16:57       ` Tejun Heo
@ 2022-10-03 17:32         ` Waiman Long
  2022-10-03 17:36           ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2022-10-03 17:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Michal Koutný

On 10/3/22 12:57, Tejun Heo wrote:
> On Mon, Oct 03, 2022 at 12:55:24PM -0400, Waiman Long wrote:
>> That was my original plan. However, after looking at some existing users of
>> lockless list, they have coded in the dependency on the fact that a lockless
>> list is empty if it is NULL. I guess I can make this true also for the new
>> lockless list with sentinel at the expense of a bit more overhead in the
>> entry insertion path and deletion path. I will take a further look at that.
> There aren't that many users of llist. Maybe it'd be easier / cleaner to
> introduce a macro to test whether a llist is empty and replace the NULL
> tests?

What my current thinking is to make llist works with both NULL and 
sentinel terminated lockless list. Users who wish to use the sentinel 
terminated version will have to use special sentinel version of 
LLIST_HEAD() macro and llist_del_all() and __llist_del_all() functions. 
In this way, I don't need to touch an existing users of llist while 
minimizing code redundancy. What do you think?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 17:32         ` Waiman Long
@ 2022-10-03 17:36           ` Tejun Heo
  2022-10-03 17:40             ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2022-10-03 17:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Michal Koutný

Hello,

On Mon, Oct 03, 2022 at 01:32:49PM -0400, Waiman Long wrote:
> What my current thinking is to make llist works with both NULL and sentinel
> terminated lockless list. Users who wish to use the sentinel terminated
> version will have to use special sentinel version of LLIST_HEAD() macro and
> llist_del_all() and __llist_del_all() functions. In this way, I don't need
> to touch an existing users of llist while minimizing code redundancy. What
> do you think?

Wouldn't that be more error-prone in the long term? I'd just bite the bullet
and convert the empty tests. It is a hassle to find them but given that it's
just the head node testing, it hopefully wouldn't be too bad.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 17:36           ` Tejun Heo
@ 2022-10-03 17:40             ` Waiman Long
  2022-10-03 19:39               ` Waiman Long
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2022-10-03 17:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Michal Koutný


On 10/3/22 13:36, Tejun Heo wrote:
> Hello,
>
> On Mon, Oct 03, 2022 at 01:32:49PM -0400, Waiman Long wrote:
>> What my current thinking is to make llist works with both NULL and sentinel
>> terminated lockless list. Users who wish to use the sentinel terminated
>> version will have to use special sentinel version of LLIST_HEAD() macro and
>> llist_del_all() and __llist_del_all() functions. In this way, I don't need
>> to touch an existing users of llist while minimizing code redundancy. What
>> do you think?
> Wouldn't that be more error-prone in the long term? I'd just bite the bullet
> and convert the empty tests. It is a hassle to find them but given that it's
> just the head node testing, it hopefully wouldn't be too bad.

OK, I will take a further look at what changes will be needed by the 
existing llist users.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 17:40             ` Waiman Long
@ 2022-10-03 19:39               ` Waiman Long
  2022-10-03 20:15                 ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Waiman Long @ 2022-10-03 19:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Michal Koutný


On 10/3/22 13:40, Waiman Long wrote:
>
> On 10/3/22 13:36, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Oct 03, 2022 at 01:32:49PM -0400, Waiman Long wrote:
>>> What my current thinking is to make llist works with both NULL and 
>>> sentinel
>>> terminated lockless list. Users who wish to use the sentinel terminated
>>> version will have to use special sentinel version of LLIST_HEAD() 
>>> macro and
>>> llist_del_all() and __llist_del_all() functions. In this way, I 
>>> don't need
>>> to touch an existing users of llist while minimizing code 
>>> redundancy. What
>>> do you think?
>> Wouldn't that be more error-prone in the long term? I'd just bite the 
>> bullet
>> and convert the empty tests. It is a hassle to find them but given 
>> that it's
>> just the head node testing, it hopefully wouldn't be too bad.
>
> OK, I will take a further look at what changes will be needed by the 
> existing llist users.

After a further look, I think the task of making sentinel llist the 
default will be more time consuming that I initially thought. For example,

1) arch/powerpc/include/asm/kvm_book3s_64.h:
    It has its own llist iterator for_each_nest_rmap_safe.

2) kprobe use llist but not the full set of APIs and the
    various arch code put NULL in their llist_node to communicate
    with kprobe.

3) drivers/vhost/scsi.c uses llist but it doesn't use LLIST_HEAD
    nor init_llist_head to initialize the llist_head. I suspect that
    it may relies on NULL being the initial value.

There are 123 instances where llist_head is referenced in arch, driver, 
filesystem and kernel code. Going through all these to make sure that it 
will all work will be a major effort. I think it will be safer to allow 
both NULL and the sentinel node as the initializers and gradually 
convert them to use the proper llist APIs over time to complete the 
conversion. I am sorry that I can't spend that much time upfront for 
this conversion effort.

Regards,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 19:39               ` Waiman Long
@ 2022-10-03 20:15                 ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2022-10-03 20:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei,
	Andy Shevchenko, Andrew Morton, Michal Koutný

Hello,

On Mon, Oct 03, 2022 at 03:39:02PM -0400, Waiman Long wrote:
> There are 123 instances where llist_head is referenced in arch, driver,
> filesystem and kernel code. Going through all these to make sure that it
> will all work will be a major effort. I think it will be safer to allow both
> NULL and the sentinel node as the initializers and gradually convert them to
> use the proper llist APIs over time to complete the conversion. I am sorry
> that I can't spend that much time upfront for this conversion effort.

I see. Oh well, thanks for taking a look.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node
  2022-10-03 15:44 ` [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node Waiman Long
  2022-10-03 16:40   ` Tejun Heo
@ 2022-10-08  2:15   ` Huang, Ying
  1 sibling, 0 replies; 14+ messages in thread
From: Huang, Ying @ 2022-10-08  2:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Tejun Heo, Jens Axboe, cgroups, linux-block, linux-kernel,
	Ming Lei, Andy Shevchenko, Andrew Morton, Michal Koutný

Waiman Long <longman@redhat.com> writes:

> The lock-less list API is useful for dealing with list in a lock-less
> manner. However, one of the drawback of the current API is that there
> is not an easy way to determine if an entry has already been put into a
> lock-less list. This has to be tracked externally and the tracking will
> not be atomic unless some external synchronization logic is in place.
>
> This patch introduces a new variant of the lock-less list terminated
> by a sentinel node instead of by NULL. The function names start with
> "sllist" instead of "llist". The advantage of this scheme is that we
> can atomically determine if an entry has been put into a lock-less
> list by looking at the next pointer of the llist_node.

I guess that in the previous solution we use
test_and_set_bit()/clear_bit() on another member of the type containing
llist_node to track whether the node is in the llist?  After your patch,
we can use cmpxchg()/WRITE_ONCE(, NULL) on llist_node->next for that?

> Of course, the
> callers must clear the next pointer when an entry is removed from the
> lockless list. This is done automatically when the sllist_for_each_safe
> or sllist_for_each_entry_safe

Per my understanding, other xxx_for_each_safe() variants will not change
the list by itself.  So how about rename the functions?

> iteraters are used. The non-safe versions
> of sllist iterator are not provided.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/llist.h | 132 +++++++++++++++++++++++++++++++++++++++++-
>  lib/llist.c           |  79 ++++++++++++++++++++++++-
>  2 files changed, 209 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 85bda2d02d65..d9a921656adb 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -2,7 +2,8 @@
>  #ifndef LLIST_H
>  #define LLIST_H
>  /*
> - * Lock-less NULL terminated single linked list
> + * Lock-less NULL terminated singly linked list
> + * --------------------------------------------
>   *
>   * Cases where locking is not needed:
>   * If there are multiple producers and multiple consumers, llist_add can be
> @@ -46,6 +47,16 @@
>   *
>   * Copyright 2010,2011 Intel Corp.
>   *   Author: Huang Ying <ying.huang@intel.com>
> + *
> + * Lock-less sentinel node terminated singly linked list
> + * -----------------------------------------------------
> + *
> + * This is a variant of the generic lock-less list where the end of the list
> + * is terminated by a sentinel node instead of NULL. The advantage of this
> + * scheme is that we can use the next pointer of the llist_node to determine
> + * if it has been put into a lock-less list. However, the next pointer of
> + * the llist_node should be cleared ASAP after it has been removed from a
> + * lock-less list.
>   */
>  
>  #include <linux/atomic.h>
> @@ -64,6 +75,13 @@ struct llist_node {
>  #define LLIST_HEAD_INIT(name)	{ NULL }
>  #define LLIST_HEAD(name)	struct llist_head name = LLIST_HEAD_INIT(name)
>  
> +/*
> + * Sentinel lock-less list
> + */
> +extern struct llist_node	__llist_end;
> +#define SLLIST_HEAD_INIT(name)	{ &__llist_end }
> +#define SLLIST_HEAD(name)	struct llist_head name = SLLIST_HEAD_INIT(name)
> +
>  /**
>   * init_llist_head - initialize lock-less list head
>   * @head:	the head for your lock-less list
> @@ -73,6 +91,15 @@ static inline void init_llist_head(struct llist_head *list)
>  	list->first = NULL;
>  }
>  
> +/**
> + * init_sllist_head - initialize sentinel lock-less list head
> + * @head:	the head for your sentinel lock-less list
> + */
> +static inline void init_sllist_head(struct llist_head *list)
> +{
> +	list->first = &__llist_end;
> +}
> +
>  /**
>   * llist_entry - get the struct of this entry
>   * @ptr:	the &struct llist_node pointer.
> @@ -99,6 +126,15 @@ static inline void init_llist_head(struct llist_head *list)
>  #define member_address_is_nonnull(ptr, member)	\
>  	((uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member) != 0)
>  
> +/**
> + * member_address_is_notsentinel - check whether member address is not sentinel
> + * @ptr:	the object pointer (struct type * that contains the llist_node)
> + * @member:	the name of the llist_node within the struct.
> + */
> +#define member_address_is_notsentinel(ptr, member)	\
> +	((uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member)	\
> +		!= (uintptr_t)&__llist_end)
> +
>  /**
>   * llist_for_each - iterate over some deleted entries of a lock-less list
>   * @pos:	the &struct llist_node to use as a loop cursor
> @@ -135,6 +171,18 @@ static inline void init_llist_head(struct llist_head *list)
>  #define llist_for_each_safe(pos, n, node)			\
>  	for ((pos) = (node); (pos) && ((n) = (pos)->next, true); (pos) = (n))
>  
> +/**
> + * sllist_for_each_safe - iterate over entries of a sentinel lock-less list
> + *			  safe against removal of list entry
> + * @pos:	the &struct llist_node to use as a loop cursor
> + * @n:		another &struct llist_node to use as temporary storage
> + * @node:	the first entry of deleted list entries
> + */
> +#define sllist_for_each_safe(pos, n, node)			\
> +	for ((pos) = (node); ((pos) !=  &__llist_end) &&	\
> +	     ((n) = (pos)->next,				\
> +	     ({ WRITE_ONCE((pos)->next, NULL); }), true); (pos) = (n))
> +
>  /**
>   * llist_for_each_entry - iterate over some deleted entries of lock-less list of given type
>   * @pos:	the type * to use as a loop cursor.
> @@ -178,6 +226,21 @@ static inline void init_llist_head(struct llist_head *list)
>  	        (n = llist_entry(pos->member.next, typeof(*n), member), true); \
>  	     pos = n)
>  
> +/**
> + * sllist_for_each_entry_safe - iterate over sentinel entries of lock-less list
> + *				of given type safe against removal of list entry
> + * @pos:	the type * to use as a loop cursor.
> + * @n:		another type * to use as temporary storage
> + * @node:	the first entry of deleted list entries.
> + * @member:	the name of the llist_node with the struct.
> + */
> +#define sllist_for_each_entry_safe(pos, n, node, member)		       \
> +	for (pos = llist_entry((node), typeof(*(pos)), member);		       \
> +	     member_address_is_notsentinel(pos, member) &&		       \
> +		(n = llist_entry((pos)->member.next, typeof(*(n)), member),    \
> +		({ WRITE_ONCE((pos)->member.next, NULL); }), true);	       \
> +	     pos = n)
> +
>  /**
>   * llist_empty - tests whether a lock-less list is empty
>   * @head:	the list to test
> @@ -191,15 +254,35 @@ static inline bool llist_empty(const struct llist_head *head)
>  	return READ_ONCE(head->first) == NULL;
>  }
>  
> +/**
> + * sllist_empty - tests whether a lock-less list is empty
> + * @head:	the list to test
> + */
> +static inline bool sllist_empty(const struct llist_head *head)
> +{
> +	return READ_ONCE(head->first) == &__llist_end;
> +}
> +
>  static inline struct llist_node *llist_next(struct llist_node *node)
>  {
>  	return node->next;
>  }
>  
> +static inline struct llist_node *sllist_next(struct llist_node *node)
> +{
> +	struct llist_node *next = node->next;
> +
> +	return (next == &__llist_end) ? NULL : next;
> +}
> +
>  extern bool llist_add_batch(struct llist_node *new_first,
>  			    struct llist_node *new_last,
>  			    struct llist_head *head);
>  
> +extern bool sllist_add_batch(struct llist_node *new_first,
> +			     struct llist_node *new_last,
> +			     struct llist_head *head);
> +
>  static inline bool __llist_add_batch(struct llist_node *new_first,
>  				     struct llist_node *new_last,
>  				     struct llist_head *head)
> @@ -209,6 +292,15 @@ static inline bool __llist_add_batch(struct llist_node *new_first,
>  	return new_last->next == NULL;
>  }
>  
> +static inline bool __sllist_add_batch(struct llist_node *new_first,
> +				      struct llist_node *new_last,
> +				      struct llist_head *head)
> +{
> +	new_last->next = head->first;
> +	head->first = new_first;
> +	return new_last->next == &__llist_end;
> +}
> +
>  /**
>   * llist_add - add a new entry
>   * @new:	new entry to be added
> @@ -221,11 +313,28 @@ static inline bool llist_add(struct llist_node *new, struct llist_head *head)
>  	return llist_add_batch(new, new, head);
>  }
>  
> +/**
> + * sllist_add - add a new entry
> + * @new:	new entry to be added
> + * @head:	the head for your lock-less list
> + *
> + * Returns true if the list was empty prior to adding this entry.
> + */
> +static inline bool sllist_add(struct llist_node *new, struct llist_head *head)
> +{
> +	return sllist_add_batch(new, new, head);
> +}
> +
>  static inline bool __llist_add(struct llist_node *new, struct llist_head *head)
>  {
>  	return __llist_add_batch(new, new, head);
>  }
>  
> +static inline bool __sllist_add(struct llist_node *new, struct llist_head *head)
> +{
> +	return __sllist_add_batch(new, new, head);
> +}
> +
>  /**
>   * llist_del_all - delete all entries from lock-less list
>   * @head:	the head of lock-less list to delete all entries
> @@ -239,6 +348,17 @@ static inline struct llist_node *llist_del_all(struct llist_head *head)
>  	return xchg(&head->first, NULL);
>  }
>  
> +/**
> + * sllist_del_all - delete all entries from sentinel lock-less list
> + * @head:	the head of lock-less list to delete all entries
> + */
> +static inline struct llist_node *sllist_del_all(struct llist_head *head)
> +{
> +	struct llist_node *first = xchg(&head->first, &__llist_end);
> +
> +	return (first == &__llist_end) ? NULL : first;
> +}
> +
>  static inline struct llist_node *__llist_del_all(struct llist_head *head)
>  {
>  	struct llist_node *first = head->first;
> @@ -247,8 +367,18 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
>  	return first;
>  }
>  
> +static inline struct llist_node *__sllist_del_all(struct llist_head *head)
> +{
> +	struct llist_node *first = head->first;
> +
> +	head->first = &__llist_end;
> +	return (first == &__llist_end) ? NULL : first;
> +}
> +
>  extern struct llist_node *llist_del_first(struct llist_head *head);
> +extern struct llist_node *sllist_del_first(struct llist_head *head);
>  
>  struct llist_node *llist_reverse_order(struct llist_node *head);
> +struct llist_node *sllist_reverse_order(struct llist_node *head);
>  
>  #endif /* LLIST_H */
> diff --git a/lib/llist.c b/lib/llist.c
> index 611ce4881a87..418327394735 100644
> --- a/lib/llist.c
> +++ b/lib/llist.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0-only
>  /*
> - * Lock-less NULL terminated single linked list
> + * Lock-less NULL and sentinel node terminated singly linked lists
>   *
>   * The basic atomic operation of this list is cmpxchg on long.  On
>   * architectures that don't have NMI-safe cmpxchg implementation, the
> @@ -12,8 +12,11 @@
>   */
>  #include <linux/kernel.h>
>  #include <linux/export.h>
> +#include <linux/cache.h>
>  #include <linux/llist.h>
>  
> +struct llist_node __llist_end __ro_after_init;
> +EXPORT_SYMBOL_GPL(__llist_end);
>  
>  /**
>   * llist_add_batch - add several linked entries in batch
> @@ -36,6 +39,27 @@ bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
>  }
>  EXPORT_SYMBOL_GPL(llist_add_batch);
>  
> +/**
> + * sllist_add_batch - add several linked entries in batch
> + * @new_first:	first entry in batch to be added
> + * @new_last:	last entry in batch to be added
> + * @head:	the head for your lock-less list
> + *
> + * Return whether list is empty before adding.
> + */
> +bool sllist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
> +		      struct llist_head *head)
> +{
> +	struct llist_node *first;
> +
> +	do {
> +		new_last->next = first = READ_ONCE(head->first);

Here new_last->next may be changed from NULL to non-NULL?  Do we need
to do this atomically?  Even cmpxchg() to guarantee it is NULL before
put into list?

> +	} while (cmpxchg(&head->first, first, new_first) != first);
> +
> +	return first == &__llist_end;
> +}
> +EXPORT_SYMBOL_GPL(sllist_add_batch);
> +
>  /**
>   * llist_del_first - delete the first entry of lock-less list
>   * @head:	the head for your lock-less list
> @@ -69,6 +93,33 @@ struct llist_node *llist_del_first(struct llist_head *head)
>  }
>  EXPORT_SYMBOL_GPL(llist_del_first);
>  
> +/**
> + * sllist_del_first - delete the first entry of sentinel lock-less list
> + * @head:	the head for your lock-less list
> + *
> + * If list is empty, return NULL, otherwise, return the first entry
> + * deleted, this is the newest added one.
> + */
> +struct llist_node *sllist_del_first(struct llist_head *head)
> +{
> +	struct llist_node *entry, *old_entry, *next;
> +
> +	entry = smp_load_acquire(&head->first);
> +	for (;;) {
> +		if (entry == &__llist_end)
> +			return NULL;
> +		old_entry = entry;
> +		next = READ_ONCE(entry->next);
> +		entry = cmpxchg(&head->first, old_entry, next);
> +		if (entry == old_entry)
> +			break;
> +	}
> +	WRITE_ONCE(entry->next, NULL);
> +
> +	return entry;
> +}
> +EXPORT_SYMBOL_GPL(sllist_del_first);
> +
>  /**
>   * llist_reverse_order - reverse order of a llist chain
>   * @head:	first item of the list to be reversed
> @@ -90,3 +141,29 @@ struct llist_node *llist_reverse_order(struct llist_node *head)
>  	return new_head;
>  }
>  EXPORT_SYMBOL_GPL(llist_reverse_order);
> +
> +/**
> + * sllist_reverse_order - reverse order of a llist chain
> + * @head:	first item of the list to be reversed
> + *
> + * Reverse the order of a chain of llist entries and return the
> + * new first entry.
> + */
> +struct llist_node *sllist_reverse_order(struct llist_node *head)
> +{
> +	struct llist_node *new_head = &__llist_end;
> +
> +	if (!head || (head == &__llist_end))
> +		return NULL;
> +
> +	while (head != &__llist_end) {
> +		struct llist_node *tmp = head;
> +
> +		head = head->next;
> +		tmp->next = new_head;
> +		new_head = tmp;
> +	}
> +
> +	return new_head;
> +}
> +EXPORT_SYMBOL_GPL(sllist_reverse_order);

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-10-08  2:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-03 15:44 [PATCH v7 0/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-10-03 15:44 ` [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node Waiman Long
2022-10-03 16:40   ` Tejun Heo
2022-10-03 16:55     ` Waiman Long
2022-10-03 16:57       ` Tejun Heo
2022-10-03 17:32         ` Waiman Long
2022-10-03 17:36           ` Tejun Heo
2022-10-03 17:40             ` Waiman Long
2022-10-03 19:39               ` Waiman Long
2022-10-03 20:15                 ` Tejun Heo
2022-10-08  2:15   ` Huang, Ying
2022-10-03 15:44 ` [PATCH v7 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Waiman Long
2022-10-03 16:40   ` Tejun Heo
2022-10-03 15:44 ` [PATCH v7 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).