[PATCH v4 0/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
@ 2023-11-06 21:05 Waiman Long
  2023-11-06 21:05 ` [PATCH v4 1/3] " Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Waiman Long @ 2023-11-06 21:05 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Joe Mario, Sebastian Jug, Yosry Ahmed,
	Waiman Long

 v4:
  - Update patch 2 to fix a minor bug and update some of the comments.

 v3:
  - Minor comment twisting as suggested by Yosry.
  - Add patches 2 and 3 to further reduce lock hold time

The purpose of this patch series is to reduce of the cpu_lock hold time
in cgroup_rstat_flush_locked() so as to reduce the latency impact when
cgroup_rstat_updated() is called as they may contend with each other
on the cpu_lock.

A parallel kernel build on a 2-socket x86-64 server is used as the
benchmarking tool for measuring the lock hold time. Below were the lock
hold time frequency distribution before and after applying different
number of patches:

  Hold time   Before patch   Patch 1   Patches 1-2  Patches 1-3
  ---------   ------------   -------   -----------  -----------
    0-01 us      804,139   13,738,708   14,594,545   15,484,707
   01-05 us    9,772,767    1,177,194      439,926      207,382
   05-10 us    4,595,028        4,984        5,960        3,174
   10-15 us      303,481        3,562        3,543        3,006
   15-20 us       78,971        1,314        1,397        1,066
   20-25 us       24,583           18           25           15
   25-30 us        6,908           12           12           10
   30-40 us        8,015
   40-50 us        2,192
   50-60 us          316
   60-70 us           43
   70-80 us            7
   80-90 us            2
     >90 us            3

Waiman Long (3):
  cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
  cgroup/rstat: Optimize cgroup_rstat_updated_list()
  cgroup: Avoid false cacheline sharing of read mostly rstat_cpu

 include/linux/cgroup-defs.h |  14 ++++
 kernel/cgroup/rstat.c       | 131 +++++++++++++++++++++---------------
 2 files changed, 91 insertions(+), 54 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked()
  2023-11-06 21:05 [PATCH v4 0/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Waiman Long
@ 2023-11-06 21:05 ` Waiman Long
  2023-11-06 21:05 ` [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list() Waiman Long
  2023-11-06 21:05 ` [PATCH v4 3/3] cgroup: Avoid false cacheline sharing of read mostly rstat_cpu Waiman Long
  2 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2023-11-06 21:05 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Joe Mario, Sebastian Jug, Yosry Ahmed,
	Waiman Long

When cgroup_rstat_updated() isn't being called concurrently with
cgroup_rstat_flush_locked(), its run time is pretty short. When
both are called concurrently, the cgroup_rstat_updated() run time
can spike to a pretty high value due to high cpu_lock hold time in
cgroup_rstat_flush_locked(). This can be problematic if the task calling
cgroup_rstat_updated() is a realtime task running on an isolated CPU
with a strict latency requirement. The cgroup_rstat_updated() call can
happen when there is a page fault even though the task is running in
user space most of the time.

The percpu cpu_lock is used to protect the update tree -
updated_next and updated_children. This protection is only needed when
cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing
operation which can take a much longer time does not need that protection
as it is already protected by cgroup_rstat_lock.

To reduce the cpu_lock hold time, we need to perform all the
cgroup_rstat_cpu_pop_updated() calls up front with the lock
released afterward before doing any flushing. This patch adds a new
cgroup_rstat_updated_list() function to return a singly linked list of
cgroups to be flushed.

Some instrumentation code are added to measure the cpu_lock hold time
right after lock acquisition to after releasing the lock. Parallel
kernel build on a 2-socket x86-64 server is used as the benchmarking
tool for measuring the lock hold time.

The maximum cpu_lock hold time before and after the patch are 100us and
29us respectively. So the worst case time is reduced to about 30% of
the original. However, there may be some OS or hardware noises like NMI
or SMI in the test system that can worsen the worst case value. Those
noises are usually tuned out in a real production environment to get
a better result.

OTOH, the lock hold time frequency distribution should give a better
idea of the performance benefit of the patch.  Below were the frequency
distribution before and after the patch:

     Hold time        Before patch       After patch
     ---------        ------------       -----------
       0-01 us           804,139         13,738,708
      01-05 us         9,772,767          1,177,194
      05-10 us         4,595,028              4,984
      10-15 us           303,481              3,562
      15-20 us            78,971              1,314
      20-25 us            24,583                 18
      25-30 us             6,908                 12
      30-40 us             8,015
      40-50 us             2,192
      50-60 us               316
      60-70 us                43
      70-80 us                 7
      80-90 us                 2
        >90 us                 3

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/cgroup-defs.h |  7 ++++++
 kernel/cgroup/rstat.c       | 43 ++++++++++++++++++++++++-------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 265da00a1a8b..ff4b4c590f32 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -491,6 +491,13 @@ struct cgroup {
 	struct cgroup_rstat_cpu __percpu *rstat_cpu;
 	struct list_head rstat_css_list;
 
+	/*
+	 * A singly-linked list of cgroup structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 */
+	struct cgroup	*rstat_flush_next;
+
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d80d7a608141..1f300bf4dc40 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -145,6 +145,32 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	return pos;
 }
 
+/* Return a list of updated cgroups to be flushed */
+static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+{
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+	struct cgroup *head, *tail, *next;
+	unsigned long flags;
+
+	/*
+	 * The _irqsave() is needed because cgroup_rstat_lock is
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
+	 * this lock with the _irq() suffix only disables interrupts on
+	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
+	 * interrupts on both configurations. The _irqsave() ensures
+	 * that interrupts are always disabled and later restored.
+	 */
+	raw_spin_lock_irqsave(cpu_lock, flags);
+	head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
+	while (tail) {
+		next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
+		tail->rstat_flush_next = next;
+		tail = next;
+	}
+	raw_spin_unlock_irqrestore(cpu_lock, flags);
+	return head;
+}
+
 /*
  * A hook for bpf stat collectors to attach to and flush their stats.
  * Together with providing bpf kfuncs for cgroup_rstat_updated() and
@@ -179,21 +205,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock,
-						       cpu);
-		struct cgroup *pos = NULL;
-		unsigned long flags;
+		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
 
-		/*
-		 * The _irqsave() is needed because cgroup_rstat_lock is
-		 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-		 * this lock with the _irq() suffix only disables interrupts on
-		 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-		 * interrupts on both configurations. The _irqsave() ensures
-		 * that interrupts are always disabled and later restored.
-		 */
-		raw_spin_lock_irqsave(cpu_lock, flags);
-		while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
+		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css;
 
 			cgroup_base_stat_flush(pos, cpu);
@@ -205,7 +219,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 				css->ss->css_rstat_flush(css, cpu);
 			rcu_read_unlock();
 		}
-		raw_spin_unlock_irqrestore(cpu_lock, flags);
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
  2023-11-06 21:05 [PATCH v4 0/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Waiman Long
  2023-11-06 21:05 ` [PATCH v4 1/3] " Waiman Long
@ 2023-11-06 21:05 ` Waiman Long
  2023-11-28  4:01   ` Waiman Long
  2023-11-06 21:05 ` [PATCH v4 3/3] cgroup: Avoid false cacheline sharing of read mostly rstat_cpu Waiman Long
  2 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2023-11-06 21:05 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Joe Mario, Sebastian Jug, Yosry Ahmed,
	Waiman Long

The current design of cgroup_rstat_cpu_pop_updated() is to traverse
the updated tree in a way to pop out the leaf nodes first before
their parents. This can cause traversal of multiple nodes before a
leaf node can be found and popped out. IOW, a given node in the tree
can be visited multiple times before the whole operation is done. So
it is not very efficient and the code can be hard to read.

With the introduction of cgroup_rstat_updated_list() to build a list
of cgroups to be flushed first before any flushing operation is being
done, we can optimize the way the updated tree nodes are being popped
by pushing the parents first to the tail end of the list before their
children. In this way, most updated tree nodes will be visited only
once with the exception of the subtree root as we still need to go
back to its parent and popped it out of its updated_children list.
This also makes the code easier to read.

A parallel kernel build on a 2-socket x86-64 server is used as the
benchmarking tool for measuring the lock hold time. Below were the lock
hold time frequency distribution before and after the patch:

     Hold time        Before patch       After patch
     ---------        ------------       -----------
       0-01 us        13,738,708         14,594,545
      01-05 us         1,177,194            439,926
      05-10 us             4,984              5,960
      10-15 us             3,562              3,543
      15-20 us             1,314              1,397
      20-25 us                18                 25
      25-30 us                12                 12

It can be seen that the patch pushes the lock hold time towards the
lower end.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/rstat.c | 134 +++++++++++++++++++++++-------------------
 1 file changed, 72 insertions(+), 62 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 1f300bf4dc40..701388fa215f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -74,64 +74,92 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 }
 
 /**
- * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree
- * @pos: current position
- * @root: root of the tree to traversal
+ * cgroup_rstat_push_children - push children cgroups into the given list
+ * @head: current head of the list (= parent cgroup)
+ * @prstatc: cgroup_rstat_cpu of the parent cgroup
  * @cpu: target cpu
+ * Return: A new singly linked list of cgroups to be flush
  *
- * Walks the updated rstat_cpu tree on @cpu from @root.  %NULL @pos starts
- * the traversal and %NULL return indicates the end.  During traversal,
- * each returned cgroup is unlinked from the tree.  Must be called with the
- * matching cgroup_rstat_cpu_lock held.
+ * Recursively traverse down the cgroup_rstat_cpu updated tree and push
+ * parent first before its children into a singly linked list built from
+ * the tail backward like "pushing" cgroups into a stack. The parent is
+ * pushed by the caller. The recursion depth is the depth of the current
+ * updated subtree.
+ */
+static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
+				struct cgroup_rstat_cpu *prstatc, int cpu)
+{
+	struct cgroup *child, *parent;
+	struct cgroup_rstat_cpu *crstatc;
+
+	parent = head;
+	child = prstatc->updated_children;
+	prstatc->updated_children = parent;
+
+	/* updated_next is parent cgroup terminated */
+	while (child != parent) {
+		child->rstat_flush_next = head;
+		head = child;
+		crstatc = cgroup_rstat_cpu(child, cpu);
+		if (crstatc->updated_children != child)
+			head = cgroup_rstat_push_children(head, crstatc, cpu);
+		child = crstatc->updated_next;
+		crstatc->updated_next = NULL;
+	}
+	return head;
+}
+
+/**
+ * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
+ * @root: root of the cgroup subtree to traverse
+ * @cpu: target cpu
+ * Return: A singly linked list of cgroups to be flushed
+ *
+ * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
+ * each returned cgroup is unlinked from the updated tree.
  *
  * The only ordering guarantee is that, for a parent and a child pair
- * covered by a given traversal, if a child is visited, its parent is
- * guaranteed to be visited afterwards.
+ * covered by a given traversal, the child is before its parent in
+ * the list.
+ *
+ * Note that updated_children is self terminated and points to a list of
+ * child cgroups if not empty. Whereas updated_next is like a sibling link
+ * within the children list and terminated by the parent cgroup. An exception
+ * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
-						   struct cgroup *root, int cpu)
+static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc;
-	struct cgroup *parent;
-
-	if (pos == root)
-		return NULL;
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
+	struct cgroup *head = NULL, *parent;
+	unsigned long flags;
 
 	/*
-	 * We're gonna walk down to the first leaf and visit/remove it.  We
-	 * can pick whatever unvisited node as the starting point.
+	 * The _irqsave() is needed because cgroup_rstat_lock is
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
+	 * this lock with the _irq() suffix only disables interrupts on
+	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
+	 * interrupts on both configurations. The _irqsave() ensures
+	 * that interrupts are always disabled and later restored.
 	 */
-	if (!pos) {
-		pos = root;
-		/* return NULL if this subtree is not on-list */
-		if (!cgroup_rstat_cpu(pos, cpu)->updated_next)
-			return NULL;
-	} else {
-		pos = cgroup_parent(pos);
-	}
+	raw_spin_lock_irqsave(cpu_lock, flags);
 
-	/* walk down to the first leaf */
-	while (true) {
-		rstatc = cgroup_rstat_cpu(pos, cpu);
-		if (rstatc->updated_children == pos)
-			break;
-		pos = rstatc->updated_children;
-	}
+	/* Return NULL if this subtree is not on-list */
+	if (!rstatc->updated_next)
+		goto unlock_ret;
 
 	/*
-	 * Unlink @pos from the tree.  As the updated_children list is
+	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
-	 * However, due to the way we traverse, @pos will be the first
-	 * child in most cases. The only exception is @root.
 	 */
-	parent = cgroup_parent(pos);
+	parent = cgroup_parent(root);
 	if (parent) {
 		struct cgroup_rstat_cpu *prstatc;
 		struct cgroup **nextp;
 
 		prstatc = cgroup_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
-		while (*nextp != pos) {
+		while (*nextp != root) {
 			struct cgroup_rstat_cpu *nrstatc;
 
 			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
@@ -142,31 +170,13 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 	}
 
 	rstatc->updated_next = NULL;
-	return pos;
-}
-
-/* Return a list of updated cgroups to be flushed */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
-{
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup *head, *tail, *next;
-	unsigned long flags;
 
-	/*
-	 * The _irqsave() is needed because cgroup_rstat_lock is
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-	 * this lock with the _irq() suffix only disables interrupts on
-	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-	 * interrupts on both configurations. The _irqsave() ensures
-	 * that interrupts are always disabled and later restored.
-	 */
-	raw_spin_lock_irqsave(cpu_lock, flags);
-	head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
-	while (tail) {
-		next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
-		tail->rstat_flush_next = next;
-		tail = next;
-	}
+	/* Push @root to the list first before pushing the children */
+	head = root;
+	root->rstat_flush_next = NULL;
+	if (rstatc->updated_children != root)
+		head = cgroup_rstat_push_children(head, rstatc, cpu);
+unlock_ret:
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 	return head;
 }
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 3/3] cgroup: Avoid false cacheline sharing of read mostly rstat_cpu
  2023-11-06 21:05 [PATCH v4 0/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Waiman Long
  2023-11-06 21:05 ` [PATCH v4 1/3] " Waiman Long
  2023-11-06 21:05 ` [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list() Waiman Long
@ 2023-11-06 21:05 ` Waiman Long
  2 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2023-11-06 21:05 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Joe Mario, Sebastian Jug, Yosry Ahmed,
	Waiman Long

The rstat_cpu and also rstat_css_list of the cgroup structure are read
mostly variables. However, they may share the same cacheline as the
subsequent rstat_flush_next and *bstat variables which can be updated
frequently.  That will slow down the cgroup_rstat_cpu() call which is
called pretty frequently in the rstat code. Add a CACHELINE_PADDING()
line in between them to avoid false cacheline sharing.

A parallel kernel build on a 2-socket x86-64 server is used as the
benchmarking tool for measuring the lock hold time. Below were the lock
hold time frequency distribution before and after the patch:

      Run time        Before patch       After patch
      --------        ------------       -----------
       0-01 us        14,594,545         15,484,707
      01-05 us           439,926            207,382
      05-10 us             5,960              3,174
      10-15 us             3,543              3,006
      15-20 us             1,397              1,066
      20-25 us                25                 15
      25-30 us                12                 10

It can be seen that the patch further pushes the lock hold time towards
the lower end.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/cgroup-defs.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ff4b4c590f32..a4adc0580135 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -491,6 +491,13 @@ struct cgroup {
 	struct cgroup_rstat_cpu __percpu *rstat_cpu;
 	struct list_head rstat_css_list;
 
+	/*
+	 * Add padding to separate the read mostly rstat_cpu and
+	 * rstat_css_list into a different cacheline from the following
+	 * rstat_flush_next and *bstat fields which can have frequent updates.
+	 */
+	CACHELINE_PADDING(_pad_);
+
 	/*
 	 * A singly-linked list of cgroup structures to be rstat flushed.
 	 * This is a scratch field to be used exclusively by
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
  2023-11-06 21:05 ` [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list() Waiman Long
@ 2023-11-28  4:01   ` Waiman Long
  2023-11-28 16:43     ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2023-11-28  4:01 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner
  Cc: cgroups, linux-kernel, Joe Mario, Sebastian Jug, Yosry Ahmed

On 11/6/23 16:05, Waiman Long wrote:
> The current design of cgroup_rstat_cpu_pop_updated() is to traverse
> the updated tree in a way to pop out the leaf nodes first before
> their parents. This can cause traversal of multiple nodes before a
> leaf node can be found and popped out. IOW, a given node in the tree
> can be visited multiple times before the whole operation is done. So
> it is not very efficient and the code can be hard to read.
>
> With the introduction of cgroup_rstat_updated_list() to build a list
> of cgroups to be flushed first before any flushing operation is being
> done, we can optimize the way the updated tree nodes are being popped
> by pushing the parents first to the tail end of the list before their
> children. In this way, most updated tree nodes will be visited only
> once with the exception of the subtree root as we still need to go
> back to its parent and popped it out of its updated_children list.
> This also makes the code easier to read.
>
> A parallel kernel build on a 2-socket x86-64 server is used as the
> benchmarking tool for measuring the lock hold time. Below were the lock
> hold time frequency distribution before and after the patch:
>
>       Hold time        Before patch       After patch
>       ---------        ------------       -----------
>         0-01 us        13,738,708         14,594,545
>        01-05 us         1,177,194            439,926
>        05-10 us             4,984              5,960
>        10-15 us             3,562              3,543
>        15-20 us             1,314              1,397
>        20-25 us                18                 25
>        25-30 us                12                 12
>
> It can be seen that the patch pushes the lock hold time towards the
> lower end.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/rstat.c | 134 +++++++++++++++++++++++-------------------
>   1 file changed, 72 insertions(+), 62 deletions(-)
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 1f300bf4dc40..701388fa215f 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -74,64 +74,92 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>   }
>   
>   /**
> - * cgroup_rstat_cpu_pop_updated - iterate and dismantle rstat_cpu updated tree
> - * @pos: current position
> - * @root: root of the tree to traversal
> + * cgroup_rstat_push_children - push children cgroups into the given list
> + * @head: current head of the list (= parent cgroup)
> + * @prstatc: cgroup_rstat_cpu of the parent cgroup
>    * @cpu: target cpu
> + * Return: A new singly linked list of cgroups to be flush
>    *
> - * Walks the updated rstat_cpu tree on @cpu from @root.  %NULL @pos starts
> - * the traversal and %NULL return indicates the end.  During traversal,
> - * each returned cgroup is unlinked from the tree.  Must be called with the
> - * matching cgroup_rstat_cpu_lock held.
> + * Recursively traverse down the cgroup_rstat_cpu updated tree and push
> + * parent first before its children into a singly linked list built from
> + * the tail backward like "pushing" cgroups into a stack. The parent is
> + * pushed by the caller. The recursion depth is the depth of the current
> + * updated subtree.
> + */
> +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
> +				struct cgroup_rstat_cpu *prstatc, int cpu)
> +{
> +	struct cgroup *child, *parent;
> +	struct cgroup_rstat_cpu *crstatc;
> +
> +	parent = head;
> +	child = prstatc->updated_children;
> +	prstatc->updated_children = parent;
> +
> +	/* updated_next is parent cgroup terminated */
> +	while (child != parent) {
> +		child->rstat_flush_next = head;
> +		head = child;
> +		crstatc = cgroup_rstat_cpu(child, cpu);
> +		if (crstatc->updated_children != child)
> +			head = cgroup_rstat_push_children(head, crstatc, cpu);
> +		child = crstatc->updated_next;
> +		crstatc->updated_next = NULL;
> +	}
> +	return head;
> +}
> +
> +/**
> + * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
> + * @root: root of the cgroup subtree to traverse
> + * @cpu: target cpu
> + * Return: A singly linked list of cgroups to be flushed
> + *
> + * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
> + * each returned cgroup is unlinked from the updated tree.
>    *
>    * The only ordering guarantee is that, for a parent and a child pair
> - * covered by a given traversal, if a child is visited, its parent is
> - * guaranteed to be visited afterwards.
> + * covered by a given traversal, the child is before its parent in
> + * the list.
> + *
> + * Note that updated_children is self terminated and points to a list of
> + * child cgroups if not empty. Whereas updated_next is like a sibling link
> + * within the children list and terminated by the parent cgroup. An exception
> + * here is the cgroup root whose updated_next can be self terminated.
>    */
> -static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
> -						   struct cgroup *root, int cpu)
> +static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
>   {
> -	struct cgroup_rstat_cpu *rstatc;
> -	struct cgroup *parent;
> -
> -	if (pos == root)
> -		return NULL;
> +	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> +	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
> +	struct cgroup *head = NULL, *parent;
> +	unsigned long flags;
>   
>   	/*
> -	 * We're gonna walk down to the first leaf and visit/remove it.  We
> -	 * can pick whatever unvisited node as the starting point.
> +	 * The _irqsave() is needed because cgroup_rstat_lock is
> +	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
> +	 * this lock with the _irq() suffix only disables interrupts on
> +	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
> +	 * interrupts on both configurations. The _irqsave() ensures
> +	 * that interrupts are always disabled and later restored.
>   	 */
> -	if (!pos) {
> -		pos = root;
> -		/* return NULL if this subtree is not on-list */
> -		if (!cgroup_rstat_cpu(pos, cpu)->updated_next)
> -			return NULL;
> -	} else {
> -		pos = cgroup_parent(pos);
> -	}
> +	raw_spin_lock_irqsave(cpu_lock, flags);
>   
> -	/* walk down to the first leaf */
> -	while (true) {
> -		rstatc = cgroup_rstat_cpu(pos, cpu);
> -		if (rstatc->updated_children == pos)
> -			break;
> -		pos = rstatc->updated_children;
> -	}
> +	/* Return NULL if this subtree is not on-list */
> +	if (!rstatc->updated_next)
> +		goto unlock_ret;
>   
>   	/*
> -	 * Unlink @pos from the tree.  As the updated_children list is
> +	 * Unlink @root from its parent. As the updated_children list is
>   	 * singly linked, we have to walk it to find the removal point.
> -	 * However, due to the way we traverse, @pos will be the first
> -	 * child in most cases. The only exception is @root.
>   	 */
> -	parent = cgroup_parent(pos);
> +	parent = cgroup_parent(root);
>   	if (parent) {
>   		struct cgroup_rstat_cpu *prstatc;
>   		struct cgroup **nextp;
>   
>   		prstatc = cgroup_rstat_cpu(parent, cpu);
>   		nextp = &prstatc->updated_children;
> -		while (*nextp != pos) {
> +		while (*nextp != root) {
>   			struct cgroup_rstat_cpu *nrstatc;
>   
>   			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
> @@ -142,31 +170,13 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
>   	}
>   
>   	rstatc->updated_next = NULL;
> -	return pos;
> -}
> -
> -/* Return a list of updated cgroups to be flushed */
> -static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
> -{
> -	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> -	struct cgroup *head, *tail, *next;
> -	unsigned long flags;
>   
> -	/*
> -	 * The _irqsave() is needed because cgroup_rstat_lock is
> -	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
> -	 * this lock with the _irq() suffix only disables interrupts on
> -	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
> -	 * interrupts on both configurations. The _irqsave() ensures
> -	 * that interrupts are always disabled and later restored.
> -	 */
> -	raw_spin_lock_irqsave(cpu_lock, flags);
> -	head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu);
> -	while (tail) {
> -		next = cgroup_rstat_cpu_pop_updated(tail, root, cpu);
> -		tail->rstat_flush_next = next;
> -		tail = next;
> -	}
> +	/* Push @root to the list first before pushing the children */
> +	head = root;
> +	root->rstat_flush_next = NULL;
> +	if (rstatc->updated_children != root)
> +		head = cgroup_rstat_push_children(head, rstatc, cpu);
> +unlock_ret:
>   	raw_spin_unlock_irqrestore(cpu_lock, flags);
>   	return head;
>   }

Any further comment or suggestion on this patch?

Thanks,
Longman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
  2023-11-28  4:01   ` Waiman Long
@ 2023-11-28 16:43     ` Tejun Heo
  2023-11-28 16:46       ` Waiman Long
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2023-11-28 16:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Zefan Li, Johannes Weiner, cgroups, linux-kernel, Joe Mario,
	Sebastian Jug, Yosry Ahmed

On Mon, Nov 27, 2023 at 11:01:22PM -0500, Waiman Long wrote:
...
> > + * Recursively traverse down the cgroup_rstat_cpu updated tree and push
> > + * parent first before its children into a singly linked list built from
> > + * the tail backward like "pushing" cgroups into a stack. The parent is
> > + * pushed by the caller. The recursion depth is the depth of the current
> > + * updated subtree.
> > + */
> > +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
> > +				struct cgroup_rstat_cpu *prstatc, int cpu)
> > +{
> > +	struct cgroup *child, *parent;
> > +	struct cgroup_rstat_cpu *crstatc;
> > +
> > +	parent = head;
> > +	child = prstatc->updated_children;
> > +	prstatc->updated_children = parent;
> > +
> > +	/* updated_next is parent cgroup terminated */
> > +	while (child != parent) {
> > +		child->rstat_flush_next = head;
> > +		head = child;
> > +		crstatc = cgroup_rstat_cpu(child, cpu);
> > +		if (crstatc->updated_children != child)
> > +			head = cgroup_rstat_push_children(head, crstatc, cpu);
> > +		child = crstatc->updated_next;
> > +		crstatc->updated_next = NULL;
> > +	}
> > +	return head;

The recursion bothers me. We don't really have a hard limit on nesting
depth. We might need to add another pointer field but can make this
iterative, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list()
  2023-11-28 16:43     ` Tejun Heo
@ 2023-11-28 16:46       ` Waiman Long
  0 siblings, 0 replies; 7+ messages in thread
From: Waiman Long @ 2023-11-28 16:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Zefan Li, Johannes Weiner, cgroups, linux-kernel, Joe Mario,
	Sebastian Jug, Yosry Ahmed

On 11/28/23 11:43, Tejun Heo wrote:
> On Mon, Nov 27, 2023 at 11:01:22PM -0500, Waiman Long wrote:
> ...
>>> + * Recursively traverse down the cgroup_rstat_cpu updated tree and push
>>> + * parent first before its children into a singly linked list built from
>>> + * the tail backward like "pushing" cgroups into a stack. The parent is
>>> + * pushed by the caller. The recursion depth is the depth of the current
>>> + * updated subtree.
>>> + */
>>> +static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
>>> +				struct cgroup_rstat_cpu *prstatc, int cpu)
>>> +{
>>> +	struct cgroup *child, *parent;
>>> +	struct cgroup_rstat_cpu *crstatc;
>>> +
>>> +	parent = head;
>>> +	child = prstatc->updated_children;
>>> +	prstatc->updated_children = parent;
>>> +
>>> +	/* updated_next is parent cgroup terminated */
>>> +	while (child != parent) {
>>> +		child->rstat_flush_next = head;
>>> +		head = child;
>>> +		crstatc = cgroup_rstat_cpu(child, cpu);
>>> +		if (crstatc->updated_children != child)
>>> +			head = cgroup_rstat_push_children(head, crstatc, cpu);
>>> +		child = crstatc->updated_next;
>>> +		crstatc->updated_next = NULL;
>>> +	}
>>> +	return head;
> The recursion bothers me. We don't really have a hard limit on nesting
> depth. We might need to add another pointer field but can make this
> iterative, right?

I see. Yes, I think it is possible to make it iterative. Using recursion 
is just an easier way to do it. Will look into that.

Thanks,
Longman

>
> Thanks.
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-11-28 16:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-06 21:05 [PATCH v4 0/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Waiman Long
2023-11-06 21:05 ` [PATCH v4 1/3] " Waiman Long
2023-11-06 21:05 ` [PATCH v4 2/3] cgroup/rstat: Optimize cgroup_rstat_updated_list() Waiman Long
2023-11-28  4:01   ` Waiman Long
2023-11-28 16:43     ` Tejun Heo
2023-11-28 16:46       ` Waiman Long
2023-11-06 21:05 ` [PATCH v4 3/3] cgroup: Avoid false cacheline sharing of read mostly rstat_cpu Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.