All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-02  2:02 ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
closer (NUMA) to any desired CPU, instead of only the current CPU.

### Performance argument that motivated the change:
There could be an argument of why would that be needed, since the current
CPU is probably acessing the current cacheline, and so having a CPU closer
to the current one is always the best choice since the cache invalidation
will take less time. OTOH, there could be cases like this which uses
perCPU variables, and we can have up to 3 different CPUs touching the
cacheline:

C1 - Isolated CPU: The perCPU data 'belongs' to this one
C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
C3 - Housekeeping CPU: This one will do the work

Most of the times the cacheline is touched, it should be by C1. Some times
a C2 will schedule work to run on C3, since C1 is isolated.

If C1 and C2 are in different NUMA nodes, we could have C3 either in
C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
(housekeeping_any_cpu_from(C1). 

If C3 is in C2 NUMA node, there will be a faster invalidation when C3
tries to get cacheline exclusivity, and then a slower invalidation when
this happens in C1, when it's working in its data.

If C3 is in C1 NUMA node, there will be a slower invalidation when C3
tries to get cacheline exclusivity, and then a faster invalidation when
this happens in C1.

The thing is: it should be better to wait less when doing kernel work
on an isolated CPU, even at the cost of some housekeeping CPU waiting
a few more cycles.
###

Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
local_lock to spinlocks, so it can be later used to do remote percpu
cache draining on patch #3. Most performance concerns should be pointed
in the commit log.

Patch #3 implements the remote per-CPU cache drain, making use of both 
patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
introduce an extra function call and a single test to check if the CPU is
isolated. 

On scenarios with isolation enabled on boot, it will also introduce an
extra test to check in the cpumask if the CPU is isolated. If it is,
there will also be an extra read of the cpumask to look for a
housekeeping CPU.

Please, provide any feedback on that!
Thanks a lot for reading!

Leonardo Bras (3):
  sched/isolation: Add housekeepíng_any_cpu_from()
  mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
  mm/memcontrol: Add drain_remote_stock(), avoid drain_stock on isolated
    cpus

 include/linux/sched/isolation.h | 11 +++--
 kernel/sched/isolation.c        |  8 ++--
 mm/memcontrol.c                 | 83 ++++++++++++++++++++++-----------
 3 files changed, 69 insertions(+), 33 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-02  2:02 ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
closer (NUMA) to any desired CPU, instead of only the current CPU.

### Performance argument that motivated the change:
There could be an argument of why would that be needed, since the current
CPU is probably acessing the current cacheline, and so having a CPU closer
to the current one is always the best choice since the cache invalidation
will take less time. OTOH, there could be cases like this which uses
perCPU variables, and we can have up to 3 different CPUs touching the
cacheline:

C1 - Isolated CPU: The perCPU data 'belongs' to this one
C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
C3 - Housekeeping CPU: This one will do the work

Most of the times the cacheline is touched, it should be by C1. Some times
a C2 will schedule work to run on C3, since C1 is isolated.

If C1 and C2 are in different NUMA nodes, we could have C3 either in
C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
(housekeeping_any_cpu_from(C1). 

If C3 is in C2 NUMA node, there will be a faster invalidation when C3
tries to get cacheline exclusivity, and then a slower invalidation when
this happens in C1, when it's working in its data.

If C3 is in C1 NUMA node, there will be a slower invalidation when C3
tries to get cacheline exclusivity, and then a faster invalidation when
this happens in C1.

The thing is: it should be better to wait less when doing kernel work
on an isolated CPU, even at the cost of some housekeeping CPU waiting
a few more cycles.
###

Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
local_lock to spinlocks, so it can be later used to do remote percpu
cache draining on patch #3. Most performance concerns should be pointed
in the commit log.

Patch #3 implements the remote per-CPU cache drain, making use of both 
patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
introduce an extra function call and a single test to check if the CPU is
isolated. 

On scenarios with isolation enabled on boot, it will also introduce an
extra test to check in the cpumask if the CPU is isolated. If it is,
there will also be an extra read of the cpumask to look for a
housekeeping CPU.

Please, provide any feedback on that!
Thanks a lot for reading!

Leonardo Bras (3):
  sched/isolation: Add housekeepíng_any_cpu_from()
  mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
  mm/memcontrol: Add drain_remote_stock(), avoid drain_stock on isolated
    cpus

 include/linux/sched/isolation.h | 11 +++--
 kernel/sched/isolation.c        |  8 ++--
 mm/memcontrol.c                 | 83 ++++++++++++++++++++++-----------
 3 files changed, 69 insertions(+), 33 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v1 1/3] sched/isolation: Add housekeepíng_any_cpu_from()
@ 2022-11-02  2:02   ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

As of today, there is a function called housekeepíng_any_cpu() that
returns a housekeeping cpu near the current one. This function is very
useful to help delegate tasks to other cpus when the current one is
isolated.

It also comes with the benefit of looking for cpus in the same NUMA node
as the current cpu, so any memory activity could be faster in NUMA systems.

On the other hand, there is no function like that to find housekeeping cpus
in the same NUMA node of another CPU.

Change housekeepíng_any_cpu() into housekeepíng_any_cpu_from(), so it
accepts a cpu_start parameter and can find cpus in the same NUMA node as
any given CPU.

Also, reimplements housekeepíng_any_cpu() as an inline function that calls
housekeepíng_any_cpu_from() with cpu_start = current cpu.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/linux/sched/isolation.h | 11 ++++++++---
 kernel/sched/isolation.c        |  8 ++++----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 8c15abd67aed9..95b65be44f19f 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -20,7 +20,7 @@ enum hk_type {
 
 #ifdef CONFIG_CPU_ISOLATION
 DECLARE_STATIC_KEY_FALSE(housekeeping_overridden);
-extern int housekeeping_any_cpu(enum hk_type type);
+extern int housekeeping_any_cpu_from(enum hk_type type, int cpu_start);
 extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
 extern bool housekeeping_enabled(enum hk_type type);
 extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
@@ -29,9 +29,9 @@ extern void __init housekeeping_init(void);
 
 #else
 
-static inline int housekeeping_any_cpu(enum hk_type type)
+static inline int housekeeping_any_cpu_from(enum hk_type type, int cpu_start)
 {
-	return smp_processor_id();
+	return cpu_start;
 }
 
 static inline const struct cpumask *housekeeping_cpumask(enum hk_type type)
@@ -58,4 +58,9 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
 	return true;
 }
 
+static inline int housekeeping_any_cpu(enum hk_type type)
+{
+	return housekeeping_any_cpu_from(type, smp_processor_id());
+}
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c707bc5..6ebeac11bb350 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -36,22 +36,22 @@ bool housekeeping_enabled(enum hk_type type)
 }
 EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
-int housekeeping_any_cpu(enum hk_type type)
+int housekeeping_any_cpu_from(enum hk_type type, int cpu_start)
 {
 	int cpu;
 
 	if (static_branch_unlikely(&housekeeping_overridden)) {
 		if (housekeeping.flags & BIT(type)) {
-			cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
+			cpu = sched_numa_find_closest(housekeeping.cpumasks[type], cpu_start);
 			if (cpu < nr_cpu_ids)
 				return cpu;
 
 			return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
 		}
 	}
-	return smp_processor_id();
+	return cpu_start;
 }
-EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
+EXPORT_SYMBOL_GPL(housekeeping_any_cpu_from);
 
 const struct cpumask *housekeeping_cpumask(enum hk_type type)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 1/3] sched/isolation: Add housekeepíng_any_cpu_from()
@ 2022-11-02  2:02   ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

As of today, there is a function called housekeepíng_any_cpu() that
returns a housekeeping cpu near the current one. This function is very
useful to help delegate tasks to other cpus when the current one is
isolated.

It also comes with the benefit of looking for cpus in the same NUMA node
as the current cpu, so any memory activity could be faster in NUMA systems.

On the other hand, there is no function like that to find housekeeping cpus
in the same NUMA node of another CPU.

Change housekeepíng_any_cpu() into housekeepíng_any_cpu_from(), so it
accepts a cpu_start parameter and can find cpus in the same NUMA node as
any given CPU.

Also, reimplements housekeepíng_any_cpu() as an inline function that calls
housekeepíng_any_cpu_from() with cpu_start = current cpu.

Signed-off-by: Leonardo Bras <leobras-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sched/isolation.h | 11 ++++++++---
 kernel/sched/isolation.c        |  8 ++++----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 8c15abd67aed9..95b65be44f19f 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -20,7 +20,7 @@ enum hk_type {
 
 #ifdef CONFIG_CPU_ISOLATION
 DECLARE_STATIC_KEY_FALSE(housekeeping_overridden);
-extern int housekeeping_any_cpu(enum hk_type type);
+extern int housekeeping_any_cpu_from(enum hk_type type, int cpu_start);
 extern const struct cpumask *housekeeping_cpumask(enum hk_type type);
 extern bool housekeeping_enabled(enum hk_type type);
 extern void housekeeping_affine(struct task_struct *t, enum hk_type type);
@@ -29,9 +29,9 @@ extern void __init housekeeping_init(void);
 
 #else
 
-static inline int housekeeping_any_cpu(enum hk_type type)
+static inline int housekeeping_any_cpu_from(enum hk_type type, int cpu_start)
 {
-	return smp_processor_id();
+	return cpu_start;
 }
 
 static inline const struct cpumask *housekeeping_cpumask(enum hk_type type)
@@ -58,4 +58,9 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
 	return true;
 }
 
+static inline int housekeeping_any_cpu(enum hk_type type)
+{
+	return housekeeping_any_cpu_from(type, smp_processor_id());
+}
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c707bc5..6ebeac11bb350 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -36,22 +36,22 @@ bool housekeeping_enabled(enum hk_type type)
 }
 EXPORT_SYMBOL_GPL(housekeeping_enabled);
 
-int housekeeping_any_cpu(enum hk_type type)
+int housekeeping_any_cpu_from(enum hk_type type, int cpu_start)
 {
 	int cpu;
 
 	if (static_branch_unlikely(&housekeeping_overridden)) {
 		if (housekeeping.flags & BIT(type)) {
-			cpu = sched_numa_find_closest(housekeeping.cpumasks[type], smp_processor_id());
+			cpu = sched_numa_find_closest(housekeeping.cpumasks[type], cpu_start);
 			if (cpu < nr_cpu_ids)
 				return cpu;
 
 			return cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
 		}
 	}
-	return smp_processor_id();
+	return cpu_start;
 }
-EXPORT_SYMBOL_GPL(housekeeping_any_cpu);
+EXPORT_SYMBOL_GPL(housekeeping_any_cpu_from);
 
 const struct cpumask *housekeeping_cpumask(enum hk_type type)
 {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 2/3] mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
@ 2022-11-02  2:02   ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

In this context, since it's using per-cpu variables, changing from
local_lock to spinlock should not deal much impact in performance and can
allow operations such as stock draining to happen in remote cpus.

Why performance would probably not get impacted:
1 - Since the lock is in the same cache line as the information that is
    used next, there is no extra memory access for caching the lock.
2 - Since it's a percpu struct, there should be no other cpu sharing this
    cacheline, so there is no need for cacheline invalidation, and writing
    to the lock should be as fast as the next struct members.
3 - Even the write in (2) could be pipelined and batched with following
    writes to the cacheline (such as nr_pages member), further decreasing
    the impact of this change.

Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/memcontrol.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b300..add46da2e6df1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2167,7 +2167,7 @@ void unlock_page_memcg(struct page *page)
 }
 
 struct memcg_stock_pcp {
-	local_lock_t stock_lock;
+	spinlock_t stock_lock; /* Protects the percpu struct */
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 
@@ -2184,7 +2184,7 @@ struct memcg_stock_pcp {
 #define FLUSHING_CACHED_CHARGE	0
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
-	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+	.stock_lock = __SPIN_LOCK_UNLOCKED(stock_lock),
 };
 static DEFINE_MUTEX(percpu_charge_mutex);
 
@@ -2229,15 +2229,15 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
 		stock->nr_pages -= nr_pages;
 		ret = true;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 
 	return ret;
 }
@@ -2274,14 +2274,14 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	old = drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 	if (old)
 		obj_cgroup_put(old);
 }
@@ -2309,10 +2309,12 @@ static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
+	struct memcg_stock_pcp *stock;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
 	__refill_stock(memcg, nr_pages);
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 }
 
 /*
@@ -3157,8 +3159,8 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	unsigned long flags;
 	int *bytes;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
 
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
@@ -3210,7 +3212,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 	if (old)
 		obj_cgroup_put(old);
 }
@@ -3221,15 +3223,15 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 
 	return ret;
 }
@@ -3319,9 +3321,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		old = drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3337,7 +3339,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 	if (old)
 		obj_cgroup_put(old);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 2/3] mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t
@ 2022-11-02  2:02   ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

In this context, since it's using per-cpu variables, changing from
local_lock to spinlock should not deal much impact in performance and can
allow operations such as stock draining to happen in remote cpus.

Why performance would probably not get impacted:
1 - Since the lock is in the same cache line as the information that is
    used next, there is no extra memory access for caching the lock.
2 - Since it's a percpu struct, there should be no other cpu sharing this
    cacheline, so there is no need for cacheline invalidation, and writing
    to the lock should be as fast as the next struct members.
3 - Even the write in (2) could be pipelined and batched with following
    writes to the cacheline (such as nr_pages member), further decreasing
    the impact of this change.

Suggested-by: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Leonardo Bras <leobras-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b300..add46da2e6df1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2167,7 +2167,7 @@ void unlock_page_memcg(struct page *page)
 }
 
 struct memcg_stock_pcp {
-	local_lock_t stock_lock;
+	spinlock_t stock_lock; /* Protects the percpu struct */
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 
@@ -2184,7 +2184,7 @@ struct memcg_stock_pcp {
 #define FLUSHING_CACHED_CHARGE	0
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
-	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+	.stock_lock = __SPIN_LOCK_UNLOCKED(stock_lock),
 };
 static DEFINE_MUTEX(percpu_charge_mutex);
 
@@ -2229,15 +2229,15 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
 		stock->nr_pages -= nr_pages;
 		ret = true;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 
 	return ret;
 }
@@ -2274,14 +2274,14 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	old = drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 	if (old)
 		obj_cgroup_put(old);
 }
@@ -2309,10 +2309,12 @@ static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
+	struct memcg_stock_pcp *stock;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
 	__refill_stock(memcg, nr_pages);
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 }
 
 /*
@@ -3157,8 +3159,8 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	unsigned long flags;
 	int *bytes;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
 
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
@@ -3210,7 +3212,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 	if (old)
 		obj_cgroup_put(old);
 }
@@ -3221,15 +3223,15 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 
 	return ret;
 }
@@ -3319,9 +3321,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
-
 	stock = this_cpu_ptr(&memcg_stock);
+	spin_lock_irqsave(&stock->stock_lock, flags);
+
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		old = drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3337,7 +3339,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	spin_unlock_irqrestore(&stock->stock_lock, flags);
 	if (old)
 		obj_cgroup_put(old);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 3/3] mm/memcontrol: Add drain_remote_stock(), avoid drain_stock on isolated cpus
@ 2022-11-02  2:02   ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel, cgroups, linux-mm

When drain_all_stock() is called, some CPUs will be required to have their
per-CPU caches drained. This currently happens by scheduling a call to
drain_local_stock() to run in each affected CPU.

This, as a consequence, may end up scheduling work to CPUs that are
isolated, and therefore should have as little interruption as possible.

In order to avoid this, make drain_all_stock() able to detect isolated CPUs
and schedule draining the perCPU stock to happen in another non-isolated
CPU.

But since the current implementation only allows the drain to happen in
local CPU, implement a function to drain stock on a remote CPU:
drain_remote_stock().

Given both drain_local_stock() and drain_remote_stock() do almost the same
work, implement a inline drain_stock_helper() that is called by both.

Also, since drain_stock() will be able to run on a remote CPU, protect
memcg_hotplug_cpu_dead() with stock_lock.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 mm/memcontrol.c | 47 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index add46da2e6df1..7ad6e4f4b79ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -30,6 +30,7 @@
 #include <linux/cgroup.h>
 #include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
+#include <linux/sched/isolation.h>
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
@@ -2263,7 +2264,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 	stock->cached = NULL;
 }
 
-static void drain_local_stock(struct work_struct *dummy)
+static inline void drain_stock_helper(int cpu)
 {
 	struct memcg_stock_pcp *stock;
 	struct obj_cgroup *old = NULL;
@@ -2271,10 +2272,9 @@ static void drain_local_stock(struct work_struct *dummy)
 
 	/*
 	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
-	 * drain_stock races is that we always operate on local CPU stock
-	 * here with IRQ disabled
+	 * drain_stock races is stock_lock, a percpu spinlock.
 	 */
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = per_cpu_ptr(&memcg_stock, cpu);
 	spin_lock_irqsave(&stock->stock_lock, flags);
 
 	old = drain_obj_stock(stock);
@@ -2286,6 +2286,16 @@ static void drain_local_stock(struct work_struct *dummy)
 		obj_cgroup_put(old);
 }
 
+static void drain_remote_stock(struct work_struct *work)
+{
+	drain_stock_helper(atomic_long_read(&work->data));
+}
+
+static void drain_local_stock(struct work_struct *dummy)
+{
+	drain_stock_helper(smp_processor_id());
+}
+
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
@@ -2352,10 +2362,16 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 
 		if (flush &&
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
+			if (cpu == curcpu) {
 				drain_local_stock(&stock->work);
-			else
+			} else if (housekeeping_cpu(cpu, HK_TYPE_WQ)) {
 				schedule_work_on(cpu, &stock->work);
+			} else {
+				int hkcpu = housekeeping_any_cpu_from(HK_TYPE_WQ, cpu);
+
+				atomic_long_set(&stock->work.data, cpu);
+				schedule_work_on(hkcpu, &stock->work);
+			}
 		}
 	}
 	migrate_enable();
@@ -2367,7 +2383,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
 	struct memcg_stock_pcp *stock;
 
 	stock = &per_cpu(memcg_stock, cpu);
+	spin_lock(&stock->stock_lock);
 	drain_stock(stock);
+	spin_unlock(&stock->stock_lock);
 
 	return 0;
 }
@@ -7272,9 +7290,20 @@ static int __init mem_cgroup_init(void)
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
-	for_each_possible_cpu(cpu)
-		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-			  drain_local_stock);
+	/*
+	 * CPUs that are isolated should not spend cpu time for stock draining,
+	 * so allow them to export this task to the nearest housekeeping enabled
+	 * cpu available.
+	 */
+	for_each_possible_cpu(cpu) {
+		if (housekeeping_cpu(cpu, HK_TYPE_WQ)) {
+			INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
+				  drain_local_stock);
+		} else {
+			INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
+				  drain_remote_stock);
+		}
+	}
 
 	for_each_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 3/3] mm/memcontrol: Add drain_remote_stock(), avoid drain_stock on isolated cpus
@ 2022-11-02  2:02   ` Leonardo Bras
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Bras @ 2022-11-02  2:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Frederic Weisbecker, Leonardo Bras, Phil Auld,
	Marcelo Tosatti
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

When drain_all_stock() is called, some CPUs will be required to have their
per-CPU caches drained. This currently happens by scheduling a call to
drain_local_stock() to run in each affected CPU.

This, as a consequence, may end up scheduling work to CPUs that are
isolated, and therefore should have as little interruption as possible.

In order to avoid this, make drain_all_stock() able to detect isolated CPUs
and schedule draining the perCPU stock to happen in another non-isolated
CPU.

But since the current implementation only allows the drain to happen in
local CPU, implement a function to drain stock on a remote CPU:
drain_remote_stock().

Given both drain_local_stock() and drain_remote_stock() do almost the same
work, implement a inline drain_stock_helper() that is called by both.

Also, since drain_stock() will be able to run on a remote CPU, protect
memcg_hotplug_cpu_dead() with stock_lock.

Signed-off-by: Leonardo Bras <leobras-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 47 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index add46da2e6df1..7ad6e4f4b79ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -30,6 +30,7 @@
 #include <linux/cgroup.h>
 #include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
+#include <linux/sched/isolation.h>
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
@@ -2263,7 +2264,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 	stock->cached = NULL;
 }
 
-static void drain_local_stock(struct work_struct *dummy)
+static inline void drain_stock_helper(int cpu)
 {
 	struct memcg_stock_pcp *stock;
 	struct obj_cgroup *old = NULL;
@@ -2271,10 +2272,9 @@ static void drain_local_stock(struct work_struct *dummy)
 
 	/*
 	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
-	 * drain_stock races is that we always operate on local CPU stock
-	 * here with IRQ disabled
+	 * drain_stock races is stock_lock, a percpu spinlock.
 	 */
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = per_cpu_ptr(&memcg_stock, cpu);
 	spin_lock_irqsave(&stock->stock_lock, flags);
 
 	old = drain_obj_stock(stock);
@@ -2286,6 +2286,16 @@ static void drain_local_stock(struct work_struct *dummy)
 		obj_cgroup_put(old);
 }
 
+static void drain_remote_stock(struct work_struct *work)
+{
+	drain_stock_helper(atomic_long_read(&work->data));
+}
+
+static void drain_local_stock(struct work_struct *dummy)
+{
+	drain_stock_helper(smp_processor_id());
+}
+
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
@@ -2352,10 +2362,16 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 
 		if (flush &&
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
+			if (cpu == curcpu) {
 				drain_local_stock(&stock->work);
-			else
+			} else if (housekeeping_cpu(cpu, HK_TYPE_WQ)) {
 				schedule_work_on(cpu, &stock->work);
+			} else {
+				int hkcpu = housekeeping_any_cpu_from(HK_TYPE_WQ, cpu);
+
+				atomic_long_set(&stock->work.data, cpu);
+				schedule_work_on(hkcpu, &stock->work);
+			}
 		}
 	}
 	migrate_enable();
@@ -2367,7 +2383,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
 	struct memcg_stock_pcp *stock;
 
 	stock = &per_cpu(memcg_stock, cpu);
+	spin_lock(&stock->stock_lock);
 	drain_stock(stock);
+	spin_unlock(&stock->stock_lock);
 
 	return 0;
 }
@@ -7272,9 +7290,20 @@ static int __init mem_cgroup_init(void)
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
-	for_each_possible_cpu(cpu)
-		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-			  drain_local_stock);
+	/*
+	 * CPUs that are isolated should not spend cpu time for stock draining,
+	 * so allow them to export this task to the nearest housekeeping enabled
+	 * cpu available.
+	 */
+	for_each_possible_cpu(cpu) {
+		if (housekeeping_cpu(cpu, HK_TYPE_WQ)) {
+			INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
+				  drain_local_stock);
+		} else {
+			INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
+				  drain_remote_stock);
+		}
+	}
 
 	for_each_node(node) {
 		struct mem_cgroup_tree_per_node *rtpn;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-02  8:53   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-02  8:53 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> closer (NUMA) to any desired CPU, instead of only the current CPU.
> 
> ### Performance argument that motivated the change:
> There could be an argument of why would that be needed, since the current
> CPU is probably acessing the current cacheline, and so having a CPU closer
> to the current one is always the best choice since the cache invalidation
> will take less time. OTOH, there could be cases like this which uses
> perCPU variables, and we can have up to 3 different CPUs touching the
> cacheline:
> 
> C1 - Isolated CPU: The perCPU data 'belongs' to this one
> C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> C3 - Housekeeping CPU: This one will do the work
> 
> Most of the times the cacheline is touched, it should be by C1. Some times
> a C2 will schedule work to run on C3, since C1 is isolated.
> 
> If C1 and C2 are in different NUMA nodes, we could have C3 either in
> C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> (housekeeping_any_cpu_from(C1). 
> 
> If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> tries to get cacheline exclusivity, and then a slower invalidation when
> this happens in C1, when it's working in its data.
> 
> If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> tries to get cacheline exclusivity, and then a faster invalidation when
> this happens in C1.
> 
> The thing is: it should be better to wait less when doing kernel work
> on an isolated CPU, even at the cost of some housekeeping CPU waiting
> a few more cycles.
> ###
> 
> Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> local_lock to spinlocks, so it can be later used to do remote percpu
> cache draining on patch #3. Most performance concerns should be pointed
> in the commit log.
> 
> Patch #3 implements the remote per-CPU cache drain, making use of both 
> patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> introduce an extra function call and a single test to check if the CPU is
> isolated. 
> 
> On scenarios with isolation enabled on boot, it will also introduce an
> extra test to check in the cpumask if the CPU is isolated. If it is,
> there will also be an extra read of the cpumask to look for a
> housekeeping CPU.

This is a rather deep dive in the cache line usage but the most
important thing is really missing. Why do we want this change? From the
context it seems that this is an actual fix for isolcpu= setup when
remote (aka non isolated activity) interferes with isolated cpus by
scheduling pcp charge caches on those cpus.

Is this understanding correct?

If yes, how big of a problem that is? If you want a remote draining then
you need some sort of locking (currently we rely on local lock). How
come this locking is not going to cause a different form of disturbance?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-02  8:53   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-02  8:53 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> closer (NUMA) to any desired CPU, instead of only the current CPU.
> 
> ### Performance argument that motivated the change:
> There could be an argument of why would that be needed, since the current
> CPU is probably acessing the current cacheline, and so having a CPU closer
> to the current one is always the best choice since the cache invalidation
> will take less time. OTOH, there could be cases like this which uses
> perCPU variables, and we can have up to 3 different CPUs touching the
> cacheline:
> 
> C1 - Isolated CPU: The perCPU data 'belongs' to this one
> C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> C3 - Housekeeping CPU: This one will do the work
> 
> Most of the times the cacheline is touched, it should be by C1. Some times
> a C2 will schedule work to run on C3, since C1 is isolated.
> 
> If C1 and C2 are in different NUMA nodes, we could have C3 either in
> C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> (housekeeping_any_cpu_from(C1). 
> 
> If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> tries to get cacheline exclusivity, and then a slower invalidation when
> this happens in C1, when it's working in its data.
> 
> If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> tries to get cacheline exclusivity, and then a faster invalidation when
> this happens in C1.
> 
> The thing is: it should be better to wait less when doing kernel work
> on an isolated CPU, even at the cost of some housekeeping CPU waiting
> a few more cycles.
> ###
> 
> Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> local_lock to spinlocks, so it can be later used to do remote percpu
> cache draining on patch #3. Most performance concerns should be pointed
> in the commit log.
> 
> Patch #3 implements the remote per-CPU cache drain, making use of both 
> patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> introduce an extra function call and a single test to check if the CPU is
> isolated. 
> 
> On scenarios with isolation enabled on boot, it will also introduce an
> extra test to check in the cpumask if the CPU is isolated. If it is,
> there will also be an extra read of the cpumask to look for a
> housekeeping CPU.

This is a rather deep dive in the cache line usage but the most
important thing is really missing. Why do we want this change? From the
context it seems that this is an actual fix for isolcpu= setup when
remote (aka non isolated activity) interferes with isolated cpus by
scheduling pcp charge caches on those cpus.

Is this understanding correct?

If yes, how big of a problem that is? If you want a remote draining then
you need some sort of locking (currently we rely on local lock). How
come this locking is not going to cause a different form of disturbance?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
  2022-11-02  8:53   ` Michal Hocko
@ 2022-11-03 14:59     ` Leonardo Brás
  -1 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-03 14:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > 
> > ### Performance argument that motivated the change:
> > There could be an argument of why would that be needed, since the current
> > CPU is probably acessing the current cacheline, and so having a CPU closer
> > to the current one is always the best choice since the cache invalidation
> > will take less time. OTOH, there could be cases like this which uses
> > perCPU variables, and we can have up to 3 different CPUs touching the
> > cacheline:
> > 
> > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > C3 - Housekeeping CPU: This one will do the work
> > 
> > Most of the times the cacheline is touched, it should be by C1. Some times
> > a C2 will schedule work to run on C3, since C1 is isolated.
> > 
> > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > (housekeeping_any_cpu_from(C1). 
> > 
> > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > tries to get cacheline exclusivity, and then a slower invalidation when
> > this happens in C1, when it's working in its data.
> > 
> > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > tries to get cacheline exclusivity, and then a faster invalidation when
> > this happens in C1.
> > 
> > The thing is: it should be better to wait less when doing kernel work
> > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > a few more cycles.
> > ###
> > 
> > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > local_lock to spinlocks, so it can be later used to do remote percpu
> > cache draining on patch #3. Most performance concerns should be pointed
> > in the commit log.
> > 
> > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > introduce an extra function call and a single test to check if the CPU is
> > isolated. 
> > 
> > On scenarios with isolation enabled on boot, it will also introduce an
> > extra test to check in the cpumask if the CPU is isolated. If it is,
> > there will also be an extra read of the cpumask to look for a
> > housekeeping CPU.
> 

Hello Michael, thanks for reviewing!

> This is a rather deep dive in the cache line usage but the most
> important thing is really missing. Why do we want this change? From the
> context it seems that this is an actual fix for isolcpu= setup when
> remote (aka non isolated activity) interferes with isolated cpus by
> scheduling pcp charge caches on those cpus.
> 
> Is this understanding correct?

That's correct! The idea is to avoid scheduling work to isolated CPUs.

> If yes, how big of a problem that is?

The use case I have been following requires both isolcpus= and PREEMPT_RT, since
the isolated CPUs will be running a real-time workload. In this scenario,
getting any work done instead of the real-time workload may cause the system to
miss a deadline, which can be bad. 

>  If you want a remote draining then
> you need some sort of locking (currently we rely on local lock). How
> come this locking is not going to cause a different form of disturbance?

If I did everything right, most of the extra work should be done either in non-
isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
will be happening on a housekeeping CPU, and the locking cost should be paid
there as we want to avoid doing that in the isolated CPUs. 

I understand there will be a locking cost being paid in the isolated CPUs when:
a) The isolated CPU is requesting the stock drain,
b) When the isolated CPUs do a syscall and end up using the protected structure
the first time after a remote drain.

Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
should not expect the syscalls to be have a predictable time, so it should be
fine.

Thanks for helping me explain the case!
Best regards,
Leo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-03 14:59     ` Leonardo Brás
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-03 14:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups

On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > 
> > ### Performance argument that motivated the change:
> > There could be an argument of why would that be needed, since the current
> > CPU is probably acessing the current cacheline, and so having a CPU closer
> > to the current one is always the best choice since the cache invalidation
> > will take less time. OTOH, there could be cases like this which uses
> > perCPU variables, and we can have up to 3 different CPUs touching the
> > cacheline:
> > 
> > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > C3 - Housekeeping CPU: This one will do the work
> > 
> > Most of the times the cacheline is touched, it should be by C1. Some times
> > a C2 will schedule work to run on C3, since C1 is isolated.
> > 
> > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > (housekeeping_any_cpu_from(C1). 
> > 
> > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > tries to get cacheline exclusivity, and then a slower invalidation when
> > this happens in C1, when it's working in its data.
> > 
> > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > tries to get cacheline exclusivity, and then a faster invalidation when
> > this happens in C1.
> > 
> > The thing is: it should be better to wait less when doing kernel work
> > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > a few more cycles.
> > ###
> > 
> > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > local_lock to spinlocks, so it can be later used to do remote percpu
> > cache draining on patch #3. Most performance concerns should be pointed
> > in the commit log.
> > 
> > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > introduce an extra function call and a single test to check if the CPU is
> > isolated. 
> > 
> > On scenarios with isolation enabled on boot, it will also introduce an
> > extra test to check in the cpumask if the CPU is isolated. If it is,
> > there will also be an extra read of the cpumask to look for a
> > housekeeping CPU.
> 

Hello Michael, thanks for reviewing!

> This is a rather deep dive in the cache line usage but the most
> important thing is really missing. Why do we want this change? From the
> context it seems that this is an actual fix for isolcpu= setup when
> remote (aka non isolated activity) interferes with isolated cpus by
> scheduling pcp charge caches on those cpus.
> 
> Is this understanding correct?

That's correct! The idea is to avoid scheduling work to isolated CPUs.

> If yes, how big of a problem that is?

The use case I have been following requires both isolcpus= and PREEMPT_RT, since
the isolated CPUs will be running a real-time workload. In this scenario,
getting any work done instead of the real-time workload may cause the system to
miss a deadline, which can be bad. 

>  If you want a remote draining then
> you need some sort of locking (currently we rely on local lock). How
> come this locking is not going to cause a different form of disturbance?

If I did everything right, most of the extra work should be done either in non-
isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
will be happening on a housekeeping CPU, and the locking cost should be paid
there as we want to avoid doing that in the isolated CPUs. 

I understand there will be a locking cost being paid in the isolated CPUs when:
a) The isolated CPU is requesting the stock drain,
b) When the isolated CPUs do a syscall and end up using the protected structure
the first time after a remote drain.

Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
should not expect the syscalls to be have a predictable time, so it should be
fine.

Thanks for helping me explain the case!
Best regards,
Leo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-03 15:31       ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-03 15:31 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> > On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > > 
> > > ### Performance argument that motivated the change:
> > > There could be an argument of why would that be needed, since the current
> > > CPU is probably acessing the current cacheline, and so having a CPU closer
> > > to the current one is always the best choice since the cache invalidation
> > > will take less time. OTOH, there could be cases like this which uses
> > > perCPU variables, and we can have up to 3 different CPUs touching the
> > > cacheline:
> > > 
> > > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > > C3 - Housekeeping CPU: This one will do the work
> > > 
> > > Most of the times the cacheline is touched, it should be by C1. Some times
> > > a C2 will schedule work to run on C3, since C1 is isolated.
> > > 
> > > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > > (housekeeping_any_cpu_from(C1). 
> > > 
> > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > > tries to get cacheline exclusivity, and then a slower invalidation when
> > > this happens in C1, when it's working in its data.
> > > 
> > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > > tries to get cacheline exclusivity, and then a faster invalidation when
> > > this happens in C1.
> > > 
> > > The thing is: it should be better to wait less when doing kernel work
> > > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > > a few more cycles.
> > > ###
> > > 
> > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > > local_lock to spinlocks, so it can be later used to do remote percpu
> > > cache draining on patch #3. Most performance concerns should be pointed
> > > in the commit log.
> > > 
> > > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > > introduce an extra function call and a single test to check if the CPU is
> > > isolated. 
> > > 
> > > On scenarios with isolation enabled on boot, it will also introduce an
> > > extra test to check in the cpumask if the CPU is isolated. If it is,
> > > there will also be an extra read of the cpumask to look for a
> > > housekeeping CPU.
> > 
> 
> Hello Michael, thanks for reviewing!
> 
> > This is a rather deep dive in the cache line usage but the most
> > important thing is really missing. Why do we want this change? From the
> > context it seems that this is an actual fix for isolcpu= setup when
> > remote (aka non isolated activity) interferes with isolated cpus by
> > scheduling pcp charge caches on those cpus.
> > 
> > Is this understanding correct?
> 
> That's correct! The idea is to avoid scheduling work to isolated CPUs.
> 
> > If yes, how big of a problem that is?
> 
> The use case I have been following requires both isolcpus= and PREEMPT_RT, since
> the isolated CPUs will be running a real-time workload. In this scenario,
> getting any work done instead of the real-time workload may cause the system to
> miss a deadline, which can be bad. 

OK, I see. But is memcg charging actually a RT friendly operation in the
first place? Please note that this path can trigger memory reclaim and
that is when any RT expectations are simply going down the drain.

> >  If you want a remote draining then
> > you need some sort of locking (currently we rely on local lock). How
> > come this locking is not going to cause a different form of disturbance?
> 
> If I did everything right, most of the extra work should be done either in non-
> isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
> will be happening on a housekeeping CPU, and the locking cost should be paid
> there as we want to avoid doing that in the isolated CPUs. 
> 
> I understand there will be a locking cost being paid in the isolated CPUs when:
> a) The isolated CPU is requesting the stock drain,
> b) When the isolated CPUs do a syscall and end up using the protected structure
> the first time after a remote drain.

And anytime the charging path (consume_stock resp. refill_stock)
contends with the remote draining which is out of control of the RT
task. It is true that the RT kernel will turn that spin lock into a
sleeping RT lock and that could help with potential priority inversions
but still quite costly thing I would expect.

> Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> should not expect the syscalls to be have a predictable time, so it should be
> fine.

Now I am not sure I understand. If you do not consider charging path to
be RT sensitive then why is this needed in the first place? What else
would be populating the pcp cache on the isolated cpu? IRQs?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-03 15:31       ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-03 15:31 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> > On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > > 
> > > ### Performance argument that motivated the change:
> > > There could be an argument of why would that be needed, since the current
> > > CPU is probably acessing the current cacheline, and so having a CPU closer
> > > to the current one is always the best choice since the cache invalidation
> > > will take less time. OTOH, there could be cases like this which uses
> > > perCPU variables, and we can have up to 3 different CPUs touching the
> > > cacheline:
> > > 
> > > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > > C3 - Housekeeping CPU: This one will do the work
> > > 
> > > Most of the times the cacheline is touched, it should be by C1. Some times
> > > a C2 will schedule work to run on C3, since C1 is isolated.
> > > 
> > > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > > (housekeeping_any_cpu_from(C1). 
> > > 
> > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > > tries to get cacheline exclusivity, and then a slower invalidation when
> > > this happens in C1, when it's working in its data.
> > > 
> > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > > tries to get cacheline exclusivity, and then a faster invalidation when
> > > this happens in C1.
> > > 
> > > The thing is: it should be better to wait less when doing kernel work
> > > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > > a few more cycles.
> > > ###
> > > 
> > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > > local_lock to spinlocks, so it can be later used to do remote percpu
> > > cache draining on patch #3. Most performance concerns should be pointed
> > > in the commit log.
> > > 
> > > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > > introduce an extra function call and a single test to check if the CPU is
> > > isolated. 
> > > 
> > > On scenarios with isolation enabled on boot, it will also introduce an
> > > extra test to check in the cpumask if the CPU is isolated. If it is,
> > > there will also be an extra read of the cpumask to look for a
> > > housekeeping CPU.
> > 
> 
> Hello Michael, thanks for reviewing!
> 
> > This is a rather deep dive in the cache line usage but the most
> > important thing is really missing. Why do we want this change? From the
> > context it seems that this is an actual fix for isolcpu= setup when
> > remote (aka non isolated activity) interferes with isolated cpus by
> > scheduling pcp charge caches on those cpus.
> > 
> > Is this understanding correct?
> 
> That's correct! The idea is to avoid scheduling work to isolated CPUs.
> 
> > If yes, how big of a problem that is?
> 
> The use case I have been following requires both isolcpus= and PREEMPT_RT, since
> the isolated CPUs will be running a real-time workload. In this scenario,
> getting any work done instead of the real-time workload may cause the system to
> miss a deadline, which can be bad. 

OK, I see. But is memcg charging actually a RT friendly operation in the
first place? Please note that this path can trigger memory reclaim and
that is when any RT expectations are simply going down the drain.

> >  If you want a remote draining then
> > you need some sort of locking (currently we rely on local lock). How
> > come this locking is not going to cause a different form of disturbance?
> 
> If I did everything right, most of the extra work should be done either in non-
> isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
> will be happening on a housekeeping CPU, and the locking cost should be paid
> there as we want to avoid doing that in the isolated CPUs. 
> 
> I understand there will be a locking cost being paid in the isolated CPUs when:
> a) The isolated CPU is requesting the stock drain,
> b) When the isolated CPUs do a syscall and end up using the protected structure
> the first time after a remote drain.

And anytime the charging path (consume_stock resp. refill_stock)
contends with the remote draining which is out of control of the RT
task. It is true that the RT kernel will turn that spin lock into a
sleeping RT lock and that could help with potential priority inversions
but still quite costly thing I would expect.

> Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> should not expect the syscalls to be have a predictable time, so it should be
> fine.

Now I am not sure I understand. If you do not consider charging path to
be RT sensitive then why is this needed in the first place? What else
would be populating the pcp cache on the isolated cpu? IRQs?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-03 16:53         ` Leonardo Brás
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-03 16:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> > > On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > > > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > > > 
> > > > ### Performance argument that motivated the change:
> > > > There could be an argument of why would that be needed, since the current
> > > > CPU is probably acessing the current cacheline, and so having a CPU closer
> > > > to the current one is always the best choice since the cache invalidation
> > > > will take less time. OTOH, there could be cases like this which uses
> > > > perCPU variables, and we can have up to 3 different CPUs touching the
> > > > cacheline:
> > > > 
> > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > > > C3 - Housekeeping CPU: This one will do the work
> > > > 
> > > > Most of the times the cacheline is touched, it should be by C1. Some times
> > > > a C2 will schedule work to run on C3, since C1 is isolated.
> > > > 
> > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > > > (housekeeping_any_cpu_from(C1). 
> > > > 
> > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > > > tries to get cacheline exclusivity, and then a slower invalidation when
> > > > this happens in C1, when it's working in its data.
> > > > 
> > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > > > tries to get cacheline exclusivity, and then a faster invalidation when
> > > > this happens in C1.
> > > > 
> > > > The thing is: it should be better to wait less when doing kernel work
> > > > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > > > a few more cycles.
> > > > ###
> > > > 
> > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > > > local_lock to spinlocks, so it can be later used to do remote percpu
> > > > cache draining on patch #3. Most performance concerns should be pointed
> > > > in the commit log.
> > > > 
> > > > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > > > introduce an extra function call and a single test to check if the CPU is
> > > > isolated. 
> > > > 
> > > > On scenarios with isolation enabled on boot, it will also introduce an
> > > > extra test to check in the cpumask if the CPU is isolated. If it is,
> > > > there will also be an extra read of the cpumask to look for a
> > > > housekeeping CPU.
> > > 
> > 
> > Hello Michael, thanks for reviewing!
> > 
> > > This is a rather deep dive in the cache line usage but the most
> > > important thing is really missing. Why do we want this change? From the
> > > context it seems that this is an actual fix for isolcpu= setup when
> > > remote (aka non isolated activity) interferes with isolated cpus by
> > > scheduling pcp charge caches on those cpus.
> > > 
> > > Is this understanding correct?
> > 
> > That's correct! The idea is to avoid scheduling work to isolated CPUs.
> > 
> > > If yes, how big of a problem that is?
> > 
> > The use case I have been following requires both isolcpus= and PREEMPT_RT, since
> > the isolated CPUs will be running a real-time workload. In this scenario,
> > getting any work done instead of the real-time workload may cause the system to
> > miss a deadline, which can be bad. 
> 
> OK, I see. But is memcg charging actually a RT friendly operation in the
> first place? Please note that this path can trigger memory reclaim and
> that is when any RT expectations are simply going down the drain.

I understand the spent time for charging is unpredictable as you said, since a
lot of slow stuff may or may not happen. 

> 
> > >  If you want a remote draining then
> > > you need some sort of locking (currently we rely on local lock). How
> > > come this locking is not going to cause a different form of disturbance?
> > 
> > If I did everything right, most of the extra work should be done either in non-
> > isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
> > will be happening on a housekeeping CPU, and the locking cost should be paid
> > there as we want to avoid doing that in the isolated CPUs. 

Sorry, I think this caused a misunderstanding: I meant "the pcp charge cache
drain will be happening on a housekeeping CPU, ..."

> > 
> > I understand there will be a locking cost being paid in the isolated CPUs when:
> > a) The isolated CPU is requesting the stock drain,
> > b) When the isolated CPUs do a syscall and end up using the protected structure
> > the first time after a remote drain.
> 
> And anytime the charging path (consume_stock resp. refill_stock)
> contends with the remote draining which is out of control of the RT
> task. It is true that the RT kernel will turn that spin lock into a
> sleeping RT lock and that could help with potential priority inversions
> but still quite costly thing I would expect.
> 
> > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > should not expect the syscalls to be have a predictable time, so it should be
> > fine.
> 
> Now I am not sure I understand. If you do not consider charging path to
> be RT sensitive then why is this needed in the first place? What else
> would be populating the pcp cache on the isolated cpu? IRQs?

I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
time with the RT workload, we can have preemption of the RT workload, which is a
problem for meeting the deadlines.

One way I thought to solve that was introducing a remote drain, which would
require a different strategy for locking, since not all accesses to the pcp
caches would happen on a local CPU. 

Then I tried to weight the costs of this, so the solution would introduce as
little overhead as possible on no-isolation scenarios. Also, for isolation
scenarios, I tried to put most of the overheads into the housekeeping CPUs, and
the remaining on the syscalls, which are also expected to be non-predictable.

Not sure if I could answer your question, though. Please let me know in case I
missed anything.

Thanks for helping me make it more clear!
Best regards,
Leo





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-03 16:53         ` Leonardo Brás
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-03 16:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > On Wed, 2022-11-02 at 09:53 +0100, Michal Hocko wrote:
> > > On Tue 01-11-22 23:02:40, Leonardo Bras wrote:
> > > > Patch #1 expands housekeepíng_any_cpu() so we can find housekeeping cpus
> > > > closer (NUMA) to any desired CPU, instead of only the current CPU.
> > > > 
> > > > ### Performance argument that motivated the change:
> > > > There could be an argument of why would that be needed, since the current
> > > > CPU is probably acessing the current cacheline, and so having a CPU closer
> > > > to the current one is always the best choice since the cache invalidation
> > > > will take less time. OTOH, there could be cases like this which uses
> > > > perCPU variables, and we can have up to 3 different CPUs touching the
> > > > cacheline:
> > > > 
> > > > C1 - Isolated CPU: The perCPU data 'belongs' to this one
> > > > C2 - Scheduling CPU: Schedule some work to be done elsewhere, current cpu
> > > > C3 - Housekeeping CPU: This one will do the work
> > > > 
> > > > Most of the times the cacheline is touched, it should be by C1. Some times
> > > > a C2 will schedule work to run on C3, since C1 is isolated.
> > > > 
> > > > If C1 and C2 are in different NUMA nodes, we could have C3 either in
> > > > C2 NUMA node (housekeeping_any_cpu()) or in C1 NUMA node 
> > > > (housekeeping_any_cpu_from(C1). 
> > > > 
> > > > If C3 is in C2 NUMA node, there will be a faster invalidation when C3
> > > > tries to get cacheline exclusivity, and then a slower invalidation when
> > > > this happens in C1, when it's working in its data.
> > > > 
> > > > If C3 is in C1 NUMA node, there will be a slower invalidation when C3
> > > > tries to get cacheline exclusivity, and then a faster invalidation when
> > > > this happens in C1.
> > > > 
> > > > The thing is: it should be better to wait less when doing kernel work
> > > > on an isolated CPU, even at the cost of some housekeeping CPU waiting
> > > > a few more cycles.
> > > > ###
> > > > 
> > > > Patch #2 changes the locking strategy of memcg_stock_pcp->stock_lock from
> > > > local_lock to spinlocks, so it can be later used to do remote percpu
> > > > cache draining on patch #3. Most performance concerns should be pointed
> > > > in the commit log.
> > > > 
> > > > Patch #3 implements the remote per-CPU cache drain, making use of both 
> > > > patches #2 and #3. Performance-wise, in non-isolated scenarios, it should
> > > > introduce an extra function call and a single test to check if the CPU is
> > > > isolated. 
> > > > 
> > > > On scenarios with isolation enabled on boot, it will also introduce an
> > > > extra test to check in the cpumask if the CPU is isolated. If it is,
> > > > there will also be an extra read of the cpumask to look for a
> > > > housekeeping CPU.
> > > 
> > 
> > Hello Michael, thanks for reviewing!
> > 
> > > This is a rather deep dive in the cache line usage but the most
> > > important thing is really missing. Why do we want this change? From the
> > > context it seems that this is an actual fix for isolcpu= setup when
> > > remote (aka non isolated activity) interferes with isolated cpus by
> > > scheduling pcp charge caches on those cpus.
> > > 
> > > Is this understanding correct?
> > 
> > That's correct! The idea is to avoid scheduling work to isolated CPUs.
> > 
> > > If yes, how big of a problem that is?
> > 
> > The use case I have been following requires both isolcpus= and PREEMPT_RT, since
> > the isolated CPUs will be running a real-time workload. In this scenario,
> > getting any work done instead of the real-time workload may cause the system to
> > miss a deadline, which can be bad. 
> 
> OK, I see. But is memcg charging actually a RT friendly operation in the
> first place? Please note that this path can trigger memory reclaim and
> that is when any RT expectations are simply going down the drain.

I understand the spent time for charging is unpredictable as you said, since a
lot of slow stuff may or may not happen. 

> 
> > >  If you want a remote draining then
> > > you need some sort of locking (currently we rely on local lock). How
> > > come this locking is not going to cause a different form of disturbance?
> > 
> > If I did everything right, most of the extra work should be done either in non-
> > isolated (housekeeping) CPUs, or during a syscall. I mean, the pcp charge caches
> > will be happening on a housekeeping CPU, and the locking cost should be paid
> > there as we want to avoid doing that in the isolated CPUs. 

Sorry, I think this caused a misunderstanding: I meant "the pcp charge cache
drain will be happening on a housekeeping CPU, ..."

> > 
> > I understand there will be a locking cost being paid in the isolated CPUs when:
> > a) The isolated CPU is requesting the stock drain,
> > b) When the isolated CPUs do a syscall and end up using the protected structure
> > the first time after a remote drain.
> 
> And anytime the charging path (consume_stock resp. refill_stock)
> contends with the remote draining which is out of control of the RT
> task. It is true that the RT kernel will turn that spin lock into a
> sleeping RT lock and that could help with potential priority inversions
> but still quite costly thing I would expect.
> 
> > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > should not expect the syscalls to be have a predictable time, so it should be
> > fine.
> 
> Now I am not sure I understand. If you do not consider charging path to
> be RT sensitive then why is this needed in the first place? What else
> would be populating the pcp cache on the isolated cpu? IRQs?

I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
time with the RT workload, we can have preemption of the RT workload, which is a
problem for meeting the deadlines.

One way I thought to solve that was introducing a remote drain, which would
require a different strategy for locking, since not all accesses to the pcp
caches would happen on a local CPU. 

Then I tried to weight the costs of this, so the solution would introduce as
little overhead as possible on no-isolation scenarios. Also, for isolation
scenarios, I tried to put most of the overheads into the housekeeping CPUs, and
the remaining on the syscalls, which are also expected to be non-predictable.

Not sure if I could answer your question, though. Please let me know in case I
missed anything.

Thanks for helping me make it more clear!
Best regards,
Leo





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-04  8:41           ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-04  8:41 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
[...]
> > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > a) The isolated CPU is requesting the stock drain,
> > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > the first time after a remote drain.
> > 
> > And anytime the charging path (consume_stock resp. refill_stock)
> > contends with the remote draining which is out of control of the RT
> > task. It is true that the RT kernel will turn that spin lock into a
> > sleeping RT lock and that could help with potential priority inversions
> > but still quite costly thing I would expect.
> > 
> > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > should not expect the syscalls to be have a predictable time, so it should be
> > > fine.
> > 
> > Now I am not sure I understand. If you do not consider charging path to
> > be RT sensitive then why is this needed in the first place? What else
> > would be populating the pcp cache on the isolated cpu? IRQs?
> 
> I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> time with the RT workload, we can have preemption of the RT workload, which is a
> problem for meeting the deadlines.

Yes, this is understood. But it is not really clear to me why would any
draining be necessary for such an isolated CPU if no workload other than
the RT (which pressumably doesn't charge any memory?) is running on that
CPU? Is that the RT task during the initialization phase that leaves
that cache behind or something else? Sorry for being so focused on this
but I would like to understand on whether this is avoidable by a
different startup scheme or it really needs to be addressed in some way.

> One way I thought to solve that was introducing a remote drain, which would
> require a different strategy for locking, since not all accesses to the pcp
> caches would happen on a local CPU. 

Yeah, I am not supper happy about additional spin lock TBH. One
potential way to go would be to completely avoid pcp cache for isolated
CPUs. That would have some performance impact of course but on the other
hand it would give a more predictable behavior for those CPUs which
sounds like a reasonable compromise to me. What do you think?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-04  8:41           ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-04  8:41 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
[...]
> > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > a) The isolated CPU is requesting the stock drain,
> > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > the first time after a remote drain.
> > 
> > And anytime the charging path (consume_stock resp. refill_stock)
> > contends with the remote draining which is out of control of the RT
> > task. It is true that the RT kernel will turn that spin lock into a
> > sleeping RT lock and that could help with potential priority inversions
> > but still quite costly thing I would expect.
> > 
> > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > should not expect the syscalls to be have a predictable time, so it should be
> > > fine.
> > 
> > Now I am not sure I understand. If you do not consider charging path to
> > be RT sensitive then why is this needed in the first place? What else
> > would be populating the pcp cache on the isolated cpu? IRQs?
> 
> I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> time with the RT workload, we can have preemption of the RT workload, which is a
> problem for meeting the deadlines.

Yes, this is understood. But it is not really clear to me why would any
draining be necessary for such an isolated CPU if no workload other than
the RT (which pressumably doesn't charge any memory?) is running on that
CPU? Is that the RT task during the initialization phase that leaves
that cache behind or something else? Sorry for being so focused on this
but I would like to understand on whether this is avoidable by a
different startup scheme or it really needs to be addressed in some way.

> One way I thought to solve that was introducing a remote drain, which would
> require a different strategy for locking, since not all accesses to the pcp
> caches would happen on a local CPU. 

Yeah, I am not supper happy about additional spin lock TBH. One
potential way to go would be to completely avoid pcp cache for isolated
CPUs. That would have some performance impact of course but on the other
hand it would give a more predictable behavior for those CPUs which
sounds like a reasonable compromise to me. What do you think?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
  2022-11-04  8:41           ` Michal Hocko
@ 2022-11-05  1:45             ` Leonardo Brás
  -1 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-05  1:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> [...]
> > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > a) The isolated CPU is requesting the stock drain,
> > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > the first time after a remote drain.
> > > 
> > > And anytime the charging path (consume_stock resp. refill_stock)
> > > contends with the remote draining which is out of control of the RT
> > > task. It is true that the RT kernel will turn that spin lock into a
> > > sleeping RT lock and that could help with potential priority inversions
> > > but still quite costly thing I would expect.
> > > 
> > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > fine.
> > > 
> > > Now I am not sure I understand. If you do not consider charging path to
> > > be RT sensitive then why is this needed in the first place? What else
> > > would be populating the pcp cache on the isolated cpu? IRQs?
> > 
> > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > time with the RT workload, we can have preemption of the RT workload, which is a
> > problem for meeting the deadlines.
> 
> Yes, this is understood. But it is not really clear to me why would any
> draining be necessary for such an isolated CPU if no workload other than
> the RT (which pressumably doesn't charge any memory?) is running on that
> CPU? Is that the RT task during the initialization phase that leaves
> that cache behind or something else?

(I am new to this part of the code, so please correct me when I miss something.)

IIUC, if a process belongs to a control group with memory control, the 'charge'
will happen when a memory page starts getting used by it.

So, if we assume a RT load in a isolated CPU will not charge any memory, we are
assuming it will never be part of a memory-controlled cgroup.

I mean, can we just assume this? 

If I got that right, would not that be considered a limitation? like
"If you don't want your workload to be interrupted by perCPU cache draining,
don't put it in a cgroup with memory control".

> Sorry for being so focused on this
> but I would like to understand on whether this is avoidable by a
> different startup scheme or it really needs to be addressed in some way.

No worries, I am in fact happy you are giving it this much attention :)

I also understand this is a considerable change in the locking strategy, and
avoiding that is the first thing that should be tried.

> 
> > One way I thought to solve that was introducing a remote drain, which would
> > require a different strategy for locking, since not all accesses to the pcp
> > caches would happen on a local CPU. 
> 
> Yeah, I am not supper happy about additional spin lock TBH. One
> potential way to go would be to completely avoid pcp cache for isolated
> CPUs. That would have some performance impact of course but on the other
> hand it would give a more predictable behavior for those CPUs which
> sounds like a reasonable compromise to me. What do you think?

You mean not having a perCPU stock, then? 
So consume_stock() for isolated CPUs would always return false, causing
try_charge_memcg() always walking the slow path?

IIUC, both my proposal and yours would degrade performance only when we use
isolated CPUs + memcg. Is that correct?

If so, it looks like the impact would be even bigger without perCPU stock ,
compared to introducing a spinlock.

Unless, we are counting to this case where a remote CPU is draining an isolated
CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
released in the remote CPU. Well, this seems possible to happen, but I would
have to analyze how often would it happen, and how much would it impact the
deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
before its starts, so it can avoid the faulting latency, but I need to confirm
that.

On the other hand, compared to how it works now now, this should be a more
controllable way of introducing latency than a scheduled cache drain.

Your suggestion on no-stocks/caches in isolated CPUs would be great for
predictability, but I am almost sure the cost in overall performance would not
be fine.

With the possibility of prefaulting pages, do you see any scenario that would
introduce some undesirable latency in the workload?

Thanks a lot for the discussion!
Leo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-05  1:45             ` Leonardo Brás
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-05  1:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups

On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> [...]
> > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > a) The isolated CPU is requesting the stock drain,
> > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > the first time after a remote drain.
> > > 
> > > And anytime the charging path (consume_stock resp. refill_stock)
> > > contends with the remote draining which is out of control of the RT
> > > task. It is true that the RT kernel will turn that spin lock into a
> > > sleeping RT lock and that could help with potential priority inversions
> > > but still quite costly thing I would expect.
> > > 
> > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > fine.
> > > 
> > > Now I am not sure I understand. If you do not consider charging path to
> > > be RT sensitive then why is this needed in the first place? What else
> > > would be populating the pcp cache on the isolated cpu? IRQs?
> > 
> > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > time with the RT workload, we can have preemption of the RT workload, which is a
> > problem for meeting the deadlines.
> 
> Yes, this is understood. But it is not really clear to me why would any
> draining be necessary for such an isolated CPU if no workload other than
> the RT (which pressumably doesn't charge any memory?) is running on that
> CPU? Is that the RT task during the initialization phase that leaves
> that cache behind or something else?

(I am new to this part of the code, so please correct me when I miss something.)

IIUC, if a process belongs to a control group with memory control, the 'charge'
will happen when a memory page starts getting used by it.

So, if we assume a RT load in a isolated CPU will not charge any memory, we are
assuming it will never be part of a memory-controlled cgroup.

I mean, can we just assume this? 

If I got that right, would not that be considered a limitation? like
"If you don't want your workload to be interrupted by perCPU cache draining,
don't put it in a cgroup with memory control".

> Sorry for being so focused on this
> but I would like to understand on whether this is avoidable by a
> different startup scheme or it really needs to be addressed in some way.

No worries, I am in fact happy you are giving it this much attention :)

I also understand this is a considerable change in the locking strategy, and
avoiding that is the first thing that should be tried.

> 
> > One way I thought to solve that was introducing a remote drain, which would
> > require a different strategy for locking, since not all accesses to the pcp
> > caches would happen on a local CPU. 
> 
> Yeah, I am not supper happy about additional spin lock TBH. One
> potential way to go would be to completely avoid pcp cache for isolated
> CPUs. That would have some performance impact of course but on the other
> hand it would give a more predictable behavior for those CPUs which
> sounds like a reasonable compromise to me. What do you think?

You mean not having a perCPU stock, then? 
So consume_stock() for isolated CPUs would always return false, causing
try_charge_memcg() always walking the slow path?

IIUC, both my proposal and yours would degrade performance only when we use
isolated CPUs + memcg. Is that correct?

If so, it looks like the impact would be even bigger without perCPU stock ,
compared to introducing a spinlock.

Unless, we are counting to this case where a remote CPU is draining an isolated
CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
released in the remote CPU. Well, this seems possible to happen, but I would
have to analyze how often would it happen, and how much would it impact the
deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
before its starts, so it can avoid the faulting latency, but I need to confirm
that.

On the other hand, compared to how it works now now, this should be a more
controllable way of introducing latency than a scheduled cache drain.

Your suggestion on no-stocks/caches in isolated CPUs would be great for
predictability, but I am almost sure the cost in overall performance would not
be fine.

With the possibility of prefaulting pages, do you see any scenario that would
introduce some undesirable latency in the workload?

Thanks a lot for the discussion!
Leo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-07  8:10               ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-07  8:10 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Fri 04-11-22 22:45:58, Leonardo Brás wrote:
> On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > [...]
> > > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > > a) The isolated CPU is requesting the stock drain,
> > > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > > the first time after a remote drain.
> > > > 
> > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > contends with the remote draining which is out of control of the RT
> > > > task. It is true that the RT kernel will turn that spin lock into a
> > > > sleeping RT lock and that could help with potential priority inversions
> > > > but still quite costly thing I would expect.
> > > > 
> > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > > fine.
> > > > 
> > > > Now I am not sure I understand. If you do not consider charging path to
> > > > be RT sensitive then why is this needed in the first place? What else
> > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > 
> > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > > time with the RT workload, we can have preemption of the RT workload, which is a
> > > problem for meeting the deadlines.
> > 
> > Yes, this is understood. But it is not really clear to me why would any
> > draining be necessary for such an isolated CPU if no workload other than
> > the RT (which pressumably doesn't charge any memory?) is running on that
> > CPU? Is that the RT task during the initialization phase that leaves
> > that cache behind or something else?
> 
> (I am new to this part of the code, so please correct me when I miss something.)
> 
> IIUC, if a process belongs to a control group with memory control, the 'charge'
> will happen when a memory page starts getting used by it.

Yes, very broadly speaking.

> So, if we assume a RT load in a isolated CPU will not charge any memory, we are
> assuming it will never be part of a memory-controlled cgroup.

If the memory cgroup controler is enabled then each user space process
is a part of some memcg. If there is no specific memcg assigned then it
will be a root cgroup and that is skipped during most charges except for
kmem.

> I mean, can we just assume this? 
> 
> If I got that right, would not that be considered a limitation? like
> "If you don't want your workload to be interrupted by perCPU cache draining,
> don't put it in a cgroup with memory control".

We definitely do not want userspace make any assumptions on internal
implementation details like caches.

> > Sorry for being so focused on this
> > but I would like to understand on whether this is avoidable by a
> > different startup scheme or it really needs to be addressed in some way.
> 
> No worries, I am in fact happy you are giving it this much attention :)
> 
> I also understand this is a considerable change in the locking strategy, and
> avoiding that is the first thing that should be tried.
> 
> > 
> > > One way I thought to solve that was introducing a remote drain, which would
> > > require a different strategy for locking, since not all accesses to the pcp
> > > caches would happen on a local CPU. 
> > 
> > Yeah, I am not supper happy about additional spin lock TBH. One
> > potential way to go would be to completely avoid pcp cache for isolated
> > CPUs. That would have some performance impact of course but on the other
> > hand it would give a more predictable behavior for those CPUs which
> > sounds like a reasonable compromise to me. What do you think?
> 
> You mean not having a perCPU stock, then? 
> So consume_stock() for isolated CPUs would always return false, causing
> try_charge_memcg() always walking the slow path?

Exactly.

> IIUC, both my proposal and yours would degrade performance only when we use
> isolated CPUs + memcg. Is that correct?

Yes, with a notable difference that with your spin lock option there is
still a chance that the remote draining could influence the isolated CPU
workload throug that said spinlock. If there is no pcp cache for that
cpu being used then there is no potential interaction at all.

> If so, it looks like the impact would be even bigger without perCPU stock ,
> compared to introducing a spinlock.
> 
> Unless, we are counting to this case where a remote CPU is draining an isolated
> CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
> released in the remote CPU. Well, this seems possible to happen, but I would
> have to analyze how often would it happen, and how much would it impact the
> deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
> before its starts, so it can avoid the faulting latency, but I need to confirm
> that.

Yes, that is a general practice and the reason why I was asking how real
of a problem that is in practice. It is true true that appart from user
space memory which can be under full control of the userspace there are
kernel allocations which can be done on behalf of the process and those
could be charged to memcg as well. So I can imagine the pcp cache could
be populated even if the process is not faulting anything in during RT
sensitive phase.

> On the other hand, compared to how it works now now, this should be a more
> controllable way of introducing latency than a scheduled cache drain.
> 
> Your suggestion on no-stocks/caches in isolated CPUs would be great for
> predictability, but I am almost sure the cost in overall performance would not
> be fine.

It is hard to estimate the overhead without measuring that. Do you think
you can give it a try? If the performance is not really acceptable
(which I would be really surprised) then we can think of a more complex
solution.

> With the possibility of prefaulting pages, do you see any scenario that would
> introduce some undesirable latency in the workload?

My primary concern would be spin lock contention which is hard to
predict with something like remote draining.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-07  8:10               ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-07  8:10 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri 04-11-22 22:45:58, Leonardo Brás wrote:
> On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > [...]
> > > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > > a) The isolated CPU is requesting the stock drain,
> > > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > > the first time after a remote drain.
> > > > 
> > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > contends with the remote draining which is out of control of the RT
> > > > task. It is true that the RT kernel will turn that spin lock into a
> > > > sleeping RT lock and that could help with potential priority inversions
> > > > but still quite costly thing I would expect.
> > > > 
> > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > > fine.
> > > > 
> > > > Now I am not sure I understand. If you do not consider charging path to
> > > > be RT sensitive then why is this needed in the first place? What else
> > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > 
> > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > > time with the RT workload, we can have preemption of the RT workload, which is a
> > > problem for meeting the deadlines.
> > 
> > Yes, this is understood. But it is not really clear to me why would any
> > draining be necessary for such an isolated CPU if no workload other than
> > the RT (which pressumably doesn't charge any memory?) is running on that
> > CPU? Is that the RT task during the initialization phase that leaves
> > that cache behind or something else?
> 
> (I am new to this part of the code, so please correct me when I miss something.)
> 
> IIUC, if a process belongs to a control group with memory control, the 'charge'
> will happen when a memory page starts getting used by it.

Yes, very broadly speaking.

> So, if we assume a RT load in a isolated CPU will not charge any memory, we are
> assuming it will never be part of a memory-controlled cgroup.

If the memory cgroup controler is enabled then each user space process
is a part of some memcg. If there is no specific memcg assigned then it
will be a root cgroup and that is skipped during most charges except for
kmem.

> I mean, can we just assume this? 
> 
> If I got that right, would not that be considered a limitation? like
> "If you don't want your workload to be interrupted by perCPU cache draining,
> don't put it in a cgroup with memory control".

We definitely do not want userspace make any assumptions on internal
implementation details like caches.

> > Sorry for being so focused on this
> > but I would like to understand on whether this is avoidable by a
> > different startup scheme or it really needs to be addressed in some way.
> 
> No worries, I am in fact happy you are giving it this much attention :)
> 
> I also understand this is a considerable change in the locking strategy, and
> avoiding that is the first thing that should be tried.
> 
> > 
> > > One way I thought to solve that was introducing a remote drain, which would
> > > require a different strategy for locking, since not all accesses to the pcp
> > > caches would happen on a local CPU. 
> > 
> > Yeah, I am not supper happy about additional spin lock TBH. One
> > potential way to go would be to completely avoid pcp cache for isolated
> > CPUs. That would have some performance impact of course but on the other
> > hand it would give a more predictable behavior for those CPUs which
> > sounds like a reasonable compromise to me. What do you think?
> 
> You mean not having a perCPU stock, then? 
> So consume_stock() for isolated CPUs would always return false, causing
> try_charge_memcg() always walking the slow path?

Exactly.

> IIUC, both my proposal and yours would degrade performance only when we use
> isolated CPUs + memcg. Is that correct?

Yes, with a notable difference that with your spin lock option there is
still a chance that the remote draining could influence the isolated CPU
workload throug that said spinlock. If there is no pcp cache for that
cpu being used then there is no potential interaction at all.

> If so, it looks like the impact would be even bigger without perCPU stock ,
> compared to introducing a spinlock.
> 
> Unless, we are counting to this case where a remote CPU is draining an isolated
> CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
> released in the remote CPU. Well, this seems possible to happen, but I would
> have to analyze how often would it happen, and how much would it impact the
> deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
> before its starts, so it can avoid the faulting latency, but I need to confirm
> that.

Yes, that is a general practice and the reason why I was asking how real
of a problem that is in practice. It is true true that appart from user
space memory which can be under full control of the userspace there are
kernel allocations which can be done on behalf of the process and those
could be charged to memcg as well. So I can imagine the pcp cache could
be populated even if the process is not faulting anything in during RT
sensitive phase.

> On the other hand, compared to how it works now now, this should be a more
> controllable way of introducing latency than a scheduled cache drain.
> 
> Your suggestion on no-stocks/caches in isolated CPUs would be great for
> predictability, but I am almost sure the cost in overall performance would not
> be fine.

It is hard to estimate the overhead without measuring that. Do you think
you can give it a try? If the performance is not really acceptable
(which I would be really surprised) then we can think of a more complex
solution.

> With the possibility of prefaulting pages, do you see any scenario that would
> introduce some undesirable latency in the workload?

My primary concern would be spin lock contention which is hard to
predict with something like remote draining.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
  2022-11-07  8:10               ` Michal Hocko
@ 2022-11-08 23:09                 ` Leonardo Brás
  -1 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-08 23:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Mon, 2022-11-07 at 09:10 +0100, Michal Hocko wrote:
> On Fri 04-11-22 22:45:58, Leonardo Brás wrote:
> > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > > [...]
> > > > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > > > a) The isolated CPU is requesting the stock drain,
> > > > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > > > the first time after a remote drain.
> > > > > 
> > > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > > contends with the remote draining which is out of control of the RT
> > > > > task. It is true that the RT kernel will turn that spin lock into a
> > > > > sleeping RT lock and that could help with potential priority inversions
> > > > > but still quite costly thing I would expect.
> > > > > 
> > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > > > fine.
> > > > > 
> > > > > Now I am not sure I understand. If you do not consider charging path to
> > > > > be RT sensitive then why is this needed in the first place? What else
> > > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > > 
> > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > > > time with the RT workload, we can have preemption of the RT workload, which is a
> > > > problem for meeting the deadlines.
> > > 
> > > Yes, this is understood. But it is not really clear to me why would any
> > > draining be necessary for such an isolated CPU if no workload other than
> > > the RT (which pressumably doesn't charge any memory?) is running on that
> > > CPU? Is that the RT task during the initialization phase that leaves
> > > that cache behind or something else?
> > 
> > (I am new to this part of the code, so please correct me when I miss something.)
> > 
> > IIUC, if a process belongs to a control group with memory control, the 'charge'
> > will happen when a memory page starts getting used by it.
> 
> Yes, very broadly speaking.
> 
> > So, if we assume a RT load in a isolated CPU will not charge any memory, we are
> > assuming it will never be part of a memory-controlled cgroup.
> 
> If the memory cgroup controler is enabled then each user space process
> is a part of some memcg. If there is no specific memcg assigned then it
> will be a root cgroup and that is skipped during most charges except for
> kmem.

Oh, it makes sense. 
Thanks for helping me understand that! 

> 
> > I mean, can we just assume this? 
> > 
> > If I got that right, would not that be considered a limitation? like
> > "If you don't want your workload to be interrupted by perCPU cache draining,
> > don't put it in a cgroup with memory control".
> 
> We definitely do not want userspace make any assumptions on internal
> implementation details like caches.

Perfect, that was my expectation. 

> 
> > > Sorry for being so focused on this
> > > but I would like to understand on whether this is avoidable by a
> > > different startup scheme or it really needs to be addressed in some way.
> > 
> > No worries, I am in fact happy you are giving it this much attention :)
> > 
> > I also understand this is a considerable change in the locking strategy, and
> > avoiding that is the first thing that should be tried.
> > 
> > > 
> > > > One way I thought to solve that was introducing a remote drain, which would
> > > > require a different strategy for locking, since not all accesses to the pcp
> > > > caches would happen on a local CPU. 
> > > 
> > > Yeah, I am not supper happy about additional spin lock TBH. One
> > > potential way to go would be to completely avoid pcp cache for isolated
> > > CPUs. That would have some performance impact of course but on the other
> > > hand it would give a more predictable behavior for those CPUs which
> > > sounds like a reasonable compromise to me. What do you think?
> > 
> > You mean not having a perCPU stock, then? 
> > So consume_stock() for isolated CPUs would always return false, causing
> > try_charge_memcg() always walking the slow path?
> 
> Exactly.
> 
> > IIUC, both my proposal and yours would degrade performance only when we use
> > isolated CPUs + memcg. Is that correct?
> 
> Yes, with a notable difference that with your spin lock option there is
> still a chance that the remote draining could influence the isolated CPU
> workload throug that said spinlock. If there is no pcp cache for that
> cpu being used then there is no potential interaction at all.

I see. 
But the slow path is slow for some reason, right?
Does not it make use of any locks also? So on normal operation there could be a
potentially larger impact than a spinlock, even though there would be no
scheduled draining.

> 
> > If so, it looks like the impact would be even bigger without perCPU stock ,
> > compared to introducing a spinlock.
> > 
> > Unless, we are counting to this case where a remote CPU is draining an isolated
> > CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
> > released in the remote CPU. Well, this seems possible to happen, but I would
> > have to analyze how often would it happen, and how much would it impact the
> > deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
> > before its starts, so it can avoid the faulting latency, but I need to confirm
> > that.
> 
> Yes, that is a general practice and the reason why I was asking how real
> of a problem that is in practice. 

I remember this was one common factor on deadlines being missed in the workload
analyzed. Need to redo the test to be sure.

> It is true true that appart from user
> space memory which can be under full control of the userspace there are
> kernel allocations which can be done on behalf of the process and those
> could be charged to memcg as well. So I can imagine the pcp cache could
> be populated even if the process is not faulting anything in during RT
> sensitive phase.

Humm, I think I will apply the change and do a comparative testing with
upstream. This should bring good comparison results.

> 
> > On the other hand, compared to how it works now now, this should be a more
> > controllable way of introducing latency than a scheduled cache drain.
> > 
> > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > predictability, but I am almost sure the cost in overall performance would not
> > be fine.
> 
> It is hard to estimate the overhead without measuring that. Do you think
> you can give it a try? If the performance is not really acceptable
> (which I would be really surprised) then we can think of a more complex
> solution.

Sure, I can try that.
Do you suggest any specific workload that happens to stress the percpu cache
usage, with usual drains and so? Maybe I will also try with synthetic worloads
also.

> 
> > With the possibility of prefaulting pages, do you see any scenario that would
> > introduce some undesirable latency in the workload?
> 
> My primary concern would be spin lock contention which is hard to
> predict with something like remote draining.

It makes sense. I will do some testing and come out with results for that.

Thanks for reviewing!
Leo



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-08 23:09                 ` Leonardo Brás
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2022-11-08 23:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups

On Mon, 2022-11-07 at 09:10 +0100, Michal Hocko wrote:
> On Fri 04-11-22 22:45:58, Leonardo Brás wrote:
> > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 13:53:41, Leonardo Brás wrote:
> > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote:
> > > [...]
> > > > > > I understand there will be a locking cost being paid in the isolated CPUs when:
> > > > > > a) The isolated CPU is requesting the stock drain,
> > > > > > b) When the isolated CPUs do a syscall and end up using the protected structure
> > > > > > the first time after a remote drain.
> > > > > 
> > > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > > contends with the remote draining which is out of control of the RT
> > > > > task. It is true that the RT kernel will turn that spin lock into a
> > > > > sleeping RT lock and that could help with potential priority inversions
> > > > > but still quite costly thing I would expect.
> > > > > 
> > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload
> > > > > > should not expect the syscalls to be have a predictable time, so it should be
> > > > > > fine.
> > > > > 
> > > > > Now I am not sure I understand. If you do not consider charging path to
> > > > > be RT sensitive then why is this needed in the first place? What else
> > > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > > 
> > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at
> > > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu
> > > > time with the RT workload, we can have preemption of the RT workload, which is a
> > > > problem for meeting the deadlines.
> > > 
> > > Yes, this is understood. But it is not really clear to me why would any
> > > draining be necessary for such an isolated CPU if no workload other than
> > > the RT (which pressumably doesn't charge any memory?) is running on that
> > > CPU? Is that the RT task during the initialization phase that leaves
> > > that cache behind or something else?
> > 
> > (I am new to this part of the code, so please correct me when I miss something.)
> > 
> > IIUC, if a process belongs to a control group with memory control, the 'charge'
> > will happen when a memory page starts getting used by it.
> 
> Yes, very broadly speaking.
> 
> > So, if we assume a RT load in a isolated CPU will not charge any memory, we are
> > assuming it will never be part of a memory-controlled cgroup.
> 
> If the memory cgroup controler is enabled then each user space process
> is a part of some memcg. If there is no specific memcg assigned then it
> will be a root cgroup and that is skipped during most charges except for
> kmem.

Oh, it makes sense. 
Thanks for helping me understand that! 

> 
> > I mean, can we just assume this? 
> > 
> > If I got that right, would not that be considered a limitation? like
> > "If you don't want your workload to be interrupted by perCPU cache draining,
> > don't put it in a cgroup with memory control".
> 
> We definitely do not want userspace make any assumptions on internal
> implementation details like caches.

Perfect, that was my expectation. 

> 
> > > Sorry for being so focused on this
> > > but I would like to understand on whether this is avoidable by a
> > > different startup scheme or it really needs to be addressed in some way.
> > 
> > No worries, I am in fact happy you are giving it this much attention :)
> > 
> > I also understand this is a considerable change in the locking strategy, and
> > avoiding that is the first thing that should be tried.
> > 
> > > 
> > > > One way I thought to solve that was introducing a remote drain, which would
> > > > require a different strategy for locking, since not all accesses to the pcp
> > > > caches would happen on a local CPU. 
> > > 
> > > Yeah, I am not supper happy about additional spin lock TBH. One
> > > potential way to go would be to completely avoid pcp cache for isolated
> > > CPUs. That would have some performance impact of course but on the other
> > > hand it would give a more predictable behavior for those CPUs which
> > > sounds like a reasonable compromise to me. What do you think?
> > 
> > You mean not having a perCPU stock, then? 
> > So consume_stock() for isolated CPUs would always return false, causing
> > try_charge_memcg() always walking the slow path?
> 
> Exactly.
> 
> > IIUC, both my proposal and yours would degrade performance only when we use
> > isolated CPUs + memcg. Is that correct?
> 
> Yes, with a notable difference that with your spin lock option there is
> still a chance that the remote draining could influence the isolated CPU
> workload throug that said spinlock. If there is no pcp cache for that
> cpu being used then there is no potential interaction at all.

I see. 
But the slow path is slow for some reason, right?
Does not it make use of any locks also? So on normal operation there could be a
potentially larger impact than a spinlock, even though there would be no
scheduled draining.

> 
> > If so, it looks like the impact would be even bigger without perCPU stock ,
> > compared to introducing a spinlock.
> > 
> > Unless, we are counting to this case where a remote CPU is draining an isolated
> > CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be
> > released in the remote CPU. Well, this seems possible to happen, but I would
> > have to analyze how often would it happen, and how much would it impact the
> > deadlines. I *guess* most of the RT workload's memory pages are pre-faulted
> > before its starts, so it can avoid the faulting latency, but I need to confirm
> > that.
> 
> Yes, that is a general practice and the reason why I was asking how real
> of a problem that is in practice. 

I remember this was one common factor on deadlines being missed in the workload
analyzed. Need to redo the test to be sure.

> It is true true that appart from user
> space memory which can be under full control of the userspace there are
> kernel allocations which can be done on behalf of the process and those
> could be charged to memcg as well. So I can imagine the pcp cache could
> be populated even if the process is not faulting anything in during RT
> sensitive phase.

Humm, I think I will apply the change and do a comparative testing with
upstream. This should bring good comparison results.

> 
> > On the other hand, compared to how it works now now, this should be a more
> > controllable way of introducing latency than a scheduled cache drain.
> > 
> > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > predictability, but I am almost sure the cost in overall performance would not
> > be fine.
> 
> It is hard to estimate the overhead without measuring that. Do you think
> you can give it a try? If the performance is not really acceptable
> (which I would be really surprised) then we can think of a more complex
> solution.

Sure, I can try that.
Do you suggest any specific workload that happens to stress the percpu cache
usage, with usual drains and so? Maybe I will also try with synthetic worloads
also.

> 
> > With the possibility of prefaulting pages, do you see any scenario that would
> > introduce some undesirable latency in the workload?
> 
> My primary concern would be spin lock contention which is hard to
> predict with something like remote draining.

It makes sense. I will do some testing and come out with results for that.

Thanks for reviewing!
Leo



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-09  8:05                   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-09  8:05 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Tue 08-11-22 20:09:25, Leonardo Brás wrote:
[...]
> > Yes, with a notable difference that with your spin lock option there is
> > still a chance that the remote draining could influence the isolated CPU
> > workload throug that said spinlock. If there is no pcp cache for that
> > cpu being used then there is no potential interaction at all.
> 
> I see. 
> But the slow path is slow for some reason, right?
> Does not it make use of any locks also? So on normal operation there could be a
> potentially larger impact than a spinlock, even though there would be no
> scheduled draining.

Well, for the regular (try_charge) path that is essentially page_counter_try_charge
which boils down to atomic_long_add_return of the memcg counter + all
parents up the hierarchy and high memory limit evaluation (essentially 2
atomic_reads for the memcg + all parents up the hierchy). That is not
whole of a lot - especially when the memcg hierarchy is not very deep.

Per cpu batch amortizes those per hierarchy updates as well as atomic
operations + cache lines bouncing on updates.

On the other hand spinlock would do the unconditional atomic updates as
well and even much more on CONFIG_RT. A plus is that the update will be
mostly local so cache line bouncing shouldn't be terrible. Unless
somebody heavily triggers pcp cache draining but this shouldn't be all
that common (e.g. when a memcg triggers its limit.

All that being said, I am still not convinced that the pcp cache bypass
for isolated CPUs would make a dramatic difference. Especially in the
context of workloads that tend to run on isolated CPUs and rarely enter
kernel.
 
> > It is true true that appart from user
> > space memory which can be under full control of the userspace there are
> > kernel allocations which can be done on behalf of the process and those
> > could be charged to memcg as well. So I can imagine the pcp cache could
> > be populated even if the process is not faulting anything in during RT
> > sensitive phase.
> 
> Humm, I think I will apply the change and do a comparative testing with
> upstream. This should bring good comparison results.

That would be certainly appreciated!
 
> > > On the other hand, compared to how it works now now, this should be a more
> > > controllable way of introducing latency than a scheduled cache drain.
> > > 
> > > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > > predictability, but I am almost sure the cost in overall performance would not
> > > be fine.
> > 
> > It is hard to estimate the overhead without measuring that. Do you think
> > you can give it a try? If the performance is not really acceptable
> > (which I would be really surprised) then we can think of a more complex
> > solution.
> 
> Sure, I can try that.
> Do you suggest any specific workload that happens to stress the percpu cache
> usage, with usual drains and so? Maybe I will also try with synthetic worloads
> also.

I really think you want to test it on the isolcpu aware workload.
Artificial benchmark are not all that useful in this context.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2022-11-09  8:05                   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2022-11-09  8:05 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue 08-11-22 20:09:25, Leonardo Brás wrote:
[...]
> > Yes, with a notable difference that with your spin lock option there is
> > still a chance that the remote draining could influence the isolated CPU
> > workload throug that said spinlock. If there is no pcp cache for that
> > cpu being used then there is no potential interaction at all.
> 
> I see. 
> But the slow path is slow for some reason, right?
> Does not it make use of any locks also? So on normal operation there could be a
> potentially larger impact than a spinlock, even though there would be no
> scheduled draining.

Well, for the regular (try_charge) path that is essentially page_counter_try_charge
which boils down to atomic_long_add_return of the memcg counter + all
parents up the hierarchy and high memory limit evaluation (essentially 2
atomic_reads for the memcg + all parents up the hierchy). That is not
whole of a lot - especially when the memcg hierarchy is not very deep.

Per cpu batch amortizes those per hierarchy updates as well as atomic
operations + cache lines bouncing on updates.

On the other hand spinlock would do the unconditional atomic updates as
well and even much more on CONFIG_RT. A plus is that the update will be
mostly local so cache line bouncing shouldn't be terrible. Unless
somebody heavily triggers pcp cache draining but this shouldn't be all
that common (e.g. when a memcg triggers its limit.

All that being said, I am still not convinced that the pcp cache bypass
for isolated CPUs would make a dramatic difference. Especially in the
context of workloads that tend to run on isolated CPUs and rarely enter
kernel.
 
> > It is true true that appart from user
> > space memory which can be under full control of the userspace there are
> > kernel allocations which can be done on behalf of the process and those
> > could be charged to memcg as well. So I can imagine the pcp cache could
> > be populated even if the process is not faulting anything in during RT
> > sensitive phase.
> 
> Humm, I think I will apply the change and do a comparative testing with
> upstream. This should bring good comparison results.

That would be certainly appreciated!
 
> > > On the other hand, compared to how it works now now, this should be a more
> > > controllable way of introducing latency than a scheduled cache drain.
> > > 
> > > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > > predictability, but I am almost sure the cost in overall performance would not
> > > be fine.
> > 
> > It is hard to estimate the overhead without measuring that. Do you think
> > you can give it a try? If the performance is not really acceptable
> > (which I would be really surprised) then we can think of a more complex
> > solution.
> 
> Sure, I can try that.
> Do you suggest any specific workload that happens to stress the percpu cache
> usage, with usual drains and so? Maybe I will also try with synthetic worloads
> also.

I really think you want to test it on the isolcpu aware workload.
Artificial benchmark are not all that useful in this context.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2023-01-25  7:44                     ` Leonardo Brás
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2023-01-25  7:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti, linux-kernel,
	cgroups, linux-mm

On Wed, 2022-11-09 at 09:05 +0100, Michal Hocko wrote:
> On Tue 08-11-22 20:09:25, Leonardo Brás wrote:
> [...]
> > > Yes, with a notable difference that with your spin lock option there is
> > > still a chance that the remote draining could influence the isolated CPU
> > > workload throug that said spinlock. If there is no pcp cache for that
> > > cpu being used then there is no potential interaction at all.
> > 
> > I see. 
> > But the slow path is slow for some reason, right?
> > Does not it make use of any locks also? So on normal operation there could be a
> > potentially larger impact than a spinlock, even though there would be no
> > scheduled draining.
> 
> Well, for the regular (try_charge) path that is essentially page_counter_try_charge
> which boils down to atomic_long_add_return of the memcg counter + all
> parents up the hierarchy and high memory limit evaluation (essentially 2
> atomic_reads for the memcg + all parents up the hierchy). That is not
> whole of a lot - especially when the memcg hierarchy is not very deep.
> 
> Per cpu batch amortizes those per hierarchy updates as well as atomic
> operations + cache lines bouncing on updates.
> 
> On the other hand spinlock would do the unconditional atomic updates as
> well and even much more on CONFIG_RT. A plus is that the update will be
> mostly local so cache line bouncing shouldn't be terrible. Unless
> somebody heavily triggers pcp cache draining but this shouldn't be all
> that common (e.g. when a memcg triggers its limit.
> 
> All that being said, I am still not convinced that the pcp cache bypass
> for isolated CPUs would make a dramatic difference. Especially in the
> context of workloads that tend to run on isolated CPUs and rarely enter
> kernel.
>  
> > > It is true true that appart from user
> > > space memory which can be under full control of the userspace there are
> > > kernel allocations which can be done on behalf of the process and those
> > > could be charged to memcg as well. So I can imagine the pcp cache could
> > > be populated even if the process is not faulting anything in during RT
> > > sensitive phase.
> > 
> > Humm, I think I will apply the change and do a comparative testing with
> > upstream. This should bring good comparison results.
> 
> That would be certainly appreciated!
>  (
> > > > On the other hand, compared to how it works now now, this should be a more
> > > > controllable way of introducing latency than a scheduled cache drain.
> > > > 
> > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > > > predictability, but I am almost sure the cost in overall performance would not
> > > > be fine.
> > > 
> > > It is hard to estimate the overhead without measuring that. Do you think
> > > you can give it a try? If the performance is not really acceptable
> > > (which I would be really surprised) then we can think of a more complex
> > > solution.
> > 
> > Sure, I can try that.
> > Do you suggest any specific workload that happens to stress the percpu cache
> > usage, with usual drains and so? Maybe I will also try with synthetic worloads
> > also.
> 
> I really think you want to test it on the isolcpu aware workload.
> Artificial benchmark are not all that useful in this context.

Hello Michael,
I just sent a v2 for this patchset with a lot of changes.
https://lore.kernel.org/lkml/20230125073502.743446-1-leobras@redhat.com/

I have tried to gather some data on the performance numbers as suggested, but I
got carried away and the cover letter ended up too big. I hope it's not too much
trouble.

Best regards,
Leo




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
@ 2023-01-25  7:44                     ` Leonardo Brás
  0 siblings, 0 replies; 28+ messages in thread
From: Leonardo Brás @ 2023-01-25  7:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Frederic Weisbecker, Phil Auld, Marcelo Tosatti,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, 2022-11-09 at 09:05 +0100, Michal Hocko wrote:
> On Tue 08-11-22 20:09:25, Leonardo Brás wrote:
> [...]
> > > Yes, with a notable difference that with your spin lock option there is
> > > still a chance that the remote draining could influence the isolated CPU
> > > workload throug that said spinlock. If there is no pcp cache for that
> > > cpu being used then there is no potential interaction at all.
> > 
> > I see. 
> > But the slow path is slow for some reason, right?
> > Does not it make use of any locks also? So on normal operation there could be a
> > potentially larger impact than a spinlock, even though there would be no
> > scheduled draining.
> 
> Well, for the regular (try_charge) path that is essentially page_counter_try_charge
> which boils down to atomic_long_add_return of the memcg counter + all
> parents up the hierarchy and high memory limit evaluation (essentially 2
> atomic_reads for the memcg + all parents up the hierchy). That is not
> whole of a lot - especially when the memcg hierarchy is not very deep.
> 
> Per cpu batch amortizes those per hierarchy updates as well as atomic
> operations + cache lines bouncing on updates.
> 
> On the other hand spinlock would do the unconditional atomic updates as
> well and even much more on CONFIG_RT. A plus is that the update will be
> mostly local so cache line bouncing shouldn't be terrible. Unless
> somebody heavily triggers pcp cache draining but this shouldn't be all
> that common (e.g. when a memcg triggers its limit.
> 
> All that being said, I am still not convinced that the pcp cache bypass
> for isolated CPUs would make a dramatic difference. Especially in the
> context of workloads that tend to run on isolated CPUs and rarely enter
> kernel.
>  
> > > It is true true that appart from user
> > > space memory which can be under full control of the userspace there are
> > > kernel allocations which can be done on behalf of the process and those
> > > could be charged to memcg as well. So I can imagine the pcp cache could
> > > be populated even if the process is not faulting anything in during RT
> > > sensitive phase.
> > 
> > Humm, I think I will apply the change and do a comparative testing with
> > upstream. This should bring good comparison results.
> 
> That would be certainly appreciated!
>  (
> > > > On the other hand, compared to how it works now now, this should be a more
> > > > controllable way of introducing latency than a scheduled cache drain.
> > > > 
> > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > > > predictability, but I am almost sure the cost in overall performance would not
> > > > be fine.
> > > 
> > > It is hard to estimate the overhead without measuring that. Do you think
> > > you can give it a try? If the performance is not really acceptable
> > > (which I would be really surprised) then we can think of a more complex
> > > solution.
> > 
> > Sure, I can try that.
> > Do you suggest any specific workload that happens to stress the percpu cache
> > usage, with usual drains and so? Maybe I will also try with synthetic worloads
> > also.
> 
> I really think you want to test it on the isolcpu aware workload.
> Artificial benchmark are not all that useful in this context.

Hello Michael,
I just sent a v2 for this patchset with a lot of changes.
https://lore.kernel.org/lkml/20230125073502.743446-1-leobras-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org/

I have tried to gather some data on the performance numbers as suggested, but I
got carried away and the cover letter ended up too big. I hope it's not too much
trouble.

Best regards,
Leo




^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-01-25  7:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-02  2:02 [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus Leonardo Bras
2022-11-02  2:02 ` Leonardo Bras
2022-11-02  2:02 ` [PATCH v1 1/3] sched/isolation: Add housekeepíng_any_cpu_from() Leonardo Bras
2022-11-02  2:02   ` Leonardo Bras
2022-11-02  2:02 ` [PATCH v1 2/3] mm/memcontrol: Change stock_lock type from local_lock_t to spinlock_t Leonardo Bras
2022-11-02  2:02   ` Leonardo Bras
2022-11-02  2:02 ` [PATCH v1 3/3] mm/memcontrol: Add drain_remote_stock(), avoid drain_stock on isolated cpus Leonardo Bras
2022-11-02  2:02   ` Leonardo Bras
2022-11-02  8:53 ` [PATCH v1 0/3] Avoid scheduling cache draining to " Michal Hocko
2022-11-02  8:53   ` Michal Hocko
2022-11-03 14:59   ` Leonardo Brás
2022-11-03 14:59     ` Leonardo Brás
2022-11-03 15:31     ` Michal Hocko
2022-11-03 15:31       ` Michal Hocko
2022-11-03 16:53       ` Leonardo Brás
2022-11-03 16:53         ` Leonardo Brás
2022-11-04  8:41         ` Michal Hocko
2022-11-04  8:41           ` Michal Hocko
2022-11-05  1:45           ` Leonardo Brás
2022-11-05  1:45             ` Leonardo Brás
2022-11-07  8:10             ` Michal Hocko
2022-11-07  8:10               ` Michal Hocko
2022-11-08 23:09               ` Leonardo Brás
2022-11-08 23:09                 ` Leonardo Brás
2022-11-09  8:05                 ` Michal Hocko
2022-11-09  8:05                   ` Michal Hocko
2023-01-25  7:44                   ` Leonardo Brás
2023-01-25  7:44                     ` Leonardo Brás

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.