* [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2021-12-22 11:41 Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
` (3 more replies)
0 siblings, 4 replies; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
To: cgroups, linux-mm
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Thomas Gleixner, Waiman Long, Peter Zijlstra
Hi,
this is a follow up to
https://lkml.kernel.org/r/20211207155208.eyre5svucpg7krxe@linutronix.de
where it has been suggested that I should try again with memcg instead
of simply disabling it.
Patch #1 deals with the counters. It has been suggested to simply
disable preemption on RT (like in vmstats) and I followed that advice as
closely as possible. The local_irq_save() could be removed from
mod_memcg_state() and the other wrapper on RT but I leave it since it
does not hurt and it might look nicer ;)
Patch #2 is a follow up to
https://lkml.kernel.org/r/20211214144412.447035-1-longman@redhat.com
Patch #3 restricts the task_obj usage to !PREEMPTION kernels. Based on
my understanding the use of preempt_disable() minimizes (avoids?) the
win of the optimisation.
I tested them on CONFIG_PREEMPT_NONE + CONFIG_PREEMPT_RT with the
tools/testing/selftests/cgroup/* tests. I looked good except for the
following (which was also there before the patches):
- test_kmem sometimes complained about:
not ok 2 test_kmem_memcg_deletion
- test_memcontrol complained always about
not ok 3 test_memcg_min
not ok 4 test_memcg_low
and did not finish.
- lockdep complains were triggered by test_core and test_freezer (both
had to run):
======================================================
WARNING: possible circular locking dependency detected
5.16.0-rc5 #259 Not tainted
------------------------------------------------------
test_core/5996 is trying to acquire lock:
ffffffff829a1258 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x2d/0xb0
but task is already holding lock:
ffff888103034618 (&sighand->siglock){....}-{2:2}, at: get_signal+0x8d/0xdb0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sighand->siglock){....}-{2:2}:
_raw_spin_lock+0x27/0x40
cgroup_post_fork+0x1f5/0x290
copy_process+0x191b/0x1f80
kernel_clone+0x5a/0x410
__do_sys_clone3+0xb3/0x110
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #0 (css_set_lock){..-.}-{2:2}:
__lock_acquire+0x1253/0x2280
lock_acquire+0xd4/0x2e0
_raw_spin_lock_irqsave+0x36/0x50
obj_cgroup_release+0x2d/0xb0
drain_obj_stock+0x1a9/0x1b0
refill_obj_stock+0x4f/0x220
memcg_slab_free_hook.part.0+0x108/0x290
kmem_cache_free+0xf5/0x3c0
dequeue_signal+0xaf/0x1e0
get_signal+0x232/0xdb0
arch_do_signal_or_restart+0xf8/0x740
exit_to_user_mode_prepare+0x17d/0x270
syscall_exit_to_user_mode+0x19/0x70
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sighand->siglock);
lock(css_set_lock);
lock(&sighand->siglock);
lock(css_set_lock);
*** DEADLOCK ***
2 locks held by test_core/5996:
#0: ffff888103034618 (&sighand->siglock){....}-{2:2}, at: get_signal+0x8d/0xdb0
#1: ffffffff82905e40 (rcu_read_lock){....}-{1:2}, at: drain_obj_stock+0x71/0x1b0
stack backtrace:
CPU: 2 PID: 5996 Comm: test_core Not tainted 5.16.0-rc5 #259
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
check_noncircular+0xfe/0x110
__lock_acquire+0x1253/0x2280
lock_acquire+0xd4/0x2e0
_raw_spin_lock_irqsave+0x36/0x50
obj_cgroup_release+0x2d/0xb0
drain_obj_stock+0x1a9/0x1b0
refill_obj_stock+0x4f/0x220
memcg_slab_free_hook.part.0+0x108/0x290
kmem_cache_free+0xf5/0x3c0
dequeue_signal+0xaf/0x1e0
get_signal+0x232/0xdb0
arch_do_signal_or_restart+0xf8/0x740
exit_to_user_mode_prepare+0x17d/0x270
syscall_exit_to_user_mode+0x19/0x70
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2021-12-22 11:41 [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
@ 2021-12-22 11:41 ` Sebastian Andrzej Siewior
2021-12-23 2:31 ` Waiman Long
2022-01-05 14:16 ` Michal Koutný
2021-12-22 11:41 ` [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object Sebastian Andrzej Siewior
` (2 subsequent siblings)
3 siblings, 2 replies; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
To: cgroups, linux-mm
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Thomas Gleixner, Waiman Long, Peter Zijlstra,
Sebastian Andrzej Siewior
The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensure by disabling interrupts for the update.
This breaks on PREEMPT_RT because some sections additionally
acquire a spinlock_t lock (which becomes sleeping and must not be
acquired with disabled interrupts). Another problem is that
mem_cgroup_swapout() expects to be invoked with disabled interrupts
because the caller has to acquire a spinlock_t which is acquired with
disabled interrupts. Since spinlock_t never disables interrupts on
PREEMPT_RT the interrupts are never disabled at this point.
The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT. The
sections with disabled preemption must exclude memcg_check_events() so
that spinlock_t locks can still be acquired (for instance in
eventfd_signal()).
Don't disable interrupts during updates of the per-CPU variables,
instead use shorter sections which disable preemption.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 60 insertions(+), 14 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d3..d2687d5ed544b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -671,8 +671,14 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
if (mem_cgroup_disabled())
return;
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_disable();
+
__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
memcg_rstat_updated(memcg);
+
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_enable();
}
/* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -699,6 +705,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
memcg = pn->memcg;
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_disable();
+
/* Update memcg */
__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
@@ -706,6 +715,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
memcg_rstat_updated(memcg);
+
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_enable();
}
/**
@@ -788,8 +800,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
if (mem_cgroup_disabled())
return;
+ if (IS_ENABLED(PREEMPT_RT))
+ preempt_disable();
+
__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
memcg_rstat_updated(memcg);
+ if (IS_ENABLED(PREEMPT_RT))
+ preempt_enable();
}
static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -810,6 +827,9 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
int nr_pages)
{
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_disable();
+
/* pagein of a big page is an event. So, ignore page size */
if (nr_pages > 0)
__count_memcg_events(memcg, PGPGIN, 1);
@@ -819,12 +839,19 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
}
__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
+
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_enable();
}
static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
enum mem_cgroup_events_target target)
{
unsigned long val, next;
+ bool ret = false;
+
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_disable();
val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events);
next = __this_cpu_read(memcg->vmstats_percpu->targets[target]);
@@ -841,9 +868,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
break;
}
__this_cpu_write(memcg->vmstats_percpu->targets[target], next);
- return true;
+ ret = true;
}
- return false;
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ preempt_enable();
+ return ret;
}
/*
@@ -5645,12 +5674,14 @@ static int mem_cgroup_move_account(struct page *page,
ret = 0;
nid = folio_nid(folio);
- local_irq_disable();
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ local_irq_disable();
mem_cgroup_charge_statistics(to, nr_pages);
memcg_check_events(to, nid);
mem_cgroup_charge_statistics(from, -nr_pages);
memcg_check_events(from, nid);
- local_irq_enable();
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ local_irq_enable();
out_unlock:
folio_unlock(folio);
out:
@@ -6670,10 +6701,12 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
css_get(&memcg->css);
commit_charge(folio, memcg);
- local_irq_disable();
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ local_irq_disable();
mem_cgroup_charge_statistics(memcg, nr_pages);
memcg_check_events(memcg, folio_nid(folio));
- local_irq_enable();
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ local_irq_enable();
out:
return ret;
}
@@ -6785,11 +6818,20 @@ static void uncharge_batch(const struct uncharge_gather *ug)
memcg_oom_recover(ug->memcg);
}
- local_irq_save(flags);
- __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
- __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
- memcg_check_events(ug->memcg, ug->nid);
- local_irq_restore(flags);
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+ local_irq_save(flags);
+ __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+ __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+ memcg_check_events(ug->memcg, ug->nid);
+ local_irq_restore(flags);
+ } else {
+ preempt_disable();
+ __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+ __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+ preempt_enable();
+
+ memcg_check_events(ug->memcg, ug->nid);
+ }
/* drop reference from uncharge_folio */
css_put(&ug->memcg->css);
@@ -6930,10 +6972,12 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
css_get(&memcg->css);
commit_charge(new, memcg);
- local_irq_save(flags);
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ local_irq_save(flags);
mem_cgroup_charge_statistics(memcg, nr_pages);
memcg_check_events(memcg, folio_nid(new));
- local_irq_restore(flags);
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+ local_irq_restore(flags);
}
DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
@@ -7157,8 +7201,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
* i_pages lock which is taken with interrupts-off. It is
* important here to have the interrupts disabled because it is the
* only synchronisation we have for updating the per-CPU variables.
+ * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU
+ * variables are synchronised by keeping preemption disabled.
*/
- VM_BUG_ON(!irqs_disabled());
+ VM_BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT) && !irqs_disabled());
mem_cgroup_charge_statistics(memcg, -nr_entries);
memcg_check_events(memcg, page_to_nid(page));
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
2021-12-22 11:41 [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
@ 2021-12-22 11:41 ` Sebastian Andrzej Siewior
2021-12-23 21:38 ` Waiman Long
2021-12-22 11:41 ` [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels Sebastian Andrzej Siewior
2022-01-05 14:59 ` [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Michal Koutný
3 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
To: cgroups, linux-mm
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Thomas Gleixner, Waiman Long, Peter Zijlstra,
Sebastian Andrzej Siewior
The members of the per-CPU structure memcg_stock_pcp are protected
either by disabling interrupts or by disabling preemption if the
invocation occurred in process context.
Disabling interrupts protects most of the structure excluding task_obj
while disabling preemption protects only task_obj.
This schema is incompatible with PREEMPT_RT because it creates atomic
context in which actions are performed which require preemptible
context. One example is obj_cgroup_release().
The IRQ-disable and preempt-disable sections can be replaced with
local_lock_t which preserves the explicit disabling of interrupts while
keeps the code preemptible on PREEMPT_RT.
The task_obj has been added for performance reason on non-preemptible
kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption
model preempt_disable() is always implemented. Also there are no memory
allocations in_irq() context and softirqs are processed in (preemptible)
process context. Therefore it makes sense to avoid using task_obj.
Don't use task_obj on PREEMPT_RT and replace manual disabling of
interrupts with a local_lock_t. This change requires some factoring:
- drain_obj_stock() drops a reference on obj_cgroup which leads to an
invocation of obj_cgroup_release() if it is the last object. This in
turn leads to recursive locking of the local_lock_t. To avoid this,
obj_cgroup_release() is invoked outside of the locked section.
- drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been
acquired (instead of the task_obj_lock) to avoid recursive locking later
in refill_stock().
- drain_all_stock() disables preemption via get_cpu() and then invokes
drain_local_stock() if it is the local CPU to avoid scheduling a worker
(which invokes the same function). Disabling preemption here is
problematic due to the sleeping locks in drain_local_stock().
This can be avoided by always scheduling a worker, even for the local
CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures
that no worker is scheduled for an offline CPU. Since there is no
flush_work(), it is still possible that a worker is invoked on the wrong
CPU but it is okay since it operates always on the local-CPU data.
- drain_local_stock() is always invoked as a worker so it can be optimized
by removing in_task() (it is always true) and avoiding the "irq_save"
variant because interrupts are always enabled here. Operating on
task_obj first allows to acquire the lock_lock_t without lockdep
complains.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++-----------------
1 file changed, 112 insertions(+), 59 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2687d5ed544b..1e76f26be2c15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
return cgroup_memory_nokmem;
}
+struct memcg_stock_pcp;
static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
- unsigned int nr_pages);
+ unsigned int nr_pages,
+ struct memcg_stock_pcp *stock_pcp);
static void obj_cgroup_release(struct percpu_ref *ref)
{
@@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
nr_pages = nr_bytes >> PAGE_SHIFT;
if (nr_pages)
- obj_cgroup_uncharge_pages(objcg, nr_pages);
+ obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
spin_lock_irqsave(&css_set_lock, flags);
list_del(&objcg->list);
@@ -2120,26 +2122,40 @@ struct obj_stock {
};
struct memcg_stock_pcp {
+ /* Protects memcg_stock_pcp */
+ local_lock_t stock_lock;
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
+#ifndef CONFIG_PREEMPT_RT
+ /* Protects only task_obj */
+ local_lock_t task_obj_lock;
struct obj_stock task_obj;
+#endif
struct obj_stock irq_obj;
struct work_struct work;
unsigned long flags;
#define FLUSHING_CACHED_CHARGE 0
};
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+ .stock_lock = INIT_LOCAL_LOCK(stock_lock),
+#ifndef CONFIG_PREEMPT_RT
+ .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
+#endif
+};
static DEFINE_MUTEX(percpu_charge_mutex);
#ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+ struct memcg_stock_pcp *stock_pcp);
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg);
#else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+ struct memcg_stock_pcp *stock_pcp)
{
+ return NULL;
}
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg)
@@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
if (nr_pages > MEMCG_CHARGE_BATCH)
return ret;
- local_irq_save(flags);
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
stock = this_cpu_ptr(&memcg_stock);
if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
ret = true;
}
- local_irq_restore(flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
return ret;
}
@@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
static void drain_local_stock(struct work_struct *dummy)
{
- struct memcg_stock_pcp *stock;
- unsigned long flags;
+ struct memcg_stock_pcp *stock_pcp;
+ struct obj_cgroup *old;
/*
* The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
* drain_stock races is that we always operate on local CPU stock
* here with IRQ disabled
*/
- local_irq_save(flags);
+#ifndef CONFIG_PREEMPT_RT
+ local_lock(&memcg_stock.task_obj_lock);
+ old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
+ local_unlock(&memcg_stock.task_obj_lock);
+ if (old)
+ obj_cgroup_put(old);
+#endif
- stock = this_cpu_ptr(&memcg_stock);
- drain_obj_stock(&stock->irq_obj);
- if (in_task())
- drain_obj_stock(&stock->task_obj);
- drain_stock(stock);
- clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+ local_lock_irq(&memcg_stock.stock_lock);
+ stock_pcp = this_cpu_ptr(&memcg_stock);
+ old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
- local_irq_restore(flags);
+ drain_stock(stock_pcp);
+ clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
+
+ local_unlock_irq(&memcg_stock.stock_lock);
+ if (old)
+ obj_cgroup_put(old);
}
/*
* Cache charges(val) to local per_cpu area.
* This will be consumed by consume_stock() function, later.
*/
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+ struct memcg_stock_pcp *stock)
{
- struct memcg_stock_pcp *stock;
- unsigned long flags;
+ lockdep_assert_held(&stock->stock_lock);
- local_irq_save(flags);
-
- stock = this_cpu_ptr(&memcg_stock);
if (stock->cached != memcg) { /* reset if necessary */
drain_stock(stock);
css_get(&memcg->css);
@@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
if (stock->nr_pages > MEMCG_CHARGE_BATCH)
drain_stock(stock);
+}
- local_irq_restore(flags);
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+ struct memcg_stock_pcp *stock_pcp)
+{
+ unsigned long flags;
+
+ if (stock_pcp) {
+ __refill_stock(memcg, nr_pages, stock_pcp);
+ return;
+ }
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ __refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock));
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
}
/*
@@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
*/
static void drain_all_stock(struct mem_cgroup *root_memcg)
{
- int cpu, curcpu;
+ int cpu;
/* If someone's already draining, avoid adding running more workers. */
if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
* as well as workers from this path always operate on the local
* per-cpu data. CPU up doesn't touch memcg_stock at all.
*/
- curcpu = get_cpu();
+ cpus_read_lock();
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
struct mem_cgroup *memcg;
@@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
rcu_read_unlock();
if (flush &&
- !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
- if (cpu == curcpu)
- drain_local_stock(&stock->work);
- else
- schedule_work_on(cpu, &stock->work);
- }
+ !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+ schedule_work_on(cpu, &stock->work);
}
- put_cpu();
+ cpus_read_unlock();
mutex_unlock(&percpu_charge_mutex);
}
@@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
done_restock:
if (batch > nr_pages)
- refill_stock(memcg, batch - nr_pages);
+ refill_stock(memcg, batch - nr_pages, NULL);
/*
* If the hierarchy is above the normal consumption range, schedule
@@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
* can only be accessed after disabling interrupt. User context code can
* access interrupt object stock, but not vice versa.
*/
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
+ struct memcg_stock_pcp **stock_pcp)
{
struct memcg_stock_pcp *stock;
+#ifndef CONFIG_PREEMPT_RT
if (likely(in_task())) {
*pflags = 0UL;
- preempt_disable();
+ *stock_pcp = NULL;
+ local_lock(&memcg_stock.task_obj_lock);
stock = this_cpu_ptr(&memcg_stock);
return &stock->task_obj;
}
-
- local_irq_save(*pflags);
+#endif
+ local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
stock = this_cpu_ptr(&memcg_stock);
+ *stock_pcp = stock;
return &stock->irq_obj;
}
-static inline void put_obj_stock(unsigned long flags)
+static inline void put_obj_stock(unsigned long flags,
+ struct memcg_stock_pcp *stock_pcp)
{
- if (likely(in_task()))
- preempt_enable();
+#ifndef CONFIG_PREEMPT_RT
+ if (likely(!stock_pcp))
+ local_unlock(&memcg_stock.task_obj_lock);
else
- local_irq_restore(flags);
+#endif
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
}
/*
@@ -3002,7 +3038,8 @@ static void memcg_free_cache_id(int id)
* @nr_pages: number of pages to uncharge
*/
static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
- unsigned int nr_pages)
+ unsigned int nr_pages,
+ struct memcg_stock_pcp *stock_pcp)
{
struct mem_cgroup *memcg;
@@ -3010,7 +3047,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
page_counter_uncharge(&memcg->kmem, nr_pages);
- refill_stock(memcg, nr_pages);
+ refill_stock(memcg, nr_pages, stock_pcp);
css_put(&memcg->css);
}
@@ -3084,7 +3121,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
return;
objcg = __folio_objcg(folio);
- obj_cgroup_uncharge_pages(objcg, nr_pages);
+ obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
folio->memcg_data = 0;
obj_cgroup_put(objcg);
}
@@ -3092,17 +3129,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
enum node_stat_item idx, int nr)
{
+ struct memcg_stock_pcp *stock_pcp;
unsigned long flags;
- struct obj_stock *stock = get_obj_stock(&flags);
+ struct obj_cgroup *old = NULL;
+ struct obj_stock *stock;
int *bytes;
+ stock = get_obj_stock(&flags, &stock_pcp);
/*
* Save vmstat data in stock and skip vmstat array update unless
* accumulating over a page of vmstat data or when pgdat or idx
* changes.
*/
if (stock->cached_objcg != objcg) {
- drain_obj_stock(stock);
+ old = drain_obj_stock(stock, stock_pcp);
+
obj_cgroup_get(objcg);
stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3146,38 +3187,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
if (nr)
mod_objcg_mlstate(objcg, pgdat, idx, nr);
- put_obj_stock(flags);
+ put_obj_stock(flags, stock_pcp);
+ if (old)
+ obj_cgroup_put(old);
}
static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
+ struct memcg_stock_pcp *stock_pcp;
unsigned long flags;
- struct obj_stock *stock = get_obj_stock(&flags);
+ struct obj_stock *stock;
bool ret = false;
+ stock = get_obj_stock(&flags, &stock_pcp);
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
}
- put_obj_stock(flags);
+ put_obj_stock(flags, stock_pcp);
return ret;
}
-static void drain_obj_stock(struct obj_stock *stock)
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+ struct memcg_stock_pcp *stock_pcp)
{
struct obj_cgroup *old = stock->cached_objcg;
if (!old)
- return;
+ return NULL;
if (stock->nr_bytes) {
unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
if (nr_pages)
- obj_cgroup_uncharge_pages(old, nr_pages);
+ obj_cgroup_uncharge_pages(old, nr_pages, stock_pcp);
/*
* The leftover is flushed to the centralized per-memcg value.
@@ -3212,8 +3258,8 @@ static void drain_obj_stock(struct obj_stock *stock)
stock->cached_pgdat = NULL;
}
- obj_cgroup_put(old);
stock->cached_objcg = NULL;
+ return old;
}
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3221,11 +3267,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
{
struct mem_cgroup *memcg;
+#ifndef CONFIG_PREEMPT_RT
if (in_task() && stock->task_obj.cached_objcg) {
memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
return true;
}
+#endif
if (stock->irq_obj.cached_objcg) {
memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
@@ -3238,12 +3286,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
bool allow_uncharge)
{
+ struct memcg_stock_pcp *stock_pcp;
unsigned long flags;
- struct obj_stock *stock = get_obj_stock(&flags);
+ struct obj_stock *stock;
unsigned int nr_pages = 0;
+ struct obj_cgroup *old = NULL;
+ stock = get_obj_stock(&flags, &stock_pcp);
if (stock->cached_objcg != objcg) { /* reset if necessary */
- drain_obj_stock(stock);
+ old = drain_obj_stock(stock, stock_pcp);
obj_cgroup_get(objcg);
stock->cached_objcg = objcg;
stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3257,10 +3308,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
stock->nr_bytes &= (PAGE_SIZE - 1);
}
- put_obj_stock(flags);
+ put_obj_stock(flags, stock_pcp);
+ if (old)
+ obj_cgroup_put(old);
if (nr_pages)
- obj_cgroup_uncharge_pages(objcg, nr_pages);
+ obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
}
int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -7061,7 +7114,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
- refill_stock(memcg, nr_pages);
+ refill_stock(memcg, nr_pages, NULL);
}
static int __init cgroup_memory(char *s)
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
2021-12-22 11:41 [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object Sebastian Andrzej Siewior
@ 2021-12-22 11:41 ` Sebastian Andrzej Siewior
2021-12-23 21:48 ` Waiman Long
2022-01-05 14:59 ` [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Michal Koutný
3 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
To: cgroups, linux-mm
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Thomas Gleixner, Waiman Long, Peter Zijlstra,
Sebastian Andrzej Siewior
Based on my understanding the optimisation with task_obj for in_task()
mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
instead to only PREEMPT_RT.
With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
configured but these kernels always have preempt_disable()/enable()
present so it probably makes no sense here for the optimisation.
Restrict the optimisation to !CONFIG_PREEMPTION kernels.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
mm/memcontrol.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1e76f26be2c15..92180f1aa9edc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2126,7 +2126,7 @@ struct memcg_stock_pcp {
local_lock_t stock_lock;
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
/* Protects only task_obj */
local_lock_t task_obj_lock;
struct obj_stock task_obj;
@@ -2139,7 +2139,7 @@ struct memcg_stock_pcp {
};
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
.stock_lock = INIT_LOCAL_LOCK(stock_lock),
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
#endif
};
@@ -2228,7 +2228,7 @@ static void drain_local_stock(struct work_struct *dummy)
* drain_stock races is that we always operate on local CPU stock
* here with IRQ disabled
*/
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
local_lock(&memcg_stock.task_obj_lock);
old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
local_unlock(&memcg_stock.task_obj_lock);
@@ -2837,7 +2837,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
{
struct memcg_stock_pcp *stock;
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
if (likely(in_task())) {
*pflags = 0UL;
*stock_pcp = NULL;
@@ -2855,7 +2855,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
static inline void put_obj_stock(unsigned long flags,
struct memcg_stock_pcp *stock_pcp)
{
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
if (likely(!stock_pcp))
local_unlock(&memcg_stock.task_obj_lock);
else
@@ -3267,7 +3267,7 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
{
struct mem_cgroup *memcg;
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
if (in_task() && stock->task_obj.cached_objcg) {
memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
@ 2021-12-23 2:31 ` Waiman Long
2021-12-23 7:34 ` Sebastian Andrzej Siewior
2022-01-05 14:16 ` Michal Koutný
1 sibling, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-12-23 2:31 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, cgroups, linux-mm
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Thomas Gleixner, Peter Zijlstra
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The per-CPU counter are modified with the non-atomic modifier. The
> consistency is ensure by disabling interrupts for the update.
> This breaks on PREEMPT_RT because some sections additionally
> acquire a spinlock_t lock (which becomes sleeping and must not be
> acquired with disabled interrupts). Another problem is that
> mem_cgroup_swapout() expects to be invoked with disabled interrupts
> because the caller has to acquire a spinlock_t which is acquired with
> disabled interrupts. Since spinlock_t never disables interrupts on
> PREEMPT_RT the interrupts are never disabled at this point.
>
> The code is never called from in_irq() context on PREEMPT_RT therefore
How do you guarantee that these percpu update functions won't be called
in in_irq() context for PREEMPT_RT? Do you think we should add a
WARN_ON_ONCE(in_irq()) just to be sure?
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2021-12-23 2:31 ` Waiman Long
@ 2021-12-23 7:34 ` Sebastian Andrzej Siewior
2021-12-23 16:01 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-23 7:34 UTC (permalink / raw)
To: Waiman Long
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > The per-CPU counter are modified with the non-atomic modifier. The
> > consistency is ensure by disabling interrupts for the update.
> > This breaks on PREEMPT_RT because some sections additionally
> > acquire a spinlock_t lock (which becomes sleeping and must not be
> > acquired with disabled interrupts). Another problem is that
> > mem_cgroup_swapout() expects to be invoked with disabled interrupts
> > because the caller has to acquire a spinlock_t which is acquired with
> > disabled interrupts. Since spinlock_t never disables interrupts on
> > PREEMPT_RT the interrupts are never disabled at this point.
> >
> > The code is never called from in_irq() context on PREEMPT_RT therefore
>
> How do you guarantee that these percpu update functions won't be called in
> in_irq() context for PREEMPT_RT? Do you think we should add a
> WARN_ON_ONCE(in_irq()) just to be sure?
There are no invocations to the memory allocator (neither malloc() nor
free()) on RT and the memory allocator itself (SLUB and the
page-allocator so both) has sleeping locks. That means invocations
in_atomic() are bad. All interrupt handler are force-threaded. Those
which are not (like timer, per-CPU interrupts or those which explicitly
asked not to be force threaded) are limited in their doing as they can't
invoke anything that has a sleeping lock. Lockdep or
CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
The other counter are protected the same way, see
c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")
> Cheers,
> Longman
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2021-12-23 7:34 ` Sebastian Andrzej Siewior
@ 2021-12-23 16:01 ` Waiman Long
0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2021-12-23 16:01 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 12/23/21 02:34, Sebastian Andrzej Siewior wrote:
> On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> The per-CPU counter are modified with the non-atomic modifier. The
>>> consistency is ensure by disabling interrupts for the update.
>>> This breaks on PREEMPT_RT because some sections additionally
>>> acquire a spinlock_t lock (which becomes sleeping and must not be
>>> acquired with disabled interrupts). Another problem is that
>>> mem_cgroup_swapout() expects to be invoked with disabled interrupts
>>> because the caller has to acquire a spinlock_t which is acquired with
>>> disabled interrupts. Since spinlock_t never disables interrupts on
>>> PREEMPT_RT the interrupts are never disabled at this point.
>>>
>>> The code is never called from in_irq() context on PREEMPT_RT therefore
>> How do you guarantee that these percpu update functions won't be called in
>> in_irq() context for PREEMPT_RT? Do you think we should add a
>> WARN_ON_ONCE(in_irq()) just to be sure?
> There are no invocations to the memory allocator (neither malloc() nor
> free()) on RT and the memory allocator itself (SLUB and the
> page-allocator so both) has sleeping locks. That means invocations
> in_atomic() are bad. All interrupt handler are force-threaded. Those
> which are not (like timer, per-CPU interrupts or those which explicitly
> asked not to be force threaded) are limited in their doing as they can't
> invoke anything that has a sleeping lock. Lockdep or
> CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
> The other counter are protected the same way, see
> c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")
Thanks for the explanation as I am less familiar with other PREEMPT_RT
specific changes.
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
2021-12-22 11:41 ` [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object Sebastian Andrzej Siewior
@ 2021-12-23 21:38 ` Waiman Long
2022-01-03 16:34 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-12-23 21:38 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, cgroups, linux-mm
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Thomas Gleixner, Peter Zijlstra
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The members of the per-CPU structure memcg_stock_pcp are protected
> either by disabling interrupts or by disabling preemption if the
> invocation occurred in process context.
> Disabling interrupts protects most of the structure excluding task_obj
> while disabling preemption protects only task_obj.
> This schema is incompatible with PREEMPT_RT because it creates atomic
> context in which actions are performed which require preemptible
> context. One example is obj_cgroup_release().
>
> The IRQ-disable and preempt-disable sections can be replaced with
> local_lock_t which preserves the explicit disabling of interrupts while
> keeps the code preemptible on PREEMPT_RT.
>
> The task_obj has been added for performance reason on non-preemptible
> kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption
> model preempt_disable() is always implemented. Also there are no memory
> allocations in_irq() context and softirqs are processed in (preemptible)
> process context. Therefore it makes sense to avoid using task_obj.
>
> Don't use task_obj on PREEMPT_RT and replace manual disabling of
> interrupts with a local_lock_t. This change requires some factoring:
>
> - drain_obj_stock() drops a reference on obj_cgroup which leads to an
> invocation of obj_cgroup_release() if it is the last object. This in
> turn leads to recursive locking of the local_lock_t. To avoid this,
> obj_cgroup_release() is invoked outside of the locked section.
>
> - drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been
> acquired (instead of the task_obj_lock) to avoid recursive locking later
> in refill_stock().
>
> - drain_all_stock() disables preemption via get_cpu() and then invokes
> drain_local_stock() if it is the local CPU to avoid scheduling a worker
> (which invokes the same function). Disabling preemption here is
> problematic due to the sleeping locks in drain_local_stock().
> This can be avoided by always scheduling a worker, even for the local
> CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures
> that no worker is scheduled for an offline CPU. Since there is no
> flush_work(), it is still possible that a worker is invoked on the wrong
> CPU but it is okay since it operates always on the local-CPU data.
>
> - drain_local_stock() is always invoked as a worker so it can be optimized
> by removing in_task() (it is always true) and avoiding the "irq_save"
> variant because interrupts are always enabled here. Operating on
> task_obj first allows to acquire the lock_lock_t without lockdep
> complains.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++-----------------
> 1 file changed, 112 insertions(+), 59 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d2687d5ed544b..1e76f26be2c15 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
> return cgroup_memory_nokmem;
> }
>
> +struct memcg_stock_pcp;
> static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> - unsigned int nr_pages);
> + unsigned int nr_pages,
> + struct memcg_stock_pcp *stock_pcp);
AFAICS, stock_pcp is set to indicate that the stock_lock has been
acquired. Since stock_pcp, if set, should be the same as
this_cpu_ptr(&memcg_stock), it is a bit confusing to pass it to a
function that also does a percpu access to memcg_stock. Why don't you
just pass a boolean, say, stock_locked to indicate this instead. It will
be more clear and less confusing.
>
> static void obj_cgroup_release(struct percpu_ref *ref)
> {
> @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> nr_pages = nr_bytes >> PAGE_SHIFT;
>
> if (nr_pages)
> - obj_cgroup_uncharge_pages(objcg, nr_pages);
> + obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
>
> spin_lock_irqsave(&css_set_lock, flags);
> list_del(&objcg->list);
> @@ -2120,26 +2122,40 @@ struct obj_stock {
> };
>
> struct memcg_stock_pcp {
> + /* Protects memcg_stock_pcp */
> + local_lock_t stock_lock;
> struct mem_cgroup *cached; /* this never be root cgroup */
> unsigned int nr_pages;
> +#ifndef CONFIG_PREEMPT_RT
> + /* Protects only task_obj */
> + local_lock_t task_obj_lock;
> struct obj_stock task_obj;
> +#endif
> struct obj_stock irq_obj;
>
> struct work_struct work;
> unsigned long flags;
> #define FLUSHING_CACHED_CHARGE 0
> };
> -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
> + .stock_lock = INIT_LOCAL_LOCK(stock_lock),
> +#ifndef CONFIG_PREEMPT_RT
> + .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
> +#endif
> +};
> static DEFINE_MUTEX(percpu_charge_mutex);
>
> #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> + struct memcg_stock_pcp *stock_pcp);
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> struct mem_cgroup *root_memcg);
>
> #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> + struct memcg_stock_pcp *stock_pcp)
> {
> + return NULL;
> }
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> struct mem_cgroup *root_memcg)
> @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> if (nr_pages > MEMCG_CHARGE_BATCH)
> return ret;
>
> - local_irq_save(flags);
> + local_lock_irqsave(&memcg_stock.stock_lock, flags);
>
> stock = this_cpu_ptr(&memcg_stock);
> if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
> @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> ret = true;
> }
>
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>
> return ret;
> }
> @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>
> static void drain_local_stock(struct work_struct *dummy)
> {
> - struct memcg_stock_pcp *stock;
> - unsigned long flags;
> + struct memcg_stock_pcp *stock_pcp;
> + struct obj_cgroup *old;
>
> /*
> * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
> * drain_stock races is that we always operate on local CPU stock
> * here with IRQ disabled
> */
> - local_irq_save(flags);
> +#ifndef CONFIG_PREEMPT_RT
> + local_lock(&memcg_stock.task_obj_lock);
> + old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
> + local_unlock(&memcg_stock.task_obj_lock);
> + if (old)
> + obj_cgroup_put(old);
> +#endif
>
> - stock = this_cpu_ptr(&memcg_stock);
> - drain_obj_stock(&stock->irq_obj);
> - if (in_task())
> - drain_obj_stock(&stock->task_obj);
> - drain_stock(stock);
> - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> + local_lock_irq(&memcg_stock.stock_lock);
> + stock_pcp = this_cpu_ptr(&memcg_stock);
> + old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
>
> - local_irq_restore(flags);
> + drain_stock(stock_pcp);
> + clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
> +
> + local_unlock_irq(&memcg_stock.stock_lock);
> + if (old)
> + obj_cgroup_put(old);
> }
>
> /*
> * Cache charges(val) to local per_cpu area.
> * This will be consumed by consume_stock() function, later.
> */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> + struct memcg_stock_pcp *stock)
> {
> - struct memcg_stock_pcp *stock;
> - unsigned long flags;
> + lockdep_assert_held(&stock->stock_lock);
>
> - local_irq_save(flags);
> -
> - stock = this_cpu_ptr(&memcg_stock);
> if (stock->cached != memcg) { /* reset if necessary */
> drain_stock(stock);
> css_get(&memcg->css);
> @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>
> if (stock->nr_pages > MEMCG_CHARGE_BATCH)
> drain_stock(stock);
> +}
>
> - local_irq_restore(flags);
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> + struct memcg_stock_pcp *stock_pcp)
> +{
> + unsigned long flags;
> +
> + if (stock_pcp) {
> + __refill_stock(memcg, nr_pages, stock_pcp);
> + return;
> + }
> + local_lock_irqsave(&memcg_stock.stock_lock, flags);
> + __refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock));
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
>
> /*
> @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> */
> static void drain_all_stock(struct mem_cgroup *root_memcg)
> {
> - int cpu, curcpu;
> + int cpu;
>
> /* If someone's already draining, avoid adding running more workers. */
> if (!mutex_trylock(&percpu_charge_mutex))
> @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> * as well as workers from this path always operate on the local
> * per-cpu data. CPU up doesn't touch memcg_stock at all.
> */
> - curcpu = get_cpu();
> + cpus_read_lock();
> for_each_online_cpu(cpu) {
> struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> struct mem_cgroup *memcg;
> @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> rcu_read_unlock();
>
> if (flush &&
> - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> - if (cpu == curcpu)
> - drain_local_stock(&stock->work);
> - else
> - schedule_work_on(cpu, &stock->work);
> - }
> + !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> + schedule_work_on(cpu, &stock->work);
> }
> - put_cpu();
> + cpus_read_unlock();
> mutex_unlock(&percpu_charge_mutex);
> }
>
> @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> done_restock:
> if (batch > nr_pages)
> - refill_stock(memcg, batch - nr_pages);
> + refill_stock(memcg, batch - nr_pages, NULL);
>
> /*
> * If the hierarchy is above the normal consumption range, schedule
> @@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
> * can only be accessed after disabling interrupt. User context code can
> * access interrupt object stock, but not vice versa.
> */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> +static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
> + struct memcg_stock_pcp **stock_pcp)
> {
> struct memcg_stock_pcp *stock;
>
> +#ifndef CONFIG_PREEMPT_RT
> if (likely(in_task())) {
> *pflags = 0UL;
> - preempt_disable();
> + *stock_pcp = NULL;
> + local_lock(&memcg_stock.task_obj_lock);
> stock = this_cpu_ptr(&memcg_stock);
> return &stock->task_obj;
> }
> -
> - local_irq_save(*pflags);
> +#endif
> + local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
> stock = this_cpu_ptr(&memcg_stock);
> + *stock_pcp = stock;
> return &stock->irq_obj;
> }
>
> -static inline void put_obj_stock(unsigned long flags)
> +static inline void put_obj_stock(unsigned long flags,
> + struct memcg_stock_pcp *stock_pcp)
> {
> - if (likely(in_task()))
> - preempt_enable();
> +#ifndef CONFIG_PREEMPT_RT
> + if (likely(!stock_pcp))
> + local_unlock(&memcg_stock.task_obj_lock);
> else
> - local_irq_restore(flags);
If you skip the "else" and add a "return", there will be a more natural
indentation for the following lock_unlock_irqrestore(). Also we may not
really need the additional stock_pcp argument here.
> +#endif
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
2021-12-22 11:41 ` [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels Sebastian Andrzej Siewior
@ 2021-12-23 21:48 ` Waiman Long
2022-01-03 14:44 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2021-12-23 21:48 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, cgroups, linux-mm
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
Thomas Gleixner, Peter Zijlstra
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> Based on my understanding the optimisation with task_obj for in_task()
> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> instead to only PREEMPT_RT.
> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> configured but these kernels always have preempt_disable()/enable()
> present so it probably makes no sense here for the optimisation.
>
> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
understanding is that some distros are going to use PREEMPT_DYNAMIC, but
default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
disable the optimization based on PREEMPTION alone.
Regards,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
2021-12-23 21:48 ` Waiman Long
@ 2022-01-03 14:44 ` Sebastian Andrzej Siewior
2022-01-03 15:04 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-03 14:44 UTC (permalink / raw)
To: Waiman Long
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > Based on my understanding the optimisation with task_obj for in_task()
> > mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> > is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> > instead to only PREEMPT_RT.
> > With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> > configured but these kernels always have preempt_disable()/enable()
> > present so it probably makes no sense here for the optimisation.
> >
> > Restrict the optimisation to !CONFIG_PREEMPTION kernels.
> >
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
> disable the optimization based on PREEMPTION alone.
So there is a benefit to this even if preempt_disable() is not optimized
away? My understanding was that this depends on preempt_disable() being
optimized away.
Is there something you recommend as a benchmark where I could get some
numbers?
> Regards,
> Longman
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
2022-01-03 14:44 ` Sebastian Andrzej Siewior
@ 2022-01-03 15:04 ` Waiman Long
2022-01-05 20:22 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2022-01-03 15:04 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> Based on my understanding the optimisation with task_obj for in_task()
>>> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
>>> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
>>> instead to only PREEMPT_RT.
>>> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
>>> configured but these kernels always have preempt_disable()/enable()
>>> present so it probably makes no sense here for the optimisation.
>>>
>>> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>>>
>>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
>> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
>> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
>> disable the optimization based on PREEMPTION alone.
> So there is a benefit to this even if preempt_disable() is not optimized
> away? My understanding was that this depends on preempt_disable() being
> optimized away.
> Is there something you recommend as a benchmark where I could get some
> numbers?
In the case of PREEMPT_DYNAMIC, it depends on the default setting which
is used by most users. I will support disabling the optimization if
defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
CONFIG_)PREEMPTION alone.
As for microbenchmark, something that makes a lot of calls to malloc()
or related allocations can be used.
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
2021-12-23 21:38 ` Waiman Long
@ 2022-01-03 16:34 ` Sebastian Andrzej Siewior
2022-01-03 17:09 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-03 16:34 UTC (permalink / raw)
To: Waiman Long
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 2021-12-23 16:38:35 [-0500], Waiman Long wrote:
> AFAICS, stock_pcp is set to indicate that the stock_lock has been acquired.
> Since stock_pcp, if set, should be the same as this_cpu_ptr(&memcg_stock),
> it is a bit confusing to pass it to a function that also does a percpu
> access to memcg_stock. Why don't you just pass a boolean, say, stock_locked
> to indicate this instead. It will be more clear and less confusing.
Something like this then?
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2687d5ed544b..05fdc4801da6b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
return cgroup_memory_nokmem;
}
+struct memcg_stock_pcp;
static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
- unsigned int nr_pages);
+ unsigned int nr_pages,
+ bool stock_lock_acquried);
static void obj_cgroup_release(struct percpu_ref *ref)
{
@@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
nr_pages = nr_bytes >> PAGE_SHIFT;
if (nr_pages)
- obj_cgroup_uncharge_pages(objcg, nr_pages);
+ obj_cgroup_uncharge_pages(objcg, nr_pages, false);
spin_lock_irqsave(&css_set_lock, flags);
list_del(&objcg->list);
@@ -2120,26 +2122,40 @@ struct obj_stock {
};
struct memcg_stock_pcp {
+ /* Protects memcg_stock_pcp */
+ local_lock_t stock_lock;
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
+#ifndef CONFIG_PREEMPT_RT
+ /* Protects only task_obj */
+ local_lock_t task_obj_lock;
struct obj_stock task_obj;
+#endif
struct obj_stock irq_obj;
struct work_struct work;
unsigned long flags;
#define FLUSHING_CACHED_CHARGE 0
};
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+ .stock_lock = INIT_LOCAL_LOCK(stock_lock),
+#ifndef CONFIG_PREEMPT_RT
+ .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
+#endif
+};
static DEFINE_MUTEX(percpu_charge_mutex);
#ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+ bool stock_lock_acquried);
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg);
#else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+ bool stock_lock_acquried)
{
+ return NULL;
}
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg)
@@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
if (nr_pages > MEMCG_CHARGE_BATCH)
return ret;
- local_irq_save(flags);
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
stock = this_cpu_ptr(&memcg_stock);
if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
ret = true;
}
- local_irq_restore(flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
return ret;
}
@@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
static void drain_local_stock(struct work_struct *dummy)
{
- struct memcg_stock_pcp *stock;
- unsigned long flags;
+ struct memcg_stock_pcp *stock_pcp;
+ struct obj_cgroup *old;
/*
* The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
* drain_stock races is that we always operate on local CPU stock
* here with IRQ disabled
*/
- local_irq_save(flags);
+#ifndef CONFIG_PREEMPT_RT
+ local_lock(&memcg_stock.task_obj_lock);
+ old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
+ local_unlock(&memcg_stock.task_obj_lock);
+ if (old)
+ obj_cgroup_put(old);
+#endif
- stock = this_cpu_ptr(&memcg_stock);
- drain_obj_stock(&stock->irq_obj);
- if (in_task())
- drain_obj_stock(&stock->task_obj);
- drain_stock(stock);
- clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+ local_lock_irq(&memcg_stock.stock_lock);
+ stock_pcp = this_cpu_ptr(&memcg_stock);
+ old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
- local_irq_restore(flags);
+ drain_stock(stock_pcp);
+ clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
+
+ local_unlock_irq(&memcg_stock.stock_lock);
+ if (old)
+ obj_cgroup_put(old);
}
/*
* Cache charges(val) to local per_cpu area.
* This will be consumed by consume_stock() function, later.
*/
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
{
- struct memcg_stock_pcp *stock;
- unsigned long flags;
+ struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
- local_irq_save(flags);
-
- stock = this_cpu_ptr(&memcg_stock);
+ lockdep_assert_held(&stock->stock_lock);
if (stock->cached != memcg) { /* reset if necessary */
drain_stock(stock);
css_get(&memcg->css);
@@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
if (stock->nr_pages > MEMCG_CHARGE_BATCH)
drain_stock(stock);
+}
- local_irq_restore(flags);
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+ bool stock_lock_acquried)
+{
+ unsigned long flags;
+
+ if (stock_lock_acquried) {
+ __refill_stock(memcg, nr_pages);
+ return;
+ }
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ __refill_stock(memcg, nr_pages);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
}
/*
@@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
*/
static void drain_all_stock(struct mem_cgroup *root_memcg)
{
- int cpu, curcpu;
+ int cpu;
/* If someone's already draining, avoid adding running more workers. */
if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
* as well as workers from this path always operate on the local
* per-cpu data. CPU up doesn't touch memcg_stock at all.
*/
- curcpu = get_cpu();
+ cpus_read_lock();
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
struct mem_cgroup *memcg;
@@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
rcu_read_unlock();
if (flush &&
- !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
- if (cpu == curcpu)
- drain_local_stock(&stock->work);
- else
- schedule_work_on(cpu, &stock->work);
- }
+ !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+ schedule_work_on(cpu, &stock->work);
}
- put_cpu();
+ cpus_read_unlock();
mutex_unlock(&percpu_charge_mutex);
}
@@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
done_restock:
if (batch > nr_pages)
- refill_stock(memcg, batch - nr_pages);
+ refill_stock(memcg, batch - nr_pages, false);
/*
* If the hierarchy is above the normal consumption range, schedule
@@ -2803,28 +2832,36 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
* can only be accessed after disabling interrupt. User context code can
* access interrupt object stock, but not vice versa.
*/
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
+ bool *stock_lock_acquried)
{
struct memcg_stock_pcp *stock;
+#ifndef CONFIG_PREEMPT_RT
if (likely(in_task())) {
*pflags = 0UL;
- preempt_disable();
+ *stock_lock_acquried = false;
+ local_lock(&memcg_stock.task_obj_lock);
stock = this_cpu_ptr(&memcg_stock);
return &stock->task_obj;
}
-
- local_irq_save(*pflags);
+#endif
+ *stock_lock_acquried = true;
+ local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
stock = this_cpu_ptr(&memcg_stock);
return &stock->irq_obj;
}
-static inline void put_obj_stock(unsigned long flags)
+static inline void put_obj_stock(unsigned long flags,
+ bool stock_lock_acquried)
{
- if (likely(in_task()))
- preempt_enable();
- else
- local_irq_restore(flags);
+#ifndef CONFIG_PREEMPT_RT
+ if (likely(!stock_lock_acquried)) {
+ local_unlock(&memcg_stock.task_obj_lock);
+ return;
+ }
+#endif
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
}
/*
@@ -3002,7 +3039,8 @@ static void memcg_free_cache_id(int id)
* @nr_pages: number of pages to uncharge
*/
static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
- unsigned int nr_pages)
+ unsigned int nr_pages,
+ bool stock_lock_acquried)
{
struct mem_cgroup *memcg;
@@ -3010,7 +3048,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
page_counter_uncharge(&memcg->kmem, nr_pages);
- refill_stock(memcg, nr_pages);
+ refill_stock(memcg, nr_pages, stock_lock_acquried);
css_put(&memcg->css);
}
@@ -3084,7 +3122,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
return;
objcg = __folio_objcg(folio);
- obj_cgroup_uncharge_pages(objcg, nr_pages);
+ obj_cgroup_uncharge_pages(objcg, nr_pages, false);
folio->memcg_data = 0;
obj_cgroup_put(objcg);
}
@@ -3092,17 +3130,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
enum node_stat_item idx, int nr)
{
+ bool stock_lock_acquried;
unsigned long flags;
- struct obj_stock *stock = get_obj_stock(&flags);
+ struct obj_cgroup *old = NULL;
+ struct obj_stock *stock;
int *bytes;
+ stock = get_obj_stock(&flags, &stock_lock_acquried);
/*
* Save vmstat data in stock and skip vmstat array update unless
* accumulating over a page of vmstat data or when pgdat or idx
* changes.
*/
if (stock->cached_objcg != objcg) {
- drain_obj_stock(stock);
+ old = drain_obj_stock(stock, stock_lock_acquried);
+
obj_cgroup_get(objcg);
stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3146,38 +3188,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
if (nr)
mod_objcg_mlstate(objcg, pgdat, idx, nr);
- put_obj_stock(flags);
+ put_obj_stock(flags, stock_lock_acquried);
+ if (old)
+ obj_cgroup_put(old);
}
static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
+ bool stock_lock_acquried;
unsigned long flags;
- struct obj_stock *stock = get_obj_stock(&flags);
+ struct obj_stock *stock;
bool ret = false;
+ stock = get_obj_stock(&flags, &stock_lock_acquried);
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
}
- put_obj_stock(flags);
+ put_obj_stock(flags, stock_lock_acquried);
return ret;
}
-static void drain_obj_stock(struct obj_stock *stock)
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+ bool stock_lock_acquried)
{
struct obj_cgroup *old = stock->cached_objcg;
if (!old)
- return;
+ return NULL;
if (stock->nr_bytes) {
unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
if (nr_pages)
- obj_cgroup_uncharge_pages(old, nr_pages);
+ obj_cgroup_uncharge_pages(old, nr_pages, stock_lock_acquried);
/*
* The leftover is flushed to the centralized per-memcg value.
@@ -3212,8 +3259,8 @@ static void drain_obj_stock(struct obj_stock *stock)
stock->cached_pgdat = NULL;
}
- obj_cgroup_put(old);
stock->cached_objcg = NULL;
+ return old;
}
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3221,11 +3268,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
{
struct mem_cgroup *memcg;
+#ifndef CONFIG_PREEMPT_RT
if (in_task() && stock->task_obj.cached_objcg) {
memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
return true;
}
+#endif
if (stock->irq_obj.cached_objcg) {
memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
@@ -3238,12 +3287,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
bool allow_uncharge)
{
+ bool stock_lock_acquried;
unsigned long flags;
- struct obj_stock *stock = get_obj_stock(&flags);
+ struct obj_stock *stock;
unsigned int nr_pages = 0;
+ struct obj_cgroup *old = NULL;
+ stock = get_obj_stock(&flags, &stock_lock_acquried);
if (stock->cached_objcg != objcg) { /* reset if necessary */
- drain_obj_stock(stock);
+ old = drain_obj_stock(stock, stock_lock_acquried);
obj_cgroup_get(objcg);
stock->cached_objcg = objcg;
stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3257,10 +3309,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
stock->nr_bytes &= (PAGE_SIZE - 1);
}
- put_obj_stock(flags);
+ put_obj_stock(flags, stock_lock_acquried);
+ if (old)
+ obj_cgroup_put(old);
if (nr_pages)
- obj_cgroup_uncharge_pages(objcg, nr_pages);
+ obj_cgroup_uncharge_pages(objcg, nr_pages, false);
}
int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -7061,7 +7115,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
- refill_stock(memcg, nr_pages);
+ refill_stock(memcg, nr_pages, false);
}
static int __init cgroup_memory(char *s)
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
2022-01-03 16:34 ` Sebastian Andrzej Siewior
@ 2022-01-03 17:09 ` Waiman Long
0 siblings, 0 replies; 24+ messages in thread
From: Waiman Long @ 2022-01-03 17:09 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 1/3/22 11:34, Sebastian Andrzej Siewior wrote:
> On 2021-12-23 16:38:35 [-0500], Waiman Long wrote:
>> AFAICS, stock_pcp is set to indicate that the stock_lock has been acquired.
>> Since stock_pcp, if set, should be the same as this_cpu_ptr(&memcg_stock),
>> it is a bit confusing to pass it to a function that also does a percpu
>> access to memcg_stock. Why don't you just pass a boolean, say, stock_locked
>> to indicate this instead. It will be more clear and less confusing.
> Something like this then?
Yes, that looks good to me.
Thanks,
Longman
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d2687d5ed544b..05fdc4801da6b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
> return cgroup_memory_nokmem;
> }
>
> +struct memcg_stock_pcp;
> static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> - unsigned int nr_pages);
> + unsigned int nr_pages,
> + bool stock_lock_acquried);
>
> static void obj_cgroup_release(struct percpu_ref *ref)
> {
> @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> nr_pages = nr_bytes >> PAGE_SHIFT;
>
> if (nr_pages)
> - obj_cgroup_uncharge_pages(objcg, nr_pages);
> + obj_cgroup_uncharge_pages(objcg, nr_pages, false);
>
> spin_lock_irqsave(&css_set_lock, flags);
> list_del(&objcg->list);
> @@ -2120,26 +2122,40 @@ struct obj_stock {
> };
>
> struct memcg_stock_pcp {
> + /* Protects memcg_stock_pcp */
> + local_lock_t stock_lock;
> struct mem_cgroup *cached; /* this never be root cgroup */
> unsigned int nr_pages;
> +#ifndef CONFIG_PREEMPT_RT
> + /* Protects only task_obj */
> + local_lock_t task_obj_lock;
> struct obj_stock task_obj;
> +#endif
> struct obj_stock irq_obj;
>
> struct work_struct work;
> unsigned long flags;
> #define FLUSHING_CACHED_CHARGE 0
> };
> -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
> + .stock_lock = INIT_LOCAL_LOCK(stock_lock),
> +#ifndef CONFIG_PREEMPT_RT
> + .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
> +#endif
> +};
> static DEFINE_MUTEX(percpu_charge_mutex);
>
> #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> + bool stock_lock_acquried);
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> struct mem_cgroup *root_memcg);
>
> #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> + bool stock_lock_acquried)
> {
> + return NULL;
> }
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> struct mem_cgroup *root_memcg)
> @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> if (nr_pages > MEMCG_CHARGE_BATCH)
> return ret;
>
> - local_irq_save(flags);
> + local_lock_irqsave(&memcg_stock.stock_lock, flags);
>
> stock = this_cpu_ptr(&memcg_stock);
> if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
> @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> ret = true;
> }
>
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>
> return ret;
> }
> @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>
> static void drain_local_stock(struct work_struct *dummy)
> {
> - struct memcg_stock_pcp *stock;
> - unsigned long flags;
> + struct memcg_stock_pcp *stock_pcp;
> + struct obj_cgroup *old;
>
> /*
> * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
> * drain_stock races is that we always operate on local CPU stock
> * here with IRQ disabled
> */
> - local_irq_save(flags);
> +#ifndef CONFIG_PREEMPT_RT
> + local_lock(&memcg_stock.task_obj_lock);
> + old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
> + local_unlock(&memcg_stock.task_obj_lock);
> + if (old)
> + obj_cgroup_put(old);
> +#endif
>
> - stock = this_cpu_ptr(&memcg_stock);
> - drain_obj_stock(&stock->irq_obj);
> - if (in_task())
> - drain_obj_stock(&stock->task_obj);
> - drain_stock(stock);
> - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> + local_lock_irq(&memcg_stock.stock_lock);
> + stock_pcp = this_cpu_ptr(&memcg_stock);
> + old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
>
> - local_irq_restore(flags);
> + drain_stock(stock_pcp);
> + clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
> +
> + local_unlock_irq(&memcg_stock.stock_lock);
> + if (old)
> + obj_cgroup_put(old);
> }
>
> /*
> * Cache charges(val) to local per_cpu area.
> * This will be consumed by consume_stock() function, later.
> */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> - struct memcg_stock_pcp *stock;
> - unsigned long flags;
> + struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>
> - local_irq_save(flags);
> -
> - stock = this_cpu_ptr(&memcg_stock);
> + lockdep_assert_held(&stock->stock_lock);
> if (stock->cached != memcg) { /* reset if necessary */
> drain_stock(stock);
> css_get(&memcg->css);
> @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>
> if (stock->nr_pages > MEMCG_CHARGE_BATCH)
> drain_stock(stock);
> +}
>
> - local_irq_restore(flags);
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> + bool stock_lock_acquried)
> +{
> + unsigned long flags;
> +
> + if (stock_lock_acquried) {
> + __refill_stock(memcg, nr_pages);
> + return;
> + }
> + local_lock_irqsave(&memcg_stock.stock_lock, flags);
> + __refill_stock(memcg, nr_pages);
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
>
> /*
> @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> */
> static void drain_all_stock(struct mem_cgroup *root_memcg)
> {
> - int cpu, curcpu;
> + int cpu;
>
> /* If someone's already draining, avoid adding running more workers. */
> if (!mutex_trylock(&percpu_charge_mutex))
> @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> * as well as workers from this path always operate on the local
> * per-cpu data. CPU up doesn't touch memcg_stock at all.
> */
> - curcpu = get_cpu();
> + cpus_read_lock();
> for_each_online_cpu(cpu) {
> struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> struct mem_cgroup *memcg;
> @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> rcu_read_unlock();
>
> if (flush &&
> - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> - if (cpu == curcpu)
> - drain_local_stock(&stock->work);
> - else
> - schedule_work_on(cpu, &stock->work);
> - }
> + !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> + schedule_work_on(cpu, &stock->work);
> }
> - put_cpu();
> + cpus_read_unlock();
> mutex_unlock(&percpu_charge_mutex);
> }
>
> @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> done_restock:
> if (batch > nr_pages)
> - refill_stock(memcg, batch - nr_pages);
> + refill_stock(memcg, batch - nr_pages, false);
>
> /*
> * If the hierarchy is above the normal consumption range, schedule
> @@ -2803,28 +2832,36 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
> * can only be accessed after disabling interrupt. User context code can
> * access interrupt object stock, but not vice versa.
> */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> +static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
> + bool *stock_lock_acquried)
> {
> struct memcg_stock_pcp *stock;
>
> +#ifndef CONFIG_PREEMPT_RT
> if (likely(in_task())) {
> *pflags = 0UL;
> - preempt_disable();
> + *stock_lock_acquried = false;
> + local_lock(&memcg_stock.task_obj_lock);
> stock = this_cpu_ptr(&memcg_stock);
> return &stock->task_obj;
> }
> -
> - local_irq_save(*pflags);
> +#endif
> + *stock_lock_acquried = true;
> + local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
> stock = this_cpu_ptr(&memcg_stock);
> return &stock->irq_obj;
> }
>
> -static inline void put_obj_stock(unsigned long flags)
> +static inline void put_obj_stock(unsigned long flags,
> + bool stock_lock_acquried)
> {
> - if (likely(in_task()))
> - preempt_enable();
> - else
> - local_irq_restore(flags);
> +#ifndef CONFIG_PREEMPT_RT
> + if (likely(!stock_lock_acquried)) {
> + local_unlock(&memcg_stock.task_obj_lock);
> + return;
> + }
> +#endif
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
>
> /*
> @@ -3002,7 +3039,8 @@ static void memcg_free_cache_id(int id)
> * @nr_pages: number of pages to uncharge
> */
> static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> - unsigned int nr_pages)
> + unsigned int nr_pages,
> + bool stock_lock_acquried)
> {
> struct mem_cgroup *memcg;
>
> @@ -3010,7 +3048,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
>
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> page_counter_uncharge(&memcg->kmem, nr_pages);
> - refill_stock(memcg, nr_pages);
> + refill_stock(memcg, nr_pages, stock_lock_acquried);
>
> css_put(&memcg->css);
> }
> @@ -3084,7 +3122,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> return;
>
> objcg = __folio_objcg(folio);
> - obj_cgroup_uncharge_pages(objcg, nr_pages);
> + obj_cgroup_uncharge_pages(objcg, nr_pages, false);
> folio->memcg_data = 0;
> obj_cgroup_put(objcg);
> }
> @@ -3092,17 +3130,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> enum node_stat_item idx, int nr)
> {
> + bool stock_lock_acquried;
> unsigned long flags;
> - struct obj_stock *stock = get_obj_stock(&flags);
> + struct obj_cgroup *old = NULL;
> + struct obj_stock *stock;
> int *bytes;
>
> + stock = get_obj_stock(&flags, &stock_lock_acquried);
> /*
> * Save vmstat data in stock and skip vmstat array update unless
> * accumulating over a page of vmstat data or when pgdat or idx
> * changes.
> */
> if (stock->cached_objcg != objcg) {
> - drain_obj_stock(stock);
> + old = drain_obj_stock(stock, stock_lock_acquried);
> +
> obj_cgroup_get(objcg);
> stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
> @@ -3146,38 +3188,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> if (nr)
> mod_objcg_mlstate(objcg, pgdat, idx, nr);
>
> - put_obj_stock(flags);
> + put_obj_stock(flags, stock_lock_acquried);
> + if (old)
> + obj_cgroup_put(old);
> }
>
> static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> {
> + bool stock_lock_acquried;
> unsigned long flags;
> - struct obj_stock *stock = get_obj_stock(&flags);
> + struct obj_stock *stock;
> bool ret = false;
>
> + stock = get_obj_stock(&flags, &stock_lock_acquried);
> if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> stock->nr_bytes -= nr_bytes;
> ret = true;
> }
>
> - put_obj_stock(flags);
> + put_obj_stock(flags, stock_lock_acquried);
>
> return ret;
> }
>
> -static void drain_obj_stock(struct obj_stock *stock)
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> + bool stock_lock_acquried)
> {
> struct obj_cgroup *old = stock->cached_objcg;
>
> if (!old)
> - return;
> + return NULL;
>
> if (stock->nr_bytes) {
> unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
>
> if (nr_pages)
> - obj_cgroup_uncharge_pages(old, nr_pages);
> + obj_cgroup_uncharge_pages(old, nr_pages, stock_lock_acquried);
>
> /*
> * The leftover is flushed to the centralized per-memcg value.
> @@ -3212,8 +3259,8 @@ static void drain_obj_stock(struct obj_stock *stock)
> stock->cached_pgdat = NULL;
> }
>
> - obj_cgroup_put(old);
> stock->cached_objcg = NULL;
> + return old;
> }
>
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> @@ -3221,11 +3268,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> {
> struct mem_cgroup *memcg;
>
> +#ifndef CONFIG_PREEMPT_RT
> if (in_task() && stock->task_obj.cached_objcg) {
> memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
> if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> return true;
> }
> +#endif
> if (stock->irq_obj.cached_objcg) {
> memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
> if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> @@ -3238,12 +3287,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> bool allow_uncharge)
> {
> + bool stock_lock_acquried;
> unsigned long flags;
> - struct obj_stock *stock = get_obj_stock(&flags);
> + struct obj_stock *stock;
> unsigned int nr_pages = 0;
> + struct obj_cgroup *old = NULL;
>
> + stock = get_obj_stock(&flags, &stock_lock_acquried);
> if (stock->cached_objcg != objcg) { /* reset if necessary */
> - drain_obj_stock(stock);
> + old = drain_obj_stock(stock, stock_lock_acquried);
> obj_cgroup_get(objcg);
> stock->cached_objcg = objcg;
> stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> @@ -3257,10 +3309,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> stock->nr_bytes &= (PAGE_SIZE - 1);
> }
>
> - put_obj_stock(flags);
> + put_obj_stock(flags, stock_lock_acquried);
> + if (old)
> + obj_cgroup_put(old);
>
> if (nr_pages)
> - obj_cgroup_uncharge_pages(objcg, nr_pages);
> + obj_cgroup_uncharge_pages(objcg, nr_pages, false);
> }
>
> int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -7061,7 +7115,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>
> mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
>
> - refill_stock(memcg, nr_pages);
> + refill_stock(memcg, nr_pages, false);
> }
>
> static int __init cgroup_memory(char *s)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
2021-12-23 2:31 ` Waiman Long
@ 2022-01-05 14:16 ` Michal Koutný
2022-01-13 13:08 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-01-05 14:16 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
Peter Zijlstra
On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> The sections with disabled preemption must exclude
> memcg_check_events() so that spinlock_t locks can still be acquired
> (for instance in eventfd_signal()).
The resulting construct in uncharge_batch() raises eybrows. If you can decouple
per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
it the same way on !PREEMPT_RT too (and have just one variant of the
block)?
(Actually, it doesn't seem to me that memcg_check_events() can be
extracted like this from the preempt disabled block since
mem_cgroup_event_ratelimit() relies on similar RMW pattern.
Things would be simpler if PREEMPT_RT didn't allow the threshold event
handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)
Thanks,
Michal
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
2021-12-22 11:41 [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2021-12-22 11:41 ` [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels Sebastian Andrzej Siewior
@ 2022-01-05 14:59 ` Michal Koutný
2022-01-05 15:06 ` Sebastian Andrzej Siewior
3 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-01-05 14:59 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
Peter Zijlstra
On Wed, Dec 22, 2021 at 12:41:08PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> - lockdep complains were triggered by test_core and test_freezer (both
> had to run):
This doesn't happen on the patched kernel, correct?
Thanks,
Michal
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
2022-01-05 14:59 ` [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Michal Koutný
@ 2022-01-05 15:06 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-05 15:06 UTC (permalink / raw)
To: Michal Koutný
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
Peter Zijlstra
On 2022-01-05 15:59:56 [+0100], Michal Koutný wrote:
> On Wed, Dec 22, 2021 at 12:41:08PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > - lockdep complains were triggered by test_core and test_freezer (both
> > had to run):
>
> This doesn't happen on the patched kernel, correct?
I saw it first on the patched kernel (with this series) then went back
to the -rc and saw it there, too.
> Thanks,
> Michal
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
2022-01-03 15:04 ` Waiman Long
@ 2022-01-05 20:22 ` Sebastian Andrzej Siewior
2022-01-06 3:28 ` Waiman Long
0 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-05 20:22 UTC (permalink / raw)
To: Waiman Long
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> > Is there something you recommend as a benchmark where I could get some
> > numbers?
>
> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
> used by most users. I will support disabling the optimization if
> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
> CONFIG_)PREEMPTION alone.
>
> As for microbenchmark, something that makes a lot of calls to malloc() or
> related allocations can be used.
Numbers I made:
Sandy Bridge Haswell Skylake AMD-A8 7100 Zen2 ARM64
PREEMPT 5,123,896,822 5,215,055,226 5,077,611,590 6,012,287,874 6,234,674,489 20,000,000,100
IRQ 7,494,119,638 6,810,367,629 10,620,130,377 4,178,546,086 4,898,076,012 13,538,461,925
For micro benchmarking I did 1.000.000.000 iterations of
preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
[IRQ].
On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
loop took 7,494,119,638ns. The absolute numbers are not important, it is
worth noting that preemption off/on is less expensive than IRQ off/on.
Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
loop was performed with disabled interrupts so I don't expect much
interference - but then I don't know much the µArch optimized away on
local_irq_restore() given that the interrupts were already disabled.
I don't have any recent Intel HW (I think) so I don't know if this is an
Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
that the recent uArch would behave similar to AMD.
Moving on: For the test I run 100,000,000 iterations of
kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));
The BH suffix means that in_task() reported false during the allocation,
otherwise it reported true.
SD is the standard deviation.
SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
means no optimisation (irq_obj is always used).
The numbers are the time in ns needed for 100,000,000 iteration (alloc +
free). I run 5 tests and used the median value here. If the standard
deviation exceeded 10^9 then I repeated the test. The values remained in
the same range since usually only one value was off and the other
remained in the same range.
Sandy Bridge
SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT
ALLOC/FREE 8,519,295,176 9,051,200,652 10,627,431,395 11,198,189,843
SD 5,309,768 29,253,976 129,102,317 40,681,909
ALLOC/FREE BH 9,996,704,330 8,927,026,031 11,680,149,900 11,139,356,465
SD 38,237,534 72,913,120 23,626,932 116,413,331
The optimisation is visible in the SERVER-OPT case (~1.5s difference in
the runtime (or 14.7ns per iteration)). There is hardly any difference
between BH and !BH in the SERVER-NO-OPT case.
For the SERVER case, the optimisation improves ~0.5s in runtime for the
!BH case.
For the PREEMPT case it also looks like ~0.5s improvement in the BH case
while in the BH case it looks the other way around.
DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT
ALLOC/FREE 11,069,180,584 10,773,407,543 10,963,581,285 10,826,207,969
SD 23,195,912 112,763,104 13,145,589 33,543,625
ALLOC/FREE BH 11,443,342,069 10,720,094,700 11,064,914,727 10,955,883,521
SD 81,150,074 171,299,554 58,603,778 84,131,143
DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
default. I don't see any difference vs CONFIG_PREEMPT except the
default preemption state (so I didn't test that). The preemption counter
is always forced-in so preempt_enable()/disable() is not optimized away.
SRV is the default value (PREEMPT_NONE) and FULL is the overriden
(preempt=full) state.
Based on that, I don't see any added value by the optimisation once
PREEMPT_DYNAMIC is enabled.
----
Zen2:
SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT
ALLOC/FREE 8,126,735,313 8,751,307,383 9,822,927,142 10,045,105,425
SD 100,806,471 87,234,047 55,170,179 25,832,386
ALLOC/FREE BH 9,197,455,885 8,394,337,053 10,671,227,095 9,904,954,934
SD 155,223,919 57,800,997 47,529,496 105,260,566
On Zen2, the IRQ off/on was less expensive than preempt-off/on. So it
looks like I mixed up the numbers in for PREEMPT OPT and NO-OPT but I
re-run it twice and nothing significant changed… However the difference
on PREEMPT for the !BH case is not as significant as on Sandy Bridge
(~200ms here vs ~500ms there).
DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT
ALLOC/FREE 9,680,498,929 10,180,973,847 9,644,453,405 10,224,416,854
SD 73,944,156 61,850,527 13,277,203 107,145,212
ALLOC/FREE BH 10,680,074,634 9,956,695,323 10,704,442,515 9,942,155,910
SD 75,535,172 34,524,493 54,625,678 87,163,920
For double testing and checking, the full git tree is available at [0]
and the script to parse the results is at [1].
[0] git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging memcg
[1] https://breakpoint.cc/parse-memcg.py
> Cheers,
> Longman
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
2022-01-05 20:22 ` Sebastian Andrzej Siewior
@ 2022-01-06 3:28 ` Waiman Long
2022-01-13 15:26 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 24+ messages in thread
From: Waiman Long @ 2022-01-06 3:28 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 1/5/22 15:22, Sebastian Andrzej Siewior wrote:
> On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
>> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
>>> Is there something you recommend as a benchmark where I could get some
>>> numbers?
>> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
>> used by most users. I will support disabling the optimization if
>> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
>> CONFIG_)PREEMPTION alone.
>>
>> As for microbenchmark, something that makes a lot of calls to malloc() or
>> related allocations can be used.
> Numbers I made:
>
> Sandy Bridge Haswell Skylake AMD-A8 7100 Zen2 ARM64
> PREEMPT 5,123,896,822 5,215,055,226 5,077,611,590 6,012,287,874 6,234,674,489 20,000,000,100
> IRQ 7,494,119,638 6,810,367,629 10,620,130,377 4,178,546,086 4,898,076,012 13,538,461,925
Thanks for the extensive testing. I usually perform my performance test
on Intel hardware. I don't realize that Zen2 and arm64 perform better
with irq on/off.
>
> For micro benchmarking I did 1.000.000.000 iterations of
> preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
> [IRQ].
> On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
> loop took 7,494,119,638ns. The absolute numbers are not important, it is
> worth noting that preemption off/on is less expensive than IRQ off/on.
> Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
> loop was performed with disabled interrupts so I don't expect much
> interference - but then I don't know much the µArch optimized away on
> local_irq_restore() given that the interrupts were already disabled.
> I don't have any recent Intel HW (I think) so I don't know if this is an
> Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
> that the recent uArch would behave similar to AMD.
>
> Moving on: For the test I run 100,000,000 iterations of
> kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));
>
> The BH suffix means that in_task() reported false during the allocation,
> otherwise it reported true.
> SD is the standard deviation.
> SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
> OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
> means no optimisation (irq_obj is always used).
> The numbers are the time in ns needed for 100,000,000 iteration (alloc +
> free). I run 5 tests and used the median value here. If the standard
> deviation exceeded 10^9 then I repeated the test. The values remained in
> the same range since usually only one value was off and the other
> remained in the same range.
>
> Sandy Bridge
> SERVER OPT SERVER NO-OPT PREEMPT OPT PREEMPT NO-OPT
> ALLOC/FREE 8,519,295,176 9,051,200,652 10,627,431,395 11,198,189,843
> SD 5,309,768 29,253,976 129,102,317 40,681,909
> ALLOC/FREE BH 9,996,704,330 8,927,026,031 11,680,149,900 11,139,356,465
> SD 38,237,534 72,913,120 23,626,932 116,413,331
My own testing when tracking the number of times in_task() is true or
false indicated most of the kmalloc() call is done by tasks. Only a few
percents of the time is in_task() false. That is the reason why I
optimize the case that in_task() is true.
>
> The optimisation is visible in the SERVER-OPT case (~1.5s difference in
> the runtime (or 14.7ns per iteration)). There is hardly any difference
> between BH and !BH in the SERVER-NO-OPT case.
> For the SERVER case, the optimisation improves ~0.5s in runtime for the
> !BH case.
> For the PREEMPT case it also looks like ~0.5s improvement in the BH case
> while in the BH case it looks the other way around.
>
> DYN-SRV-OPT DYN-SRV-NO-OPT DYN-FULL-OPT DYN-FULL-NO-OPT
> ALLOC/FREE 11,069,180,584 10,773,407,543 10,963,581,285 10,826,207,969
> SD 23,195,912 112,763,104 13,145,589 33,543,625
> ALLOC/FREE BH 11,443,342,069 10,720,094,700 11,064,914,727 10,955,883,521
> SD 81,150,074 171,299,554 58,603,778 84,131,143
>
> DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
> default. I don't see any difference vs CONFIG_PREEMPT except the
> default preemption state (so I didn't test that). The preemption counter
> is always forced-in so preempt_enable()/disable() is not optimized away.
> SRV is the default value (PREEMPT_NONE) and FULL is the overriden
> (preempt=full) state.
>
> Based on that, I don't see any added value by the optimisation once
> PREEMPT_DYNAMIC is enabled.
The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data
points, I am not going to object to this patch then. I will try to look
further into why this is the case when I have time.
Cheers,
Longman
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2022-01-05 14:16 ` Michal Koutný
@ 2022-01-13 13:08 ` Sebastian Andrzej Siewior
2022-01-13 14:48 ` Michal Koutný
0 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-13 13:08 UTC (permalink / raw)
To: Michal Koutný
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
Peter Zijlstra
On 2022-01-05 15:16:53 [+0100], Michal Koutný wrote:
> On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > The sections with disabled preemption must exclude
> > memcg_check_events() so that spinlock_t locks can still be acquired
> > (for instance in eventfd_signal()).
>
> The resulting construct in uncharge_batch() raises eybrows. If you can decouple
> per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
> it the same way on !PREEMPT_RT too (and have just one variant of the
> block)?
That would mean that mem_cgroup_event_ratelimit() needs a
local_irq_save(). If that is okay then sure I can move it that way.
> (Actually, it doesn't seem to me that memcg_check_events() can be
> extracted like this from the preempt disabled block since
> mem_cgroup_event_ratelimit() relies on similar RMW pattern.
> Things would be simpler if PREEMPT_RT didn't allow the threshold event
> handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)
I added a preempt-disable() section restricted to RT to
mem_cgroup_event_ratelimit(). I had to exclude memcg_check_events() from
the block because of the spinlock_t locks involved down the road
(eventfd_signal() for instance).
I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
I didn't know where exactly and its implications. In this block here I
just followed the replacement of irq-off with preempt-off for RT.
> Thanks,
> Michal
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2022-01-13 13:08 ` Sebastian Andrzej Siewior
@ 2022-01-13 14:48 ` Michal Koutný
2022-01-14 9:09 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-01-13 14:48 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
Peter Zijlstra
On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> I added a preempt-disable() section restricted to RT to
> mem_cgroup_event_ratelimit().
Oh, I missed that one.
(Than the decoupling of such mem_cgroup_event_ratelimit() also makes
some more sense.)
> That would mean that mem_cgroup_event_ratelimit() needs a
> local_irq_save(). If that is okay then sure I can move it that way.
Whatever avoids the twisted code :-)
---
> I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> I didn't know where exactly and its implications. In this block here I
> just followed the replacement of irq-off with preempt-off for RT.
Both soft limit and (these) event notifications are v1 features. Soft
limit itself is rather considered even misfeature. I guess the
implications would not be many since PREEMPT_RT+memcg users would be
new(?) so should rather start with v2 anyway.
One way to disable it would be to reject writes into
memory.soft_limit_in_bytes or cgroup.event_control + documentation of
that.
Michal
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
2022-01-06 3:28 ` Waiman Long
@ 2022-01-13 15:26 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-13 15:26 UTC (permalink / raw)
To: Waiman Long
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra
On 2022-01-05 22:28:10 [-0500], Waiman Long wrote:
> Thanks for the extensive testing. I usually perform my performance test on
> Intel hardware. I don't realize that Zen2 and arm64 perform better with irq
> on/off.
Maybe you have access to more recent µArch from Intel which could behave
different.
> My own testing when tracking the number of times in_task() is true or false
> indicated most of the kmalloc() call is done by tasks. Only a few percents
> of the time is in_task() false. That is the reason why I optimize the case
> that in_task() is true.
Right. This relies on the fact that changing preemption is cheaper which
is not always true. The ultimate benefit is of course when the
preemption changes can be removed/ optimized away.
> > Based on that, I don't see any added value by the optimisation once
> > PREEMPT_DYNAMIC is enabled.
>
> The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data points,
> I am not going to object to this patch then. I will try to look further into
> why this is the case when I have time.
Okay, thank you.
In the SERVER case we keep the preemption counter so this has obviously
an impact. I am a little surprised that the DYN-FULL and DYN-NONE
differ a little since the code runs with disabled interrupts. But then
this might be the extra jump to preempt_schedule() which is patched-out
in the SERVER case.
> Cheers,
> Longman
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
2022-01-13 14:48 ` Michal Koutný
@ 2022-01-14 9:09 ` Sebastian Andrzej Siewior
2022-01-18 18:26 ` [PATCH] mm/memcg: Do not check v1 event counter when not needed Michal Koutný
0 siblings, 1 reply; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-14 9:09 UTC (permalink / raw)
To: Michal Koutný
Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
Peter Zijlstra
On 2022-01-13 15:48:03 [+0100], Michal Koutný wrote:
> On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > I added a preempt-disable() section restricted to RT to
> > mem_cgroup_event_ratelimit().
>
> Oh, I missed that one.
>
> (Than the decoupling of such mem_cgroup_event_ratelimit() also makes
> some more sense.)
> > That would mean that mem_cgroup_event_ratelimit() needs a
> > local_irq_save(). If that is okay then sure I can move it that way.
>
> Whatever avoids the twisted code :-)
Okay. So let me do that.
> ---
>
> > I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> > I didn't know where exactly and its implications. In this block here I
> > just followed the replacement of irq-off with preempt-off for RT.
>
> Both soft limit and (these) event notifications are v1 features. Soft
> limit itself is rather considered even misfeature. I guess the
> implications would not be many since PREEMPT_RT+memcg users would be
> new(?) so should rather start with v2 anyway.
People often migrate to RT so they take whatever they have. In general I
would like to keep RT & !RT in sync unless there are reasons to do it
differently.
> One way to disable it would be to reject writes into
> memory.soft_limit_in_bytes or cgroup.event_control + documentation of
> that.
So avoiding these two also avoids memcg_check_events()?
Right now it does not look like a big obstacle. It is the same pattern
I'm following for the per-CPU RMW. If it is, I could avoid the writes
and if0 the other function for RT.
If I remove memcg_check_events() from the equation then we could keep
the explicit irq-off regions (plus add a few). The only that would look
odd then is that we disable interrupts for the RMW operation and
preemption in other places (__count_memcg_events() invocations in swap.c
and vmscan.c).
Are there plans to remove v1 or is this part of "we must not break
userland"?
> Michal
Sebastian
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH] mm/memcg: Do not check v1 event counter when not needed
2022-01-14 9:09 ` Sebastian Andrzej Siewior
@ 2022-01-18 18:26 ` Michal Koutný
2022-01-18 19:57 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 24+ messages in thread
From: Michal Koutný @ 2022-01-18 18:26 UTC (permalink / raw)
To: bigeasy
Cc: akpm, cgroups, hannes, linux-mm, longman, mhocko, mkoutny,
peterz, tglx, vdavydov.dev
The function memcg_check_events() is called to trigger possible event
notifications or soft limit updates when page event "clock" moves
sufficiently. This tracking is not needed when neither soft limit nor (v1)
event notifications are configured. The tracking can catch-up
with the clock at any time upon thresholds configuration.
Guard this functionality behind an unlikely static branch (soft limit
and events are presumably rather unused than used).
This has slight insignificant performance gain in page-fault specific
benchmark but overall no performance impact is expected. The goal is to
partition the charging code per provided user functionality.
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
mm/memcontrol.c | 8 ++++++++
1 file changed, 8 insertions(+)
On Fri, Jan 14, 2022 at 10:09:35AM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> So avoiding these two also avoids memcg_check_events()?
I've made the matter explicit with the surrounding patch.
[
The performance "gain" is negligible (differences of pft [1] are dominated by
non-root memcg classification):
nocg, nopatch cg, nopatch nocg, patch cg, patch
Hmean faults/sec-2 273366.6312 ( 0.00%) 243573.3767 * -10.90%* 273901.9709 * 0.20%* 247702.4104 * -9.39%*
CoeffVar faults/sec-2 3.8771 ( 0.00%) 3.8396 ( 0.97%) 3.1400 ( 19.01%) 4.1188 ( -6.24%)
cg, nopatch cg, patch
Hmean faults/sec-2 243573.3767 ( 0.00%) 247702.4104 * 1.70%*
CoeffVar faults/sec-2 3.8396 ( 0.00%) 4.1188 ( -7.27%)
On less targetted benchmarks it's well below noise.
]
I think it would make sense inserting the patch into your series and
subsequently reject enabling on PREEMPT_RT -- provided this patch makes sense
to others too -- the justification is rather functionality splitting for
this PREEMPT_RT effort.
> Are there plans to remove v1 or is this part of "we must not break
> userland"?
It's part of that mantra, so v1 can't be simply removed. OTOH, my sensing is
that this change also fits under not extending v1 (to avoid doubling effort on
everything).
Michal
[1] https://github.com/gormanm/mmtests/blob/6bcb8b301a48386e0cc63a21f7642048a3ceaed5/configs/config-pagealloc-performance#L6
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a7b3ebf8e48..7f64ce33d137 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -106,6 +106,8 @@ static bool do_memsw_account(void)
#define THRESHOLDS_EVENTS_TARGET 128
#define SOFTLIMIT_EVENTS_TARGET 1024
+DEFINE_STATIC_KEY_FALSE(memcg_v1_events_enabled_key);
+
/*
* Cgroups above their limits are maintained in a RB-Tree, independent of
* their hierarchy representation
@@ -852,6 +854,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
*/
static void memcg_check_events(struct mem_cgroup *memcg, int nid)
{
+ if (!static_branch_unlikely(&memcg_v1_events_enabled_key))
+ return;
+
/* threshold event is triggered in finer grain than soft limit */
if (unlikely(mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
@@ -3757,6 +3762,7 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
break;
case RES_SOFT_LIMIT:
memcg->soft_limit = nr_pages;
+ static_branch_enable(&memcg_v1_events_enabled_key);
ret = 0;
break;
}
@@ -4831,6 +4837,8 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
list_add(&event->list, &memcg->event_list);
spin_unlock_irq(&memcg->event_list_lock);
+ static_branch_enable(&memcg_v1_events_enabled_key);
+
fdput(cfile);
fdput(efile);
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] mm/memcg: Do not check v1 event counter when not needed
2022-01-18 18:26 ` [PATCH] mm/memcg: Do not check v1 event counter when not needed Michal Koutný
@ 2022-01-18 19:57 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 24+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-18 19:57 UTC (permalink / raw)
To: Michal Koutný
Cc: akpm, cgroups, hannes, linux-mm, longman, mhocko, peterz, tglx,
vdavydov.dev
On 2022-01-18 19:26:00 [+0100], Michal Koutný wrote:
> I think it would make sense inserting the patch into your series and
> subsequently reject enabling on PREEMPT_RT -- provided this patch makes sense
> to others too -- the justification is rather functionality splitting for
> this PREEMPT_RT effort.
Interesting. So while looking at this today I came up with the patch at
the bottom. The other things I had looked way uglier and then since
nobody probably will use it…
Let me know how you want it to be integrated.
------>8------
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Tue, 18 Jan 2022 17:28:07 +0100
Subject: [PATCH] mm/memcg: Disable threshold event handlers on PREEMPT_RT
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
During the integration of PREEMPT_RT support, the code flow around
memcg_check_events() resulted in `twisted code'. Moving the code around
and avoiding then would then lead to an additional local-irq-save
section within memcg_check_events(). While looking better, it adds a
local-irq-save section to code flow which is usually within an
local-irq-save block.
The threshold event handler is a deprecated memcg v1 feature. Instead of
trying to get it to work under PREEMPT_RT just disable it. There should
have not been any users on PREEMPT_RT. From that perspective makes it
even less sense to get it to work under PREEMPT_RT while having zero
users.
Make memory.soft_limit_in_bytes and cgroup.event_control return
-EOPNOTSUPP on PREEMPT_RT. Make memcg_check_events() empty on PREEMPT_RT
since it won't do anything. Document that the two knobs are disabled on
PREEMPT_RT.
Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 2 ++
mm/memcontrol.c | 12 ++++++++++++
2 files changed, 14 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index faac50149a222..2cc502a75ef64 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -64,6 +64,7 @@ Brief summary of control files.
threads
cgroup.procs show list of processes
cgroup.event_control an interface for event_fd()
+ This knob is not available on CONFIG_PREEMPT_RT systems.
memory.usage_in_bytes show current usage for memory
(See 5.5 for details)
memory.memsw.usage_in_bytes show current usage for memory+Swap
@@ -75,6 +76,7 @@ Brief summary of control files.
memory.max_usage_in_bytes show max memory usage recorded
memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
memory.soft_limit_in_bytes set/show soft limit of memory usage
+ This knob is not available on CONFIG_PREEMPT_RT systems.
memory.stat show various statistics
memory.use_hierarchy set/show hierarchical account enabled
This knob is deprecated and shouldn't be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d3..3c4f7a0fd0039 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -821,6 +821,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
}
+#ifndef CONFIG_PREEMPT_RT
static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
enum mem_cgroup_events_target target)
{
@@ -864,6 +865,9 @@ static void memcg_check_events(struct mem_cgroup *memcg, int nid)
mem_cgroup_update_tree(memcg, nid);
}
}
+#else
+static void memcg_check_events(struct mem_cgroup *memcg, int nid) { }
+#endif
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
{
@@ -3751,8 +3755,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
}
break;
case RES_SOFT_LIMIT:
+#ifndef CONFIG_PREEMPT_RT
memcg->soft_limit = nr_pages;
ret = 0;
+#else
+ ret = -EOPNOTSUPP;
+#endif
break;
}
return ret ?: nbytes;
@@ -4717,6 +4725,7 @@ static void memcg_event_ptable_queue_proc(struct file *file,
static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
+#ifndef CONFIG_PREEMPT_RT
struct cgroup_subsys_state *css = of_css(of);
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup_event *event;
@@ -4843,6 +4852,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
kfree(event);
return ret;
+#else
+ return -EOPNOTSUPP;
+#endif
}
static struct cftype mem_cgroup_legacy_files[] = {
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
end of thread, other threads:[~2022-01-18 19:57 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-22 11:41 [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
2021-12-23 2:31 ` Waiman Long
2021-12-23 7:34 ` Sebastian Andrzej Siewior
2021-12-23 16:01 ` Waiman Long
2022-01-05 14:16 ` Michal Koutný
2022-01-13 13:08 ` Sebastian Andrzej Siewior
2022-01-13 14:48 ` Michal Koutný
2022-01-14 9:09 ` Sebastian Andrzej Siewior
2022-01-18 18:26 ` [PATCH] mm/memcg: Do not check v1 event counter when not needed Michal Koutný
2022-01-18 19:57 ` Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object Sebastian Andrzej Siewior
2021-12-23 21:38 ` Waiman Long
2022-01-03 16:34 ` Sebastian Andrzej Siewior
2022-01-03 17:09 ` Waiman Long
2021-12-22 11:41 ` [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels Sebastian Andrzej Siewior
2021-12-23 21:48 ` Waiman Long
2022-01-03 14:44 ` Sebastian Andrzej Siewior
2022-01-03 15:04 ` Waiman Long
2022-01-05 20:22 ` Sebastian Andrzej Siewior
2022-01-06 3:28 ` Waiman Long
2022-01-13 15:26 ` Sebastian Andrzej Siewior
2022-01-05 14:59 ` [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Michal Koutný
2022-01-05 15:06 ` Sebastian Andrzej Siewior
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.