All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2021-12-22 11:41 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

Hi,

this is a follow up to
   https://lkml.kernel.org/r/20211207155208.eyre5svucpg7krxe@linutronix.de

where it has been suggested that I should try again with memcg instead
of simply disabling it.

Patch #1 deals with the counters. It has been suggested to simply
disable preemption on RT (like in vmstats) and I followed that advice as
closely as possible. The local_irq_save() could be removed from
mod_memcg_state() and the other wrapper on RT but I leave it since it
does not hurt and it might look nicer ;)

Patch #2 is a follow up to
   https://lkml.kernel.org/r/20211214144412.447035-1-longman@redhat.com

Patch #3 restricts the task_obj usage to !PREEMPTION kernels. Based on
my understanding the use of preempt_disable() minimizes (avoids?) the
win of the optimisation.

I tested them on CONFIG_PREEMPT_NONE + CONFIG_PREEMPT_RT with the
tools/testing/selftests/cgroup/* tests. I looked good except for the
following (which was also there before the patches):
- test_kmem sometimes complained about:
 not ok 2 test_kmem_memcg_deletion
 
- test_memcontrol complained always about
 not ok 3 test_memcg_min
 not ok 4 test_memcg_low
 and did not finish.

- lockdep complains were triggered by test_core and test_freezer (both
  had to run):
 ======================================================
 WARNING: possible circular locking dependency detected
 5.16.0-rc5 #259 Not tainted
 ------------------------------------------------------
 test_core/5996 is trying to acquire lock:
 ffffffff829a1258 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x2d/0xb0
 
 but task is already holding lock:
 ffff888103034618 (&sighand->siglock){....}-{2:2}, at: get_signal+0x8d/0xdb0
 
 which lock already depends on the new lock.

 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (&sighand->siglock){....}-{2:2}:
        _raw_spin_lock+0x27/0x40
        cgroup_post_fork+0x1f5/0x290
        copy_process+0x191b/0x1f80
        kernel_clone+0x5a/0x410
        __do_sys_clone3+0xb3/0x110
        do_syscall_64+0x43/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 -> #0 (css_set_lock){..-.}-{2:2}:
        __lock_acquire+0x1253/0x2280
        lock_acquire+0xd4/0x2e0
        _raw_spin_lock_irqsave+0x36/0x50
        obj_cgroup_release+0x2d/0xb0
        drain_obj_stock+0x1a9/0x1b0
        refill_obj_stock+0x4f/0x220
        memcg_slab_free_hook.part.0+0x108/0x290
        kmem_cache_free+0xf5/0x3c0
        dequeue_signal+0xaf/0x1e0
        get_signal+0x232/0xdb0
        arch_do_signal_or_restart+0xf8/0x740
        exit_to_user_mode_prepare+0x17d/0x270
        syscall_exit_to_user_mode+0x19/0x70
        do_syscall_64+0x50/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&sighand->siglock);
                                lock(css_set_lock);
                                lock(&sighand->siglock);
   lock(css_set_lock);
 
  *** DEADLOCK ***

 2 locks held by test_core/5996:
  #0: ffff888103034618 (&sighand->siglock){....}-{2:2}, at: get_signal+0x8d/0xdb0
  #1: ffffffff82905e40 (rcu_read_lock){....}-{1:2}, at: drain_obj_stock+0x71/0x1b0
 
 stack backtrace:
 CPU: 2 PID: 5996 Comm: test_core Not tainted 5.16.0-rc5 #259
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x45/0x59
  check_noncircular+0xfe/0x110
  __lock_acquire+0x1253/0x2280
  lock_acquire+0xd4/0x2e0
  _raw_spin_lock_irqsave+0x36/0x50
  obj_cgroup_release+0x2d/0xb0
  drain_obj_stock+0x1a9/0x1b0
  refill_obj_stock+0x4f/0x220
  memcg_slab_free_hook.part.0+0x108/0x290
  kmem_cache_free+0xf5/0x3c0
  dequeue_signal+0xaf/0x1e0
  get_signal+0x232/0xdb0
  arch_do_signal_or_restart+0xf8/0x740
  exit_to_user_mode_prepare+0x17d/0x270
  syscall_exit_to_user_mode+0x19/0x70
  do_syscall_64+0x50/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
  </TASK>

Sebastian




^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2021-12-22 11:41 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

Hi,

this is a follow up to
   https://lkml.kernel.org/r/20211207155208.eyre5svucpg7krxe-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org

where it has been suggested that I should try again with memcg instead
of simply disabling it.

Patch #1 deals with the counters. It has been suggested to simply
disable preemption on RT (like in vmstats) and I followed that advice as
closely as possible. The local_irq_save() could be removed from
mod_memcg_state() and the other wrapper on RT but I leave it since it
does not hurt and it might look nicer ;)

Patch #2 is a follow up to
   https://lkml.kernel.org/r/20211214144412.447035-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org

Patch #3 restricts the task_obj usage to !PREEMPTION kernels. Based on
my understanding the use of preempt_disable() minimizes (avoids?) the
win of the optimisation.

I tested them on CONFIG_PREEMPT_NONE + CONFIG_PREEMPT_RT with the
tools/testing/selftests/cgroup/* tests. I looked good except for the
following (which was also there before the patches):
- test_kmem sometimes complained about:
 not ok 2 test_kmem_memcg_deletion
 
- test_memcontrol complained always about
 not ok 3 test_memcg_min
 not ok 4 test_memcg_low
 and did not finish.

- lockdep complains were triggered by test_core and test_freezer (both
  had to run):
 ======================================================
 WARNING: possible circular locking dependency detected
 5.16.0-rc5 #259 Not tainted
 ------------------------------------------------------
 test_core/5996 is trying to acquire lock:
 ffffffff829a1258 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x2d/0xb0
 
 but task is already holding lock:
 ffff888103034618 (&sighand->siglock){....}-{2:2}, at: get_signal+0x8d/0xdb0
 
 which lock already depends on the new lock.

 
 the existing dependency chain (in reverse order) is:
 
 -> #1 (&sighand->siglock){....}-{2:2}:
        _raw_spin_lock+0x27/0x40
        cgroup_post_fork+0x1f5/0x290
        copy_process+0x191b/0x1f80
        kernel_clone+0x5a/0x410
        __do_sys_clone3+0xb3/0x110
        do_syscall_64+0x43/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 -> #0 (css_set_lock){..-.}-{2:2}:
        __lock_acquire+0x1253/0x2280
        lock_acquire+0xd4/0x2e0
        _raw_spin_lock_irqsave+0x36/0x50
        obj_cgroup_release+0x2d/0xb0
        drain_obj_stock+0x1a9/0x1b0
        refill_obj_stock+0x4f/0x220
        memcg_slab_free_hook.part.0+0x108/0x290
        kmem_cache_free+0xf5/0x3c0
        dequeue_signal+0xaf/0x1e0
        get_signal+0x232/0xdb0
        arch_do_signal_or_restart+0xf8/0x740
        exit_to_user_mode_prepare+0x17d/0x270
        syscall_exit_to_user_mode+0x19/0x70
        do_syscall_64+0x50/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
 
 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&sighand->siglock);
                                lock(css_set_lock);
                                lock(&sighand->siglock);
   lock(css_set_lock);
 
  *** DEADLOCK ***

 2 locks held by test_core/5996:
  #0: ffff888103034618 (&sighand->siglock){....}-{2:2}, at: get_signal+0x8d/0xdb0
  #1: ffffffff82905e40 (rcu_read_lock){....}-{1:2}, at: drain_obj_stock+0x71/0x1b0
 
 stack backtrace:
 CPU: 2 PID: 5996 Comm: test_core Not tainted 5.16.0-rc5 #259
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x45/0x59
  check_noncircular+0xfe/0x110
  __lock_acquire+0x1253/0x2280
  lock_acquire+0xd4/0x2e0
  _raw_spin_lock_irqsave+0x36/0x50
  obj_cgroup_release+0x2d/0xb0
  drain_obj_stock+0x1a9/0x1b0
  refill_obj_stock+0x4f/0x220
  memcg_slab_free_hook.part.0+0x108/0x290
  kmem_cache_free+0xf5/0x3c0
  dequeue_signal+0xaf/0x1e0
  get_signal+0x232/0xdb0
  arch_do_signal_or_restart+0xf8/0x740
  exit_to_user_mode_prepare+0x17d/0x270
  syscall_exit_to_user_mode+0x19/0x70
  do_syscall_64+0x50/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
  </TASK>

Sebastian



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-22 11:41   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra,
	Sebastian Andrzej Siewior

The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensure by disabling interrupts for the update.
This breaks on PREEMPT_RT because some sections additionally
acquire a spinlock_t lock (which becomes sleeping and must not be
acquired with disabled interrupts). Another problem is that
mem_cgroup_swapout() expects to be invoked with disabled interrupts
because the caller has to acquire a spinlock_t which is acquired with
disabled interrupts. Since spinlock_t never disables interrupts on
PREEMPT_RT the interrupts are never disabled at this point.

The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT. The
sections with disabled preemption must exclude memcg_check_events() so
that spinlock_t locks can still be acquired (for instance in
eventfd_signal()).

Don't disable interrupts during updates of the per-CPU variables,
instead use shorter sections which disable preemption.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d3..d2687d5ed544b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -671,8 +671,14 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 	memcg_rstat_updated(memcg);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -699,6 +705,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
@@ -706,6 +715,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
 	memcg_rstat_updated(memcg);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 /**
@@ -788,8 +800,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (mem_cgroup_disabled())
 		return;
 
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_disable();
+
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
 	memcg_rstat_updated(memcg);
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_enable();
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -810,6 +827,9 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 int nr_pages)
 {
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
 		__count_memcg_events(memcg, PGPGIN, 1);
@@ -819,12 +839,19 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	}
 
 	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				       enum mem_cgroup_events_target target)
 {
 	unsigned long val, next;
+	bool ret = false;
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
 
 	val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events);
 	next = __this_cpu_read(memcg->vmstats_percpu->targets[target]);
@@ -841,9 +868,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 			break;
 		}
 		__this_cpu_write(memcg->vmstats_percpu->targets[target], next);
-		return true;
+		ret = true;
 	}
-	return false;
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	return ret;
 }
 
 /*
@@ -5645,12 +5674,14 @@ static int mem_cgroup_move_account(struct page *page,
 	ret = 0;
 	nid = folio_nid(folio);
 
-	local_irq_disable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_disable();
 	mem_cgroup_charge_statistics(to, nr_pages);
 	memcg_check_events(to, nid);
 	mem_cgroup_charge_statistics(from, -nr_pages);
 	memcg_check_events(from, nid);
-	local_irq_enable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
 out_unlock:
 	folio_unlock(folio);
 out:
@@ -6670,10 +6701,12 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 	css_get(&memcg->css);
 	commit_charge(folio, memcg);
 
-	local_irq_disable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, nr_pages);
 	memcg_check_events(memcg, folio_nid(folio));
-	local_irq_enable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
 out:
 	return ret;
 }
@@ -6785,11 +6818,20 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 		memcg_oom_recover(ug->memcg);
 	}
 
-	local_irq_save(flags);
-	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
-	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
-	memcg_check_events(ug->memcg, ug->nid);
-	local_irq_restore(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		local_irq_save(flags);
+		__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+		__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+		memcg_check_events(ug->memcg, ug->nid);
+		local_irq_restore(flags);
+	} else {
+		preempt_disable();
+		__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+		__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+		preempt_enable();
+
+		memcg_check_events(ug->memcg, ug->nid);
+	}
 
 	/* drop reference from uncharge_folio */
 	css_put(&ug->memcg->css);
@@ -6930,10 +6972,12 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 	css_get(&memcg->css);
 	commit_charge(new, memcg);
 
-	local_irq_save(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_save(flags);
 	mem_cgroup_charge_statistics(memcg, nr_pages);
 	memcg_check_events(memcg, folio_nid(new));
-	local_irq_restore(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(flags);
 }
 
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
@@ -7157,8 +7201,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * i_pages lock which is taken with interrupts-off. It is
 	 * important here to have the interrupts disabled because it is the
 	 * only synchronisation we have for updating the per-CPU variables.
+	 * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU
+	 * variables are synchronised by keeping preemption disabled.
 	 */
-	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT) && !irqs_disabled());
 	mem_cgroup_charge_statistics(memcg, -nr_entries);
 	memcg_check_events(memcg, page_to_nid(page));
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-22 11:41   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra,
	Sebastian Andrzej Siewior

The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensure by disabling interrupts for the update.
This breaks on PREEMPT_RT because some sections additionally
acquire a spinlock_t lock (which becomes sleeping and must not be
acquired with disabled interrupts). Another problem is that
mem_cgroup_swapout() expects to be invoked with disabled interrupts
because the caller has to acquire a spinlock_t which is acquired with
disabled interrupts. Since spinlock_t never disables interrupts on
PREEMPT_RT the interrupts are never disabled at this point.

The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT. The
sections with disabled preemption must exclude memcg_check_events() so
that spinlock_t locks can still be acquired (for instance in
eventfd_signal()).

Don't disable interrupts during updates of the per-CPU variables,
instead use shorter sections which disable preemption.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d3..d2687d5ed544b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -671,8 +671,14 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 	if (mem_cgroup_disabled())
 		return;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 	memcg_rstat_updated(memcg);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -699,6 +705,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
@@ -706,6 +715,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
 	memcg_rstat_updated(memcg);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 /**
@@ -788,8 +800,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (mem_cgroup_disabled())
 		return;
 
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_disable();
+
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
 	memcg_rstat_updated(memcg);
+	if (IS_ENABLED(PREEMPT_RT))
+		preempt_enable();
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -810,6 +827,9 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 int nr_pages)
 {
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
 		__count_memcg_events(memcg, PGPGIN, 1);
@@ -819,12 +839,19 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	}
 
 	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
 }
 
 static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				       enum mem_cgroup_events_target target)
 {
 	unsigned long val, next;
+	bool ret = false;
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
 
 	val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events);
 	next = __this_cpu_read(memcg->vmstats_percpu->targets[target]);
@@ -841,9 +868,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 			break;
 		}
 		__this_cpu_write(memcg->vmstats_percpu->targets[target], next);
-		return true;
+		ret = true;
 	}
-	return false;
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	return ret;
 }
 
 /*
@@ -5645,12 +5674,14 @@ static int mem_cgroup_move_account(struct page *page,
 	ret = 0;
 	nid = folio_nid(folio);
 
-	local_irq_disable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_disable();
 	mem_cgroup_charge_statistics(to, nr_pages);
 	memcg_check_events(to, nid);
 	mem_cgroup_charge_statistics(from, -nr_pages);
 	memcg_check_events(from, nid);
-	local_irq_enable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
 out_unlock:
 	folio_unlock(folio);
 out:
@@ -6670,10 +6701,12 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 	css_get(&memcg->css);
 	commit_charge(folio, memcg);
 
-	local_irq_disable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, nr_pages);
 	memcg_check_events(memcg, folio_nid(folio));
-	local_irq_enable();
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
 out:
 	return ret;
 }
@@ -6785,11 +6818,20 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 		memcg_oom_recover(ug->memcg);
 	}
 
-	local_irq_save(flags);
-	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
-	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
-	memcg_check_events(ug->memcg, ug->nid);
-	local_irq_restore(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
+		local_irq_save(flags);
+		__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+		__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+		memcg_check_events(ug->memcg, ug->nid);
+		local_irq_restore(flags);
+	} else {
+		preempt_disable();
+		__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
+		__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+		preempt_enable();
+
+		memcg_check_events(ug->memcg, ug->nid);
+	}
 
 	/* drop reference from uncharge_folio */
 	css_put(&ug->memcg->css);
@@ -6930,10 +6972,12 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 	css_get(&memcg->css);
 	commit_charge(new, memcg);
 
-	local_irq_save(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_save(flags);
 	mem_cgroup_charge_statistics(memcg, nr_pages);
 	memcg_check_events(memcg, folio_nid(new));
-	local_irq_restore(flags);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(flags);
 }
 
 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
@@ -7157,8 +7201,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * i_pages lock which is taken with interrupts-off. It is
 	 * important here to have the interrupts disabled because it is the
 	 * only synchronisation we have for updating the per-CPU variables.
+	 * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU
+	 * variables are synchronised by keeping preemption disabled.
 	 */
-	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT) && !irqs_disabled());
 	mem_cgroup_charge_statistics(memcg, -nr_entries);
 	memcg_check_events(memcg, page_to_nid(page));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2021-12-22 11:41   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra,
	Sebastian Andrzej Siewior

The members of the per-CPU structure memcg_stock_pcp are protected
either by disabling interrupts or by disabling preemption if the
invocation occurred in process context.
Disabling interrupts protects most of the structure excluding task_obj
while disabling preemption protects only task_obj.
This schema is incompatible with PREEMPT_RT because it creates atomic
context in which actions are performed which require preemptible
context. One example is obj_cgroup_release().

The IRQ-disable and preempt-disable sections can be replaced with
local_lock_t which preserves the explicit disabling of interrupts while
keeps the code preemptible on PREEMPT_RT.

The task_obj has been added for performance reason on non-preemptible
kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption
model preempt_disable() is always implemented. Also there are no memory
allocations in_irq() context and softirqs are processed in (preemptible)
process context. Therefore it makes sense to avoid using task_obj.

Don't use task_obj on PREEMPT_RT and replace manual disabling of
interrupts with a local_lock_t. This change requires some factoring:

- drain_obj_stock() drops a reference on obj_cgroup which leads to an
  invocation of obj_cgroup_release() if it is the last object. This in
  turn leads to recursive locking of the local_lock_t. To avoid this,
  obj_cgroup_release() is invoked outside of the locked section.

- drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been
  acquired (instead of the task_obj_lock) to avoid recursive locking later
  in refill_stock().

- drain_all_stock() disables preemption via get_cpu() and then invokes
  drain_local_stock() if it is the local CPU to avoid scheduling a worker
  (which invokes the same function). Disabling preemption here is
  problematic due to the sleeping locks in drain_local_stock().
  This can be avoided by always scheduling a worker, even for the local
  CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures
  that no worker is scheduled for an offline CPU. Since there is no
  flush_work(), it is still possible that a worker is invoked on the wrong
  CPU but it is okay since it operates always on the local-CPU data.

- drain_local_stock() is always invoked as a worker so it can be optimized
  by removing in_task() (it is always true) and avoiding the "irq_save"
  variant because interrupts are always enabled here. Operating on
  task_obj first allows to acquire the lock_lock_t without lockdep
  complains.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 112 insertions(+), 59 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2687d5ed544b..1e76f26be2c15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
 	return cgroup_memory_nokmem;
 }
 
+struct memcg_stock_pcp;
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages);
+				      unsigned int nr_pages,
+				      struct memcg_stock_pcp *stock_pcp);
 
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
@@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	nr_pages = nr_bytes >> PAGE_SHIFT;
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	list_del(&objcg->list);
@@ -2120,26 +2122,40 @@ struct obj_stock {
 };
 
 struct memcg_stock_pcp {
+	/* Protects memcg_stock_pcp */
+	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+#ifndef CONFIG_PREEMPT_RT
+	/* Protects only task_obj */
+	local_lock_t task_obj_lock;
 	struct obj_stock task_obj;
+#endif
 	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
 };
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+#ifndef CONFIG_PREEMPT_RT
+	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
+#endif
+};
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  struct memcg_stock_pcp *stock_pcp);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+						 struct memcg_stock_pcp *stock_pcp)
 {
+	return NULL;
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
@@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 
 static void drain_local_stock(struct work_struct *dummy)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	struct memcg_stock_pcp *stock_pcp;
+	struct obj_cgroup *old;
 
 	/*
 	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_irq_save(flags);
+#ifndef CONFIG_PREEMPT_RT
+	local_lock(&memcg_stock.task_obj_lock);
+	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
+	local_unlock(&memcg_stock.task_obj_lock);
+	if (old)
+		obj_cgroup_put(old);
+#endif
 
-	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->irq_obj);
-	if (in_task())
-		drain_obj_stock(&stock->task_obj);
-	drain_stock(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+	local_lock_irq(&memcg_stock.stock_lock);
+	stock_pcp = this_cpu_ptr(&memcg_stock);
+	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
 
-	local_irq_restore(flags);
+	drain_stock(stock_pcp);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
+
+	local_unlock_irq(&memcg_stock.stock_lock);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			   struct memcg_stock_pcp *stock)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	lockdep_assert_held(&stock->stock_lock);
 
-	local_irq_save(flags);
-
-	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
 		css_get(&memcg->css);
@@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
 		drain_stock(stock);
+}
 
-	local_irq_restore(flags);
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			 struct memcg_stock_pcp *stock_pcp)
+{
+	unsigned long flags;
+
+	if (stock_pcp) {
+		__refill_stock(memcg, nr_pages, stock_pcp);
+		return;
+	}
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	__refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock));
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-	int cpu, curcpu;
+	int cpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
 	if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 	 * as well as workers from this path always operate on the local
 	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
 	 */
-	curcpu = get_cpu();
+	cpus_read_lock();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *memcg;
@@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		rcu_read_unlock();
 
 		if (flush &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
-	put_cpu();
+	cpus_read_unlock();
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 done_restock:
 	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
+		refill_stock(memcg, batch - nr_pages, NULL);
 
 	/*
 	 * If the hierarchy is above the normal consumption range, schedule
@@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
  * can only be accessed after disabling interrupt. User context code can
  * access interrupt object stock, but not vice versa.
  */
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
+					      struct memcg_stock_pcp **stock_pcp)
 {
 	struct memcg_stock_pcp *stock;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (likely(in_task())) {
 		*pflags = 0UL;
-		preempt_disable();
+		*stock_pcp = NULL;
+		local_lock(&memcg_stock.task_obj_lock);
 		stock = this_cpu_ptr(&memcg_stock);
 		return &stock->task_obj;
 	}
-
-	local_irq_save(*pflags);
+#endif
+	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
 	stock = this_cpu_ptr(&memcg_stock);
+	*stock_pcp = stock;
 	return &stock->irq_obj;
 }
 
-static inline void put_obj_stock(unsigned long flags)
+static inline void put_obj_stock(unsigned long flags,
+				 struct memcg_stock_pcp *stock_pcp)
 {
-	if (likely(in_task()))
-		preempt_enable();
+#ifndef CONFIG_PREEMPT_RT
+	if (likely(!stock_pcp))
+		local_unlock(&memcg_stock.task_obj_lock);
 	else
-		local_irq_restore(flags);
+#endif
+		local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -3002,7 +3038,8 @@ static void memcg_free_cache_id(int id)
  * @nr_pages: number of pages to uncharge
  */
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages)
+				      unsigned int nr_pages,
+				      struct memcg_stock_pcp *stock_pcp)
 {
 	struct mem_cgroup *memcg;
 
@@ -3010,7 +3047,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		page_counter_uncharge(&memcg->kmem, nr_pages);
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, stock_pcp);
 
 	css_put(&memcg->css);
 }
@@ -3084,7 +3121,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 		return;
 
 	objcg = __folio_objcg(folio);
-	obj_cgroup_uncharge_pages(objcg, nr_pages);
+	obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
 	folio->memcg_data = 0;
 	obj_cgroup_put(objcg);
 }
@@ -3092,17 +3129,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
+	struct memcg_stock_pcp *stock_pcp;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_cgroup *old = NULL;
+	struct obj_stock *stock;
 	int *bytes;
 
+	stock = get_obj_stock(&flags, &stock_pcp);
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
 	 * accumulating over a page of vmstat data or when pgdat or idx
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_pcp);
+
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3146,38 +3187,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_pcp);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
+	struct memcg_stock_pcp *stock_pcp;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	bool ret = false;
 
+	stock = get_obj_stock(&flags, &stock_pcp);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_pcp);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct obj_stock *stock)
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  struct memcg_stock_pcp *stock_pcp)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
 	if (!old)
-		return;
+		return NULL;
 
 	if (stock->nr_bytes) {
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
 
 		if (nr_pages)
-			obj_cgroup_uncharge_pages(old, nr_pages);
+			obj_cgroup_uncharge_pages(old, nr_pages, stock_pcp);
 
 		/*
 		 * The leftover is flushed to the centralized per-memcg value.
@@ -3212,8 +3258,8 @@ static void drain_obj_stock(struct obj_stock *stock)
 		stock->cached_pgdat = NULL;
 	}
 
-	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
+	return old;
 }
 
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3221,11 +3267,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (in_task() && stock->task_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
+#endif
 	if (stock->irq_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
@@ -3238,12 +3286,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
+	struct memcg_stock_pcp *stock_pcp;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	unsigned int nr_pages = 0;
+	struct obj_cgroup *old = NULL;
 
+	stock = get_obj_stock(&flags, &stock_pcp);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_pcp);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3257,10 +3308,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_pcp);
+	if (old)
+		obj_cgroup_put(old);
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -7061,7 +7114,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
 
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, NULL);
 }
 
 static int __init cgroup_memory(char *s)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2021-12-22 11:41   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra,
	Sebastian Andrzej Siewior

The members of the per-CPU structure memcg_stock_pcp are protected
either by disabling interrupts or by disabling preemption if the
invocation occurred in process context.
Disabling interrupts protects most of the structure excluding task_obj
while disabling preemption protects only task_obj.
This schema is incompatible with PREEMPT_RT because it creates atomic
context in which actions are performed which require preemptible
context. One example is obj_cgroup_release().

The IRQ-disable and preempt-disable sections can be replaced with
local_lock_t which preserves the explicit disabling of interrupts while
keeps the code preemptible on PREEMPT_RT.

The task_obj has been added for performance reason on non-preemptible
kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption
model preempt_disable() is always implemented. Also there are no memory
allocations in_irq() context and softirqs are processed in (preemptible)
process context. Therefore it makes sense to avoid using task_obj.

Don't use task_obj on PREEMPT_RT and replace manual disabling of
interrupts with a local_lock_t. This change requires some factoring:

- drain_obj_stock() drops a reference on obj_cgroup which leads to an
  invocation of obj_cgroup_release() if it is the last object. This in
  turn leads to recursive locking of the local_lock_t. To avoid this,
  obj_cgroup_release() is invoked outside of the locked section.

- drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been
  acquired (instead of the task_obj_lock) to avoid recursive locking later
  in refill_stock().

- drain_all_stock() disables preemption via get_cpu() and then invokes
  drain_local_stock() if it is the local CPU to avoid scheduling a worker
  (which invokes the same function). Disabling preemption here is
  problematic due to the sleeping locks in drain_local_stock().
  This can be avoided by always scheduling a worker, even for the local
  CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures
  that no worker is scheduled for an offline CPU. Since there is no
  flush_work(), it is still possible that a worker is invoked on the wrong
  CPU but it is okay since it operates always on the local-CPU data.

- drain_local_stock() is always invoked as a worker so it can be optimized
  by removing in_task() (it is always true) and avoiding the "irq_save"
  variant because interrupts are always enabled here. Operating on
  task_obj first allows to acquire the lock_lock_t without lockdep
  complains.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 112 insertions(+), 59 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2687d5ed544b..1e76f26be2c15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
 	return cgroup_memory_nokmem;
 }
 
+struct memcg_stock_pcp;
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages);
+				      unsigned int nr_pages,
+				      struct memcg_stock_pcp *stock_pcp);
 
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
@@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	nr_pages = nr_bytes >> PAGE_SHIFT;
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	list_del(&objcg->list);
@@ -2120,26 +2122,40 @@ struct obj_stock {
 };
 
 struct memcg_stock_pcp {
+	/* Protects memcg_stock_pcp */
+	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+#ifndef CONFIG_PREEMPT_RT
+	/* Protects only task_obj */
+	local_lock_t task_obj_lock;
 	struct obj_stock task_obj;
+#endif
 	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
 };
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+#ifndef CONFIG_PREEMPT_RT
+	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
+#endif
+};
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  struct memcg_stock_pcp *stock_pcp);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+						 struct memcg_stock_pcp *stock_pcp)
 {
+	return NULL;
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
@@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 
 static void drain_local_stock(struct work_struct *dummy)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	struct memcg_stock_pcp *stock_pcp;
+	struct obj_cgroup *old;
 
 	/*
 	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_irq_save(flags);
+#ifndef CONFIG_PREEMPT_RT
+	local_lock(&memcg_stock.task_obj_lock);
+	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
+	local_unlock(&memcg_stock.task_obj_lock);
+	if (old)
+		obj_cgroup_put(old);
+#endif
 
-	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->irq_obj);
-	if (in_task())
-		drain_obj_stock(&stock->task_obj);
-	drain_stock(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+	local_lock_irq(&memcg_stock.stock_lock);
+	stock_pcp = this_cpu_ptr(&memcg_stock);
+	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
 
-	local_irq_restore(flags);
+	drain_stock(stock_pcp);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
+
+	local_unlock_irq(&memcg_stock.stock_lock);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			   struct memcg_stock_pcp *stock)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	lockdep_assert_held(&stock->stock_lock);
 
-	local_irq_save(flags);
-
-	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
 		css_get(&memcg->css);
@@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
 		drain_stock(stock);
+}
 
-	local_irq_restore(flags);
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			 struct memcg_stock_pcp *stock_pcp)
+{
+	unsigned long flags;
+
+	if (stock_pcp) {
+		__refill_stock(memcg, nr_pages, stock_pcp);
+		return;
+	}
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	__refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock));
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-	int cpu, curcpu;
+	int cpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
 	if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 	 * as well as workers from this path always operate on the local
 	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
 	 */
-	curcpu = get_cpu();
+	cpus_read_lock();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *memcg;
@@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		rcu_read_unlock();
 
 		if (flush &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
-	put_cpu();
+	cpus_read_unlock();
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 done_restock:
 	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
+		refill_stock(memcg, batch - nr_pages, NULL);
 
 	/*
 	 * If the hierarchy is above the normal consumption range, schedule
@@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
  * can only be accessed after disabling interrupt. User context code can
  * access interrupt object stock, but not vice versa.
  */
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
+					      struct memcg_stock_pcp **stock_pcp)
 {
 	struct memcg_stock_pcp *stock;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (likely(in_task())) {
 		*pflags = 0UL;
-		preempt_disable();
+		*stock_pcp = NULL;
+		local_lock(&memcg_stock.task_obj_lock);
 		stock = this_cpu_ptr(&memcg_stock);
 		return &stock->task_obj;
 	}
-
-	local_irq_save(*pflags);
+#endif
+	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
 	stock = this_cpu_ptr(&memcg_stock);
+	*stock_pcp = stock;
 	return &stock->irq_obj;
 }
 
-static inline void put_obj_stock(unsigned long flags)
+static inline void put_obj_stock(unsigned long flags,
+				 struct memcg_stock_pcp *stock_pcp)
 {
-	if (likely(in_task()))
-		preempt_enable();
+#ifndef CONFIG_PREEMPT_RT
+	if (likely(!stock_pcp))
+		local_unlock(&memcg_stock.task_obj_lock);
 	else
-		local_irq_restore(flags);
+#endif
+		local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -3002,7 +3038,8 @@ static void memcg_free_cache_id(int id)
  * @nr_pages: number of pages to uncharge
  */
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages)
+				      unsigned int nr_pages,
+				      struct memcg_stock_pcp *stock_pcp)
 {
 	struct mem_cgroup *memcg;
 
@@ -3010,7 +3047,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		page_counter_uncharge(&memcg->kmem, nr_pages);
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, stock_pcp);
 
 	css_put(&memcg->css);
 }
@@ -3084,7 +3121,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 		return;
 
 	objcg = __folio_objcg(folio);
-	obj_cgroup_uncharge_pages(objcg, nr_pages);
+	obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
 	folio->memcg_data = 0;
 	obj_cgroup_put(objcg);
 }
@@ -3092,17 +3129,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
+	struct memcg_stock_pcp *stock_pcp;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_cgroup *old = NULL;
+	struct obj_stock *stock;
 	int *bytes;
 
+	stock = get_obj_stock(&flags, &stock_pcp);
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
 	 * accumulating over a page of vmstat data or when pgdat or idx
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_pcp);
+
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3146,38 +3187,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_pcp);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
+	struct memcg_stock_pcp *stock_pcp;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	bool ret = false;
 
+	stock = get_obj_stock(&flags, &stock_pcp);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_pcp);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct obj_stock *stock)
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  struct memcg_stock_pcp *stock_pcp)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
 	if (!old)
-		return;
+		return NULL;
 
 	if (stock->nr_bytes) {
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
 
 		if (nr_pages)
-			obj_cgroup_uncharge_pages(old, nr_pages);
+			obj_cgroup_uncharge_pages(old, nr_pages, stock_pcp);
 
 		/*
 		 * The leftover is flushed to the centralized per-memcg value.
@@ -3212,8 +3258,8 @@ static void drain_obj_stock(struct obj_stock *stock)
 		stock->cached_pgdat = NULL;
 	}
 
-	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
+	return old;
 }
 
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3221,11 +3267,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (in_task() && stock->task_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
+#endif
 	if (stock->irq_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
@@ -3238,12 +3286,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
+	struct memcg_stock_pcp *stock_pcp;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	unsigned int nr_pages = 0;
+	struct obj_cgroup *old = NULL;
 
+	stock = get_obj_stock(&flags, &stock_pcp);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_pcp);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3257,10 +3308,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_pcp);
+	if (old)
+		obj_cgroup_put(old);
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -7061,7 +7114,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
 
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, NULL);
 }
 
 static int __init cgroup_memory(char *s)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2021-12-22 11:41   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra,
	Sebastian Andrzej Siewior

Based on my understanding the optimisation with task_obj for in_task()
mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
instead to only PREEMPT_RT.
With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
configured but these kernels always have preempt_disable()/enable()
present so it probably makes no sense here for the optimisation.

Restrict the optimisation to !CONFIG_PREEMPTION kernels.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1e76f26be2c15..92180f1aa9edc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2126,7 +2126,7 @@ struct memcg_stock_pcp {
 	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	/* Protects only task_obj */
 	local_lock_t task_obj_lock;
 	struct obj_stock task_obj;
@@ -2139,7 +2139,7 @@ struct memcg_stock_pcp {
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
 	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
 #endif
 };
@@ -2228,7 +2228,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	local_lock(&memcg_stock.task_obj_lock);
 	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
 	local_unlock(&memcg_stock.task_obj_lock);
@@ -2837,7 +2837,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
 {
 	struct memcg_stock_pcp *stock;
 
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (likely(in_task())) {
 		*pflags = 0UL;
 		*stock_pcp = NULL;
@@ -2855,7 +2855,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
 static inline void put_obj_stock(unsigned long flags,
 				 struct memcg_stock_pcp *stock_pcp)
 {
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (likely(!stock_pcp))
 		local_unlock(&memcg_stock.task_obj_lock);
 	else
@@ -3267,7 +3267,7 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (in_task() && stock->task_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2021-12-22 11:41   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-22 11:41 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra,
	Sebastian Andrzej Siewior

Based on my understanding the optimisation with task_obj for in_task()
mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
instead to only PREEMPT_RT.
With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
configured but these kernels always have preempt_disable()/enable()
present so it probably makes no sense here for the optimisation.

Restrict the optimisation to !CONFIG_PREEMPTION kernels.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 mm/memcontrol.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1e76f26be2c15..92180f1aa9edc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2126,7 +2126,7 @@ struct memcg_stock_pcp {
 	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	/* Protects only task_obj */
 	local_lock_t task_obj_lock;
 	struct obj_stock task_obj;
@@ -2139,7 +2139,7 @@ struct memcg_stock_pcp {
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
 	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
 #endif
 };
@@ -2228,7 +2228,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	local_lock(&memcg_stock.task_obj_lock);
 	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
 	local_unlock(&memcg_stock.task_obj_lock);
@@ -2837,7 +2837,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
 {
 	struct memcg_stock_pcp *stock;
 
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (likely(in_task())) {
 		*pflags = 0UL;
 		*stock_pcp = NULL;
@@ -2855,7 +2855,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
 static inline void put_obj_stock(unsigned long flags,
 				 struct memcg_stock_pcp *stock_pcp)
 {
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (likely(!stock_pcp))
 		local_unlock(&memcg_stock.task_obj_lock);
 	else
@@ -3267,7 +3267,7 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-#ifndef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPTION
 	if (in_task() && stock->task_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-23  2:31     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23  2:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The per-CPU counter are modified with the non-atomic modifier. The
> consistency is ensure by disabling interrupts for the update.
> This breaks on PREEMPT_RT because some sections additionally
> acquire a spinlock_t lock (which becomes sleeping and must not be
> acquired with disabled interrupts). Another problem is that
> mem_cgroup_swapout() expects to be invoked with disabled interrupts
> because the caller has to acquire a spinlock_t which is acquired with
> disabled interrupts. Since spinlock_t never disables interrupts on
> PREEMPT_RT the interrupts are never disabled at this point.
>
> The code is never called from in_irq() context on PREEMPT_RT therefore

How do you guarantee that these percpu update functions won't be called 
in in_irq() context for PREEMPT_RT? Do you think we should add a 
WARN_ON_ONCE(in_irq()) just to be sure?

Cheers,
Longman



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-23  2:31     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23  2:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The per-CPU counter are modified with the non-atomic modifier. The
> consistency is ensure by disabling interrupts for the update.
> This breaks on PREEMPT_RT because some sections additionally
> acquire a spinlock_t lock (which becomes sleeping and must not be
> acquired with disabled interrupts). Another problem is that
> mem_cgroup_swapout() expects to be invoked with disabled interrupts
> because the caller has to acquire a spinlock_t which is acquired with
> disabled interrupts. Since spinlock_t never disables interrupts on
> PREEMPT_RT the interrupts are never disabled at this point.
>
> The code is never called from in_irq() context on PREEMPT_RT therefore

How do you guarantee that these percpu update functions won't be called 
in in_irq() context for PREEMPT_RT? Do you think we should add a 
WARN_ON_ONCE(in_irq()) just to be sure?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-23  7:34       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-23  7:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > The per-CPU counter are modified with the non-atomic modifier. The
> > consistency is ensure by disabling interrupts for the update.
> > This breaks on PREEMPT_RT because some sections additionally
> > acquire a spinlock_t lock (which becomes sleeping and must not be
> > acquired with disabled interrupts). Another problem is that
> > mem_cgroup_swapout() expects to be invoked with disabled interrupts
> > because the caller has to acquire a spinlock_t which is acquired with
> > disabled interrupts. Since spinlock_t never disables interrupts on
> > PREEMPT_RT the interrupts are never disabled at this point.
> > 
> > The code is never called from in_irq() context on PREEMPT_RT therefore
> 
> How do you guarantee that these percpu update functions won't be called in
> in_irq() context for PREEMPT_RT? Do you think we should add a
> WARN_ON_ONCE(in_irq()) just to be sure?

There are no invocations to the memory allocator (neither malloc() nor
free()) on RT and the memory allocator itself (SLUB and the
page-allocator so both) has sleeping locks. That means invocations
in_atomic() are bad. All interrupt handler are force-threaded. Those
which are not (like timer, per-CPU interrupts or those which explicitly
asked not to be force threaded) are limited in their doing as they can't
invoke anything that has a sleeping lock. Lockdep or
CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
The other counter are protected the same way, see
  c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

> Cheers,
> Longman

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-23  7:34       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-23  7:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > The per-CPU counter are modified with the non-atomic modifier. The
> > consistency is ensure by disabling interrupts for the update.
> > This breaks on PREEMPT_RT because some sections additionally
> > acquire a spinlock_t lock (which becomes sleeping and must not be
> > acquired with disabled interrupts). Another problem is that
> > mem_cgroup_swapout() expects to be invoked with disabled interrupts
> > because the caller has to acquire a spinlock_t which is acquired with
> > disabled interrupts. Since spinlock_t never disables interrupts on
> > PREEMPT_RT the interrupts are never disabled at this point.
> > 
> > The code is never called from in_irq() context on PREEMPT_RT therefore
> 
> How do you guarantee that these percpu update functions won't be called in
> in_irq() context for PREEMPT_RT? Do you think we should add a
> WARN_ON_ONCE(in_irq()) just to be sure?

There are no invocations to the memory allocator (neither malloc() nor
free()) on RT and the memory allocator itself (SLUB and the
page-allocator so both) has sleeping locks. That means invocations
in_atomic() are bad. All interrupt handler are force-threaded. Those
which are not (like timer, per-CPU interrupts or those which explicitly
asked not to be force threaded) are limited in their doing as they can't
invoke anything that has a sleeping lock. Lockdep or
CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
The other counter are protected the same way, see
  c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

> Cheers,
> Longman

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-23 16:01         ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23 16:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 12/23/21 02:34, Sebastian Andrzej Siewior wrote:
> On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> The per-CPU counter are modified with the non-atomic modifier. The
>>> consistency is ensure by disabling interrupts for the update.
>>> This breaks on PREEMPT_RT because some sections additionally
>>> acquire a spinlock_t lock (which becomes sleeping and must not be
>>> acquired with disabled interrupts). Another problem is that
>>> mem_cgroup_swapout() expects to be invoked with disabled interrupts
>>> because the caller has to acquire a spinlock_t which is acquired with
>>> disabled interrupts. Since spinlock_t never disables interrupts on
>>> PREEMPT_RT the interrupts are never disabled at this point.
>>>
>>> The code is never called from in_irq() context on PREEMPT_RT therefore
>> How do you guarantee that these percpu update functions won't be called in
>> in_irq() context for PREEMPT_RT? Do you think we should add a
>> WARN_ON_ONCE(in_irq()) just to be sure?
> There are no invocations to the memory allocator (neither malloc() nor
> free()) on RT and the memory allocator itself (SLUB and the
> page-allocator so both) has sleeping locks. That means invocations
> in_atomic() are bad. All interrupt handler are force-threaded. Those
> which are not (like timer, per-CPU interrupts or those which explicitly
> asked not to be force threaded) are limited in their doing as they can't
> invoke anything that has a sleeping lock. Lockdep or
> CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
> The other counter are protected the same way, see
>    c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

Thanks for the explanation as I am less familiar with other PREEMPT_RT 
specific changes.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2021-12-23 16:01         ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23 16:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 12/23/21 02:34, Sebastian Andrzej Siewior wrote:
> On 2021-12-22 21:31:36 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> The per-CPU counter are modified with the non-atomic modifier. The
>>> consistency is ensure by disabling interrupts for the update.
>>> This breaks on PREEMPT_RT because some sections additionally
>>> acquire a spinlock_t lock (which becomes sleeping and must not be
>>> acquired with disabled interrupts). Another problem is that
>>> mem_cgroup_swapout() expects to be invoked with disabled interrupts
>>> because the caller has to acquire a spinlock_t which is acquired with
>>> disabled interrupts. Since spinlock_t never disables interrupts on
>>> PREEMPT_RT the interrupts are never disabled at this point.
>>>
>>> The code is never called from in_irq() context on PREEMPT_RT therefore
>> How do you guarantee that these percpu update functions won't be called in
>> in_irq() context for PREEMPT_RT? Do you think we should add a
>> WARN_ON_ONCE(in_irq()) just to be sure?
> There are no invocations to the memory allocator (neither malloc() nor
> free()) on RT and the memory allocator itself (SLUB and the
> page-allocator so both) has sleeping locks. That means invocations
> in_atomic() are bad. All interrupt handler are force-threaded. Those
> which are not (like timer, per-CPU interrupts or those which explicitly
> asked not to be force threaded) are limited in their doing as they can't
> invoke anything that has a sleeping lock. Lockdep or
> CONFIG_DEBUG_ATOMIC_SLEEP will yell here.
> The other counter are protected the same way, see
>    c68ed7945701a ("mm/vmstat: protect per cpu variables with preempt disable on RT")

Thanks for the explanation as I am less familiar with other PREEMPT_RT 
specific changes.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2021-12-23 21:38     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23 21:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The members of the per-CPU structure memcg_stock_pcp are protected
> either by disabling interrupts or by disabling preemption if the
> invocation occurred in process context.
> Disabling interrupts protects most of the structure excluding task_obj
> while disabling preemption protects only task_obj.
> This schema is incompatible with PREEMPT_RT because it creates atomic
> context in which actions are performed which require preemptible
> context. One example is obj_cgroup_release().
>
> The IRQ-disable and preempt-disable sections can be replaced with
> local_lock_t which preserves the explicit disabling of interrupts while
> keeps the code preemptible on PREEMPT_RT.
>
> The task_obj has been added for performance reason on non-preemptible
> kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption
> model preempt_disable() is always implemented. Also there are no memory
> allocations in_irq() context and softirqs are processed in (preemptible)
> process context. Therefore it makes sense to avoid using task_obj.
>
> Don't use task_obj on PREEMPT_RT and replace manual disabling of
> interrupts with a local_lock_t. This change requires some factoring:
>
> - drain_obj_stock() drops a reference on obj_cgroup which leads to an
>    invocation of obj_cgroup_release() if it is the last object. This in
>    turn leads to recursive locking of the local_lock_t. To avoid this,
>    obj_cgroup_release() is invoked outside of the locked section.
>
> - drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been
>    acquired (instead of the task_obj_lock) to avoid recursive locking later
>    in refill_stock().
>
> - drain_all_stock() disables preemption via get_cpu() and then invokes
>    drain_local_stock() if it is the local CPU to avoid scheduling a worker
>    (which invokes the same function). Disabling preemption here is
>    problematic due to the sleeping locks in drain_local_stock().
>    This can be avoided by always scheduling a worker, even for the local
>    CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures
>    that no worker is scheduled for an offline CPU. Since there is no
>    flush_work(), it is still possible that a worker is invoked on the wrong
>    CPU but it is okay since it operates always on the local-CPU data.
>
> - drain_local_stock() is always invoked as a worker so it can be optimized
>    by removing in_task() (it is always true) and avoiding the "irq_save"
>    variant because interrupts are always enabled here. Operating on
>    task_obj first allows to acquire the lock_lock_t without lockdep
>    complains.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>   mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++-----------------
>   1 file changed, 112 insertions(+), 59 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d2687d5ed544b..1e76f26be2c15 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
>   	return cgroup_memory_nokmem;
>   }
>   
> +struct memcg_stock_pcp;
>   static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> -				      unsigned int nr_pages);
> +				      unsigned int nr_pages,
> +				      struct memcg_stock_pcp *stock_pcp);

AFAICS, stock_pcp is set to indicate that the stock_lock has been 
acquired. Since stock_pcp, if set, should be the same as 
this_cpu_ptr(&memcg_stock), it is a bit confusing to pass it to a 
function that also does a percpu access to memcg_stock. Why don't you 
just pass a boolean, say, stock_locked to indicate this instead. It will 
be more clear and less confusing.


>   
>   static void obj_cgroup_release(struct percpu_ref *ref)
>   {
> @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>   	nr_pages = nr_bytes >> PAGE_SHIFT;
>   
>   	if (nr_pages)
> -		obj_cgroup_uncharge_pages(objcg, nr_pages);
> +		obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
>   
>   	spin_lock_irqsave(&css_set_lock, flags);
>   	list_del(&objcg->list);
> @@ -2120,26 +2122,40 @@ struct obj_stock {
>   };
>   
>   struct memcg_stock_pcp {
> +	/* Protects memcg_stock_pcp */
> +	local_lock_t stock_lock;
>   	struct mem_cgroup *cached; /* this never be root cgroup */
>   	unsigned int nr_pages;
> +#ifndef CONFIG_PREEMPT_RT
> +	/* Protects only task_obj */
> +	local_lock_t task_obj_lock;
>   	struct obj_stock task_obj;
> +#endif
>   	struct obj_stock irq_obj;
>   
>   	struct work_struct work;
>   	unsigned long flags;
>   #define FLUSHING_CACHED_CHARGE	0
>   };
> -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
> +	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
> +#ifndef CONFIG_PREEMPT_RT
> +	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
> +#endif
> +};
>   static DEFINE_MUTEX(percpu_charge_mutex);
>   
>   #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +					  struct memcg_stock_pcp *stock_pcp);
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg);
>   
>   #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +						 struct memcg_stock_pcp *stock_pcp)
>   {
> +	return NULL;
>   }
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg)
> @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   	if (nr_pages > MEMCG_CHARGE_BATCH)
>   		return ret;
>   
> -	local_irq_save(flags);
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
>   
>   	stock = this_cpu_ptr(&memcg_stock);
>   	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
> @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   		ret = true;
>   	}
>   
> -	local_irq_restore(flags);
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   
>   	return ret;
>   }
> @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>   
>   static void drain_local_stock(struct work_struct *dummy)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	struct memcg_stock_pcp *stock_pcp;
> +	struct obj_cgroup *old;
>   
>   	/*
>   	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
>   	 * drain_stock races is that we always operate on local CPU stock
>   	 * here with IRQ disabled
>   	 */
> -	local_irq_save(flags);
> +#ifndef CONFIG_PREEMPT_RT
> +	local_lock(&memcg_stock.task_obj_lock);
> +	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
> +	local_unlock(&memcg_stock.task_obj_lock);
> +	if (old)
> +		obj_cgroup_put(old);
> +#endif
>   
> -	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->irq_obj);
> -	if (in_task())
> -		drain_obj_stock(&stock->task_obj);
> -	drain_stock(stock);
> -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> +	local_lock_irq(&memcg_stock.stock_lock);
> +	stock_pcp = this_cpu_ptr(&memcg_stock);
> +	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
>   
> -	local_irq_restore(flags);
> +	drain_stock(stock_pcp);
> +	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
> +
> +	local_unlock_irq(&memcg_stock.stock_lock);
> +	if (old)
> +		obj_cgroup_put(old);
>   }
>   
>   /*
>    * Cache charges(val) to local per_cpu area.
>    * This will be consumed by consume_stock() function, later.
>    */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +			   struct memcg_stock_pcp *stock)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	lockdep_assert_held(&stock->stock_lock);
>   
> -	local_irq_save(flags);
> -
> -	stock = this_cpu_ptr(&memcg_stock);
>   	if (stock->cached != memcg) { /* reset if necessary */
>   		drain_stock(stock);
>   		css_get(&memcg->css);
> @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   
>   	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
>   		drain_stock(stock);
> +}
>   
> -	local_irq_restore(flags);
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +			 struct memcg_stock_pcp *stock_pcp)
> +{
> +	unsigned long flags;
> +
> +	if (stock_pcp) {
> +		__refill_stock(memcg, nr_pages, stock_pcp);
> +		return;
> +	}
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	__refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock));
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
>   
>   /*
> @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>    */
>   static void drain_all_stock(struct mem_cgroup *root_memcg)
>   {
> -	int cpu, curcpu;
> +	int cpu;
>   
>   	/* If someone's already draining, avoid adding running more workers. */
>   	if (!mutex_trylock(&percpu_charge_mutex))
> @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   	 * as well as workers from this path always operate on the local
>   	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
>   	 */
> -	curcpu = get_cpu();
> +	cpus_read_lock();
>   	for_each_online_cpu(cpu) {
>   		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>   		struct mem_cgroup *memcg;
> @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   		rcu_read_unlock();
>   
>   		if (flush &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			schedule_work_on(cpu, &stock->work);
>   	}
> -	put_cpu();
> +	cpus_read_unlock();
>   	mutex_unlock(&percpu_charge_mutex);
>   }
>   
> @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   
>   done_restock:
>   	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> +		refill_stock(memcg, batch - nr_pages, NULL);
>   
>   	/*
>   	 * If the hierarchy is above the normal consumption range, schedule
> @@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>    * can only be accessed after disabling interrupt. User context code can
>    * access interrupt object stock, but not vice versa.
>    */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> +static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
> +					      struct memcg_stock_pcp **stock_pcp)
>   {
>   	struct memcg_stock_pcp *stock;
>   
> +#ifndef CONFIG_PREEMPT_RT
>   	if (likely(in_task())) {
>   		*pflags = 0UL;
> -		preempt_disable();
> +		*stock_pcp = NULL;
> +		local_lock(&memcg_stock.task_obj_lock);
>   		stock = this_cpu_ptr(&memcg_stock);
>   		return &stock->task_obj;
>   	}
> -
> -	local_irq_save(*pflags);
> +#endif
> +	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
>   	stock = this_cpu_ptr(&memcg_stock);
> +	*stock_pcp = stock;
>   	return &stock->irq_obj;
>   }
>   
> -static inline void put_obj_stock(unsigned long flags)
> +static inline void put_obj_stock(unsigned long flags,
> +				 struct memcg_stock_pcp *stock_pcp)
>   {
> -	if (likely(in_task()))
> -		preempt_enable();
> +#ifndef CONFIG_PREEMPT_RT
> +	if (likely(!stock_pcp))
> +		local_unlock(&memcg_stock.task_obj_lock);
>   	else
> -		local_irq_restore(flags);
If you skip the "else" and add a "return", there will be a more natural 
indentation for the following lock_unlock_irqrestore(). Also we may not 
really need the additional stock_pcp argument here.
> +#endif
> +		local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
Cheers,
Longman



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2021-12-23 21:38     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23 21:38 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The members of the per-CPU structure memcg_stock_pcp are protected
> either by disabling interrupts or by disabling preemption if the
> invocation occurred in process context.
> Disabling interrupts protects most of the structure excluding task_obj
> while disabling preemption protects only task_obj.
> This schema is incompatible with PREEMPT_RT because it creates atomic
> context in which actions are performed which require preemptible
> context. One example is obj_cgroup_release().
>
> The IRQ-disable and preempt-disable sections can be replaced with
> local_lock_t which preserves the explicit disabling of interrupts while
> keeps the code preemptible on PREEMPT_RT.
>
> The task_obj has been added for performance reason on non-preemptible
> kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption
> model preempt_disable() is always implemented. Also there are no memory
> allocations in_irq() context and softirqs are processed in (preemptible)
> process context. Therefore it makes sense to avoid using task_obj.
>
> Don't use task_obj on PREEMPT_RT and replace manual disabling of
> interrupts with a local_lock_t. This change requires some factoring:
>
> - drain_obj_stock() drops a reference on obj_cgroup which leads to an
>    invocation of obj_cgroup_release() if it is the last object. This in
>    turn leads to recursive locking of the local_lock_t. To avoid this,
>    obj_cgroup_release() is invoked outside of the locked section.
>
> - drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been
>    acquired (instead of the task_obj_lock) to avoid recursive locking later
>    in refill_stock().
>
> - drain_all_stock() disables preemption via get_cpu() and then invokes
>    drain_local_stock() if it is the local CPU to avoid scheduling a worker
>    (which invokes the same function). Disabling preemption here is
>    problematic due to the sleeping locks in drain_local_stock().
>    This can be avoided by always scheduling a worker, even for the local
>    CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures
>    that no worker is scheduled for an offline CPU. Since there is no
>    flush_work(), it is still possible that a worker is invoked on the wrong
>    CPU but it is okay since it operates always on the local-CPU data.
>
> - drain_local_stock() is always invoked as a worker so it can be optimized
>    by removing in_task() (it is always true) and avoiding the "irq_save"
>    variant because interrupts are always enabled here. Operating on
>    task_obj first allows to acquire the lock_lock_t without lockdep
>    complains.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> ---
>   mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++-----------------
>   1 file changed, 112 insertions(+), 59 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d2687d5ed544b..1e76f26be2c15 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
>   	return cgroup_memory_nokmem;
>   }
>   
> +struct memcg_stock_pcp;
>   static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> -				      unsigned int nr_pages);
> +				      unsigned int nr_pages,
> +				      struct memcg_stock_pcp *stock_pcp);

AFAICS, stock_pcp is set to indicate that the stock_lock has been 
acquired. Since stock_pcp, if set, should be the same as 
this_cpu_ptr(&memcg_stock), it is a bit confusing to pass it to a 
function that also does a percpu access to memcg_stock. Why don't you 
just pass a boolean, say, stock_locked to indicate this instead. It will 
be more clear and less confusing.


>   
>   static void obj_cgroup_release(struct percpu_ref *ref)
>   {
> @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>   	nr_pages = nr_bytes >> PAGE_SHIFT;
>   
>   	if (nr_pages)
> -		obj_cgroup_uncharge_pages(objcg, nr_pages);
> +		obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
>   
>   	spin_lock_irqsave(&css_set_lock, flags);
>   	list_del(&objcg->list);
> @@ -2120,26 +2122,40 @@ struct obj_stock {
>   };
>   
>   struct memcg_stock_pcp {
> +	/* Protects memcg_stock_pcp */
> +	local_lock_t stock_lock;
>   	struct mem_cgroup *cached; /* this never be root cgroup */
>   	unsigned int nr_pages;
> +#ifndef CONFIG_PREEMPT_RT
> +	/* Protects only task_obj */
> +	local_lock_t task_obj_lock;
>   	struct obj_stock task_obj;
> +#endif
>   	struct obj_stock irq_obj;
>   
>   	struct work_struct work;
>   	unsigned long flags;
>   #define FLUSHING_CACHED_CHARGE	0
>   };
> -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
> +	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
> +#ifndef CONFIG_PREEMPT_RT
> +	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
> +#endif
> +};
>   static DEFINE_MUTEX(percpu_charge_mutex);
>   
>   #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +					  struct memcg_stock_pcp *stock_pcp);
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg);
>   
>   #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +						 struct memcg_stock_pcp *stock_pcp)
>   {
> +	return NULL;
>   }
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg)
> @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   	if (nr_pages > MEMCG_CHARGE_BATCH)
>   		return ret;
>   
> -	local_irq_save(flags);
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
>   
>   	stock = this_cpu_ptr(&memcg_stock);
>   	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
> @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   		ret = true;
>   	}
>   
> -	local_irq_restore(flags);
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   
>   	return ret;
>   }
> @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>   
>   static void drain_local_stock(struct work_struct *dummy)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	struct memcg_stock_pcp *stock_pcp;
> +	struct obj_cgroup *old;
>   
>   	/*
>   	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
>   	 * drain_stock races is that we always operate on local CPU stock
>   	 * here with IRQ disabled
>   	 */
> -	local_irq_save(flags);
> +#ifndef CONFIG_PREEMPT_RT
> +	local_lock(&memcg_stock.task_obj_lock);
> +	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
> +	local_unlock(&memcg_stock.task_obj_lock);
> +	if (old)
> +		obj_cgroup_put(old);
> +#endif
>   
> -	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->irq_obj);
> -	if (in_task())
> -		drain_obj_stock(&stock->task_obj);
> -	drain_stock(stock);
> -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> +	local_lock_irq(&memcg_stock.stock_lock);
> +	stock_pcp = this_cpu_ptr(&memcg_stock);
> +	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
>   
> -	local_irq_restore(flags);
> +	drain_stock(stock_pcp);
> +	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
> +
> +	local_unlock_irq(&memcg_stock.stock_lock);
> +	if (old)
> +		obj_cgroup_put(old);
>   }
>   
>   /*
>    * Cache charges(val) to local per_cpu area.
>    * This will be consumed by consume_stock() function, later.
>    */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +			   struct memcg_stock_pcp *stock)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	lockdep_assert_held(&stock->stock_lock);
>   
> -	local_irq_save(flags);
> -
> -	stock = this_cpu_ptr(&memcg_stock);
>   	if (stock->cached != memcg) { /* reset if necessary */
>   		drain_stock(stock);
>   		css_get(&memcg->css);
> @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   
>   	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
>   		drain_stock(stock);
> +}
>   
> -	local_irq_restore(flags);
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +			 struct memcg_stock_pcp *stock_pcp)
> +{
> +	unsigned long flags;
> +
> +	if (stock_pcp) {
> +		__refill_stock(memcg, nr_pages, stock_pcp);
> +		return;
> +	}
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	__refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock));
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
>   
>   /*
> @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>    */
>   static void drain_all_stock(struct mem_cgroup *root_memcg)
>   {
> -	int cpu, curcpu;
> +	int cpu;
>   
>   	/* If someone's already draining, avoid adding running more workers. */
>   	if (!mutex_trylock(&percpu_charge_mutex))
> @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   	 * as well as workers from this path always operate on the local
>   	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
>   	 */
> -	curcpu = get_cpu();
> +	cpus_read_lock();
>   	for_each_online_cpu(cpu) {
>   		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>   		struct mem_cgroup *memcg;
> @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   		rcu_read_unlock();
>   
>   		if (flush &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			schedule_work_on(cpu, &stock->work);
>   	}
> -	put_cpu();
> +	cpus_read_unlock();
>   	mutex_unlock(&percpu_charge_mutex);
>   }
>   
> @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   
>   done_restock:
>   	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> +		refill_stock(memcg, batch - nr_pages, NULL);
>   
>   	/*
>   	 * If the hierarchy is above the normal consumption range, schedule
> @@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>    * can only be accessed after disabling interrupt. User context code can
>    * access interrupt object stock, but not vice versa.
>    */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> +static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
> +					      struct memcg_stock_pcp **stock_pcp)
>   {
>   	struct memcg_stock_pcp *stock;
>   
> +#ifndef CONFIG_PREEMPT_RT
>   	if (likely(in_task())) {
>   		*pflags = 0UL;
> -		preempt_disable();
> +		*stock_pcp = NULL;
> +		local_lock(&memcg_stock.task_obj_lock);
>   		stock = this_cpu_ptr(&memcg_stock);
>   		return &stock->task_obj;
>   	}
> -
> -	local_irq_save(*pflags);
> +#endif
> +	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
>   	stock = this_cpu_ptr(&memcg_stock);
> +	*stock_pcp = stock;
>   	return &stock->irq_obj;
>   }
>   
> -static inline void put_obj_stock(unsigned long flags)
> +static inline void put_obj_stock(unsigned long flags,
> +				 struct memcg_stock_pcp *stock_pcp)
>   {
> -	if (likely(in_task()))
> -		preempt_enable();
> +#ifndef CONFIG_PREEMPT_RT
> +	if (likely(!stock_pcp))
> +		local_unlock(&memcg_stock.task_obj_lock);
>   	else
> -		local_irq_restore(flags);
If you skip the "else" and add a "return", there will be a more natural 
indentation for the following lock_unlock_irqrestore(). Also we may not 
really need the additional stock_pcp argument here.
> +#endif
> +		local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
Cheers,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2021-12-23 21:48     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23 21:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, cgroups, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> Based on my understanding the optimisation with task_obj for in_task()
> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> instead to only PREEMPT_RT.
> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> configured but these kernels always have preempt_disable()/enable()
> present so it probably makes no sense here for the optimisation.
>
> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My 
understanding is that some distros are going to use PREEMPT_DYNAMIC, but 
default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to 
disable the optimization based on PREEMPTION alone.

Regards,
Longman



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2021-12-23 21:48     ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2021-12-23 21:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> Based on my understanding the optimisation with task_obj for in_task()
> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> instead to only PREEMPT_RT.
> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> configured but these kernels always have preempt_disable()/enable()
> present so it probably makes no sense here for the optimisation.
>
> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My 
understanding is that some distros are going to use PREEMPT_DYNAMIC, but 
default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to 
disable the optimization based on PREEMPTION alone.

Regards,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-03 14:44       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-03 14:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > Based on my understanding the optimisation with task_obj for in_task()
> > mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> > is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> > instead to only PREEMPT_RT.
> > With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> > configured but these kernels always have preempt_disable()/enable()
> > present so it probably makes no sense here for the optimisation.
> > 
> > Restrict the optimisation to !CONFIG_PREEMPTION kernels.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
> disable the optimization based on PREEMPTION alone.

So there is a benefit to this even if preempt_disable() is not optimized
away? My understanding was that this depends on preempt_disable() being
optimized away.
Is there something you recommend as a benchmark where I could get some
numbers?

> Regards,
> Longman

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-03 14:44       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-03 14:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> > Based on my understanding the optimisation with task_obj for in_task()
> > mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
> > is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
> > instead to only PREEMPT_RT.
> > With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
> > configured but these kernels always have preempt_disable()/enable()
> > present so it probably makes no sense here for the optimisation.
> > 
> > Restrict the optimisation to !CONFIG_PREEMPTION kernels.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> 
> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
> disable the optimization based on PREEMPTION alone.

So there is a benefit to this even if preempt_disable() is not optimized
away? My understanding was that this depends on preempt_disable() being
optimized away.
Is there something you recommend as a benchmark where I could get some
numbers?

> Regards,
> Longman

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-03 15:04         ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2022-01-03 15:04 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra


On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> Based on my understanding the optimisation with task_obj for in_task()
>>> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
>>> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
>>> instead to only PREEMPT_RT.
>>> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
>>> configured but these kernels always have preempt_disable()/enable()
>>> present so it probably makes no sense here for the optimisation.
>>>
>>> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>>>
>>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
>> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
>> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
>> disable the optimization based on PREEMPTION alone.
> So there is a benefit to this even if preempt_disable() is not optimized
> away? My understanding was that this depends on preempt_disable() being
> optimized away.
> Is there something you recommend as a benchmark where I could get some
> numbers?

In the case of PREEMPT_DYNAMIC, it depends on the default setting which 
is used by most users. I will support disabling the optimization if 
defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by 
CONFIG_)PREEMPTION alone.

As for microbenchmark, something that makes a lot of calls to malloc() 
or related allocations can be used.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-03 15:04         ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2022-01-03 15:04 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra


On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> On 2021-12-23 16:48:41 [-0500], Waiman Long wrote:
>> On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
>>> Based on my understanding the optimisation with task_obj for in_task()
>>> mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable()
>>> is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel
>>> instead to only PREEMPT_RT.
>>> With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be
>>> configured but these kernels always have preempt_disable()/enable()
>>> present so it probably makes no sense here for the optimisation.
>>>
>>> Restrict the optimisation to !CONFIG_PREEMPTION kernels.
>>>
>>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
>> If PREEMPT_DYNAMIC is selected, PREEMPTION will also be set. My
>> understanding is that some distros are going to use PREEMPT_DYNAMIC, but
>> default to PREEMPT_VOLUNTARY. So I don't believe it is a good idea to
>> disable the optimization based on PREEMPTION alone.
> So there is a benefit to this even if preempt_disable() is not optimized
> away? My understanding was that this depends on preempt_disable() being
> optimized away.
> Is there something you recommend as a benchmark where I could get some
> numbers?

In the case of PREEMPT_DYNAMIC, it depends on the default setting which 
is used by most users. I will support disabling the optimization if 
defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by 
CONFIG_)PREEMPTION alone.

As for microbenchmark, something that makes a lot of calls to malloc() 
or related allocations can be used.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2022-01-03 16:34       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-03 16:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 2021-12-23 16:38:35 [-0500], Waiman Long wrote:
> AFAICS, stock_pcp is set to indicate that the stock_lock has been acquired.
> Since stock_pcp, if set, should be the same as this_cpu_ptr(&memcg_stock),
> it is a bit confusing to pass it to a function that also does a percpu
> access to memcg_stock. Why don't you just pass a boolean, say, stock_locked
> to indicate this instead. It will be more clear and less confusing.

Something like this then?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2687d5ed544b..05fdc4801da6b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
 	return cgroup_memory_nokmem;
 }
 
+struct memcg_stock_pcp;
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages);
+				      unsigned int nr_pages,
+				      bool stock_lock_acquried);
 
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
@@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	nr_pages = nr_bytes >> PAGE_SHIFT;
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	list_del(&objcg->list);
@@ -2120,26 +2122,40 @@ struct obj_stock {
 };
 
 struct memcg_stock_pcp {
+	/* Protects memcg_stock_pcp */
+	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+#ifndef CONFIG_PREEMPT_RT
+	/* Protects only task_obj */
+	local_lock_t task_obj_lock;
 	struct obj_stock task_obj;
+#endif
 	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
 };
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+#ifndef CONFIG_PREEMPT_RT
+	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
+#endif
+};
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  bool stock_lock_acquried);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+						 bool stock_lock_acquried)
 {
+	return NULL;
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
@@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 
 static void drain_local_stock(struct work_struct *dummy)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	struct memcg_stock_pcp *stock_pcp;
+	struct obj_cgroup *old;
 
 	/*
 	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_irq_save(flags);
+#ifndef CONFIG_PREEMPT_RT
+	local_lock(&memcg_stock.task_obj_lock);
+	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
+	local_unlock(&memcg_stock.task_obj_lock);
+	if (old)
+		obj_cgroup_put(old);
+#endif
 
-	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->irq_obj);
-	if (in_task())
-		drain_obj_stock(&stock->task_obj);
-	drain_stock(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+	local_lock_irq(&memcg_stock.stock_lock);
+	stock_pcp = this_cpu_ptr(&memcg_stock);
+	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
 
-	local_irq_restore(flags);
+	drain_stock(stock_pcp);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
+
+	local_unlock_irq(&memcg_stock.stock_lock);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
 
-	local_irq_save(flags);
-
-	stock = this_cpu_ptr(&memcg_stock);
+	lockdep_assert_held(&stock->stock_lock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
 		css_get(&memcg->css);
@@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
 		drain_stock(stock);
+}
 
-	local_irq_restore(flags);
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			 bool stock_lock_acquried)
+{
+	unsigned long flags;
+
+	if (stock_lock_acquried) {
+		__refill_stock(memcg, nr_pages);
+		return;
+	}
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	__refill_stock(memcg, nr_pages);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-	int cpu, curcpu;
+	int cpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
 	if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 	 * as well as workers from this path always operate on the local
 	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
 	 */
-	curcpu = get_cpu();
+	cpus_read_lock();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *memcg;
@@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		rcu_read_unlock();
 
 		if (flush &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
-	put_cpu();
+	cpus_read_unlock();
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 done_restock:
 	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
+		refill_stock(memcg, batch - nr_pages, false);
 
 	/*
 	 * If the hierarchy is above the normal consumption range, schedule
@@ -2803,28 +2832,36 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
  * can only be accessed after disabling interrupt. User context code can
  * access interrupt object stock, but not vice versa.
  */
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
+					      bool *stock_lock_acquried)
 {
 	struct memcg_stock_pcp *stock;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (likely(in_task())) {
 		*pflags = 0UL;
-		preempt_disable();
+		*stock_lock_acquried = false;
+		local_lock(&memcg_stock.task_obj_lock);
 		stock = this_cpu_ptr(&memcg_stock);
 		return &stock->task_obj;
 	}
-
-	local_irq_save(*pflags);
+#endif
+	*stock_lock_acquried = true;
+	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
 	stock = this_cpu_ptr(&memcg_stock);
 	return &stock->irq_obj;
 }
 
-static inline void put_obj_stock(unsigned long flags)
+static inline void put_obj_stock(unsigned long flags,
+				 bool stock_lock_acquried)
 {
-	if (likely(in_task()))
-		preempt_enable();
-	else
-		local_irq_restore(flags);
+#ifndef CONFIG_PREEMPT_RT
+	if (likely(!stock_lock_acquried)) {
+		local_unlock(&memcg_stock.task_obj_lock);
+		return;
+	}
+#endif
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -3002,7 +3039,8 @@ static void memcg_free_cache_id(int id)
  * @nr_pages: number of pages to uncharge
  */
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages)
+				      unsigned int nr_pages,
+				      bool stock_lock_acquried)
 {
 	struct mem_cgroup *memcg;
 
@@ -3010,7 +3048,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		page_counter_uncharge(&memcg->kmem, nr_pages);
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, stock_lock_acquried);
 
 	css_put(&memcg->css);
 }
@@ -3084,7 +3122,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 		return;
 
 	objcg = __folio_objcg(folio);
-	obj_cgroup_uncharge_pages(objcg, nr_pages);
+	obj_cgroup_uncharge_pages(objcg, nr_pages, false);
 	folio->memcg_data = 0;
 	obj_cgroup_put(objcg);
 }
@@ -3092,17 +3130,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
+	bool stock_lock_acquried;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_cgroup *old = NULL;
+	struct obj_stock *stock;
 	int *bytes;
 
+	stock = get_obj_stock(&flags, &stock_lock_acquried);
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
 	 * accumulating over a page of vmstat data or when pgdat or idx
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_lock_acquried);
+
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3146,38 +3188,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_lock_acquried);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
+	bool stock_lock_acquried;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	bool ret = false;
 
+	stock = get_obj_stock(&flags, &stock_lock_acquried);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_lock_acquried);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct obj_stock *stock)
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  bool stock_lock_acquried)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
 	if (!old)
-		return;
+		return NULL;
 
 	if (stock->nr_bytes) {
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
 
 		if (nr_pages)
-			obj_cgroup_uncharge_pages(old, nr_pages);
+			obj_cgroup_uncharge_pages(old, nr_pages, stock_lock_acquried);
 
 		/*
 		 * The leftover is flushed to the centralized per-memcg value.
@@ -3212,8 +3259,8 @@ static void drain_obj_stock(struct obj_stock *stock)
 		stock->cached_pgdat = NULL;
 	}
 
-	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
+	return old;
 }
 
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3221,11 +3268,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (in_task() && stock->task_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
+#endif
 	if (stock->irq_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
@@ -3238,12 +3287,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
+	bool stock_lock_acquried;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	unsigned int nr_pages = 0;
+	struct obj_cgroup *old = NULL;
 
+	stock = get_obj_stock(&flags, &stock_lock_acquried);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_lock_acquried);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3257,10 +3309,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_lock_acquried);
+	if (old)
+		obj_cgroup_put(old);
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -7061,7 +7115,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
 
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, false);
 }
 
 static int __init cgroup_memory(char *s)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2022-01-03 16:34       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-03 16:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 2021-12-23 16:38:35 [-0500], Waiman Long wrote:
> AFAICS, stock_pcp is set to indicate that the stock_lock has been acquired.
> Since stock_pcp, if set, should be the same as this_cpu_ptr(&memcg_stock),
> it is a bit confusing to pass it to a function that also does a percpu
> access to memcg_stock. Why don't you just pass a boolean, say, stock_locked
> to indicate this instead. It will be more clear and less confusing.

Something like this then?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2687d5ed544b..05fdc4801da6b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
 	return cgroup_memory_nokmem;
 }
 
+struct memcg_stock_pcp;
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages);
+				      unsigned int nr_pages,
+				      bool stock_lock_acquried);
 
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
@@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	nr_pages = nr_bytes >> PAGE_SHIFT;
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
 
 	spin_lock_irqsave(&css_set_lock, flags);
 	list_del(&objcg->list);
@@ -2120,26 +2122,40 @@ struct obj_stock {
 };
 
 struct memcg_stock_pcp {
+	/* Protects memcg_stock_pcp */
+	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+#ifndef CONFIG_PREEMPT_RT
+	/* Protects only task_obj */
+	local_lock_t task_obj_lock;
 	struct obj_stock task_obj;
+#endif
 	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
 };
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+#ifndef CONFIG_PREEMPT_RT
+	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
+#endif
+};
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  bool stock_lock_acquried);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+						 bool stock_lock_acquried)
 {
+	return NULL;
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
@@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 
 static void drain_local_stock(struct work_struct *dummy)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	struct memcg_stock_pcp *stock_pcp;
+	struct obj_cgroup *old;
 
 	/*
 	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_irq_save(flags);
+#ifndef CONFIG_PREEMPT_RT
+	local_lock(&memcg_stock.task_obj_lock);
+	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
+	local_unlock(&memcg_stock.task_obj_lock);
+	if (old)
+		obj_cgroup_put(old);
+#endif
 
-	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->irq_obj);
-	if (in_task())
-		drain_obj_stock(&stock->task_obj);
-	drain_stock(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+	local_lock_irq(&memcg_stock.stock_lock);
+	stock_pcp = this_cpu_ptr(&memcg_stock);
+	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
 
-	local_irq_restore(flags);
+	drain_stock(stock_pcp);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
+
+	local_unlock_irq(&memcg_stock.stock_lock);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 /*
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	struct memcg_stock_pcp *stock;
-	unsigned long flags;
+	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
 
-	local_irq_save(flags);
-
-	stock = this_cpu_ptr(&memcg_stock);
+	lockdep_assert_held(&stock->stock_lock);
 	if (stock->cached != memcg) { /* reset if necessary */
 		drain_stock(stock);
 		css_get(&memcg->css);
@@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
 		drain_stock(stock);
+}
 
-	local_irq_restore(flags);
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			 bool stock_lock_acquried)
+{
+	unsigned long flags;
+
+	if (stock_lock_acquried) {
+		__refill_stock(memcg, nr_pages);
+		return;
+	}
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	__refill_stock(memcg, nr_pages);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-	int cpu, curcpu;
+	int cpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
 	if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 	 * as well as workers from this path always operate on the local
 	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
 	 */
-	curcpu = get_cpu();
+	cpus_read_lock();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *memcg;
@@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		rcu_read_unlock();
 
 		if (flush &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
-	put_cpu();
+	cpus_read_unlock();
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 done_restock:
 	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
+		refill_stock(memcg, batch - nr_pages, false);
 
 	/*
 	 * If the hierarchy is above the normal consumption range, schedule
@@ -2803,28 +2832,36 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
  * can only be accessed after disabling interrupt. User context code can
  * access interrupt object stock, but not vice versa.
  */
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
+					      bool *stock_lock_acquried)
 {
 	struct memcg_stock_pcp *stock;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (likely(in_task())) {
 		*pflags = 0UL;
-		preempt_disable();
+		*stock_lock_acquried = false;
+		local_lock(&memcg_stock.task_obj_lock);
 		stock = this_cpu_ptr(&memcg_stock);
 		return &stock->task_obj;
 	}
-
-	local_irq_save(*pflags);
+#endif
+	*stock_lock_acquried = true;
+	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
 	stock = this_cpu_ptr(&memcg_stock);
 	return &stock->irq_obj;
 }
 
-static inline void put_obj_stock(unsigned long flags)
+static inline void put_obj_stock(unsigned long flags,
+				 bool stock_lock_acquried)
 {
-	if (likely(in_task()))
-		preempt_enable();
-	else
-		local_irq_restore(flags);
+#ifndef CONFIG_PREEMPT_RT
+	if (likely(!stock_lock_acquried)) {
+		local_unlock(&memcg_stock.task_obj_lock);
+		return;
+	}
+#endif
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -3002,7 +3039,8 @@ static void memcg_free_cache_id(int id)
  * @nr_pages: number of pages to uncharge
  */
 static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
-				      unsigned int nr_pages)
+				      unsigned int nr_pages,
+				      bool stock_lock_acquried)
 {
 	struct mem_cgroup *memcg;
 
@@ -3010,7 +3048,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		page_counter_uncharge(&memcg->kmem, nr_pages);
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, stock_lock_acquried);
 
 	css_put(&memcg->css);
 }
@@ -3084,7 +3122,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 		return;
 
 	objcg = __folio_objcg(folio);
-	obj_cgroup_uncharge_pages(objcg, nr_pages);
+	obj_cgroup_uncharge_pages(objcg, nr_pages, false);
 	folio->memcg_data = 0;
 	obj_cgroup_put(objcg);
 }
@@ -3092,17 +3130,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
+	bool stock_lock_acquried;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_cgroup *old = NULL;
+	struct obj_stock *stock;
 	int *bytes;
 
+	stock = get_obj_stock(&flags, &stock_lock_acquried);
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
 	 * accumulating over a page of vmstat data or when pgdat or idx
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_lock_acquried);
+
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3146,38 +3188,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_lock_acquried);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
+	bool stock_lock_acquried;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	bool ret = false;
 
+	stock = get_obj_stock(&flags, &stock_lock_acquried);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_lock_acquried);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct obj_stock *stock)
+static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
+					  bool stock_lock_acquried)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
 	if (!old)
-		return;
+		return NULL;
 
 	if (stock->nr_bytes) {
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
 
 		if (nr_pages)
-			obj_cgroup_uncharge_pages(old, nr_pages);
+			obj_cgroup_uncharge_pages(old, nr_pages, stock_lock_acquried);
 
 		/*
 		 * The leftover is flushed to the centralized per-memcg value.
@@ -3212,8 +3259,8 @@ static void drain_obj_stock(struct obj_stock *stock)
 		stock->cached_pgdat = NULL;
 	}
 
-	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
+	return old;
 }
 
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3221,11 +3268,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
+#ifndef CONFIG_PREEMPT_RT
 	if (in_task() && stock->task_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
+#endif
 	if (stock->irq_obj.cached_objcg) {
 		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
@@ -3238,12 +3287,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
+	bool stock_lock_acquried;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
+	struct obj_stock *stock;
 	unsigned int nr_pages = 0;
+	struct obj_cgroup *old = NULL;
 
+	stock = get_obj_stock(&flags, &stock_lock_acquried);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock, stock_lock_acquried);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3257,10 +3309,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	put_obj_stock(flags);
+	put_obj_stock(flags, stock_lock_acquried);
+	if (old)
+		obj_cgroup_put(old);
 
 	if (nr_pages)
-		obj_cgroup_uncharge_pages(objcg, nr_pages);
+		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -7061,7 +7115,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
 
-	refill_stock(memcg, nr_pages);
+	refill_stock(memcg, nr_pages, false);
 }
 
 static int __init cgroup_memory(char *s)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2022-01-03 17:09         ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2022-01-03 17:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 1/3/22 11:34, Sebastian Andrzej Siewior wrote:
> On 2021-12-23 16:38:35 [-0500], Waiman Long wrote:
>> AFAICS, stock_pcp is set to indicate that the stock_lock has been acquired.
>> Since stock_pcp, if set, should be the same as this_cpu_ptr(&memcg_stock),
>> it is a bit confusing to pass it to a function that also does a percpu
>> access to memcg_stock. Why don't you just pass a boolean, say, stock_locked
>> to indicate this instead. It will be more clear and less confusing.
> Something like this then?

Yes, that looks good to me.

Thanks,
Longman

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d2687d5ed544b..05fdc4801da6b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
>   	return cgroup_memory_nokmem;
>   }
>   
> +struct memcg_stock_pcp;
>   static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> -				      unsigned int nr_pages);
> +				      unsigned int nr_pages,
> +				      bool stock_lock_acquried);
>   
>   static void obj_cgroup_release(struct percpu_ref *ref)
>   {
> @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>   	nr_pages = nr_bytes >> PAGE_SHIFT;
>   
>   	if (nr_pages)
> -		obj_cgroup_uncharge_pages(objcg, nr_pages);
> +		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
>   
>   	spin_lock_irqsave(&css_set_lock, flags);
>   	list_del(&objcg->list);
> @@ -2120,26 +2122,40 @@ struct obj_stock {
>   };
>   
>   struct memcg_stock_pcp {
> +	/* Protects memcg_stock_pcp */
> +	local_lock_t stock_lock;
>   	struct mem_cgroup *cached; /* this never be root cgroup */
>   	unsigned int nr_pages;
> +#ifndef CONFIG_PREEMPT_RT
> +	/* Protects only task_obj */
> +	local_lock_t task_obj_lock;
>   	struct obj_stock task_obj;
> +#endif
>   	struct obj_stock irq_obj;
>   
>   	struct work_struct work;
>   	unsigned long flags;
>   #define FLUSHING_CACHED_CHARGE	0
>   };
> -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
> +	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
> +#ifndef CONFIG_PREEMPT_RT
> +	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
> +#endif
> +};
>   static DEFINE_MUTEX(percpu_charge_mutex);
>   
>   #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +					  bool stock_lock_acquried);
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg);
>   
>   #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +						 bool stock_lock_acquried)
>   {
> +	return NULL;
>   }
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg)
> @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   	if (nr_pages > MEMCG_CHARGE_BATCH)
>   		return ret;
>   
> -	local_irq_save(flags);
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
>   
>   	stock = this_cpu_ptr(&memcg_stock);
>   	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
> @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   		ret = true;
>   	}
>   
> -	local_irq_restore(flags);
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   
>   	return ret;
>   }
> @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>   
>   static void drain_local_stock(struct work_struct *dummy)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	struct memcg_stock_pcp *stock_pcp;
> +	struct obj_cgroup *old;
>   
>   	/*
>   	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
>   	 * drain_stock races is that we always operate on local CPU stock
>   	 * here with IRQ disabled
>   	 */
> -	local_irq_save(flags);
> +#ifndef CONFIG_PREEMPT_RT
> +	local_lock(&memcg_stock.task_obj_lock);
> +	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
> +	local_unlock(&memcg_stock.task_obj_lock);
> +	if (old)
> +		obj_cgroup_put(old);
> +#endif
>   
> -	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->irq_obj);
> -	if (in_task())
> -		drain_obj_stock(&stock->task_obj);
> -	drain_stock(stock);
> -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> +	local_lock_irq(&memcg_stock.stock_lock);
> +	stock_pcp = this_cpu_ptr(&memcg_stock);
> +	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
>   
> -	local_irq_restore(flags);
> +	drain_stock(stock_pcp);
> +	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
> +
> +	local_unlock_irq(&memcg_stock.stock_lock);
> +	if (old)
> +		obj_cgroup_put(old);
>   }
>   
>   /*
>    * Cache charges(val) to local per_cpu area.
>    * This will be consumed by consume_stock() function, later.
>    */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>   
> -	local_irq_save(flags);
> -
> -	stock = this_cpu_ptr(&memcg_stock);
> +	lockdep_assert_held(&stock->stock_lock);
>   	if (stock->cached != memcg) { /* reset if necessary */
>   		drain_stock(stock);
>   		css_get(&memcg->css);
> @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   
>   	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
>   		drain_stock(stock);
> +}
>   
> -	local_irq_restore(flags);
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +			 bool stock_lock_acquried)
> +{
> +	unsigned long flags;
> +
> +	if (stock_lock_acquried) {
> +		__refill_stock(memcg, nr_pages);
> +		return;
> +	}
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	__refill_stock(memcg, nr_pages);
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
>   
>   /*
> @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>    */
>   static void drain_all_stock(struct mem_cgroup *root_memcg)
>   {
> -	int cpu, curcpu;
> +	int cpu;
>   
>   	/* If someone's already draining, avoid adding running more workers. */
>   	if (!mutex_trylock(&percpu_charge_mutex))
> @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   	 * as well as workers from this path always operate on the local
>   	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
>   	 */
> -	curcpu = get_cpu();
> +	cpus_read_lock();
>   	for_each_online_cpu(cpu) {
>   		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>   		struct mem_cgroup *memcg;
> @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   		rcu_read_unlock();
>   
>   		if (flush &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			schedule_work_on(cpu, &stock->work);
>   	}
> -	put_cpu();
> +	cpus_read_unlock();
>   	mutex_unlock(&percpu_charge_mutex);
>   }
>   
> @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   
>   done_restock:
>   	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> +		refill_stock(memcg, batch - nr_pages, false);
>   
>   	/*
>   	 * If the hierarchy is above the normal consumption range, schedule
> @@ -2803,28 +2832,36 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>    * can only be accessed after disabling interrupt. User context code can
>    * access interrupt object stock, but not vice versa.
>    */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> +static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
> +					      bool *stock_lock_acquried)
>   {
>   	struct memcg_stock_pcp *stock;
>   
> +#ifndef CONFIG_PREEMPT_RT
>   	if (likely(in_task())) {
>   		*pflags = 0UL;
> -		preempt_disable();
> +		*stock_lock_acquried = false;
> +		local_lock(&memcg_stock.task_obj_lock);
>   		stock = this_cpu_ptr(&memcg_stock);
>   		return &stock->task_obj;
>   	}
> -
> -	local_irq_save(*pflags);
> +#endif
> +	*stock_lock_acquried = true;
> +	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
>   	stock = this_cpu_ptr(&memcg_stock);
>   	return &stock->irq_obj;
>   }
>   
> -static inline void put_obj_stock(unsigned long flags)
> +static inline void put_obj_stock(unsigned long flags,
> +				 bool stock_lock_acquried)
>   {
> -	if (likely(in_task()))
> -		preempt_enable();
> -	else
> -		local_irq_restore(flags);
> +#ifndef CONFIG_PREEMPT_RT
> +	if (likely(!stock_lock_acquried)) {
> +		local_unlock(&memcg_stock.task_obj_lock);
> +		return;
> +	}
> +#endif
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
>   
>   /*
> @@ -3002,7 +3039,8 @@ static void memcg_free_cache_id(int id)
>    * @nr_pages: number of pages to uncharge
>    */
>   static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> -				      unsigned int nr_pages)
> +				      unsigned int nr_pages,
> +				      bool stock_lock_acquried)
>   {
>   	struct mem_cgroup *memcg;
>   
> @@ -3010,7 +3048,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
>   
>   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>   		page_counter_uncharge(&memcg->kmem, nr_pages);
> -	refill_stock(memcg, nr_pages);
> +	refill_stock(memcg, nr_pages, stock_lock_acquried);
>   
>   	css_put(&memcg->css);
>   }
> @@ -3084,7 +3122,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>   		return;
>   
>   	objcg = __folio_objcg(folio);
> -	obj_cgroup_uncharge_pages(objcg, nr_pages);
> +	obj_cgroup_uncharge_pages(objcg, nr_pages, false);
>   	folio->memcg_data = 0;
>   	obj_cgroup_put(objcg);
>   }
> @@ -3092,17 +3130,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>   void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>   		     enum node_stat_item idx, int nr)
>   {
> +	bool stock_lock_acquried;
>   	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_cgroup *old = NULL;
> +	struct obj_stock *stock;
>   	int *bytes;
>   
> +	stock = get_obj_stock(&flags, &stock_lock_acquried);
>   	/*
>   	 * Save vmstat data in stock and skip vmstat array update unless
>   	 * accumulating over a page of vmstat data or when pgdat or idx
>   	 * changes.
>   	 */
>   	if (stock->cached_objcg != objcg) {
> -		drain_obj_stock(stock);
> +		old = drain_obj_stock(stock, stock_lock_acquried);
> +
>   		obj_cgroup_get(objcg);
>   		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
>   				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
> @@ -3146,38 +3188,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>   	if (nr)
>   		mod_objcg_mlstate(objcg, pgdat, idx, nr);
>   
> -	put_obj_stock(flags);
> +	put_obj_stock(flags, stock_lock_acquried);
> +	if (old)
> +		obj_cgroup_put(old);
>   }
>   
>   static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>   {
> +	bool stock_lock_acquried;
>   	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_stock *stock;
>   	bool ret = false;
>   
> +	stock = get_obj_stock(&flags, &stock_lock_acquried);
>   	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
>   		stock->nr_bytes -= nr_bytes;
>   		ret = true;
>   	}
>   
> -	put_obj_stock(flags);
> +	put_obj_stock(flags, stock_lock_acquried);
>   
>   	return ret;
>   }
>   
> -static void drain_obj_stock(struct obj_stock *stock)
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +					  bool stock_lock_acquried)
>   {
>   	struct obj_cgroup *old = stock->cached_objcg;
>   
>   	if (!old)
> -		return;
> +		return NULL;
>   
>   	if (stock->nr_bytes) {
>   		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>   		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
>   
>   		if (nr_pages)
> -			obj_cgroup_uncharge_pages(old, nr_pages);
> +			obj_cgroup_uncharge_pages(old, nr_pages, stock_lock_acquried);
>   
>   		/*
>   		 * The leftover is flushed to the centralized per-memcg value.
> @@ -3212,8 +3259,8 @@ static void drain_obj_stock(struct obj_stock *stock)
>   		stock->cached_pgdat = NULL;
>   	}
>   
> -	obj_cgroup_put(old);
>   	stock->cached_objcg = NULL;
> +	return old;
>   }
>   
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> @@ -3221,11 +3268,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   {
>   	struct mem_cgroup *memcg;
>   
> +#ifndef CONFIG_PREEMPT_RT
>   	if (in_task() && stock->task_obj.cached_objcg) {
>   		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
>   		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>   			return true;
>   	}
> +#endif
>   	if (stock->irq_obj.cached_objcg) {
>   		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
>   		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> @@ -3238,12 +3287,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>   			     bool allow_uncharge)
>   {
> +	bool stock_lock_acquried;
>   	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_stock *stock;
>   	unsigned int nr_pages = 0;
> +	struct obj_cgroup *old = NULL;
>   
> +	stock = get_obj_stock(&flags, &stock_lock_acquried);
>   	if (stock->cached_objcg != objcg) { /* reset if necessary */
> -		drain_obj_stock(stock);
> +		old = drain_obj_stock(stock, stock_lock_acquried);
>   		obj_cgroup_get(objcg);
>   		stock->cached_objcg = objcg;
>   		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> @@ -3257,10 +3309,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>   		stock->nr_bytes &= (PAGE_SIZE - 1);
>   	}
>   
> -	put_obj_stock(flags);
> +	put_obj_stock(flags, stock_lock_acquried);
> +	if (old)
> +		obj_cgroup_put(old);
>   
>   	if (nr_pages)
> -		obj_cgroup_uncharge_pages(objcg, nr_pages);
> +		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
>   }
>   
>   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -7061,7 +7115,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>   
>   	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
>   
> -	refill_stock(memcg, nr_pages);
> +	refill_stock(memcg, nr_pages, false);
>   }
>   
>   static int __init cgroup_memory(char *s)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
@ 2022-01-03 17:09         ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2022-01-03 17:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 1/3/22 11:34, Sebastian Andrzej Siewior wrote:
> On 2021-12-23 16:38:35 [-0500], Waiman Long wrote:
>> AFAICS, stock_pcp is set to indicate that the stock_lock has been acquired.
>> Since stock_pcp, if set, should be the same as this_cpu_ptr(&memcg_stock),
>> it is a bit confusing to pass it to a function that also does a percpu
>> access to memcg_stock. Why don't you just pass a boolean, say, stock_locked
>> to indicate this instead. It will be more clear and less confusing.
> Something like this then?

Yes, that looks good to me.

Thanks,
Longman

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d2687d5ed544b..05fdc4801da6b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
>   	return cgroup_memory_nokmem;
>   }
>   
> +struct memcg_stock_pcp;
>   static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> -				      unsigned int nr_pages);
> +				      unsigned int nr_pages,
> +				      bool stock_lock_acquried);
>   
>   static void obj_cgroup_release(struct percpu_ref *ref)
>   {
> @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>   	nr_pages = nr_bytes >> PAGE_SHIFT;
>   
>   	if (nr_pages)
> -		obj_cgroup_uncharge_pages(objcg, nr_pages);
> +		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
>   
>   	spin_lock_irqsave(&css_set_lock, flags);
>   	list_del(&objcg->list);
> @@ -2120,26 +2122,40 @@ struct obj_stock {
>   };
>   
>   struct memcg_stock_pcp {
> +	/* Protects memcg_stock_pcp */
> +	local_lock_t stock_lock;
>   	struct mem_cgroup *cached; /* this never be root cgroup */
>   	unsigned int nr_pages;
> +#ifndef CONFIG_PREEMPT_RT
> +	/* Protects only task_obj */
> +	local_lock_t task_obj_lock;
>   	struct obj_stock task_obj;
> +#endif
>   	struct obj_stock irq_obj;
>   
>   	struct work_struct work;
>   	unsigned long flags;
>   #define FLUSHING_CACHED_CHARGE	0
>   };
> -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
> +	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
> +#ifndef CONFIG_PREEMPT_RT
> +	.task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
> +#endif
> +};
>   static DEFINE_MUTEX(percpu_charge_mutex);
>   
>   #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +					  bool stock_lock_acquried);
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg);
>   
>   #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +						 bool stock_lock_acquried)
>   {
> +	return NULL;
>   }
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   				     struct mem_cgroup *root_memcg)
> @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   	if (nr_pages > MEMCG_CHARGE_BATCH)
>   		return ret;
>   
> -	local_irq_save(flags);
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
>   
>   	stock = this_cpu_ptr(&memcg_stock);
>   	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
> @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   		ret = true;
>   	}
>   
> -	local_irq_restore(flags);
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   
>   	return ret;
>   }
> @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>   
>   static void drain_local_stock(struct work_struct *dummy)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	struct memcg_stock_pcp *stock_pcp;
> +	struct obj_cgroup *old;
>   
>   	/*
>   	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
>   	 * drain_stock races is that we always operate on local CPU stock
>   	 * here with IRQ disabled
>   	 */
> -	local_irq_save(flags);
> +#ifndef CONFIG_PREEMPT_RT
> +	local_lock(&memcg_stock.task_obj_lock);
> +	old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
> +	local_unlock(&memcg_stock.task_obj_lock);
> +	if (old)
> +		obj_cgroup_put(old);
> +#endif
>   
> -	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->irq_obj);
> -	if (in_task())
> -		drain_obj_stock(&stock->task_obj);
> -	drain_stock(stock);
> -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> +	local_lock_irq(&memcg_stock.stock_lock);
> +	stock_pcp = this_cpu_ptr(&memcg_stock);
> +	old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
>   
> -	local_irq_restore(flags);
> +	drain_stock(stock_pcp);
> +	clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
> +
> +	local_unlock_irq(&memcg_stock.stock_lock);
> +	if (old)
> +		obj_cgroup_put(old);
>   }
>   
>   /*
>    * Cache charges(val) to local per_cpu area.
>    * This will be consumed by consume_stock() function, later.
>    */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   {
> -	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> +	struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock);
>   
> -	local_irq_save(flags);
> -
> -	stock = this_cpu_ptr(&memcg_stock);
> +	lockdep_assert_held(&stock->stock_lock);
>   	if (stock->cached != memcg) { /* reset if necessary */
>   		drain_stock(stock);
>   		css_get(&memcg->css);
> @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   
>   	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
>   		drain_stock(stock);
> +}
>   
> -	local_irq_restore(flags);
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> +			 bool stock_lock_acquried)
> +{
> +	unsigned long flags;
> +
> +	if (stock_lock_acquried) {
> +		__refill_stock(memcg, nr_pages);
> +		return;
> +	}
> +	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	__refill_stock(memcg, nr_pages);
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
>   
>   /*
> @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>    */
>   static void drain_all_stock(struct mem_cgroup *root_memcg)
>   {
> -	int cpu, curcpu;
> +	int cpu;
>   
>   	/* If someone's already draining, avoid adding running more workers. */
>   	if (!mutex_trylock(&percpu_charge_mutex))
> @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   	 * as well as workers from this path always operate on the local
>   	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
>   	 */
> -	curcpu = get_cpu();
> +	cpus_read_lock();
>   	for_each_online_cpu(cpu) {
>   		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>   		struct mem_cgroup *memcg;
> @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>   		rcu_read_unlock();
>   
>   		if (flush &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			schedule_work_on(cpu, &stock->work);
>   	}
> -	put_cpu();
> +	cpus_read_unlock();
>   	mutex_unlock(&percpu_charge_mutex);
>   }
>   
> @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>   
>   done_restock:
>   	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> +		refill_stock(memcg, batch - nr_pages, false);
>   
>   	/*
>   	 * If the hierarchy is above the normal consumption range, schedule
> @@ -2803,28 +2832,36 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>    * can only be accessed after disabling interrupt. User context code can
>    * access interrupt object stock, but not vice versa.
>    */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> +static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
> +					      bool *stock_lock_acquried)
>   {
>   	struct memcg_stock_pcp *stock;
>   
> +#ifndef CONFIG_PREEMPT_RT
>   	if (likely(in_task())) {
>   		*pflags = 0UL;
> -		preempt_disable();
> +		*stock_lock_acquried = false;
> +		local_lock(&memcg_stock.task_obj_lock);
>   		stock = this_cpu_ptr(&memcg_stock);
>   		return &stock->task_obj;
>   	}
> -
> -	local_irq_save(*pflags);
> +#endif
> +	*stock_lock_acquried = true;
> +	local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
>   	stock = this_cpu_ptr(&memcg_stock);
>   	return &stock->irq_obj;
>   }
>   
> -static inline void put_obj_stock(unsigned long flags)
> +static inline void put_obj_stock(unsigned long flags,
> +				 bool stock_lock_acquried)
>   {
> -	if (likely(in_task()))
> -		preempt_enable();
> -	else
> -		local_irq_restore(flags);
> +#ifndef CONFIG_PREEMPT_RT
> +	if (likely(!stock_lock_acquried)) {
> +		local_unlock(&memcg_stock.task_obj_lock);
> +		return;
> +	}
> +#endif
> +	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>   }
>   
>   /*
> @@ -3002,7 +3039,8 @@ static void memcg_free_cache_id(int id)
>    * @nr_pages: number of pages to uncharge
>    */
>   static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> -				      unsigned int nr_pages)
> +				      unsigned int nr_pages,
> +				      bool stock_lock_acquried)
>   {
>   	struct mem_cgroup *memcg;
>   
> @@ -3010,7 +3048,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
>   
>   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>   		page_counter_uncharge(&memcg->kmem, nr_pages);
> -	refill_stock(memcg, nr_pages);
> +	refill_stock(memcg, nr_pages, stock_lock_acquried);
>   
>   	css_put(&memcg->css);
>   }
> @@ -3084,7 +3122,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>   		return;
>   
>   	objcg = __folio_objcg(folio);
> -	obj_cgroup_uncharge_pages(objcg, nr_pages);
> +	obj_cgroup_uncharge_pages(objcg, nr_pages, false);
>   	folio->memcg_data = 0;
>   	obj_cgroup_put(objcg);
>   }
> @@ -3092,17 +3130,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>   void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>   		     enum node_stat_item idx, int nr)
>   {
> +	bool stock_lock_acquried;
>   	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_cgroup *old = NULL;
> +	struct obj_stock *stock;
>   	int *bytes;
>   
> +	stock = get_obj_stock(&flags, &stock_lock_acquried);
>   	/*
>   	 * Save vmstat data in stock and skip vmstat array update unless
>   	 * accumulating over a page of vmstat data or when pgdat or idx
>   	 * changes.
>   	 */
>   	if (stock->cached_objcg != objcg) {
> -		drain_obj_stock(stock);
> +		old = drain_obj_stock(stock, stock_lock_acquried);
> +
>   		obj_cgroup_get(objcg);
>   		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
>   				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
> @@ -3146,38 +3188,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>   	if (nr)
>   		mod_objcg_mlstate(objcg, pgdat, idx, nr);
>   
> -	put_obj_stock(flags);
> +	put_obj_stock(flags, stock_lock_acquried);
> +	if (old)
> +		obj_cgroup_put(old);
>   }
>   
>   static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>   {
> +	bool stock_lock_acquried;
>   	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_stock *stock;
>   	bool ret = false;
>   
> +	stock = get_obj_stock(&flags, &stock_lock_acquried);
>   	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
>   		stock->nr_bytes -= nr_bytes;
>   		ret = true;
>   	}
>   
> -	put_obj_stock(flags);
> +	put_obj_stock(flags, stock_lock_acquried);
>   
>   	return ret;
>   }
>   
> -static void drain_obj_stock(struct obj_stock *stock)
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> +					  bool stock_lock_acquried)
>   {
>   	struct obj_cgroup *old = stock->cached_objcg;
>   
>   	if (!old)
> -		return;
> +		return NULL;
>   
>   	if (stock->nr_bytes) {
>   		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>   		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
>   
>   		if (nr_pages)
> -			obj_cgroup_uncharge_pages(old, nr_pages);
> +			obj_cgroup_uncharge_pages(old, nr_pages, stock_lock_acquried);
>   
>   		/*
>   		 * The leftover is flushed to the centralized per-memcg value.
> @@ -3212,8 +3259,8 @@ static void drain_obj_stock(struct obj_stock *stock)
>   		stock->cached_pgdat = NULL;
>   	}
>   
> -	obj_cgroup_put(old);
>   	stock->cached_objcg = NULL;
> +	return old;
>   }
>   
>   static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> @@ -3221,11 +3268,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   {
>   	struct mem_cgroup *memcg;
>   
> +#ifndef CONFIG_PREEMPT_RT
>   	if (in_task() && stock->task_obj.cached_objcg) {
>   		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
>   		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>   			return true;
>   	}
> +#endif
>   	if (stock->irq_obj.cached_objcg) {
>   		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
>   		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> @@ -3238,12 +3287,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>   static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>   			     bool allow_uncharge)
>   {
> +	bool stock_lock_acquried;
>   	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
> +	struct obj_stock *stock;
>   	unsigned int nr_pages = 0;
> +	struct obj_cgroup *old = NULL;
>   
> +	stock = get_obj_stock(&flags, &stock_lock_acquried);
>   	if (stock->cached_objcg != objcg) { /* reset if necessary */
> -		drain_obj_stock(stock);
> +		old = drain_obj_stock(stock, stock_lock_acquried);
>   		obj_cgroup_get(objcg);
>   		stock->cached_objcg = objcg;
>   		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> @@ -3257,10 +3309,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>   		stock->nr_bytes &= (PAGE_SIZE - 1);
>   	}
>   
> -	put_obj_stock(flags);
> +	put_obj_stock(flags, stock_lock_acquried);
> +	if (old)
> +		obj_cgroup_put(old);
>   
>   	if (nr_pages)
> -		obj_cgroup_uncharge_pages(objcg, nr_pages);
> +		obj_cgroup_uncharge_pages(objcg, nr_pages, false);
>   }
>   
>   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -7061,7 +7115,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
>   
>   	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
>   
> -	refill_stock(memcg, nr_pages);
> +	refill_stock(memcg, nr_pages, false);
>   }
>   
>   static int __init cgroup_memory(char *s)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-05 14:16     ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-05 14:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
	Peter Zijlstra

On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> The sections with disabled preemption must exclude
> memcg_check_events() so that spinlock_t locks can still be acquired
> (for instance in eventfd_signal()).

The resulting construct in uncharge_batch() raises eybrows. If you can decouple
per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
it the same way on !PREEMPT_RT too (and have just one variant of the
block)?

(Actually, it doesn't seem to me that memcg_check_events() can be
extracted like this from the preempt disabled block since
mem_cgroup_event_ratelimit() relies on similar RMW pattern.
Things would be simpler if PREEMPT_RT didn't allow the threshold event
handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)

Thanks,
Michal


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-05 14:16     ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-05 14:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> The sections with disabled preemption must exclude
> memcg_check_events() so that spinlock_t locks can still be acquired
> (for instance in eventfd_signal()).

The resulting construct in uncharge_batch() raises eybrows. If you can decouple
per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
it the same way on !PREEMPT_RT too (and have just one variant of the
block)?

(Actually, it doesn't seem to me that memcg_check_events() can be
extracted like this from the preempt disabled block since
mem_cgroup_event_ratelimit() relies on similar RMW pattern.
Things would be simpler if PREEMPT_RT didn't allow the threshold event
handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2022-01-05 14:59   ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-05 14:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
	Peter Zijlstra

On Wed, Dec 22, 2021 at 12:41:08PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> - lockdep complains were triggered by test_core and test_freezer (both
>   had to run):

This doesn't happen on the patched kernel, correct?

Thanks,
Michal


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2022-01-05 14:59   ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-05 14:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

On Wed, Dec 22, 2021 at 12:41:08PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> - lockdep complains were triggered by test_core and test_freezer (both
>   had to run):

This doesn't happen on the patched kernel, correct?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2022-01-05 15:06     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-05 15:06 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
	Peter Zijlstra

On 2022-01-05 15:59:56 [+0100], Michal Koutný wrote:
> On Wed, Dec 22, 2021 at 12:41:08PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > - lockdep complains were triggered by test_core and test_freezer (both
> >   had to run):
> 
> This doesn't happen on the patched kernel, correct?

I saw it first on the patched kernel (with this series) then went back
to the -rc and saw it there, too.

> Thanks,
> Michal

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2022-01-05 15:06     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-05 15:06 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

On 2022-01-05 15:59:56 [+0100], Michal Koutný wrote:
> On Wed, Dec 22, 2021 at 12:41:08PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> > - lockdep complains were triggered by test_core and test_freezer (both
> >   had to run):
> 
> This doesn't happen on the patched kernel, correct?

I saw it first on the patched kernel (with this series) then went back
to the -rc and saw it there, too.

> Thanks,
> Michal

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-05 20:22           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-05 20:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> > Is there something you recommend as a benchmark where I could get some
> > numbers?
> 
> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
> used by most users. I will support disabling the optimization if
> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
> CONFIG_)PREEMPTION alone.
> 
> As for microbenchmark, something that makes a lot of calls to malloc() or
> related allocations can be used.

Numbers I made:

        Sandy Bridge   Haswell        Skylake         AMD-A8 7100    Zen2           ARM64
PREEMPT 5,123,896,822  5,215,055,226   5,077,611,590  6,012,287,874  6,234,674,489  20,000,000,100
IRQ     7,494,119,638  6,810,367,629  10,620,130,377  4,178,546,086  4,898,076,012  13,538,461,925

For micro benchmarking I did 1.000.000.000 iterations of
preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
[IRQ].
On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
loop took 7,494,119,638ns. The absolute numbers are not important, it is
worth noting that preemption off/on is less expensive than IRQ off/on.
Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
loop was performed with disabled interrupts so I don't expect much
interference - but then I don't know much the µArch optimized away on
local_irq_restore() given that the interrupts were already disabled.
I don't have any recent Intel HW (I think) so I don't know if this is an
Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
that the recent uArch would behave similar to AMD.

Moving on: For the test I run 100,000,000 iterations of
     kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));

The BH suffix means that in_task() reported false during the allocation,
otherwise it reported true.
SD is the standard deviation.
SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
means no optimisation (irq_obj is always used).
The numbers are the time in ns needed for 100,000,000 iteration (alloc +
free). I run 5 tests and used the median value here. If the standard
deviation exceeded 10^9 then I repeated the test. The values remained in
the same range since usually only one value was off and the other
remained in the same range.

Sandy Bridge
                 SERVER OPT   SERVER NO-OPT    PREEMPT OPT     PREEMPT NO-OPT
ALLOC/FREE    8,519,295,176   9,051,200,652    10,627,431,395  11,198,189,843
SD                5,309,768      29,253,976       129,102,317      40,681,909
ALLOC/FREE BH 9,996,704,330   8,927,026,031    11,680,149,900  11,139,356,465
SD               38,237,534      72,913,120        23,626,932     116,413,331

The optimisation is visible in the SERVER-OPT case (~1.5s difference in
the runtime (or 14.7ns per iteration)). There is hardly any difference
between BH and !BH in the SERVER-NO-OPT case.
For the SERVER case, the optimisation improves ~0.5s in runtime for the
!BH case.
For the PREEMPT case it also looks like ~0.5s improvement in the BH case
while in the BH case it looks the other way around.

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT   DYN-FULL-NO-OPT
ALLOC/FREE     11,069,180,584  10,773,407,543  10,963,581,285    10,826,207,969
SD                 23,195,912     112,763,104      13,145,589        33,543,625
ALLOC/FREE BH  11,443,342,069  10,720,094,700  11,064,914,727    10,955,883,521
SD                 81,150,074     171,299,554      58,603,778        84,131,143

DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
default.  I don't see any difference vs CONFIG_PREEMPT except the
default preemption state (so I didn't test that). The preemption counter
is always forced-in so preempt_enable()/disable() is not optimized away.
SRV is the default value (PREEMPT_NONE) and FULL is the overriden
(preempt=full) state.

Based on that, I don't see any added value by the optimisation once
PREEMPT_DYNAMIC is enabled.

----
Zen2:
                 SERVER OPT   SERVER NO-OPT   PREEMPT OPT      PREEMPT NO-OPT
ALLOC/FREE    8,126,735,313   8,751,307,383    9,822,927,142   10,045,105,425
SD              100,806,471      87,234,047       55,170,179       25,832,386
ALLOC/FREE BH 9,197,455,885   8,394,337,053   10,671,227,095    9,904,954,934
SD              155,223,919      57,800,997       47,529,496      105,260,566

On Zen2, the IRQ off/on was less expensive than preempt-off/on. So it
looks like I mixed up the numbers in for PREEMPT OPT and NO-OPT but I
re-run it twice and nothing significant changed… However the difference
on PREEMPT for the !BH case is not as significant as on Sandy Bridge
(~200ms here vs ~500ms there).

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT  DYN-FULL-NO-OPT
ALLOC/FREE      9,680,498,929  10,180,973,847   9,644,453,405  10,224,416,854
SD                 73,944,156      61,850,527      13,277,203     107,145,212
ALLOC/FREE BH  10,680,074,634   9,956,695,323  10,704,442,515   9,942,155,910
SD                 75,535,172      34,524,493      54,625,678      87,163,920

For double testing and checking, the full git tree is available at [0]
and the script to parse the results is at [1].

[0] git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging memcg
[1] https://breakpoint.cc/parse-memcg.py

> Cheers,
> Longman

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-05 20:22           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-05 20:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
> > Is there something you recommend as a benchmark where I could get some
> > numbers?
> 
> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
> used by most users. I will support disabling the optimization if
> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
> CONFIG_)PREEMPTION alone.
> 
> As for microbenchmark, something that makes a lot of calls to malloc() or
> related allocations can be used.

Numbers I made:

        Sandy Bridge   Haswell        Skylake         AMD-A8 7100    Zen2           ARM64
PREEMPT 5,123,896,822  5,215,055,226   5,077,611,590  6,012,287,874  6,234,674,489  20,000,000,100
IRQ     7,494,119,638  6,810,367,629  10,620,130,377  4,178,546,086  4,898,076,012  13,538,461,925

For micro benchmarking I did 1.000.000.000 iterations of
preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
[IRQ].
On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
loop took 7,494,119,638ns. The absolute numbers are not important, it is
worth noting that preemption off/on is less expensive than IRQ off/on.
Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
loop was performed with disabled interrupts so I don't expect much
interference - but then I don't know much the µArch optimized away on
local_irq_restore() given that the interrupts were already disabled.
I don't have any recent Intel HW (I think) so I don't know if this is an
Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
that the recent uArch would behave similar to AMD.

Moving on: For the test I run 100,000,000 iterations of
     kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));

The BH suffix means that in_task() reported false during the allocation,
otherwise it reported true.
SD is the standard deviation.
SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
means no optimisation (irq_obj is always used).
The numbers are the time in ns needed for 100,000,000 iteration (alloc +
free). I run 5 tests and used the median value here. If the standard
deviation exceeded 10^9 then I repeated the test. The values remained in
the same range since usually only one value was off and the other
remained in the same range.

Sandy Bridge
                 SERVER OPT   SERVER NO-OPT    PREEMPT OPT     PREEMPT NO-OPT
ALLOC/FREE    8,519,295,176   9,051,200,652    10,627,431,395  11,198,189,843
SD                5,309,768      29,253,976       129,102,317      40,681,909
ALLOC/FREE BH 9,996,704,330   8,927,026,031    11,680,149,900  11,139,356,465
SD               38,237,534      72,913,120        23,626,932     116,413,331

The optimisation is visible in the SERVER-OPT case (~1.5s difference in
the runtime (or 14.7ns per iteration)). There is hardly any difference
between BH and !BH in the SERVER-NO-OPT case.
For the SERVER case, the optimisation improves ~0.5s in runtime for the
!BH case.
For the PREEMPT case it also looks like ~0.5s improvement in the BH case
while in the BH case it looks the other way around.

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT   DYN-FULL-NO-OPT
ALLOC/FREE     11,069,180,584  10,773,407,543  10,963,581,285    10,826,207,969
SD                 23,195,912     112,763,104      13,145,589        33,543,625
ALLOC/FREE BH  11,443,342,069  10,720,094,700  11,064,914,727    10,955,883,521
SD                 81,150,074     171,299,554      58,603,778        84,131,143

DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
default.  I don't see any difference vs CONFIG_PREEMPT except the
default preemption state (so I didn't test that). The preemption counter
is always forced-in so preempt_enable()/disable() is not optimized away.
SRV is the default value (PREEMPT_NONE) and FULL is the overriden
(preempt=full) state.

Based on that, I don't see any added value by the optimisation once
PREEMPT_DYNAMIC is enabled.

----
Zen2:
                 SERVER OPT   SERVER NO-OPT   PREEMPT OPT      PREEMPT NO-OPT
ALLOC/FREE    8,126,735,313   8,751,307,383    9,822,927,142   10,045,105,425
SD              100,806,471      87,234,047       55,170,179       25,832,386
ALLOC/FREE BH 9,197,455,885   8,394,337,053   10,671,227,095    9,904,954,934
SD              155,223,919      57,800,997       47,529,496      105,260,566

On Zen2, the IRQ off/on was less expensive than preempt-off/on. So it
looks like I mixed up the numbers in for PREEMPT OPT and NO-OPT but I
re-run it twice and nothing significant changed… However the difference
on PREEMPT for the !BH case is not as significant as on Sandy Bridge
(~200ms here vs ~500ms there).

                 DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT  DYN-FULL-NO-OPT
ALLOC/FREE      9,680,498,929  10,180,973,847   9,644,453,405  10,224,416,854
SD                 73,944,156      61,850,527      13,277,203     107,145,212
ALLOC/FREE BH  10,680,074,634   9,956,695,323  10,704,442,515   9,942,155,910
SD                 75,535,172      34,524,493      54,625,678      87,163,920

For double testing and checking, the full git tree is available at [0]
and the script to parse the results is at [1].

[0] git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging memcg
[1] https://breakpoint.cc/parse-memcg.py

> Cheers,
> Longman

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-06  3:28             ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2022-01-06  3:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 1/5/22 15:22, Sebastian Andrzej Siewior wrote:
> On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
>> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
>>> Is there something you recommend as a benchmark where I could get some
>>> numbers?
>> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
>> used by most users. I will support disabling the optimization if
>> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
>> CONFIG_)PREEMPTION alone.
>>
>> As for microbenchmark, something that makes a lot of calls to malloc() or
>> related allocations can be used.
> Numbers I made:
>
>          Sandy Bridge   Haswell        Skylake         AMD-A8 7100    Zen2           ARM64
> PREEMPT 5,123,896,822  5,215,055,226   5,077,611,590  6,012,287,874  6,234,674,489  20,000,000,100
> IRQ     7,494,119,638  6,810,367,629  10,620,130,377  4,178,546,086  4,898,076,012  13,538,461,925

Thanks for the extensive testing. I usually perform my performance test 
on Intel hardware. I don't realize that Zen2 and arm64 perform better 
with irq on/off.


>
> For micro benchmarking I did 1.000.000.000 iterations of
> preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
> [IRQ].
> On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
> loop took 7,494,119,638ns. The absolute numbers are not important, it is
> worth noting that preemption off/on is less expensive than IRQ off/on.
> Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
> loop was performed with disabled interrupts so I don't expect much
> interference - but then I don't know much the µArch optimized away on
> local_irq_restore() given that the interrupts were already disabled.
> I don't have any recent Intel HW (I think) so I don't know if this is an
> Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
> that the recent uArch would behave similar to AMD.
>
> Moving on: For the test I run 100,000,000 iterations of
>       kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));
>
> The BH suffix means that in_task() reported false during the allocation,
> otherwise it reported true.
> SD is the standard deviation.
> SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
> OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
> means no optimisation (irq_obj is always used).
> The numbers are the time in ns needed for 100,000,000 iteration (alloc +
> free). I run 5 tests and used the median value here. If the standard
> deviation exceeded 10^9 then I repeated the test. The values remained in
> the same range since usually only one value was off and the other
> remained in the same range.
>
> Sandy Bridge
>                   SERVER OPT   SERVER NO-OPT    PREEMPT OPT     PREEMPT NO-OPT
> ALLOC/FREE    8,519,295,176   9,051,200,652    10,627,431,395  11,198,189,843
> SD                5,309,768      29,253,976       129,102,317      40,681,909
> ALLOC/FREE BH 9,996,704,330   8,927,026,031    11,680,149,900  11,139,356,465
> SD               38,237,534      72,913,120        23,626,932     116,413,331

My own testing when tracking the number of times in_task() is true or 
false indicated most of the kmalloc() call is done by tasks. Only a few 
percents of the time is in_task() false. That is the reason why I 
optimize the case that in_task() is true.


>
> The optimisation is visible in the SERVER-OPT case (~1.5s difference in
> the runtime (or 14.7ns per iteration)). There is hardly any difference
> between BH and !BH in the SERVER-NO-OPT case.
> For the SERVER case, the optimisation improves ~0.5s in runtime for the
> !BH case.
> For the PREEMPT case it also looks like ~0.5s improvement in the BH case
> while in the BH case it looks the other way around.
>
>                   DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT   DYN-FULL-NO-OPT
> ALLOC/FREE     11,069,180,584  10,773,407,543  10,963,581,285    10,826,207,969
> SD                 23,195,912     112,763,104      13,145,589        33,543,625
> ALLOC/FREE BH  11,443,342,069  10,720,094,700  11,064,914,727    10,955,883,521
> SD                 81,150,074     171,299,554      58,603,778        84,131,143
>
> DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
> default.  I don't see any difference vs CONFIG_PREEMPT except the
> default preemption state (so I didn't test that). The preemption counter
> is always forced-in so preempt_enable()/disable() is not optimized away.
> SRV is the default value (PREEMPT_NONE) and FULL is the overriden
> (preempt=full) state.
>
> Based on that, I don't see any added value by the optimisation once
> PREEMPT_DYNAMIC is enabled.

The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data 
points, I am not going to object to this patch then. I will try to look 
further into why this is the case when I have time.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-06  3:28             ` Waiman Long
  0 siblings, 0 replies; 48+ messages in thread
From: Waiman Long @ 2022-01-06  3:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 1/5/22 15:22, Sebastian Andrzej Siewior wrote:
> On 2022-01-03 10:04:29 [-0500], Waiman Long wrote:
>> On 1/3/22 09:44, Sebastian Andrzej Siewior wrote:
>>> Is there something you recommend as a benchmark where I could get some
>>> numbers?
>> In the case of PREEMPT_DYNAMIC, it depends on the default setting which is
>> used by most users. I will support disabling the optimization if
>> defined(CONFIG_PREEMPT_RT) || defined(CONFIG_PREEMPT), just not by
>> CONFIG_)PREEMPTION alone.
>>
>> As for microbenchmark, something that makes a lot of calls to malloc() or
>> related allocations can be used.
> Numbers I made:
>
>          Sandy Bridge   Haswell        Skylake         AMD-A8 7100    Zen2           ARM64
> PREEMPT 5,123,896,822  5,215,055,226   5,077,611,590  6,012,287,874  6,234,674,489  20,000,000,100
> IRQ     7,494,119,638  6,810,367,629  10,620,130,377  4,178,546,086  4,898,076,012  13,538,461,925

Thanks for the extensive testing. I usually perform my performance test 
on Intel hardware. I don't realize that Zen2 and arm64 perform better 
with irq on/off.


>
> For micro benchmarking I did 1.000.000.000 iterations of
> preempt_disable()/enable() [PREEMPT] and local_irq_save()/restore()
> [IRQ].
> On a Sandy Bridge the PREEMPT loop took 5,123,896,822ns while the IRQ
> loop took 7,494,119,638ns. The absolute numbers are not important, it is
> worth noting that preemption off/on is less expensive than IRQ off/on.
> Except for AMD and ARM64 where IRQ off/on was less expensive. The whole
> loop was performed with disabled interrupts so I don't expect much
> interference - but then I don't know much the µArch optimized away on
> local_irq_restore() given that the interrupts were already disabled.
> I don't have any recent Intel HW (I think) so I don't know if this is an
> Intel only thing (IRQ off/on cheaper than preemption off/on) but I guess
> that the recent uArch would behave similar to AMD.
>
> Moving on: For the test I run 100,000,000 iterations of
>       kfree(kmalloc(128, GFP_ATOMIC | __GFP_ACCOUNT));
>
> The BH suffix means that in_task() reported false during the allocation,
> otherwise it reported true.
> SD is the standard deviation.
> SERVER means PREEMPT_NONE while PREEMPT means CONFIG_PREEMPT.
> OPT means the optimisation (in_task() + task_obj) is active, NO-OPT
> means no optimisation (irq_obj is always used).
> The numbers are the time in ns needed for 100,000,000 iteration (alloc +
> free). I run 5 tests and used the median value here. If the standard
> deviation exceeded 10^9 then I repeated the test. The values remained in
> the same range since usually only one value was off and the other
> remained in the same range.
>
> Sandy Bridge
>                   SERVER OPT   SERVER NO-OPT    PREEMPT OPT     PREEMPT NO-OPT
> ALLOC/FREE    8,519,295,176   9,051,200,652    10,627,431,395  11,198,189,843
> SD                5,309,768      29,253,976       129,102,317      40,681,909
> ALLOC/FREE BH 9,996,704,330   8,927,026,031    11,680,149,900  11,139,356,465
> SD               38,237,534      72,913,120        23,626,932     116,413,331

My own testing when tracking the number of times in_task() is true or 
false indicated most of the kmalloc() call is done by tasks. Only a few 
percents of the time is in_task() false. That is the reason why I 
optimize the case that in_task() is true.


>
> The optimisation is visible in the SERVER-OPT case (~1.5s difference in
> the runtime (or 14.7ns per iteration)). There is hardly any difference
> between BH and !BH in the SERVER-NO-OPT case.
> For the SERVER case, the optimisation improves ~0.5s in runtime for the
> !BH case.
> For the PREEMPT case it also looks like ~0.5s improvement in the BH case
> while in the BH case it looks the other way around.
>
>                   DYN-SRV-OPT   DYN-SRV-NO-OPT    DYN-FULL-OPT   DYN-FULL-NO-OPT
> ALLOC/FREE     11,069,180,584  10,773,407,543  10,963,581,285    10,826,207,969
> SD                 23,195,912     112,763,104      13,145,589        33,543,625
> ALLOC/FREE BH  11,443,342,069  10,720,094,700  11,064,914,727    10,955,883,521
> SD                 81,150,074     171,299,554      58,603,778        84,131,143
>
> DYN is CONFIG_PREEMPT_DYNAMIC enabled and CONFIG_PREEMPT_NONE is
> default.  I don't see any difference vs CONFIG_PREEMPT except the
> default preemption state (so I didn't test that). The preemption counter
> is always forced-in so preempt_enable()/disable() is not optimized away.
> SRV is the default value (PREEMPT_NONE) and FULL is the overriden
> (preempt=full) state.
>
> Based on that, I don't see any added value by the optimisation once
> PREEMPT_DYNAMIC is enabled.

The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data 
points, I am not going to object to this patch then. I will try to look 
further into why this is the case when I have time.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-13 13:08       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-13 13:08 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
	Peter Zijlstra

On 2022-01-05 15:16:53 [+0100], Michal Koutný wrote:
> On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > The sections with disabled preemption must exclude
> > memcg_check_events() so that spinlock_t locks can still be acquired
> > (for instance in eventfd_signal()).
> 
> The resulting construct in uncharge_batch() raises eybrows. If you can decouple
> per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
> it the same way on !PREEMPT_RT too (and have just one variant of the
> block)?

That would mean that mem_cgroup_event_ratelimit() needs a
local_irq_save(). If that is okay then sure I can move it that way.

> (Actually, it doesn't seem to me that memcg_check_events() can be
> extracted like this from the preempt disabled block since
> mem_cgroup_event_ratelimit() relies on similar RMW pattern.
> Things would be simpler if PREEMPT_RT didn't allow the threshold event
> handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)

I added a preempt-disable() section restricted to RT to
mem_cgroup_event_ratelimit(). I had to exclude memcg_check_events() from
the block because of the spinlock_t locks involved down the road
(eventfd_signal() for instance).

I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
I didn't know where exactly and its implications. In this block here I
just followed the replacement of irq-off with preempt-off for RT.

> Thanks,
> Michal

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-13 13:08       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-13 13:08 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

On 2022-01-05 15:16:53 [+0100], Michal Koutný wrote:
> On Wed, Dec 22, 2021 at 12:41:09PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> > The sections with disabled preemption must exclude
> > memcg_check_events() so that spinlock_t locks can still be acquired
> > (for instance in eventfd_signal()).
> 
> The resulting construct in uncharge_batch() raises eybrows. If you can decouple
> per-cpu updates from memcg_check_events() on PREEMPT_RT, why not tackle
> it the same way on !PREEMPT_RT too (and have just one variant of the
> block)?

That would mean that mem_cgroup_event_ratelimit() needs a
local_irq_save(). If that is okay then sure I can move it that way.

> (Actually, it doesn't seem to me that memcg_check_events() can be
> extracted like this from the preempt disabled block since
> mem_cgroup_event_ratelimit() relies on similar RMW pattern.
> Things would be simpler if PREEMPT_RT didn't allow the threshold event
> handlers (akin to Michal Hocko's suggestion of rejecting soft limit).)

I added a preempt-disable() section restricted to RT to
mem_cgroup_event_ratelimit(). I had to exclude memcg_check_events() from
the block because of the spinlock_t locks involved down the road
(eventfd_signal() for instance).

I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
I didn't know where exactly and its implications. In this block here I
just followed the replacement of irq-off with preempt-off for RT.

> Thanks,
> Michal

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-13 14:48         ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-13 14:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
	Peter Zijlstra

On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> I added a preempt-disable() section restricted to RT to
> mem_cgroup_event_ratelimit().

Oh, I missed that one.

(Than the decoupling of such mem_cgroup_event_ratelimit() also makes
some more sense.)
> That would mean that mem_cgroup_event_ratelimit() needs a
> local_irq_save(). If that is okay then sure I can move it that way.

Whatever avoids the twisted code :-)

---

> I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> I didn't know where exactly and its implications. In this block here I
> just followed the replacement of irq-off with preempt-off for RT.

Both soft limit and (these) event notifications are v1 features. Soft
limit itself is rather considered even misfeature. I guess the
implications would not be many since PREEMPT_RT+memcg users would be
new(?) so should rather start with v2 anyway.

One way to disable it would be to reject writes into
memory.soft_limit_in_bytes or cgroup.event_control + documentation of
that.

Michal


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-13 14:48         ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-13 14:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> I added a preempt-disable() section restricted to RT to
> mem_cgroup_event_ratelimit().

Oh, I missed that one.

(Than the decoupling of such mem_cgroup_event_ratelimit() also makes
some more sense.)
> That would mean that mem_cgroup_event_ratelimit() needs a
> local_irq_save(). If that is okay then sure I can move it that way.

Whatever avoids the twisted code :-)

---

> I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> I didn't know where exactly and its implications. In this block here I
> just followed the replacement of irq-off with preempt-off for RT.

Both soft limit and (these) event notifications are v1 features. Soft
limit itself is rather considered even misfeature. I guess the
implications would not be many since PREEMPT_RT+memcg users would be
new(?) so should rather start with v2 anyway.

One way to disable it would be to reject writes into
memory.soft_limit_in_bytes or cgroup.event_control + documentation of
that.

Michal

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-13 15:26               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-13 15:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Peter Zijlstra

On 2022-01-05 22:28:10 [-0500], Waiman Long wrote:
> Thanks for the extensive testing. I usually perform my performance test on
> Intel hardware. I don't realize that Zen2 and arm64 perform better with irq
> on/off.

Maybe you have access to more recent µArch from Intel which could behave
different.

> My own testing when tracking the number of times in_task() is true or false
> indicated most of the kmalloc() call is done by tasks. Only a few percents
> of the time is in_task() false. That is the reason why I optimize the case
> that in_task() is true.

Right. This relies on the fact that changing preemption is cheaper which
is not always true. The ultimate benefit is of course when the
preemption changes can be removed/ optimized away.

> > Based on that, I don't see any added value by the optimisation once
> > PREEMPT_DYNAMIC is enabled.
> 
> The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data points,
> I am not going to object to this patch then. I will try to look further into
> why this is the case when I have time.

Okay, thank you.
In the SERVER case we keep the preemption counter so this has obviously
an impact. I am a little surprised that the DYN-FULL and DYN-NONE
differ a little since the code runs with disabled interrupts. But then
this might be the extra jump to preempt_schedule() which is patched-out
in the SERVER case.

> Cheers,
> Longman

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels.
@ 2022-01-13 15:26               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-13 15:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Peter Zijlstra

On 2022-01-05 22:28:10 [-0500], Waiman Long wrote:
> Thanks for the extensive testing. I usually perform my performance test on
> Intel hardware. I don't realize that Zen2 and arm64 perform better with irq
> on/off.

Maybe you have access to more recent µArch from Intel which could behave
different.

> My own testing when tracking the number of times in_task() is true or false
> indicated most of the kmalloc() call is done by tasks. Only a few percents
> of the time is in_task() false. That is the reason why I optimize the case
> that in_task() is true.

Right. This relies on the fact that changing preemption is cheaper which
is not always true. The ultimate benefit is of course when the
preemption changes can be removed/ optimized away.

> > Based on that, I don't see any added value by the optimisation once
> > PREEMPT_DYNAMIC is enabled.
> 
> The PREEMPT_DYNAMIC result is a bit surprising to me. Given the data points,
> I am not going to object to this patch then. I will try to look further into
> why this is the case when I have time.

Okay, thank you.
In the SERVER case we keep the preemption counter so this has obviously
an impact. I am a little surprised that the DYN-FULL and DYN-NONE
differ a little since the code runs with disabled interrupts. But then
this might be the extra jump to preempt_schedule() which is patched-out
in the SERVER case.

> Cheers,
> Longman

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-14  9:09           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-14  9:09 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, linux-mm, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Thomas Gleixner, Waiman Long,
	Peter Zijlstra

On 2022-01-13 15:48:03 [+0100], Michal Koutný wrote:
> On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > I added a preempt-disable() section restricted to RT to
> > mem_cgroup_event_ratelimit().
> 
> Oh, I missed that one.
> 
> (Than the decoupling of such mem_cgroup_event_ratelimit() also makes
> some more sense.)
> > That would mean that mem_cgroup_event_ratelimit() needs a
> > local_irq_save(). If that is okay then sure I can move it that way.
> 
> Whatever avoids the twisted code :-)

Okay. So let me do that.

> ---
> 
> > I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> > I didn't know where exactly and its implications. In this block here I
> > just followed the replacement of irq-off with preempt-off for RT.
> 
> Both soft limit and (these) event notifications are v1 features. Soft
> limit itself is rather considered even misfeature. I guess the
> implications would not be many since PREEMPT_RT+memcg users would be
> new(?) so should rather start with v2 anyway.

People often migrate to RT so they take whatever they have. In general I
would like to keep RT & !RT in sync unless there are reasons to do it
differently.

> One way to disable it would be to reject writes into
> memory.soft_limit_in_bytes or cgroup.event_control + documentation of
> that.

So avoiding these two also avoids memcg_check_events()?
Right now it does not look like a big obstacle. It is the same pattern
I'm following for the per-CPU RMW. If it is, I could avoid the writes
and if0 the other function for RT.
If I remove memcg_check_events() from the equation then we could keep
the explicit irq-off regions (plus add a few). The only that would look
odd then is that we disable interrupts for the RMW operation and
preemption in other places (__count_memcg_events() invocations in swap.c
and vmscan.c). 

Are there plans to remove v1 or is this part of "we must not break
userland"?

> Michal

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT
@ 2022-01-14  9:09           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-14  9:09 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Thomas Gleixner, Waiman Long, Peter Zijlstra

On 2022-01-13 15:48:03 [+0100], Michal Koutný wrote:
> On Thu, Jan 13, 2022 at 02:08:10PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> > I added a preempt-disable() section restricted to RT to
> > mem_cgroup_event_ratelimit().
> 
> Oh, I missed that one.
> 
> (Than the decoupling of such mem_cgroup_event_ratelimit() also makes
> some more sense.)
> > That would mean that mem_cgroup_event_ratelimit() needs a
> > local_irq_save(). If that is okay then sure I can move it that way.
> 
> Whatever avoids the twisted code :-)

Okay. So let me do that.

> ---
> 
> > I remember Michal (Hocko) suggested excluding/ rejecting soft limit but
> > I didn't know where exactly and its implications. In this block here I
> > just followed the replacement of irq-off with preempt-off for RT.
> 
> Both soft limit and (these) event notifications are v1 features. Soft
> limit itself is rather considered even misfeature. I guess the
> implications would not be many since PREEMPT_RT+memcg users would be
> new(?) so should rather start with v2 anyway.

People often migrate to RT so they take whatever they have. In general I
would like to keep RT & !RT in sync unless there are reasons to do it
differently.

> One way to disable it would be to reject writes into
> memory.soft_limit_in_bytes or cgroup.event_control + documentation of
> that.

So avoiding these two also avoids memcg_check_events()?
Right now it does not look like a big obstacle. It is the same pattern
I'm following for the per-CPU RMW. If it is, I could avoid the writes
and if0 the other function for RT.
If I remove memcg_check_events() from the equation then we could keep
the explicit irq-off regions (plus add a few). The only that would look
odd then is that we disable interrupts for the RMW operation and
preemption in other places (__count_memcg_events() invocations in swap.c
and vmscan.c). 

Are there plans to remove v1 or is this part of "we must not break
userland"?

> Michal

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH] mm/memcg: Do not check v1 event counter when not needed
@ 2022-01-18 18:26             ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-18 18:26 UTC (permalink / raw)
  To: bigeasy
  Cc: akpm, cgroups, hannes, linux-mm, longman, mhocko, mkoutny,
	peterz, tglx, vdavydov.dev

The function memcg_check_events() is called to trigger possible event
notifications or soft limit updates when page event "clock" moves
sufficiently. This tracking is not needed when neither soft limit nor (v1)
event notifications are configured. The tracking can catch-up
with the clock at any time upon thresholds configuration.

Guard this functionality behind an unlikely static branch (soft limit
and events are presumably rather unused than used).

This has slight insignificant performance gain in page-fault specific
benchmark but overall no performance impact is expected. The goal is to
partition the charging code per provided user functionality.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 mm/memcontrol.c | 8 ++++++++
 1 file changed, 8 insertions(+)


On Fri, Jan 14, 2022 at 10:09:35AM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> So avoiding these two also avoids memcg_check_events()?

I've made the matter explicit with the surrounding patch.

[
The performance "gain" is negligible (differences of pft [1] are dominated by
non-root memcg classification):

                         nocg, nopatch            cg, nopatch              nocg, patch              cg, patch
Hmean     faults/sec-2   273366.6312 (   0.00%)   243573.3767 * -10.90%*   273901.9709 *   0.20%*   247702.4104 *  -9.39%*
CoeffVar  faults/sec-2        3.8771 (   0.00%)        3.8396 (   0.97%)        3.1400 (  19.01%)        4.1188 (  -6.24%)

                                                  cg, nopatch                                       cg, patch
Hmean     faults/sec-2                            243573.3767 (   0.00%)                            247702.4104 *   1.70%*
CoeffVar  faults/sec-2                                 3.8396 (   0.00%)                                 4.1188 (  -7.27%)

On less targetted benchmarks it's well below noise.
]

I think it would make sense inserting the patch into your series and
subsequently reject enabling on PREEMPT_RT -- provided this patch makes sense
to others too -- the justification is rather functionality splitting for
this PREEMPT_RT effort.

> Are there plans to remove v1 or is this part of "we must not break
> userland"?

It's part of that mantra, so v1 can't be simply removed. OTOH, my sensing is
that this change also fits under not extending v1 (to avoid doubling effort on
everything).

Michal

[1] https://github.com/gormanm/mmtests/blob/6bcb8b301a48386e0cc63a21f7642048a3ceaed5/configs/config-pagealloc-performance#L6

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a7b3ebf8e48..7f64ce33d137 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -106,6 +106,8 @@ static bool do_memsw_account(void)
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 
+DEFINE_STATIC_KEY_FALSE(memcg_v1_events_enabled_key);
+
 /*
  * Cgroups above their limits are maintained in a RB-Tree, independent of
  * their hierarchy representation
@@ -852,6 +854,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 {
+	if (!static_branch_unlikely(&memcg_v1_events_enabled_key))
+		return;
+
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -3757,6 +3762,7 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		break;
 	case RES_SOFT_LIMIT:
 		memcg->soft_limit = nr_pages;
+		static_branch_enable(&memcg_v1_events_enabled_key);
 		ret = 0;
 		break;
 	}
@@ -4831,6 +4837,8 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	list_add(&event->list, &memcg->event_list);
 	spin_unlock_irq(&memcg->event_list_lock);
 
+	static_branch_enable(&memcg_v1_events_enabled_key);
+
 	fdput(cfile);
 	fdput(efile);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH] mm/memcg: Do not check v1 event counter when not needed
@ 2022-01-18 18:26             ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-01-18 18:26 UTC (permalink / raw)
  To: bigeasy-hfZtesqFncYOwBW4kG4KsQ
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, longman-H+wXaHxf7aLQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, mkoutny-IBi9RG/b67k,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, tglx-hfZtesqFncYOwBW4kG4KsQ,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w

The function memcg_check_events() is called to trigger possible event
notifications or soft limit updates when page event "clock" moves
sufficiently. This tracking is not needed when neither soft limit nor (v1)
event notifications are configured. The tracking can catch-up
with the clock at any time upon thresholds configuration.

Guard this functionality behind an unlikely static branch (soft limit
and events are presumably rather unused than used).

This has slight insignificant performance gain in page-fault specific
benchmark but overall no performance impact is expected. The goal is to
partition the charging code per provided user functionality.

Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
---
 mm/memcontrol.c | 8 ++++++++
 1 file changed, 8 insertions(+)


On Fri, Jan 14, 2022 at 10:09:35AM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> So avoiding these two also avoids memcg_check_events()?

I've made the matter explicit with the surrounding patch.

[
The performance "gain" is negligible (differences of pft [1] are dominated by
non-root memcg classification):

                         nocg, nopatch            cg, nopatch              nocg, patch              cg, patch
Hmean     faults/sec-2   273366.6312 (   0.00%)   243573.3767 * -10.90%*   273901.9709 *   0.20%*   247702.4104 *  -9.39%*
CoeffVar  faults/sec-2        3.8771 (   0.00%)        3.8396 (   0.97%)        3.1400 (  19.01%)        4.1188 (  -6.24%)

                                                  cg, nopatch                                       cg, patch
Hmean     faults/sec-2                            243573.3767 (   0.00%)                            247702.4104 *   1.70%*
CoeffVar  faults/sec-2                                 3.8396 (   0.00%)                                 4.1188 (  -7.27%)

On less targetted benchmarks it's well below noise.
]

I think it would make sense inserting the patch into your series and
subsequently reject enabling on PREEMPT_RT -- provided this patch makes sense
to others too -- the justification is rather functionality splitting for
this PREEMPT_RT effort.

> Are there plans to remove v1 or is this part of "we must not break
> userland"?

It's part of that mantra, so v1 can't be simply removed. OTOH, my sensing is
that this change also fits under not extending v1 (to avoid doubling effort on
everything).

Michal

[1] https://github.com/gormanm/mmtests/blob/6bcb8b301a48386e0cc63a21f7642048a3ceaed5/configs/config-pagealloc-performance#L6

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a7b3ebf8e48..7f64ce33d137 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -106,6 +106,8 @@ static bool do_memsw_account(void)
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 
+DEFINE_STATIC_KEY_FALSE(memcg_v1_events_enabled_key);
+
 /*
  * Cgroups above their limits are maintained in a RB-Tree, independent of
  * their hierarchy representation
@@ -852,6 +854,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 {
+	if (!static_branch_unlikely(&memcg_v1_events_enabled_key))
+		return;
+
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -3757,6 +3762,7 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		break;
 	case RES_SOFT_LIMIT:
 		memcg->soft_limit = nr_pages;
+		static_branch_enable(&memcg_v1_events_enabled_key);
 		ret = 0;
 		break;
 	}
@@ -4831,6 +4837,8 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	list_add(&event->list, &memcg->event_list);
 	spin_unlock_irq(&memcg->event_list_lock);
 
+	static_branch_enable(&memcg_v1_events_enabled_key);
+
 	fdput(cfile);
 	fdput(efile);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm/memcg: Do not check v1 event counter when not needed
@ 2022-01-18 19:57               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-18 19:57 UTC (permalink / raw)
  To: Michal Koutný
  Cc: akpm, cgroups, hannes, linux-mm, longman, mhocko, peterz, tglx,
	vdavydov.dev

On 2022-01-18 19:26:00 [+0100], Michal Koutný wrote:
> I think it would make sense inserting the patch into your series and
> subsequently reject enabling on PREEMPT_RT -- provided this patch makes sense
> to others too -- the justification is rather functionality splitting for
> this PREEMPT_RT effort.

Interesting. So while looking at this today I came up with the patch at
the bottom. The other things I had looked way uglier and then since
nobody probably will use it…
Let me know how you want it to be integrated.

------>8------

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Tue, 18 Jan 2022 17:28:07 +0100
Subject: [PATCH] mm/memcg: Disable threshold event handlers on PREEMPT_RT
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During the integration of PREEMPT_RT support, the code flow around
memcg_check_events() resulted in `twisted code'. Moving the code around
and avoiding then would then lead to an additional local-irq-save
section within memcg_check_events(). While looking better, it adds a
local-irq-save section to code flow which is usually within an
local-irq-save block.

The threshold event handler is a deprecated memcg v1 feature. Instead of
trying to get it to work under PREEMPT_RT just disable it. There should
have not been any users on PREEMPT_RT. From that perspective makes it
even less sense to get it to work under PREEMPT_RT while having zero
users.

Make memory.soft_limit_in_bytes and cgroup.event_control return
-EOPNOTSUPP on PREEMPT_RT. Make memcg_check_events() empty on PREEMPT_RT
since it won't do anything. Document that the two knobs are disabled on
PREEMPT_RT.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 Documentation/admin-guide/cgroup-v1/memory.rst |  2 ++
 mm/memcontrol.c                                | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index faac50149a222..2cc502a75ef64 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -64,6 +64,7 @@ Brief summary of control files.
 				     threads
  cgroup.procs			     show list of processes
  cgroup.event_control		     an interface for event_fd()
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.usage_in_bytes		     show current usage for memory
 				     (See 5.5 for details)
  memory.memsw.usage_in_bytes	     show current usage for memory+Swap
@@ -75,6 +76,7 @@ Brief summary of control files.
  memory.max_usage_in_bytes	     show max memory usage recorded
  memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
  memory.soft_limit_in_bytes	     set/show soft limit of memory usage
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.stat			     show various statistics
  memory.use_hierarchy		     set/show hierarchical account enabled
                                      This knob is deprecated and shouldn't be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d3..3c4f7a0fd0039 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -821,6 +821,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
 }
 
+#ifndef CONFIG_PREEMPT_RT
 static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				       enum mem_cgroup_events_target target)
 {
@@ -864,6 +865,9 @@ static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 			mem_cgroup_update_tree(memcg, nid);
 	}
 }
+#else
+static void memcg_check_events(struct mem_cgroup *memcg, int nid) { }
+#endif
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
@@ -3751,8 +3755,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		}
 		break;
 	case RES_SOFT_LIMIT:
+#ifndef CONFIG_PREEMPT_RT
 		memcg->soft_limit = nr_pages;
 		ret = 0;
+#else
+		ret = -EOPNOTSUPP;
+#endif
 		break;
 	}
 	return ret ?: nbytes;
@@ -4717,6 +4725,7 @@ static void memcg_event_ptable_queue_proc(struct file *file,
 static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 					 char *buf, size_t nbytes, loff_t off)
 {
+#ifndef CONFIG_PREEMPT_RT
 	struct cgroup_subsys_state *css = of_css(of);
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup_event *event;
@@ -4843,6 +4852,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	kfree(event);
 
 	return ret;
+#else
+	return -EOPNOTSUPP;
+#endif
 }
 
 static struct cftype mem_cgroup_legacy_files[] = {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] mm/memcg: Do not check v1 event counter when not needed
@ 2022-01-18 19:57               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-01-18 19:57 UTC (permalink / raw)
  To: Michal Koutný
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, longman-H+wXaHxf7aLQT0dZR+AlfA,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	tglx-hfZtesqFncYOwBW4kG4KsQ, vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w

On 2022-01-18 19:26:00 [+0100], Michal Koutný wrote:
> I think it would make sense inserting the patch into your series and
> subsequently reject enabling on PREEMPT_RT -- provided this patch makes sense
> to others too -- the justification is rather functionality splitting for
> this PREEMPT_RT effort.

Interesting. So while looking at this today I came up with the patch at
the bottom. The other things I had looked way uglier and then since
nobody probably will use it…
Let me know how you want it to be integrated.

------>8------

From: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Date: Tue, 18 Jan 2022 17:28:07 +0100
Subject: [PATCH] mm/memcg: Disable threshold event handlers on PREEMPT_RT
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During the integration of PREEMPT_RT support, the code flow around
memcg_check_events() resulted in `twisted code'. Moving the code around
and avoiding then would then lead to an additional local-irq-save
section within memcg_check_events(). While looking better, it adds a
local-irq-save section to code flow which is usually within an
local-irq-save block.

The threshold event handler is a deprecated memcg v1 feature. Instead of
trying to get it to work under PREEMPT_RT just disable it. There should
have not been any users on PREEMPT_RT. From that perspective makes it
even less sense to get it to work under PREEMPT_RT while having zero
users.

Make memory.soft_limit_in_bytes and cgroup.event_control return
-EOPNOTSUPP on PREEMPT_RT. Make memcg_check_events() empty on PREEMPT_RT
since it won't do anything. Document that the two knobs are disabled on
PREEMPT_RT.

Suggested-by: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Suggested-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v1/memory.rst |  2 ++
 mm/memcontrol.c                                | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index faac50149a222..2cc502a75ef64 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -64,6 +64,7 @@ Brief summary of control files.
 				     threads
  cgroup.procs			     show list of processes
  cgroup.event_control		     an interface for event_fd()
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.usage_in_bytes		     show current usage for memory
 				     (See 5.5 for details)
  memory.memsw.usage_in_bytes	     show current usage for memory+Swap
@@ -75,6 +76,7 @@ Brief summary of control files.
  memory.max_usage_in_bytes	     show max memory usage recorded
  memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
  memory.soft_limit_in_bytes	     set/show soft limit of memory usage
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.stat			     show various statistics
  memory.use_hierarchy		     set/show hierarchical account enabled
                                      This knob is deprecated and shouldn't be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ed5f2a0879d3..3c4f7a0fd0039 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -821,6 +821,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
 }
 
+#ifndef CONFIG_PREEMPT_RT
 static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				       enum mem_cgroup_events_target target)
 {
@@ -864,6 +865,9 @@ static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 			mem_cgroup_update_tree(memcg, nid);
 	}
 }
+#else
+static void memcg_check_events(struct mem_cgroup *memcg, int nid) { }
+#endif
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
@@ -3751,8 +3755,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		}
 		break;
 	case RES_SOFT_LIMIT:
+#ifndef CONFIG_PREEMPT_RT
 		memcg->soft_limit = nr_pages;
 		ret = 0;
+#else
+		ret = -EOPNOTSUPP;
+#endif
 		break;
 	}
 	return ret ?: nbytes;
@@ -4717,6 +4725,7 @@ static void memcg_event_ptable_queue_proc(struct file *file,
 static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 					 char *buf, size_t nbytes, loff_t off)
 {
+#ifndef CONFIG_PREEMPT_RT
 	struct cgroup_subsys_state *css = of_css(of);
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup_event *event;
@@ -4843,6 +4852,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	kfree(event);
 
 	return ret;
+#else
+	return -EOPNOTSUPP;
+#endif
 }
 
 static struct cftype mem_cgroup_legacy_files[] = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2022-01-18 19:57 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-22 11:41 [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
2021-12-22 11:41 ` Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
2021-12-22 11:41   ` Sebastian Andrzej Siewior
2021-12-23  2:31   ` Waiman Long
2021-12-23  2:31     ` Waiman Long
2021-12-23  7:34     ` Sebastian Andrzej Siewior
2021-12-23  7:34       ` Sebastian Andrzej Siewior
2021-12-23 16:01       ` Waiman Long
2021-12-23 16:01         ` Waiman Long
2022-01-05 14:16   ` Michal Koutný
2022-01-05 14:16     ` Michal Koutný
2022-01-13 13:08     ` Sebastian Andrzej Siewior
2022-01-13 13:08       ` Sebastian Andrzej Siewior
2022-01-13 14:48       ` Michal Koutný
2022-01-13 14:48         ` Michal Koutný
2022-01-14  9:09         ` Sebastian Andrzej Siewior
2022-01-14  9:09           ` Sebastian Andrzej Siewior
2022-01-18 18:26           ` [PATCH] mm/memcg: Do not check v1 event counter when not needed Michal Koutný
2022-01-18 18:26             ` Michal Koutný
2022-01-18 19:57             ` Sebastian Andrzej Siewior
2022-01-18 19:57               ` Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object Sebastian Andrzej Siewior
2021-12-22 11:41   ` Sebastian Andrzej Siewior
2021-12-23 21:38   ` Waiman Long
2021-12-23 21:38     ` Waiman Long
2022-01-03 16:34     ` Sebastian Andrzej Siewior
2022-01-03 16:34       ` Sebastian Andrzej Siewior
2022-01-03 17:09       ` Waiman Long
2022-01-03 17:09         ` Waiman Long
2021-12-22 11:41 ` [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels Sebastian Andrzej Siewior
2021-12-22 11:41   ` Sebastian Andrzej Siewior
2021-12-23 21:48   ` Waiman Long
2021-12-23 21:48     ` Waiman Long
2022-01-03 14:44     ` Sebastian Andrzej Siewior
2022-01-03 14:44       ` Sebastian Andrzej Siewior
2022-01-03 15:04       ` Waiman Long
2022-01-03 15:04         ` Waiman Long
2022-01-05 20:22         ` Sebastian Andrzej Siewior
2022-01-05 20:22           ` Sebastian Andrzej Siewior
2022-01-06  3:28           ` Waiman Long
2022-01-06  3:28             ` Waiman Long
2022-01-13 15:26             ` Sebastian Andrzej Siewior
2022-01-13 15:26               ` Sebastian Andrzej Siewior
2022-01-05 14:59 ` [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Michal Koutný
2022-01-05 14:59   ` Michal Koutný
2022-01-05 15:06   ` Sebastian Andrzej Siewior
2022-01-05 15:06     ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.