[PATCH v3 0/5] mm/memcg: Address PREEMPT_RT problems instead of disabling it.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/5] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2022-02-17  9:47 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:47 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

Hi,

this series aims to address the memcg related problem on PREEMPT_RT.

I tested them on CONFIG_PREEMPT and CONFIG_PREEMPT_RT with the
tools/testing/selftests/cgroup/* tests and I haven't observed any
regressions (other than the lockdep report that is already there).

Changes since v2:
- rebased on top of v5.17-rc4-mmots-2022-02-15-20-39.

- Added memcg_stats_lock() in 3/5 so it a little more obvious and
  hopefully easiert to maintain.

- Opencoded obj_cgroup_uncharge_pages() in drain_obj_stock(). The
  __locked suffix was confusing.

v2: https://lore.kernel.org/all/20220211223537.2175879-1-bigeasy@linutronix.de/

Changes since v1:
- Made a full patch from Michal Hocko's diff to disable the from-IRQ vs
  from-task optimisation

- Disabling threshold event handlers is using now IS_ENABLED(PREEMPT_RT)
  instead of #ifdef. The outcome is the same but there is no need to
  shuffle the code around.

v1: https://lore.kernel.org/all/20220125164337.2071854-1-bigeasy@linutronix.de/

Changes since the RFC:
- cgroup.event_control / memory.soft_limit_in_bytes is disabled on
  PREEMPT_RT. It is a deprecated v1 feature. Fixing the signal path is
  not worth it.

- The updates to per-CPU counters are usually synchronised by disabling
  interrupts. There are a few spots where assumption about disabled
  interrupts are not true on PREEMPT_RT and therefore preemption is
  disabled. This is okay since the counter are never written from
  in_irq() context.

RFC: https://lore.kernel.org/all/20211222114111.2206248-1-bigeasy@linutronix.de/

Sebastian




^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 0/5] mm/memcg: Address PREEMPT_RT problems instead of disabling it.
@ 2022-02-17  9:47 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:47 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

Hi,

this series aims to address the memcg related problem on PREEMPT_RT.

I tested them on CONFIG_PREEMPT and CONFIG_PREEMPT_RT with the
tools/testing/selftests/cgroup/* tests and I haven't observed any
regressions (other than the lockdep report that is already there).

Changes since v2:
- rebased on top of v5.17-rc4-mmots-2022-02-15-20-39.

- Added memcg_stats_lock() in 3/5 so it a little more obvious and
  hopefully easiert to maintain.

- Opencoded obj_cgroup_uncharge_pages() in drain_obj_stock(). The
  __locked suffix was confusing.

v2: https://lore.kernel.org/all/20220211223537.2175879-1-bigeasy@linutronix.de/

Changes since v1:
- Made a full patch from Michal Hocko's diff to disable the from-IRQ vs
  from-task optimisation

- Disabling threshold event handlers is using now IS_ENABLED(PREEMPT_RT)
  instead of #ifdef. The outcome is the same but there is no need to
  shuffle the code around.

v1: https://lore.kernel.org/all/20220125164337.2071854-1-bigeasy@linutronix.de/

Changes since the RFC:
- cgroup.event_control / memory.soft_limit_in_bytes is disabled on
  PREEMPT_RT. It is a deprecated v1 feature. Fixing the signal path is
  not worth it.

- The updates to per-CPU counters are usually synchronised by disabling
  interrupts. There are a few spots where assumption about disabled
  interrupts are not true on PREEMPT_RT and therefore preemption is
  disabled. This is okay since the counter are never written from
  in_irq() context.

RFC: https://lore.kernel.org/all/20211222114111.2206248-1-bigeasy@linutronix.de/

Sebastian



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/5] mm/memcg: Revert ("mm/memcg: optimize user context object stock access")
@ 2022-02-17  9:47   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:47 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Michal Hocko, Sebastian Andrzej Siewior, Roman Gushchin

From: Michal Hocko <mhocko@suse.com>

The optimisation is based on a micro benchmark where local_irq_save() is
more expensive than a preempt_disable(). There is no evidence that it is
visible in a real-world workload and there are CPUs where the opposite is
true (local_irq_save() is cheaper than preempt_disable()).

Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
where preempt_disable() is optimized away. There is no improvement with
PREEMPT_DYNAMIC since the preemption counter is always available.

The optimization makes also the PREEMPT_RT integration more complicated
since most of the assumption are not true on PREEMPT_RT.

Revert the optimisation since it complicates the PREEMPT_RT integration
and the improvement is hardly visible.

[ bigeasy: Patch body around Michal's diff ]

Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4@dhcp22.suse.cz
Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S@linutronix.de
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 94 ++++++++++++++-----------------------------------
 1 file changed, 27 insertions(+), 67 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3c4816147273a..8ab2dc75e70ec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2078,23 +2078,17 @@ void unlock_page_memcg(struct page *page)
 	folio_memcg_unlock(page_folio(page));
 }
 
-struct obj_stock {
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached; /* this never be root cgroup */
+	unsigned int nr_pages;
+
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int nr_slab_reclaimable_b;
 	int nr_slab_unreclaimable_b;
-#else
-	int dummy[0];
 #endif
-};
-
-struct memcg_stock_pcp {
-	struct mem_cgroup *cached; /* this never be root cgroup */
-	unsigned int nr_pages;
-	struct obj_stock task_obj;
-	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2104,13 +2098,13 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static void drain_obj_stock(struct memcg_stock_pcp *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
 
 #else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2190,9 +2184,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->irq_obj);
-	if (in_task())
-		drain_obj_stock(&stock->task_obj);
+	drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -2767,41 +2759,6 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
  */
 #define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
 
-/*
- * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
- * sequence used in this case to access content from object stock is slow.
- * To optimize for user context access, there are now two object stocks for
- * task context and interrupt context access respectively.
- *
- * The task context object stock can be accessed by disabling preemption only
- * which is cheap in non-preempt kernel. The interrupt context object stock
- * can only be accessed after disabling interrupt. User context code can
- * access interrupt object stock, but not vice versa.
- */
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
-{
-	struct memcg_stock_pcp *stock;
-
-	if (likely(in_task())) {
-		*pflags = 0UL;
-		preempt_disable();
-		stock = this_cpu_ptr(&memcg_stock);
-		return &stock->task_obj;
-	}
-
-	local_irq_save(*pflags);
-	stock = this_cpu_ptr(&memcg_stock);
-	return &stock->irq_obj;
-}
-
-static inline void put_obj_stock(unsigned long flags)
-{
-	if (likely(in_task()))
-		preempt_enable();
-	else
-		local_irq_restore(flags);
-}
-
 /*
  * mod_objcg_mlstate() may be called with irq enabled, so
  * mod_memcg_lruvec_state() should be used.
@@ -3082,10 +3039,13 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
+	struct memcg_stock_pcp *stock;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
 	int *bytes;
 
+	local_irq_save(flags);
+	stock = this_cpu_ptr(&memcg_stock);
+
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
 	 * accumulating over a page of vmstat data or when pgdat or idx
@@ -3136,26 +3096,29 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	put_obj_stock(flags);
+	local_irq_restore(flags);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
+	struct memcg_stock_pcp *stock;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
 	bool ret = false;
 
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	put_obj_stock(flags);
+	local_irq_restore(flags);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct obj_stock *stock)
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3211,13 +3174,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (in_task() && stock->task_obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
-		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
-			return true;
-	}
-	if (stock->irq_obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
+	if (stock->cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3228,10 +3186,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
+	struct memcg_stock_pcp *stock;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
 	unsigned int nr_pages = 0;
 
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3247,7 +3208,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	put_obj_stock(flags);
+	local_irq_restore(flags);
 
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
@@ -6812,7 +6773,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	long nr_pages;
 	struct mem_cgroup *memcg;
 	struct obj_cgroup *objcg;
-	bool use_objcg = folio_memcg_kmem(folio);
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
@@ -6821,7 +6781,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	 * folio memcg or objcg at this point, we have fully
 	 * exclusive access to the folio.
 	 */
-	if (use_objcg) {
+	if (folio_memcg_kmem(folio)) {
 		objcg = __folio_objcg(folio);
 		/*
 		 * This get matches the put at the end of the function and
@@ -6849,7 +6809,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 
 	nr_pages = folio_nr_pages(folio);
 
-	if (use_objcg) {
+	if (folio_memcg_kmem(folio)) {
 		ug->nr_memory += nr_pages;
 		ug->nr_kmem += nr_pages;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 1/5] mm/memcg: Revert ("mm/memcg: optimize user context object stock access")
@ 2022-02-17  9:47   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:47 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Michal Hocko, Sebastian Andrzej Siewior, Roman Gushchin

From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

The optimisation is based on a micro benchmark where local_irq_save() is
more expensive than a preempt_disable(). There is no evidence that it is
visible in a real-world workload and there are CPUs where the opposite is
true (local_irq_save() is cheaper than preempt_disable()).

Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
where preempt_disable() is optimized away. There is no improvement with
PREEMPT_DYNAMIC since the preemption counter is always available.

The optimization makes also the PREEMPT_RT integration more complicated
since most of the assumption are not true on PREEMPT_RT.

Revert the optimisation since it complicates the PREEMPT_RT integration
and the improvement is hardly visible.

[ bigeasy: Patch body around Michal's diff ]

Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org
Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 mm/memcontrol.c | 94 ++++++++++++++-----------------------------------
 1 file changed, 27 insertions(+), 67 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3c4816147273a..8ab2dc75e70ec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2078,23 +2078,17 @@ void unlock_page_memcg(struct page *page)
 	folio_memcg_unlock(page_folio(page));
 }
 
-struct obj_stock {
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached; /* this never be root cgroup */
+	unsigned int nr_pages;
+
 #ifdef CONFIG_MEMCG_KMEM
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
 	unsigned int nr_bytes;
 	int nr_slab_reclaimable_b;
 	int nr_slab_unreclaimable_b;
-#else
-	int dummy[0];
 #endif
-};
-
-struct memcg_stock_pcp {
-	struct mem_cgroup *cached; /* this never be root cgroup */
-	unsigned int nr_pages;
-	struct obj_stock task_obj;
-	struct obj_stock irq_obj;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2104,13 +2098,13 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct obj_stock *stock);
+static void drain_obj_stock(struct memcg_stock_pcp *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
 
 #else
-static inline void drain_obj_stock(struct obj_stock *stock)
+static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2190,9 +2184,7 @@ static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(&stock->irq_obj);
-	if (in_task())
-		drain_obj_stock(&stock->task_obj);
+	drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -2767,41 +2759,6 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
  */
 #define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
 
-/*
- * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
- * sequence used in this case to access content from object stock is slow.
- * To optimize for user context access, there are now two object stocks for
- * task context and interrupt context access respectively.
- *
- * The task context object stock can be accessed by disabling preemption only
- * which is cheap in non-preempt kernel. The interrupt context object stock
- * can only be accessed after disabling interrupt. User context code can
- * access interrupt object stock, but not vice versa.
- */
-static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
-{
-	struct memcg_stock_pcp *stock;
-
-	if (likely(in_task())) {
-		*pflags = 0UL;
-		preempt_disable();
-		stock = this_cpu_ptr(&memcg_stock);
-		return &stock->task_obj;
-	}
-
-	local_irq_save(*pflags);
-	stock = this_cpu_ptr(&memcg_stock);
-	return &stock->irq_obj;
-}
-
-static inline void put_obj_stock(unsigned long flags)
-{
-	if (likely(in_task()))
-		preempt_enable();
-	else
-		local_irq_restore(flags);
-}
-
 /*
  * mod_objcg_mlstate() may be called with irq enabled, so
  * mod_memcg_lruvec_state() should be used.
@@ -3082,10 +3039,13 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
+	struct memcg_stock_pcp *stock;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
 	int *bytes;
 
+	local_irq_save(flags);
+	stock = this_cpu_ptr(&memcg_stock);
+
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
 	 * accumulating over a page of vmstat data or when pgdat or idx
@@ -3136,26 +3096,29 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	put_obj_stock(flags);
+	local_irq_restore(flags);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
+	struct memcg_stock_pcp *stock;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
 	bool ret = false;
 
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
 	}
 
-	put_obj_stock(flags);
+	local_irq_restore(flags);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct obj_stock *stock)
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3211,13 +3174,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 {
 	struct mem_cgroup *memcg;
 
-	if (in_task() && stock->task_obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
-		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
-			return true;
-	}
-	if (stock->irq_obj.cached_objcg) {
-		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
+	if (stock->cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->cached_objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
 			return true;
 	}
@@ -3228,10 +3186,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
+	struct memcg_stock_pcp *stock;
 	unsigned long flags;
-	struct obj_stock *stock = get_obj_stock(&flags);
 	unsigned int nr_pages = 0;
 
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3247,7 +3208,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	put_obj_stock(flags);
+	local_irq_restore(flags);
 
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
@@ -6812,7 +6773,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	long nr_pages;
 	struct mem_cgroup *memcg;
 	struct obj_cgroup *objcg;
-	bool use_objcg = folio_memcg_kmem(folio);
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
@@ -6821,7 +6781,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	 * folio memcg or objcg at this point, we have fully
 	 * exclusive access to the folio.
 	 */
-	if (use_objcg) {
+	if (folio_memcg_kmem(folio)) {
 		objcg = __folio_objcg(folio);
 		/*
 		 * This get matches the put at the end of the function and
@@ -6849,7 +6809,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 
 	nr_pages = folio_nr_pages(folio);
 
-	if (use_objcg) {
+	if (folio_memcg_kmem(folio)) {
 		ug->nr_memory += nr_pages;
 		ug->nr_kmem += nr_pages;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/5] mm/memcg: Disable threshold event handlers on PREEMPT_RT
@ 2022-02-17  9:47   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:47 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior, Roman Gushchin

During the integration of PREEMPT_RT support, the code flow around
memcg_check_events() resulted in `twisted code'. Moving the code around
and avoiding then would then lead to an additional local-irq-save
section within memcg_check_events(). While looking better, it adds a
local-irq-save section to code flow which is usually within an
local-irq-off block on non-PREEMPT_RT configurations.

The threshold event handler is a deprecated memcg v1 feature. Instead of
trying to get it to work under PREEMPT_RT just disable it. There should
be no users on PREEMPT_RT. From that perspective it makes even less
sense to get it to work under PREEMPT_RT while having zero users.

Make memory.soft_limit_in_bytes and cgroup.event_control return
-EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
Document that the two knobs are disabled on PREEMPT_RT.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v1/memory.rst |  2 ++
 mm/memcontrol.c                                | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index faac50149a222..2cc502a75ef64 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -64,6 +64,7 @@ Brief summary of control files.
 				     threads
  cgroup.procs			     show list of processes
  cgroup.event_control		     an interface for event_fd()
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.usage_in_bytes		     show current usage for memory
 				     (See 5.5 for details)
  memory.memsw.usage_in_bytes	     show current usage for memory+Swap
@@ -75,6 +76,7 @@ Brief summary of control files.
  memory.max_usage_in_bytes	     show max memory usage recorded
  memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
  memory.soft_limit_in_bytes	     set/show soft limit of memory usage
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.stat			     show various statistics
  memory.use_hierarchy		     set/show hierarchical account enabled
                                      This knob is deprecated and shouldn't be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8ab2dc75e70ec..0b5117ed2ae08 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -859,6 +859,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 {
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return;
+
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -3731,8 +3734,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		}
 		break;
 	case RES_SOFT_LIMIT:
-		memcg->soft_limit = nr_pages;
-		ret = 0;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+			ret = -EOPNOTSUPP;
+		} else {
+			memcg->soft_limit = nr_pages;
+			ret = 0;
+		}
 		break;
 	}
 	return ret ?: nbytes;
@@ -4708,6 +4715,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	char *endp;
 	int ret;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return -EOPNOTSUPP;
+
 	buf = strstrip(buf);
 
 	efd = simple_strtoul(buf, &endp, 10);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 2/5] mm/memcg: Disable threshold event handlers on PREEMPT_RT
@ 2022-02-17  9:47   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:47 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior, Roman Gushchin

During the integration of PREEMPT_RT support, the code flow around
memcg_check_events() resulted in `twisted code'. Moving the code around
and avoiding then would then lead to an additional local-irq-save
section within memcg_check_events(). While looking better, it adds a
local-irq-save section to code flow which is usually within an
local-irq-off block on non-PREEMPT_RT configurations.

The threshold event handler is a deprecated memcg v1 feature. Instead of
trying to get it to work under PREEMPT_RT just disable it. There should
be no users on PREEMPT_RT. From that perspective it makes even less
sense to get it to work under PREEMPT_RT while having zero users.

Make memory.soft_limit_in_bytes and cgroup.event_control return
-EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
Document that the two knobs are disabled on PREEMPT_RT.

Suggested-by: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Suggested-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v1/memory.rst |  2 ++
 mm/memcontrol.c                                | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index faac50149a222..2cc502a75ef64 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -64,6 +64,7 @@ Brief summary of control files.
 				     threads
  cgroup.procs			     show list of processes
  cgroup.event_control		     an interface for event_fd()
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.usage_in_bytes		     show current usage for memory
 				     (See 5.5 for details)
  memory.memsw.usage_in_bytes	     show current usage for memory+Swap
@@ -75,6 +76,7 @@ Brief summary of control files.
  memory.max_usage_in_bytes	     show max memory usage recorded
  memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
  memory.soft_limit_in_bytes	     set/show soft limit of memory usage
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.stat			     show various statistics
  memory.use_hierarchy		     set/show hierarchical account enabled
                                      This knob is deprecated and shouldn't be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8ab2dc75e70ec..0b5117ed2ae08 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -859,6 +859,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 {
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return;
+
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -3731,8 +3734,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		}
 		break;
 	case RES_SOFT_LIMIT:
-		memcg->soft_limit = nr_pages;
-		ret = 0;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+			ret = -EOPNOTSUPP;
+		} else {
+			memcg->soft_limit = nr_pages;
+			ret = 0;
+		}
 		break;
 	}
 	return ret ?: nbytes;
@@ -4708,6 +4715,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	char *endp;
 	int ret;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return -EOPNOTSUPP;
+
 	buf = strstrip(buf);
 
 	efd = simple_strtoul(buf, &endp, 10);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-17  9:48   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:48 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior, Roman Gushchin

The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensured by disabling interrupts for the update.
On non PREEMPT_RT configuration this works because acquiring a
spinlock_t typed lock with the _irq() suffix disables interrupts. On
PREEMPT_RT configurations the RMW operation can be interrupted.

Another problem is that mem_cgroup_swapout() expects to be invoked with
disabled interrupts because the caller has to acquire a spinlock_t which
is acquired with disabled interrupts. Since spinlock_t never disables
interrupts on PREEMPT_RT the interrupts are never disabled at this
point.

The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT.
The sections which explicitly disable interrupts can remain on
PREEMPT_RT because the sections remain short and they don't involve
sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT).

Disable preemption during update of the per-CPU variables which do not
explicitly disable interrupts.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b5117ed2ae08..36ab3660f2c6d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -630,6 +630,28 @@ static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 
+/*
+ * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
+ * not rely on this as part of an acquired spinlock_t lock. These functions are
+ * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
+ * is sufficient.
+ */
+static void memcg_stats_lock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_disable();
+#else
+      VM_BUG_ON(!irqs_disabled());
+#endif
+}
+
+static void memcg_stats_unlock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_enable();
+#endif
+}
+
 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
 	unsigned int x;
@@ -706,6 +728,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
+	memcg_stats_lock();
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
@@ -713,6 +736,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
 	memcg_rstat_updated(memcg, val);
+	memcg_stats_unlock();
 }
 
 /**
@@ -795,8 +819,10 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (mem_cgroup_disabled())
 		return;
 
+	memcg_stats_lock();
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
 	memcg_rstat_updated(memcg, count);
+	memcg_stats_unlock();
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -7140,8 +7166,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * important here to have the interrupts disabled because it is the
 	 * only synchronisation we have for updating the per-CPU variables.
 	 */
-	VM_BUG_ON(!irqs_disabled());
+	memcg_stats_lock();
 	mem_cgroup_charge_statistics(memcg, -nr_entries);
+	memcg_stats_unlock();
 	memcg_check_events(memcg, page_to_nid(page));
 
 	css_put(&memcg->css);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-17  9:48   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:48 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior, Roman Gushchin

The per-CPU counter are modified with the non-atomic modifier. The
consistency is ensured by disabling interrupts for the update.
On non PREEMPT_RT configuration this works because acquiring a
spinlock_t typed lock with the _irq() suffix disables interrupts. On
PREEMPT_RT configurations the RMW operation can be interrupted.

Another problem is that mem_cgroup_swapout() expects to be invoked with
disabled interrupts because the caller has to acquire a spinlock_t which
is acquired with disabled interrupts. Since spinlock_t never disables
interrupts on PREEMPT_RT the interrupts are never disabled at this
point.

The code is never called from in_irq() context on PREEMPT_RT therefore
disabling preemption during the update is sufficient on PREEMPT_RT.
The sections which explicitly disable interrupts can remain on
PREEMPT_RT because the sections remain short and they don't involve
sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT).

Disable preemption during update of the per-CPU variables which do not
explicitly disable interrupts.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
---
 mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b5117ed2ae08..36ab3660f2c6d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -630,6 +630,28 @@ static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 
+/*
+ * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
+ * not rely on this as part of an acquired spinlock_t lock. These functions are
+ * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
+ * is sufficient.
+ */
+static void memcg_stats_lock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_disable();
+#else
+      VM_BUG_ON(!irqs_disabled());
+#endif
+}
+
+static void memcg_stats_unlock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_enable();
+#endif
+}
+
 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
 	unsigned int x;
@@ -706,6 +728,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
+	memcg_stats_lock();
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
@@ -713,6 +736,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
 	memcg_rstat_updated(memcg, val);
+	memcg_stats_unlock();
 }
 
 /**
@@ -795,8 +819,10 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (mem_cgroup_disabled())
 		return;
 
+	memcg_stats_lock();
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
 	memcg_rstat_updated(memcg, count);
+	memcg_stats_unlock();
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -7140,8 +7166,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	 * important here to have the interrupts disabled because it is the
 	 * only synchronisation we have for updating the per-CPU variables.
 	 */
-	VM_BUG_ON(!irqs_disabled());
+	memcg_stats_lock();
 	mem_cgroup_charge_statistics(memcg, -nr_entries);
+	memcg_stats_unlock();
 	memcg_check_events(memcg, page_to_nid(page));
 
 	css_put(&memcg->css);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-17  9:48   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:48 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior

From: Johannes Weiner <hannes@cmpxchg.org>

Provide the inner part of refill_stock() as __refill_stock() without
disabling interrupts. This eases the integration of local_lock_t where
recursive locking must be avoided.
Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
__refill_stock(). The caller of drain_obj_stock() already disables
interrupts.

[bigeasy: Patch body around Johannes' diff ]

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36ab3660f2c6d..a3225501cce36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2224,12 +2224,9 @@ static void drain_local_stock(struct work_struct *dummy)
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	struct memcg_stock_pcp *stock;
-	unsigned long flags;
-
-	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
@@ -2241,7 +2238,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
 		drain_stock(stock);
+}
 
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__refill_stock(memcg, nr_pages);
 	local_irq_restore(flags);
 }
 
@@ -3158,8 +3162,16 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
 
-		if (nr_pages)
-			obj_cgroup_uncharge_pages(old, nr_pages);
+		if (nr_pages) {
+			struct mem_cgroup *memcg;
+
+			memcg = get_mem_cgroup_from_objcg(old);
+
+			memcg_account_kmem(memcg, -nr_pages);
+			__refill_stock(memcg, nr_pages);
+
+			css_put(&memcg->css);
+		}
 
 		/*
 		 * The leftover is flushed to the centralized per-memcg value.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-17  9:48   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:48 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior

From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Provide the inner part of refill_stock() as __refill_stock() without
disabling interrupts. This eases the integration of local_lock_t where
recursive locking must be avoided.
Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
__refill_stock(). The caller of drain_obj_stock() already disables
interrupts.

[bigeasy: Patch body around Johannes' diff ]

Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 mm/memcontrol.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36ab3660f2c6d..a3225501cce36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2224,12 +2224,9 @@ static void drain_local_stock(struct work_struct *dummy)
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	struct memcg_stock_pcp *stock;
-	unsigned long flags;
-
-	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached != memcg) { /* reset if necessary */
@@ -2241,7 +2238,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
 		drain_stock(stock);
+}
 
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__refill_stock(memcg, nr_pages);
 	local_irq_restore(flags);
 }
 
@@ -3158,8 +3162,16 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
 
-		if (nr_pages)
-			obj_cgroup_uncharge_pages(old, nr_pages);
+		if (nr_pages) {
+			struct mem_cgroup *memcg;
+
+			memcg = get_mem_cgroup_from_objcg(old);
+
+			memcg_account_kmem(memcg, -nr_pages);
+			__refill_stock(memcg, nr_pages);
+
+			css_put(&memcg->css);
+		}
 
 		/*
 		 * The leftover is flushed to the centralized per-memcg value.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-17  9:48   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:48 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior, kernel test robot

The members of the per-CPU structure memcg_stock_pcp are protected by
disabling interrupts. This is not working on PREEMPT_RT because it
creates atomic context in which actions are performed which require
preemptible context. One example is obj_cgroup_release().

The IRQ-disable sections can be replaced with local_lock_t which
preserves the explicit disabling of interrupts while keeps the code
preemptible on PREEMPT_RT.

drain_all_stock() disables preemption via get_cpu() and then invokes
drain_local_stock() if it is the local CPU to avoid scheduling a worker (which
invokes the same function). Disabling preemption here is problematic due to the
sleeping locks in drain_local_stock().
This can be avoided by always scheduling a worker, even for the local
CPU. Using cpus_read_lock() to stabilize the cpu_online_mask is not
needed since the worker operates always on the CPU-local data structure.
Should a CPU go offline then a two worker would perform the work and no
harm is done. Using cpus_read_lock() leads to a possible deadlock.

drain_obj_stock() drops a reference on obj_cgroup which leads to an invocation
of obj_cgroup_release() if it is the last object. This in turn leads to
recursive locking of the local_lock_t. To avoid this, obj_cgroup_release() is
invoked outside of the locked section.

obj_cgroup_uncharge_pages() can be invoked with the local_lock_t acquired and
without it. This will lead later to a recursion in refill_stock(). To
avoid the locking recursion provide obj_cgroup_uncharge_pages_locked()
which uses the locked version of refill_stock().

- Replace disabling interrupts for memcg_stock with a local_lock_t.

- Schedule a worker even for the local CPU instead of invoking it
  directly (in drain_all_stock()).

- Let drain_obj_stock() return the old struct obj_cgroup which is passed
  to obj_cgroup_put() outside of the locked section.

- Provide obj_cgroup_uncharge_pages_locked() which uses the locked
  version of refill_stock() to avoid recursive locking in
  drain_obj_stock().

Link: https://lkml.kernel.org/r/20220209014709.GA26885@xsang-OptiPlex-9020
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 mm/memcontrol.c | 67 +++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 30 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3225501cce36..97a88b63ee983 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2108,6 +2108,7 @@ void unlock_page_memcg(struct page *page)
 }
 
 struct memcg_stock_pcp {
+	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 
@@ -2123,18 +2124,21 @@ struct memcg_stock_pcp {
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
 };
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+};
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 {
+	return NULL;
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
@@ -2166,7 +2170,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2174,7 +2178,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -2203,6 +2207,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 static void drain_local_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock;
+	struct obj_cgroup *old = NULL;
 	unsigned long flags;
 
 	/*
@@ -2210,14 +2215,16 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
+	old = drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 /*
@@ -2244,9 +2251,9 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 	__refill_stock(memcg, nr_pages);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -2255,7 +2262,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-	int cpu, curcpu;
+	int cpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
 	if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 	 * as well as workers from this path always operate on the local
 	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
 	 */
-	curcpu = get_cpu();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *memcg;
@@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		rcu_read_unlock();
 
 		if (flush &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
-	put_cpu();
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -3073,10 +3074,11 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
 	struct memcg_stock_pcp *stock;
+	struct obj_cgroup *old = NULL;
 	unsigned long flags;
 	int *bytes;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 	stock = this_cpu_ptr(&memcg_stock);
 
 	/*
@@ -3085,7 +3087,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3129,7 +3131,9 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
@@ -3138,7 +3142,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
@@ -3146,17 +3150,17 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
 	if (!old)
-		return;
+		return NULL;
 
 	if (stock->nr_bytes) {
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
@@ -3206,8 +3210,8 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		stock->cached_pgdat = NULL;
 	}
 
-	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
+	return old;
 }
 
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3228,14 +3232,15 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
 	struct memcg_stock_pcp *stock;
+	struct obj_cgroup *old = NULL;
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3249,7 +3254,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	if (old)
+		obj_cgroup_put(old);
 
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-17  9:48   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-17  9:48 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Sebastian Andrzej Siewior, kernel test robot

The members of the per-CPU structure memcg_stock_pcp are protected by
disabling interrupts. This is not working on PREEMPT_RT because it
creates atomic context in which actions are performed which require
preemptible context. One example is obj_cgroup_release().

The IRQ-disable sections can be replaced with local_lock_t which
preserves the explicit disabling of interrupts while keeps the code
preemptible on PREEMPT_RT.

drain_all_stock() disables preemption via get_cpu() and then invokes
drain_local_stock() if it is the local CPU to avoid scheduling a worker (which
invokes the same function). Disabling preemption here is problematic due to the
sleeping locks in drain_local_stock().
This can be avoided by always scheduling a worker, even for the local
CPU. Using cpus_read_lock() to stabilize the cpu_online_mask is not
needed since the worker operates always on the CPU-local data structure.
Should a CPU go offline then a two worker would perform the work and no
harm is done. Using cpus_read_lock() leads to a possible deadlock.

drain_obj_stock() drops a reference on obj_cgroup which leads to an invocation
of obj_cgroup_release() if it is the last object. This in turn leads to
recursive locking of the local_lock_t. To avoid this, obj_cgroup_release() is
invoked outside of the locked section.

obj_cgroup_uncharge_pages() can be invoked with the local_lock_t acquired and
without it. This will lead later to a recursion in refill_stock(). To
avoid the locking recursion provide obj_cgroup_uncharge_pages_locked()
which uses the locked version of refill_stock().

- Replace disabling interrupts for memcg_stock with a local_lock_t.

- Schedule a worker even for the local CPU instead of invoking it
  directly (in drain_all_stock()).

- Let drain_obj_stock() return the old struct obj_cgroup which is passed
  to obj_cgroup_put() outside of the locked section.

- Provide obj_cgroup_uncharge_pages_locked() which uses the locked
  version of refill_stock() to avoid recursive locking in
  drain_obj_stock().

Link: https://lkml.kernel.org/r/20220209014709.GA26885@xsang-OptiPlex-9020
Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 mm/memcontrol.c | 67 +++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 30 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3225501cce36..97a88b63ee983 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2108,6 +2108,7 @@ void unlock_page_memcg(struct page *page)
 }
 
 struct memcg_stock_pcp {
+	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 
@@ -2123,18 +2124,21 @@ struct memcg_stock_pcp {
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
 };
-static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
+	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+};
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 {
+	return NULL;
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
@@ -2166,7 +2170,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
@@ -2174,7 +2178,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -2203,6 +2207,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 static void drain_local_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock;
+	struct obj_cgroup *old = NULL;
 	unsigned long flags;
 
 	/*
@@ -2210,14 +2215,16 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
+	old = drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 /*
@@ -2244,9 +2251,9 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 	__refill_stock(memcg, nr_pages);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -2255,7 +2262,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-	int cpu, curcpu;
+	int cpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
 	if (!mutex_trylock(&percpu_charge_mutex))
@@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 	 * as well as workers from this path always operate on the local
 	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
 	 */
-	curcpu = get_cpu();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *memcg;
@@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		rcu_read_unlock();
 
 		if (flush &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
-	put_cpu();
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -3073,10 +3074,11 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 		     enum node_stat_item idx, int nr)
 {
 	struct memcg_stock_pcp *stock;
+	struct obj_cgroup *old = NULL;
 	unsigned long flags;
 	int *bytes;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 	stock = this_cpu_ptr(&memcg_stock);
 
 	/*
@@ -3085,7 +3087,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	 * changes.
 	 */
 	if (stock->cached_objcg != objcg) {
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
@@ -3129,7 +3131,9 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	if (old)
+		obj_cgroup_put(old);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
@@ -3138,7 +3142,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
@@ -3146,17 +3150,17 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 		ret = true;
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
 {
 	struct obj_cgroup *old = stock->cached_objcg;
 
 	if (!old)
-		return;
+		return NULL;
 
 	if (stock->nr_bytes) {
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
@@ -3206,8 +3210,8 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		stock->cached_pgdat = NULL;
 	}
 
-	obj_cgroup_put(old);
 	stock->cached_objcg = NULL;
+	return old;
 }
 
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -3228,14 +3232,15 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			     bool allow_uncharge)
 {
 	struct memcg_stock_pcp *stock;
+	struct obj_cgroup *old = NULL;
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (stock->cached_objcg != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+		old = drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->cached_objcg = objcg;
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3249,7 +3254,9 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	if (old)
+		obj_cgroup_put(old);
 
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Revert ("mm/memcg: optimize user context object stock access")
@ 2022-02-18 16:09     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 16:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Michal Hocko, Roman Gushchin

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> From: Michal Hocko <mhocko@suse.com>
>
> The optimisation is based on a micro benchmark where local_irq_save() is
> more expensive than a preempt_disable(). There is no evidence that it is
> visible in a real-world workload and there are CPUs where the opposite is
> true (local_irq_save() is cheaper than preempt_disable()).
>
> Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
> where preempt_disable() is optimized away. There is no improvement with
> PREEMPT_DYNAMIC since the preemption counter is always available.
>
> The optimization makes also the PREEMPT_RT integration more complicated
> since most of the assumption are not true on PREEMPT_RT.
>
> Revert the optimisation since it complicates the PREEMPT_RT integration
> and the improvement is hardly visible.
>
> [ bigeasy: Patch body around Michal's diff ]
>
> Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4@dhcp22.suse.cz
> Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S@linutronix.de
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Revert ("mm/memcg: optimize user context object stock access")
@ 2022-02-18 16:09     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 16:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Michal Hocko, Roman Gushchin

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
>
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
>
> The optimisation is based on a micro benchmark where local_irq_save() is
> more expensive than a preempt_disable(). There is no evidence that it is
> visible in a real-world workload and there are CPUs where the opposite is
> true (local_irq_save() is cheaper than preempt_disable()).
>
> Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
> where preempt_disable() is optimized away. There is no improvement with
> PREEMPT_DYNAMIC since the preemption counter is always available.
>
> The optimization makes also the PREEMPT_RT integration more complicated
> since most of the assumption are not true on PREEMPT_RT.
>
> Revert the optimisation since it complicates the PREEMPT_RT integration
> and the improvement is hardly visible.
>
> [ bigeasy: Patch body around Michal's diff ]
>
> Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
> Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org
> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Disable threshold event handlers on PREEMPT_RT
@ 2022-02-18 16:39     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 16:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> During the integration of PREEMPT_RT support, the code flow around
> memcg_check_events() resulted in `twisted code'. Moving the code around
> and avoiding then would then lead to an additional local-irq-save
> section within memcg_check_events(). While looking better, it adds a
> local-irq-save section to code flow which is usually within an
> local-irq-off block on non-PREEMPT_RT configurations.
>
> The threshold event handler is a deprecated memcg v1 feature. Instead of
> trying to get it to work under PREEMPT_RT just disable it. There should
> be no users on PREEMPT_RT. From that perspective it makes even less
> sense to get it to work under PREEMPT_RT while having zero users.
>
> Make memory.soft_limit_in_bytes and cgroup.event_control return
> -EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
> memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
> Document that the two knobs are disabled on PREEMPT_RT.
>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Suggested-by: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Disable threshold event handlers on PREEMPT_RT
@ 2022-02-18 16:39     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 16:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
>
> During the integration of PREEMPT_RT support, the code flow around
> memcg_check_events() resulted in `twisted code'. Moving the code around
> and avoiding then would then lead to an additional local-irq-save
> section within memcg_check_events(). While looking better, it adds a
> local-irq-save section to code flow which is usually within an
> local-irq-off block on non-PREEMPT_RT configurations.
>
> The threshold event handler is a deprecated memcg v1 feature. Instead of
> trying to get it to work under PREEMPT_RT just disable it. There should
> be no users on PREEMPT_RT. From that perspective it makes even less
> sense to get it to work under PREEMPT_RT while having zero users.
>
> Make memory.soft_limit_in_bytes and cgroup.event_control return
> -EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
> memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
> Document that the two knobs are disabled on PREEMPT_RT.
>
> Suggested-by: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Suggested-by: Michal Koutn√Ω <mkoutny-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-18 17:25     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 17:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> The per-CPU counter are modified with the non-atomic modifier. The
> consistency is ensured by disabling interrupts for the update.
> On non PREEMPT_RT configuration this works because acquiring a
> spinlock_t typed lock with the _irq() suffix disables interrupts. On
> PREEMPT_RT configurations the RMW operation can be interrupted.
>
> Another problem is that mem_cgroup_swapout() expects to be invoked with
> disabled interrupts because the caller has to acquire a spinlock_t which
> is acquired with disabled interrupts. Since spinlock_t never disables
> interrupts on PREEMPT_RT the interrupts are never disabled at this
> point.
>
> The code is never called from in_irq() context on PREEMPT_RT therefore
> disabling preemption during the update is sufficient on PREEMPT_RT.
> The sections which explicitly disable interrupts can remain on
> PREEMPT_RT because the sections remain short and they don't involve
> sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT).
>
> Disable preemption during update of the per-CPU variables which do not
> explicitly disable interrupts.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0b5117ed2ae08..36ab3660f2c6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -630,6 +630,28 @@ static DEFINE_SPINLOCK(stats_flush_lock);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
>
> +/*
> + * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
> + * not rely on this as part of an acquired spinlock_t lock. These functions are
> + * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
> + * is sufficient.
> + */
> +static void memcg_stats_lock(void)
> +{
> +#ifdef CONFIG_PREEMPT_RT
> +      preempt_disable();
> +#else
> +      VM_BUG_ON(!irqs_disabled());
> +#endif
> +}
> +
> +static void memcg_stats_unlock(void)
> +{
> +#ifdef CONFIG_PREEMPT_RT
> +      preempt_enable();
> +#endif
> +}
> +
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  {
>         unsigned int x;
> @@ -706,6 +728,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>         memcg = pn->memcg;)
>
> +       memcg_stats_lock();

The call chains from rmap.c have not really disabled irqs. Actually
there is a comment in do_page_add_anon_rmap() "We use the irq-unsafe
__{inc|mod}_zone_page_stat because these counters are not modified in
interrupt context, and pte lock(a spinlock) is held, which implies
preemption disabled".

VM_BUG_ON(!irqs_disabled()) within memcg_stats_lock() would be giving
false error reports for CONFIG_PREEMPT_NONE kernels.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-18 17:25     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 17:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
>
> The per-CPU counter are modified with the non-atomic modifier. The
> consistency is ensured by disabling interrupts for the update.
> On non PREEMPT_RT configuration this works because acquiring a
> spinlock_t typed lock with the _irq() suffix disables interrupts. On
> PREEMPT_RT configurations the RMW operation can be interrupted.
>
> Another problem is that mem_cgroup_swapout() expects to be invoked with
> disabled interrupts because the caller has to acquire a spinlock_t which
> is acquired with disabled interrupts. Since spinlock_t never disables
> interrupts on PREEMPT_RT the interrupts are never disabled at this
> point.
>
> The code is never called from in_irq() context on PREEMPT_RT therefore
> disabling preemption during the update is sufficient on PREEMPT_RT.
> The sections which explicitly disable interrupts can remain on
> PREEMPT_RT because the sections remain short and they don't involve
> sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT).
>
> Disable preemption during update of the per-CPU variables which do not
> explicitly disable interrupts.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> ---
>  mm/memcontrol.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0b5117ed2ae08..36ab3660f2c6d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -630,6 +630,28 @@ static DEFINE_SPINLOCK(stats_flush_lock);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
>
> +/*
> + * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
> + * not rely on this as part of an acquired spinlock_t lock. These functions are
> + * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
> + * is sufficient.
> + */
> +static void memcg_stats_lock(void)
> +{
> +#ifdef CONFIG_PREEMPT_RT
> +      preempt_disable();
> +#else
> +      VM_BUG_ON(!irqs_disabled());
> +#endif
> +}
> +
> +static void memcg_stats_unlock(void)
> +{
> +#ifdef CONFIG_PREEMPT_RT
> +      preempt_enable();
> +#endif
> +}
> +
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  {
>         unsigned int x;
> @@ -706,6 +728,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>         memcg = pn->memcg;)
>
> +       memcg_stats_lock();

The call chains from rmap.c have not really disabled irqs. Actually
there is a comment in do_page_add_anon_rmap() "We use the irq-unsafe
__{inc|mod}_zone_page_stat because these counters are not modified in
interrupt context, and pte lock(a spinlock) is held, which implies
preemption disabled".

VM_BUG_ON(!irqs_disabled()) within memcg_stats_lock() would be giving
false error reports for CONFIG_PREEMPT_NONE kernels.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-18 18:40     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 18:40 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> From: Johannes Weiner <hannes@cmpxchg.org>
>
> Provide the inner part of refill_stock() as __refill_stock() without
> disabling interrupts. This eases the integration of local_lock_t where
> recursive locking must be avoided.
> Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
> __refill_stock(). The caller of drain_obj_stock() already disables
> interrupts.
>
> [bigeasy: Patch body around Johannes' diff ]
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-18 18:40     ` Shakeel Butt
  0 siblings, 0 replies; 48+ messages in thread
From: Shakeel Butt @ 2022-02-18 18:40 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

On Thu, Feb 17, 2022 at 1:48 AM Sebastian Andrzej Siewior
<bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
>
> From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>
> Provide the inner part of refill_stock() as __refill_stock() without
> disabling interrupts. This eases the integration of local_lock_t where
> recursive locking must be avoided.
> Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
> __refill_stock(). The caller of drain_obj_stock() already disables
> interrupts.
>
> [bigeasy: Patch body around Johannes' diff ]
>
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-18 19:07     ` Roman Gushchin
  0 siblings, 0 replies; 48+ messages in thread
From: Roman Gushchin @ 2022-02-18 19:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

On Thu, Feb 17, 2022 at 10:48:01AM +0100, Sebastian Andrzej Siewior wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Provide the inner part of refill_stock() as __refill_stock() without
> disabling interrupts. This eases the integration of local_lock_t where
> recursive locking must be avoided.
> Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
> __refill_stock(). The caller of drain_obj_stock() already disables
> interrupts.
> 
> [bigeasy: Patch body around Johannes' diff ]
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Reviewed-by: Roman Gushchin <guro@fb.com>



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-18 19:07     ` Roman Gushchin
  0 siblings, 0 replies; 48+ messages in thread
From: Roman Gushchin @ 2022-02-18 19:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Hocko, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

On Thu, Feb 17, 2022 at 10:48:01AM +0100, Sebastian Andrzej Siewior wrote:
> From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> 
> Provide the inner part of refill_stock() as __refill_stock() without
> disabling interrupts. This eases the integration of local_lock_t where
> recursive locking must be avoided.
> Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
> __refill_stock(). The caller of drain_obj_stock() already disables
> interrupts.
> 
> [bigeasy: Patch body around Johannes' diff ]
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

Reviewed-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 11:31       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 11:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On 2022-02-18 09:25:29 [-0800], Shakeel Butt wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 0b5117ed2ae08..36ab3660f2c6d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -630,6 +630,28 @@ static DEFINE_SPINLOCK(stats_flush_lock);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >
> > +/*
> > + * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
> > + * not rely on this as part of an acquired spinlock_t lock. These functions are
> > + * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
> > + * is sufficient.
> > + */
> > +static void memcg_stats_lock(void)
> > +{
> > +#ifdef CONFIG_PREEMPT_RT
> > +      preempt_disable();
> > +#else
> > +      VM_BUG_ON(!irqs_disabled());
> > +#endif
> > +}
> > +
> > +static void memcg_stats_unlock(void)
> > +{
> > +#ifdef CONFIG_PREEMPT_RT
> > +      preempt_enable();
> > +#endif
> > +}
> > +
> >  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >  {
> >         unsigned int x;
> > @@ -706,6 +728,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> >         memcg = pn->memcg;)
> >
> > +       memcg_stats_lock();
> 
> The call chains from rmap.c have not really disabled irqs. Actually
> there is a comment in do_page_add_anon_rmap() "We use the irq-unsafe
> __{inc|mod}_zone_page_stat because these counters are not modified in
> interrupt context, and pte lock(a spinlock) is held, which implies
> preemption disabled".
> 
> VM_BUG_ON(!irqs_disabled()) within memcg_stats_lock() would be giving
> false error reports for CONFIG_PREEMPT_NONE kernels.

So three caller, including do_page_add_anon_rmap():
   __mod_lruvec_page_state() -> __mod_lruvec_state() -> __mod_memcg_lruvec_state()

is affected. Here we get false warnings because interrupts may not be
disabled and it is intended. Hmmm.
The odd part is that this only affects certain idx so any kind of
additional debugging would need to take this into account.
What about memcg_rstat_updated()? It does:

|         x = __this_cpu_add_return(stats_updates, abs(val));
|         if (x > MEMCG_CHARGE_BATCH) {
|                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
|                 __this_cpu_write(stats_updates, 0);
|         }

The writes to stats_updates can happen from IRQ-context and with
disabled preemption only. So this is not good, right?

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 11:31       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 11:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On 2022-02-18 09:25:29 [-0800], Shakeel Butt wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 0b5117ed2ae08..36ab3660f2c6d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -630,6 +630,28 @@ static DEFINE_SPINLOCK(stats_flush_lock);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >
> > +/*
> > + * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
> > + * not rely on this as part of an acquired spinlock_t lock. These functions are
> > + * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
> > + * is sufficient.
> > + */
> > +static void memcg_stats_lock(void)
> > +{
> > +#ifdef CONFIG_PREEMPT_RT
> > +      preempt_disable();
> > +#else
> > +      VM_BUG_ON(!irqs_disabled());
> > +#endif
> > +}
> > +
> > +static void memcg_stats_unlock(void)
> > +{
> > +#ifdef CONFIG_PREEMPT_RT
> > +      preempt_enable();
> > +#endif
> > +}
> > +
> >  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >  {
> >         unsigned int x;
> > @@ -706,6 +728,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> >         memcg = pn->memcg;)
> >
> > +       memcg_stats_lock();
> 
> The call chains from rmap.c have not really disabled irqs. Actually
> there is a comment in do_page_add_anon_rmap() "We use the irq-unsafe
> __{inc|mod}_zone_page_stat because these counters are not modified in
> interrupt context, and pte lock(a spinlock) is held, which implies
> preemption disabled".
> 
> VM_BUG_ON(!irqs_disabled()) within memcg_stats_lock() would be giving
> false error reports for CONFIG_PREEMPT_NONE kernels.

So three caller, including do_page_add_anon_rmap():
   __mod_lruvec_page_state() -> __mod_lruvec_state() -> __mod_memcg_lruvec_state()

is affected. Here we get false warnings because interrupts may not be
disabled and it is intended. Hmmm.
The odd part is that this only affects certain idx so any kind of
additional debugging would need to take this into account.
What about memcg_rstat_updated()? It does:

|         x = __this_cpu_add_return(stats_updates, abs(val));
|         if (x > MEMCG_CHARGE_BATCH) {
|                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
|                 __this_cpu_write(stats_updates, 0);
|         }

The writes to stats_updates can happen from IRQ-context and with
disabled preemption only. So this is not good, right?

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 12:12         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 12:12 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On 2022-02-21 12:31:18 [+0100], To Shakeel Butt wrote:
> > The call chains from rmap.c have not really disabled irqs. Actually
> > there is a comment in do_page_add_anon_rmap() "We use the irq-unsafe
> > __{inc|mod}_zone_page_stat because these counters are not modified in
> > interrupt context, and pte lock(a spinlock) is held, which implies
> > preemption disabled".
> > 
> > VM_BUG_ON(!irqs_disabled()) within memcg_stats_lock() would be giving
> > false error reports for CONFIG_PREEMPT_NONE kernels.
> 
> So three caller, including do_page_add_anon_rmap():
>    __mod_lruvec_page_state() -> __mod_lruvec_state() -> __mod_memcg_lruvec_state()
> 
> is affected. Here we get false warnings because interrupts may not be
> disabled and it is intended. Hmmm.
> The odd part is that this only affects certain idx so any kind of
> additional debugging would need to take this into account.
> What about memcg_rstat_updated()? It does:
> 
> |         x = __this_cpu_add_return(stats_updates, abs(val));
> |         if (x > MEMCG_CHARGE_BATCH) {
> |                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
> |                 __this_cpu_write(stats_updates, 0);
> |         }
> 
> The writes to stats_updates can happen from IRQ-context and with
> disabled preemption only. So this is not good, right?

So I made the following to avoid the wrong assert. Still not sure how
bad the hunk above.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 97a88b63ee983..1bac4798b72ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -645,6 +645,13 @@ static void memcg_stats_lock(void)
 #endif
 }
 
+static void __memcg_stats_lock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_disable();
+#endif
+}
+
 static void memcg_stats_unlock(void)
 {
 #ifdef CONFIG_PREEMPT_RT
@@ -728,7 +735,20 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
-	memcg_stats_lock();
+	/*
+	 * The caller from rmap relay on disabled preemption becase they never
+	 * update their counter from in-interrupt context. For these two
+	 * counters we check that the update is never performed from an
+	 * interrupt context while other caller need to have disabled interrupt.
+	 */
+	__memcg_stats_lock();
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		if (idx == NR_ANON_MAPPED || idx == NR_FILE_MAPPED)
+			WARN_ON_ONCE(!in_task());
+		else
+			WARN_ON_ONCE(!irqs_disabled());
+	}
+
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 

Sebastian


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 12:12         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 12:12 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, Linux MM, Andrew Morton, Johannes Weiner, Michal Hocko,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On 2022-02-21 12:31:18 [+0100], To Shakeel Butt wrote:
> > The call chains from rmap.c have not really disabled irqs. Actually
> > there is a comment in do_page_add_anon_rmap() "We use the irq-unsafe
> > __{inc|mod}_zone_page_stat because these counters are not modified in
> > interrupt context, and pte lock(a spinlock) is held, which implies
> > preemption disabled".
> > 
> > VM_BUG_ON(!irqs_disabled()) within memcg_stats_lock() would be giving
> > false error reports for CONFIG_PREEMPT_NONE kernels.
> 
> So three caller, including do_page_add_anon_rmap():
>    __mod_lruvec_page_state() -> __mod_lruvec_state() -> __mod_memcg_lruvec_state()
> 
> is affected. Here we get false warnings because interrupts may not be
> disabled and it is intended. Hmmm.
> The odd part is that this only affects certain idx so any kind of
> additional debugging would need to take this into account.
> What about memcg_rstat_updated()? It does:
> 
> |         x = __this_cpu_add_return(stats_updates, abs(val));
> |         if (x > MEMCG_CHARGE_BATCH) {
> |                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
> |                 __this_cpu_write(stats_updates, 0);
> |         }
> 
> The writes to stats_updates can happen from IRQ-context and with
> disabled preemption only. So this is not good, right?

So I made the following to avoid the wrong assert. Still not sure how
bad the hunk above.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 97a88b63ee983..1bac4798b72ba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -645,6 +645,13 @@ static void memcg_stats_lock(void)
 #endif
 }
 
+static void __memcg_stats_lock(void)
+{
+#ifdef CONFIG_PREEMPT_RT
+      preempt_disable();
+#endif
+}
+
 static void memcg_stats_unlock(void)
 {
 #ifdef CONFIG_PREEMPT_RT
@@ -728,7 +735,20 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
-	memcg_stats_lock();
+	/*
+	 * The caller from rmap relay on disabled preemption becase they never
+	 * update their counter from in-interrupt context. For these two
+	 * counters we check that the update is never performed from an
+	 * interrupt context while other caller need to have disabled interrupt.
+	 */
+	__memcg_stats_lock();
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		if (idx == NR_ANON_MAPPED || idx == NR_FILE_MAPPED)
+			WARN_ON_ONCE(!in_task());
+		else
+			WARN_ON_ONCE(!irqs_disabled());
+	}
+
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 

Sebastian

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 13:18         ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-02-21 13:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Shakeel Butt, Cgroups, Linux MM, Andrew Morton, Johannes Weiner,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Vladimir Davydov,
	Waiman Long, Roman Gushchin

On Mon, Feb 21, 2022 at 12:31:17PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> What about memcg_rstat_updated()? It does:
> 
> |         x = __this_cpu_add_return(stats_updates, abs(val));
> |         if (x > MEMCG_CHARGE_BATCH) {
> |                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
> |                 __this_cpu_write(stats_updates, 0);
> |         }
> 
> The writes to stats_updates can happen from IRQ-context and with
> disabled preemption only. So this is not good, right?

These counters serve as a hint for aggregating per-cpu per-cgroup stats.
If they were systematically mis-updated, it could manifest by
missing "refresh signal" from the given CPU. OTOH, this lagging is also
meant to by limited by elapsed time thanks to periodic flushing.

This could affect freshness of the stats not their accuracy though.


HTH,
Michal



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 13:18         ` Michal Koutný
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Koutný @ 2022-02-21 13:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Shakeel Butt, Cgroups, Linux MM, Andrew Morton, Johannes Weiner,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Vladimir Davydov,
	Waiman Long, Roman Gushchin

On Mon, Feb 21, 2022 at 12:31:17PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> What about memcg_rstat_updated()? It does:
> 
> |         x = __this_cpu_add_return(stats_updates, abs(val));
> |         if (x > MEMCG_CHARGE_BATCH) {
> |                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
> |                 __this_cpu_write(stats_updates, 0);
> |         }
> 
> The writes to stats_updates can happen from IRQ-context and with
> disabled preemption only. So this is not good, right?

These counters serve as a hint for aggregating per-cpu per-cgroup stats.
If they were systematically mis-updated, it could manifest by
missing "refresh signal" from the given CPU. OTOH, this lagging is also
meant to by limited by elapsed time thanks to periodic flushing.

This could affect freshness of the stats not their accuracy though.


HTH,
Michal


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 13:58           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 13:58 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Shakeel Butt, Cgroups, Linux MM, Andrew Morton, Johannes Weiner,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Vladimir Davydov,
	Waiman Long, Roman Gushchin

On 2022-02-21 14:18:25 [+0100], Michal Koutný wrote:
> On Mon, Feb 21, 2022 at 12:31:17PM +0100, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > What about memcg_rstat_updated()? It does:
> > 
> > |         x = __this_cpu_add_return(stats_updates, abs(val));
> > |         if (x > MEMCG_CHARGE_BATCH) {
> > |                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
> > |                 __this_cpu_write(stats_updates, 0);
> > |         }
> > 
> > The writes to stats_updates can happen from IRQ-context and with
> > disabled preemption only. So this is not good, right?
> 
> These counters serve as a hint for aggregating per-cpu per-cgroup stats.
> If they were systematically mis-updated, it could manifest by
> missing "refresh signal" from the given CPU. OTOH, this lagging is also
> meant to by limited by elapsed time thanks to periodic flushing.
> 
> This could affect freshness of the stats not their accuracy though.

Oki. Then let me update the code as suggested and ignore this case since
it is nothing to worry about.

> HTH,
> Michal

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.
@ 2022-02-21 13:58           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 13:58 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Shakeel Butt, Cgroups, Linux MM, Andrew Morton, Johannes Weiner,
	Michal Hocko, Peter Zijlstra, Thomas Gleixner, Vladimir Davydov,
	Waiman Long, Roman Gushchin

On 2022-02-21 14:18:25 [+0100], Michal KoutnÃ½ wrote:
> On Mon, Feb 21, 2022 at 12:31:17PM +0100, Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> wrote:
> > What about memcg_rstat_updated()? It does:
> > 
> > |         x = __this_cpu_add_return(stats_updates, abs(val));
> > |         if (x > MEMCG_CHARGE_BATCH) {
> > |                 atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
> > |                 __this_cpu_write(stats_updates, 0);
> > |         }
> > 
> > The writes to stats_updates can happen from IRQ-context and with
> > disabled preemption only. So this is not good, right?
> 
> These counters serve as a hint for aggregating per-cpu per-cgroup stats.
> If they were systematically mis-updated, it could manifest by
> missing "refresh signal" from the given CPU. OTOH, this lagging is also
> meant to by limited by elapsed time thanks to periodic flushing.
> 
> This could affect freshness of the stats not their accuracy though.

Oki. Then let me update the code as suggested and ignore this case since
it is nothing to worry about.

> HTH,
> Michal

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Revert ("mm/memcg: optimize user context object stock access")
@ 2022-02-21 14:26     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu 17-02-22 10:47:58, Sebastian Andrzej Siewior wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The optimisation is based on a micro benchmark where local_irq_save() is
> more expensive than a preempt_disable(). There is no evidence that it is
> visible in a real-world workload and there are CPUs where the opposite is
> true (local_irq_save() is cheaper than preempt_disable()).
> 
> Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
> where preempt_disable() is optimized away. There is no improvement with
> PREEMPT_DYNAMIC since the preemption counter is always available.
> 
> The optimization makes also the PREEMPT_RT integration more complicated
> since most of the assumption are not true on PREEMPT_RT.
> 
> Revert the optimisation since it complicates the PREEMPT_RT integration
> and the improvement is hardly visible.
> 
> [ bigeasy: Patch body around Michal's diff ]

Thanks for preparing the changelog for this. I was planning to post mine
but I was waiting for a feedback from Waiman. Anyway this looks good to
me.

> 
> Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4@dhcp22.suse.cz
> Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S@linutronix.de
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 94 ++++++++++++++-----------------------------------
>  1 file changed, 27 insertions(+), 67 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3c4816147273a..8ab2dc75e70ec 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2078,23 +2078,17 @@ void unlock_page_memcg(struct page *page)
>  	folio_memcg_unlock(page_folio(page));
>  }
>  
> -struct obj_stock {
> +struct memcg_stock_pcp {
> +	struct mem_cgroup *cached; /* this never be root cgroup */
> +	unsigned int nr_pages;
> +
>  #ifdef CONFIG_MEMCG_KMEM
>  	struct obj_cgroup *cached_objcg;
>  	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
>  	int nr_slab_reclaimable_b;
>  	int nr_slab_unreclaimable_b;
> -#else
> -	int dummy[0];
>  #endif
> -};
> -
> -struct memcg_stock_pcp {
> -	struct mem_cgroup *cached; /* this never be root cgroup */
> -	unsigned int nr_pages;
> -	struct obj_stock task_obj;
> -	struct obj_stock irq_obj;
>  
>  	struct work_struct work;
>  	unsigned long flags;
> @@ -2104,13 +2098,13 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>  static DEFINE_MUTEX(percpu_charge_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static void drain_obj_stock(struct memcg_stock_pcp *stock);
>  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  				     struct mem_cgroup *root_memcg);
>  static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
>  
>  #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
>  {
>  }
>  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> @@ -2190,9 +2184,7 @@ static void drain_local_stock(struct work_struct *dummy)
>  	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->irq_obj);
> -	if (in_task())
> -		drain_obj_stock(&stock->task_obj);
> +	drain_obj_stock(stock);
>  	drain_stock(stock);
>  	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>  
> @@ -2767,41 +2759,6 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>   */
>  #define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
>  
> -/*
> - * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> - * sequence used in this case to access content from object stock is slow.
> - * To optimize for user context access, there are now two object stocks for
> - * task context and interrupt context access respectively.
> - *
> - * The task context object stock can be accessed by disabling preemption only
> - * which is cheap in non-preempt kernel. The interrupt context object stock
> - * can only be accessed after disabling interrupt. User context code can
> - * access interrupt object stock, but not vice versa.
> - */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> -{
> -	struct memcg_stock_pcp *stock;
> -
> -	if (likely(in_task())) {
> -		*pflags = 0UL;
> -		preempt_disable();
> -		stock = this_cpu_ptr(&memcg_stock);
> -		return &stock->task_obj;
> -	}
> -
> -	local_irq_save(*pflags);
> -	stock = this_cpu_ptr(&memcg_stock);
> -	return &stock->irq_obj;
> -}
> -
> -static inline void put_obj_stock(unsigned long flags)
> -{
> -	if (likely(in_task()))
> -		preempt_enable();
> -	else
> -		local_irq_restore(flags);
> -}
> -
>  /*
>   * mod_objcg_mlstate() may be called with irq enabled, so
>   * mod_memcg_lruvec_state() should be used.
> @@ -3082,10 +3039,13 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  		     enum node_stat_item idx, int nr)
>  {
> +	struct memcg_stock_pcp *stock;
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
>  	int *bytes;
>  
> +	local_irq_save(flags);
> +	stock = this_cpu_ptr(&memcg_stock);
> +
>  	/*
>  	 * Save vmstat data in stock and skip vmstat array update unless
>  	 * accumulating over a page of vmstat data or when pgdat or idx
> @@ -3136,26 +3096,29 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  	if (nr)
>  		mod_objcg_mlstate(objcg, pgdat, idx, nr);
>  
> -	put_obj_stock(flags);
> +	local_irq_restore(flags);
>  }
>  
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
> +	struct memcg_stock_pcp *stock;
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
>  	bool ret = false;
>  
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
>  	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
>  		stock->nr_bytes -= nr_bytes;
>  		ret = true;
>  	}
>  
> -	put_obj_stock(flags);
> +	local_irq_restore(flags);
>  
>  	return ret;
>  }
>  
> -static void drain_obj_stock(struct obj_stock *stock)
> +static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  {
>  	struct obj_cgroup *old = stock->cached_objcg;
>  
> @@ -3211,13 +3174,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  {
>  	struct mem_cgroup *memcg;
>  
> -	if (in_task() && stock->task_obj.cached_objcg) {
> -		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
> -		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> -			return true;
> -	}
> -	if (stock->irq_obj.cached_objcg) {
> -		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
> +	if (stock->cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->cached_objcg);
>  		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>  			return true;
>  	}
> @@ -3228,10 +3186,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>  			     bool allow_uncharge)
>  {
> +	struct memcg_stock_pcp *stock;
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
>  	unsigned int nr_pages = 0;
>  
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
>  	if (stock->cached_objcg != objcg) { /* reset if necessary */
>  		drain_obj_stock(stock);
>  		obj_cgroup_get(objcg);
> @@ -3247,7 +3208,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>  		stock->nr_bytes &= (PAGE_SIZE - 1);
>  	}
>  
> -	put_obj_stock(flags);
> +	local_irq_restore(flags);
>  
>  	if (nr_pages)
>  		obj_cgroup_uncharge_pages(objcg, nr_pages);
> @@ -6812,7 +6773,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  	long nr_pages;
>  	struct mem_cgroup *memcg;
>  	struct obj_cgroup *objcg;
> -	bool use_objcg = folio_memcg_kmem(folio);
>  
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>  
> @@ -6821,7 +6781,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  	 * folio memcg or objcg at this point, we have fully
>  	 * exclusive access to the folio.
>  	 */
> -	if (use_objcg) {
> +	if (folio_memcg_kmem(folio)) {
>  		objcg = __folio_objcg(folio);
>  		/*
>  		 * This get matches the put at the end of the function and
> @@ -6849,7 +6809,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  
>  	nr_pages = folio_nr_pages(folio);
>  
> -	if (use_objcg) {
> +	if (folio_memcg_kmem(folio)) {
>  		ug->nr_memory += nr_pages;
>  		ug->nr_kmem += nr_pages;
>  
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] mm/memcg: Revert ("mm/memcg: optimize user context object stock access")
@ 2022-02-21 14:26     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu 17-02-22 10:47:58, Sebastian Andrzej Siewior wrote:
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> 
> The optimisation is based on a micro benchmark where local_irq_save() is
> more expensive than a preempt_disable(). There is no evidence that it is
> visible in a real-world workload and there are CPUs where the opposite is
> true (local_irq_save() is cheaper than preempt_disable()).
> 
> Based on micro benchmarks, the optimisation makes sense on PREEMPT_NONE
> where preempt_disable() is optimized away. There is no improvement with
> PREEMPT_DYNAMIC since the preemption counter is always available.
> 
> The optimization makes also the PREEMPT_RT integration more complicated
> since most of the assumption are not true on PREEMPT_RT.
> 
> Revert the optimisation since it complicates the PREEMPT_RT integration
> and the improvement is hardly visible.
> 
> [ bigeasy: Patch body around Michal's diff ]

Thanks for preparing the changelog for this. I was planning to post mine
but I was waiting for a feedback from Waiman. Anyway this looks good to
me.

> 
> Link: https://lore.kernel.org/all/YgOGkXXCrD%2F1k+p4-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
> Link: https://lkml.kernel.org/r/YdX+INO9gQje6d0S-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org
> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
>  mm/memcontrol.c | 94 ++++++++++++++-----------------------------------
>  1 file changed, 27 insertions(+), 67 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3c4816147273a..8ab2dc75e70ec 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2078,23 +2078,17 @@ void unlock_page_memcg(struct page *page)
>  	folio_memcg_unlock(page_folio(page));
>  }
>  
> -struct obj_stock {
> +struct memcg_stock_pcp {
> +	struct mem_cgroup *cached; /* this never be root cgroup */
> +	unsigned int nr_pages;
> +
>  #ifdef CONFIG_MEMCG_KMEM
>  	struct obj_cgroup *cached_objcg;
>  	struct pglist_data *cached_pgdat;
>  	unsigned int nr_bytes;
>  	int nr_slab_reclaimable_b;
>  	int nr_slab_unreclaimable_b;
> -#else
> -	int dummy[0];
>  #endif
> -};
> -
> -struct memcg_stock_pcp {
> -	struct mem_cgroup *cached; /* this never be root cgroup */
> -	unsigned int nr_pages;
> -	struct obj_stock task_obj;
> -	struct obj_stock irq_obj;
>  
>  	struct work_struct work;
>  	unsigned long flags;
> @@ -2104,13 +2098,13 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>  static DEFINE_MUTEX(percpu_charge_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static void drain_obj_stock(struct memcg_stock_pcp *stock);
>  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  				     struct mem_cgroup *root_memcg);
>  static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
>  
>  #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
>  {
>  }
>  static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> @@ -2190,9 +2184,7 @@ static void drain_local_stock(struct work_struct *dummy)
>  	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
> -	drain_obj_stock(&stock->irq_obj);
> -	if (in_task())
> -		drain_obj_stock(&stock->task_obj);
> +	drain_obj_stock(stock);
>  	drain_stock(stock);
>  	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>  
> @@ -2767,41 +2759,6 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
>   */
>  #define OBJCGS_CLEAR_MASK	(__GFP_DMA | __GFP_RECLAIMABLE | __GFP_ACCOUNT)
>  
> -/*
> - * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
> - * sequence used in this case to access content from object stock is slow.
> - * To optimize for user context access, there are now two object stocks for
> - * task context and interrupt context access respectively.
> - *
> - * The task context object stock can be accessed by disabling preemption only
> - * which is cheap in non-preempt kernel. The interrupt context object stock
> - * can only be accessed after disabling interrupt. User context code can
> - * access interrupt object stock, but not vice versa.
> - */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> -{
> -	struct memcg_stock_pcp *stock;
> -
> -	if (likely(in_task())) {
> -		*pflags = 0UL;
> -		preempt_disable();
> -		stock = this_cpu_ptr(&memcg_stock);
> -		return &stock->task_obj;
> -	}
> -
> -	local_irq_save(*pflags);
> -	stock = this_cpu_ptr(&memcg_stock);
> -	return &stock->irq_obj;
> -}
> -
> -static inline void put_obj_stock(unsigned long flags)
> -{
> -	if (likely(in_task()))
> -		preempt_enable();
> -	else
> -		local_irq_restore(flags);
> -}
> -
>  /*
>   * mod_objcg_mlstate() may be called with irq enabled, so
>   * mod_memcg_lruvec_state() should be used.
> @@ -3082,10 +3039,13 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  		     enum node_stat_item idx, int nr)
>  {
> +	struct memcg_stock_pcp *stock;
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
>  	int *bytes;
>  
> +	local_irq_save(flags);
> +	stock = this_cpu_ptr(&memcg_stock);
> +
>  	/*
>  	 * Save vmstat data in stock and skip vmstat array update unless
>  	 * accumulating over a page of vmstat data or when pgdat or idx
> @@ -3136,26 +3096,29 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>  	if (nr)
>  		mod_objcg_mlstate(objcg, pgdat, idx, nr);
>  
> -	put_obj_stock(flags);
> +	local_irq_restore(flags);
>  }
>  
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  {
> +	struct memcg_stock_pcp *stock;
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
>  	bool ret = false;
>  
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
>  	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
>  		stock->nr_bytes -= nr_bytes;
>  		ret = true;
>  	}
>  
> -	put_obj_stock(flags);
> +	local_irq_restore(flags);
>  
>  	return ret;
>  }
>  
> -static void drain_obj_stock(struct obj_stock *stock)
> +static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  {
>  	struct obj_cgroup *old = stock->cached_objcg;
>  
> @@ -3211,13 +3174,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  {
>  	struct mem_cgroup *memcg;
>  
> -	if (in_task() && stock->task_obj.cached_objcg) {
> -		memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg);
> -		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> -			return true;
> -	}
> -	if (stock->irq_obj.cached_objcg) {
> -		memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg);
> +	if (stock->cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->cached_objcg);
>  		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
>  			return true;
>  	}
> @@ -3228,10 +3186,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>  static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>  			     bool allow_uncharge)
>  {
> +	struct memcg_stock_pcp *stock;
>  	unsigned long flags;
> -	struct obj_stock *stock = get_obj_stock(&flags);
>  	unsigned int nr_pages = 0;
>  
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
>  	if (stock->cached_objcg != objcg) { /* reset if necessary */
>  		drain_obj_stock(stock);
>  		obj_cgroup_get(objcg);
> @@ -3247,7 +3208,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>  		stock->nr_bytes &= (PAGE_SIZE - 1);
>  	}
>  
> -	put_obj_stock(flags);
> +	local_irq_restore(flags);
>  
>  	if (nr_pages)
>  		obj_cgroup_uncharge_pages(objcg, nr_pages);
> @@ -6812,7 +6773,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  	long nr_pages;
>  	struct mem_cgroup *memcg;
>  	struct obj_cgroup *objcg;
> -	bool use_objcg = folio_memcg_kmem(folio);
>  
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>  
> @@ -6821,7 +6781,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  	 * folio memcg or objcg at this point, we have fully
>  	 * exclusive access to the folio.
>  	 */
> -	if (use_objcg) {
> +	if (folio_memcg_kmem(folio)) {
>  		objcg = __folio_objcg(folio);
>  		/*
>  		 * This get matches the put at the end of the function and
> @@ -6849,7 +6809,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
>  
>  	nr_pages = folio_nr_pages(folio);
>  
> -	if (use_objcg) {
> +	if (folio_memcg_kmem(folio)) {
>  		ug->nr_memory += nr_pages;
>  		ug->nr_kmem += nr_pages;
>  
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Disable threshold event handlers on PREEMPT_RT
@ 2022-02-21 14:27     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu 17-02-22 10:47:59, Sebastian Andrzej Siewior wrote:
> During the integration of PREEMPT_RT support, the code flow around
> memcg_check_events() resulted in `twisted code'. Moving the code around
> and avoiding then would then lead to an additional local-irq-save
> section within memcg_check_events(). While looking better, it adds a
> local-irq-save section to code flow which is usually within an
> local-irq-off block on non-PREEMPT_RT configurations.
> 
> The threshold event handler is a deprecated memcg v1 feature. Instead of
> trying to get it to work under PREEMPT_RT just disable it. There should
> be no users on PREEMPT_RT. From that perspective it makes even less
> sense to get it to work under PREEMPT_RT while having zero users.
> 
> Make memory.soft_limit_in_bytes and cgroup.event_control return
> -EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
> memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
> Document that the two knobs are disabled on PREEMPT_RT.
> 
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Suggested-by: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  Documentation/admin-guide/cgroup-v1/memory.rst |  2 ++
>  mm/memcontrol.c                                | 14 ++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index faac50149a222..2cc502a75ef64 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -64,6 +64,7 @@ Brief summary of control files.
>  				     threads
>   cgroup.procs			     show list of processes
>   cgroup.event_control		     an interface for event_fd()
> +				     This knob is not available on CONFIG_PREEMPT_RT systems.
>   memory.usage_in_bytes		     show current usage for memory
>  				     (See 5.5 for details)
>   memory.memsw.usage_in_bytes	     show current usage for memory+Swap
> @@ -75,6 +76,7 @@ Brief summary of control files.
>   memory.max_usage_in_bytes	     show max memory usage recorded
>   memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
>   memory.soft_limit_in_bytes	     set/show soft limit of memory usage
> +				     This knob is not available on CONFIG_PREEMPT_RT systems.
>   memory.stat			     show various statistics
>   memory.use_hierarchy		     set/show hierarchical account enabled
>                                       This knob is deprecated and shouldn't be
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8ab2dc75e70ec..0b5117ed2ae08 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -859,6 +859,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   */
>  static void memcg_check_events(struct mem_cgroup *memcg, int nid)
>  {
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		return;
> +
>  	/* threshold event is triggered in finer grain than soft limit */
>  	if (unlikely(mem_cgroup_event_ratelimit(memcg,
>  						MEM_CGROUP_TARGET_THRESH))) {
> @@ -3731,8 +3734,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>  		}
>  		break;
>  	case RES_SOFT_LIMIT:
> -		memcg->soft_limit = nr_pages;
> -		ret = 0;
> +		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> +			ret = -EOPNOTSUPP;
> +		} else {
> +			memcg->soft_limit = nr_pages;
> +			ret = 0;
> +		}
>  		break;
>  	}
>  	return ret ?: nbytes;
> @@ -4708,6 +4715,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
>  	char *endp;
>  	int ret;
>  
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		return -EOPNOTSUPP;
> +
>  	buf = strstrip(buf);
>  
>  	efd = simple_strtoul(buf, &endp, 10);
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/5] mm/memcg: Disable threshold event handlers on PREEMPT_RT
@ 2022-02-21 14:27     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:27 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	Roman Gushchin

On Thu 17-02-22 10:47:59, Sebastian Andrzej Siewior wrote:
> During the integration of PREEMPT_RT support, the code flow around
> memcg_check_events() resulted in `twisted code'. Moving the code around
> and avoiding then would then lead to an additional local-irq-save
> section within memcg_check_events(). While looking better, it adds a
> local-irq-save section to code flow which is usually within an
> local-irq-off block on non-PREEMPT_RT configurations.
> 
> The threshold event handler is a deprecated memcg v1 feature. Instead of
> trying to get it to work under PREEMPT_RT just disable it. There should
> be no users on PREEMPT_RT. From that perspective it makes even less
> sense to get it to work under PREEMPT_RT while having zero users.
> 
> Make memory.soft_limit_in_bytes and cgroup.event_control return
> -EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
> memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
> Document that the two knobs are disabled on PREEMPT_RT.
> 
> Suggested-by: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Suggested-by: Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
> Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Thanks!

> ---
>  Documentation/admin-guide/cgroup-v1/memory.rst |  2 ++
>  mm/memcontrol.c                                | 14 ++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index faac50149a222..2cc502a75ef64 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -64,6 +64,7 @@ Brief summary of control files.
>  				     threads
>   cgroup.procs			     show list of processes
>   cgroup.event_control		     an interface for event_fd()
> +				     This knob is not available on CONFIG_PREEMPT_RT systems.
>   memory.usage_in_bytes		     show current usage for memory
>  				     (See 5.5 for details)
>   memory.memsw.usage_in_bytes	     show current usage for memory+Swap
> @@ -75,6 +76,7 @@ Brief summary of control files.
>   memory.max_usage_in_bytes	     show max memory usage recorded
>   memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
>   memory.soft_limit_in_bytes	     set/show soft limit of memory usage
> +				     This knob is not available on CONFIG_PREEMPT_RT systems.
>   memory.stat			     show various statistics
>   memory.use_hierarchy		     set/show hierarchical account enabled
>                                       This knob is deprecated and shouldn't be
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8ab2dc75e70ec..0b5117ed2ae08 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -859,6 +859,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   */
>  static void memcg_check_events(struct mem_cgroup *memcg, int nid)
>  {
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		return;
> +
>  	/* threshold event is triggered in finer grain than soft limit */
>  	if (unlikely(mem_cgroup_event_ratelimit(memcg,
>  						MEM_CGROUP_TARGET_THRESH))) {
> @@ -3731,8 +3734,12 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>  		}
>  		break;
>  	case RES_SOFT_LIMIT:
> -		memcg->soft_limit = nr_pages;
> -		ret = 0;
> +		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> +			ret = -EOPNOTSUPP;
> +		} else {
> +			memcg->soft_limit = nr_pages;
> +			ret = 0;
> +		}
>  		break;
>  	}
>  	return ret ?: nbytes;
> @@ -4708,6 +4715,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
>  	char *endp;
>  	int ret;
>  
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		return -EOPNOTSUPP;
> +
>  	buf = strstrip(buf);
>  
>  	efd = simple_strtoul(buf, &endp, 10);
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-21 14:30     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

On Thu 17-02-22 10:48:01, Sebastian Andrzej Siewior wrote:
> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> Provide the inner part of refill_stock() as __refill_stock() without
> disabling interrupts. This eases the integration of local_lock_t where
> recursive locking must be avoided.
> Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
> __refill_stock(). The caller of drain_obj_stock() already disables
> interrupts.
> 
> [bigeasy: Patch body around Johannes' diff ]
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36ab3660f2c6d..a3225501cce36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2224,12 +2224,9 @@ static void drain_local_stock(struct work_struct *dummy)
>   * Cache charges(val) to local per_cpu area.
>   * This will be consumed by consume_stock() function, later.
>   */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
>  	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
>  	if (stock->cached != memcg) { /* reset if necessary */
> @@ -2241,7 +2238,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  
>  	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
>  		drain_stock(stock);
> +}
>  
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__refill_stock(memcg, nr_pages);
>  	local_irq_restore(flags);
>  }
>  
> @@ -3158,8 +3162,16 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>  		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
>  
> -		if (nr_pages)
> -			obj_cgroup_uncharge_pages(old, nr_pages);
> +		if (nr_pages) {
> +			struct mem_cgroup *memcg;
> +
> +			memcg = get_mem_cgroup_from_objcg(old);
> +
> +			memcg_account_kmem(memcg, -nr_pages);
> +			__refill_stock(memcg, nr_pages);
> +
> +			css_put(&memcg->css);
> +		}
>  
>  		/*
>  		 * The leftover is flushed to the centralized per-memcg value.
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()
@ 2022-02-21 14:30     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long

On Thu 17-02-22 10:48:01, Sebastian Andrzej Siewior wrote:
> From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> 
> Provide the inner part of refill_stock() as __refill_stock() without
> disabling interrupts. This eases the integration of local_lock_t where
> recursive locking must be avoided.
> Open code obj_cgroup_uncharge_pages() in drain_obj_stock() and use
> __refill_stock(). The caller of drain_obj_stock() already disables
> interrupts.
> 
> [bigeasy: Patch body around Johannes' diff ]
> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!

> ---
>  mm/memcontrol.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36ab3660f2c6d..a3225501cce36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2224,12 +2224,9 @@ static void drain_local_stock(struct work_struct *dummy)
>   * Cache charges(val) to local per_cpu area.
>   * This will be consumed by consume_stock() function, later.
>   */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
>  	struct memcg_stock_pcp *stock;
> -	unsigned long flags;
> -
> -	local_irq_save(flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
>  	if (stock->cached != memcg) { /* reset if necessary */
> @@ -2241,7 +2238,14 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>  
>  	if (stock->nr_pages > MEMCG_CHARGE_BATCH)
>  		drain_stock(stock);
> +}
>  
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	__refill_stock(memcg, nr_pages);
>  	local_irq_restore(flags);
>  }
>  
> @@ -3158,8 +3162,16 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>  		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
>  
> -		if (nr_pages)
> -			obj_cgroup_uncharge_pages(old, nr_pages);
> +		if (nr_pages) {
> +			struct mem_cgroup *memcg;
> +
> +			memcg = get_mem_cgroup_from_objcg(old);
> +
> +			memcg_account_kmem(memcg, -nr_pages);
> +			__refill_stock(memcg, nr_pages);
> +
> +			css_put(&memcg->css);
> +		}
>  
>  		/*
>  		 * The leftover is flushed to the centralized per-memcg value.
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 14:46     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On Thu 17-02-22 10:48:02, Sebastian Andrzej Siewior wrote:
[...]
> @@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  	 * as well as workers from this path always operate on the local
>  	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
>  	 */
> -	curcpu = get_cpu();

Could you make this a separate patch?

>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *memcg;
> @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  		rcu_read_unlock();
>  
>  		if (flush &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			schedule_work_on(cpu, &stock->work);

Maybe I am missing but on !PREEMPT kernels there is nothing really
guaranteeing that the worker runs so there should be cond_resched after
the mutex is unlocked. I do not think we want to rely on callers to be
aware of this subtlety.

An alternative would be to split out __drain_local_stock which doesn't
do local_lock.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 14:46     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 14:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On Thu 17-02-22 10:48:02, Sebastian Andrzej Siewior wrote:
[...]
> @@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  	 * as well as workers from this path always operate on the local
>  	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
>  	 */
> -	curcpu = get_cpu();

Could you make this a separate patch?

>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *memcg;
> @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  		rcu_read_unlock();
>  
>  		if (flush &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			schedule_work_on(cpu, &stock->work);

Maybe I am missing but on !PREEMPT kernels there is nothing really
guaranteeing that the worker runs so there should be cond_resched after
the mutex is unlocked. I do not think we want to rely on callers to be
aware of this subtlety.

An alternative would be to split out __drain_local_stock which doesn't
do local_lock.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 15:19       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 15:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On 2022-02-21 15:46:30 [+0100], Michal Hocko wrote:
> On Thu 17-02-22 10:48:02, Sebastian Andrzej Siewior wrote:
> [...]
> > @@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> >  	 * as well as workers from this path always operate on the local
> >  	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
> >  	 */
> > -	curcpu = get_cpu();
> 
> Could you make this a separate patch?

Sure.

> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> >  		struct mem_cgroup *memcg;
> > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> >  		rcu_read_unlock();
> >  
> >  		if (flush &&
> > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > -			if (cpu == curcpu)
> > -				drain_local_stock(&stock->work);
> > -			else
> > -				schedule_work_on(cpu, &stock->work);
> > -		}
> > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +			schedule_work_on(cpu, &stock->work);
> 
> Maybe I am missing but on !PREEMPT kernels there is nothing really
> guaranteeing that the worker runs so there should be cond_resched after
> the mutex is unlocked. I do not think we want to rely on callers to be
> aware of this subtlety.

There is no guarantee on PREEMPT kernels, too. The worker will be made
running and will be put on the CPU when the scheduler sees it fit and
there could be other worker which take precedence (queued earlier).
But I was not aware that the worker _needs_ to run before we return. We
might get migrated after put_cpu() so I wasn't aware that this is
important. Should we attempt best effort and wait for the worker on the
current CPU?

> An alternative would be to split out __drain_local_stock which doesn't
> do local_lock.

but isn't the section in drain_local_stock() unprotected then?

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 15:19       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 15:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On 2022-02-21 15:46:30 [+0100], Michal Hocko wrote:
> On Thu 17-02-22 10:48:02, Sebastian Andrzej Siewior wrote:
> [...]
> > @@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> >  	 * as well as workers from this path always operate on the local
> >  	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
> >  	 */
> > -	curcpu = get_cpu();
> 
> Could you make this a separate patch?

Sure.

> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> >  		struct mem_cgroup *memcg;
> > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> >  		rcu_read_unlock();
> >  
> >  		if (flush &&
> > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > -			if (cpu == curcpu)
> > -				drain_local_stock(&stock->work);
> > -			else
> > -				schedule_work_on(cpu, &stock->work);
> > -		}
> > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +			schedule_work_on(cpu, &stock->work);
> 
> Maybe I am missing but on !PREEMPT kernels there is nothing really
> guaranteeing that the worker runs so there should be cond_resched after
> the mutex is unlocked. I do not think we want to rely on callers to be
> aware of this subtlety.

There is no guarantee on PREEMPT kernels, too. The worker will be made
running and will be put on the CPU when the scheduler sees it fit and
there could be other worker which take precedence (queued earlier).
But I was not aware that the worker _needs_ to run before we return. We
might get migrated after put_cpu() so I wasn't aware that this is
important. Should we attempt best effort and wait for the worker on the
current CPU?

> An alternative would be to split out __drain_local_stock which doesn't
> do local_lock.

but isn't the section in drain_local_stock() unprotected then?

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 16:24         ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 16:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On Mon 21-02-22 16:19:52, Sebastian Andrzej Siewior wrote:
> On 2022-02-21 15:46:30 [+0100], Michal Hocko wrote:
> > On Thu 17-02-22 10:48:02, Sebastian Andrzej Siewior wrote:
> > [...]
> > > @@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > >  	 * as well as workers from this path always operate on the local
> > >  	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
> > >  	 */
> > > -	curcpu = get_cpu();
> > 
> > Could you make this a separate patch?
> 
> Sure.
> 
> > >  	for_each_online_cpu(cpu) {
> > >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > >  		struct mem_cgroup *memcg;
> > > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > >  		rcu_read_unlock();
> > >  
> > >  		if (flush &&
> > > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > > -			if (cpu == curcpu)
> > > -				drain_local_stock(&stock->work);
> > > -			else
> > > -				schedule_work_on(cpu, &stock->work);
> > > -		}
> > > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > +			schedule_work_on(cpu, &stock->work);
> > 
> > Maybe I am missing but on !PREEMPT kernels there is nothing really
> > guaranteeing that the worker runs so there should be cond_resched after
> > the mutex is unlocked. I do not think we want to rely on callers to be
> > aware of this subtlety.
> 
> There is no guarantee on PREEMPT kernels, too. The worker will be made
> running and will be put on the CPU when the scheduler sees it fit and
> there could be other worker which take precedence (queued earlier).
> But I was not aware that the worker _needs_ to run before we return.

A lack of draining will not be a correctness problem (sorry I should
have made that clear). It is more about subtlety than anything. E.g. the
charging path could be forced to memory reclaim because of the cached
charges which are still waiting for their draining. Not really something
to lose sleep over from the runtime perspective. I was just wondering
that this makes things more complex than necessary.

> We
> might get migrated after put_cpu() so I wasn't aware that this is
> important. Should we attempt best effort and wait for the worker on the
> current CPU?


> > An alternative would be to split out __drain_local_stock which doesn't
> > do local_lock.
> 
> but isn't the section in drain_local_stock() unprotected then?

local_lock instead of {get,put}_cpu would handle that right?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 16:24         ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 16:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On Mon 21-02-22 16:19:52, Sebastian Andrzej Siewior wrote:
> On 2022-02-21 15:46:30 [+0100], Michal Hocko wrote:
> > On Thu 17-02-22 10:48:02, Sebastian Andrzej Siewior wrote:
> > [...]
> > > @@ -2266,7 +2273,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > >  	 * as well as workers from this path always operate on the local
> > >  	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
> > >  	 */
> > > -	curcpu = get_cpu();
> > 
> > Could you make this a separate patch?
> 
> Sure.
> 
> > >  	for_each_online_cpu(cpu) {
> > >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > >  		struct mem_cgroup *memcg;
> > > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > >  		rcu_read_unlock();
> > >  
> > >  		if (flush &&
> > > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > > -			if (cpu == curcpu)
> > > -				drain_local_stock(&stock->work);
> > > -			else
> > > -				schedule_work_on(cpu, &stock->work);
> > > -		}
> > > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > +			schedule_work_on(cpu, &stock->work);
> > 
> > Maybe I am missing but on !PREEMPT kernels there is nothing really
> > guaranteeing that the worker runs so there should be cond_resched after
> > the mutex is unlocked. I do not think we want to rely on callers to be
> > aware of this subtlety.
> 
> There is no guarantee on PREEMPT kernels, too. The worker will be made
> running and will be put on the CPU when the scheduler sees it fit and
> there could be other worker which take precedence (queued earlier).
> But I was not aware that the worker _needs_ to run before we return.

A lack of draining will not be a correctness problem (sorry I should
have made that clear). It is more about subtlety than anything. E.g. the
charging path could be forced to memory reclaim because of the cached
charges which are still waiting for their draining. Not really something
to lose sleep over from the runtime perspective. I was just wondering
that this makes things more complex than necessary.

> We
> might get migrated after put_cpu() so I wasn't aware that this is
> important. Should we attempt best effort and wait for the worker on the
> current CPU?


> > An alternative would be to split out __drain_local_stock which doesn't
> > do local_lock.
> 
> but isn't the section in drain_local_stock() unprotected then?

local_lock instead of {get,put}_cpu would handle that right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
  2022-02-21 16:24         ` Michal Hocko
@ 2022-02-21 16:44           ` Sebastian Andrzej Siewior
  -1 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On 2022-02-21 17:24:41 [+0100], Michal Hocko wrote:
> > > > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > > >  		rcu_read_unlock();
> > > >  
> > > >  		if (flush &&
> > > > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > > > -			if (cpu == curcpu)
> > > > -				drain_local_stock(&stock->work);
> > > > -			else
> > > > -				schedule_work_on(cpu, &stock->work);
> > > > -		}
> > > > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > +			schedule_work_on(cpu, &stock->work);
> > > 
> > > Maybe I am missing but on !PREEMPT kernels there is nothing really
> > > guaranteeing that the worker runs so there should be cond_resched after
> > > the mutex is unlocked. I do not think we want to rely on callers to be
> > > aware of this subtlety.
> > 
> > There is no guarantee on PREEMPT kernels, too. The worker will be made
> > running and will be put on the CPU when the scheduler sees it fit and
> > there could be other worker which take precedence (queued earlier).
> > But I was not aware that the worker _needs_ to run before we return.
> 
> A lack of draining will not be a correctness problem (sorry I should
> have made that clear). It is more about subtlety than anything. E.g. the
> charging path could be forced to memory reclaim because of the cached
> charges which are still waiting for their draining. Not really something
> to lose sleep over from the runtime perspective. I was just wondering
> that this makes things more complex than necessary.

So it is no strictly wrong but it would be better if we could do
drain_local_stock() on the local CPU.

> > We
> > might get migrated after put_cpu() so I wasn't aware that this is
> > important. Should we attempt best effort and wait for the worker on the
> > current CPU?
> 
> 
> > > An alternative would be to split out __drain_local_stock which doesn't
> > > do local_lock.
> > 
> > but isn't the section in drain_local_stock() unprotected then?
> 
> local_lock instead of {get,put}_cpu would handle that right?

It took a while, but it clicked :)
If we acquire the lock_lock_t, that we would otherwise acquire in
drain_local_stock(), before the for_each_cpu loop (as you say
get,pu_cpu) then we would indeed need __drain_local_stock() and things
would work. But it looks like an abuse of the lock to avoid CPU
migration since there is no need to have it acquired at this point. Also
the whole section would run with disabled interrupts and there is no
need for it.

What about if replace get_cpu() with migrate_disable()? 

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 16:44           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On 2022-02-21 17:24:41 [+0100], Michal Hocko wrote:
> > > > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > > >  		rcu_read_unlock();
> > > >  
> > > >  		if (flush &&
> > > > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > > > -			if (cpu == curcpu)
> > > > -				drain_local_stock(&stock->work);
> > > > -			else
> > > > -				schedule_work_on(cpu, &stock->work);
> > > > -		}
> > > > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > +			schedule_work_on(cpu, &stock->work);
> > > 
> > > Maybe I am missing but on !PREEMPT kernels there is nothing really
> > > guaranteeing that the worker runs so there should be cond_resched after
> > > the mutex is unlocked. I do not think we want to rely on callers to be
> > > aware of this subtlety.
> > 
> > There is no guarantee on PREEMPT kernels, too. The worker will be made
> > running and will be put on the CPU when the scheduler sees it fit and
> > there could be other worker which take precedence (queued earlier).
> > But I was not aware that the worker _needs_ to run before we return.
> 
> A lack of draining will not be a correctness problem (sorry I should
> have made that clear). It is more about subtlety than anything. E.g. the
> charging path could be forced to memory reclaim because of the cached
> charges which are still waiting for their draining. Not really something
> to lose sleep over from the runtime perspective. I was just wondering
> that this makes things more complex than necessary.

So it is no strictly wrong but it would be better if we could do
drain_local_stock() on the local CPU.

> > We
> > might get migrated after put_cpu() so I wasn't aware that this is
> > important. Should we attempt best effort and wait for the worker on the
> > current CPU?
> 
> 
> > > An alternative would be to split out __drain_local_stock which doesn't
> > > do local_lock.
> > 
> > but isn't the section in drain_local_stock() unprotected then?
> 
> local_lock instead of {get,put}_cpu would handle that right?

It took a while, but it clicked :)
If we acquire the lock_lock_t, that we would otherwise acquire in
drain_local_stock(), before the for_each_cpu loop (as you say
get,pu_cpu) then we would indeed need __drain_local_stock() and things
would work. But it looks like an abuse of the lock to avoid CPU
migration since there is no need to have it acquired at this point. Also
the whole section would run with disabled interrupts and there is no
need for it.

What about if replace get_cpu() with migrate_disable()? 

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 17:17             ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 17:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On Mon 21-02-22 17:44:13, Sebastian Andrzej Siewior wrote:
> On 2022-02-21 17:24:41 [+0100], Michal Hocko wrote:
> > > > > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > > > >  		rcu_read_unlock();
> > > > >  
> > > > >  		if (flush &&
> > > > > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > > > > -			if (cpu == curcpu)
> > > > > -				drain_local_stock(&stock->work);
> > > > > -			else
> > > > > -				schedule_work_on(cpu, &stock->work);
> > > > > -		}
> > > > > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > > +			schedule_work_on(cpu, &stock->work);
> > > > 
> > > > Maybe I am missing but on !PREEMPT kernels there is nothing really
> > > > guaranteeing that the worker runs so there should be cond_resched after
> > > > the mutex is unlocked. I do not think we want to rely on callers to be
> > > > aware of this subtlety.
> > > 
> > > There is no guarantee on PREEMPT kernels, too. The worker will be made
> > > running and will be put on the CPU when the scheduler sees it fit and
> > > there could be other worker which take precedence (queued earlier).
> > > But I was not aware that the worker _needs_ to run before we return.
> > 
> > A lack of draining will not be a correctness problem (sorry I should
> > have made that clear). It is more about subtlety than anything. E.g. the
> > charging path could be forced to memory reclaim because of the cached
> > charges which are still waiting for their draining. Not really something
> > to lose sleep over from the runtime perspective. I was just wondering
> > that this makes things more complex than necessary.
> 
> So it is no strictly wrong but it would be better if we could do
> drain_local_stock() on the local CPU.
> 
> > > We
> > > might get migrated after put_cpu() so I wasn't aware that this is
> > > important. Should we attempt best effort and wait for the worker on the
> > > current CPU?
> > 
> > 
> > > > An alternative would be to split out __drain_local_stock which doesn't
> > > > do local_lock.
> > > 
> > > but isn't the section in drain_local_stock() unprotected then?
> > 
> > local_lock instead of {get,put}_cpu would handle that right?
> 
> It took a while, but it clicked :)
> If we acquire the lock_lock_t, that we would otherwise acquire in
> drain_local_stock(), before the for_each_cpu loop (as you say
> get,pu_cpu) then we would indeed need __drain_local_stock() and things
> would work. But it looks like an abuse of the lock to avoid CPU
> migration since there is no need to have it acquired at this point. Also
> the whole section would run with disabled interrupts and there is no
> need for it.
> 
> What about if replace get_cpu() with migrate_disable()? 

Yeah, that would be a better option. I am just not used to think in RT
so migrate_disable didn't really come to my mind.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 17:17             ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2022-02-21 17:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On Mon 21-02-22 17:44:13, Sebastian Andrzej Siewior wrote:
> On 2022-02-21 17:24:41 [+0100], Michal Hocko wrote:
> > > > > @@ -2282,14 +2288,9 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> > > > >  		rcu_read_unlock();
> > > > >  
> > > > >  		if (flush &&
> > > > > -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > > > > -			if (cpu == curcpu)
> > > > > -				drain_local_stock(&stock->work);
> > > > > -			else
> > > > > -				schedule_work_on(cpu, &stock->work);
> > > > > -		}
> > > > > +		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > > +			schedule_work_on(cpu, &stock->work);
> > > > 
> > > > Maybe I am missing but on !PREEMPT kernels there is nothing really
> > > > guaranteeing that the worker runs so there should be cond_resched after
> > > > the mutex is unlocked. I do not think we want to rely on callers to be
> > > > aware of this subtlety.
> > > 
> > > There is no guarantee on PREEMPT kernels, too. The worker will be made
> > > running and will be put on the CPU when the scheduler sees it fit and
> > > there could be other worker which take precedence (queued earlier).
> > > But I was not aware that the worker _needs_ to run before we return.
> > 
> > A lack of draining will not be a correctness problem (sorry I should
> > have made that clear). It is more about subtlety than anything. E.g. the
> > charging path could be forced to memory reclaim because of the cached
> > charges which are still waiting for their draining. Not really something
> > to lose sleep over from the runtime perspective. I was just wondering
> > that this makes things more complex than necessary.
> 
> So it is no strictly wrong but it would be better if we could do
> drain_local_stock() on the local CPU.
> 
> > > We
> > > might get migrated after put_cpu() so I wasn't aware that this is
> > > important. Should we attempt best effort and wait for the worker on the
> > > current CPU?
> > 
> > 
> > > > An alternative would be to split out __drain_local_stock which doesn't
> > > > do local_lock.
> > > 
> > > but isn't the section in drain_local_stock() unprotected then?
> > 
> > local_lock instead of {get,put}_cpu would handle that right?
> 
> It took a while, but it clicked :)
> If we acquire the lock_lock_t, that we would otherwise acquire in
> drain_local_stock(), before the for_each_cpu loop (as you say
> get,pu_cpu) then we would indeed need __drain_local_stock() and things
> would work. But it looks like an abuse of the lock to avoid CPU
> migration since there is no need to have it acquired at this point. Also
> the whole section would run with disabled interrupts and there is no
> need for it.
> 
> What about if replace get_cpu() with migrate_disable()? 

Yeah, that would be a better option. I am just not used to think in RT
so migrate_disable didn't really come to my mind.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 17:25               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 17:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, Andrew Morton, Johannes Weiner,
	Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On 2022-02-21 18:17:39 [+0100], Michal Hocko wrote:
> > What about if replace get_cpu() with migrate_disable()? 
> 
> Yeah, that would be a better option. I am just not used to think in RT
> so migrate_disable didn't really come to my mind.

Good.
Let me rebase accordingly, new version is coming soon.

Side note: migrate_disable() is by now implemented in !RT and used in
highmem for instance (kmap_atomic()).

Sebastian


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t
@ 2022-02-21 17:25               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 48+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-21 17:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Andrew Morton, Johannes Weiner, Michal Koutný,
	Peter Zijlstra, Thomas Gleixner, Vladimir Davydov, Waiman Long,
	kernel test robot

On 2022-02-21 18:17:39 [+0100], Michal Hocko wrote:
> > What about if replace get_cpu() with migrate_disable()? 
> 
> Yeah, that would be a better option. I am just not used to think in RT
> so migrate_disable didn't really come to my mind.

Good.
Let me rebase accordingly, new version is coming soon.

Side note: migrate_disable() is by now implemented in !RT and used in
highmem for instance (kmap_atomic()).

Sebastian

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2022-02-21 17:25 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-17  9:47 [PATCH v3 0/5] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
2022-02-17  9:47 ` Sebastian Andrzej Siewior
2022-02-17  9:47 ` [PATCH v3 1/5] mm/memcg: Revert ("mm/memcg: optimize user context object stock access") Sebastian Andrzej Siewior
2022-02-17  9:47   ` Sebastian Andrzej Siewior
2022-02-18 16:09   ` Shakeel Butt
2022-02-18 16:09     ` Shakeel Butt
2022-02-21 14:26   ` Michal Hocko
2022-02-21 14:26     ` Michal Hocko
2022-02-17  9:47 ` [PATCH v3 2/5] mm/memcg: Disable threshold event handlers on PREEMPT_RT Sebastian Andrzej Siewior
2022-02-17  9:47   ` Sebastian Andrzej Siewior
2022-02-18 16:39   ` Shakeel Butt
2022-02-18 16:39     ` Shakeel Butt
2022-02-21 14:27   ` Michal Hocko
2022-02-21 14:27     ` Michal Hocko
2022-02-17  9:48 ` [PATCH v3 3/5] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT where needed Sebastian Andrzej Siewior
2022-02-17  9:48   ` Sebastian Andrzej Siewior
2022-02-18 17:25   ` Shakeel Butt
2022-02-18 17:25     ` Shakeel Butt
2022-02-21 11:31     ` Sebastian Andrzej Siewior
2022-02-21 11:31       ` Sebastian Andrzej Siewior
2022-02-21 12:12       ` Sebastian Andrzej Siewior
2022-02-21 12:12         ` Sebastian Andrzej Siewior
2022-02-21 13:18       ` Michal Koutný
2022-02-21 13:18         ` Michal Koutný
2022-02-21 13:58         ` Sebastian Andrzej Siewior
2022-02-21 13:58           ` Sebastian Andrzej Siewior
2022-02-17  9:48 ` [PATCH v3 4/5] mm/memcg: Opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock() Sebastian Andrzej Siewior
2022-02-17  9:48   ` Sebastian Andrzej Siewior
2022-02-18 18:40   ` Shakeel Butt
2022-02-18 18:40     ` Shakeel Butt
2022-02-18 19:07   ` Roman Gushchin
2022-02-18 19:07     ` Roman Gushchin
2022-02-21 14:30   ` Michal Hocko
2022-02-21 14:30     ` Michal Hocko
2022-02-17  9:48 ` [PATCH v3 5/5] mm/memcg: Protect memcg_stock with a local_lock_t Sebastian Andrzej Siewior
2022-02-17  9:48   ` Sebastian Andrzej Siewior
2022-02-21 14:46   ` Michal Hocko
2022-02-21 14:46     ` Michal Hocko
2022-02-21 15:19     ` Sebastian Andrzej Siewior
2022-02-21 15:19       ` Sebastian Andrzej Siewior
2022-02-21 16:24       ` Michal Hocko
2022-02-21 16:24         ` Michal Hocko
2022-02-21 16:44         ` Sebastian Andrzej Siewior
2022-02-21 16:44           ` Sebastian Andrzej Siewior
2022-02-21 17:17           ` Michal Hocko
2022-02-21 17:17             ` Michal Hocko
2022-02-21 17:25             ` Sebastian Andrzej Siewior
2022-02-21 17:25               ` Sebastian Andrzej Siewior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.