linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Cleanups and fixup for memcontrol
@ 2021-07-29 12:57 Miaohe Lin
  2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
                   ` (4 more replies)
  0 siblings, 5 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-29 12:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: shakeelb, guro, willy, alexs, richard.weiyang, songmuchun,
	linux-mm, linux-kernel, cgroups, linmiaohe

Hi all,
This series contains cleanups to remove unused functions, narrow the
scope of mutex and so on. Also this fix the possible NULL pointer
dereferencing and possible wrong percpu operation. More details can
be found in the respective changelogs. Thanks!

Miaohe Lin (5):
  mm, memcg: remove unused functions
  mm, memcg: narrow the scope of percpu_charge_mutex
  mm, memcg: save some atomic ops when flush is already true
  mm, memcg: avoid possible NULL pointer dereferencing in
    mem_cgroup_init()
  mm, memcg: always call __mod_node_page_state() with preempt disabled

 include/linux/memcontrol.h | 12 ------------
 mm/memcontrol.c            |  8 +++++---
 2 files changed, 5 insertions(+), 15 deletions(-)

-- 
2.23.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/5] mm, memcg: remove unused functions
  2021-07-29 12:57 [PATCH 0/5] Cleanups and fixup for memcontrol Miaohe Lin
@ 2021-07-29 12:57 ` Miaohe Lin
  2021-07-29 14:07   ` Shakeel Butt
                     ` (3 more replies)
  2021-07-29 12:57 ` [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex Miaohe Lin
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-29 12:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: shakeelb, guro, willy, alexs, richard.weiyang, songmuchun,
	linux-mm, linux-kernel, cgroups, linmiaohe

Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user
of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm:
vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the
declaration of mem_cgroup_select_victim_node() is remained here. Remove
them.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 include/linux/memcontrol.h | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7028d8e4a3d7..04437504444f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -606,13 +606,6 @@ static inline bool PageMemcgKmem(struct page *page)
 	return folio_memcg_kmem(page_folio(page));
 }
 
-static __always_inline bool memcg_stat_item_in_bytes(int idx)
-{
-	if (idx == MEMCG_PERCPU_B)
-		return true;
-	return vmstat_item_in_bytes(idx);
-}
-
 static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 {
 	return (memcg == root_mem_cgroup);
@@ -916,11 +909,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 	return !!(memcg->css.flags & CSS_ONLINE);
 }
 
-/*
- * For memory reclaim.
- */
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
-
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int zid, int nr_pages);
 
-- 
2.23.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-29 12:57 [PATCH 0/5] Cleanups and fixup for memcontrol Miaohe Lin
  2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
@ 2021-07-29 12:57 ` Miaohe Lin
  2021-07-30  2:42   ` Muchun Song
                     ` (2 more replies)
  2021-07-29 12:57 ` [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true Miaohe Lin
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-29 12:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: shakeelb, guro, willy, alexs, richard.weiyang, songmuchun,
	linux-mm, linux-kernel, cgroups, linmiaohe

Since percpu_charge_mutex is only used inside drain_all_stock(), we can
narrow the scope of percpu_charge_mutex by moving it here.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6580c2381a3e..a03e24e57cd9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
 #define FLUSHING_CACHED_CHARGE	0
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
 static void drain_obj_stock(struct obj_stock *stock);
@@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
+	static DEFINE_MUTEX(percpu_charge_mutex);
 	int cpu, curcpu;
 
 	/* If someone's already draining, avoid adding running more workers. */
-- 
2.23.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true
  2021-07-29 12:57 [PATCH 0/5] Cleanups and fixup for memcontrol Miaohe Lin
  2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
  2021-07-29 12:57 ` [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex Miaohe Lin
@ 2021-07-29 12:57 ` Miaohe Lin
  2021-07-29 14:40   ` Shakeel Butt
                     ` (3 more replies)
  2021-07-29 12:57 ` [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init() Miaohe Lin
  2021-07-29 12:57 ` [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled Miaohe Lin
  4 siblings, 4 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-29 12:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: shakeelb, guro, willy, alexs, richard.weiyang, songmuchun,
	linux-mm, linux-kernel, cgroups, linmiaohe

Add 'else' to save some atomic ops in obj_stock_flush_required() when
flush is already true. No functional change intended here.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a03e24e57cd9..5b4592d1e0f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2231,7 +2231,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
-		if (obj_stock_flush_required(stock, root_memcg))
+		else if (obj_stock_flush_required(stock, root_memcg))
 			flush = true;
 		rcu_read_unlock();
 
-- 
2.23.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-29 12:57 [PATCH 0/5] Cleanups and fixup for memcontrol Miaohe Lin
                   ` (2 preceding siblings ...)
  2021-07-29 12:57 ` [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true Miaohe Lin
@ 2021-07-29 12:57 ` Miaohe Lin
  2021-07-29 13:52   ` Matthew Wilcox
  2021-07-30  3:12   ` Roman Gushchin
  2021-07-29 12:57 ` [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled Miaohe Lin
  4 siblings, 2 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-29 12:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: shakeelb, guro, willy, alexs, richard.weiyang, songmuchun,
	linux-mm, linux-kernel, cgroups, linmiaohe

rtpn might be NULL in very rare case. We have better to check it before
dereferencing it. Since memcg can live with NULL rb_tree_per_node in
soft_limit_tree, warn this case and continue.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5b4592d1e0f2..70a32174e7c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
 		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
 				    node_online(node) ? node : NUMA_NO_NODE);
 
+		if (WARN_ON_ONCE(!rtpn))
+			continue;
 		rtpn->rb_root = RB_ROOT;
 		rtpn->rb_rightmost = NULL;
 		spin_lock_init(&rtpn->lock);
-- 
2.23.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled
  2021-07-29 12:57 [PATCH 0/5] Cleanups and fixup for memcontrol Miaohe Lin
                   ` (3 preceding siblings ...)
  2021-07-29 12:57 ` [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init() Miaohe Lin
@ 2021-07-29 12:57 ` Miaohe Lin
  2021-07-29 14:39   ` Shakeel Butt
  4 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2021-07-29 12:57 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: shakeelb, guro, willy, alexs, richard.weiyang, songmuchun,
	linux-mm, linux-kernel, cgroups, linmiaohe

We should always ensure __mod_node_page_state() is called with preempt
disabled or percpu ops may manipulate the wrong cpu when preempt happened.

Fixes: b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70a32174e7c4..616d1a72ece3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -697,8 +697,8 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
 	memcg = page_memcg(head);
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
 	if (!memcg) {
-		rcu_read_unlock();
 		__mod_node_page_state(pgdat, idx, val);
+		rcu_read_unlock();
 		return;
 	}
 
-- 
2.23.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-29 12:57 ` [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init() Miaohe Lin
@ 2021-07-29 13:52   ` Matthew Wilcox
  2021-07-30  1:50     ` Miaohe Lin
  2021-07-30  3:12   ` Roman Gushchin
  1 sibling, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2021-07-29 13:52 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, mhocko, vdavydov.dev, akpm, shakeelb, guro, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
> rtpn might be NULL in very rare case. We have better to check it before
> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
> soft_limit_tree, warn this case and continue.

Why would we need to warn?  the GFP flags don't contain NOWARN, so
we already know an allocation failed.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/5] mm, memcg: remove unused functions
  2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
@ 2021-07-29 14:07   ` Shakeel Butt
  2021-07-30  2:39   ` Muchun Song
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Shakeel Butt @ 2021-07-29 14:07 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Matthew Wilcox, alexs, Wei Yang, Muchun Song,
	Linux MM, LKML, Cgroups

On Thu, Jul 29, 2021 at 5:57 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user
> of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm:
> vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the
> declaration of mem_cgroup_select_victim_node() is remained here. Remove
> them.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled
  2021-07-29 12:57 ` [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled Miaohe Lin
@ 2021-07-29 14:39   ` Shakeel Butt
  2021-07-30  1:52     ` Miaohe Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Shakeel Butt @ 2021-07-29 14:39 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Matthew Wilcox, alexs, Wei Yang, Muchun Song,
	Linux MM, LKML, Cgroups

On Thu, Jul 29, 2021 at 5:58 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> We should always ensure __mod_node_page_state() is called with preempt
> disabled or percpu ops may manipulate the wrong cpu when preempt happened.
>
> Fixes: b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages")
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 70a32174e7c4..616d1a72ece3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -697,8 +697,8 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
>         memcg = page_memcg(head);
>         /* Untracked pages have no memcg, no lruvec. Update only the node */
>         if (!memcg) {
> -               rcu_read_unlock();
>                 __mod_node_page_state(pgdat, idx, val);
> +               rcu_read_unlock();

This rcu is for page_memcg. The preemption and interrupts are disabled
across __mod_lruvec_page_state().

>                 return;
>         }
>
> --
> 2.23.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true
  2021-07-29 12:57 ` [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true Miaohe Lin
@ 2021-07-29 14:40   ` Shakeel Butt
  2021-07-30  2:37   ` Muchun Song
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Shakeel Butt @ 2021-07-29 14:40 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Matthew Wilcox, alexs, Wei Yang, Muchun Song,
	Linux MM, LKML, Cgroups

On Thu, Jul 29, 2021 at 5:58 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> Add 'else' to save some atomic ops in obj_stock_flush_required() when
> flush is already true. No functional change intended here.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-29 13:52   ` Matthew Wilcox
@ 2021-07-30  1:50     ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-30  1:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hannes, mhocko, vdavydov.dev, akpm, shakeelb, guro, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On 2021/7/29 21:52, Matthew Wilcox wrote:
> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
>> rtpn might be NULL in very rare case. We have better to check it before
>> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
>> soft_limit_tree, warn this case and continue.
> 
> Why would we need to warn?  the GFP flags don't contain NOWARN, so
> we already know an allocation failed.

I see. Will remove it. Many thanks!

> .
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled
  2021-07-29 14:39   ` Shakeel Butt
@ 2021-07-30  1:52     ` Miaohe Lin
  2021-07-30  2:33       ` [External] " Muchun Song
  0 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2021-07-30  1:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Matthew Wilcox, alexs, Wei Yang, Muchun Song,
	Linux MM, LKML, Cgroups

On 2021/7/29 22:39, Shakeel Butt wrote:
> On Thu, Jul 29, 2021 at 5:58 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> We should always ensure __mod_node_page_state() is called with preempt
>> disabled or percpu ops may manipulate the wrong cpu when preempt happened.
>>
>> Fixes: b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages")
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/memcontrol.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 70a32174e7c4..616d1a72ece3 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -697,8 +697,8 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
>>         memcg = page_memcg(head);
>>         /* Untracked pages have no memcg, no lruvec. Update only the node */
>>         if (!memcg) {
>> -               rcu_read_unlock();
>>                 __mod_node_page_state(pgdat, idx, val);
>> +               rcu_read_unlock();
> 
> This rcu is for page_memcg. The preemption and interrupts are disabled
> across __mod_lruvec_page_state().
> 

I thought it's used to protect __mod_node_page_state(). Looks somewhat confusing for me.
Many thanks for pointing this out!

>>                 return;
>>         }
>>
>> --
>> 2.23.0
>>
> .
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [External] Re: [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled
  2021-07-30  1:52     ` Miaohe Lin
@ 2021-07-30  2:33       ` Muchun Song
  2021-07-30  2:46         ` Miaohe Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Muchun Song @ 2021-07-30  2:33 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Shakeel Butt, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Roman Gushchin, Matthew Wilcox, Alex Shi,
	Wei Yang, Linux MM, LKML, Cgroups

On Fri, Jul 30, 2021 at 9:52 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2021/7/29 22:39, Shakeel Butt wrote:
> > On Thu, Jul 29, 2021 at 5:58 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>
> >> We should always ensure __mod_node_page_state() is called with preempt
> >> disabled or percpu ops may manipulate the wrong cpu when preempt happened.
> >>
> >> Fixes: b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages")
> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >> ---
> >>  mm/memcontrol.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 70a32174e7c4..616d1a72ece3 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -697,8 +697,8 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
> >>         memcg = page_memcg(head);
> >>         /* Untracked pages have no memcg, no lruvec. Update only the node */
> >>         if (!memcg) {
> >> -               rcu_read_unlock();
> >>                 __mod_node_page_state(pgdat, idx, val);
> >> +               rcu_read_unlock();
> >
> > This rcu is for page_memcg. The preemption and interrupts are disabled
> > across __mod_lruvec_page_state().
> >
>
> I thought it's used to protect __mod_node_page_state(). Looks somewhat confusing for me.
> Many thanks for pointing this out!

Hi Miaohe,

git show b4e0b68fbd9d can help you find out why we add
the rcu read lock around it.

Thanks.

>
> >>                 return;
> >>         }
> >>
> >> --
> >> 2.23.0
> >>
> > .
> >
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true
  2021-07-29 12:57 ` [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true Miaohe Lin
  2021-07-29 14:40   ` Shakeel Butt
@ 2021-07-30  2:37   ` Muchun Song
  2021-07-30  3:07   ` Roman Gushchin
  2021-07-30  6:51   ` Michal Hocko
  3 siblings, 0 replies; 45+ messages in thread
From: Muchun Song @ 2021-07-30  2:37 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Shakeel Butt, Roman Gushchin, Matthew Wilcox, Alex Shi, Wei Yang,
	Linux Memory Management List, LKML, Cgroups

On Thu, Jul 29, 2021 at 8:57 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> Add 'else' to save some atomic ops in obj_stock_flush_required() when
> flush is already true. No functional change intended here.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/5] mm, memcg: remove unused functions
  2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
  2021-07-29 14:07   ` Shakeel Butt
@ 2021-07-30  2:39   ` Muchun Song
  2021-07-30  2:57   ` Roman Gushchin
  2021-07-30  6:45   ` Michal Hocko
  3 siblings, 0 replies; 45+ messages in thread
From: Muchun Song @ 2021-07-30  2:39 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Shakeel Butt, Roman Gushchin, Matthew Wilcox, Alex Shi, Wei Yang,
	Linux Memory Management List, LKML, Cgroups

On Thu, Jul 29, 2021 at 8:57 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user
> of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm:
> vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the
> declaration of mem_cgroup_select_victim_node() is remained here. Remove
> them.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-29 12:57 ` [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex Miaohe Lin
@ 2021-07-30  2:42   ` Muchun Song
  2021-07-30  3:06   ` Roman Gushchin
  2021-07-30  6:46   ` Michal Hocko
  2 siblings, 0 replies; 45+ messages in thread
From: Muchun Song @ 2021-07-30  2:42 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Shakeel Butt, Roman Gushchin, Matthew Wilcox, Alex Shi, Wei Yang,
	Linux Memory Management List, LKML, Cgroups

On Thu, Jul 29, 2021 at 8:58 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> narrow the scope of percpu_charge_mutex by moving it here.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

LGTM.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [External] Re: [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled
  2021-07-30  2:33       ` [External] " Muchun Song
@ 2021-07-30  2:46         ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-30  2:46 UTC (permalink / raw)
  To: Muchun Song
  Cc: Shakeel Butt, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Roman Gushchin, Matthew Wilcox, Alex Shi,
	Wei Yang, Linux MM, LKML, Cgroups

On 2021/7/30 10:33, Muchun Song wrote:
> On Fri, Jul 30, 2021 at 9:52 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2021/7/29 22:39, Shakeel Butt wrote:
>>> On Thu, Jul 29, 2021 at 5:58 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> We should always ensure __mod_node_page_state() is called with preempt
>>>> disabled or percpu ops may manipulate the wrong cpu when preempt happened.
>>>>
>>>> Fixes: b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages")
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> ---
>>>>  mm/memcontrol.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 70a32174e7c4..616d1a72ece3 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -697,8 +697,8 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
>>>>         memcg = page_memcg(head);
>>>>         /* Untracked pages have no memcg, no lruvec. Update only the node */
>>>>         if (!memcg) {
>>>> -               rcu_read_unlock();
>>>>                 __mod_node_page_state(pgdat, idx, val);
>>>> +               rcu_read_unlock();
>>>
>>> This rcu is for page_memcg. The preemption and interrupts are disabled
>>> across __mod_lruvec_page_state().
>>>
>>
>> I thought it's used to protect __mod_node_page_state(). Looks somewhat confusing for me.
>> Many thanks for pointing this out!
> 
> Hi Miaohe,
> 
> git show b4e0b68fbd9d can help you find out why we add
> the rcu read lock around it.

Thanks for your tip. That's my overlook when I checked this commit. I should have looked at this
more closely. :(

> 
> Thanks.
> 
>>
>>>>                 return;
>>>>         }
>>>>
>>>> --
>>>> 2.23.0
>>>>
>>> .
>>>
>>
> .
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/5] mm, memcg: remove unused functions
  2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
  2021-07-29 14:07   ` Shakeel Butt
  2021-07-30  2:39   ` Muchun Song
@ 2021-07-30  2:57   ` Roman Gushchin
  2021-07-30  6:45   ` Michal Hocko
  3 siblings, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2021-07-30  2:57 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, mhocko, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu, Jul 29, 2021 at 08:57:51PM +0800, Miaohe Lin wrote:
> Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user
> of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm:
> vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the
> declaration of mem_cgroup_select_victim_node() is remained here. Remove
> them.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-29 12:57 ` [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex Miaohe Lin
  2021-07-30  2:42   ` Muchun Song
@ 2021-07-30  3:06   ` Roman Gushchin
  2021-07-30  6:50     ` Michal Hocko
  2021-07-30  6:46   ` Michal Hocko
  2 siblings, 1 reply; 45+ messages in thread
From: Roman Gushchin @ 2021-07-30  3:06 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, mhocko, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> narrow the scope of percpu_charge_mutex by moving it here.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6580c2381a3e..a03e24e57cd9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
>  #define FLUSHING_CACHED_CHARGE	0
>  };
>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> -static DEFINE_MUTEX(percpu_charge_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  static void drain_obj_stock(struct obj_stock *stock);
> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> +	static DEFINE_MUTEX(percpu_charge_mutex);
>  	int cpu, curcpu;

It's considered a good practice to protect data instead of code paths. After
the proposed change it becomes obvious that the opposite is done here: the mutex
is used to prevent a simultaneous execution of the code of the drain_all_stock()
function.

Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
it with a simple atomic variable or even a single bitfield. Then the change will
be better justified, IMO.

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true
  2021-07-29 12:57 ` [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true Miaohe Lin
  2021-07-29 14:40   ` Shakeel Butt
  2021-07-30  2:37   ` Muchun Song
@ 2021-07-30  3:07   ` Roman Gushchin
  2021-07-30  6:51   ` Michal Hocko
  3 siblings, 0 replies; 45+ messages in thread
From: Roman Gushchin @ 2021-07-30  3:07 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, mhocko, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu, Jul 29, 2021 at 08:57:53PM +0800, Miaohe Lin wrote:
> Add 'else' to save some atomic ops in obj_stock_flush_required() when
> flush is already true. No functional change intended here.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Acked-by: Roman Gushchin <guro@fb.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-29 12:57 ` [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init() Miaohe Lin
  2021-07-29 13:52   ` Matthew Wilcox
@ 2021-07-30  3:12   ` Roman Gushchin
  2021-07-30  6:29     ` Miaohe Lin
  2021-07-30  6:44     ` Michal Hocko
  1 sibling, 2 replies; 45+ messages in thread
From: Roman Gushchin @ 2021-07-30  3:12 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, mhocko, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
> rtpn might be NULL in very rare case. We have better to check it before
> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
> soft_limit_tree, warn this case and continue.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memcontrol.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5b4592d1e0f2..70a32174e7c4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
>  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
>  				    node_online(node) ? node : NUMA_NO_NODE);
>  
> +		if (WARN_ON_ONCE(!rtpn))
> +			continue;

I also really doubt that it makes any sense to continue in this case.
If this allocations fails (at the very beginning of the system's life, it's an __init function),
something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
a perfect choice.

Is this a real world problem? Do I miss something?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-30  3:12   ` Roman Gushchin
@ 2021-07-30  6:29     ` Miaohe Lin
  2021-07-30  6:44     ` Michal Hocko
  1 sibling, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-30  6:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, mhocko, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On 2021/7/30 11:12, Roman Gushchin wrote:
> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
>> rtpn might be NULL in very rare case. We have better to check it before
>> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
>> soft_limit_tree, warn this case and continue.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/memcontrol.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 5b4592d1e0f2..70a32174e7c4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
>>  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
>>  				    node_online(node) ? node : NUMA_NO_NODE);
>>  
>> +		if (WARN_ON_ONCE(!rtpn))
>> +			continue;
> 
> I also really doubt that it makes any sense to continue in this case.
> If this allocations fails (at the very beginning of the system's life, it's an __init function),
> something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
> a perfect choice.
> 
> Is this a real world problem? Do I miss something?

No, this is a theoretical bug, a very race case but not impossible IMO.
Since we can't live with NULL rb_tree_per_node in soft_limit_tree, I thinks
simply continue or break here without panic is also acceptable. Or is it
more proper to choose panic here?

Thanks.

> .
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-30  3:12   ` Roman Gushchin
  2021-07-30  6:29     ` Miaohe Lin
@ 2021-07-30  6:44     ` Michal Hocko
  2021-07-31  2:05       ` Miaohe Lin
  1 sibling, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2021-07-30  6:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Miaohe Lin, hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu 29-07-21 20:12:43, Roman Gushchin wrote:
> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
> > rtpn might be NULL in very rare case. We have better to check it before
> > dereferencing it. Since memcg can live with NULL rb_tree_per_node in
> > soft_limit_tree, warn this case and continue.
> > 
> > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > ---
> >  mm/memcontrol.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 5b4592d1e0f2..70a32174e7c4 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
> >  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
> >  				    node_online(node) ? node : NUMA_NO_NODE);
> >  
> > +		if (WARN_ON_ONCE(!rtpn))
> > +			continue;
> 
> I also really doubt that it makes any sense to continue in this case.
> If this allocations fails (at the very beginning of the system's life, it's an __init function),
> something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
> a perfect choice.

Moreover this is 24B allocation during early boot. Kernel will OOM and
panic when not being able to find any victim. I do not think we need to
do any special handling here.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/5] mm, memcg: remove unused functions
  2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
                     ` (2 preceding siblings ...)
  2021-07-30  2:57   ` Roman Gushchin
@ 2021-07-30  6:45   ` Michal Hocko
  3 siblings, 0 replies; 45+ messages in thread
From: Michal Hocko @ 2021-07-30  6:45 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, vdavydov.dev, akpm, shakeelb, guro, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu 29-07-21 20:57:51, Miaohe Lin wrote:
> Since commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat"), last user
> of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f15 ("mm:
> vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the
> declaration of mem_cgroup_select_victim_node() is remained here. Remove
> them.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 12 ------------
>  1 file changed, 12 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7028d8e4a3d7..04437504444f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -606,13 +606,6 @@ static inline bool PageMemcgKmem(struct page *page)
>  	return folio_memcg_kmem(page_folio(page));
>  }
>  
> -static __always_inline bool memcg_stat_item_in_bytes(int idx)
> -{
> -	if (idx == MEMCG_PERCPU_B)
> -		return true;
> -	return vmstat_item_in_bytes(idx);
> -}
> -
>  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  {
>  	return (memcg == root_mem_cgroup);
> @@ -916,11 +909,6 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  	return !!(memcg->css.flags & CSS_ONLINE);
>  }
>  
> -/*
> - * For memory reclaim.
> - */
> -int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
> -
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>  		int zid, int nr_pages);
>  
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-29 12:57 ` [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex Miaohe Lin
  2021-07-30  2:42   ` Muchun Song
  2021-07-30  3:06   ` Roman Gushchin
@ 2021-07-30  6:46   ` Michal Hocko
  2 siblings, 0 replies; 45+ messages in thread
From: Michal Hocko @ 2021-07-30  6:46 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, vdavydov.dev, akpm, shakeelb, guro, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu 29-07-21 20:57:52, Miaohe Lin wrote:
> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> narrow the scope of percpu_charge_mutex by moving it here.

Makes sense and this is usually my preference as well. We used to have
other caller back then so I couldn't.

> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6580c2381a3e..a03e24e57cd9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
>  #define FLUSHING_CACHED_CHARGE	0
>  };
>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> -static DEFINE_MUTEX(percpu_charge_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  static void drain_obj_stock(struct obj_stock *stock);
> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> +	static DEFINE_MUTEX(percpu_charge_mutex);
>  	int cpu, curcpu;
>  
>  	/* If someone's already draining, avoid adding running more workers. */
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-30  3:06   ` Roman Gushchin
@ 2021-07-30  6:50     ` Michal Hocko
  2021-07-31  2:29       ` Miaohe Lin
  2021-08-03 14:15       ` Johannes Weiner
  0 siblings, 2 replies; 45+ messages in thread
From: Michal Hocko @ 2021-07-30  6:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Miaohe Lin, hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
> > Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> > narrow the scope of percpu_charge_mutex by moving it here.
> > 
> > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > ---
> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6580c2381a3e..a03e24e57cd9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
> >  #define FLUSHING_CACHED_CHARGE	0
> >  };
> >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > -static DEFINE_MUTEX(percpu_charge_mutex);
> >  
> >  #ifdef CONFIG_MEMCG_KMEM
> >  static void drain_obj_stock(struct obj_stock *stock);
> > @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> >   */
> >  static void drain_all_stock(struct mem_cgroup *root_memcg)
> >  {
> > +	static DEFINE_MUTEX(percpu_charge_mutex);
> >  	int cpu, curcpu;
> 
> It's considered a good practice to protect data instead of code paths. After
> the proposed change it becomes obvious that the opposite is done here: the mutex
> is used to prevent a simultaneous execution of the code of the drain_all_stock()
> function.

The purpose of the lock was indeed to orchestrate callers more than any
data structure consistency.
 
> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
> it with a simple atomic variable or even a single bitfield. Then the change will
> be better justified, IMO.

Yes, mutex can be replaced by an atomic in a follow up patch.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true
  2021-07-29 12:57 ` [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true Miaohe Lin
                     ` (2 preceding siblings ...)
  2021-07-30  3:07   ` Roman Gushchin
@ 2021-07-30  6:51   ` Michal Hocko
  3 siblings, 0 replies; 45+ messages in thread
From: Michal Hocko @ 2021-07-30  6:51 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: hannes, vdavydov.dev, akpm, shakeelb, guro, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Thu 29-07-21 20:57:53, Miaohe Lin wrote:
> Add 'else' to save some atomic ops in obj_stock_flush_required() when
> flush is already true. No functional change intended here.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a03e24e57cd9..5b4592d1e0f2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2231,7 +2231,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>  		if (memcg && stock->nr_pages &&
>  		    mem_cgroup_is_descendant(memcg, root_memcg))
>  			flush = true;
> -		if (obj_stock_flush_required(stock, root_memcg))
> +		else if (obj_stock_flush_required(stock, root_memcg))
>  			flush = true;
>  		rcu_read_unlock();
>  
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-30  6:44     ` Michal Hocko
@ 2021-07-31  2:05       ` Miaohe Lin
  2021-08-02  6:43         ` Michal Hocko
  0 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2021-07-31  2:05 UTC (permalink / raw)
  To: Michal Hocko, Roman Gushchin
  Cc: hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On 2021/7/30 14:44, Michal Hocko wrote:
> On Thu 29-07-21 20:12:43, Roman Gushchin wrote:
>> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
>>> rtpn might be NULL in very rare case. We have better to check it before
>>> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
>>> soft_limit_tree, warn this case and continue.
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/memcontrol.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 5b4592d1e0f2..70a32174e7c4 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
>>>  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
>>>  				    node_online(node) ? node : NUMA_NO_NODE);
>>>  
>>> +		if (WARN_ON_ONCE(!rtpn))
>>> +			continue;
>>
>> I also really doubt that it makes any sense to continue in this case.
>> If this allocations fails (at the very beginning of the system's life, it's an __init function),
>> something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
>> a perfect choice.
> 
> Moreover this is 24B allocation during early boot. Kernel will OOM and
> panic when not being able to find any victim. I do not think we need to

Agree with you. But IMO it may not be a good idea to leave the rtpn without NULL check. We should defend
it though it could hardly happen. But I'm not insist on this check. I will drop this patch if you insist.

Thanks both of you.

> do any special handling here.
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-30  6:50     ` Michal Hocko
@ 2021-07-31  2:29       ` Miaohe Lin
  2021-08-02  6:49         ` Michal Hocko
  2021-08-03  3:40         ` Roman Gushchin
  2021-08-03 14:15       ` Johannes Weiner
  1 sibling, 2 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-07-31  2:29 UTC (permalink / raw)
  To: Michal Hocko, Roman Gushchin
  Cc: hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On 2021/7/30 14:50, Michal Hocko wrote:
> On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
>> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
>>> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
>>> narrow the scope of percpu_charge_mutex by moving it here.
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/memcontrol.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 6580c2381a3e..a03e24e57cd9 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
>>>  #define FLUSHING_CACHED_CHARGE	0
>>>  };
>>>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>>> -static DEFINE_MUTEX(percpu_charge_mutex);
>>>  
>>>  #ifdef CONFIG_MEMCG_KMEM
>>>  static void drain_obj_stock(struct obj_stock *stock);
>>> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>   */
>>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>>  {
>>> +	static DEFINE_MUTEX(percpu_charge_mutex);
>>>  	int cpu, curcpu;
>>
>> It's considered a good practice to protect data instead of code paths. After
>> the proposed change it becomes obvious that the opposite is done here: the mutex
>> is used to prevent a simultaneous execution of the code of the drain_all_stock()
>> function.
> 
> The purpose of the lock was indeed to orchestrate callers more than any
> data structure consistency.
>  
>> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
>> it with a simple atomic variable or even a single bitfield. Then the change will
>> be better justified, IMO.
> 
> Yes, mutex can be replaced by an atomic in a follow up patch.
> 

Thanks for both of you. It's a really good suggestion. What do you mean is something like below?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 616d1a72ece3..508a96e80980 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-       static DEFINE_MUTEX(percpu_charge_mutex);
        int cpu, curcpu;
+       static atomic_t drain_all_stocks = ATOMIC_INIT(-1);

        /* If someone's already draining, avoid adding running more workers. */
-       if (!mutex_trylock(&percpu_charge_mutex))
+       if (!atomic_inc_not_zero(&drain_all_stocks))
                return;
        /*
         * Notify other cpus that system-wide "drain" is running
@@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
                }
        }
        put_cpu();
-       mutex_unlock(&percpu_charge_mutex);
+       atomic_dec(&drain_all_stocks);
 }

 static int memcg_hotplug_cpu_dead(unsigned int cpu)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-07-31  2:05       ` Miaohe Lin
@ 2021-08-02  6:43         ` Michal Hocko
  2021-08-02 10:00           ` Miaohe Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2021-08-02  6:43 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Roman Gushchin, hannes, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On Sat 31-07-21 10:05:51, Miaohe Lin wrote:
> On 2021/7/30 14:44, Michal Hocko wrote:
> > On Thu 29-07-21 20:12:43, Roman Gushchin wrote:
> >> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
> >>> rtpn might be NULL in very rare case. We have better to check it before
> >>> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
> >>> soft_limit_tree, warn this case and continue.
> >>>
> >>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>> ---
> >>>  mm/memcontrol.c | 2 ++
> >>>  1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 5b4592d1e0f2..70a32174e7c4 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
> >>>  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
> >>>  				    node_online(node) ? node : NUMA_NO_NODE);
> >>>  
> >>> +		if (WARN_ON_ONCE(!rtpn))
> >>> +			continue;
> >>
> >> I also really doubt that it makes any sense to continue in this case.
> >> If this allocations fails (at the very beginning of the system's life, it's an __init function),
> >> something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
> >> a perfect choice.
> > 
> > Moreover this is 24B allocation during early boot. Kernel will OOM and
> > panic when not being able to find any victim. I do not think we need to
> 
> Agree with you. But IMO it may not be a good idea to leave the rtpn without NULL check. We should defend
> it though it could hardly happen. But I'm not insist on this check. I will drop this patch if you insist.

It is not that I would insist. I just do not see any point in the code
churn. This check is not going to ever trigger and there is nothing you
can do to recover anyway so crashing the kernel is likely the only
choice left.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-31  2:29       ` Miaohe Lin
@ 2021-08-02  6:49         ` Michal Hocko
  2021-08-02  9:54           ` Miaohe Lin
  2021-08-03  3:40         ` Roman Gushchin
  1 sibling, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2021-08-02  6:49 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Roman Gushchin, hannes, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On Sat 31-07-21 10:29:52, Miaohe Lin wrote:
> On 2021/7/30 14:50, Michal Hocko wrote:
> > On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
> >> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
> >>> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> >>> narrow the scope of percpu_charge_mutex by moving it here.
> >>>
> >>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>> ---
> >>>  mm/memcontrol.c | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 6580c2381a3e..a03e24e57cd9 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
> >>>  #define FLUSHING_CACHED_CHARGE	0
> >>>  };
> >>>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> >>> -static DEFINE_MUTEX(percpu_charge_mutex);
> >>>  
> >>>  #ifdef CONFIG_MEMCG_KMEM
> >>>  static void drain_obj_stock(struct obj_stock *stock);
> >>> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> >>>   */
> >>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
> >>>  {
> >>> +	static DEFINE_MUTEX(percpu_charge_mutex);
> >>>  	int cpu, curcpu;
> >>
> >> It's considered a good practice to protect data instead of code paths. After
> >> the proposed change it becomes obvious that the opposite is done here: the mutex
> >> is used to prevent a simultaneous execution of the code of the drain_all_stock()
> >> function.
> > 
> > The purpose of the lock was indeed to orchestrate callers more than any
> > data structure consistency.
> >  
> >> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
> >> it with a simple atomic variable or even a single bitfield. Then the change will
> >> be better justified, IMO.
> > 
> > Yes, mutex can be replaced by an atomic in a follow up patch.
> > 
> 
> Thanks for both of you. It's a really good suggestion. What do you mean is something like below?
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 616d1a72ece3..508a96e80980 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> -       static DEFINE_MUTEX(percpu_charge_mutex);
>         int cpu, curcpu;
> +       static atomic_t drain_all_stocks = ATOMIC_INIT(-1);
>         /* If someone's already draining, avoid adding running more workers. */
> -       if (!mutex_trylock(&percpu_charge_mutex))
> +       if (!atomic_inc_not_zero(&drain_all_stocks))
>                 return;
>         /*
>          * Notify other cpus that system-wide "drain" is running
> @@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>                 }
>         }
>         put_cpu();
> -       mutex_unlock(&percpu_charge_mutex);
> +       atomic_dec(&drain_all_stocks);

Yes this would work. I would just s@drain_all_stocks@drainers@ or
something similar to better express the intention.

>  }
> 
>  static int memcg_hotplug_cpu_dead(unsigned int cpu)

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-02  6:49         ` Michal Hocko
@ 2021-08-02  9:54           ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-08-02  9:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On 2021/8/2 14:49, Michal Hocko wrote:
> On Sat 31-07-21 10:29:52, Miaohe Lin wrote:
>> On 2021/7/30 14:50, Michal Hocko wrote:
>>> On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
>>>> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
>>>>> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
>>>>> narrow the scope of percpu_charge_mutex by moving it here.
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> ---
>>>>>  mm/memcontrol.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index 6580c2381a3e..a03e24e57cd9 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
>>>>>  #define FLUSHING_CACHED_CHARGE	0
>>>>>  };
>>>>>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>>>>> -static DEFINE_MUTEX(percpu_charge_mutex);
>>>>>  
>>>>>  #ifdef CONFIG_MEMCG_KMEM
>>>>>  static void drain_obj_stock(struct obj_stock *stock);
>>>>> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>>>   */
>>>>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>>>>  {
>>>>> +	static DEFINE_MUTEX(percpu_charge_mutex);
>>>>>  	int cpu, curcpu;
>>>>
>>>> It's considered a good practice to protect data instead of code paths. After
>>>> the proposed change it becomes obvious that the opposite is done here: the mutex
>>>> is used to prevent a simultaneous execution of the code of the drain_all_stock()
>>>> function.
>>>
>>> The purpose of the lock was indeed to orchestrate callers more than any
>>> data structure consistency.
>>>  
>>>> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
>>>> it with a simple atomic variable or even a single bitfield. Then the change will
>>>> be better justified, IMO.
>>>
>>> Yes, mutex can be replaced by an atomic in a follow up patch.
>>>
>>
>> Thanks for both of you. It's a really good suggestion. What do you mean is something like below?
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 616d1a72ece3..508a96e80980 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>   */
>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>  {
>> -       static DEFINE_MUTEX(percpu_charge_mutex);
>>         int cpu, curcpu;
>> +       static atomic_t drain_all_stocks = ATOMIC_INIT(-1);
>>         /* If someone's already draining, avoid adding running more workers. */
>> -       if (!mutex_trylock(&percpu_charge_mutex))
>> +       if (!atomic_inc_not_zero(&drain_all_stocks))
>>                 return;
>>         /*
>>          * Notify other cpus that system-wide "drain" is running
>> @@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>>                 }
>>         }
>>         put_cpu();
>> -       mutex_unlock(&percpu_charge_mutex);
>> +       atomic_dec(&drain_all_stocks);
> 
> Yes this would work. I would just s@drain_all_stocks@drainers@ or
> something similar to better express the intention.
> 

Sounds good. Will do it in v2. Many thanks.

>>  }
>>
>>  static int memcg_hotplug_cpu_dead(unsigned int cpu)
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-08-02  6:43         ` Michal Hocko
@ 2021-08-02 10:00           ` Miaohe Lin
  2021-08-02 10:42             ` Michal Hocko
  0 siblings, 1 reply; 45+ messages in thread
From: Miaohe Lin @ 2021-08-02 10:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On 2021/8/2 14:43, Michal Hocko wrote:
> On Sat 31-07-21 10:05:51, Miaohe Lin wrote:
>> On 2021/7/30 14:44, Michal Hocko wrote:
>>> On Thu 29-07-21 20:12:43, Roman Gushchin wrote:
>>>> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
>>>>> rtpn might be NULL in very rare case. We have better to check it before
>>>>> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
>>>>> soft_limit_tree, warn this case and continue.
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> ---
>>>>>  mm/memcontrol.c | 2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index 5b4592d1e0f2..70a32174e7c4 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
>>>>>  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
>>>>>  				    node_online(node) ? node : NUMA_NO_NODE);
>>>>>  
>>>>> +		if (WARN_ON_ONCE(!rtpn))
>>>>> +			continue;
>>>>
>>>> I also really doubt that it makes any sense to continue in this case.
>>>> If this allocations fails (at the very beginning of the system's life, it's an __init function),
>>>> something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
>>>> a perfect choice.
>>>
>>> Moreover this is 24B allocation during early boot. Kernel will OOM and
>>> panic when not being able to find any victim. I do not think we need to
>>
>> Agree with you. But IMO it may not be a good idea to leave the rtpn without NULL check. We should defend
>> it though it could hardly happen. But I'm not insist on this check. I will drop this patch if you insist.
> 
> It is not that I would insist. I just do not see any point in the code
> churn. This check is not going to ever trigger and there is nothing you
> can do to recover anyway so crashing the kernel is likely the only
> choice left.
> 

I hope I get the point now. What you mean is nothing we can do to recover and panic'ing on a
NULL-pointer dereference is a perfect choice ? Should we declare that we leave the rtpn without
NULL check on purpose like below ?

Many thanks.

@@ -7109,8 +7109,12 @@ static int __init mem_cgroup_init(void)
                rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
                                    node_online(node) ? node : NUMA_NO_NODE);

-               if (WARN_ON_ONCE(!rtpn))
-                       continue;
+               /*
+                * If this allocation fails (at the very beginning of the
+                * system's life, it's an __init function), something is
+                * terribly wrong and panic'ing on a NULL-pointer
+                * dereference sounds like a perfect choice.
+                */
                rtpn->rb_root = RB_ROOT;
                rtpn->rb_rightmost = NULL;
                spin_lock_init(&rtpn->lock);



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-08-02 10:00           ` Miaohe Lin
@ 2021-08-02 10:42             ` Michal Hocko
  2021-08-02 11:18               ` Miaohe Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2021-08-02 10:42 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Roman Gushchin, hannes, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On Mon 02-08-21 18:00:10, Miaohe Lin wrote:
> On 2021/8/2 14:43, Michal Hocko wrote:
> > On Sat 31-07-21 10:05:51, Miaohe Lin wrote:
> >> On 2021/7/30 14:44, Michal Hocko wrote:
> >>> On Thu 29-07-21 20:12:43, Roman Gushchin wrote:
> >>>> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
> >>>>> rtpn might be NULL in very rare case. We have better to check it before
> >>>>> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
> >>>>> soft_limit_tree, warn this case and continue.
> >>>>>
> >>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>>>> ---
> >>>>>  mm/memcontrol.c | 2 ++
> >>>>>  1 file changed, 2 insertions(+)
> >>>>>
> >>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>>> index 5b4592d1e0f2..70a32174e7c4 100644
> >>>>> --- a/mm/memcontrol.c
> >>>>> +++ b/mm/memcontrol.c
> >>>>> @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
> >>>>>  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
> >>>>>  				    node_online(node) ? node : NUMA_NO_NODE);
> >>>>>  
> >>>>> +		if (WARN_ON_ONCE(!rtpn))
> >>>>> +			continue;
> >>>>
> >>>> I also really doubt that it makes any sense to continue in this case.
> >>>> If this allocations fails (at the very beginning of the system's life, it's an __init function),
> >>>> something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
> >>>> a perfect choice.
> >>>
> >>> Moreover this is 24B allocation during early boot. Kernel will OOM and
> >>> panic when not being able to find any victim. I do not think we need to
> >>
> >> Agree with you. But IMO it may not be a good idea to leave the rtpn without NULL check. We should defend
> >> it though it could hardly happen. But I'm not insist on this check. I will drop this patch if you insist.
> > 
> > It is not that I would insist. I just do not see any point in the code
> > churn. This check is not going to ever trigger and there is nothing you
> > can do to recover anyway so crashing the kernel is likely the only
> > choice left.
> > 
> 
> I hope I get the point now. What you mean is nothing we can do to recover and panic'ing on a
> NULL-pointer dereference is a perfect choice ? Should we declare that we leave the rtpn without
> NULL check on purpose like below ?
> 
> Many thanks.
> 
> @@ -7109,8 +7109,12 @@ static int __init mem_cgroup_init(void)
>                 rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
>                                     node_online(node) ? node : NUMA_NO_NODE);
> 
> -               if (WARN_ON_ONCE(!rtpn))
> -                       continue;
> +               /*
> +                * If this allocation fails (at the very beginning of the
> +                * system's life, it's an __init function), something is
> +                * terribly wrong and panic'ing on a NULL-pointer
> +                * dereference sounds like a perfect choice.
> +                */

I am not really sure this is really worth it. Really we do not really
want to have similar comments all over the early init code, do we?

>                 rtpn->rb_root = RB_ROOT;
>                 rtpn->rb_rightmost = NULL;
>                 spin_lock_init(&rtpn->lock);

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init()
  2021-08-02 10:42             ` Michal Hocko
@ 2021-08-02 11:18               ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-08-02 11:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On 2021/8/2 18:42, Michal Hocko wrote:
> On Mon 02-08-21 18:00:10, Miaohe Lin wrote:
>> On 2021/8/2 14:43, Michal Hocko wrote:
>>> On Sat 31-07-21 10:05:51, Miaohe Lin wrote:
>>>> On 2021/7/30 14:44, Michal Hocko wrote:
>>>>> On Thu 29-07-21 20:12:43, Roman Gushchin wrote:
>>>>>> On Thu, Jul 29, 2021 at 08:57:54PM +0800, Miaohe Lin wrote:
>>>>>>> rtpn might be NULL in very rare case. We have better to check it before
>>>>>>> dereferencing it. Since memcg can live with NULL rb_tree_per_node in
>>>>>>> soft_limit_tree, warn this case and continue.
>>>>>>>
>>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>>> ---
>>>>>>>  mm/memcontrol.c | 2 ++
>>>>>>>  1 file changed, 2 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>>> index 5b4592d1e0f2..70a32174e7c4 100644
>>>>>>> --- a/mm/memcontrol.c
>>>>>>> +++ b/mm/memcontrol.c
>>>>>>> @@ -7109,6 +7109,8 @@ static int __init mem_cgroup_init(void)
>>>>>>>  		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
>>>>>>>  				    node_online(node) ? node : NUMA_NO_NODE);
>>>>>>>  
>>>>>>> +		if (WARN_ON_ONCE(!rtpn))
>>>>>>> +			continue;
>>>>>>
>>>>>> I also really doubt that it makes any sense to continue in this case.
>>>>>> If this allocations fails (at the very beginning of the system's life, it's an __init function),
>>>>>> something is terribly wrong and panic'ing on a NULL-pointer dereference sounds like
>>>>>> a perfect choice.
>>>>>
>>>>> Moreover this is 24B allocation during early boot. Kernel will OOM and
>>>>> panic when not being able to find any victim. I do not think we need to
>>>>
>>>> Agree with you. But IMO it may not be a good idea to leave the rtpn without NULL check. We should defend
>>>> it though it could hardly happen. But I'm not insist on this check. I will drop this patch if you insist.
>>>
>>> It is not that I would insist. I just do not see any point in the code
>>> churn. This check is not going to ever trigger and there is nothing you
>>> can do to recover anyway so crashing the kernel is likely the only
>>> choice left.
>>>
>>
>> I hope I get the point now. What you mean is nothing we can do to recover and panic'ing on a
>> NULL-pointer dereference is a perfect choice ? Should we declare that we leave the rtpn without
>> NULL check on purpose like below ?
>>
>> Many thanks.
>>
>> @@ -7109,8 +7109,12 @@ static int __init mem_cgroup_init(void)
>>                 rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL,
>>                                     node_online(node) ? node : NUMA_NO_NODE);
>>
>> -               if (WARN_ON_ONCE(!rtpn))
>> -                       continue;
>> +               /*
>> +                * If this allocation fails (at the very beginning of the
>> +                * system's life, it's an __init function), something is
>> +                * terribly wrong and panic'ing on a NULL-pointer
>> +                * dereference sounds like a perfect choice.
>> +                */
> 
> I am not really sure this is really worth it. Really we do not really
> want to have similar comments all over the early init code, do we?

Maybe not. Will drop this patch.

Thanks.

> 
>>                 rtpn->rb_root = RB_ROOT;
>>                 rtpn->rb_rightmost = NULL;
>>                 spin_lock_init(&rtpn->lock);
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-31  2:29       ` Miaohe Lin
  2021-08-02  6:49         ` Michal Hocko
@ 2021-08-03  3:40         ` Roman Gushchin
  2021-08-03  6:29           ` Miaohe Lin
  1 sibling, 1 reply; 45+ messages in thread
From: Roman Gushchin @ 2021-08-03  3:40 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Michal Hocko, hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Sat, Jul 31, 2021 at 10:29:52AM +0800, Miaohe Lin wrote:
> On 2021/7/30 14:50, Michal Hocko wrote:
> > On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
> >> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
> >>> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> >>> narrow the scope of percpu_charge_mutex by moving it here.
> >>>
> >>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>> ---
> >>>  mm/memcontrol.c | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 6580c2381a3e..a03e24e57cd9 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
> >>>  #define FLUSHING_CACHED_CHARGE	0
> >>>  };
> >>>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> >>> -static DEFINE_MUTEX(percpu_charge_mutex);
> >>>  
> >>>  #ifdef CONFIG_MEMCG_KMEM
> >>>  static void drain_obj_stock(struct obj_stock *stock);
> >>> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> >>>   */
> >>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
> >>>  {
> >>> +	static DEFINE_MUTEX(percpu_charge_mutex);
> >>>  	int cpu, curcpu;
> >>
> >> It's considered a good practice to protect data instead of code paths. After
> >> the proposed change it becomes obvious that the opposite is done here: the mutex
> >> is used to prevent a simultaneous execution of the code of the drain_all_stock()
> >> function.
> > 
> > The purpose of the lock was indeed to orchestrate callers more than any
> > data structure consistency.
> >  
> >> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
> >> it with a simple atomic variable or even a single bitfield. Then the change will
> >> be better justified, IMO.
> > 
> > Yes, mutex can be replaced by an atomic in a follow up patch.
> > 
> 
> Thanks for both of you. It's a really good suggestion. What do you mean is something like below?
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 616d1a72ece3..508a96e80980 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> -       static DEFINE_MUTEX(percpu_charge_mutex);
>         int cpu, curcpu;
> +       static atomic_t drain_all_stocks = ATOMIC_INIT(-1);
> 
>         /* If someone's already draining, avoid adding running more workers. */
> -       if (!mutex_trylock(&percpu_charge_mutex))
> +       if (!atomic_inc_not_zero(&drain_all_stocks))
>                 return;

It should work, but why not a simple atomic_cmpxchg(&drain_all_stocks, 0, 1) and
initialize it to 0? Maybe it's just my preference, but IMO (0, 1) is easier
to understand than (-1, 0) here. Not a strong opinion though, up to you.

Thanks!


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-03  3:40         ` Roman Gushchin
@ 2021-08-03  6:29           ` Miaohe Lin
  2021-08-03  7:11             ` Michal Hocko
  2021-08-03  9:33             ` Muchun Song
  0 siblings, 2 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-08-03  6:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Hocko, hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On 2021/8/3 11:40, Roman Gushchin wrote:
> On Sat, Jul 31, 2021 at 10:29:52AM +0800, Miaohe Lin wrote:
>> On 2021/7/30 14:50, Michal Hocko wrote:
>>> On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
>>>> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
>>>>> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
>>>>> narrow the scope of percpu_charge_mutex by moving it here.
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> ---
>>>>>  mm/memcontrol.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index 6580c2381a3e..a03e24e57cd9 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
>>>>>  #define FLUSHING_CACHED_CHARGE	0
>>>>>  };
>>>>>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>>>>> -static DEFINE_MUTEX(percpu_charge_mutex);
>>>>>  
>>>>>  #ifdef CONFIG_MEMCG_KMEM
>>>>>  static void drain_obj_stock(struct obj_stock *stock);
>>>>> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>>>   */
>>>>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>>>>  {
>>>>> +	static DEFINE_MUTEX(percpu_charge_mutex);
>>>>>  	int cpu, curcpu;
>>>>
>>>> It's considered a good practice to protect data instead of code paths. After
>>>> the proposed change it becomes obvious that the opposite is done here: the mutex
>>>> is used to prevent a simultaneous execution of the code of the drain_all_stock()
>>>> function.
>>>
>>> The purpose of the lock was indeed to orchestrate callers more than any
>>> data structure consistency.
>>>  
>>>> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
>>>> it with a simple atomic variable or even a single bitfield. Then the change will
>>>> be better justified, IMO.
>>>
>>> Yes, mutex can be replaced by an atomic in a follow up patch.
>>>
>>
>> Thanks for both of you. It's a really good suggestion. What do you mean is something like below?
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 616d1a72ece3..508a96e80980 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>   */
>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>  {
>> -       static DEFINE_MUTEX(percpu_charge_mutex);
>>         int cpu, curcpu;
>> +       static atomic_t drain_all_stocks = ATOMIC_INIT(-1);
>>
>>         /* If someone's already draining, avoid adding running more workers. */
>> -       if (!mutex_trylock(&percpu_charge_mutex))
>> +       if (!atomic_inc_not_zero(&drain_all_stocks))
>>                 return;
> 
> It should work, but why not a simple atomic_cmpxchg(&drain_all_stocks, 0, 1) and
> initialize it to 0? Maybe it's just my preference, but IMO (0, 1) is easier
> to understand than (-1, 0) here. Not a strong opinion though, up to you.
> 

I think this would improve the readability. What you mean is something like below ?

Many thanks.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 616d1a72ece3..6210b1124929 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  */
 static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
-       static DEFINE_MUTEX(percpu_charge_mutex);
        int cpu, curcpu;
+       static atomic_t drainer = ATOMIC_INIT(0);

        /* If someone's already draining, avoid adding running more workers. */
-       if (!mutex_trylock(&percpu_charge_mutex))
+       if (atomic_cmpxchg(&drainer, 0, 1) != 0)
                return;
        /*
         * Notify other cpus that system-wide "drain" is running
@@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
                }
        }
        put_cpu();
-       mutex_unlock(&percpu_charge_mutex);
+       atomic_set(&drainer, 0);
 }

> Thanks!
> .
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-03  6:29           ` Miaohe Lin
@ 2021-08-03  7:11             ` Michal Hocko
  2021-08-03  7:13               ` Roman Gushchin
  2021-08-03  9:33             ` Muchun Song
  1 sibling, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2021-08-03  7:11 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Roman Gushchin, hannes, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On Tue 03-08-21 14:29:13, Miaohe Lin wrote:
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 616d1a72ece3..6210b1124929 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> -       static DEFINE_MUTEX(percpu_charge_mutex);
>         int cpu, curcpu;
> +       static atomic_t drainer = ATOMIC_INIT(0);
> 
>         /* If someone's already draining, avoid adding running more workers. */
> -       if (!mutex_trylock(&percpu_charge_mutex))
> +       if (atomic_cmpxchg(&drainer, 0, 1) != 0)
>                 return;
>         /*
>          * Notify other cpus that system-wide "drain" is running
> @@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>                 }
>         }
>         put_cpu();
> -       mutex_unlock(&percpu_charge_mutex);
> +       atomic_set(&drainer, 0);

atomic_set doesn't imply memory barrier IIRC. Is this safe?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-03  7:11             ` Michal Hocko
@ 2021-08-03  7:13               ` Roman Gushchin
  2021-08-03  7:27                 ` Michal Hocko
  0 siblings, 1 reply; 45+ messages in thread
From: Roman Gushchin @ 2021-08-03  7:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Miaohe Lin, hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

I’d go with atomic_dec().

Sent from my iPhone

> On Aug 3, 2021, at 00:11, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Tue 03-08-21 14:29:13, Miaohe Lin wrote:
> [...]
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 616d1a72ece3..6210b1124929 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>  */
>> static void drain_all_stock(struct mem_cgroup *root_memcg)
>> {
>> -       static DEFINE_MUTEX(percpu_charge_mutex);
>>        int cpu, curcpu;
>> +       static atomic_t drainer = ATOMIC_INIT(0);
>> 
>>        /* If someone's already draining, avoid adding running more workers. */
>> -       if (!mutex_trylock(&percpu_charge_mutex))
>> +       if (atomic_cmpxchg(&drainer, 0, 1) != 0)
>>                return;
>>        /*
>>         * Notify other cpus that system-wide "drain" is running
>> @@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>>                }
>>        }
>>        put_cpu();
>> -       mutex_unlock(&percpu_charge_mutex);
>> +       atomic_set(&drainer, 0);
> 
> atomic_set doesn't imply memory barrier IIRC. Is this safe?
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-03  7:13               ` Roman Gushchin
@ 2021-08-03  7:27                 ` Michal Hocko
  0 siblings, 0 replies; 45+ messages in thread
From: Michal Hocko @ 2021-08-03  7:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Miaohe Lin, hannes, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On Tue 03-08-21 07:13:35, Roman Gushchin wrote:
> I’d go with atomic_dec().

which is not implying memory barriers either. You would need
atomic_dec_return or some other explicit barrier IIRC.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-03  6:29           ` Miaohe Lin
  2021-08-03  7:11             ` Michal Hocko
@ 2021-08-03  9:33             ` Muchun Song
  2021-08-03 10:50               ` Miaohe Lin
  1 sibling, 1 reply; 45+ messages in thread
From: Muchun Song @ 2021-08-03  9:33 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Roman Gushchin, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Andrew Morton, Shakeel Butt, Matthew Wilcox, Alex Shi, Wei Yang,
	Linux Memory Management List, LKML, Cgroups

On Tue, Aug 3, 2021 at 2:29 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2021/8/3 11:40, Roman Gushchin wrote:
> > On Sat, Jul 31, 2021 at 10:29:52AM +0800, Miaohe Lin wrote:
> >> On 2021/7/30 14:50, Michal Hocko wrote:
> >>> On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
> >>>> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
> >>>>> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> >>>>> narrow the scope of percpu_charge_mutex by moving it here.
> >>>>>
> >>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>>>> ---
> >>>>>  mm/memcontrol.c | 2 +-
> >>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>>> index 6580c2381a3e..a03e24e57cd9 100644
> >>>>> --- a/mm/memcontrol.c
> >>>>> +++ b/mm/memcontrol.c
> >>>>> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
> >>>>>  #define FLUSHING_CACHED_CHARGE   0
> >>>>>  };
> >>>>>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> >>>>> -static DEFINE_MUTEX(percpu_charge_mutex);
> >>>>>
> >>>>>  #ifdef CONFIG_MEMCG_KMEM
> >>>>>  static void drain_obj_stock(struct obj_stock *stock);
> >>>>> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> >>>>>   */
> >>>>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
> >>>>>  {
> >>>>> + static DEFINE_MUTEX(percpu_charge_mutex);
> >>>>>   int cpu, curcpu;
> >>>>
> >>>> It's considered a good practice to protect data instead of code paths. After
> >>>> the proposed change it becomes obvious that the opposite is done here: the mutex
> >>>> is used to prevent a simultaneous execution of the code of the drain_all_stock()
> >>>> function.
> >>>
> >>> The purpose of the lock was indeed to orchestrate callers more than any
> >>> data structure consistency.
> >>>
> >>>> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
> >>>> it with a simple atomic variable or even a single bitfield. Then the change will
> >>>> be better justified, IMO.
> >>>
> >>> Yes, mutex can be replaced by an atomic in a follow up patch.
> >>>
> >>
> >> Thanks for both of you. It's a really good suggestion. What do you mean is something like below?
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 616d1a72ece3..508a96e80980 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> >>   */
> >>  static void drain_all_stock(struct mem_cgroup *root_memcg)
> >>  {
> >> -       static DEFINE_MUTEX(percpu_charge_mutex);
> >>         int cpu, curcpu;
> >> +       static atomic_t drain_all_stocks = ATOMIC_INIT(-1);
> >>
> >>         /* If someone's already draining, avoid adding running more workers. */
> >> -       if (!mutex_trylock(&percpu_charge_mutex))
> >> +       if (!atomic_inc_not_zero(&drain_all_stocks))
> >>                 return;
> >
> > It should work, but why not a simple atomic_cmpxchg(&drain_all_stocks, 0, 1) and
> > initialize it to 0? Maybe it's just my preference, but IMO (0, 1) is easier
> > to understand than (-1, 0) here. Not a strong opinion though, up to you.
> >
>
> I think this would improve the readability. What you mean is something like below ?
>
> Many thanks.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 616d1a72ece3..6210b1124929 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   */
>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> -       static DEFINE_MUTEX(percpu_charge_mutex);
>         int cpu, curcpu;
> +       static atomic_t drainer = ATOMIC_INIT(0);
>
>         /* If someone's already draining, avoid adding running more workers. */
> -       if (!mutex_trylock(&percpu_charge_mutex))
> +       if (atomic_cmpxchg(&drainer, 0, 1) != 0)

I'd like to use atomic_cmpxchg_acquire() here.

>                 return;
>         /*
>          * Notify other cpus that system-wide "drain" is running
> @@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>                 }
>         }
>         put_cpu();
> -       mutex_unlock(&percpu_charge_mutex);
> +       atomic_set(&drainer, 0);

So use atomic_set_release() here to cooperate with
atomic_cmpxchg_acquire().

Thanks.

>  }
>
> > Thanks!
> > .
> >
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-03  9:33             ` Muchun Song
@ 2021-08-03 10:50               ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-08-03 10:50 UTC (permalink / raw)
  To: Muchun Song, Michal Hocko, Roman Gushchin
  Cc: Roman Gushchin, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Andrew Morton, Shakeel Butt, Matthew Wilcox, Alex Shi, Wei Yang,
	Linux Memory Management List, LKML, Cgroups

On 2021/8/3 17:33, Muchun Song wrote:
> On Tue, Aug 3, 2021 at 2:29 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2021/8/3 11:40, Roman Gushchin wrote:
>>> On Sat, Jul 31, 2021 at 10:29:52AM +0800, Miaohe Lin wrote:
>>>> On 2021/7/30 14:50, Michal Hocko wrote:
>>>>> On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
>>>>>> On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
>>>>>>> Since percpu_charge_mutex is only used inside drain_all_stock(), we can
>>>>>>> narrow the scope of percpu_charge_mutex by moving it here.
>>>>>>>
>>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>>> ---
>>>>>>>  mm/memcontrol.c | 2 +-
>>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>>> index 6580c2381a3e..a03e24e57cd9 100644
>>>>>>> --- a/mm/memcontrol.c
>>>>>>> +++ b/mm/memcontrol.c
>>>>>>> @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
>>>>>>>  #define FLUSHING_CACHED_CHARGE   0
>>>>>>>  };
>>>>>>>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>>>>>>> -static DEFINE_MUTEX(percpu_charge_mutex);
>>>>>>>
>>>>>>>  #ifdef CONFIG_MEMCG_KMEM
>>>>>>>  static void drain_obj_stock(struct obj_stock *stock);
>>>>>>> @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>>>>>   */
>>>>>>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>>>>>>  {
>>>>>>> + static DEFINE_MUTEX(percpu_charge_mutex);
>>>>>>>   int cpu, curcpu;
>>>>>>
>>>>>> It's considered a good practice to protect data instead of code paths. After
>>>>>> the proposed change it becomes obvious that the opposite is done here: the mutex
>>>>>> is used to prevent a simultaneous execution of the code of the drain_all_stock()
>>>>>> function.
>>>>>
>>>>> The purpose of the lock was indeed to orchestrate callers more than any
>>>>> data structure consistency.
>>>>>
>>>>>> Actually we don't need a mutex here: nobody ever sleeps on it. So I'd replace
>>>>>> it with a simple atomic variable or even a single bitfield. Then the change will
>>>>>> be better justified, IMO.
>>>>>
>>>>> Yes, mutex can be replaced by an atomic in a follow up patch.
>>>>>
>>>>
>>>> Thanks for both of you. It's a really good suggestion. What do you mean is something like below?
>>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 616d1a72ece3..508a96e80980 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>>>   */
>>>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>>>  {
>>>> -       static DEFINE_MUTEX(percpu_charge_mutex);
>>>>         int cpu, curcpu;
>>>> +       static atomic_t drain_all_stocks = ATOMIC_INIT(-1);
>>>>
>>>>         /* If someone's already draining, avoid adding running more workers. */
>>>> -       if (!mutex_trylock(&percpu_charge_mutex))
>>>> +       if (!atomic_inc_not_zero(&drain_all_stocks))
>>>>                 return;
>>>
>>> It should work, but why not a simple atomic_cmpxchg(&drain_all_stocks, 0, 1) and
>>> initialize it to 0? Maybe it's just my preference, but IMO (0, 1) is easier
>>> to understand than (-1, 0) here. Not a strong opinion though, up to you.
>>>
>>
>> I think this would improve the readability. What you mean is something like below ?
>>
>> Many thanks.
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 616d1a72ece3..6210b1124929 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2208,11 +2208,11 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>>   */
>>  static void drain_all_stock(struct mem_cgroup *root_memcg)
>>  {
>> -       static DEFINE_MUTEX(percpu_charge_mutex);
>>         int cpu, curcpu;
>> +       static atomic_t drainer = ATOMIC_INIT(0);
>>
>>         /* If someone's already draining, avoid adding running more workers. */
>> -       if (!mutex_trylock(&percpu_charge_mutex))
>> +       if (atomic_cmpxchg(&drainer, 0, 1) != 0)
> 
> I'd like to use atomic_cmpxchg_acquire() here.
> 
>>                 return;
>>         /*
>>          * Notify other cpus that system-wide "drain" is running
>> @@ -2244,7 +2244,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
>>                 }
>>         }
>>         put_cpu();
>> -       mutex_unlock(&percpu_charge_mutex);
>> +       atomic_set(&drainer, 0);
> 
> So use atomic_set_release() here to cooperate with
> atomic_cmpxchg_acquire().

I think this will work well. Many thanks!

> 
> Thanks.
> 
>>  }
>>
>>> Thanks!
>>> .
>>>
>>
> .
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-07-30  6:50     ` Michal Hocko
  2021-07-31  2:29       ` Miaohe Lin
@ 2021-08-03 14:15       ` Johannes Weiner
  2021-08-04  8:20         ` Michal Hocko
  1 sibling, 1 reply; 45+ messages in thread
From: Johannes Weiner @ 2021-08-03 14:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Miaohe Lin, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On Fri, Jul 30, 2021 at 08:50:02AM +0200, Michal Hocko wrote:
> On Thu 29-07-21 20:06:45, Roman Gushchin wrote:
> > On Thu, Jul 29, 2021 at 08:57:52PM +0800, Miaohe Lin wrote:
> > > Since percpu_charge_mutex is only used inside drain_all_stock(), we can
> > > narrow the scope of percpu_charge_mutex by moving it here.
> > > 
> > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > > ---
> > >  mm/memcontrol.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 6580c2381a3e..a03e24e57cd9 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp {
> > >  #define FLUSHING_CACHED_CHARGE	0
> > >  };
> > >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > > -static DEFINE_MUTEX(percpu_charge_mutex);
> > >  
> > >  #ifdef CONFIG_MEMCG_KMEM
> > >  static void drain_obj_stock(struct obj_stock *stock);
> > > @@ -2209,6 +2208,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> > >   */
> > >  static void drain_all_stock(struct mem_cgroup *root_memcg)
> > >  {
> > > +	static DEFINE_MUTEX(percpu_charge_mutex);
> > >  	int cpu, curcpu;
> > 
> > It's considered a good practice to protect data instead of code paths. After
> > the proposed change it becomes obvious that the opposite is done here: the mutex
> > is used to prevent a simultaneous execution of the code of the drain_all_stock()
> > function.
> 
> The purpose of the lock was indeed to orchestrate callers more than any
> data structure consistency.

It doesn't seem like we need the lock at all.

The comment says it's so we don't spawn more workers when flushing is
already underway. But a work cannot be queued more than once - if it
were just about that, we'd needlessly duplicate the
test_and_set_bit(WORK_STRUCT_PENDING_BIT) in queue_work_on().

git history shows we tried to remove it once:

commit 8521fc50d433507a7cdc96bec280f9e5888a54cc
Author: Michal Hocko <mhocko@suse.cz>
Date:   Tue Jul 26 16:08:29 2011 -0700

    memcg: get rid of percpu_charge_mutex lock

but it turned out that the lock did in fact protect a data structure:
the stock itself. Specifically stock->cached:

commit 9f50fad65b87a8776ae989ca059ad6c17925dfc3
Author: Michal Hocko <mhocko@suse.cz>
Date:   Tue Aug 9 11:56:26 2011 +0200

    Revert "memcg: get rid of percpu_charge_mutex lock"

    This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc.

    The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
    bit operations is sufficient but that is not true.  Johannes Weiner has
    reported a crash during parallel memory cgroup removal:

      BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
      Oops: 0000 [#1] PREEMPT SMP
      Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
      RIP: 0010:[<ffffffff81083b70>]  css_is_ancestor+0x20/0x70
      RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
      Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
      Call Trace:
       [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
       [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
       [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
       [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
       [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
       [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
       [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
       [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
       [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b

    We are crashing because we try to dereference cached memcg when we are
    checking whether we should wait for draining on the cache.  The cache is
    already cleaned up, though.

    There is also a theoretical chance that the cached memcg gets freed
    between we test for the FLUSHING_CACHED_CHARGE and dereference it in
    mem_cgroup_same_or_subtree:

            CPU0                    CPU1                         CPU2
      mem=stock->cached
      stock->cached=NULL
                                  clear_bit
                                                            test_and_set_bit
      test_bit()                    ...
      <preempted>             mem_cgroup_destroy
      use after free

    The percpu_charge_mutex protected from this race because sync draining
    is exclusive.

    It is safer to revert now and come up with a more parallel
    implementation later.

I didn't remember this one at all!

However, when you look at the codebase from back then, there was no
rcu-protection for memcg lifetime, and drain_stock() didn't double
check stock->cached inside the work. Hence the crash during a race.

The drain code is different now: drain_local_stock() disables IRQs
which holds up rcu, and then calls drain_stock() and drain_obj_stock()
which both check stock->cached one more time before the deref.

With workqueue managing concurrency, and rcu ensuring memcg lifetime
during the drain, this lock indeed seems unnecessary now.

Unless I'm missing something, it should just be removed instead.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-03 14:15       ` Johannes Weiner
@ 2021-08-04  8:20         ` Michal Hocko
  2021-08-05  1:44           ` Miaohe Lin
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Hocko @ 2021-08-04  8:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Miaohe Lin, vdavydov.dev, akpm, shakeelb, willy,
	alexs, richard.weiyang, songmuchun, linux-mm, linux-kernel,
	cgroups

On Tue 03-08-21 10:15:36, Johannes Weiner wrote:
[...]
> git history shows we tried to remove it once:
> 
> commit 8521fc50d433507a7cdc96bec280f9e5888a54cc
> Author: Michal Hocko <mhocko@suse.cz>
> Date:   Tue Jul 26 16:08:29 2011 -0700
> 
>     memcg: get rid of percpu_charge_mutex lock
> 
> but it turned out that the lock did in fact protect a data structure:
> the stock itself. Specifically stock->cached:
> 
> commit 9f50fad65b87a8776ae989ca059ad6c17925dfc3
> Author: Michal Hocko <mhocko@suse.cz>
> Date:   Tue Aug 9 11:56:26 2011 +0200
> 
>     Revert "memcg: get rid of percpu_charge_mutex lock"
> 
>     This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc.
> 
>     The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
>     bit operations is sufficient but that is not true.  Johannes Weiner has
>     reported a crash during parallel memory cgroup removal:
> 
>       BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>       IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
>       Oops: 0000 [#1] PREEMPT SMP
>       Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
>       RIP: 0010:[<ffffffff81083b70>]  css_is_ancestor+0x20/0x70
>       RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
>       Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
>       Call Trace:
>        [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
>        [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
>        [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
>        [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
>        [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
>        [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
>        [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
>        [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
>        [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
> 
>     We are crashing because we try to dereference cached memcg when we are
>     checking whether we should wait for draining on the cache.  The cache is
>     already cleaned up, though.
> 
>     There is also a theoretical chance that the cached memcg gets freed
>     between we test for the FLUSHING_CACHED_CHARGE and dereference it in
>     mem_cgroup_same_or_subtree:
> 
>             CPU0                    CPU1                         CPU2
>       mem=stock->cached
>       stock->cached=NULL
>                                   clear_bit
>                                                             test_and_set_bit
>       test_bit()                    ...
>       <preempted>             mem_cgroup_destroy
>       use after free
> 
>     The percpu_charge_mutex protected from this race because sync draining
>     is exclusive.
> 
>     It is safer to revert now and come up with a more parallel
>     implementation later.
> 
> I didn't remember this one at all!

Me neither. Thanks for looking that up!

> However, when you look at the codebase from back then, there was no
> rcu-protection for memcg lifetime, and drain_stock() didn't double
> check stock->cached inside the work. Hence the crash during a race.
> 
> The drain code is different now: drain_local_stock() disables IRQs
> which holds up rcu, and then calls drain_stock() and drain_obj_stock()
> which both check stock->cached one more time before the deref.
> 
> With workqueue managing concurrency, and rcu ensuring memcg lifetime
> during the drain, this lock indeed seems unnecessary now.
> 
> Unless I'm missing something, it should just be removed instead.

I do not think you are missing anything. We can drop the lock and
simplify the code. The above information would be great to have in the
changelog.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex
  2021-08-04  8:20         ` Michal Hocko
@ 2021-08-05  1:44           ` Miaohe Lin
  0 siblings, 0 replies; 45+ messages in thread
From: Miaohe Lin @ 2021-08-05  1:44 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Roman Gushchin, vdavydov.dev, akpm, shakeelb, willy, alexs,
	richard.weiyang, songmuchun, linux-mm, linux-kernel, cgroups

On 2021/8/4 16:20, Michal Hocko wrote:
> On Tue 03-08-21 10:15:36, Johannes Weiner wrote:
> [...]
>> git history shows we tried to remove it once:
>>
>> commit 8521fc50d433507a7cdc96bec280f9e5888a54cc
>> Author: Michal Hocko <mhocko@suse.cz>
>> Date:   Tue Jul 26 16:08:29 2011 -0700
>>
>>     memcg: get rid of percpu_charge_mutex lock
>>
>> but it turned out that the lock did in fact protect a data structure:
>> the stock itself. Specifically stock->cached:
>>
>> commit 9f50fad65b87a8776ae989ca059ad6c17925dfc3
>> Author: Michal Hocko <mhocko@suse.cz>
>> Date:   Tue Aug 9 11:56:26 2011 +0200
>>
>>     Revert "memcg: get rid of percpu_charge_mutex lock"
>>
>>     This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc.
>>
>>     The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
>>     bit operations is sufficient but that is not true.  Johannes Weiner has
>>     reported a crash during parallel memory cgroup removal:
>>
>>       BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>>       IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
>>       Oops: 0000 [#1] PREEMPT SMP
>>       Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
>>       RIP: 0010:[<ffffffff81083b70>]  css_is_ancestor+0x20/0x70
>>       RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
>>       Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
>>       Call Trace:
>>        [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
>>        [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
>>        [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
>>        [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
>>        [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
>>        [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
>>        [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
>>        [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
>>        [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
>>
>>     We are crashing because we try to dereference cached memcg when we are
>>     checking whether we should wait for draining on the cache.  The cache is
>>     already cleaned up, though.
>>
>>     There is also a theoretical chance that the cached memcg gets freed
>>     between we test for the FLUSHING_CACHED_CHARGE and dereference it in
>>     mem_cgroup_same_or_subtree:
>>
>>             CPU0                    CPU1                         CPU2
>>       mem=stock->cached
>>       stock->cached=NULL
>>                                   clear_bit
>>                                                             test_and_set_bit
>>       test_bit()                    ...
>>       <preempted>             mem_cgroup_destroy
>>       use after free
>>
>>     The percpu_charge_mutex protected from this race because sync draining
>>     is exclusive.
>>
>>     It is safer to revert now and come up with a more parallel
>>     implementation later.
>>
>> I didn't remember this one at all!
> 
> Me neither. Thanks for looking that up!
> 
>> However, when you look at the codebase from back then, there was no
>> rcu-protection for memcg lifetime, and drain_stock() didn't double
>> check stock->cached inside the work. Hence the crash during a race.
>>
>> The drain code is different now: drain_local_stock() disables IRQs
>> which holds up rcu, and then calls drain_stock() and drain_obj_stock()
>> which both check stock->cached one more time before the deref.
>>
>> With workqueue managing concurrency, and rcu ensuring memcg lifetime
>> during the drain, this lock indeed seems unnecessary now.
>>
>> Unless I'm missing something, it should just be removed instead.
> 
> I do not think you are missing anything. We can drop the lock and
> simplify the code. The above information would be great to have in the
> changelog.
> 

Am I supposed to revert this with the above information in the changelog and add
Suggested-by for both of you?

Many thanks.

> Thanks!
> 



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-08-05  1:44 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-29 12:57 [PATCH 0/5] Cleanups and fixup for memcontrol Miaohe Lin
2021-07-29 12:57 ` [PATCH 1/5] mm, memcg: remove unused functions Miaohe Lin
2021-07-29 14:07   ` Shakeel Butt
2021-07-30  2:39   ` Muchun Song
2021-07-30  2:57   ` Roman Gushchin
2021-07-30  6:45   ` Michal Hocko
2021-07-29 12:57 ` [PATCH 2/5] mm, memcg: narrow the scope of percpu_charge_mutex Miaohe Lin
2021-07-30  2:42   ` Muchun Song
2021-07-30  3:06   ` Roman Gushchin
2021-07-30  6:50     ` Michal Hocko
2021-07-31  2:29       ` Miaohe Lin
2021-08-02  6:49         ` Michal Hocko
2021-08-02  9:54           ` Miaohe Lin
2021-08-03  3:40         ` Roman Gushchin
2021-08-03  6:29           ` Miaohe Lin
2021-08-03  7:11             ` Michal Hocko
2021-08-03  7:13               ` Roman Gushchin
2021-08-03  7:27                 ` Michal Hocko
2021-08-03  9:33             ` Muchun Song
2021-08-03 10:50               ` Miaohe Lin
2021-08-03 14:15       ` Johannes Weiner
2021-08-04  8:20         ` Michal Hocko
2021-08-05  1:44           ` Miaohe Lin
2021-07-30  6:46   ` Michal Hocko
2021-07-29 12:57 ` [PATCH 3/5] mm, memcg: save some atomic ops when flush is already true Miaohe Lin
2021-07-29 14:40   ` Shakeel Butt
2021-07-30  2:37   ` Muchun Song
2021-07-30  3:07   ` Roman Gushchin
2021-07-30  6:51   ` Michal Hocko
2021-07-29 12:57 ` [PATCH 4/5] mm, memcg: avoid possible NULL pointer dereferencing in mem_cgroup_init() Miaohe Lin
2021-07-29 13:52   ` Matthew Wilcox
2021-07-30  1:50     ` Miaohe Lin
2021-07-30  3:12   ` Roman Gushchin
2021-07-30  6:29     ` Miaohe Lin
2021-07-30  6:44     ` Michal Hocko
2021-07-31  2:05       ` Miaohe Lin
2021-08-02  6:43         ` Michal Hocko
2021-08-02 10:00           ` Miaohe Lin
2021-08-02 10:42             ` Michal Hocko
2021-08-02 11:18               ` Miaohe Lin
2021-07-29 12:57 ` [PATCH 5/5] mm, memcg: always call __mod_node_page_state() with preempt disabled Miaohe Lin
2021-07-29 14:39   ` Shakeel Butt
2021-07-30  1:52     ` Miaohe Lin
2021-07-30  2:33       ` [External] " Muchun Song
2021-07-30  2:46         ` Miaohe Lin

This is a public inbox, see mirroring instructions
on how to clone and mirror all data and code used for this inbox