linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] set memcg when split page
@ 2021-03-04  7:40 Zhou Guanghui
  2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
  2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
  0 siblings, 2 replies; 33+ messages in thread
From: Zhou Guanghui @ 2021-03-04  7:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: akpm, mhocko, hannes, hughd, kirill.shutemov, npiggin, ziy,
	wangkefeng.wang, guohanjun, dingtianhong, chenweilong, rui.xiang

v2:
1. rename mem_cgroup_split_huge_fixup
2. use split_page_memcg when split page

*** BLURB HERE ***

Zhou Guanghui (2):
  mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  mm/memcg: set memcg when split pages

 include/linux/memcontrol.h |  6 ++----
 mm/huge_memory.c           |  2 +-
 mm/memcontrol.c            | 15 ++++++---------
 mm/page_alloc.c            |  1 +
 4 files changed, 10 insertions(+), 14 deletions(-)

-- 
2.25.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-04  7:40 [PATCH v2 0/2] set memcg when split page Zhou Guanghui
@ 2021-03-04  7:40 ` Zhou Guanghui
  2021-03-04 15:50   ` Johannes Weiner
                     ` (4 more replies)
  2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
  1 sibling, 5 replies; 33+ messages in thread
From: Zhou Guanghui @ 2021-03-04  7:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: akpm, mhocko, hannes, hughd, kirill.shutemov, npiggin, ziy,
	wangkefeng.wang, guohanjun, dingtianhong, chenweilong, rui.xiang

Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
pass in page number argument.

In this way, the interface name is more common and can be used by
potential users. In addition, the complete info(memcg and flag) of
the memcg needs to be set to the tail pages.

Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
---
 include/linux/memcontrol.h |  6 ++----
 mm/huge_memory.c           |  2 +-
 mm/memcontrol.c            | 15 ++++++---------
 3 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..0c04d39a7967 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1061,9 +1061,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 	rcu_read_unlock();
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void mem_cgroup_split_huge_fixup(struct page *head);
-#endif
+void split_page_memcg(struct page *head, unsigned int nr);
 
 #else /* CONFIG_MEMCG */
 
@@ -1400,7 +1398,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	return 0;
 }
 
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
+static inline void split_page_memcg(struct page *head, unsigned int nr)
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 395c75111d33..e7f29308ebc8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2471,7 +2471,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	int i;
 
 	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(head);
+	split_page_memcg(head, nr);
 
 	if (PageAnon(head) && PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..e064ac0d850a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3287,24 +3287,21 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 
 #endif /* CONFIG_MEMCG_KMEM */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
- * Because page_memcg(head) is not set on compound tails, set it now.
+ * Because page_memcg(head) is not set on tails, set it now.
  */
-void mem_cgroup_split_huge_fixup(struct page *head)
+void split_page_memcg(struct page *head, unsigned int nr)
 {
 	struct mem_cgroup *memcg = page_memcg(head);
 	int i;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() || !memcg)
 		return;
 
-	for (i = 1; i < HPAGE_PMD_NR; i++) {
-		css_get(&memcg->css);
-		head[i].memcg_data = (unsigned long)memcg;
-	}
+	for (i = 1; i < nr; i++)
+		head[i].memcg_data = head->memcg_data;
+	css_get_many(&memcg->css, nr - 1);
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_MEMCG_SWAP
 /**
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-04  7:40 [PATCH v2 0/2] set memcg when split page Zhou Guanghui
  2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
@ 2021-03-04  7:40 ` Zhou Guanghui
  2021-03-04 15:52   ` Johannes Weiner
                     ` (4 more replies)
  1 sibling, 5 replies; 33+ messages in thread
From: Zhou Guanghui @ 2021-03-04  7:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: akpm, mhocko, hannes, hughd, kirill.shutemov, npiggin, ziy,
	wangkefeng.wang, guohanjun, dingtianhong, chenweilong, rui.xiang

As described in the split_page function comment, for the non-compound
high order page, the sub-pages must be freed individually. If the
memcg of the fisrt page is valid, the tail pages cannot be uncharged
when be freed.

For example, when alloc_pages_exact is used to allocate 1MB continuous
physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
set). When make_alloc_exact free the unused 1MB and free_pages_exact
free the applied 1MB, actually, only 4KB(one page) is uncharged.

Therefore, the memcg of the tail page needs to be set when split page.

Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
---
 mm/page_alloc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..3ed783e25c3c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
 	for (i = 1; i < (1 << order); i++)
 		set_page_refcounted(page + i);
 	split_page_owner(page, 1 << order);
+	split_page_memcg(page, 1 << order);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
@ 2021-03-04 15:50   ` Johannes Weiner
  2021-03-04 16:20   ` Zi Yan
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2021-03-04 15:50 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: linux-kernel, linux-mm, akpm, mhocko, hughd, kirill.shutemov,
	npiggin, ziy, wangkefeng.wang, guohanjun, dingtianhong,
	chenweilong, rui.xiang

On Thu, Mar 04, 2021 at 07:40:52AM +0000, Zhou Guanghui wrote:
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
> 
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
> 
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
@ 2021-03-04 15:52   ` Johannes Weiner
  2021-03-04 16:22   ` Zi Yan
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2021-03-04 15:52 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: linux-kernel, linux-mm, akpm, mhocko, hughd, kirill.shutemov,
	npiggin, ziy, wangkefeng.wang, guohanjun, dingtianhong,
	chenweilong, rui.xiang

On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
> 
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
> 
> Therefore, the memcg of the tail page needs to be set when split page.
> 
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
  2021-03-04 15:50   ` Johannes Weiner
@ 2021-03-04 16:20   ` Zi Yan
  2021-03-04 18:54   ` Shakeel Butt
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Zi Yan @ 2021-03-04 16:20 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: linux-kernel, linux-mm, akpm, mhocko, hannes, hughd,
	kirill.shutemov, npiggin, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

On 4 Mar 2021, at 2:40, Zhou Guanghui wrote:

> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
>
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
>
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
> ---
>  include/linux/memcontrol.h |  6 ++----
>  mm/huge_memory.c           |  2 +-
>  mm/memcontrol.c            | 15 ++++++---------
>  3 files changed, 9 insertions(+), 14 deletions(-)
>
LGTM. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
  2021-03-04 15:52   ` Johannes Weiner
@ 2021-03-04 16:22   ` Zi Yan
  2021-03-04 18:55   ` Shakeel Butt
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Zi Yan @ 2021-03-04 16:22 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: linux-kernel, linux-mm, akpm, mhocko, hannes, hughd,
	kirill.shutemov, npiggin, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]

On 4 Mar 2021, at 2:40, Zhou Guanghui wrote:

> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged

s/fisrt/first/

> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
> ---
>  mm/page_alloc.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..3ed783e25c3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
>  	for (i = 1; i < (1 << order); i++)
>  		set_page_refcounted(page + i);
>  	split_page_owner(page, 1 << order);
> +	split_page_memcg(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
>
> -- 
> 2.25.0

LGTM. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
  2021-03-04 15:50   ` Johannes Weiner
  2021-03-04 16:20   ` Zi Yan
@ 2021-03-04 18:54   ` Shakeel Butt
  2021-03-05 11:48   ` Michal Hocko
  2021-03-08 22:37   ` Singh, Balbir
  4 siblings, 0 replies; 33+ messages in thread
From: Shakeel Butt @ 2021-03-04 18:54 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: LKML, Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner,
	Hugh Dickins, Kirill A. Shutemov, Nicholas Piggin, ziy,
	wangkefeng.wang, guohanjun, dingtianhong, chenweilong, rui.xiang

On Wed, Mar 3, 2021 at 11:55 PM Zhou Guanghui <zhouguanghui1@huawei.com> wrote:
>
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
>
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
>
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
  2021-03-04 15:52   ` Johannes Weiner
  2021-03-04 16:22   ` Zi Yan
@ 2021-03-04 18:55   ` Shakeel Butt
  2021-03-05 11:52   ` Michal Hocko
  2021-03-08 21:02   ` Matthew Wilcox
  4 siblings, 0 replies; 33+ messages in thread
From: Shakeel Butt @ 2021-03-04 18:55 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: LKML, Linux MM, Andrew Morton, Michal Hocko, Johannes Weiner,
	Hugh Dickins, Kirill A. Shutemov, Nicholas Piggin, ziy,
	wangkefeng.wang, guohanjun, dingtianhong, chenweilong, rui.xiang

On Wed, Mar 3, 2021 at 11:57 PM Zhou Guanghui <zhouguanghui1@huawei.com> wrote:
>
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
                     ` (2 preceding siblings ...)
  2021-03-04 18:54   ` Shakeel Butt
@ 2021-03-05 11:48   ` Michal Hocko
  2021-03-08 22:37   ` Singh, Balbir
  4 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-05 11:48 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: linux-kernel, linux-mm, akpm, hannes, hughd, kirill.shutemov,
	npiggin, ziy, wangkefeng.wang, guohanjun, dingtianhong,
	chenweilong, rui.xiang

On Thu 04-03-21 07:40:52, Zhou Guanghui wrote:
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
> 
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.

I think it is good to call out the css_get -> css_get_many as a
microptimization explicitly. Even though I do not expect this to be
easily visible it makes sense to save rcu section for each of the tail
page in general.

> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  6 ++----
>  mm/huge_memory.c           |  2 +-
>  mm/memcontrol.c            | 15 ++++++---------
>  3 files changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..0c04d39a7967 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1061,9 +1061,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  	rcu_read_unlock();
>  }
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void mem_cgroup_split_huge_fixup(struct page *head);
> -#endif
> +void split_page_memcg(struct page *head, unsigned int nr);
>  
>  #else /* CONFIG_MEMCG */
>  
> @@ -1400,7 +1398,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  	return 0;
>  }
>  
> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
> +static inline void split_page_memcg(struct page *head, unsigned int nr)
>  {
>  }
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 395c75111d33..e7f29308ebc8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2471,7 +2471,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	int i;
>  
>  	/* complete memcg works before add pages to LRU */
> -	mem_cgroup_split_huge_fixup(head);
> +	split_page_memcg(head, nr);
>  
>  	if (PageAnon(head) && PageSwapCache(head)) {
>  		swp_entry_t entry = { .val = page_private(head) };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec01ef9d..e064ac0d850a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3287,24 +3287,21 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  
>  #endif /* CONFIG_MEMCG_KMEM */
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /*
> - * Because page_memcg(head) is not set on compound tails, set it now.
> + * Because page_memcg(head) is not set on tails, set it now.
>   */
> -void mem_cgroup_split_huge_fixup(struct page *head)
> +void split_page_memcg(struct page *head, unsigned int nr)
>  {
>  	struct mem_cgroup *memcg = page_memcg(head);
>  	int i;
>  
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_disabled() || !memcg)
>  		return;
>  
> -	for (i = 1; i < HPAGE_PMD_NR; i++) {
> -		css_get(&memcg->css);
> -		head[i].memcg_data = (unsigned long)memcg;
> -	}
> +	for (i = 1; i < nr; i++)
> +		head[i].memcg_data = head->memcg_data;
> +	css_get_many(&memcg->css, nr - 1);
>  }
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #ifdef CONFIG_MEMCG_SWAP
>  /**
> -- 
> 2.25.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
                     ` (2 preceding siblings ...)
  2021-03-04 18:55   ` Shakeel Butt
@ 2021-03-05 11:52   ` Michal Hocko
  2021-03-05 23:58     ` Andrew Morton
  2021-03-08 21:02   ` Matthew Wilcox
  4 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-05 11:52 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: linux-kernel, linux-mm, akpm, hannes, hughd, kirill.shutemov,
	npiggin, ziy, wangkefeng.wang, guohanjun, dingtianhong,
	chenweilong, rui.xiang

On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
> 
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
> 
> Therefore, the memcg of the tail page needs to be set when split page.
> 

As already mentioned there are at least two explicit users of
__GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
mention that explicitly and maybe even mention 7efe8ef274024 resp.
c419621873713 so that it is clear this is not just a theoretical issue.

> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..3ed783e25c3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
>  	for (i = 1; i < (1 << order); i++)
>  		set_page_refcounted(page + i);
>  	split_page_owner(page, 1 << order);
> +	split_page_memcg(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
>  
> -- 
> 2.25.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-05 11:52   ` Michal Hocko
@ 2021-03-05 23:58     ` Andrew Morton
  2021-03-08  8:41       ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2021-03-05 23:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zhou Guanghui, linux-kernel, linux-mm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > As described in the split_page function comment, for the non-compound
> > high order page, the sub-pages must be freed individually. If the
> > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > when be freed.
> > 
> > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > 
> > Therefore, the memcg of the tail page needs to be set when split page.
> > 
> 
> As already mentioned there are at least two explicit users of
> __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> mention that explicitly and maybe even mention 7efe8ef274024 resp.
> c419621873713 so that it is clear this is not just a theoretical issue.

I added

: Michel:
: 
: There are at least two explicit users of __GFP_ACCOUNT with
: alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
: Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
: ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
: just a theoretical issue.

And should we cc:stable on this one?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-05 23:58     ` Andrew Morton
@ 2021-03-08  8:41       ` Michal Hocko
  2021-03-08 20:42         ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-08  8:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Zhou Guanghui, linux-kernel, linux-mm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > As described in the split_page function comment, for the non-compound
> > > high order page, the sub-pages must be freed individually. If the
> > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > when be freed.
> > > 
> > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > 
> > > Therefore, the memcg of the tail page needs to be set when split page.
> > > 
> > 
> > As already mentioned there are at least two explicit users of
> > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > c419621873713 so that it is clear this is not just a theoretical issue.
> 
> I added
> 
> : Michel:
> : 
> : There are at least two explicit users of __GFP_ACCOUNT with
> : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> : just a theoretical issue.
> 
> And should we cc:stable on this one?

Somebody more familiar with iommu dma allocation layer should have a
look as well (__iommu_dma_alloc_pages) so that we know whether there are
kernels outside of the above two ones mentioned above that need a fix.
But in general this sounds like a good fit for the stable tree.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-08  8:41       ` Michal Hocko
@ 2021-03-08 20:42         ` Andrew Morton
  2021-03-08 20:47           ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2021-03-08 20:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zhou Guanghui, linux-kernel, linux-mm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > As described in the split_page function comment, for the non-compound
> > > > high order page, the sub-pages must be freed individually. If the
> > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > when be freed.
> > > > 
> > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > 
> > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > 
> > > 
> > > As already mentioned there are at least two explicit users of
> > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > c419621873713 so that it is clear this is not just a theoretical issue.
> > 
> > I added
> > 
> > : Michel:
> > : 
> > : There are at least two explicit users of __GFP_ACCOUNT with
> > : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > : just a theoretical issue.
> > 
> > And should we cc:stable on this one?
> 
> Somebody more familiar with iommu dma allocation layer should have a
> look as well (__iommu_dma_alloc_pages) so that we know whether there are
> kernels outside of the above two ones mentioned above that need a fix.
> But in general this sounds like a good fit for the stable tree.

OK.  I reversed the order of these two patches so we don't need to
burden -stable with a cosmetic rename.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-08 20:42         ` Andrew Morton
@ 2021-03-08 20:47           ` Matthew Wilcox
  2021-03-09  0:10             ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2021-03-08 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Zhou Guanghui, linux-kernel, linux-mm, hannes,
	hughd, kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Mon, Mar 08, 2021 at 12:42:27PM -0800, Andrew Morton wrote:
> On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > > 
> > > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > > As described in the split_page function comment, for the non-compound
> > > > > high order page, the sub-pages must be freed individually. If the
> > > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > > when be freed.
> > > > > 
> > > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > > 
> > > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > > 
> > > > 
> > > > As already mentioned there are at least two explicit users of
> > > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > > c419621873713 so that it is clear this is not just a theoretical issue.
> > > 
> > > I added
> > > 
> > > : Michel:
> > > : 
> > > : There are at least two explicit users of __GFP_ACCOUNT with
> > > : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> > > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > > : just a theoretical issue.
> > > 
> > > And should we cc:stable on this one?
> > 
> > Somebody more familiar with iommu dma allocation layer should have a
> > look as well (__iommu_dma_alloc_pages) so that we know whether there are
> > kernels outside of the above two ones mentioned above that need a fix.
> > But in general this sounds like a good fit for the stable tree.
> 
> OK.  I reversed the order of these two patches so we don't need to
> burden -stable with a cosmetic rename.

Eek, no.

The alloc_pages_exact() is done to pages that _aren't_ compound.
So you have to pass the number of pages to the memcg split function,
because a non-compound page doesn't know the size of its allocation.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
                     ` (3 preceding siblings ...)
  2021-03-05 11:52   ` Michal Hocko
@ 2021-03-08 21:02   ` Matthew Wilcox
  2021-03-09  9:02     ` Michal Hocko
  4 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2021-03-08 21:02 UTC (permalink / raw)
  To: Zhou Guanghui
  Cc: linux-kernel, linux-mm, akpm, mhocko, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
> 
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
> 
> Therefore, the memcg of the tail page needs to be set when split page.

There's another place we need to do this to ...

+++ b/mm/page_alloc.c
@@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
 {
        if (put_page_testzero(page))
                free_the_page(page, order);
-       else if (!PageHead(page))
-               while (order-- > 0)
-                       free_the_page(page + (1 << order), order);
+       else if (!PageHead(page)) {
+               while (order-- > 0) {
+                       struct page *tail = page + (1 << order);
+#ifdef CONFIG_MEMCG
+                       tail->memcg_data = page->memcg_data;
+#endif
+                       free_the_page(tail, order);
+               }
+       }
 }
 EXPORT_SYMBOL(__free_pages);
 

I wonder if we shouldn't initialise memcg_data on all subsequent pages
of non-compound allocations instead?  Because I'm not sure this is the
only place that needs to be fixed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
                     ` (3 preceding siblings ...)
  2021-03-05 11:48   ` Michal Hocko
@ 2021-03-08 22:37   ` Singh, Balbir
  2021-03-09  8:28     ` Michal Hocko
  4 siblings, 1 reply; 33+ messages in thread
From: Singh, Balbir @ 2021-03-08 22:37 UTC (permalink / raw)
  To: Zhou Guanghui, linux-kernel, linux-mm
  Cc: akpm, mhocko, hannes, hughd, kirill.shutemov, npiggin, ziy,
	wangkefeng.wang, guohanjun, dingtianhong, chenweilong, rui.xiang

On 4/3/21 6:40 pm, Zhou Guanghui wrote:
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
> 
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
> 
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
> ---
>  include/linux/memcontrol.h |  6 ++----
>  mm/huge_memory.c           |  2 +-
>  mm/memcontrol.c            | 15 ++++++---------
>  3 files changed, 9 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..0c04d39a7967 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1061,9 +1061,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  	rcu_read_unlock();
>  }
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void mem_cgroup_split_huge_fixup(struct page *head);
> -#endif
> +void split_page_memcg(struct page *head, unsigned int nr);
>  
>  #else /* CONFIG_MEMCG */
>  
> @@ -1400,7 +1398,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  	return 0;
>  }
>  
> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
> +static inline void split_page_memcg(struct page *head, unsigned int nr)
>  {
>  }
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 395c75111d33..e7f29308ebc8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2471,7 +2471,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	int i;
>  
>  	/* complete memcg works before add pages to LRU */
> -	mem_cgroup_split_huge_fixup(head);
> +	split_page_memcg(head, nr);
>  
>  	if (PageAnon(head) && PageSwapCache(head)) {
>  		swp_entry_t entry = { .val = page_private(head) };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec01ef9d..e064ac0d850a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3287,24 +3287,21 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>  
>  #endif /* CONFIG_MEMCG_KMEM */
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /*
> - * Because page_memcg(head) is not set on compound tails, set it now.
> + * Because page_memcg(head) is not set on tails, set it now.
>   */
> -void mem_cgroup_split_huge_fixup(struct page *head)
> +void split_page_memcg(struct page *head, unsigned int nr)
>  {

Do we need input validation on nr? Can nr be aribtrary or can we enforce

VM_BUG_ON(!is_power_of_2(nr));


Balbir Singh


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-08 20:47           ` Matthew Wilcox
@ 2021-03-09  0:10             ` Andrew Morton
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2021-03-09  0:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Zhou Guanghui, linux-kernel, linux-mm, hannes,
	hughd, kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Mon, 8 Mar 2021 20:47:31 +0000 Matthew Wilcox <willy@infradead.org> wrote:

> On Mon, Mar 08, 2021 at 12:42:27PM -0800, Andrew Morton wrote:
> > On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > > > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > > > 
> > > > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > > > As described in the split_page function comment, for the non-compound
> > > > > > high order page, the sub-pages must be freed individually. If the
> > > > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > > > when be freed.
> > > > > > 
> > > > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > > > 
> > > > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > > > 
> > > > > 
> > > > > As already mentioned there are at least two explicit users of
> > > > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > > > c419621873713 so that it is clear this is not just a theoretical issue.
> > > > 
> > > > I added
> > > > 
> > > > : Michel:
> > > > : 
> > > > : There are at least two explicit users of __GFP_ACCOUNT with
> > > > : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> > > > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > > > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > > > : just a theoretical issue.
> > > > 
> > > > And should we cc:stable on this one?
> > > 
> > > Somebody more familiar with iommu dma allocation layer should have a
> > > look as well (__iommu_dma_alloc_pages) so that we know whether there are
> > > kernels outside of the above two ones mentioned above that need a fix.
> > > But in general this sounds like a good fit for the stable tree.
> > 
> > OK.  I reversed the order of these two patches so we don't need to
> > burden -stable with a cosmetic rename.
> 
> Eek, no.
> 
> The alloc_pages_exact() is done to pages that _aren't_ compound.
> So you have to pass the number of pages to the memcg split function,
> because a non-compound page doesn't know the size of its allocation.

Ah, OK, the patch title fooled me.

It should have been a three-patch series, really

1: add nr_pages arg to mem_cgroup_split_huge_fixup()
2: call mem_cgroup_split_huge_fixup() when splitting
3: rename mem_cgroup_split_huge_fixup() to split_page_memcg()

That way, the third cosmetic patch could be deferred so we don't feed
the cosmetic renaming into -stable.

But whatever, the rename isn't a big deal so I'll go with the 2-patch
series as sent for -stable.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-08 22:37   ` Singh, Balbir
@ 2021-03-09  8:28     ` Michal Hocko
  2021-03-10 21:44       ` Singh, Balbir
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-09  8:28 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: Zhou Guanghui, linux-kernel, linux-mm, akpm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Tue 09-03-21 09:37:29, Balbir Singh wrote:
> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
[...]
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  /*
> > - * Because page_memcg(head) is not set on compound tails, set it now.
> > + * Because page_memcg(head) is not set on tails, set it now.
> >   */
> > -void mem_cgroup_split_huge_fixup(struct page *head)
> > +void split_page_memcg(struct page *head, unsigned int nr)
> >  {
> 
> Do we need input validation on nr? Can nr be aribtrary or can we enforce
> 
> VM_BUG_ON(!is_power_of_2(nr));

In practice this will be power of 2 but why should we bother to sanitze
that? 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-08 21:02   ` Matthew Wilcox
@ 2021-03-09  9:02     ` Michal Hocko
  2021-03-09 12:32       ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-09  9:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhou Guanghui, linux-kernel, linux-mm, akpm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Mon 08-03-21 21:02:25, Matthew Wilcox wrote:
> On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> > As described in the split_page function comment, for the non-compound
> > high order page, the sub-pages must be freed individually. If the
> > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > when be freed.
> > 
> > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > 
> > Therefore, the memcg of the tail page needs to be set when split page.
> 
> There's another place we need to do this to ...
> 
> +++ b/mm/page_alloc.c
> @@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
>  {
>         if (put_page_testzero(page))
>                 free_the_page(page, order);
> -       else if (!PageHead(page))
> -               while (order-- > 0)
> -                       free_the_page(page + (1 << order), order);
> +       else if (!PageHead(page)) {
> +               while (order-- > 0) {
> +                       struct page *tail = page + (1 << order);
> +#ifdef CONFIG_MEMCG
> +                       tail->memcg_data = page->memcg_data;
> +#endif
> +                       free_the_page(tail, order);
> +               }
> +       }
>  }
>  EXPORT_SYMBOL(__free_pages);

Hmm, I was not aware of this code. This is really a tricky code.

> I wonder if we shouldn't initialise memcg_data on all subsequent pages
> of non-compound allocations instead?  Because I'm not sure this is the
> only place that needs to be fixed.

That would be safer for sure. Do you mean this as a replacement to the
original patch?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 913c2b9e5c72..d44dea2b8d22 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	if (memcg && !mem_cgroup_is_root(memcg)) {
 		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
 		if (!ret) {
+			int nr_pages = 1 << order;
 			page->memcg_data = (unsigned long)memcg |
 				MEMCG_DATA_KMEM;
+			
+			/*
+			 * Compound pages are normally split or freed
+			 * via their head pages so memcg_data in in the
+			 * head page should be sufficient but there
+			 * are exceptions to the rule (see __free_pages).
+			 * Non compound pages would need to copy memcg anyway.
+			 */
+			for (i = 1; i < nr_pages; i++) {
+				struct page * p = page + i;
+				p->memcg_data = page->memcg_data
+			}
 			return 0;
 		}
 		css_put(&memcg->css);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-09  9:02     ` Michal Hocko
@ 2021-03-09 12:32       ` Matthew Wilcox
  2021-03-09 13:03         ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2021-03-09 12:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zhou Guanghui, linux-kernel, linux-mm, akpm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
> On Mon 08-03-21 21:02:25, Matthew Wilcox wrote:
> > On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > free the applied 1MB, actually, only 4KB(one page) is uncharged.

> > @@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
> >  {
> >         if (put_page_testzero(page))
> >                 free_the_page(page, order);
> > -       else if (!PageHead(page))
> > -               while (order-- > 0)
> > -                       free_the_page(page + (1 << order), order);
> > +       else if (!PageHead(page)) {
> > +               while (order-- > 0) {
> > +                       struct page *tail = page + (1 << order);
> > +#ifdef CONFIG_MEMCG
> > +                       tail->memcg_data = page->memcg_data;
> > +#endif
> > +                       free_the_page(tail, order);
> > +               }
> > +       }
> >  }
> >  EXPORT_SYMBOL(__free_pages);
> 
> Hmm, I was not aware of this code. This is really a tricky code.

Yes.  I only added it recently.  I don't see a better way to solve this
problem.  We could turn the non-compound page into a compound page at
this point, but I'm not sure that's really less tricky.

> > I wonder if we shouldn't initialise memcg_data on all subsequent pages
> > of non-compound allocations instead?  Because I'm not sure this is the
> > only place that needs to be fixed.
> 
> That would be safer for sure. Do you mean this as a replacement to the
> original patch?
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 913c2b9e5c72..d44dea2b8d22 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  	if (memcg && !mem_cgroup_is_root(memcg)) {
>  		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
>  		if (!ret) {
> +			int nr_pages = 1 << order;
>  			page->memcg_data = (unsigned long)memcg |
>  				MEMCG_DATA_KMEM;
> +			
> +			/*
> +			 * Compound pages are normally split or freed
> +			 * via their head pages so memcg_data in in the
> +			 * head page should be sufficient but there
> +			 * are exceptions to the rule (see __free_pages).
> +			 * Non compound pages would need to copy memcg anyway.
> +			 */
> +			for (i = 1; i < nr_pages; i++) {
> +				struct page * p = page + i;
> +				p->memcg_data = page->memcg_data
> +			}
>  			return 0;

I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
along these lines.  I might phrase the comment a little differently ...

			/*
			 * Compound pages are treated as a single unit,
			 * but non-compound pages can be freed individually
			 * so each page needs to have its memcg set to get
			 * the accounting right.
			 */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-09 12:32       ` Matthew Wilcox
@ 2021-03-09 13:03         ` Michal Hocko
  2021-03-11  8:37           ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-09 13:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhou Guanghui, linux-kernel, linux-mm, akpm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Tue 09-03-21 12:32:55, Matthew Wilcox wrote:
> On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 913c2b9e5c72..d44dea2b8d22 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> >  	if (memcg && !mem_cgroup_is_root(memcg)) {
> >  		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> >  		if (!ret) {
> > +			int nr_pages = 1 << order;
> >  			page->memcg_data = (unsigned long)memcg |
> >  				MEMCG_DATA_KMEM;
> > +			
> > +			/*
> > +			 * Compound pages are normally split or freed
> > +			 * via their head pages so memcg_data in in the
> > +			 * head page should be sufficient but there
> > +			 * are exceptions to the rule (see __free_pages).
> > +			 * Non compound pages would need to copy memcg anyway.
> > +			 */
> > +			for (i = 1; i < nr_pages; i++) {
> > +				struct page * p = page + i;
> > +				p->memcg_data = page->memcg_data
> > +			}
> >  			return 0;
> 
> I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
> along these lines.  I might phrase the comment a little differently ...
> 
> 			/*
> 			 * Compound pages are treated as a single unit,
> 			 * but non-compound pages can be freed individually
> 			 * so each page needs to have its memcg set to get
> 			 * the accounting right.
> 			 */

OK, I must have misunderstood your __free_pages fix then. I thought this
was about compound pages. Btw. again I forgot about css ref counting so
here is an updated version.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 913c2b9e5c72..ec2c705f38fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3133,10 +3133,22 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 
 	memcg = get_mem_cgroup_from_current();
 	if (memcg && !mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
+		int nr_pages = 1 << order;
+		ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
 		if (!ret) {
 			page->memcg_data = (unsigned long)memcg |
 				MEMCG_DATA_KMEM;
+			if (nr_pages > 1) {
+				/*
+				 * comment goes here
+				 */
+				for (i = 1; i < nr_pages; i++) {
+					struct page * p = page + i;
+					p->memcg_data = page->memcg_data
+				}
+				/* Head page reference from get_mem_cgroup_from_current */
+				css_get_many(&memcg->css, nr_pages - 1);
+			}
 			return 0;
 		}
 		css_put(&memcg->css);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-09  8:28     ` Michal Hocko
@ 2021-03-10 21:44       ` Singh, Balbir
  2021-03-10 22:00         ` Hugh Dickins
  0 siblings, 1 reply; 33+ messages in thread
From: Singh, Balbir @ 2021-03-10 21:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zhou Guanghui, linux-kernel, linux-mm, akpm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On 9/3/21 7:28 pm, Michal Hocko wrote:
> On Tue 09-03-21 09:37:29, Balbir Singh wrote:
>> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
> [...]
>>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>  /*
>>> - * Because page_memcg(head) is not set on compound tails, set it now.
>>> + * Because page_memcg(head) is not set on tails, set it now.
>>>   */
>>> -void mem_cgroup_split_huge_fixup(struct page *head)
>>> +void split_page_memcg(struct page *head, unsigned int nr)
>>>  {
>>
>> Do we need input validation on nr? Can nr be aribtrary or can we enforce
>>
>> VM_BUG_ON(!is_power_of_2(nr));
> 
> In practice this will be power of 2 but why should we bother to sanitze
> that? 
> 

Just when DEBUG_VM is enabled to ensure the contract is valid, given that
nr is now variable, we could end up with subtle bugs unless we can audit
all callers. Even the power of 2 check does not catch the fact that nr
is indeed what we expect, but it still checks a large range of invalid
inputs.

Balbir Singh.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-10 21:44       ` Singh, Balbir
@ 2021-03-10 22:00         ` Hugh Dickins
  2021-03-10 23:50           ` Singh, Balbir
  0 siblings, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2021-03-10 22:00 UTC (permalink / raw)
  To: Singh, Balbir
  Cc: Michal Hocko, Zhou Guanghui, linux-kernel, linux-mm, akpm,
	hannes, hughd, kirill.shutemov, npiggin, ziy, wangkefeng.wang,
	guohanjun, dingtianhong, chenweilong, rui.xiang

On Thu, 11 Mar 2021, Singh, Balbir wrote:
> On 9/3/21 7:28 pm, Michal Hocko wrote:
> > On Tue 09-03-21 09:37:29, Balbir Singh wrote:
> >> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
> > [...]
> >>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>  /*
> >>> - * Because page_memcg(head) is not set on compound tails, set it now.
> >>> + * Because page_memcg(head) is not set on tails, set it now.
> >>>   */
> >>> -void mem_cgroup_split_huge_fixup(struct page *head)
> >>> +void split_page_memcg(struct page *head, unsigned int nr)
> >>>  {
> >>
> >> Do we need input validation on nr? Can nr be aribtrary or can we enforce
> >>
> >> VM_BUG_ON(!is_power_of_2(nr));
> > 
> > In practice this will be power of 2 but why should we bother to sanitze
> > that? 
> > 
> 
> Just when DEBUG_VM is enabled to ensure the contract is valid, given that
> nr is now variable, we could end up with subtle bugs unless we can audit
> all callers. Even the power of 2 check does not catch the fact that nr
> is indeed what we expect, but it still checks a large range of invalid
> inputs.

I think you imagine this is something it's not.

"all callers" are __split_huge_page() and split_page() (maybe Matthew
will have a third caller, maybe not).  It is not something drivers will
be calling directly themselves, and it won't ever get EXPORTed to them.

Hugh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg
  2021-03-10 22:00         ` Hugh Dickins
@ 2021-03-10 23:50           ` Singh, Balbir
  0 siblings, 0 replies; 33+ messages in thread
From: Singh, Balbir @ 2021-03-10 23:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Zhou Guanghui, linux-kernel, linux-mm, akpm,
	hannes, kirill.shutemov, npiggin, ziy, wangkefeng.wang,
	guohanjun, dingtianhong, chenweilong, rui.xiang

On 11/3/21 9:00 am, Hugh Dickins wrote:
> On Thu, 11 Mar 2021, Singh, Balbir wrote:
>> On 9/3/21 7:28 pm, Michal Hocko wrote:
>>> On Tue 09-03-21 09:37:29, Balbir Singh wrote:
>>>> On 4/3/21 6:40 pm, Zhou Guanghui wrote:
>>> [...]
>>>>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>  /*
>>>>> - * Because page_memcg(head) is not set on compound tails, set it now.
>>>>> + * Because page_memcg(head) is not set on tails, set it now.
>>>>>   */
>>>>> -void mem_cgroup_split_huge_fixup(struct page *head)
>>>>> +void split_page_memcg(struct page *head, unsigned int nr)
>>>>>  {
>>>>
>>>> Do we need input validation on nr? Can nr be aribtrary or can we enforce
>>>>
>>>> VM_BUG_ON(!is_power_of_2(nr));
>>>
>>> In practice this will be power of 2 but why should we bother to sanitze
>>> that? 
>>>
>>
>> Just when DEBUG_VM is enabled to ensure the contract is valid, given that
>> nr is now variable, we could end up with subtle bugs unless we can audit
>> all callers. Even the power of 2 check does not catch the fact that nr
>> is indeed what we expect, but it still checks a large range of invalid
>> inputs.
> 
> I think you imagine this is something it's not.
> 
> "all callers" are __split_huge_page() and split_page() (maybe Matthew
> will have a third caller, maybe not).  It is not something drivers will
> be calling directly themselves, and it won't ever get EXPORTed to them.
> 

Don't feel strongly about it if that is the case.

Thanks,
Balbir Singh


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-09 13:03         ` Michal Hocko
@ 2021-03-11  8:37           ` Michal Hocko
  2021-03-11 15:21             ` Johannes Weiner
  0 siblings, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-11  8:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zhou Guanghui, linux-kernel, linux-mm, akpm, hannes, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

Johannes, Hugh,

what do you think about this approach? If we want to stick with
split_page approach then we need to update the missing place Matthew has
pointed out.

On Tue 09-03-21 14:03:36, Michal Hocko wrote:
> On Tue 09-03-21 12:32:55, Matthew Wilcox wrote:
> > On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
> [...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 913c2b9e5c72..d44dea2b8d22 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> > >  	if (memcg && !mem_cgroup_is_root(memcg)) {
> > >  		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> > >  		if (!ret) {
> > > +			int nr_pages = 1 << order;
> > >  			page->memcg_data = (unsigned long)memcg |
> > >  				MEMCG_DATA_KMEM;
> > > +			
> > > +			/*
> > > +			 * Compound pages are normally split or freed
> > > +			 * via their head pages so memcg_data in in the
> > > +			 * head page should be sufficient but there
> > > +			 * are exceptions to the rule (see __free_pages).
> > > +			 * Non compound pages would need to copy memcg anyway.
> > > +			 */
> > > +			for (i = 1; i < nr_pages; i++) {
> > > +				struct page * p = page + i;
> > > +				p->memcg_data = page->memcg_data
> > > +			}
> > >  			return 0;
> > 
> > I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
> > along these lines.  I might phrase the comment a little differently ...
> > 
> > 			/*
> > 			 * Compound pages are treated as a single unit,
> > 			 * but non-compound pages can be freed individually
> > 			 * so each page needs to have its memcg set to get
> > 			 * the accounting right.
> > 			 */
> 
> OK, I must have misunderstood your __free_pages fix then. I thought this
> was about compound pages. Btw. again I forgot about css ref counting so
> here is an updated version.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 913c2b9e5c72..ec2c705f38fa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3133,10 +3133,22 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  
>  	memcg = get_mem_cgroup_from_current();
>  	if (memcg && !mem_cgroup_is_root(memcg)) {
> -		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> +		int nr_pages = 1 << order;
> +		ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
>  		if (!ret) {
>  			page->memcg_data = (unsigned long)memcg |
>  				MEMCG_DATA_KMEM;
> +			if (nr_pages > 1) {
> +				/*
> +				 * comment goes here
> +				 */
> +				for (i = 1; i < nr_pages; i++) {
> +					struct page * p = page + i;
> +					p->memcg_data = page->memcg_data
> +				}
> +				/* Head page reference from get_mem_cgroup_from_current */
> +				css_get_many(&memcg->css, nr_pages - 1);
> +			}
>  			return 0;
>  		}
>  		css_put(&memcg->css);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-11  8:37           ` Michal Hocko
@ 2021-03-11 15:21             ` Johannes Weiner
  2021-03-11 16:23               ` Matthew Wilcox
  2021-03-11 16:26               ` Michal Hocko
  0 siblings, 2 replies; 33+ messages in thread
From: Johannes Weiner @ 2021-03-11 15:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Zhou Guanghui, linux-kernel, linux-mm, akpm,
	hughd, kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> Johannes, Hugh,
> 
> what do you think about this approach? If we want to stick with
> split_page approach then we need to update the missing place Matthew has
> pointed out.

I find the __free_pages() code quite tricky as well. But for that
reason I would actually prefer to initiate the splitting in there,
since that's the place where we actually split the page, rather than
spread the handling of this situation further out.

The race condition shouldn't be hot, so I don't think we need to be as
efficient about setting page->memcg_data only on the higher-order
buddies as in Willy's scratch patch. We can call split_page_memcg(),
which IMO should actually help document what's happening to the page.

I think that function could also benefit a bit more from step-by-step
documentation about what's going on. The kerneldoc is helpful, but I
don't think it does justice to how tricky this race condition is.

Something like this?

void __free_pages(struct page *page, unsigned int order)
{
	/*
	 * Drop the base reference from __alloc_pages and free. In
	 * case there is an outstanding speculative reference, from
	 * e.g. the page cache, it will put and free the page later.
	 */
	if (likely(put_page_testzero(page))) {
		free_the_page(page, order);
		return;
	}

	/*
	 * The speculative reference will put and free the page.
	 *
	 * However, if the speculation was into a higher-order page
	 * that isn't marked compound, the other side will know
	 * nothing about our buddy pages and only free the order-0
	 * page at the start of our chunk! We must split off and free
	 * the buddy pages here.
	 *
	 * The buddy pages aren't individually refcounted, so they
	 * can't have any pending speculative references themselves.
	 */
	if (!PageHead(page) && order > 0) {
		split_page_memcg(page, 1 << order);
		while (order-- > 0)
			free_the_page(page + (1 << order), order);
	}
}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-11 15:21             ` Johannes Weiner
@ 2021-03-11 16:23               ` Matthew Wilcox
  2021-03-11 16:26               ` Michal Hocko
  1 sibling, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2021-03-11 16:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Zhou Guanghui, linux-kernel, linux-mm, akpm, hughd,
	kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Thu, Mar 11, 2021 at 10:21:39AM -0500, Johannes Weiner wrote:
> On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > Johannes, Hugh,
> > 
> > what do you think about this approach? If we want to stick with
> > split_page approach then we need to update the missing place Matthew has
> > pointed out.
> 
> I find the __free_pages() code quite tricky as well. But for that
> reason I would actually prefer to initiate the splitting in there,
> since that's the place where we actually split the page, rather than
> spread the handling of this situation further out.

Mmm.  The thing is, we don't actually split the page because it was
never compound.  I don't know whether anybody actually does this,
but it's legitimate to write:

	struct page *p = alloc_pages(GFP_KERNEL, 2);

	free_unref_page(p + 1);
	free_unref_page(p + 3);
	free_unref_page(p + 2);
	__free_page(p);

The good news is that I recently made free_unref_page() local to
mm/internal.h, so we don't need to worry about device drivers doing this.
As far as I can tell, we don't have any exposure to this kind of thing
today through functions exported from mm, but I might have missed
something.

I'd really like to get rid of non-compound high-order pages.  Slab,
filesystems and anonymous memory all use compound pages.  I think
it's just crusty old device drivers that don't.  And alloc_pages_exact(),
of course, but that's kind of internal.

> The race condition shouldn't be hot, so I don't think we need to be as
> efficient about setting page->memcg_data only on the higher-order
> buddies as in Willy's scratch patch. We can call split_page_memcg(),
> which IMO should actually help document what's happening to the page.

I'm cool with that.  I agree, this is not a performance case!

> I think that function could also benefit a bit more from step-by-step
> documentation about what's going on. The kerneldoc is helpful, but I
> don't think it does justice to how tricky this race condition is.

Always good to have other people read over your explanation ...
the kernel-doc could probably be simplified as a result.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-11 15:21             ` Johannes Weiner
  2021-03-11 16:23               ` Matthew Wilcox
@ 2021-03-11 16:26               ` Michal Hocko
  2021-03-11 20:37                 ` Hugh Dickins
  1 sibling, 1 reply; 33+ messages in thread
From: Michal Hocko @ 2021-03-11 16:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, Zhou Guanghui, linux-kernel, linux-mm, akpm,
	hughd, kirill.shutemov, npiggin, ziy, wangkefeng.wang, guohanjun,
	dingtianhong, chenweilong, rui.xiang

On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > Johannes, Hugh,
> > 
> > what do you think about this approach? If we want to stick with
> > split_page approach then we need to update the missing place Matthew has
> > pointed out.
> 
> I find the __free_pages() code quite tricky as well. But for that
> reason I would actually prefer to initiate the splitting in there,
> since that's the place where we actually split the page, rather than
> spread the handling of this situation further out.
> 
> The race condition shouldn't be hot, so I don't think we need to be as
> efficient about setting page->memcg_data only on the higher-order
> buddies as in Willy's scratch patch. We can call split_page_memcg(),
> which IMO should actually help document what's happening to the page.
> 
> I think that function could also benefit a bit more from step-by-step
> documentation about what's going on. The kerneldoc is helpful, but I
> don't think it does justice to how tricky this race condition is.
> 
> Something like this?
> 
> void __free_pages(struct page *page, unsigned int order)
> {
> 	/*
> 	 * Drop the base reference from __alloc_pages and free. In
> 	 * case there is an outstanding speculative reference, from
> 	 * e.g. the page cache, it will put and free the page later.
> 	 */
> 	if (likely(put_page_testzero(page))) {
> 		free_the_page(page, order);
> 		return;
> 	}
> 
> 	/*
> 	 * The speculative reference will put and free the page.
> 	 *
> 	 * However, if the speculation was into a higher-order page
> 	 * that isn't marked compound, the other side will know
> 	 * nothing about our buddy pages and only free the order-0
> 	 * page at the start of our chunk! We must split off and free
> 	 * the buddy pages here.
> 	 *
> 	 * The buddy pages aren't individually refcounted, so they
> 	 * can't have any pending speculative references themselves.
> 	 */
> 	if (!PageHead(page) && order > 0) {
> 		split_page_memcg(page, 1 << order);
> 		while (order-- > 0)
> 			free_the_page(page + (1 << order), order);
> 	}
> }

Fine with me. Mathew was concerned about more places that do something
similar but I would say that if we find out more places we might
reconsider and currently stay with a reasonably clear model that it is
only head patch that carries the memcg information and split_page_memcg
is necessary to break such page into smaller pieces.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-11 16:26               ` Michal Hocko
@ 2021-03-11 20:37                 ` Hugh Dickins
  2021-03-18 14:05                   ` Michal Hocko
  0 siblings, 1 reply; 33+ messages in thread
From: Hugh Dickins @ 2021-03-11 20:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Matthew Wilcox, Zhou Guanghui, linux-kernel,
	linux-mm, akpm, hughd, kirill.shutemov, npiggin, ziy,
	wangkefeng.wang, guohanjun, dingtianhong, chenweilong, rui.xiang

On Thu, 11 Mar 2021, Michal Hocko wrote:
> On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > Johannes, Hugh,
> > > 
> > > what do you think about this approach? If we want to stick with
> > > split_page approach then we need to update the missing place Matthew has
> > > pointed out.
> > 
> > I find the __free_pages() code quite tricky as well. But for that
> > reason I would actually prefer to initiate the splitting in there,
> > since that's the place where we actually split the page, rather than
> > spread the handling of this situation further out.
> > 
> > The race condition shouldn't be hot, so I don't think we need to be as
> > efficient about setting page->memcg_data only on the higher-order
> > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > which IMO should actually help document what's happening to the page.
> > 
> > I think that function could also benefit a bit more from step-by-step
> > documentation about what's going on. The kerneldoc is helpful, but I
> > don't think it does justice to how tricky this race condition is.
> > 
> > Something like this?
> > 
> > void __free_pages(struct page *page, unsigned int order)
> > {
> > 	/*
> > 	 * Drop the base reference from __alloc_pages and free. In
> > 	 * case there is an outstanding speculative reference, from
> > 	 * e.g. the page cache, it will put and free the page later.
> > 	 */
> > 	if (likely(put_page_testzero(page))) {
> > 		free_the_page(page, order);
> > 		return;
> > 	}
> > 
> > 	/*
> > 	 * The speculative reference will put and free the page.
> > 	 *
> > 	 * However, if the speculation was into a higher-order page
> > 	 * that isn't marked compound, the other side will know
> > 	 * nothing about our buddy pages and only free the order-0
> > 	 * page at the start of our chunk! We must split off and free
> > 	 * the buddy pages here.
> > 	 *
> > 	 * The buddy pages aren't individually refcounted, so they
> > 	 * can't have any pending speculative references themselves.
> > 	 */
> > 	if (!PageHead(page) && order > 0) {
> > 		split_page_memcg(page, 1 << order);
> > 		while (order-- > 0)
> > 			free_the_page(page + (1 << order), order);
> > 	}
> > }
> 
> Fine with me. Mathew was concerned about more places that do something
> similar but I would say that if we find out more places we might
> reconsider and currently stay with a reasonably clear model that it is
> only head patch that carries the memcg information and split_page_memcg
> is necessary to break such page into smaller pieces.

I agree: I do like Johannes' suggestion best, now that we already
have split_page_memcg().  Not too worried about contrived use of
free_unref_page() here; and whether non-compound high-order pages
should be perpetuated is a different discussion.

Hugh

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-11 20:37                 ` Hugh Dickins
@ 2021-03-18 14:05                   ` Michal Hocko
  2021-03-18 15:02                     ` Matthew Wilcox
  2021-03-18 15:07                     ` Johannes Weiner
  0 siblings, 2 replies; 33+ messages in thread
From: Michal Hocko @ 2021-03-18 14:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, Hugh Dickins, Zhou Guanghui, linux-kernel,
	linux-mm, akpm, kirill.shutemov, npiggin, ziy, wangkefeng.wang,
	guohanjun, dingtianhong, chenweilong, rui.xiang

On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> On Thu, 11 Mar 2021, Michal Hocko wrote:
> > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > Johannes, Hugh,
> > > > 
> > > > what do you think about this approach? If we want to stick with
> > > > split_page approach then we need to update the missing place Matthew has
> > > > pointed out.
> > > 
> > > I find the __free_pages() code quite tricky as well. But for that
> > > reason I would actually prefer to initiate the splitting in there,
> > > since that's the place where we actually split the page, rather than
> > > spread the handling of this situation further out.
> > > 
> > > The race condition shouldn't be hot, so I don't think we need to be as
> > > efficient about setting page->memcg_data only on the higher-order
> > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > which IMO should actually help document what's happening to the page.
> > > 
> > > I think that function could also benefit a bit more from step-by-step
> > > documentation about what's going on. The kerneldoc is helpful, but I
> > > don't think it does justice to how tricky this race condition is.
> > > 
> > > Something like this?
> > > 
> > > void __free_pages(struct page *page, unsigned int order)
> > > {
> > > 	/*
> > > 	 * Drop the base reference from __alloc_pages and free. In
> > > 	 * case there is an outstanding speculative reference, from
> > > 	 * e.g. the page cache, it will put and free the page later.
> > > 	 */
> > > 	if (likely(put_page_testzero(page))) {
> > > 		free_the_page(page, order);
> > > 		return;
> > > 	}
> > > 
> > > 	/*
> > > 	 * The speculative reference will put and free the page.
> > > 	 *
> > > 	 * However, if the speculation was into a higher-order page
> > > 	 * that isn't marked compound, the other side will know
> > > 	 * nothing about our buddy pages and only free the order-0
> > > 	 * page at the start of our chunk! We must split off and free
> > > 	 * the buddy pages here.
> > > 	 *
> > > 	 * The buddy pages aren't individually refcounted, so they
> > > 	 * can't have any pending speculative references themselves.
> > > 	 */
> > > 	if (!PageHead(page) && order > 0) {
> > > 		split_page_memcg(page, 1 << order);
> > > 		while (order-- > 0)
> > > 			free_the_page(page + (1 << order), order);
> > > 	}
> > > }
> > 
> > Fine with me. Mathew was concerned about more places that do something
> > similar but I would say that if we find out more places we might
> > reconsider and currently stay with a reasonably clear model that it is
> > only head patch that carries the memcg information and split_page_memcg
> > is necessary to break such page into smaller pieces.
> 
> I agree: I do like Johannes' suggestion best, now that we already
> have split_page_memcg().  Not too worried about contrived use of
> free_unref_page() here; and whether non-compound high-order pages
> should be perpetuated is a different discussion.

Matthew, are you planning to post a patch with suggested changes or
should I do it?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-18 14:05                   ` Michal Hocko
@ 2021-03-18 15:02                     ` Matthew Wilcox
  2021-03-18 15:07                     ` Johannes Weiner
  1 sibling, 0 replies; 33+ messages in thread
From: Matthew Wilcox @ 2021-03-18 15:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Hugh Dickins, Zhou Guanghui, linux-kernel,
	linux-mm, akpm, kirill.shutemov, npiggin, ziy, wangkefeng.wang,
	guohanjun, dingtianhong, chenweilong, rui.xiang

On Thu, Mar 18, 2021 at 03:05:00PM +0100, Michal Hocko wrote:
> On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> > On Thu, 11 Mar 2021, Michal Hocko wrote:
> > > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > > Johannes, Hugh,
> > > > > 
> > > > > what do you think about this approach? If we want to stick with
> > > > > split_page approach then we need to update the missing place Matthew has
> > > > > pointed out.
> > > > 
> > > > I find the __free_pages() code quite tricky as well. But for that
> > > > reason I would actually prefer to initiate the splitting in there,
> > > > since that's the place where we actually split the page, rather than
> > > > spread the handling of this situation further out.
> > > > 
> > > > The race condition shouldn't be hot, so I don't think we need to be as
> > > > efficient about setting page->memcg_data only on the higher-order
> > > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > > which IMO should actually help document what's happening to the page.
> > > > 
> > > > I think that function could also benefit a bit more from step-by-step
> > > > documentation about what's going on. The kerneldoc is helpful, but I
> > > > don't think it does justice to how tricky this race condition is.
> > > > 
> > > > Something like this?
> > > > 
> > > > void __free_pages(struct page *page, unsigned int order)
> > > > {
> > > > 	/*
> > > > 	 * Drop the base reference from __alloc_pages and free. In
> > > > 	 * case there is an outstanding speculative reference, from
> > > > 	 * e.g. the page cache, it will put and free the page later.
> > > > 	 */
> > > > 	if (likely(put_page_testzero(page))) {
> > > > 		free_the_page(page, order);
> > > > 		return;
> > > > 	}
> > > > 
> > > > 	/*
> > > > 	 * The speculative reference will put and free the page.
> > > > 	 *
> > > > 	 * However, if the speculation was into a higher-order page
> > > > 	 * that isn't marked compound, the other side will know
> > > > 	 * nothing about our buddy pages and only free the order-0
> > > > 	 * page at the start of our chunk! We must split off and free
> > > > 	 * the buddy pages here.
> > > > 	 *
> > > > 	 * The buddy pages aren't individually refcounted, so they
> > > > 	 * can't have any pending speculative references themselves.
> > > > 	 */
> > > > 	if (!PageHead(page) && order > 0) {
> > > > 		split_page_memcg(page, 1 << order);
> > > > 		while (order-- > 0)
> > > > 			free_the_page(page + (1 << order), order);
> > > > 	}
> > > > }
> > > 
> > > Fine with me. Mathew was concerned about more places that do something
> > > similar but I would say that if we find out more places we might
> > > reconsider and currently stay with a reasonably clear model that it is
> > > only head patch that carries the memcg information and split_page_memcg
> > > is necessary to break such page into smaller pieces.
> > 
> > I agree: I do like Johannes' suggestion best, now that we already
> > have split_page_memcg().  Not too worried about contrived use of
> > free_unref_page() here; and whether non-compound high-order pages
> > should be perpetuated is a different discussion.
> 
> Matthew, are you planning to post a patch with suggested changes or
> should I do it?

I'm busy with the folio work; could you do it please?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/2] mm/memcg: set memcg when split page
  2021-03-18 14:05                   ` Michal Hocko
  2021-03-18 15:02                     ` Matthew Wilcox
@ 2021-03-18 15:07                     ` Johannes Weiner
  1 sibling, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2021-03-18 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Hugh Dickins, Zhou Guanghui, linux-kernel,
	linux-mm, akpm, kirill.shutemov, npiggin, ziy, wangkefeng.wang,
	guohanjun, dingtianhong, chenweilong, rui.xiang

On Thu, Mar 18, 2021 at 03:05:00PM +0100, Michal Hocko wrote:
> On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> > On Thu, 11 Mar 2021, Michal Hocko wrote:
> > > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > > Johannes, Hugh,
> > > > > 
> > > > > what do you think about this approach? If we want to stick with
> > > > > split_page approach then we need to update the missing place Matthew has
> > > > > pointed out.
> > > > 
> > > > I find the __free_pages() code quite tricky as well. But for that
> > > > reason I would actually prefer to initiate the splitting in there,
> > > > since that's the place where we actually split the page, rather than
> > > > spread the handling of this situation further out.
> > > > 
> > > > The race condition shouldn't be hot, so I don't think we need to be as
> > > > efficient about setting page->memcg_data only on the higher-order
> > > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > > which IMO should actually help document what's happening to the page.
> > > > 
> > > > I think that function could also benefit a bit more from step-by-step
> > > > documentation about what's going on. The kerneldoc is helpful, but I
> > > > don't think it does justice to how tricky this race condition is.
> > > > 
> > > > Something like this?
> > > > 
> > > > void __free_pages(struct page *page, unsigned int order)
> > > > {
> > > > 	/*
> > > > 	 * Drop the base reference from __alloc_pages and free. In
> > > > 	 * case there is an outstanding speculative reference, from
> > > > 	 * e.g. the page cache, it will put and free the page later.
> > > > 	 */
> > > > 	if (likely(put_page_testzero(page))) {
> > > > 		free_the_page(page, order);
> > > > 		return;
> > > > 	}
> > > > 
> > > > 	/*
> > > > 	 * The speculative reference will put and free the page.
> > > > 	 *
> > > > 	 * However, if the speculation was into a higher-order page
> > > > 	 * that isn't marked compound, the other side will know
> > > > 	 * nothing about our buddy pages and only free the order-0
> > > > 	 * page at the start of our chunk! We must split off and free
> > > > 	 * the buddy pages here.
> > > > 	 *
> > > > 	 * The buddy pages aren't individually refcounted, so they
> > > > 	 * can't have any pending speculative references themselves.
> > > > 	 */
> > > > 	if (!PageHead(page) && order > 0) {
> > > > 		split_page_memcg(page, 1 << order);
> > > > 		while (order-- > 0)
> > > > 			free_the_page(page + (1 << order), order);
> > > > 	}
> > > > }
> > > 
> > > Fine with me. Mathew was concerned about more places that do something
> > > similar but I would say that if we find out more places we might
> > > reconsider and currently stay with a reasonably clear model that it is
> > > only head patch that carries the memcg information and split_page_memcg
> > > is necessary to break such page into smaller pieces.
> > 
> > I agree: I do like Johannes' suggestion best, now that we already
> > have split_page_memcg().  Not too worried about contrived use of
> > free_unref_page() here; and whether non-compound high-order pages
> > should be perpetuated is a different discussion.
> 
> Matthew, are you planning to post a patch with suggested changes or
> should I do it?

I'll post a proper patch.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-03-18 15:08 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-04  7:40 [PATCH v2 0/2] set memcg when split page Zhou Guanghui
2021-03-04  7:40 ` [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg Zhou Guanghui
2021-03-04 15:50   ` Johannes Weiner
2021-03-04 16:20   ` Zi Yan
2021-03-04 18:54   ` Shakeel Butt
2021-03-05 11:48   ` Michal Hocko
2021-03-08 22:37   ` Singh, Balbir
2021-03-09  8:28     ` Michal Hocko
2021-03-10 21:44       ` Singh, Balbir
2021-03-10 22:00         ` Hugh Dickins
2021-03-10 23:50           ` Singh, Balbir
2021-03-04  7:40 ` [PATCH v2 2/2] mm/memcg: set memcg when split page Zhou Guanghui
2021-03-04 15:52   ` Johannes Weiner
2021-03-04 16:22   ` Zi Yan
2021-03-04 18:55   ` Shakeel Butt
2021-03-05 11:52   ` Michal Hocko
2021-03-05 23:58     ` Andrew Morton
2021-03-08  8:41       ` Michal Hocko
2021-03-08 20:42         ` Andrew Morton
2021-03-08 20:47           ` Matthew Wilcox
2021-03-09  0:10             ` Andrew Morton
2021-03-08 21:02   ` Matthew Wilcox
2021-03-09  9:02     ` Michal Hocko
2021-03-09 12:32       ` Matthew Wilcox
2021-03-09 13:03         ` Michal Hocko
2021-03-11  8:37           ` Michal Hocko
2021-03-11 15:21             ` Johannes Weiner
2021-03-11 16:23               ` Matthew Wilcox
2021-03-11 16:26               ` Michal Hocko
2021-03-11 20:37                 ` Hugh Dickins
2021-03-18 14:05                   ` Michal Hocko
2021-03-18 15:02                     ` Matthew Wilcox
2021-03-18 15:07                     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).