[v2,2/2] mm/memcg: set memcg when split page
diff mbox series

Message ID 20210304074053.65527-3-zhouguanghui1@huawei.com
State Accepted
Commit e1baddf8475b06cc56f4bafecf9a32a124343d9f
Headers show
Series
  • set memcg when split page
Related show

Commit Message

Zhou Guanghui March 4, 2021, 7:40 a.m. UTC
As described in the split_page function comment, for the non-compound
high order page, the sub-pages must be freed individually. If the
memcg of the fisrt page is valid, the tail pages cannot be uncharged
when be freed.

For example, when alloc_pages_exact is used to allocate 1MB continuous
physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
set). When make_alloc_exact free the unused 1MB and free_pages_exact
free the applied 1MB, actually, only 4KB(one page) is uncharged.

Therefore, the memcg of the tail page needs to be set when split page.

Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
---
 mm/page_alloc.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Johannes Weiner March 4, 2021, 3:52 p.m. UTC | #1
On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
> 
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
> 
> Therefore, the memcg of the tail page needs to be set when split page.
> 
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Zi Yan March 4, 2021, 4:22 p.m. UTC | #2
On 4 Mar 2021, at 2:40, Zhou Guanghui wrote:

> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged

s/fisrt/first/

> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
> ---
>  mm/page_alloc.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..3ed783e25c3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
>  	for (i = 1; i < (1 << order); i++)
>  		set_page_refcounted(page + i);
>  	split_page_owner(page, 1 << order);
> +	split_page_memcg(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
>
> -- 
> 2.25.0

LGTM. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>


—
Best Regards,
Yan Zi
Shakeel Butt March 4, 2021, 6:55 p.m. UTC | #3
On Wed, Mar 3, 2021 at 11:57 PM Zhou Guanghui <zhouguanghui1@huawei.com> wrote:
>
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Michal Hocko March 5, 2021, 11:52 a.m. UTC | #4
On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
> 
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
> 
> Therefore, the memcg of the tail page needs to be set when split page.
> 

As already mentioned there are at least two explicit users of
__GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
mention that explicitly and maybe even mention 7efe8ef274024 resp.
c419621873713 so that it is clear this is not just a theoretical issue.

> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..3ed783e25c3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
>  	for (i = 1; i < (1 << order); i++)
>  		set_page_refcounted(page + i);
>  	split_page_owner(page, 1 << order);
> +	split_page_memcg(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
>  
> -- 
> 2.25.0
>
Andrew Morton March 5, 2021, 11:58 p.m. UTC | #5
On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > As described in the split_page function comment, for the non-compound
> > high order page, the sub-pages must be freed individually. If the
> > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > when be freed.
> > 
> > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > 
> > Therefore, the memcg of the tail page needs to be set when split page.
> > 
> 
> As already mentioned there are at least two explicit users of
> __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> mention that explicitly and maybe even mention 7efe8ef274024 resp.
> c419621873713 so that it is clear this is not just a theoretical issue.

I added

: Michel:
: 
: There are at least two explicit users of __GFP_ACCOUNT with
: alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
: Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
: ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
: just a theoretical issue.

And should we cc:stable on this one?
Michal Hocko March 8, 2021, 8:41 a.m. UTC | #6
On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > As described in the split_page function comment, for the non-compound
> > > high order page, the sub-pages must be freed individually. If the
> > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > when be freed.
> > > 
> > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > 
> > > Therefore, the memcg of the tail page needs to be set when split page.
> > > 
> > 
> > As already mentioned there are at least two explicit users of
> > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > c419621873713 so that it is clear this is not just a theoretical issue.
> 
> I added
> 
> : Michel:
> : 
> : There are at least two explicit users of __GFP_ACCOUNT with
> : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> : just a theoretical issue.
> 
> And should we cc:stable on this one?

Somebody more familiar with iommu dma allocation layer should have a
look as well (__iommu_dma_alloc_pages) so that we know whether there are
kernels outside of the above two ones mentioned above that need a fix.
But in general this sounds like a good fit for the stable tree.
Andrew Morton March 8, 2021, 8:42 p.m. UTC | #7
On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > As described in the split_page function comment, for the non-compound
> > > > high order page, the sub-pages must be freed individually. If the
> > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > when be freed.
> > > > 
> > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > 
> > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > 
> > > 
> > > As already mentioned there are at least two explicit users of
> > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > c419621873713 so that it is clear this is not just a theoretical issue.
> > 
> > I added
> > 
> > : Michel:
> > : 
> > : There are at least two explicit users of __GFP_ACCOUNT with
> > : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > : just a theoretical issue.
> > 
> > And should we cc:stable on this one?
> 
> Somebody more familiar with iommu dma allocation layer should have a
> look as well (__iommu_dma_alloc_pages) so that we know whether there are
> kernels outside of the above two ones mentioned above that need a fix.
> But in general this sounds like a good fit for the stable tree.

OK.  I reversed the order of these two patches so we don't need to
burden -stable with a cosmetic rename.
Matthew Wilcox (Oracle) March 8, 2021, 8:47 p.m. UTC | #8
On Mon, Mar 08, 2021 at 12:42:27PM -0800, Andrew Morton wrote:
> On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > > 
> > > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > > As described in the split_page function comment, for the non-compound
> > > > > high order page, the sub-pages must be freed individually. If the
> > > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > > when be freed.
> > > > > 
> > > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > > 
> > > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > > 
> > > > 
> > > > As already mentioned there are at least two explicit users of
> > > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > > c419621873713 so that it is clear this is not just a theoretical issue.
> > > 
> > > I added
> > > 
> > > : Michel:
> > > : 
> > > : There are at least two explicit users of __GFP_ACCOUNT with
> > > : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> > > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > > : just a theoretical issue.
> > > 
> > > And should we cc:stable on this one?
> > 
> > Somebody more familiar with iommu dma allocation layer should have a
> > look as well (__iommu_dma_alloc_pages) so that we know whether there are
> > kernels outside of the above two ones mentioned above that need a fix.
> > But in general this sounds like a good fit for the stable tree.
> 
> OK.  I reversed the order of these two patches so we don't need to
> burden -stable with a cosmetic rename.

Eek, no.

The alloc_pages_exact() is done to pages that _aren't_ compound.
So you have to pass the number of pages to the memcg split function,
because a non-compound page doesn't know the size of its allocation.
Matthew Wilcox (Oracle) March 8, 2021, 9:02 p.m. UTC | #9
On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
> 
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
> 
> Therefore, the memcg of the tail page needs to be set when split page.

There's another place we need to do this to ...

+++ b/mm/page_alloc.c
@@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
 {
        if (put_page_testzero(page))
                free_the_page(page, order);
-       else if (!PageHead(page))
-               while (order-- > 0)
-                       free_the_page(page + (1 << order), order);
+       else if (!PageHead(page)) {
+               while (order-- > 0) {
+                       struct page *tail = page + (1 << order);
+#ifdef CONFIG_MEMCG
+                       tail->memcg_data = page->memcg_data;
+#endif
+                       free_the_page(tail, order);
+               }
+       }
 }
 EXPORT_SYMBOL(__free_pages);
 

I wonder if we shouldn't initialise memcg_data on all subsequent pages
of non-compound allocations instead?  Because I'm not sure this is the
only place that needs to be fixed.
Andrew Morton March 9, 2021, 12:10 a.m. UTC | #10
On Mon, 8 Mar 2021 20:47:31 +0000 Matthew Wilcox <willy@infradead.org> wrote:

> On Mon, Mar 08, 2021 at 12:42:27PM -0800, Andrew Morton wrote:
> > On Mon, 8 Mar 2021 09:41:38 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > 
> > > On Fri 05-03-21 15:58:40, Andrew Morton wrote:
> > > > On Fri, 5 Mar 2021 12:52:52 +0100 Michal Hocko <mhocko@suse.com> wrote:
> > > > 
> > > > > On Thu 04-03-21 07:40:53, Zhou Guanghui wrote:
> > > > > > As described in the split_page function comment, for the non-compound
> > > > > > high order page, the sub-pages must be freed individually. If the
> > > > > > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > > > > > when be freed.
> > > > > > 
> > > > > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > > > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > > > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > > > > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > > > > > 
> > > > > > Therefore, the memcg of the tail page needs to be set when split page.
> > > > > > 
> > > > > 
> > > > > As already mentioned there are at least two explicit users of
> > > > > __GFP_ACCOUNT with alloc_exact_pages added recently. It would be good to
> > > > > mention that explicitly and maybe even mention 7efe8ef274024 resp.
> > > > > c419621873713 so that it is clear this is not just a theoretical issue.
> > > > 
> > > > I added
> > > > 
> > > > : Michel:
> > > > : 
> > > > : There are at least two explicit users of __GFP_ACCOUNT with
> > > > : alloc_exact_pages added recently.  See 7efe8ef274024 ("KVM: arm64:
> > > > : Allocate stage-2 pgd pages with GFP_KERNEL_ACCOUNT") and c419621873713
> > > > : ("KVM: s390: Add memcg accounting to KVM allocations"), so this is not
> > > > : just a theoretical issue.
> > > > 
> > > > And should we cc:stable on this one?
> > > 
> > > Somebody more familiar with iommu dma allocation layer should have a
> > > look as well (__iommu_dma_alloc_pages) so that we know whether there are
> > > kernels outside of the above two ones mentioned above that need a fix.
> > > But in general this sounds like a good fit for the stable tree.
> > 
> > OK.  I reversed the order of these two patches so we don't need to
> > burden -stable with a cosmetic rename.
> 
> Eek, no.
> 
> The alloc_pages_exact() is done to pages that _aren't_ compound.
> So you have to pass the number of pages to the memcg split function,
> because a non-compound page doesn't know the size of its allocation.

Ah, OK, the patch title fooled me.

It should have been a three-patch series, really

1: add nr_pages arg to mem_cgroup_split_huge_fixup()
2: call mem_cgroup_split_huge_fixup() when splitting
3: rename mem_cgroup_split_huge_fixup() to split_page_memcg()

That way, the third cosmetic patch could be deferred so we don't feed
the cosmetic renaming into -stable.

But whatever, the rename isn't a big deal so I'll go with the 2-patch
series as sent for -stable.
Michal Hocko March 9, 2021, 9:02 a.m. UTC | #11
On Mon 08-03-21 21:02:25, Matthew Wilcox wrote:
> On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> > As described in the split_page function comment, for the non-compound
> > high order page, the sub-pages must be freed individually. If the
> > memcg of the fisrt page is valid, the tail pages cannot be uncharged
> > when be freed.
> > 
> > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > free the applied 1MB, actually, only 4KB(one page) is uncharged.
> > 
> > Therefore, the memcg of the tail page needs to be set when split page.
> 
> There's another place we need to do this to ...
> 
> +++ b/mm/page_alloc.c
> @@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
>  {
>         if (put_page_testzero(page))
>                 free_the_page(page, order);
> -       else if (!PageHead(page))
> -               while (order-- > 0)
> -                       free_the_page(page + (1 << order), order);
> +       else if (!PageHead(page)) {
> +               while (order-- > 0) {
> +                       struct page *tail = page + (1 << order);
> +#ifdef CONFIG_MEMCG
> +                       tail->memcg_data = page->memcg_data;
> +#endif
> +                       free_the_page(tail, order);
> +               }
> +       }
>  }
>  EXPORT_SYMBOL(__free_pages);

Hmm, I was not aware of this code. This is really a tricky code.

> I wonder if we shouldn't initialise memcg_data on all subsequent pages
> of non-compound allocations instead?  Because I'm not sure this is the
> only place that needs to be fixed.

That would be safer for sure. Do you mean this as a replacement to the
original patch?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 913c2b9e5c72..d44dea2b8d22 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	if (memcg && !mem_cgroup_is_root(memcg)) {
 		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
 		if (!ret) {
+			int nr_pages = 1 << order;
 			page->memcg_data = (unsigned long)memcg |
 				MEMCG_DATA_KMEM;
+			
+			/*
+			 * Compound pages are normally split or freed
+			 * via their head pages so memcg_data in in the
+			 * head page should be sufficient but there
+			 * are exceptions to the rule (see __free_pages).
+			 * Non compound pages would need to copy memcg anyway.
+			 */
+			for (i = 1; i < nr_pages; i++) {
+				struct page * p = page + i;
+				p->memcg_data = page->memcg_data
+			}
 			return 0;
 		}
 		css_put(&memcg->css);
Matthew Wilcox (Oracle) March 9, 2021, 12:32 p.m. UTC | #12
On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
> On Mon 08-03-21 21:02:25, Matthew Wilcox wrote:
> > On Thu, Mar 04, 2021 at 07:40:53AM +0000, Zhou Guanghui wrote:
> > > For example, when alloc_pages_exact is used to allocate 1MB continuous
> > > physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> > > set). When make_alloc_exact free the unused 1MB and free_pages_exact
> > > free the applied 1MB, actually, only 4KB(one page) is uncharged.

> > @@ -5081,9 +5081,15 @@ void __free_pages(struct page *page, unsigned int order)
> >  {
> >         if (put_page_testzero(page))
> >                 free_the_page(page, order);
> > -       else if (!PageHead(page))
> > -               while (order-- > 0)
> > -                       free_the_page(page + (1 << order), order);
> > +       else if (!PageHead(page)) {
> > +               while (order-- > 0) {
> > +                       struct page *tail = page + (1 << order);
> > +#ifdef CONFIG_MEMCG
> > +                       tail->memcg_data = page->memcg_data;
> > +#endif
> > +                       free_the_page(tail, order);
> > +               }
> > +       }
> >  }
> >  EXPORT_SYMBOL(__free_pages);
> 
> Hmm, I was not aware of this code. This is really a tricky code.

Yes.  I only added it recently.  I don't see a better way to solve this
problem.  We could turn the non-compound page into a compound page at
this point, but I'm not sure that's really less tricky.

> > I wonder if we shouldn't initialise memcg_data on all subsequent pages
> > of non-compound allocations instead?  Because I'm not sure this is the
> > only place that needs to be fixed.
> 
> That would be safer for sure. Do you mean this as a replacement to the
> original patch?
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 913c2b9e5c72..d44dea2b8d22 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  	if (memcg && !mem_cgroup_is_root(memcg)) {
>  		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
>  		if (!ret) {
> +			int nr_pages = 1 << order;
>  			page->memcg_data = (unsigned long)memcg |
>  				MEMCG_DATA_KMEM;
> +			
> +			/*
> +			 * Compound pages are normally split or freed
> +			 * via their head pages so memcg_data in in the
> +			 * head page should be sufficient but there
> +			 * are exceptions to the rule (see __free_pages).
> +			 * Non compound pages would need to copy memcg anyway.
> +			 */
> +			for (i = 1; i < nr_pages; i++) {
> +				struct page * p = page + i;
> +				p->memcg_data = page->memcg_data
> +			}
>  			return 0;

I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
along these lines.  I might phrase the comment a little differently ...

			/*
			 * Compound pages are treated as a single unit,
			 * but non-compound pages can be freed individually
			 * so each page needs to have its memcg set to get
			 * the accounting right.
			 */
Michal Hocko March 9, 2021, 1:03 p.m. UTC | #13
On Tue 09-03-21 12:32:55, Matthew Wilcox wrote:
> On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 913c2b9e5c72..d44dea2b8d22 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> >  	if (memcg && !mem_cgroup_is_root(memcg)) {
> >  		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> >  		if (!ret) {
> > +			int nr_pages = 1 << order;
> >  			page->memcg_data = (unsigned long)memcg |
> >  				MEMCG_DATA_KMEM;
> > +			
> > +			/*
> > +			 * Compound pages are normally split or freed
> > +			 * via their head pages so memcg_data in in the
> > +			 * head page should be sufficient but there
> > +			 * are exceptions to the rule (see __free_pages).
> > +			 * Non compound pages would need to copy memcg anyway.
> > +			 */
> > +			for (i = 1; i < nr_pages; i++) {
> > +				struct page * p = page + i;
> > +				p->memcg_data = page->memcg_data
> > +			}
> >  			return 0;
> 
> I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
> along these lines.  I might phrase the comment a little differently ...
> 
> 			/*
> 			 * Compound pages are treated as a single unit,
> 			 * but non-compound pages can be freed individually
> 			 * so each page needs to have its memcg set to get
> 			 * the accounting right.
> 			 */

OK, I must have misunderstood your __free_pages fix then. I thought this
was about compound pages. Btw. again I forgot about css ref counting so
here is an updated version.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 913c2b9e5c72..ec2c705f38fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3133,10 +3133,22 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 
 	memcg = get_mem_cgroup_from_current();
 	if (memcg && !mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
+		int nr_pages = 1 << order;
+		ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
 		if (!ret) {
 			page->memcg_data = (unsigned long)memcg |
 				MEMCG_DATA_KMEM;
+			if (nr_pages > 1) {
+				/*
+				 * comment goes here
+				 */
+				for (i = 1; i < nr_pages; i++) {
+					struct page * p = page + i;
+					p->memcg_data = page->memcg_data
+				}
+				/* Head page reference from get_mem_cgroup_from_current */
+				css_get_many(&memcg->css, nr_pages - 1);
+			}
 			return 0;
 		}
 		css_put(&memcg->css);
Michal Hocko March 11, 2021, 8:37 a.m. UTC | #14
Johannes, Hugh,

what do you think about this approach? If we want to stick with
split_page approach then we need to update the missing place Matthew has
pointed out.

On Tue 09-03-21 14:03:36, Michal Hocko wrote:
> On Tue 09-03-21 12:32:55, Matthew Wilcox wrote:
> > On Tue, Mar 09, 2021 at 10:02:00AM +0100, Michal Hocko wrote:
> [...]
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 913c2b9e5c72..d44dea2b8d22 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -3135,8 +3135,21 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> > >  	if (memcg && !mem_cgroup_is_root(memcg)) {
> > >  		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> > >  		if (!ret) {
> > > +			int nr_pages = 1 << order;
> > >  			page->memcg_data = (unsigned long)memcg |
> > >  				MEMCG_DATA_KMEM;
> > > +			
> > > +			/*
> > > +			 * Compound pages are normally split or freed
> > > +			 * via their head pages so memcg_data in in the
> > > +			 * head page should be sufficient but there
> > > +			 * are exceptions to the rule (see __free_pages).
> > > +			 * Non compound pages would need to copy memcg anyway.
> > > +			 */
> > > +			for (i = 1; i < nr_pages; i++) {
> > > +				struct page * p = page + i;
> > > +				p->memcg_data = page->memcg_data
> > > +			}
> > >  			return 0;
> > 
> > I would condition this loop on if (!(gfp & __GFP_COMP)), but yes, something
> > along these lines.  I might phrase the comment a little differently ...
> > 
> > 			/*
> > 			 * Compound pages are treated as a single unit,
> > 			 * but non-compound pages can be freed individually
> > 			 * so each page needs to have its memcg set to get
> > 			 * the accounting right.
> > 			 */
> 
> OK, I must have misunderstood your __free_pages fix then. I thought this
> was about compound pages. Btw. again I forgot about css ref counting so
> here is an updated version.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 913c2b9e5c72..ec2c705f38fa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3133,10 +3133,22 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  
>  	memcg = get_mem_cgroup_from_current();
>  	if (memcg && !mem_cgroup_is_root(memcg)) {
> -		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> +		int nr_pages = 1 << order;
> +		ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
>  		if (!ret) {
>  			page->memcg_data = (unsigned long)memcg |
>  				MEMCG_DATA_KMEM;
> +			if (nr_pages > 1) {
> +				/*
> +				 * comment goes here
> +				 */
> +				for (i = 1; i < nr_pages; i++) {
> +					struct page * p = page + i;
> +					p->memcg_data = page->memcg_data
> +				}
> +				/* Head page reference from get_mem_cgroup_from_current */
> +				css_get_many(&memcg->css, nr_pages - 1);
> +			}
>  			return 0;
>  		}
>  		css_put(&memcg->css);
Johannes Weiner March 11, 2021, 3:21 p.m. UTC | #15
On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> Johannes, Hugh,
> 
> what do you think about this approach? If we want to stick with
> split_page approach then we need to update the missing place Matthew has
> pointed out.

I find the __free_pages() code quite tricky as well. But for that
reason I would actually prefer to initiate the splitting in there,
since that's the place where we actually split the page, rather than
spread the handling of this situation further out.

The race condition shouldn't be hot, so I don't think we need to be as
efficient about setting page->memcg_data only on the higher-order
buddies as in Willy's scratch patch. We can call split_page_memcg(),
which IMO should actually help document what's happening to the page.

I think that function could also benefit a bit more from step-by-step
documentation about what's going on. The kerneldoc is helpful, but I
don't think it does justice to how tricky this race condition is.

Something like this?

void __free_pages(struct page *page, unsigned int order)
{
	/*
	 * Drop the base reference from __alloc_pages and free. In
	 * case there is an outstanding speculative reference, from
	 * e.g. the page cache, it will put and free the page later.
	 */
	if (likely(put_page_testzero(page))) {
		free_the_page(page, order);
		return;
	}

	/*
	 * The speculative reference will put and free the page.
	 *
	 * However, if the speculation was into a higher-order page
	 * that isn't marked compound, the other side will know
	 * nothing about our buddy pages and only free the order-0
	 * page at the start of our chunk! We must split off and free
	 * the buddy pages here.
	 *
	 * The buddy pages aren't individually refcounted, so they
	 * can't have any pending speculative references themselves.
	 */
	if (!PageHead(page) && order > 0) {
		split_page_memcg(page, 1 << order);
		while (order-- > 0)
			free_the_page(page + (1 << order), order);
	}
}
Matthew Wilcox (Oracle) March 11, 2021, 4:23 p.m. UTC | #16
On Thu, Mar 11, 2021 at 10:21:39AM -0500, Johannes Weiner wrote:
> On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > Johannes, Hugh,
> > 
> > what do you think about this approach? If we want to stick with
> > split_page approach then we need to update the missing place Matthew has
> > pointed out.
> 
> I find the __free_pages() code quite tricky as well. But for that
> reason I would actually prefer to initiate the splitting in there,
> since that's the place where we actually split the page, rather than
> spread the handling of this situation further out.

Mmm.  The thing is, we don't actually split the page because it was
never compound.  I don't know whether anybody actually does this,
but it's legitimate to write:

	struct page *p = alloc_pages(GFP_KERNEL, 2);

	free_unref_page(p + 1);
	free_unref_page(p + 3);
	free_unref_page(p + 2);
	__free_page(p);

The good news is that I recently made free_unref_page() local to
mm/internal.h, so we don't need to worry about device drivers doing this.
As far as I can tell, we don't have any exposure to this kind of thing
today through functions exported from mm, but I might have missed
something.

I'd really like to get rid of non-compound high-order pages.  Slab,
filesystems and anonymous memory all use compound pages.  I think
it's just crusty old device drivers that don't.  And alloc_pages_exact(),
of course, but that's kind of internal.

> The race condition shouldn't be hot, so I don't think we need to be as
> efficient about setting page->memcg_data only on the higher-order
> buddies as in Willy's scratch patch. We can call split_page_memcg(),
> which IMO should actually help document what's happening to the page.

I'm cool with that.  I agree, this is not a performance case!

> I think that function could also benefit a bit more from step-by-step
> documentation about what's going on. The kerneldoc is helpful, but I
> don't think it does justice to how tricky this race condition is.

Always good to have other people read over your explanation ...
the kernel-doc could probably be simplified as a result.
Michal Hocko March 11, 2021, 4:26 p.m. UTC | #17
On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > Johannes, Hugh,
> > 
> > what do you think about this approach? If we want to stick with
> > split_page approach then we need to update the missing place Matthew has
> > pointed out.
> 
> I find the __free_pages() code quite tricky as well. But for that
> reason I would actually prefer to initiate the splitting in there,
> since that's the place where we actually split the page, rather than
> spread the handling of this situation further out.
> 
> The race condition shouldn't be hot, so I don't think we need to be as
> efficient about setting page->memcg_data only on the higher-order
> buddies as in Willy's scratch patch. We can call split_page_memcg(),
> which IMO should actually help document what's happening to the page.
> 
> I think that function could also benefit a bit more from step-by-step
> documentation about what's going on. The kerneldoc is helpful, but I
> don't think it does justice to how tricky this race condition is.
> 
> Something like this?
> 
> void __free_pages(struct page *page, unsigned int order)
> {
> 	/*
> 	 * Drop the base reference from __alloc_pages and free. In
> 	 * case there is an outstanding speculative reference, from
> 	 * e.g. the page cache, it will put and free the page later.
> 	 */
> 	if (likely(put_page_testzero(page))) {
> 		free_the_page(page, order);
> 		return;
> 	}
> 
> 	/*
> 	 * The speculative reference will put and free the page.
> 	 *
> 	 * However, if the speculation was into a higher-order page
> 	 * that isn't marked compound, the other side will know
> 	 * nothing about our buddy pages and only free the order-0
> 	 * page at the start of our chunk! We must split off and free
> 	 * the buddy pages here.
> 	 *
> 	 * The buddy pages aren't individually refcounted, so they
> 	 * can't have any pending speculative references themselves.
> 	 */
> 	if (!PageHead(page) && order > 0) {
> 		split_page_memcg(page, 1 << order);
> 		while (order-- > 0)
> 			free_the_page(page + (1 << order), order);
> 	}
> }

Fine with me. Mathew was concerned about more places that do something
similar but I would say that if we find out more places we might
reconsider and currently stay with a reasonably clear model that it is
only head patch that carries the memcg information and split_page_memcg
is necessary to break such page into smaller pieces.
Hugh Dickins March 11, 2021, 8:37 p.m. UTC | #18
On Thu, 11 Mar 2021, Michal Hocko wrote:
> On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > Johannes, Hugh,
> > > 
> > > what do you think about this approach? If we want to stick with
> > > split_page approach then we need to update the missing place Matthew has
> > > pointed out.
> > 
> > I find the __free_pages() code quite tricky as well. But for that
> > reason I would actually prefer to initiate the splitting in there,
> > since that's the place where we actually split the page, rather than
> > spread the handling of this situation further out.
> > 
> > The race condition shouldn't be hot, so I don't think we need to be as
> > efficient about setting page->memcg_data only on the higher-order
> > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > which IMO should actually help document what's happening to the page.
> > 
> > I think that function could also benefit a bit more from step-by-step
> > documentation about what's going on. The kerneldoc is helpful, but I
> > don't think it does justice to how tricky this race condition is.
> > 
> > Something like this?
> > 
> > void __free_pages(struct page *page, unsigned int order)
> > {
> > 	/*
> > 	 * Drop the base reference from __alloc_pages and free. In
> > 	 * case there is an outstanding speculative reference, from
> > 	 * e.g. the page cache, it will put and free the page later.
> > 	 */
> > 	if (likely(put_page_testzero(page))) {
> > 		free_the_page(page, order);
> > 		return;
> > 	}
> > 
> > 	/*
> > 	 * The speculative reference will put and free the page.
> > 	 *
> > 	 * However, if the speculation was into a higher-order page
> > 	 * that isn't marked compound, the other side will know
> > 	 * nothing about our buddy pages and only free the order-0
> > 	 * page at the start of our chunk! We must split off and free
> > 	 * the buddy pages here.
> > 	 *
> > 	 * The buddy pages aren't individually refcounted, so they
> > 	 * can't have any pending speculative references themselves.
> > 	 */
> > 	if (!PageHead(page) && order > 0) {
> > 		split_page_memcg(page, 1 << order);
> > 		while (order-- > 0)
> > 			free_the_page(page + (1 << order), order);
> > 	}
> > }
> 
> Fine with me. Mathew was concerned about more places that do something
> similar but I would say that if we find out more places we might
> reconsider and currently stay with a reasonably clear model that it is
> only head patch that carries the memcg information and split_page_memcg
> is necessary to break such page into smaller pieces.

I agree: I do like Johannes' suggestion best, now that we already
have split_page_memcg().  Not too worried about contrived use of
free_unref_page() here; and whether non-compound high-order pages
should be perpetuated is a different discussion.

Hugh
Michal Hocko March 18, 2021, 2:05 p.m. UTC | #19
On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> On Thu, 11 Mar 2021, Michal Hocko wrote:
> > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > Johannes, Hugh,
> > > > 
> > > > what do you think about this approach? If we want to stick with
> > > > split_page approach then we need to update the missing place Matthew has
> > > > pointed out.
> > > 
> > > I find the __free_pages() code quite tricky as well. But for that
> > > reason I would actually prefer to initiate the splitting in there,
> > > since that's the place where we actually split the page, rather than
> > > spread the handling of this situation further out.
> > > 
> > > The race condition shouldn't be hot, so I don't think we need to be as
> > > efficient about setting page->memcg_data only on the higher-order
> > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > which IMO should actually help document what's happening to the page.
> > > 
> > > I think that function could also benefit a bit more from step-by-step
> > > documentation about what's going on. The kerneldoc is helpful, but I
> > > don't think it does justice to how tricky this race condition is.
> > > 
> > > Something like this?
> > > 
> > > void __free_pages(struct page *page, unsigned int order)
> > > {
> > > 	/*
> > > 	 * Drop the base reference from __alloc_pages and free. In
> > > 	 * case there is an outstanding speculative reference, from
> > > 	 * e.g. the page cache, it will put and free the page later.
> > > 	 */
> > > 	if (likely(put_page_testzero(page))) {
> > > 		free_the_page(page, order);
> > > 		return;
> > > 	}
> > > 
> > > 	/*
> > > 	 * The speculative reference will put and free the page.
> > > 	 *
> > > 	 * However, if the speculation was into a higher-order page
> > > 	 * that isn't marked compound, the other side will know
> > > 	 * nothing about our buddy pages and only free the order-0
> > > 	 * page at the start of our chunk! We must split off and free
> > > 	 * the buddy pages here.
> > > 	 *
> > > 	 * The buddy pages aren't individually refcounted, so they
> > > 	 * can't have any pending speculative references themselves.
> > > 	 */
> > > 	if (!PageHead(page) && order > 0) {
> > > 		split_page_memcg(page, 1 << order);
> > > 		while (order-- > 0)
> > > 			free_the_page(page + (1 << order), order);
> > > 	}
> > > }
> > 
> > Fine with me. Mathew was concerned about more places that do something
> > similar but I would say that if we find out more places we might
> > reconsider and currently stay with a reasonably clear model that it is
> > only head patch that carries the memcg information and split_page_memcg
> > is necessary to break such page into smaller pieces.
> 
> I agree: I do like Johannes' suggestion best, now that we already
> have split_page_memcg().  Not too worried about contrived use of
> free_unref_page() here; and whether non-compound high-order pages
> should be perpetuated is a different discussion.

Matthew, are you planning to post a patch with suggested changes or
should I do it?
Matthew Wilcox (Oracle) March 18, 2021, 3:02 p.m. UTC | #20
On Thu, Mar 18, 2021 at 03:05:00PM +0100, Michal Hocko wrote:
> On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> > On Thu, 11 Mar 2021, Michal Hocko wrote:
> > > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > > Johannes, Hugh,
> > > > > 
> > > > > what do you think about this approach? If we want to stick with
> > > > > split_page approach then we need to update the missing place Matthew has
> > > > > pointed out.
> > > > 
> > > > I find the __free_pages() code quite tricky as well. But for that
> > > > reason I would actually prefer to initiate the splitting in there,
> > > > since that's the place where we actually split the page, rather than
> > > > spread the handling of this situation further out.
> > > > 
> > > > The race condition shouldn't be hot, so I don't think we need to be as
> > > > efficient about setting page->memcg_data only on the higher-order
> > > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > > which IMO should actually help document what's happening to the page.
> > > > 
> > > > I think that function could also benefit a bit more from step-by-step
> > > > documentation about what's going on. The kerneldoc is helpful, but I
> > > > don't think it does justice to how tricky this race condition is.
> > > > 
> > > > Something like this?
> > > > 
> > > > void __free_pages(struct page *page, unsigned int order)
> > > > {
> > > > 	/*
> > > > 	 * Drop the base reference from __alloc_pages and free. In
> > > > 	 * case there is an outstanding speculative reference, from
> > > > 	 * e.g. the page cache, it will put and free the page later.
> > > > 	 */
> > > > 	if (likely(put_page_testzero(page))) {
> > > > 		free_the_page(page, order);
> > > > 		return;
> > > > 	}
> > > > 
> > > > 	/*
> > > > 	 * The speculative reference will put and free the page.
> > > > 	 *
> > > > 	 * However, if the speculation was into a higher-order page
> > > > 	 * that isn't marked compound, the other side will know
> > > > 	 * nothing about our buddy pages and only free the order-0
> > > > 	 * page at the start of our chunk! We must split off and free
> > > > 	 * the buddy pages here.
> > > > 	 *
> > > > 	 * The buddy pages aren't individually refcounted, so they
> > > > 	 * can't have any pending speculative references themselves.
> > > > 	 */
> > > > 	if (!PageHead(page) && order > 0) {
> > > > 		split_page_memcg(page, 1 << order);
> > > > 		while (order-- > 0)
> > > > 			free_the_page(page + (1 << order), order);
> > > > 	}
> > > > }
> > > 
> > > Fine with me. Mathew was concerned about more places that do something
> > > similar but I would say that if we find out more places we might
> > > reconsider and currently stay with a reasonably clear model that it is
> > > only head patch that carries the memcg information and split_page_memcg
> > > is necessary to break such page into smaller pieces.
> > 
> > I agree: I do like Johannes' suggestion best, now that we already
> > have split_page_memcg().  Not too worried about contrived use of
> > free_unref_page() here; and whether non-compound high-order pages
> > should be perpetuated is a different discussion.
> 
> Matthew, are you planning to post a patch with suggested changes or
> should I do it?

I'm busy with the folio work; could you do it please?
Johannes Weiner March 18, 2021, 3:07 p.m. UTC | #21
On Thu, Mar 18, 2021 at 03:05:00PM +0100, Michal Hocko wrote:
> On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> > On Thu, 11 Mar 2021, Michal Hocko wrote:
> > > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > > Johannes, Hugh,
> > > > > 
> > > > > what do you think about this approach? If we want to stick with
> > > > > split_page approach then we need to update the missing place Matthew has
> > > > > pointed out.
> > > > 
> > > > I find the __free_pages() code quite tricky as well. But for that
> > > > reason I would actually prefer to initiate the splitting in there,
> > > > since that's the place where we actually split the page, rather than
> > > > spread the handling of this situation further out.
> > > > 
> > > > The race condition shouldn't be hot, so I don't think we need to be as
> > > > efficient about setting page->memcg_data only on the higher-order
> > > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > > which IMO should actually help document what's happening to the page.
> > > > 
> > > > I think that function could also benefit a bit more from step-by-step
> > > > documentation about what's going on. The kerneldoc is helpful, but I
> > > > don't think it does justice to how tricky this race condition is.
> > > > 
> > > > Something like this?
> > > > 
> > > > void __free_pages(struct page *page, unsigned int order)
> > > > {
> > > > 	/*
> > > > 	 * Drop the base reference from __alloc_pages and free. In
> > > > 	 * case there is an outstanding speculative reference, from
> > > > 	 * e.g. the page cache, it will put and free the page later.
> > > > 	 */
> > > > 	if (likely(put_page_testzero(page))) {
> > > > 		free_the_page(page, order);
> > > > 		return;
> > > > 	}
> > > > 
> > > > 	/*
> > > > 	 * The speculative reference will put and free the page.
> > > > 	 *
> > > > 	 * However, if the speculation was into a higher-order page
> > > > 	 * that isn't marked compound, the other side will know
> > > > 	 * nothing about our buddy pages and only free the order-0
> > > > 	 * page at the start of our chunk! We must split off and free
> > > > 	 * the buddy pages here.
> > > > 	 *
> > > > 	 * The buddy pages aren't individually refcounted, so they
> > > > 	 * can't have any pending speculative references themselves.
> > > > 	 */
> > > > 	if (!PageHead(page) && order > 0) {
> > > > 		split_page_memcg(page, 1 << order);
> > > > 		while (order-- > 0)
> > > > 			free_the_page(page + (1 << order), order);
> > > > 	}
> > > > }
> > > 
> > > Fine with me. Mathew was concerned about more places that do something
> > > similar but I would say that if we find out more places we might
> > > reconsider and currently stay with a reasonably clear model that it is
> > > only head patch that carries the memcg information and split_page_memcg
> > > is necessary to break such page into smaller pieces.
> > 
> > I agree: I do like Johannes' suggestion best, now that we already
> > have split_page_memcg().  Not too worried about contrived use of
> > free_unref_page() here; and whether non-compound high-order pages
> > should be perpetuated is a different discussion.
> 
> Matthew, are you planning to post a patch with suggested changes or
> should I do it?

I'll post a proper patch.

Patch
diff mbox series

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..3ed783e25c3c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3310,6 +3310,7 @@  void split_page(struct page *page, unsigned int order)
 	for (i = 1; i < (1 << order); i++)
 		set_page_refcounted(page + i);
 	split_page_owner(page, 1 << order);
+	split_page_memcg(page, 1 << order);
 }
 EXPORT_SYMBOL_GPL(split_page);