mm/memcg: set memcg when split pages
diff mbox series

Message ID 20210302013451.118701-1-zhouguanghui1@huawei.com
State New, archived
Headers show
Series
  • mm/memcg: set memcg when split pages
Related show

Commit Message

Zhou Guanghui March 2, 2021, 1:34 a.m. UTC
When split page, the memory cgroup info recorded in first page is
not copied to tail pages. In this case, when the tail pages are
freed, the uncharge operation is not performed. As a result, the
usage of this memcg keeps increasing, and the OOM may occur.

So, the copying of first page's memory cgroup info to tail pages
is needed when split page.

Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
---
 include/linux/memcontrol.h | 10 ++++++++++
 mm/page_alloc.c            |  4 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

Comments

Zi Yan March 2, 2021, 1:59 a.m. UTC | #1
On 1 Mar 2021, at 20:34, Zhou Guanghui wrote:

> When split page, the memory cgroup info recorded in first page is
> not copied to tail pages. In this case, when the tail pages are
> freed, the uncharge operation is not performed. As a result, the
> usage of this memcg keeps increasing, and the OOM may occur.
>
> So, the copying of first page's memory cgroup info to tail pages
> is needed when split page.
>
> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
> ---
>  include/linux/memcontrol.h | 10 ++++++++++
>  mm/page_alloc.c            |  4 +++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..c7e2b4421dc1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
>  extern bool cgroup_memory_noswap;
>  #endif
>
> +static inline void copy_page_memcg(struct page *dst, struct page *src)
> +{
> +	if (src->memcg_data)
> +		dst->memcg_data = src->memcg_data;
> +}
> +
>  struct mem_cgroup *lock_page_memcg(struct page *page);
>  void __unlock_page_memcg(struct mem_cgroup *memcg);
>  void unlock_page_memcg(struct page *page);
> @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>  {
>  }
>
> +static inline void copy_page_memcg(struct page *dst, struct page *src)
> +{
> +}
> +
>  static inline struct mem_cgroup *lock_page_memcg(struct page *page)
>  {
>  	return NULL;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..ee0a63dc1c9b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int order)
>  	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>
> -	for (i = 1; i < (1 << order); i++)
> +	for (i = 1; i < (1 << order); i++) {
>  		set_page_refcounted(page + i);
> +		copy_page_memcg(page + i, page);
> +	}
>  	split_page_owner(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
> -- 
> 2.25.0

+memcg maintainers

split_page() is used for non-compound higher-order pages. I am not sure
if there is any such pages monitored by memcg. Please let me know
if I miss anything.

—
Best Regards,
Yan Zi
Zhou Guanghui March 2, 2021, 7:05 a.m. UTC | #2
在 2021/3/2 10:00, Zi Yan 写道:
> On 1 Mar 2021, at 20:34, Zhou Guanghui wrote:
> 
>> When split page, the memory cgroup info recorded in first page is
>> not copied to tail pages. In this case, when the tail pages are
>> freed, the uncharge operation is not performed. As a result, the
>> usage of this memcg keeps increasing, and the OOM may occur.
>>
>> So, the copying of first page's memory cgroup info to tail pages
>> is needed when split page.
>>
>> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
>> ---
>>   include/linux/memcontrol.h | 10 ++++++++++
>>   mm/page_alloc.c            |  4 +++-
>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index e6dc793d587d..c7e2b4421dc1 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
>>   extern bool cgroup_memory_noswap;
>>   #endif
>>
>> +static inline void copy_page_memcg(struct page *dst, struct page *src)
>> +{
>> +	if (src->memcg_data)
>> +		dst->memcg_data = src->memcg_data;
>> +}
>> +
>>   struct mem_cgroup *lock_page_memcg(struct page *page);
>>   void __unlock_page_memcg(struct mem_cgroup *memcg);
>>   void unlock_page_memcg(struct page *page);
>> @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>>   {
>>   }
>>
>> +static inline void copy_page_memcg(struct page *dst, struct page *src)
>> +{
>> +}
>> +
>>   static inline struct mem_cgroup *lock_page_memcg(struct page *page)
>>   {
>>   	return NULL;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3e4b29ee2b1e..ee0a63dc1c9b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int order)
>>   	VM_BUG_ON_PAGE(PageCompound(page), page);
>>   	VM_BUG_ON_PAGE(!page_count(page), page);
>>
>> -	for (i = 1; i < (1 << order); i++)
>> +	for (i = 1; i < (1 << order); i++) {
>>   		set_page_refcounted(page + i);
>> +		copy_page_memcg(page + i, page);
>> +	}
>>   	split_page_owner(page, 1 << order);
>>   }
>>   EXPORT_SYMBOL_GPL(split_page);
>> -- 
>> 2.25.0
> 
> +memcg maintainers
> 
> split_page() is used for non-compound higher-order pages. I am not sure
> if there is any such pages monitored by memcg. Please let me know
> if I miss anything.

Thank you for taking time for this.

This should be put in kmemcg, and I'll modify it.

When the kmemcg is enabled and _GFP_ACCOUNT is set, the charged and 
uncharged sizes do not match when alloc/free_pages_exact method is used 
to apply for or free memory with exact size. This is because memcg data 
of the tail page is not set during the split page.
Michal Hocko March 2, 2021, 9:17 a.m. UTC | #3
[Cc Johannes for awareness and fixup Nick's email]

On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> When split page, the memory cgroup info recorded in first page is
> not copied to tail pages. In this case, when the tail pages are
> freed, the uncharge operation is not performed. As a result, the
> usage of this memcg keeps increasing, and the OOM may occur.
> 
> So, the copying of first page's memory cgroup info to tail pages
> is needed when split page.

I was not aware that alloc_pages_exact is used for accounted allocations
but git grep told me otherwise so this is not a theoretical one. Both
users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
used in dma allocator but I got lost in indirection so I have no idea
whether there are any users there.

The page itself looks reasonable to me.

> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Minor nit

> ---
>  include/linux/memcontrol.h | 10 ++++++++++
>  mm/page_alloc.c            |  4 +++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..c7e2b4421dc1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
>  extern bool cgroup_memory_noswap;
>  #endif
>  
> +static inline void copy_page_memcg(struct page *dst, struct page *src)
> +{
> +	if (src->memcg_data)
> +		dst->memcg_data = src->memcg_data;

I would just drop the test. The struct page is a single cache line which
is dirty by the reference count so another store will unlikely be
noticeable even when NULL is stored here and you safe a conditional.

> +}
> +
>  struct mem_cgroup *lock_page_memcg(struct page *page);
>  void __unlock_page_memcg(struct mem_cgroup *memcg);
>  void unlock_page_memcg(struct page *page);
> @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>  {
>  }
>  
> +static inline void copy_page_memcg(struct page *dst, struct page *src)
> +{
> +}
> +
>  static inline struct mem_cgroup *lock_page_memcg(struct page *page)
>  {
>  	return NULL;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..ee0a63dc1c9b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int order)
>  	VM_BUG_ON_PAGE(PageCompound(page), page);
>  	VM_BUG_ON_PAGE(!page_count(page), page);
>  
> -	for (i = 1; i < (1 << order); i++)
> +	for (i = 1; i < (1 << order); i++) {
>  		set_page_refcounted(page + i);
> +		copy_page_memcg(page + i, page);
> +	}
>  	split_page_owner(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
> -- 
> 2.25.0
>
Zi Yan March 2, 2021, 3:37 p.m. UTC | #4
On 2 Mar 2021, at 2:05, Zhouguanghui (OS Kernel) wrote:

> 在 2021/3/2 10:00, Zi Yan 写道:
>> On 1 Mar 2021, at 20:34, Zhou Guanghui wrote:
>>
>>> When split page, the memory cgroup info recorded in first page is
>>> not copied to tail pages. In this case, when the tail pages are
>>> freed, the uncharge operation is not performed. As a result, the
>>> usage of this memcg keeps increasing, and the OOM may occur.
>>>
>>> So, the copying of first page's memory cgroup info to tail pages
>>> is needed when split page.
>>>
>>> Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
>>> ---
>>>   include/linux/memcontrol.h | 10 ++++++++++
>>>   mm/page_alloc.c            |  4 +++-
>>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index e6dc793d587d..c7e2b4421dc1 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
>>>   extern bool cgroup_memory_noswap;
>>>   #endif
>>>
>>> +static inline void copy_page_memcg(struct page *dst, struct page *src)
>>> +{
>>> +	if (src->memcg_data)
>>> +		dst->memcg_data = src->memcg_data;
>>> +}
>>> +
>>>   struct mem_cgroup *lock_page_memcg(struct page *page);
>>>   void __unlock_page_memcg(struct mem_cgroup *memcg);
>>>   void unlock_page_memcg(struct page *page);
>>> @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>>>   {
>>>   }
>>>
>>> +static inline void copy_page_memcg(struct page *dst, struct page *src)
>>> +{
>>> +}
>>> +
>>>   static inline struct mem_cgroup *lock_page_memcg(struct page *page)
>>>   {
>>>   	return NULL;
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3e4b29ee2b1e..ee0a63dc1c9b 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int order)
>>>   	VM_BUG_ON_PAGE(PageCompound(page), page);
>>>   	VM_BUG_ON_PAGE(!page_count(page), page);
>>>
>>> -	for (i = 1; i < (1 << order); i++)
>>> +	for (i = 1; i < (1 << order); i++) {
>>>   		set_page_refcounted(page + i);
>>> +		copy_page_memcg(page + i, page);
>>> +	}
>>>   	split_page_owner(page, 1 << order);
>>>   }
>>>   EXPORT_SYMBOL_GPL(split_page);
>>> -- 
>>> 2.25.0
>>
>> +memcg maintainers
>>
>> split_page() is used for non-compound higher-order pages. I am not sure
>> if there is any such pages monitored by memcg. Please let me know
>> if I miss anything.
>
> Thank you for taking time for this.
>
> This should be put in kmemcg, and I'll modify it.
>
> When the kmemcg is enabled and _GFP_ACCOUNT is set, the charged and
> uncharged sizes do not match when alloc/free_pages_exact method is used
> to apply for or free memory with exact size. This is because memcg data
> of the tail page is not set during the split page.

Thanks for your clarification. I missed kmemcg.

I have a question on copy_page_memcg above. By reading __memcg_kmem_charge_page
and __memcg_kmem_uncharge_page, it seems to me that every single page requires
a css_get(&memcg->css) at charge time and a css_put(&memcg->css) at uncharge time.
But your copy_page_memcg does not do css_get for split subpages. Will it cause
memcg->css underflow when subpages are uncharged?


—
Best Regards,
Yan Zi
Michal Hocko March 2, 2021, 3:42 p.m. UTC | #5
On Tue 02-03-21 10:37:13, Zi Yan wrote:
[...]
> I have a question on copy_page_memcg above. By reading __memcg_kmem_charge_page
> and __memcg_kmem_uncharge_page, it seems to me that every single page requires
> a css_get(&memcg->css) at charge time and a css_put(&memcg->css) at uncharge time.
> But your copy_page_memcg does not do css_get for split subpages. Will it cause
> memcg->css underflow when subpages are uncharged?

yes, well spotted. I have completely missed that. This will also discard
my comment on testing the memcg.
Michal Hocko March 2, 2021, 3:51 p.m. UTC | #6
On Tue 02-03-21 10:17:03, Michal Hocko wrote:
> [Cc Johannes for awareness and fixup Nick's email]
> 
> On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > When split page, the memory cgroup info recorded in first page is
> > not copied to tail pages. In this case, when the tail pages are
> > freed, the uncharge operation is not performed. As a result, the
> > usage of this memcg keeps increasing, and the OOM may occur.
> > 
> > So, the copying of first page's memory cgroup info to tail pages
> > is needed when split page.
> 
> I was not aware that alloc_pages_exact is used for accounted allocations
> but git grep told me otherwise so this is not a theoretical one. Both
> users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> used in dma allocator but I got lost in indirection so I have no idea
> whether there are any users there.
> 
> The page itself looks reasonable to me.
> 
> > Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Minor nit
> 
> > ---
> >  include/linux/memcontrol.h | 10 ++++++++++
> >  mm/page_alloc.c            |  4 +++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e6dc793d587d..c7e2b4421dc1 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
> >  extern bool cgroup_memory_noswap;
> >  #endif
> >  
> > +static inline void copy_page_memcg(struct page *dst, struct page *src)
> > +{
> > +	if (src->memcg_data)
> > +		dst->memcg_data = src->memcg_data;
> 
> I would just drop the test. The struct page is a single cache line which
> is dirty by the reference count so another store will unlikely be
> noticeable even when NULL is stored here and you safe a conditional.

Disregard this. As Zi Yan mentioned in other reply, we need to keep the
check and take a css reference along with transfering the memcg.
Hugh Dickins March 2, 2021, 8:24 p.m. UTC | #7
On Tue, 2 Mar 2021, Michal Hocko wrote:
> [Cc Johannes for awareness and fixup Nick's email]
> 
> On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > When split page, the memory cgroup info recorded in first page is
> > not copied to tail pages. In this case, when the tail pages are
> > freed, the uncharge operation is not performed. As a result, the
> > usage of this memcg keeps increasing, and the OOM may occur.
> > 
> > So, the copying of first page's memory cgroup info to tail pages
> > is needed when split page.
> 
> I was not aware that alloc_pages_exact is used for accounted allocations
> but git grep told me otherwise so this is not a theoretical one. Both
> users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> used in dma allocator but I got lost in indirection so I have no idea
> whether there are any users there.

Yes, it's a bit worrying that such a low-level thing as split_page()
can now get caught up in memcg accounting, but I suppose that's okay.

I feel rather strongly that whichever way it is done, THP splitting
and split_page() should use the same interface to memcg.

And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
there need to be css_get()s too - or better, a css_get_many().

Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
it mem_cgroup_split_page_fixup(), and take order from caller.

Though I've never much liked that separate pass: would it be
better page by page, like this copy_page_memcg() does?  Though
mem_cgroup_disabled() and css_getting make that less appealing.

Hugh

> 
> The page itself looks reasonable to me.
> 
> > Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Minor nit
> 
> > ---
> >  include/linux/memcontrol.h | 10 ++++++++++
> >  mm/page_alloc.c            |  4 +++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e6dc793d587d..c7e2b4421dc1 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
> >  extern bool cgroup_memory_noswap;
> >  #endif
> >  
> > +static inline void copy_page_memcg(struct page *dst, struct page *src)
> > +{
> > +	if (src->memcg_data)
> > +		dst->memcg_data = src->memcg_data;
> 
> I would just drop the test. The struct page is a single cache line which
> is dirty by the reference count so another store will unlikely be
> noticeable even when NULL is stored here and you safe a conditional.
> 
> > +}
> > +
> >  struct mem_cgroup *lock_page_memcg(struct page *page);
> >  void __unlock_page_memcg(struct mem_cgroup *memcg);
> >  void unlock_page_memcg(struct page *page);
> > @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
> >  {
> >  }
> >  
> > +static inline void copy_page_memcg(struct page *dst, struct page *src)
> > +{
> > +}
> > +
> >  static inline struct mem_cgroup *lock_page_memcg(struct page *page)
> >  {
> >  	return NULL;
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3e4b29ee2b1e..ee0a63dc1c9b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int order)
> >  	VM_BUG_ON_PAGE(PageCompound(page), page);
> >  	VM_BUG_ON_PAGE(!page_count(page), page);
> >  
> > -	for (i = 1; i < (1 << order); i++)
> > +	for (i = 1; i < (1 << order); i++) {
> >  		set_page_refcounted(page + i);
> > +		copy_page_memcg(page + i, page);
> > +	}
> >  	split_page_owner(page, 1 << order);
> >  }
> >  EXPORT_SYMBOL_GPL(split_page);
> > -- 
> > 2.25.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
Johannes Weiner March 2, 2021, 10:56 p.m. UTC | #8
On Tue, Mar 02, 2021 at 12:24:41PM -0800, Hugh Dickins wrote:
> On Tue, 2 Mar 2021, Michal Hocko wrote:
> > [Cc Johannes for awareness and fixup Nick's email]
> > 
> > On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > > When split page, the memory cgroup info recorded in first page is
> > > not copied to tail pages. In this case, when the tail pages are
> > > freed, the uncharge operation is not performed. As a result, the
> > > usage of this memcg keeps increasing, and the OOM may occur.
> > > 
> > > So, the copying of first page's memory cgroup info to tail pages
> > > is needed when split page.
> > 
> > I was not aware that alloc_pages_exact is used for accounted allocations
> > but git grep told me otherwise so this is not a theoretical one. Both
> > users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> > used in dma allocator but I got lost in indirection so I have no idea
> > whether there are any users there.
> 
> Yes, it's a bit worrying that such a low-level thing as split_page()
> can now get caught up in memcg accounting, but I suppose that's okay.
> 
> I feel rather strongly that whichever way it is done, THP splitting
> and split_page() should use the same interface to memcg.
> 
> And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
> there need to be css_get()s too - or better, a css_get_many().
> 
> Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
> it mem_cgroup_split_page_fixup(), and take order from caller.

+1

There is already a split_page_owner() in both these places as well
which does a similar thing. Mabye we can match that by calling it
split_page_memcg() and having it take a nr of pages?

> Though I've never much liked that separate pass: would it be
> better page by page, like this copy_page_memcg() does?  Though
> mem_cgroup_disabled() and css_getting make that less appealing.

Agreed on both counts. mem_cgroup_disabled() is a jump label and would
be okay, IMO, but the refcounting - though it is (usually) per-cpu -
adds at least two branches and rcu read locking.
Hugh Dickins March 2, 2021, 11:49 p.m. UTC | #9
On Tue, 2 Mar 2021, Johannes Weiner wrote:
> On Tue, Mar 02, 2021 at 12:24:41PM -0800, Hugh Dickins wrote:
> > On Tue, 2 Mar 2021, Michal Hocko wrote:
> > > [Cc Johannes for awareness and fixup Nick's email]
> > > 
> > > On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > > > When split page, the memory cgroup info recorded in first page is
> > > > not copied to tail pages. In this case, when the tail pages are
> > > > freed, the uncharge operation is not performed. As a result, the
> > > > usage of this memcg keeps increasing, and the OOM may occur.
> > > > 
> > > > So, the copying of first page's memory cgroup info to tail pages
> > > > is needed when split page.
> > > 
> > > I was not aware that alloc_pages_exact is used for accounted allocations
> > > but git grep told me otherwise so this is not a theoretical one. Both
> > > users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> > > used in dma allocator but I got lost in indirection so I have no idea
> > > whether there are any users there.
> > 
> > Yes, it's a bit worrying that such a low-level thing as split_page()
> > can now get caught up in memcg accounting, but I suppose that's okay.
> > 
> > I feel rather strongly that whichever way it is done, THP splitting
> > and split_page() should use the same interface to memcg.
> > 
> > And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
> > there need to be css_get()s too - or better, a css_get_many().
> > 
> > Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
> > it mem_cgroup_split_page_fixup(), and take order from caller.
> 
> +1
> 
> There is already a split_page_owner() in both these places as well
> which does a similar thing. Mabye we can match that by calling it
> split_page_memcg() and having it take a nr of pages?

Agreed on both counts :) "fixup" was not an inspiring name.

> 
> > Though I've never much liked that separate pass: would it be
> > better page by page, like this copy_page_memcg() does?  Though
> > mem_cgroup_disabled() and css_getting make that less appealing.
> 
> Agreed on both counts. mem_cgroup_disabled() is a jump label and would
> be okay, IMO, but the refcounting - though it is (usually) per-cpu -
> adds at least two branches and rcu read locking.
Michal Hocko March 3, 2021, 7:46 a.m. UTC | #10
On Tue 02-03-21 17:56:07, Johannes Weiner wrote:
> On Tue, Mar 02, 2021 at 12:24:41PM -0800, Hugh Dickins wrote:
> > On Tue, 2 Mar 2021, Michal Hocko wrote:
> > > [Cc Johannes for awareness and fixup Nick's email]
> > > 
> > > On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > > > When split page, the memory cgroup info recorded in first page is
> > > > not copied to tail pages. In this case, when the tail pages are
> > > > freed, the uncharge operation is not performed. As a result, the
> > > > usage of this memcg keeps increasing, and the OOM may occur.
> > > > 
> > > > So, the copying of first page's memory cgroup info to tail pages
> > > > is needed when split page.
> > > 
> > > I was not aware that alloc_pages_exact is used for accounted allocations
> > > but git grep told me otherwise so this is not a theoretical one. Both
> > > users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> > > used in dma allocator but I got lost in indirection so I have no idea
> > > whether there are any users there.
> > 
> > Yes, it's a bit worrying that such a low-level thing as split_page()
> > can now get caught up in memcg accounting, but I suppose that's okay.
> > 
> > I feel rather strongly that whichever way it is done, THP splitting
> > and split_page() should use the same interface to memcg.
> > 
> > And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
> > there need to be css_get()s too - or better, a css_get_many().
> > 
> > Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
> > it mem_cgroup_split_page_fixup(), and take order from caller.
> 
> +1
> 
> There is already a split_page_owner() in both these places as well
> which does a similar thing. Mabye we can match that by calling it
> split_page_memcg() and having it take a nr of pages?

Sounds good to me.
Zhou Guanghui March 3, 2021, 9:15 a.m. UTC | #11
在 2021/3/3 15:46, Michal Hocko 写道:
> On Tue 02-03-21 17:56:07, Johannes Weiner wrote:
>> On Tue, Mar 02, 2021 at 12:24:41PM -0800, Hugh Dickins wrote:
>>> On Tue, 2 Mar 2021, Michal Hocko wrote:
>>>> [Cc Johannes for awareness and fixup Nick's email]
>>>>
>>>> On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
>>>>> When split page, the memory cgroup info recorded in first page is
>>>>> not copied to tail pages. In this case, when the tail pages are
>>>>> freed, the uncharge operation is not performed. As a result, the
>>>>> usage of this memcg keeps increasing, and the OOM may occur.
>>>>>
>>>>> So, the copying of first page's memory cgroup info to tail pages
>>>>> is needed when split page.
>>>>
>>>> I was not aware that alloc_pages_exact is used for accounted allocations
>>>> but git grep told me otherwise so this is not a theoretical one. Both
>>>> users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
>>>> used in dma allocator but I got lost in indirection so I have no idea
>>>> whether there are any users there.
>>>
>>> Yes, it's a bit worrying that such a low-level thing as split_page()
>>> can now get caught up in memcg accounting, but I suppose that's okay.
>>>
>>> I feel rather strongly that whichever way it is done, THP splitting
>>> and split_page() should use the same interface to memcg.
>>>
>>> And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
>>> there need to be css_get()s too - or better, a css_get_many().
>>>
>>> Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
>>> it mem_cgroup_split_page_fixup(), and take order from caller.
>>
>> +1
>>
>> There is already a split_page_owner() in both these places as well
>> which does a similar thing. Mabye we can match that by calling it
>> split_page_memcg() and having it take a nr of pages?
> 
> Sounds good to me.
> 
  Hi, Michal, Johannes, Hugh, and Zi Yan, thank you for taking time for 
this.

I agree, and will send v2 patches for taking these.

Thanks

Patch
diff mbox series

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..c7e2b4421dc1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -867,6 +867,12 @@  void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 extern bool cgroup_memory_noswap;
 #endif
 
+static inline void copy_page_memcg(struct page *dst, struct page *src)
+{
+	if (src->memcg_data)
+		dst->memcg_data = src->memcg_data;
+}
+
 struct mem_cgroup *lock_page_memcg(struct page *page);
 void __unlock_page_memcg(struct mem_cgroup *memcg);
 void unlock_page_memcg(struct page *page);
@@ -1291,6 +1297,10 @@  mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
 }
 
+static inline void copy_page_memcg(struct page *dst, struct page *src)
+{
+}
+
 static inline struct mem_cgroup *lock_page_memcg(struct page *page)
 {
 	return NULL;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..ee0a63dc1c9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3307,8 +3307,10 @@  void split_page(struct page *page, unsigned int order)
 	VM_BUG_ON_PAGE(PageCompound(page), page);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 
-	for (i = 1; i < (1 << order); i++)
+	for (i = 1; i < (1 << order); i++) {
 		set_page_refcounted(page + i);
+		copy_page_memcg(page + i, page);
+	}
 	split_page_owner(page, 1 << order);
 }
 EXPORT_SYMBOL_GPL(split_page);