linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area()
       [not found] ` <20200623184515.4132564-2-guro@fb.com>
@ 2020-06-24  0:58   ` Shakeel Butt
  0 siblings, 0 replies; 19+ messages in thread
From: Shakeel Butt @ 2020-06-24  0:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> To implement accounting of percpu memory we need the information about the
> size of freed object.  Return it from pcpu_free_area().
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups
       [not found] ` <20200623184515.4132564-3-guro@fb.com>
@ 2020-06-24  1:25   ` Shakeel Butt
  0 siblings, 0 replies; 19+ messages in thread
From: Shakeel Butt @ 2020-06-24  1:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Percpu memory is becoming more and more widely used by various subsystems,
> and the total amount of memory controlled by the percpu allocator can make
> a good part of the total memory.
>
> As an example, bpf maps can consume a lot of percpu memory, and they are
> created by a user.  Also, some cgroup internals (e.g.  memory controller
> statistics) can be quite large.  On a machine with many CPUs and big
> number of cgroups they can consume hundreds of megabytes.
>
> So the lack of memcg accounting is creating a breach in the memory
> isolation.  Similar to the slab memory, percpu memory should be accounted
> by default.
>
> To implement the perpcu accounting it's possible to take the slab memory
> accounting as a model to follow.  Let's introduce two types of percpu
> chunks: root and memcg.  What makes memcg chunks different is an
> additional space allocated to store memcg membership information.  If
> __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
> If it's possible to charge the corresponding size to the target memory
> cgroup, allocation is performed, and the memcg ownership data is recorded.
> System-wide allocations are performed using root chunks, so there is no
> additional memory overhead.
>
> To implement a fast reparenting of percpu memory on memcg removal, we
> don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
> introduced for slab accounting.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
       [not found] ` <20200623184515.4132564-4-guro@fb.com>
@ 2020-06-24  1:35   ` Shakeel Butt
  2020-08-11 15:05   ` Johannes Weiner
  1 sibling, 0 replies; 19+ messages in thread
From: Shakeel Butt @ 2020-06-24  1:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Percpu memory can represent a noticeable chunk of the total memory
> consumption, especially on big machines with many CPUs.  Let's track
> percpu memory usage for each memcg and display it in memory.stat.
>
> A percpu allocation is usually scattered over multiple pages (and nodes),
> and can be significantly smaller than a page.  So let's add a byte-sized
> counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
> created for slabs can be perfectly reused for percpu case.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
       [not found] ` <20200623184515.4132564-5-guro@fb.com>
@ 2020-06-24  1:40   ` Shakeel Butt
  2020-06-24  1:49     ` Roman Gushchin
  2020-07-29 17:10   ` Michal Koutný
  2020-08-11 15:27   ` Johannes Weiner
  2 siblings, 1 reply; 19+ messages in thread
From: Shakeel Butt @ 2020-06-24  1:40 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
>
> Memory cgroups are using large chunks of percpu memory to store vmstat
> data.  Yet this memory is not accounted at all, so in the case when there
> are many (dying) cgroups, it's not exactly clear where all the memory is.
>
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation

*simply

> between cgroups.
>
> Let's account the consumed percpu memory to the parent cgroup.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-06-24  1:40   ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Shakeel Butt
@ 2020-06-24  1:49     ` Roman Gushchin
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Gushchin @ 2020-06-24  1:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Linux MM, Kernel Team, LKML

On Tue, Jun 23, 2020 at 06:40:41PM -0700, Shakeel Butt wrote:
> On Tue, Jun 23, 2020 at 11:47 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > Memory cgroups are using large chunks of percpu memory to store vmstat
> > data.  Yet this memory is not accounted at all, so in the case when there
> > are many (dying) cgroups, it's not exactly clear where all the memory is.
> >
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> 
> *simply
> 
> > between cgroups.
> >
> > Let's account the consumed percpu memory to the parent cgroup.
> >
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Dennis Zhou <dennis@kernel.org>
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Hello, Shakeel!

Thank you for the review of this and the previous patchsets!

Btw, I'll be completely offline till the end of the week,
so if any questions will arise around these patchsets,
I'll answer all them on Monday, Jun 29th. Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
       [not found] ` <20200623184515.4132564-5-guro@fb.com>
  2020-06-24  1:40   ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Shakeel Butt
@ 2020-07-29 17:10   ` Michal Koutný
  2020-08-07  4:16     ` Andrew Morton
  2020-08-11 15:27   ` Johannes Weiner
  2 siblings, 1 reply; 19+ messages in thread
From: Michal Koutný @ 2020-07-29 17:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1349 bytes --]

Hello.

On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation
> between cgroups.
No doubt about accounting the memory if it's significant amount.

> Let's account the consumed percpu memory to the parent cgroup.
Why did you choose charging to the parent of the created cgroup?

Should the charge go the cgroup _that is creating_ the new memcg?

One reason is that there are the throttling mechanisms for memory limits
and those are better exercised when the actor and its memory artefact
are the same cgroup, aren't they?

The second reason is based on the example Dlegation Containment
(Documentation/admin-guide/cgroup-v2.rst)

> For an example, let's assume cgroups C0 and C1 have been delegated to
> user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> all processes under C0 and C1 belong to U0::
> 
>   ~~~~~~~~~~~~~ - C0 - C00
>   ~ cgroup    ~      \ C01
>   ~ hierarchy ~
>   ~~~~~~~~~~~~~ - C1 - C10

Thanks to permissions a task running in C0 creating a cgroup in C1 would
deplete C1's supply victimizing tasks inside C1.

Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-07-29 17:10   ` Michal Koutný
@ 2020-08-07  4:16     ` Andrew Morton
  2020-08-07  4:37       ` Roman Gushchin
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2020-08-07  4:16 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:

> Hello.
> 
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> > between cgroups.
> No doubt about accounting the memory if it's significant amount.
> 
> > Let's account the consumed percpu memory to the parent cgroup.
> Why did you choose charging to the parent of the created cgroup?
> 
> Should the charge go the cgroup _that is creating_ the new memcg?
> 
> One reason is that there are the throttling mechanisms for memory limits
> and those are better exercised when the actor and its memory artefact
> are the same cgroup, aren't they?
> 
> The second reason is based on the example Dlegation Containment
> (Documentation/admin-guide/cgroup-v2.rst)
> 
> > For an example, let's assume cgroups C0 and C1 have been delegated to
> > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > all processes under C0 and C1 belong to U0::
> > 
> >   ~~~~~~~~~~~~~ - C0 - C00
> >   ~ cgroup    ~      \ C01
> >   ~ hierarchy ~
> >   ~~~~~~~~~~~~~ - C1 - C10
> 
> Thanks to permissions a task running in C0 creating a cgroup in C1 would
> deplete C1's supply victimizing tasks inside C1.

These week-old issues appear to be significant.  Roman?  Or someone
else?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-07  4:16     ` Andrew Morton
@ 2020-08-07  4:37       ` Roman Gushchin
  2020-08-10 19:33         ` Roman Gushchin
  2020-08-11 14:47         ` Michal Koutný
  0 siblings, 2 replies; 19+ messages in thread
From: Roman Gushchin @ 2020-08-07  4:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Koutný,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:
> 
> > Hello.
> > 
> > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > > Because the size of memory cgroup internal structures can dramatically
> > > exceed the size of object or page which is pinning it in the memory, it's
> > > not a good idea to simple ignore it.  It actually breaks the isolation
> > > between cgroups.
> > No doubt about accounting the memory if it's significant amount.
> > 
> > > Let's account the consumed percpu memory to the parent cgroup.
> > Why did you choose charging to the parent of the created cgroup?
> > 
> > Should the charge go the cgroup _that is creating_ the new memcg?
> > 
> > One reason is that there are the throttling mechanisms for memory limits
> > and those are better exercised when the actor and its memory artefact
> > are the same cgroup, aren't they?

Hi!

In general, yes. But in this case I think it wouldn't be a good idea:
most often cgroups are created by a centralized daemon (systemd),
which is usually located in the root cgroup. Even if it's located not in
the root cgroup, limiting it's memory will likely affect the whole system,
even if only one specific limit was reached.
If there is a containerized workload, which creates sub-cgroups,
charging it's parent cgroup is perfectly effective.

And the opposite, if we'll charge the cgroup of a process, who created
a cgroup, we'll not cover the most common case: systemd creating
cgroups for all services in the system.

> > 
> > The second reason is based on the example Dlegation Containment
> > (Documentation/admin-guide/cgroup-v2.rst)
> > 
> > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > all processes under C0 and C1 belong to U0::
> > > 
> > >   ~~~~~~~~~~~~~ - C0 - C00
> > >   ~ cgroup    ~      \ C01
> > >   ~ hierarchy ~
> > >   ~~~~~~~~~~~~~ - C1 - C10
> > 
> > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > deplete C1's supply victimizing tasks inside C1.

Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
in completely different cgroup. In this particular case there are tons of other
ways how a task from C00 can hurt C1.

> 
> These week-old issues appear to be significant.  Roman?  Or someone
> else?

Oh, I'm sorry, somehow I've missed this letter.
Thank you for pointing at it!

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-07  4:37       ` Roman Gushchin
@ 2020-08-10 19:33         ` Roman Gushchin
  2020-08-11 14:47         ` Michal Koutný
  1 sibling, 0 replies; 19+ messages in thread
From: Roman Gushchin @ 2020-08-10 19:33 UTC (permalink / raw)
  To: Andrew Morton, Michal Koutný
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Johannes Weiner,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin wrote:
> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > On Wed, 29 Jul 2020 19:10:39 +0200 Michal Koutný <mkoutny@suse.com> wrote:
> > 
> > > Hello.
> > > 
> > > On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > > > Because the size of memory cgroup internal structures can dramatically
> > > > exceed the size of object or page which is pinning it in the memory, it's
> > > > not a good idea to simple ignore it.  It actually breaks the isolation
> > > > between cgroups.
> > > No doubt about accounting the memory if it's significant amount.
> > > 
> > > > Let's account the consumed percpu memory to the parent cgroup.
> > > Why did you choose charging to the parent of the created cgroup?
> > > 
> > > Should the charge go the cgroup _that is creating_ the new memcg?
> > > 
> > > One reason is that there are the throttling mechanisms for memory limits
> > > and those are better exercised when the actor and its memory artefact
> > > are the same cgroup, aren't they?
> 
> Hi!
> 
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
> 
> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
> 
> > > 
> > > The second reason is based on the example Dlegation Containment
> > > (Documentation/admin-guide/cgroup-v2.rst)
> > > 
> > > > For an example, let's assume cgroups C0 and C1 have been delegated to
> > > > user U0 who created C00, C01 under C0 and C10 under C1 as follows and
> > > > all processes under C0 and C1 belong to U0::
> > > > 
> > > >   ~~~~~~~~~~~~~ - C0 - C00
> > > >   ~ cgroup    ~      \ C01
> > > >   ~ hierarchy ~
> > > >   ~~~~~~~~~~~~~ - C1 - C10
> > > 
> > > Thanks to permissions a task running in C0 creating a cgroup in C1 would
> > > deplete C1's supply victimizing tasks inside C1.
> 
> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
> 
> > 
> > These week-old issues appear to be significant.  Roman?  Or someone
> > else?
> 
> Oh, I'm sorry, somehow I've missed this letter.
> Thank you for pointing at it!

Hello, Michal!

Do you have concerns left here or it's good to go?

It seems that this blocking the whole percpu accounting patchset from being merged,
and I still hope it can be squeezed into 5.9.

Thank you!

Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-07  4:37       ` Roman Gushchin
  2020-08-10 19:33         ` Roman Gushchin
@ 2020-08-11 14:47         ` Michal Koutný
  2020-08-11 16:55           ` Roman Gushchin
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Koutný @ 2020-08-11 14:47 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@fb.com> wrote:
> In general, yes. But in this case I think it wouldn't be a good idea:
> most often cgroups are created by a centralized daemon (systemd),
> which is usually located in the root cgroup. Even if it's located not in
> the root cgroup, limiting it's memory will likely affect the whole system,
> even if only one specific limit was reached.
The generic scheme would be (assuming the no internal process
constraint, in the root too)

` root or delegated root
  ` manager-cgroup (systemd, docker, ...)
  ` [aggregation group(s)]
    ` job-group-1
    ` ...
    ` job-group-n

> If there is a containerized workload, which creates sub-cgroups,
> charging it's parent cgroup is perfectly effective.
No dispute about this in either approaches.

> And the opposite, if we'll charge the cgroup of a process, who created
> a cgroup, we'll not cover the most common case: systemd creating
> cgroups for all services in the system.
What I mean is that systemd should be charged for the cgroup creation.
Or more generally, any container/cgroup manager should be charged.
Consider a leak when it wouldn't remove spent cgroups, IMO the effect
(throttling, reclaim) should be exercised on such a culprit.

I don't think the existing workload (job-group-i above) should
unnecessarily suffer when only manager is acting up. Is that different
from your idea?

> Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> in completely different cgroup. In this particular case there are tons of other
> ways how a task from C00 can hurt C1.
I agree with that.


If I haven't overlooked anything, this should be first case when
cgroup-related structures are accounted (please correct me).
So this is setting a precendent, if others show useful to be accounted
in the future too. I'm thinking about cpu_cgroup_css_alloc() that can
also allocate a lot (with big CPU count). The current approach would lead
situations where matching cpu and memory csses needn't to exist and that
would need special handling.


> On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > These week-old issues appear to be significant.  Roman?  Or someone
> > else?
Despite my concerns, I don't think this is fundamental and can't be
changed later so it doesn't prevent the inclusion in 5.9 RC1.

Regards,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics
       [not found] ` <20200623184515.4132564-4-guro@fb.com>
  2020-06-24  1:35   ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Shakeel Butt
@ 2020-08-11 15:05   ` Johannes Weiner
  1 sibling, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2020-08-11 15:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Tue, Jun 23, 2020 at 11:45:13AM -0700, Roman Gushchin wrote:
> Percpu memory can represent a noticeable chunk of the total memory
> consumption, especially on big machines with many CPUs.  Let's track
> percpu memory usage for each memcg and display it in memory.stat.
> 
> A percpu allocation is usually scattered over multiple pages (and nodes),
> and can be significantly smaller than a page.  So let's add a byte-sized
> counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
> created for slabs can be perfectly reused for percpu case.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
       [not found] ` <20200623184515.4132564-5-guro@fb.com>
  2020-06-24  1:40   ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Shakeel Butt
  2020-07-29 17:10   ` Michal Koutný
@ 2020-08-11 15:27   ` Johannes Weiner
  2020-08-11 17:06     ` Roman Gushchin
  2 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2020-08-11 15:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> Memory cgroups are using large chunks of percpu memory to store vmstat
> data.  Yet this memory is not accounted at all, so in the case when there
> are many (dying) cgroups, it's not exactly clear where all the memory is.
> 
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it.  It actually breaks the isolation
> between cgroups.
> 
> Let's account the consumed percpu memory to the parent cgroup.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Acked-by: Dennis Zhou <dennis@kernel.org>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

This makes sense, and the accounting is in line with how we track and
distribute child creation quotas (cgroup.max.descendants and
cgroup.max.depth) up the cgroup tree.

I have one minor comment that isn't a dealbreaker for me:

> @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>  	if (!pn)
>  		return 1;
>  
> -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_local) {
>  		kfree(pn);
>  		return 1;
>  	}
>  
> -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> +					       GFP_KERNEL_ACCOUNT);
>  	if (!pn->lruvec_stat_cpu) {
>  		free_percpu(pn->lruvec_stat_local);
>  		kfree(pn);
> @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  		goto fail;
>  	}
>  
> -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_local)
>  		goto fail;
>  
> -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> +						 GFP_KERNEL_ACCOUNT);
>  	if (!memcg->vmstats_percpu)
>  		goto fail;
>  
> @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	struct mem_cgroup *memcg;
>  	long error = -ENOMEM;
>  
> +	memalloc_use_memcg(parent);
>  	memcg = mem_cgroup_alloc();
> +	memalloc_unuse_memcg();

The disconnect between 1) requesting accounting and 2) which cgroup to
charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
handgrenade, because accounting to the current task is almost
guaranteed to be wrong if the use_memcg() annotation were to get lost
in a refactor or not make it to a new caller of the function.

The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
used elsewhere. And pretending it's an independent interface would be
overengineering. But how about the following in mem_cgroup_alloc() and
alloc_mem_cgroup_per_node_info() to document that caller relationship:

	/* We charge the parent cgroup, never the current task */
	WARN_ON_ONCE(!current->active_memcg);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 14:47         ` Michal Koutný
@ 2020-08-11 16:55           ` Roman Gushchin
  2020-08-11 18:32             ` Michal Koutný
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2020-08-11 16:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Tue, Aug 11, 2020 at 04:47:54PM +0200, Michal Koutny wrote:
> On Thu, Aug 06, 2020 at 09:37:17PM -0700, Roman Gushchin <guro@fb.com> wrote:
> > In general, yes. But in this case I think it wouldn't be a good idea:
> > most often cgroups are created by a centralized daemon (systemd),
> > which is usually located in the root cgroup. Even if it's located not in
> > the root cgroup, limiting it's memory will likely affect the whole system,
> > even if only one specific limit was reached.
> The generic scheme would be (assuming the no internal process
> constraint, in the root too)
> 
> ` root or delegated root
>   ` manager-cgroup (systemd, docker, ...)
>   ` [aggregation group(s)]
>     ` job-group-1
>     ` ...
>     ` job-group-n
> 
> > If there is a containerized workload, which creates sub-cgroups,
> > charging it's parent cgroup is perfectly effective.
> No dispute about this in either approaches.
> 
> > And the opposite, if we'll charge the cgroup of a process, who created
> > a cgroup, we'll not cover the most common case: systemd creating
> > cgroups for all services in the system.
> What I mean is that systemd should be charged for the cgroup creation.
> Or more generally, any container/cgroup manager should be charged.
> Consider a leak when it wouldn't remove spent cgroups, IMO the effect
> (throttling, reclaim) should be exercised on such a culprit.

As I said, there are 2 problems with charging systemd (or a similar daemon):
1) It often belongs to the root cgroup.
2) OOMing or failing some random memory allocations is a bad way
   to "communicate" a memory shortage to the daemon.
   What we really want is to prevent creating a huge number of cgroups
   (including dying cgroups) in some specific sub-tree(s).
   OOMing the daemon or returning -ENOMEM to some random syscalls
   will not help us to reach the goal and likely will bring a bad
   experience to a user.

In a generic case I don't see how we can charge the cgroup which
creates cgroups without solving these problems first.

And if there is a very special case where we have to limit it,
we can just add an additional layer:

` root or delegated root
   ` manager-parent-cgroup-with-a-limit
     ` manager-cgroup (systemd, docker, ...)
   ` [aggregation group(s)]
     ` job-group-1
     ` ...
     ` job-group-n

> 
> I don't think the existing workload (job-group-i above) should
> unnecessarily suffer when only manager is acting up. Is that different
> from your idea?
> 
> > Right, but it's quite unusual for tasks from one cgroup to create sub-cgroups
> > in completely different cgroup. In this particular case there are tons of other
> > ways how a task from C00 can hurt C1.
> I agree with that.
> 
> 
> If I haven't overlooked anything, this should be first case when
> cgroup-related structures are accounted (please correct me).
> So this is setting a precendent, if others show useful to be accounted
> in the future too.

Right.

> I'm thinking about cpu_cgroup_css_alloc() that can
> also allocate a lot (with big CPU count). The current approach would lead
> situations where matching cpu and memory csses needn't to exist and that
> would need special handling.

I'd definitely charge the parent cgroup in all similar cases.

> 
> 
> > On Thu, Aug 06, 2020 at 09:16:03PM -0700, Andrew Morton wrote:
> > > These week-old issues appear to be significant.  Roman?  Or someone
> > > else?
> Despite my concerns, I don't think this is fundamental and can't be
> changed later so it doesn't prevent the inclusion in 5.9 RC1.

Thank you!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 15:27   ` Johannes Weiner
@ 2020-08-11 17:06     ` Roman Gushchin
  2020-08-13  9:16       ` Naresh Kamboju
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2020-08-11 17:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Michal Hocko, Shakeel Butt, linux-mm, kernel-team, linux-kernel

On Tue, Aug 11, 2020 at 11:27:37AM -0400, Johannes Weiner wrote:
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> > Memory cgroups are using large chunks of percpu memory to store vmstat
> > data.  Yet this memory is not accounted at all, so in the case when there
> > are many (dying) cgroups, it's not exactly clear where all the memory is.
> > 
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it.  It actually breaks the isolation
> > between cgroups.
> > 
> > Let's account the consumed percpu memory to the parent cgroup.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > Acked-by: Dennis Zhou <dennis@kernel.org>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thank you!

> 
> This makes sense, and the accounting is in line with how we track and
> distribute child creation quotas (cgroup.max.descendants and
> cgroup.max.depth) up the cgroup tree.
> 
> I have one minor comment that isn't a dealbreaker for me:
> 
> > @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> >  	if (!pn)
> >  		return 1;
> >  
> > -	pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> > +	pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> > +						 GFP_KERNEL_ACCOUNT);
> >  	if (!pn->lruvec_stat_local) {
> >  		kfree(pn);
> >  		return 1;
> >  	}
> >  
> > -	pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> > +	pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> > +					       GFP_KERNEL_ACCOUNT);
> >  	if (!pn->lruvec_stat_cpu) {
> >  		free_percpu(pn->lruvec_stat_local);
> >  		kfree(pn);
> > @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> >  		goto fail;
> >  	}
> >  
> > -	memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> > +	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > +						GFP_KERNEL_ACCOUNT);
> >  	if (!memcg->vmstats_local)
> >  		goto fail;
> >  
> > -	memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> > +	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > +						 GFP_KERNEL_ACCOUNT);
> >  	if (!memcg->vmstats_percpu)
> >  		goto fail;
> >  
> > @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >  	struct mem_cgroup *memcg;
> >  	long error = -ENOMEM;
> >  
> > +	memalloc_use_memcg(parent);
> >  	memcg = mem_cgroup_alloc();
> > +	memalloc_unuse_memcg();
> 
> The disconnect between 1) requesting accounting and 2) which cgroup to
> charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
> handgrenade, because accounting to the current task is almost
> guaranteed to be wrong if the use_memcg() annotation were to get lost
> in a refactor or not make it to a new caller of the function.
> 
> The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
> used elsewhere. And pretending it's an independent interface would be
> overengineering. But how about the following in mem_cgroup_alloc() and
> alloc_mem_cgroup_per_node_info() to document that caller relationship:
> 
> 	/* We charge the parent cgroup, never the current task */
> 	WARN_ON_ONCE(!current->active_memcg);

I have nothing against.

Andrew, can you, please, squash the following diff into the patch?

Thanks!

--

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 130093bdf74b..e25f2db7e61c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
        if (!pn)
                return 1;
 
+       /* We charge the parent cgroup, never the current task */
+       WARN_ON_ONCE(!current->active_memcg);
+
        pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
                                                 GFP_KERNEL_ACCOUNT);
        if (!pn->lruvec_stat_local) {
@@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
                goto fail;
        }
 
+       /* We charge the parent cgroup, never the current task */
+       WARN_ON_ONCE(!current->active_memcg);
+
        memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
                                                GFP_KERNEL_ACCOUNT);
        if (!memcg->vmstats_local)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 16:55           ` Roman Gushchin
@ 2020-08-11 18:32             ` Michal Koutný
  2020-08-11 19:32               ` Roman Gushchin
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Koutný @ 2020-08-11 18:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2118 bytes --]

On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@fb.com> wrote:
> As I said, there are 2 problems with charging systemd (or a similar daemon):
> 1) It often belongs to the root cgroup.
This doesn't hold for systemd (if we agree that systemd is the most
common case).

> 2) OOMing or failing some random memory allocations is a bad way
>    to "communicate" a memory shortage to the daemon.
>    What we really want is to prevent creating a huge number of cgroups
There's cgroup.max.descendants for that...

>    (including dying cgroups) in some specific sub-tree(s).
...oh, so is this limiting the number of cgroups or limiting resources
used?

>    OOMing the daemon or returning -ENOMEM to some random syscalls
>    will not help us to reach the goal and likely will bring a bad
>    experience to a user.
If we reach the situation when memory for cgroup operations is tight,
it'll disappoint the user either way.
My premise is that a running workload is more valuable than the
accompanying manager.

> In a generic case I don't see how we can charge the cgroup which
> creates cgroups without solving these problems first.
In my understanding, "onbehalveness" is a concept useful for various
kernel threads doing deferred work. Here it's promoted to user processes
managing cgroups.

> And if there is a very special case where we have to limit it,
> we can just add an additional layer:
> 
> ` root or delegated root
>    ` manager-parent-cgroup-with-a-limit
>      ` manager-cgroup (systemd, docker, ...)
>    ` [aggregation group(s)]
>      ` job-group-1
>      ` ...
>      ` job-group-n
If the charge goes to the parent of created cgroup (job-cgroup-i here),
then the layer adds nothing. Am I missing something?

> I'd definitely charge the parent cgroup in all similar cases.
(This would mandate the controllers on the unified hierarchy, which is
fine IMO.) Then the order of enabling controllers on a subtree (e.g.
cpu,memory vs memory,cpu) by the manager would yield different charging.
This seems wrong^W confusing to me. 


Thanks,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 18:32             ` Michal Koutný
@ 2020-08-11 19:32               ` Roman Gushchin
  2020-08-12 16:28                 ` Michal Koutný
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2020-08-11 19:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

On Tue, Aug 11, 2020 at 08:32:25PM +0200, Michal Koutny wrote:
> On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@fb.com> wrote:
> > As I said, there are 2 problems with charging systemd (or a similar daemon):
> > 1) It often belongs to the root cgroup.
> This doesn't hold for systemd (if we agree that systemd is the most
> common case).

Ok, it's better.

> 
> > 2) OOMing or failing some random memory allocations is a bad way
> >    to "communicate" a memory shortage to the daemon.
> >    What we really want is to prevent creating a huge number of cgroups
> There's cgroup.max.descendants for that...

cgroup.max.descendants limits the number of live cgroups, it can't limit
the number of dying cgroups.

> 
> >    (including dying cgroups) in some specific sub-tree(s).
> ...oh, so is this limiting the number of cgroups or limiting resources
> used?

My scenario is simple: there is a large machine, which has no memory
pressure for some time (e.g. is idle or running a workload with small
working set). Periodically running services creating a lot of cgroups,
usually in system.slice. After some time a significant part of the whole
memory is getting consumed by dying cgroups and their percpu data.
Getting rid of it and reclaiming all memory is not always possible
(percpu is getting fragmented relatively easy) and is time consuming.

If we'll set memory.high on system.slice, it will create an artificial
memory pressure once we're getting close to the limit. It will trigger
the reclaim of user pages and slab objects, so eventually we'll be able
to release dying cgroups as well.

You might say that it would work even without charging memcg internal
structures. The problem is that a small slab object can indirectly pin
a lot of (percpu) memory. If don't take the indirectly pinned memory
into account, likely we won't apply enough memory pressure.

If we'll limit init.slice (where systemd seems to reside), as you suggest,
we'll eventually create trashing in init.slice, followed by OOM.
I struggle to see how it makes the life of a user better?

> 
> >    OOMing the daemon or returning -ENOMEM to some random syscalls
> >    will not help us to reach the goal and likely will bring a bad
> >    experience to a user.
> If we reach the situation when memory for cgroup operations is tight,
> it'll disappoint the user either way.
> My premise is that a running workload is more valuable than the
> accompanying manager.

The problem is that OOM-killing the accompanying manager won't release
resources and help to get rid of accumulated cgroups. So in the very
best case it will prevent new cgroups from being created (as well
as some other random operations from being performed). Most likely
the only way to "fix" this for a user will be to reboot the machine.

> 
> > In a generic case I don't see how we can charge the cgroup which
> > creates cgroups without solving these problems first.
> In my understanding, "onbehalveness" is a concept useful for various
> kernel threads doing deferred work. Here it's promoted to user processes
> managing cgroups.
> 
> > And if there is a very special case where we have to limit it,
> > we can just add an additional layer:
> > 
> > ` root or delegated root
> >    ` manager-parent-cgroup-with-a-limit
> >      ` manager-cgroup (systemd, docker, ...)
> >    ` [aggregation group(s)]
> >      ` job-group-1
> >      ` ...
> >      ` job-group-n
> If the charge goes to the parent of created cgroup (job-cgroup-i here),
> then the layer adds nothing. Am I missing something?

Sorry, I was wrong here, please ignore this part.

> 
> > I'd definitely charge the parent cgroup in all similar cases.
> (This would mandate the controllers on the unified hierarchy, which is
> fine IMO.) Then the order of enabling controllers on a subtree (e.g.
> cpu,memory vs memory,cpu) by the manager would yield different charging.
> This seems wrong^W confusing to me.

I agree it's confusing.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 19:32               ` Roman Gushchin
@ 2020-08-12 16:28                 ` Michal Koutný
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Koutný @ 2020-08-12 16:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Johannes Weiner, Michal Hocko, Shakeel Butt, linux-mm,
	kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1026 bytes --]

On Tue, Aug 11, 2020 at 12:32:28PM -0700, Roman Gushchin <guro@fb.com> wrote:
> If we'll limit init.slice (where systemd seems to reside), as you suggest,
> we'll eventually create trashing in init.slice, followed by OOM.
> I struggle to see how it makes the life of a user better?
> [...]
> The problem is that OOM-killing the accompanying manager won't release
> resources and help to get rid of accumulated cgroups.
I see your point now. I focused on the creator because of the live
memcgs.

When the issue are the dying memcgs (c), they were effectively released
by their creator but are pinned by whatever remained after their life
(LRU pages, slab->obj_cgroups). Since these pins were created _from
within_ such a child (c), they're most readily removable by reclaiming
(hierarchically) close to c. (It'd be achievable by limiting the lowest
common ancestor of manager and its product (typically root) but that is
more clumsy and less effective.)

This is the reasoning that justifies the remote charge.

Thanks!
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-11 17:06     ` Roman Gushchin
@ 2020-08-13  9:16       ` Naresh Kamboju
  2020-08-13 23:27         ` Stephen Rothwell
  0 siblings, 1 reply; 19+ messages in thread
From: Naresh Kamboju @ 2020-08-13  9:16 UTC (permalink / raw)
  To: Roman Gushchin, Linux-Next Mailing List, open list, linux-mm, Cgroups
  Cc: Johannes Weiner, Andrew Morton, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Michal Hocko, Shakeel Butt, Kernel Team,
	lkft-triage

The kernel warnings  were noticed on linux next 20200813 while booting
on arm64, arm, x86_64 and i386.

metadata:
  git branch: master
  git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
  git commit: e6d113aca646fb6a92b237340109237fd7a9c770
  git describe: next-20200813
  make_kernelversion: 5.8.0
  kernel-config:
https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 130093bdf74b..e25f2db7e61c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>         if (!pn)
>                 return 1;
>
> +       /* We charge the parent cgroup, never the current task */
> +       WARN_ON_ONCE(!current->active_memcg);
> +
>         pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
>                                                  GFP_KERNEL_ACCOUNT);
>         if (!pn->lruvec_stat_local) {
> @@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>                 goto fail;
>         }
>
> +       /* We charge the parent cgroup, never the current task */
> +       WARN_ON_ONCE(!current->active_memcg);

[    0.217404] ------------[ cut here ]------------
[    0.218038] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5226
mem_cgroup_css_alloc+0x680/0x740
[    0.219188] Modules linked in:
[    0.219597] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0-next-20200813 #1
[    0.220187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[    0.221190] EIP: mem_cgroup_css_alloc+0x680/0x740
[    0.222190] Code: d6 17 5d ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74
26 00 b8 58 39 6a d1 e8 fe 94 55 ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d
74 26 00 <0f> 0b e9 01 fa ff ff 8d b4 26 00 00 00 00 66 90 bb f4 ff ff
ff ba
[    0.223188] EAX: 00000000 EBX: d13666c0 ECX: 00000cc0 EDX: 0000ffff
[    0.224187] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[    0.225188] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[    0.226190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[    0.227195] Call Trace:
[    0.227882]  ? _cond_resched+0x17/0x30
[    0.228195]  cgroup_init_subsys+0x66/0x12a
[    0.229193]  cgroup_init+0x118/0x323
[    0.230194]  start_kernel+0x43c/0x47d
[    0.231193]  i386_start_kernel+0x48/0x4a
[    0.232194]  startup_32_smp+0x164/0x168
[    0.233195] ---[ end trace dfcf9be7b40caf05 ]---
[    0.2342#
08] ------------[ cut here ]------------
[    0.235192] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5141
mem_cgroup_css_alloc+0x718/0x740
[    0.236187] Modules linked in:
[    0.236590] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W
  5.8.0-next-20200813 #1
[    0.237190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[    0.238194] EIP: mem_cgroup_css_alloc+0x718/0x740
[    0.239191] Code: 48 ff e9 7c fd ff ff 8d 76 00 a1 b0 14 40 d1 e9
53 fc ff ff 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00
00 00 00 <0f> 0b e9 df f9 ff ff 90 89 f8 e8 29 0c 5c ff 89 f2 b8 10 f4
40 d1
[    0.240190] EAX: 00000000 EBX: f4c0c800 ECX: 00000000 EDX: d0eab660
[    0.241189] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[    0.242189] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[    0.243190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[    0.244188] Call Trace:
[    0.245191]  ? _cond_resched+0x17/0x30
[    0.245686]  cgroup_init_subsys+0x66/0x12a
[    0.246189]  cgroup_init+0x118/0x323
[    0.246654]  start_kernel+0x43c/0x47d
[    0.247189]  i386_start_kernel+0x48/0x4a
[    0.247697]  startup_32_smp+0x164/0x168
[    0.248188] ---[ end trace dfcf9be7b40caf06 ]---
[    0.248990] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
[    0.249187] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0


Full test log,
https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20200813/testrun/3061112/suite/linux-log-parser/test/check-kernel-warning-1665815/log

-- 
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup
  2020-08-13  9:16       ` Naresh Kamboju
@ 2020-08-13 23:27         ` Stephen Rothwell
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen Rothwell @ 2020-08-13 23:27 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Roman Gushchin, Linux-Next Mailing List, open list, linux-mm,
	Cgroups, Johannes Weiner, Andrew Morton, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Michal Hocko, Shakeel Butt, Kernel Team,
	lkft-triage

[-- Attachment #1: Type: text/plain, Size: 659 bytes --]

Hi Naresh,

On Thu, 13 Aug 2020 14:46:51 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> The kernel warnings  were noticed on linux next 20200813 while booting
> on arm64, arm, x86_64 and i386.
> 
> metadata:
>   git branch: master
>   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>   git commit: e6d113aca646fb6a92b237340109237fd7a9c770
>   git describe: next-20200813
>   make_kernelversion: 5.8.0
>   kernel-config:
> https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config

Actually in Linus' tree.

It has been fixed today.  Thanks for reporting.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-08-13 23:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200623184515.4132564-1-guro@fb.com>
     [not found] ` <20200623184515.4132564-2-guro@fb.com>
2020-06-24  0:58   ` [PATCH v3 1/5] percpu: return number of released bytes from pcpu_free_area() Shakeel Butt
     [not found] ` <20200623184515.4132564-3-guro@fb.com>
2020-06-24  1:25   ` [PATCH v3 2/5] mm: memcg/percpu: account percpu memory to memory cgroups Shakeel Butt
     [not found] ` <20200623184515.4132564-4-guro@fb.com>
2020-06-24  1:35   ` [PATCH v3 3/5] mm: memcg/percpu: per-memcg percpu memory statistics Shakeel Butt
2020-08-11 15:05   ` Johannes Weiner
     [not found] ` <20200623184515.4132564-5-guro@fb.com>
2020-06-24  1:40   ` [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup Shakeel Butt
2020-06-24  1:49     ` Roman Gushchin
2020-07-29 17:10   ` Michal Koutný
2020-08-07  4:16     ` Andrew Morton
2020-08-07  4:37       ` Roman Gushchin
2020-08-10 19:33         ` Roman Gushchin
2020-08-11 14:47         ` Michal Koutný
2020-08-11 16:55           ` Roman Gushchin
2020-08-11 18:32             ` Michal Koutný
2020-08-11 19:32               ` Roman Gushchin
2020-08-12 16:28                 ` Michal Koutný
2020-08-11 15:27   ` Johannes Weiner
2020-08-11 17:06     ` Roman Gushchin
2020-08-13  9:16       ` Naresh Kamboju
2020-08-13 23:27         ` Stephen Rothwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).