All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-09  8:03 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-09  8:03 UTC (permalink / raw)
  To: cgroups, linux-mm; +Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov

OpenVZ many years accounted memory of few kernel objects,
this helps us to prevent host memory abuse from inside memcg-limited container.

Vasily Averin (9):
  memcg: accounting for allocations called with disabled BH
  memcg: accounting for fib6_nodes cache
  memcg: accounting for ip6_dst_cache
  memcg: accounting for fib_rules
  memcg: accounting for ip_fib caches
  memcg: accounting for fasync_cache
  memcg: accounting for mnt_cache entries
  memcg: accounting for tty_struct objects
  memcg: accounting for ldt_struct objects

 arch/x86/kernel/ldt.c | 7 ++++---
 drivers/tty/tty_io.c  | 4 ++--
 fs/fcntl.c            | 3 ++-
 fs/namespace.c        | 5 +++--
 mm/memcontrol.c       | 2 +-
 net/core/fib_rules.c  | 4 ++--
 net/ipv4/fib_trie.c   | 4 ++--
 net/ipv6/ip6_fib.c    | 2 +-
 net/ipv6/route.c      | 2 +-
 9 files changed, 18 insertions(+), 15 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-09  8:03 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-09  8:03 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov

OpenVZ many years accounted memory of few kernel objects,
this helps us to prevent host memory abuse from inside memcg-limited container.

Vasily Averin (9):
  memcg: accounting for allocations called with disabled BH
  memcg: accounting for fib6_nodes cache
  memcg: accounting for ip6_dst_cache
  memcg: accounting for fib_rules
  memcg: accounting for ip_fib caches
  memcg: accounting for fasync_cache
  memcg: accounting for mnt_cache entries
  memcg: accounting for tty_struct objects
  memcg: accounting for ldt_struct objects

 arch/x86/kernel/ldt.c | 7 ++++---
 drivers/tty/tty_io.c  | 4 ++--
 fs/fcntl.c            | 3 ++-
 fs/namespace.c        | 5 +++--
 mm/memcontrol.c       | 2 +-
 net/core/fib_rules.c  | 4 ++--
 net/ipv4/fib_trie.c   | 4 ++--
 net/ipv6/ip6_fib.c    | 2 +-
 net/ipv6/route.c      | 2 +-
 9 files changed, 18 insertions(+), 15 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-09 21:12   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-09 21:12 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Linux MM, Johannes Weiner, Michal Hocko, Vladimir Davydov

On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> OpenVZ many years accounted memory of few kernel objects,
> this helps us to prevent host memory abuse from inside memcg-limited container.
>

The text is cryptic but I am assuming you wanted to say that OpenVZ
has remained on a kernel which was still on opt-out kmem accounting
i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
patches are needed, right?


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-09 21:12   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-09 21:12 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Linux MM, Johannes Weiner, Michal Hocko, Vladimir Davydov

On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> OpenVZ many years accounted memory of few kernel objects,
> this helps us to prevent host memory abuse from inside memcg-limited container.
>

The text is cryptic but I am assuming you wanted to say that OpenVZ
has remained on a kernel which was still on opt-out kmem accounting
i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
patches are needed, right?

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-10 10:17     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-10 10:17 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, Linux MM, Johannes Weiner, Michal Hocko, Vladimir Davydov

On 3/10/21 12:12 AM, Shakeel Butt wrote:
> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> OpenVZ many years accounted memory of few kernel objects,
>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>
> 
> The text is cryptic but I am assuming you wanted to say that OpenVZ
> has remained on a kernel which was still on opt-out kmem accounting
> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> patches are needed, right?

Something like this.
Frankly speaking I badly understand which arguments should I provide to upstream
to enable accounting for some new king of objects.

OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
and we have accounted all required kernel objects by using our own patches.
When memcg was added to upstream Vladimir Davydov added accounting of some objects
to upstream but did not skipped another ones.
Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
"skipped" objects by our own patches just because we accounted such objects before.
We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-10 10:17     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-10 10:17 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, Linux MM, Johannes Weiner, Michal Hocko, Vladimir Davydov

On 3/10/21 12:12 AM, Shakeel Butt wrote:
> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>>
>> OpenVZ many years accounted memory of few kernel objects,
>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>
> 
> The text is cryptic but I am assuming you wanted to say that OpenVZ
> has remained on a kernel which was still on opt-out kmem accounting
> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> patches are needed, right?

Something like this.
Frankly speaking I badly understand which arguments should I provide to upstream
to enable accounting for some new king of objects.

OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
and we have accounted all required kernel objects by using our own patches.
When memcg was added to upstream Vladimir Davydov added accounting of some objects
to upstream but did not skipped another ones.
Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
"skipped" objects by our own patches just because we accounted such objects before.
We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-10 10:41       ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-03-10 10:41 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> > On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> OpenVZ many years accounted memory of few kernel objects,
> >> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>
> > 
> > The text is cryptic but I am assuming you wanted to say that OpenVZ
> > has remained on a kernel which was still on opt-out kmem accounting
> > i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> > patches are needed, right?
> 
> Something like this.
> Frankly speaking I badly understand which arguments should I provide to upstream
> to enable accounting for some new king of objects.
> 
> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
> and we have accounted all required kernel objects by using our own patches.
> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> to upstream but did not skipped another ones.
> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> "skipped" objects by our own patches just because we accounted such objects before.
> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

That is certainly an interesting information. But for a changelog it
would be more appropriate to provide information about how much memory
user can induce and whether there is any way to limit that memory by
other means. How practical those other means are and which usecases will
benefit from the containment.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-10 10:41       ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-03-10 10:41 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> > On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
> >>
> >> OpenVZ many years accounted memory of few kernel objects,
> >> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>
> > 
> > The text is cryptic but I am assuming you wanted to say that OpenVZ
> > has remained on a kernel which was still on opt-out kmem accounting
> > i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> > patches are needed, right?
> 
> Something like this.
> Frankly speaking I badly understand which arguments should I provide to upstream
> to enable accounting for some new king of objects.
> 
> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
> and we have accounted all required kernel objects by using our own patches.
> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> to upstream but did not skipped another ones.
> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> "skipped" objects by our own patches just because we accounted such objects before.
> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

That is certainly an interesting information. But for a changelog it
would be more appropriate to provide information about how much memory
user can induce and whether there is any way to limit that memory by
other means. How practical those other means are and which usecases will
benefit from the containment.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-11  7:00         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-11  7:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On 3/10/21 1:41 PM, Michal Hocko wrote:
> On Wed 10-03-21 13:17:19, Vasily Averin wrote:
>> On 3/10/21 12:12 AM, Shakeel Butt wrote:
>>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>
>>>> OpenVZ many years accounted memory of few kernel objects,
>>>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>>
>>> The text is cryptic but I am assuming you wanted to say that OpenVZ
>>> has remained on a kernel which was still on opt-out kmem accounting
>>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
>>> patches are needed, right?
>>
>> Something like this.
>> Frankly speaking I badly understand which arguments should I provide to upstream
>> to enable accounting for some new king of objects.
>>
>> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
>> and we have accounted all required kernel objects by using our own patches.
>> When memcg was added to upstream Vladimir Davydov added accounting of some objects
>> to upstream but did not skipped another ones.
>> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
>> "skipped" objects by our own patches just because we accounted such objects before.
>> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 
> 
> That is certainly an interesting information. But for a changelog it
> would be more appropriate to provide information about how much memory
> user can induce and whether there is any way to limit that memory by
> other means. How practical those other means are and which usecases will
> benefit from the containment.

Right now I would like to understand how should I argument my requests about
accounting of new kind of objects.

Which description it enough to enable object accounting?
Could you please specify some edge rules?
Should I push such patches trough this list? 
Is it probably better to send them to mailing lists of according subsystems?
Should I notify them somehow at least?

"untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"

"each mount inside memcg-limited container creates non-accounted mount object,
 but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"

"unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"

"non-accounted multi-page tty objects can be created from inside memcg-limited container"

"unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-11  7:00         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-11  7:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On 3/10/21 1:41 PM, Michal Hocko wrote:
> On Wed 10-03-21 13:17:19, Vasily Averin wrote:
>> On 3/10/21 12:12 AM, Shakeel Butt wrote:
>>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>>>>
>>>> OpenVZ many years accounted memory of few kernel objects,
>>>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>>
>>> The text is cryptic but I am assuming you wanted to say that OpenVZ
>>> has remained on a kernel which was still on opt-out kmem accounting
>>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
>>> patches are needed, right?
>>
>> Something like this.
>> Frankly speaking I badly understand which arguments should I provide to upstream
>> to enable accounting for some new king of objects.
>>
>> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
>> and we have accounted all required kernel objects by using our own patches.
>> When memcg was added to upstream Vladimir Davydov added accounting of some objects
>> to upstream but did not skipped another ones.
>> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
>> "skipped" objects by our own patches just because we accounted such objects before.
>> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 
> 
> That is certainly an interesting information. But for a changelog it
> would be more appropriate to provide information about how much memory
> user can induce and whether there is any way to limit that memory by
> other means. How practical those other means are and which usecases will
> benefit from the containment.

Right now I would like to understand how should I argument my requests about
accounting of new kind of objects.

Which description it enough to enable object accounting?
Could you please specify some edge rules?
Should I push such patches trough this list? 
Is it probably better to send them to mailing lists of according subsystems?
Should I notify them somehow at least?

"untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"

"each mount inside memcg-limited container creates non-accounted mount object,
 but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"

"unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"

"non-accounted multi-page tty objects can be created from inside memcg-limited container"

"unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-11  8:35           ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-03-11  8:35 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Thu 11-03-21 10:00:17, Vasily Averin wrote:
> On 3/10/21 1:41 PM, Michal Hocko wrote:
[...]
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
> 
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
> 
> Which description it enough to enable object accounting?

Doesn't the above paragraph give you a hint?

> Could you please specify some edge rules?

There are no strong rules AFAIK. I would say that it is important is
that the user can trigger a lot of or unbound amount of objects.

> Should I push such patches trough this list? 

yes linux-mm and ccing memcg maintainers is the proper way. It would be
great to CC maintainers of the affected subsystem as well.

> Is it probably better to send them to mailing lists of according subsystems?

> Should I notify them somehow at least?
> 
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
> 
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
> 
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
> 
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
> 
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

OK, that sounds better to me. It would be also great if you can mention
whether there are any other means to limit those objects if there are
any.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-11  8:35           ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-03-11  8:35 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Thu 11-03-21 10:00:17, Vasily Averin wrote:
> On 3/10/21 1:41 PM, Michal Hocko wrote:
[...]
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
> 
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
> 
> Which description it enough to enable object accounting?

Doesn't the above paragraph give you a hint?

> Could you please specify some edge rules?

There are no strong rules AFAIK. I would say that it is important is
that the user can trigger a lot of or unbound amount of objects.

> Should I push such patches trough this list? 

yes linux-mm and ccing memcg maintainers is the proper way. It would be
great to CC maintainers of the affected subsystem as well.

> Is it probably better to send them to mailing lists of according subsystems?

> Should I notify them somehow at least?
> 
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
> 
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
> 
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
> 
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
> 
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

OK, that sounds better to me. It would be also great if you can mention
whether there are any other means to limit those objects if there are
any.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-11 15:14           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-11 15:14 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Wed, Mar 10, 2021 at 11:00 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 3/10/21 1:41 PM, Michal Hocko wrote:
> > On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> >> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> >>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>>>
> >>>> OpenVZ many years accounted memory of few kernel objects,
> >>>> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>>
> >>> The text is cryptic but I am assuming you wanted to say that OpenVZ
> >>> has remained on a kernel which was still on opt-out kmem accounting
> >>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> >>> patches are needed, right?
> >>
> >> Something like this.
> >> Frankly speaking I badly understand which arguments should I provide to upstream
> >> to enable accounting for some new king of objects.
> >>
> >> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels)
> >> and we have accounted all required kernel objects by using our own patches.
> >> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> >> to upstream but did not skipped another ones.
> >> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> >> "skipped" objects by our own patches just because we accounted such objects before.
> >> We're working on rebase to new kernels and we prefer to push our old patches to upstream.
> >
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
>
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
>
> Which description it enough to enable object accounting?
> Could you please specify some edge rules?
> Should I push such patches trough this list?
> Is it probably better to send them to mailing lists of according subsystems?
> Should I notify them somehow at least?
>
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
>
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
>
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
>
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
>
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"
>

I think the above reasoning is good enough. Just resend your patches
with the corresponding details.


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-11 15:14           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-11 15:14 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Wed, Mar 10, 2021 at 11:00 PM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> On 3/10/21 1:41 PM, Michal Hocko wrote:
> > On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> >> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> >>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
> >>>>
> >>>> OpenVZ many years accounted memory of few kernel objects,
> >>>> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>>
> >>> The text is cryptic but I am assuming you wanted to say that OpenVZ
> >>> has remained on a kernel which was still on opt-out kmem accounting
> >>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> >>> patches are needed, right?
> >>
> >> Something like this.
> >> Frankly speaking I badly understand which arguments should I provide to upstream
> >> to enable accounting for some new king of objects.
> >>
> >> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels)
> >> and we have accounted all required kernel objects by using our own patches.
> >> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> >> to upstream but did not skipped another ones.
> >> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> >> "skipped" objects by our own patches just because we accounted such objects before.
> >> We're working on rebase to new kernels and we prefer to push our old patches to upstream.
> >
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
>
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
>
> Which description it enough to enable object accounting?
> Could you please specify some edge rules?
> Should I push such patches trough this list?
> Is it probably better to send them to mailing lists of according subsystems?
> Should I notify them somehow at least?
>
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
>
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
>
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
>
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
>
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"
>

I think the above reasoning is good enough. Just resend your patches
with the corresponding details.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v2 0/8] memcg accounting from OpenVZ
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2021-03-15 12:22             ` Vasily Averin
  2021-04-22 10:36                 ` Vasily Averin
                                 ` (6 more replies)
  2021-03-15 12:23             ` [PATCH v2 1/8] memcg: accounting for fib6_nodes cache Vasily Averin
                               ` (7 subsequent siblings)
  8 siblings, 7 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:22 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt

OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
and we have accounted all required kernel objects by using our own patches.
When memcg was added to upstream Vladimir Davydov added accounting of some objects
to upstream but did not skipped another ones.
Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
"skipped" objects by our own patches just because we accounted such objects before.
We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (8):
  memcg: accounting for fib6_nodes cache
  memcg: accounting for ip6_dst_cache
  memcg: accounting for fib_rules
  memcg: accounting for ip_fib caches
  memcg: accounting for fasync_cache
  memcg: accounting for mnt_cache entries
  memcg: accounting for tty_struct objects
  memcg: accounting for ldt_struct objects

 arch/x86/kernel/ldt.c | 7 ++++---
 drivers/tty/tty_io.c  | 4 ++--
 fs/fcntl.c            | 3 ++-
 fs/namespace.c        | 5 +++--
 mm/memcontrol.c       | 2 +-
 net/core/fib_rules.c  | 4 ++--
 net/ipv4/fib_trie.c   | 4 ++--
 net/ipv6/ip6_fib.c    | 2 +-
 net/ipv6/route.c      | 2 +-
 9 files changed, 18 insertions(+), 15 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2021-03-15 12:22             ` [PATCH v2 0/8] " Vasily Averin
@ 2021-03-15 12:23             ` Vasily Averin
  2021-03-15 15:13                 ` David Ahern
                                 ` (2 more replies)
  2021-03-15 12:23             ` [PATCH v2 2/8] memcg: accounting for ip6_dst_cache Vasily Averin
                               ` (6 subsequent siblings)
  8 siblings, 3 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski

An untrusted netadmin inside a memcg-limited container can create a
huge number of routing entries. Currently, allocated kernel objects
are not accounted to proper memcg, so this can lead to global memory
shortage on the host and cause lot of OOM kiils.

One such object is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

It this case is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
To be sure that caller is not executed in process contxt
'!in_task()' check should be used instead
---
 mm/memcontrol.c    | 2 +-
 net/ipv6/ip6_fib.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec0..568f2cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index ef9d022..fa92ed1 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2445,7 +2445,7 @@ int __init fib6_init(void)
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
 					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   0, SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 2/8] memcg: accounting for ip6_dst_cache
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2021-03-15 12:22             ` [PATCH v2 0/8] " Vasily Averin
  2021-03-15 12:23             ` [PATCH v2 1/8] memcg: accounting for fib6_nodes cache Vasily Averin
@ 2021-03-15 12:23             ` Vasily Averin
  2021-03-15 15:14                 ` David Ahern
  2021-03-15 12:23             ` [PATCH v2 3/8] memcg: accounting for fib_rules Vasily Averin
                               ` (5 subsequent siblings)
  8 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller,
	Hideaki YOSHIFUJI, Jakub Kicinski, David Ahern

An untrusted netadmin inside a memcg-limited container can create a
huge number of routing entries. Currently, allocated kernel objects
are not accounted to proper memcg, so this can lead to global memory
shortage on the host and cause lot of OOM kiils.

This patches enables accounting for 'struct rt6_info' allocations.
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1536f49..d1d7cdf 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6526,7 +6526,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 3/8] memcg: accounting for fib_rules
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                               ` (2 preceding siblings ...)
  2021-03-15 12:23             ` [PATCH v2 2/8] memcg: accounting for ip6_dst_cache Vasily Averin
@ 2021-03-15 12:23             ` Vasily Averin
  2021-03-15 15:14                 ` David Ahern
  2021-03-15 12:23             ` [PATCH v2 4/8] memcg: accounting for ip_fib caches Vasily Averin
                               ` (4 subsequent siblings)
  8 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller, David Ahern,
	Jakub Kicinski, Hideaki YOSHIFUJI

An untrusted netadmin inside a memcg-limited container can create a
huge number of routing entries. Currently, allocated kernel objects
are not accounted to proper memcg, so this can lead to global memory
shortage on the host and cause lot of OOM kiils.

This patch enables accounting for 'struct fib_rules'
---
 net/core/fib_rules.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cd80ffe..65d8b1d 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 4/8] memcg: accounting for ip_fib caches
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                               ` (3 preceding siblings ...)
  2021-03-15 12:23             ` [PATCH v2 3/8] memcg: accounting for fib_rules Vasily Averin
@ 2021-03-15 12:23             ` Vasily Averin
  2021-03-15 15:15                 ` David Ahern
  2021-03-15 12:23             ` [PATCH v2 5/8] memcg: accounting for fasync_cache Vasily Averin
                               ` (3 subsequent siblings)
  8 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller, David Ahern,
	Jakub Kicinski, Hideaki YOSHIFUJI

An untrusted netadmin inside a memcg-limited container can create a
huge number of routing entries. Currently, allocated kernel objects
are not accounted to proper memcg, so this can lead to global memory
shortage on the host and cause lot of OOM kiils.

This patch enables accounting for ip_fib_alias and ip_fib_trie caches
---
 net/ipv4/fib_trie.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 5/8] memcg: accounting for fasync_cache
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                               ` (4 preceding siblings ...)
  2021-03-15 12:23             ` [PATCH v2 4/8] memcg: accounting for ip_fib caches Vasily Averin
@ 2021-03-15 12:23             ` Vasily Averin
  2021-03-15 15:56                 ` Shakeel Butt
  2021-03-15 12:23             ` [PATCH v2 6/8] memcg: accounting for mnt_cache entries Vasily Averin
                               ` (2 subsequent siblings)
  8 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Jeff Layton, J. Bruce Fields,
	Alexander Viro

unprivileged user inside memcg-limited container can trigger
creation of huge number of non-accounted fasync_struct objects
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 6/8] memcg: accounting for mnt_cache entries
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                               ` (5 preceding siblings ...)
  2021-03-15 12:23             ` [PATCH v2 5/8] memcg: accounting for fasync_cache Vasily Averin
@ 2021-03-15 12:23             ` Vasily Averin
  2021-03-15 12:23             ` [PATCH v2 7/8] memcg: accounting for tty_struct objects Vasily Averin
  2021-03-15 12:24             ` [PATCH v2 8/8] memcg: accounting for ldt_struct objects Vasily Averin
  8 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Alexander Viro

Each mount inside memcg-limited container creates non-accounted mount object,
each new mount namespace allocates lot of them for cloned mounts.
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 56bb5a5..d550c2a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4213,7 +4214,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 7/8] memcg: accounting for tty_struct objects
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                               ` (6 preceding siblings ...)
  2021-03-15 12:23             ` [PATCH v2 6/8] memcg: accounting for mnt_cache entries Vasily Averin
@ 2021-03-15 12:23             ` Vasily Averin
       [not found]               ` <61134897-703e-a2a8-6f0b-0bf6e1b79dda-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  2021-03-15 12:24             ` [PATCH v2 8/8] memcg: accounting for ldt_struct objects Vasily Averin
  8 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Greg Kroah-Hartman, Jiri Slaby

Non-accounted multi-page tty-related kenrel objects can be created
from inside memcg-limited container.
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 74733ec..a3b881b 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1503,7 +1503,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3128,7 +3128,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 8/8] memcg: accounting for ldt_struct objects
       [not found]           ` <YEnWUrYOArju66ym-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                               ` (7 preceding siblings ...)
  2021-03-15 12:23             ` [PATCH v2 7/8] memcg: accounting for tty_struct objects Vasily Averin
@ 2021-03-15 12:24             ` Vasily Averin
  2021-03-15 13:27                 ` Borislav Petkov
  8 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-03-15 12:24 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86-DgEjT+Ai2ygdnm+yROfE0A

Unprivileged user inside memcg-limited container can create
non-accounted multi-page per-thread kernel objects for LDT
---
 arch/x86/kernel/ldt.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..a1889a0 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,10 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size,
+					     GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 7/8] memcg: accounting for tty_struct objects
       [not found]               ` <61134897-703e-a2a8-6f0b-0bf6e1b79dda-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-03-15 12:40                 ` Greg Kroah-Hartman
  0 siblings, 0 replies; 305+ messages in thread
From: Greg Kroah-Hartman @ 2021-03-15 12:40 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Jiri Slaby

On Mon, Mar 15, 2021 at 03:23:53PM +0300, Vasily Averin wrote:
> Non-accounted multi-page tty-related kenrel objects can be created
> from inside memcg-limited container.
> ---
>  drivers/tty/tty_io.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index 74733ec..a3b881b 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -1503,7 +1503,7 @@ void tty_save_termios(struct tty_struct *tty)
>  	/* Stash the termios data */
>  	tp = tty->driver->termios[idx];
>  	if (tp == NULL) {
> -		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
> +		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
>  		if (tp == NULL)
>  			return;
>  		tty->driver->termios[idx] = tp;
> @@ -3128,7 +3128,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
>  {
>  	struct tty_struct *tty;
>  
> -	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
> +	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
>  	if (!tty)
>  		return NULL;
>  
> -- 
> 1.8.3.1
> 


Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- Your patch does not have a Signed-off-by: line.  Please read the
  kernel file, Documentation/SubmittingPatches and resend it after
  adding that line.  Note, the line needs to be in the body of the
  email, before the patch, not at the bottom of the patch or in the
  email signature.

- You did not specify a description of why the patch is needed, or
  possibly, any description at all, in the email body.  Please read the
  section entitled "The canonical patch format" in the kernel file,
  Documentation/SubmittingPatches for what is needed in order to
  properly describe the change.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 13:27                 ` Borislav Petkov
  0 siblings, 0 replies; 305+ messages in thread
From: Borislav Petkov @ 2021-03-15 13:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, linux-mm, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Thomas Gleixner, Ingo Molnar,
	x86

On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> Unprivileged user inside memcg-limited container can create
> non-accounted multi-page per-thread kernel objects for LDT

I have hard time parsing this commit message.

And I'm CCed only on patch 8 of what looks like a patchset.

And that patchset is not on lkml so I can't find the rest to read about
it, perhaps linux-mm.

/me goes and finds it on lore

I can see some bits and pieces, this, for example:

https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/

 ( Btw, that version has your SOB and this patch doesn't even have a
   Signed-off-by. Next time, run your whole set through checkpatch please
   before sending. )

Now, this URL above talks about OOM, ok, that gets me close to the "why"
this patch.

From a quick look at the ldt.c code, we allow a single LDT struct per
mm. Manpage says so too:

DESCRIPTION
       modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
       The LDT is an array of segment descriptors that can be referenced by user  code.
       Linux  allows  processes  to configure a per-process (actually per-mm) LDT.

We allow

/* Maximum number of LDT entries supported. */
#define LDT_ENTRIES     8192

so there's an upper limit per mm.

Now, please explain what is this accounting for?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette




^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 13:27                 ` Borislav Petkov
  0 siblings, 0 replies; 305+ messages in thread
From: Borislav Petkov @ 2021-03-15 13:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Thomas Gleixner, Ingo Molnar,
	x86-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> Unprivileged user inside memcg-limited container can create
> non-accounted multi-page per-thread kernel objects for LDT

I have hard time parsing this commit message.

And I'm CCed only on patch 8 of what looks like a patchset.

And that patchset is not on lkml so I can't find the rest to read about
it, perhaps linux-mm.

/me goes and finds it on lore

I can see some bits and pieces, this, for example:

https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org/

 ( Btw, that version has your SOB and this patch doesn't even have a
   Signed-off-by. Next time, run your whole set through checkpatch please
   before sending. )

Now, this URL above talks about OOM, ok, that gets me close to the "why"
this patch.

From a quick look at the ldt.c code, we allow a single LDT struct per
mm. Manpage says so too:

DESCRIPTION
       modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
       The LDT is an array of segment descriptors that can be referenced by user  code.
       Linux  allows  processes  to configure a per-process (actually per-mm) LDT.

We allow

/* Maximum number of LDT entries supported. */
#define LDT_ENTRIES     8192

so there's an upper limit per mm.

Now, please explain what is this accounting for?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 15:13                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:13 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead
> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>




^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 15:13                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:13 UTC (permalink / raw)
  To: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead
> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 2/8] memcg: accounting for ip6_dst_cache
@ 2021-03-15 15:14                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:14 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, Hideaki YOSHIFUJI, Jakub Kicinski, David Ahern

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patches enables accounting for 'struct rt6_info' allocations.
> ---
>  net/ipv6/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>




^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 2/8] memcg: accounting for ip6_dst_cache
@ 2021-03-15 15:14                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:14 UTC (permalink / raw)
  To: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller,
	Hideaki YOSHIFUJI, Jakub Kicinski, David Ahern

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patches enables accounting for 'struct rt6_info' allocations.
> ---
>  net/ipv6/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Acked-by: David Ahern <dsahern-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 3/8] memcg: accounting for fib_rules
@ 2021-03-15 15:14                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:14 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, David Ahern, Jakub Kicinski, Hideaki YOSHIFUJI

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for 'struct fib_rules'
> ---
>  net/core/fib_rules.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 3/8] memcg: accounting for fib_rules
@ 2021-03-15 15:14                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:14 UTC (permalink / raw)
  To: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller, David Ahern,
	Jakub Kicinski, Hideaki YOSHIFUJI

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for 'struct fib_rules'
> ---
>  net/core/fib_rules.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 4/8] memcg: accounting for ip_fib caches
@ 2021-03-15 15:15                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:15 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, David Ahern, Jakub Kicinski, Hideaki YOSHIFUJI

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for ip_fib_alias and ip_fib_trie caches
> ---
>  net/ipv4/fib_trie.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>




^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 4/8] memcg: accounting for ip_fib caches
@ 2021-03-15 15:15                 ` David Ahern
  0 siblings, 0 replies; 305+ messages in thread
From: David Ahern @ 2021-03-15 15:15 UTC (permalink / raw)
  To: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller, David Ahern,
	Jakub Kicinski, Hideaki YOSHIFUJI

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for ip_fib_alias and ip_fib_trie caches
> ---
>  net/ipv4/fib_trie.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 15:23                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:23 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern, Jakub Kicinski

On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
>
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> It this case is

'In this case it is'

> not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt

'context'

> '!in_task()' check should be used instead

You missed the signoff and it seems like the whole series is missing
it as well. Please run scripts/checkpatch.pl on the patches before
sending again.

> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec0..568f2cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))

Can you please also add some explanation in the commit message on the
differences between in_interrupt() and in_task()? Why is
in_interrupt() not correct here but !in_task() is? What about kernels
with or without PREEMPT_COUNT?

>                 return true;
>
>         return false;
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index ef9d022..fa92ed1 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -2445,7 +2445,7 @@ int __init fib6_init(void)
>
>         fib6_node_kmem = kmem_cache_create("fib6_nodes",
>                                            sizeof(struct fib6_node),
> -                                          0, SLAB_HWCACHE_ALIGN,
> +                                          0, SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT,
>                                            NULL);
>         if (!fib6_node_kmem)
>                 goto out;
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 15:23                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:23 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern, Jakub Kicinski

On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
>
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> It this case is

'In this case it is'

> not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt

'context'

> '!in_task()' check should be used instead

You missed the signoff and it seems like the whole series is missing
it as well. Please run scripts/checkpatch.pl on the patches before
sending again.

> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec0..568f2cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))

Can you please also add some explanation in the commit message on the
differences between in_interrupt() and in_task()? Why is
in_interrupt() not correct here but !in_task() is? What about kernels
with or without PREEMPT_COUNT?

>                 return true;
>
>         return false;
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index ef9d022..fa92ed1 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -2445,7 +2445,7 @@ int __init fib6_init(void)
>
>         fib6_node_kmem = kmem_cache_create("fib6_nodes",
>                                            sizeof(struct fib6_node),
> -                                          0, SLAB_HWCACHE_ALIGN,
> +                                          0, SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT,
>                                            NULL);
>         if (!fib6_node_kmem)
>                 goto out;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 15:48                   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Thomas Gleixner, Ingo Molnar, x86

On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > Unprivileged user inside memcg-limited container can create
> > non-accounted multi-page per-thread kernel objects for LDT
>
> I have hard time parsing this commit message.
>
> And I'm CCed only on patch 8 of what looks like a patchset.
>
> And that patchset is not on lkml so I can't find the rest to read about
> it, perhaps linux-mm.
>
> /me goes and finds it on lore
>
> I can see some bits and pieces, this, for example:
>
> https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/
>
>  ( Btw, that version has your SOB and this patch doesn't even have a
>    Signed-off-by. Next time, run your whole set through checkpatch please
>    before sending. )
>
> Now, this URL above talks about OOM, ok, that gets me close to the "why"
> this patch.
>
> From a quick look at the ldt.c code, we allow a single LDT struct per
> mm. Manpage says so too:
>
> DESCRIPTION
>        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
>        The LDT is an array of segment descriptors that can be referenced by user  code.
>        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
>
> We allow
>
> /* Maximum number of LDT entries supported. */
> #define LDT_ENTRIES     8192
>
> so there's an upper limit per mm.
>
> Now, please explain what is this accounting for?
>

Let me try to provide the reasoning at least from my perspective.
There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines. The
unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 15:48                   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Thomas Gleixner, Ingo Molnar,
	x86-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org> wrote:
>
> On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > Unprivileged user inside memcg-limited container can create
> > non-accounted multi-page per-thread kernel objects for LDT
>
> I have hard time parsing this commit message.
>
> And I'm CCed only on patch 8 of what looks like a patchset.
>
> And that patchset is not on lkml so I can't find the rest to read about
> it, perhaps linux-mm.
>
> /me goes and finds it on lore
>
> I can see some bits and pieces, this, for example:
>
> https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org/
>
>  ( Btw, that version has your SOB and this patch doesn't even have a
>    Signed-off-by. Next time, run your whole set through checkpatch please
>    before sending. )
>
> Now, this URL above talks about OOM, ok, that gets me close to the "why"
> this patch.
>
> From a quick look at the ldt.c code, we allow a single LDT struct per
> mm. Manpage says so too:
>
> DESCRIPTION
>        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
>        The LDT is an array of segment descriptors that can be referenced by user  code.
>        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
>
> We allow
>
> /* Maximum number of LDT entries supported. */
> #define LDT_ENTRIES     8192
>
> so there's an upper limit per mm.
>
> Now, please explain what is this accounting for?
>

Let me try to provide the reasoning at least from my perspective.
There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines. The
unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 5/8] memcg: accounting for fasync_cache
@ 2021-03-15 15:56                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:56 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Jeff Layton, J. Bruce Fields, Alexander Viro

On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> unprivileged user inside memcg-limited container can trigger
> creation of huge number of non-accounted fasync_struct objects

You need to make each patch of this series self-contained by including
the motivation behind the series (just one or two sentences). For
example, for this patch include what's the potential impact of these
huge numbers of unaccounted fasync_struct objects?

> ---
>  fs/fcntl.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index dfc72f1..7941559 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
>                         __FMODE_EXEC | __FMODE_NONOTIFY));
>
>         fasync_cache = kmem_cache_create("fasync_cache",
> -               sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
> +                                        sizeof(struct fasync_struct), 0,
> +                                        SLAB_PANIC | SLAB_ACCOUNT, NULL);
>         return 0;
>  }
>
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 5/8] memcg: accounting for fasync_cache
@ 2021-03-15 15:56                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:56 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Jeff Layton, J. Bruce Fields, Alexander Viro

On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> unprivileged user inside memcg-limited container can trigger
> creation of huge number of non-accounted fasync_struct objects

You need to make each patch of this series self-contained by including
the motivation behind the series (just one or two sentences). For
example, for this patch include what's the potential impact of these
huge numbers of unaccounted fasync_struct objects?

> ---
>  fs/fcntl.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index dfc72f1..7941559 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
>                         __FMODE_EXEC | __FMODE_NONOTIFY));
>
>         fasync_cache = kmem_cache_create("fasync_cache",
> -               sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
> +                                        sizeof(struct fasync_struct), 0,
> +                                        SLAB_PANIC | SLAB_ACCOUNT, NULL);
>         return 0;
>  }
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 15:58                     ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-03-15 15:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Borislav Petkov, Vasily Averin, Cgroups, Linux MM,
	Johannes Weiner, Vladimir Davydov, Thomas Gleixner, Ingo Molnar,
	x86

On Mon 15-03-21 08:48:26, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > > Unprivileged user inside memcg-limited container can create
> > > non-accounted multi-page per-thread kernel objects for LDT
> >
> > I have hard time parsing this commit message.
> >
> > And I'm CCed only on patch 8 of what looks like a patchset.
> >
> > And that patchset is not on lkml so I can't find the rest to read about
> > it, perhaps linux-mm.
> >
> > /me goes and finds it on lore
> >
> > I can see some bits and pieces, this, for example:
> >
> > https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/
> >
> >  ( Btw, that version has your SOB and this patch doesn't even have a
> >    Signed-off-by. Next time, run your whole set through checkpatch please
> >    before sending. )
> >
> > Now, this URL above talks about OOM, ok, that gets me close to the "why"
> > this patch.
> >
> > From a quick look at the ldt.c code, we allow a single LDT struct per
> > mm. Manpage says so too:
> >
> > DESCRIPTION
> >        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
> >        The LDT is an array of segment descriptors that can be referenced by user  code.
> >        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
> >
> > We allow
> >
> > /* Maximum number of LDT entries supported. */
> > #define LDT_ENTRIES     8192
> >
> > so there's an upper limit per mm.
> >
> > Now, please explain what is this accounting for?
> >
> 
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

It would be better to be explicit

8192 * 8 = 64kB * number_of_tasks

so realistically this is in range of lower megabytes. Is this worth the
memcg accounting overhead? Maybe yes but what kind of workloads really
care?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 15:58                     ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-03-15 15:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Borislav Petkov, Vasily Averin, Cgroups, Linux MM,
	Johannes Weiner, Vladimir Davydov, Thomas Gleixner, Ingo Molnar,
	x86-DgEjT+Ai2ygdnm+yROfE0A

On Mon 15-03-21 08:48:26, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp-Gina5bIWoIWzQB+pC5nmwQ@public.gmane.org> wrote:
> >
> > On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > > Unprivileged user inside memcg-limited container can create
> > > non-accounted multi-page per-thread kernel objects for LDT
> >
> > I have hard time parsing this commit message.
> >
> > And I'm CCed only on patch 8 of what looks like a patchset.
> >
> > And that patchset is not on lkml so I can't find the rest to read about
> > it, perhaps linux-mm.
> >
> > /me goes and finds it on lore
> >
> > I can see some bits and pieces, this, for example:
> >
> > https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org/
> >
> >  ( Btw, that version has your SOB and this patch doesn't even have a
> >    Signed-off-by. Next time, run your whole set through checkpatch please
> >    before sending. )
> >
> > Now, this URL above talks about OOM, ok, that gets me close to the "why"
> > this patch.
> >
> > From a quick look at the ldt.c code, we allow a single LDT struct per
> > mm. Manpage says so too:
> >
> > DESCRIPTION
> >        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
> >        The LDT is an array of segment descriptors that can be referenced by user  code.
> >        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
> >
> > We allow
> >
> > /* Maximum number of LDT entries supported. */
> > #define LDT_ENTRIES     8192
> >
> > so there's an upper limit per mm.
> >
> > Now, please explain what is this accounting for?
> >
> 
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

It would be better to be explicit

8192 * 8 = 64kB * number_of_tasks

so realistically this is in range of lower megabytes. Is this worth the
memcg accounting overhead? Maybe yes but what kind of workloads really
care?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 15:59                     ` Borislav Petkov
  0 siblings, 0 replies; 305+ messages in thread
From: Borislav Petkov @ 2021-03-15 15:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Thomas Gleixner, Ingo Molnar, x86

On Mon, Mar 15, 2021 at 08:48:26AM -0700, Shakeel Butt wrote:
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

Good enough for me, as long as that is part of the commit message.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
@ 2021-03-15 15:59                     ` Borislav Petkov
  0 siblings, 0 replies; 305+ messages in thread
From: Borislav Petkov @ 2021-03-15 15:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Thomas Gleixner, Ingo Molnar,
	x86-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Mar 15, 2021 at 08:48:26AM -0700, Shakeel Butt wrote:
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

Good enough for me, as long as that is part of the commit message.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 17:09                 ` Jakub Kicinski
  0 siblings, 0 replies; 305+ messages in thread
From: Jakub Kicinski @ 2021-03-15 17:09 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, linux-mm, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead

Sorry for a random question, I didn't get the cover letter. 

What's the overhead of adding SLAB_ACCOUNT?

Please make sure you CC netdev on series which may impact networking.


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 17:09                 ` Jakub Kicinski
  0 siblings, 0 replies; 305+ messages in thread
From: Jakub Kicinski @ 2021-03-15 17:09 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead

Sorry for a random question, I didn't get the cover letter. 

What's the overhead of adding SLAB_ACCOUNT?

Please make sure you CC netdev on series which may impact networking.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 19:24                   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 19:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern

On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > An untrusted netadmin inside a memcg-limited container can create a
> > huge number of routing entries. Currently, allocated kernel objects
> > are not accounted to proper memcg, so this can lead to global memory
> > shortage on the host and cause lot of OOM kiils.
> >
> > One such object is the 'struct fib6_node' mostly allocated in
> > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> >
> >  write_lock_bh(&table->tb6_lock);
> >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> >  write_unlock_bh(&table->tb6_lock);
> >
> > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > kmem cache. The proper memory cgroup still cannot be found due to the
> > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > To be sure that caller is not executed in process contxt
> > '!in_task()' check should be used instead
>
> Sorry for a random question, I didn't get the cover letter.
>
> What's the overhead of adding SLAB_ACCOUNT?
>

The potential overhead is for MEMCG users where we need to
charge/account each allocation from SLAB_ACCOUNT kmem caches. However
charging is done in batches, so the cost is amortized. If there is a
concern about a specific workload then it would be good to see the
impact of this patch for that workload.

> Please make sure you CC netdev on series which may impact networking.


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 19:24                   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-03-15 19:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern

On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > An untrusted netadmin inside a memcg-limited container can create a
> > huge number of routing entries. Currently, allocated kernel objects
> > are not accounted to proper memcg, so this can lead to global memory
> > shortage on the host and cause lot of OOM kiils.
> >
> > One such object is the 'struct fib6_node' mostly allocated in
> > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> >
> >  write_lock_bh(&table->tb6_lock);
> >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> >  write_unlock_bh(&table->tb6_lock);
> >
> > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > kmem cache. The proper memory cgroup still cannot be found due to the
> > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > To be sure that caller is not executed in process contxt
> > '!in_task()' check should be used instead
>
> Sorry for a random question, I didn't get the cover letter.
>
> What's the overhead of adding SLAB_ACCOUNT?
>

The potential overhead is for MEMCG users where we need to
charge/account each allocation from SLAB_ACCOUNT kmem caches. However
charging is done in batches, so the cost is amortized. If there is a
concern about a specific workload then it would be good to see the
impact of this patch for that workload.

> Please make sure you CC netdev on series which may impact networking.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 19:32                     ` Roman Gushchin
  0 siblings, 0 replies; 305+ messages in thread
From: Roman Gushchin @ 2021-03-15 19:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jakub Kicinski, Vasily Averin, Cgroups, Michal Hocko, Linux MM,
	Johannes Weiner, Vladimir Davydov, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > > An untrusted netadmin inside a memcg-limited container can create a
> > > huge number of routing entries. Currently, allocated kernel objects
> > > are not accounted to proper memcg, so this can lead to global memory
> > > shortage on the host and cause lot of OOM kiils.
> > >
> > > One such object is the 'struct fib6_node' mostly allocated in
> > > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> > >
> > >  write_lock_bh(&table->tb6_lock);
> > >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> > >  write_unlock_bh(&table->tb6_lock);
> > >
> > > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > > kmem cache. The proper memory cgroup still cannot be found due to the
> > > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > > To be sure that caller is not executed in process contxt
> > > '!in_task()' check should be used instead
> >
> > Sorry for a random question, I didn't get the cover letter.
> >
> > What's the overhead of adding SLAB_ACCOUNT?
> >
> 
> The potential overhead is for MEMCG users where we need to
> charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> charging is done in batches, so the cost is amortized. If there is a
> concern about a specific workload then it would be good to see the
> impact of this patch for that workload.
> 
> > Please make sure you CC netdev on series which may impact networking.

In general the overhead is not that big, so I don't think we should argue
too much about every new case where we want to enable the accounting and
rather focus on those few examples (if any?) where it actually hurts
the performance in a meaningful way.

Thanks!


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 19:32                     ` Roman Gushchin
  0 siblings, 0 replies; 305+ messages in thread
From: Roman Gushchin @ 2021-03-15 19:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jakub Kicinski, Vasily Averin, Cgroups, Michal Hocko, Linux MM,
	Johannes Weiner, Vladimir Davydov, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > > An untrusted netadmin inside a memcg-limited container can create a
> > > huge number of routing entries. Currently, allocated kernel objects
> > > are not accounted to proper memcg, so this can lead to global memory
> > > shortage on the host and cause lot of OOM kiils.
> > >
> > > One such object is the 'struct fib6_node' mostly allocated in
> > > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> > >
> > >  write_lock_bh(&table->tb6_lock);
> > >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> > >  write_unlock_bh(&table->tb6_lock);
> > >
> > > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > > kmem cache. The proper memory cgroup still cannot be found due to the
> > > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > > To be sure that caller is not executed in process contxt
> > > '!in_task()' check should be used instead
> >
> > Sorry for a random question, I didn't get the cover letter.
> >
> > What's the overhead of adding SLAB_ACCOUNT?
> >
> 
> The potential overhead is for MEMCG users where we need to
> charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> charging is done in batches, so the cost is amortized. If there is a
> concern about a specific workload then it would be good to see the
> impact of this patch for that workload.
> 
> > Please make sure you CC netdev on series which may impact networking.

In general the overhead is not that big, so I don't think we should argue
too much about every new case where we want to enable the accounting and
rather focus on those few examples (if any?) where it actually hurts
the performance in a meaningful way.

Thanks!

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 19:35                       ` Jakub Kicinski
  0 siblings, 0 replies; 305+ messages in thread
From: Jakub Kicinski @ 2021-03-15 19:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Vasily Averin, Cgroups, Michal Hocko, Linux MM,
	Johannes Weiner, Vladimir Davydov, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, 15 Mar 2021 12:32:07 -0700 Roman Gushchin wrote:
> On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> > On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:  
> > > Sorry for a random question, I didn't get the cover letter.
> > >
> > > What's the overhead of adding SLAB_ACCOUNT?
> > 
> > The potential overhead is for MEMCG users where we need to
> > charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> > charging is done in batches, so the cost is amortized. If there is a
> > concern about a specific workload then it would be good to see the
> > impact of this patch for that workload.
> >   
> > > Please make sure you CC netdev on series which may impact networking.  
> 
> In general the overhead is not that big, so I don't think we should argue
> too much about every new case where we want to enable the accounting and
> rather focus on those few examples (if any?) where it actually hurts
> the performance in a meaningful way.

Ack, no serious concerns about this particular case.

I was expecting you'd have micro benchmark numbers handy so I was
curious to learn what they are, but that appears not to be the case.


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
@ 2021-03-15 19:35                       ` Jakub Kicinski
  0 siblings, 0 replies; 305+ messages in thread
From: Jakub Kicinski @ 2021-03-15 19:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Vasily Averin, Cgroups, Michal Hocko, Linux MM,
	Johannes Weiner, Vladimir Davydov, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, 15 Mar 2021 12:32:07 -0700 Roman Gushchin wrote:
> On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> > On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:  
> > > Sorry for a random question, I didn't get the cover letter.
> > >
> > > What's the overhead of adding SLAB_ACCOUNT?
> > 
> > The potential overhead is for MEMCG users where we need to
> > charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> > charging is done in batches, so the cost is amortized. If there is a
> > concern about a specific workload then it would be good to see the
> > impact of this patch for that workload.
> >   
> > > Please make sure you CC netdev on series which may impact networking.  
> 
> In general the overhead is not that big, so I don't think we should argue
> too much about every new case where we want to enable the accounting and
> rather focus on those few examples (if any?) where it actually hurts
> the performance in a meaningful way.

Ack, no serious concerns about this particular case.

I was expecting you'd have micro benchmark numbers handy so I was
curious to learn what they are, but that appears not to be the case.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v2 0/8] memcg accounting from OpenVZ
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-03-16  7:15                 ` Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-03-16  7:15 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Roman Gushchin

Michal, Shakeel, Roman,
thank you very much for your help.

On 3/15/21 3:22 PM, Vasily Averin wrote:
> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
> and we have accounted all required kernel objects by using our own patches.
> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> to upstream but did not skipped another ones.
> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> "skipped" objects by our own patches just because we accounted such objects before.
> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 
> 
> v2:
> - squashed old patch 1 "accounting for allocations called with disabled BH"
>    with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
> - improved patch description
> - subsystem maintainers added to cc:
> 
> Vasily Averin (8):
>   memcg: accounting for fib6_nodes cache
>   memcg: accounting for ip6_dst_cache
>   memcg: accounting for fib_rules
>   memcg: accounting for ip_fib caches
>   memcg: accounting for fasync_cache
>   memcg: accounting for mnt_cache entries
>   memcg: accounting for tty_struct objects
>   memcg: accounting for ldt_struct objects
> 
>  arch/x86/kernel/ldt.c | 7 ++++---
>  drivers/tty/tty_io.c  | 4 ++--
>  fs/fcntl.c            | 3 ++-
>  fs/namespace.c        | 5 +++--
>  mm/memcontrol.c       | 2 +-
>  net/core/fib_rules.c  | 4 ++--
>  net/ipv4/fib_trie.c   | 4 ++--
>  net/ipv6/ip6_fib.c    | 2 +-
>  net/ipv6/route.c      | 2 +-
>  9 files changed, 18 insertions(+), 15 deletions(-)
> 


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v3 00/16] memcg accounting from OpenVZ
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  2021-03-16  7:15                 ` [PATCH v2 0/8] memcg accounting from OpenVZ Vasily Averin
@ 2021-04-22 10:35                 ` Vasily Averin
  2021-04-28  6:51                     ` Vasily Averin
                                     ` (16 more replies)
  2021-04-22 10:36                 ` [PATCH v3 07/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
                                   ` (9 subsequent siblings)
  11 siblings, 17 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:35 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      |  7 ++++---
 drivers/tty/tty_io.c       |  4 ++--
 fs/fcntl.c                 |  3 ++-
 fs/locks.c                 |  6 ++++--
 fs/namespace.c             |  7 ++++---
 fs/select.c                |  4 ++--
 ipc/msg.c                  |  2 +-
 ipc/namespace.c            |  2 +-
 ipc/sem.c                  | 10 ++++++----
 ipc/shm.c                  |  2 +-
 kernel/cgroup/namespace.c  |  2 +-
 kernel/nsproxy.c           |  2 +-
 kernel/pid_namespace.c     |  2 +-
 kernel/signal.c            |  2 +-
 kernel/time/namespace.c    |  4 ++--
 kernel/time/posix-timers.c |  4 ++--
 kernel/user_namespace.c    |  2 +-
 mm/memcontrol.c            |  2 +-
 net/8021q/vlan.c           |  2 +-
 net/core/dev.c             |  6 +++---
 net/core/fib_rules.c       |  4 ++--
 net/core/scm.c             |  4 ++--
 net/dccp/proto.c           |  2 +-
 net/ipv4/devinet.c         |  2 +-
 net/ipv4/fib_trie.c        |  4 ++--
 net/ipv4/tcp.c             |  4 +++-
 net/ipv6/addrconf.c        |  2 +-
 net/ipv6/ip6_fib.c         |  4 ++--
 net/ipv6/route.c           |  2 +-
 net/ipv6/sit.c             |  5 +++--
 30 files changed, 59 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v3 01/16] memcg: enable accounting for net_device and Tx/Rx queues
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 10:36                 ` Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9a..87b1e80 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9994,7 +9994,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10061,7 +10061,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10693,7 +10693,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 01/16] memcg: enable accounting for net_device and Tx/Rx queues
@ 2021-04-22 10:36                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9a..87b1e80 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9994,7 +9994,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10061,7 +10061,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10693,7 +10693,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 10:36                 ` Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d..15108ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cd80ffe..65d8b1d 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2e35f68da..9b90413 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9e53f5..d56a15a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 679699e..0982b7c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2444,8 +2444,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 373d480..5dc5c68 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6510,7 +6510,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-04-22 10:36                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev-u79uwXL29TY76Z2rM5mHXA

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d..15108ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cd80ffe..65d8b1d 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2e35f68da..9b90413 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9e53f5..d56a15a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 679699e..0982b7c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2444,8 +2444,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 373d480..5dc5c68 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6510,7 +6510,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 10:36                 ` Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 6d705d9..f90d1e8 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de7cc84..5817a86b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4498,7 +4498,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 03/16] memcg: enable accounting for inet_bin_bucket cache
@ 2021-04-22 10:36                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev-u79uwXL29TY76Z2rM5mHXA

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 6d705d9..f90d1e8 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de7cc84..5817a86b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4498,7 +4498,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 04/16] memcg: enable accounting for VLAN group array
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 10:36                 ` Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8b644113..d0a579d4 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 04/16] memcg: enable accounting for VLAN group array
@ 2021-04-22 10:36                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8b644113..d0a579d4 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 10:36                 ` Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 9fdccf0..2ba147c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -320,7 +320,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -333,7 +333,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
@ 2021-04-22 10:36                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev-u79uwXL29TY76Z2rM5mHXA

Author: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 9fdccf0..2ba147c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -320,7 +320,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -333,7 +333,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 06/16] memcg: enable accounting for scm_fp_list objects
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 10:36                 ` Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 8156d4f..e837e4f 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -348,7 +348,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 06/16] memcg: enable accounting for scm_fp_list objects
@ 2021-04-22 10:36                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 8156d4f..e837e4f 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -348,7 +348,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 07/16] memcg: enable accounting for new namesapces and struct nsproxy
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  2021-03-16  7:15                 ` [PATCH v2 0/8] memcg accounting from OpenVZ Vasily Averin
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
@ 2021-04-22 10:36                 ` Vasily Averin
  2021-04-22 10:37                 ` [PATCH v3 08/16] memcg: enable accounting of ipc resources Vasily Averin
                                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:36 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 56bb5a5..5ecfa349 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3286,7 +3286,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af61294..886d6f9 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1319,7 +1319,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 08/16] memcg: enable accounting of ipc resources
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (2 preceding siblings ...)
  2021-04-22 10:36                 ` [PATCH v3 07/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
       [not found]                   ` <4ed65beb-bda3-1c93-fadf-296b760a32b2-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  2021-04-22 10:37                 ` [PATCH v3 09/16] memcg: enable accounting for mnt_cache entries Vasily Averin
                                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Andrew Morton, Alexey Dobriyan, Dmitry Safonov

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 ipc/msg.c |  2 +-
 ipc/sem.c | 10 ++++++----
 ipc/shm.c |  2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index acd1bc7..87898cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index f6c30a8..52a6599 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -511,7 +511,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1850,7 +1850,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1935,7 +1935,8 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL);
+	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+		      GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -1999,7 +2000,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index febd88d..7632d72 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 09/16] memcg: enable accounting for mnt_cache entries
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (3 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 08/16] memcg: enable accounting of ipc resources Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
  2021-04-22 10:37                 ` [PATCH v3 10/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5ecfa349..fc1b50d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4213,7 +4214,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 10/16] memcg: enable accounting for pollfd and select bits arrays
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (4 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 09/16] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
  2021-04-22 10:37                 ` [PATCH v3 11/16] memcg: enable accounting for signals Vasily Averin
                                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 11/16] memcg: enable accounting for signals
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (5 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 10/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
  2021-04-22 10:37                 ` [PATCH v3 12/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
                                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jens Axboe, Eric W. Biederman, Oleg Nesterov

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f271835..a7fa849 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4639,7 +4639,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 12/16] memcg: enable accounting for posix_timers_cache slab
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (6 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 11/16] memcg: enable accounting for signals Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
  2021-04-22 10:37                 ` [PATCH v3 13/16] memcg: enable accounting for file lock caches Vasily Averin
                                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index bf540f5a..2eee615 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 13/16] memcg: enable accounting for file lock caches
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (7 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 12/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
  2021-04-22 10:37                 ` [PATCH v3 14/16] memcg: enable accounting for fasync_cache Vasily Averin
                                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 6125d2d..fb751f3 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3007,10 +3007,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 14/16] memcg: enable accounting for fasync_cache
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (8 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 13/16] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
  2021-04-22 10:37                 ` [PATCH v3 15/16] memcg: enable accounting for tty-related objects Vasily Averin
  2021-04-22 10:38                 ` [PATCH v3 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  11 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (9 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 14/16] memcg: enable accounting for fasync_cache Vasily Averin
@ 2021-04-22 10:37                 ` Vasily Averin
       [not found]                   ` <da450388-2fbc-1bb8-0839-b6480cb0eead-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  2021-04-22 10:38                 ` [PATCH v3 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  11 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 391bada..e613b8e 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1502,7 +1502,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3127,7 +3127,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v3 16/16] memcg: enable accounting for ldt_struct objects
       [not found]               ` <dddf6b29-debd-dcb5-62d0-74909d610edb-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
                                   ` (10 preceding siblings ...)
  2021-04-22 10:37                 ` [PATCH v3 15/16] memcg: enable accounting for tty-related objects Vasily Averin
@ 2021-04-22 10:38                 ` Vasily Averin
       [not found]                   ` <94dd36cb-3abb-53fc-0f23-26c02094ddf4-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  11 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 10:38 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 arch/x86/kernel/ldt.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..a1889a0 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,10 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size,
+					     GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                   ` <da450388-2fbc-1bb8-0839-b6480cb0eead-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 11:23                     ` Greg Kroah-Hartman
       [not found]                       ` <YIFcqcd4dCiNcILj-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Greg Kroah-Hartman @ 2021-04-22 11:23 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby

On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
> At each login the user forces the kernel to create a new terminal and
> allocate up to ~1Kb memory for the tty-related structures.

Does this tiny amount of memory actually matter?  This feels like it
would be lost in the noise, and not really be an issue for any real
system as it's hard to abuse (i.e. if a user creates lots of tty
structures, what can they do???)

So no, I do not think this patch is needed, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                       ` <YIFcqcd4dCiNcILj-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2021-04-22 11:44                         ` Michal Hocko
       [not found]                           ` <YIFhuwlXKaAaY3IU-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Michal Hocko @ 2021-04-22 11:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby

On Thu 22-04-21 13:23:21, Greg KH wrote:
> On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
> > At each login the user forces the kernel to create a new terminal and
> > allocate up to ~1Kb memory for the tty-related structures.
> 
> Does this tiny amount of memory actually matter?

The primary question is whether an untrusted user can trigger an
unbounded amount of these allocations.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                           ` <YIFhuwlXKaAaY3IU-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2021-04-22 11:50                             ` Greg Kroah-Hartman
       [not found]                               ` <YIFjI3zHVQr4BjHc-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Greg Kroah-Hartman @ 2021-04-22 11:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby

On Thu, Apr 22, 2021 at 01:44:59PM +0200, Michal Hocko wrote:
> On Thu 22-04-21 13:23:21, Greg KH wrote:
> > On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
> > > At each login the user forces the kernel to create a new terminal and
> > > allocate up to ~1Kb memory for the tty-related structures.
> > 
> > Does this tiny amount of memory actually matter?
> 
> The primary question is whether an untrusted user can trigger an
> unbounded amount of these allocations.

Can they?  They are not bounded by some other resource limit?

greg k-h

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                               ` <YIFjI3zHVQr4BjHc-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
@ 2021-04-22 12:22                                 ` Michal Hocko
  2021-04-22 13:59                                 ` Vasily Averin
  1 sibling, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-04-22 12:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby

On Thu 22-04-21 13:50:59, Greg KH wrote:
> On Thu, Apr 22, 2021 at 01:44:59PM +0200, Michal Hocko wrote:
> > On Thu 22-04-21 13:23:21, Greg KH wrote:
> > > On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
> > > > At each login the user forces the kernel to create a new terminal and
> > > > allocate up to ~1Kb memory for the tty-related structures.
> > > 
> > > Does this tiny amount of memory actually matter?
> > 
> > The primary question is whether an untrusted user can trigger an
> > unbounded amount of these allocations.
> 
> Can they?  They are not bounded by some other resource limit?

I dunno. This is not my area. I am not aware of any direct rlimit (maybe
RLIMIT_NPROC) and maybe pid controller would help. But the changelog
should definitely mention that. Other patches tend to mention the
scenario they protect from this one should be more specific.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 16/16] memcg: enable accounting for ldt_struct objects
       [not found]                   ` <94dd36cb-3abb-53fc-0f23-26c02094ddf4-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-22 12:26                     ` Borislav Petkov
       [not found]                       ` <20210422122615.GA7021-Jj63ApZU6fQ@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Borislav Petkov @ 2021-04-22 12:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Thu, Apr 22, 2021 at 01:38:01PM +0300, Vasily Averin wrote:

You have forgotten to Cc LKML on your submission.

> Each task can request own LDT and force the kernel to allocate up to
> 64Kb memory per-mm.
> 
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines.
> The unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.
> 
> It makes sense to account for this objects to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  arch/x86/kernel/ldt.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index aa15132..a1889a0 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>  	if (num_entries > LDT_ENTRIES)
>  		return NULL;
>  
> -	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
> +	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
>  	if (!new_ldt)
>  		return NULL;
>  
> @@ -168,9 +168,10 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>  	 * than PAGE_SIZE.
>  	 */
>  	if (alloc_size > PAGE_SIZE)
> -		new_ldt->entries = vzalloc(alloc_size);
> +		new_ldt->entries = __vmalloc(alloc_size,
> +					     GFP_KERNEL_ACCOUNT | __GFP_ZERO);

You don't have to break that line - just let it stick out.

>  	else
> -		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
> +		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
>  
>  	if (!new_ldt->entries) {
>  		kfree(new_ldt);
> -- 

In any case:

Acked-by: Borislav Petkov <bp-l3A5Bk7waGM@public.gmane.org>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                               ` <YIFjI3zHVQr4BjHc-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>
  2021-04-22 12:22                                 ` Michal Hocko
@ 2021-04-22 13:59                                 ` Vasily Averin
       [not found]                                   ` <6e697a1f-936d-5ffe-d29f-e4dcbe099799-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  1 sibling, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-22 13:59 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Michal Hocko
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jiri Slaby

On 4/22/21 2:50 PM, Greg Kroah-Hartman wrote:
> On Thu, Apr 22, 2021 at 01:44:59PM +0200, Michal Hocko wrote:
>> On Thu 22-04-21 13:23:21, Greg KH wrote:
>>> On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
>>>> At each login the user forces the kernel to create a new terminal and
>>>> allocate up to ~1Kb memory for the tty-related structures.
>>>
>>> Does this tiny amount of memory actually matter?
>>
>> The primary question is whether an untrusted user can trigger an
>> unbounded amount of these allocations.
> 
> Can they?  They are not bounded by some other resource limit?

I'm not ready to provide usecase right now,
but on the other hand I do not see any related limits.
Let me take time out to dig this question.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 16/16] memcg: enable accounting for ldt_struct objects
       [not found]                       ` <20210422122615.GA7021-Jj63ApZU6fQ@public.gmane.org>
@ 2021-04-23  3:13                         ` Vasily Averin
       [not found]                           ` <29fe6b29-d56a-6ea1-2fe7-2b015f6b74ef-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-23  3:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Michal Hocko

On 4/22/21 3:26 PM, Borislav Petkov wrote:
> On Thu, Apr 22, 2021 at 01:38:01PM +0300, Vasily Averin wrote:
> 
> You have forgotten to Cc LKML on your submission.
I think it's OK, patch set is addressed to cgroups subsystem amiling list.
Am I missed something and such patches should be sent to LKML anyway?

>> @@ -168,9 +168,10 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>>  	 * than PAGE_SIZE.
>>  	 */
>>  	if (alloc_size > PAGE_SIZE)
>> -		new_ldt->entries = vzalloc(alloc_size);
>> +		new_ldt->entries = __vmalloc(alloc_size,
>> +					     GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> 
> You don't have to break that line - just let it stick out.
Hmm. I missed that allowed line limit was increased up to 100 bytes,
Thank you, I will fix it in next patch version. 

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 16/16] memcg: enable accounting for ldt_struct objects
       [not found]                           ` <29fe6b29-d56a-6ea1-2fe7-2b015f6b74ef-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-23  6:20                             ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-04-23  6:20 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Borislav Petkov, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Fri 23-04-21 06:13:30, Vasily Averin wrote:
> On 4/22/21 3:26 PM, Borislav Petkov wrote:
> > On Thu, Apr 22, 2021 at 01:38:01PM +0300, Vasily Averin wrote:
> > 
> > You have forgotten to Cc LKML on your submission.
> I think it's OK, patch set is addressed to cgroups subsystem amiling list.
> Am I missed something and such patches should be sent to LKML anyway?

Yes, it is preferable to CC all patches to lkml. These patches are not
cgroup specit much. A specific subsystem knowledge is require to judge
them and most people are not subscribed to the cgroup ML.

> >> @@ -168,9 +168,10 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
> >>  	 * than PAGE_SIZE.
> >>  	 */
> >>  	if (alloc_size > PAGE_SIZE)
> >> -		new_ldt->entries = vzalloc(alloc_size);
> >> +		new_ldt->entries = __vmalloc(alloc_size,
> >> +					     GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > 
> > You don't have to break that line - just let it stick out.
> Hmm. I missed that allowed line limit was increased up to 100 bytes,
> Thank you, I will fix it in next patch version. 

Line limits are more of a guidance than a hard rule. Also please note
that different subsystems' maintainers insist on this guidance
differently.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                                   ` <6e697a1f-936d-5ffe-d29f-e4dcbe099799-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-23  7:53                                     ` Vasily Averin
       [not found]                                       ` <03cb1ce9-143a-1cd0-f34b-d608c3bbc66c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-23  7:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Michal Hocko
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jiri Slaby

On 4/22/21 4:59 PM, Vasily Averin wrote:
> On 4/22/21 2:50 PM, Greg Kroah-Hartman wrote:
>> On Thu, Apr 22, 2021 at 01:44:59PM +0200, Michal Hocko wrote:
>>> On Thu 22-04-21 13:23:21, Greg KH wrote:
>>>> On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
>>>>> At each login the user forces the kernel to create a new terminal and
>>>>> allocate up to ~1Kb memory for the tty-related structures.
>>>>
>>>> Does this tiny amount of memory actually matter?
>>>
>>> The primary question is whether an untrusted user can trigger an
>>> unbounded amount of these allocations.
>>
>> Can they?  They are not bounded by some other resource limit?
> 
> I'm not ready to provide usecase right now,
> but on the other hand I do not see any related limits.
> Let me take time out to dig this question.

By default it's allowed to create up to 4096 ptys with 1024 reserve for initns only
and the settings are controlled by host admin. It's OK.
Though this default is not enough for hosters with thousands of containers per node.
Host admin can be forced to increase it up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024, but admin inside container 
can change it via remount. In result one container can consume almost all allowed ptys 
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows to significantly
exceed the assigned memcg limit and leads to troubles on the over-committed node.
So I still think it makes sense to account this memory.

Btw OpenVz have per-container pty accounting and limits, but upstream does not.

Thank you,
	Vasily Averin.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                                       ` <03cb1ce9-143a-1cd0-f34b-d608c3bbc66c-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-23  8:58                                         ` Michal Hocko
       [not found]                                           ` <YIKMMSf1uPrWmT2V-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Michal Hocko @ 2021-04-23  8:58 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Greg Kroah-Hartman, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby

On Fri 23-04-21 10:53:55, Vasily Averin wrote:
> On 4/22/21 4:59 PM, Vasily Averin wrote:
> > On 4/22/21 2:50 PM, Greg Kroah-Hartman wrote:
> >> On Thu, Apr 22, 2021 at 01:44:59PM +0200, Michal Hocko wrote:
> >>> On Thu 22-04-21 13:23:21, Greg KH wrote:
> >>>> On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
> >>>>> At each login the user forces the kernel to create a new terminal and
> >>>>> allocate up to ~1Kb memory for the tty-related structures.
> >>>>
> >>>> Does this tiny amount of memory actually matter?
> >>>
> >>> The primary question is whether an untrusted user can trigger an
> >>> unbounded amount of these allocations.
> >>
> >> Can they?  They are not bounded by some other resource limit?
> > 
> > I'm not ready to provide usecase right now,
> > but on the other hand I do not see any related limits.
> > Let me take time out to dig this question.
> 
> By default it's allowed to create up to 4096 ptys with 1024 reserve for initns only
> and the settings are controlled by host admin. It's OK.
> Though this default is not enough for hosters with thousands of containers per node.
> Host admin can be forced to increase it up to NR_UNIX98_PTY_MAX = 1<<20.
> 
> By default container is restricted by pty mount_opt.max = 1024, but admin inside container 
> can change it via remount. In result one container can consume almost all allowed ptys 
> and allocate up to 1Gb of unaccounted memory.
> 
> It is not enough per-se to trigger OOM on host, however anyway, it allows to significantly
> exceed the assigned memcg limit and leads to troubles on the over-committed node.
> So I still think it makes sense to account this memory.

This is a very valuable information to have in the changelog. It is not
my call but if all the above is correct then the accounting is worth
IMO.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                                           ` <YIKMMSf1uPrWmT2V-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2021-04-23 10:29                                             ` Vasily Averin
       [not found]                                               ` <31c49c60-44db-0363-3d07-5febe0048e86-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-23 10:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg Kroah-Hartman, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby

On 4/23/21 11:58 AM, Michal Hocko wrote:
> On Fri 23-04-21 10:53:55, Vasily Averin wrote:
>> On 4/22/21 4:59 PM, Vasily Averin wrote:
>>> On 4/22/21 2:50 PM, Greg Kroah-Hartman wrote:
>>>> On Thu, Apr 22, 2021 at 01:44:59PM +0200, Michal Hocko wrote:
>>>>> On Thu 22-04-21 13:23:21, Greg KH wrote:
>>>>>> On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
>>>>>>> At each login the user forces the kernel to create a new terminal and
>>>>>>> allocate up to ~1Kb memory for the tty-related structures.
>>>>>>
>>>>>> Does this tiny amount of memory actually matter?
>>>>>
>>>>> The primary question is whether an untrusted user can trigger an
>>>>> unbounded amount of these allocations.
>>>>
>>>> Can they?  They are not bounded by some other resource limit?
>>>
>>> I'm not ready to provide usecase right now,
>>> but on the other hand I do not see any related limits.
>>> Let me take time out to dig this question.
>>
>> By default it's allowed to create up to 4096 ptys with 1024 reserve for initns only
>> and the settings are controlled by host admin. It's OK.
>> Though this default is not enough for hosters with thousands of containers per node.
>> Host admin can be forced to increase it up to NR_UNIX98_PTY_MAX = 1<<20.
>>
>> By default container is restricted by pty mount_opt.max = 1024, but admin inside container 
>> can change it via remount. In result one container can consume almost all allowed ptys 
>> and allocate up to 1Gb of unaccounted memory.
>>
>> It is not enough per-se to trigger OOM on host, however anyway, it allows to significantly
>> exceed the assigned memcg limit and leads to troubles on the over-committed node.
>> So I still think it makes sense to account this memory.
> 
> This is a very valuable information to have in the changelog. It is not
> my call but if all the above is correct then the accounting is worth
> IMO.

If Greg doesn't have any objections, I'll add this explanation to the next version of the patch.


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 15/16] memcg: enable accounting for tty-related objects
       [not found]                                               ` <31c49c60-44db-0363-3d07-5febe0048e86-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-23 10:57                                                 ` Greg Kroah-Hartman
  0 siblings, 0 replies; 305+ messages in thread
From: Greg Kroah-Hartman @ 2021-04-23 10:57 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby

On Fri, Apr 23, 2021 at 01:29:53PM +0300, Vasily Averin wrote:
> On 4/23/21 11:58 AM, Michal Hocko wrote:
> > On Fri 23-04-21 10:53:55, Vasily Averin wrote:
> >> On 4/22/21 4:59 PM, Vasily Averin wrote:
> >>> On 4/22/21 2:50 PM, Greg Kroah-Hartman wrote:
> >>>> On Thu, Apr 22, 2021 at 01:44:59PM +0200, Michal Hocko wrote:
> >>>>> On Thu 22-04-21 13:23:21, Greg KH wrote:
> >>>>>> On Thu, Apr 22, 2021 at 01:37:53PM +0300, Vasily Averin wrote:
> >>>>>>> At each login the user forces the kernel to create a new terminal and
> >>>>>>> allocate up to ~1Kb memory for the tty-related structures.
> >>>>>>
> >>>>>> Does this tiny amount of memory actually matter?
> >>>>>
> >>>>> The primary question is whether an untrusted user can trigger an
> >>>>> unbounded amount of these allocations.
> >>>>
> >>>> Can they?  They are not bounded by some other resource limit?
> >>>
> >>> I'm not ready to provide usecase right now,
> >>> but on the other hand I do not see any related limits.
> >>> Let me take time out to dig this question.
> >>
> >> By default it's allowed to create up to 4096 ptys with 1024 reserve for initns only
> >> and the settings are controlled by host admin. It's OK.
> >> Though this default is not enough for hosters with thousands of containers per node.
> >> Host admin can be forced to increase it up to NR_UNIX98_PTY_MAX = 1<<20.
> >>
> >> By default container is restricted by pty mount_opt.max = 1024, but admin inside container 
> >> can change it via remount. In result one container can consume almost all allowed ptys 
> >> and allocate up to 1Gb of unaccounted memory.
> >>
> >> It is not enough per-se to trigger OOM on host, however anyway, it allows to significantly
> >> exceed the assigned memcg limit and leads to troubles on the over-committed node.
> >> So I still think it makes sense to account this memory.
> > 
> > This is a very valuable information to have in the changelog. It is not
> > my call but if all the above is correct then the accounting is worth
> > IMO.
> 
> If Greg doesn't have any objections, I'll add this explanation to the next version of the patch.
> 

I object to the current text you submitted, so something has to change
in order for me to be able to accept the patch :)

Seriously, yes, the above information is great, please include that and
all should be fine.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 08/16] memcg: enable accounting of ipc resources
       [not found]                   ` <4ed65beb-bda3-1c93-fadf-296b760a32b2-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-23 12:16                     ` Alexey Dobriyan
       [not found]                       ` <YIK6ttdnfjOo6XCN-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Alexey Dobriyan @ 2021-04-23 12:16 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Andrew Morton,
	Dmitry Safonov

On Thu, Apr 22, 2021 at 01:37:02PM +0300, Vasily Averin wrote:
> When user creates IPC objects it forces kernel to allocate memory for
> these long-living objects.
> 
> It makes sense to account them to restrict the host's memory consumption
> from inside the memcg-limited container.
> 
> This patch enables accounting for IPC shared memory segments, messages
> semaphores and semaphore's undo lists.

> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
>  	key_t key = params->key;
>  	int msgflg = params->flg;
>  
> -	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
> +	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);

Why this requires vmalloc? struct msg_queue is not big at all.

> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>  			ns->shm_tot + numpages > ns->shm_ctlall)
>  		return -ENOSPC;
>  
> -	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
> +	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);

Same question.
Kmem caches can be GFP_ACCOUNT by default.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 08/16] memcg: enable accounting of ipc resources
       [not found]                       ` <YIK6ttdnfjOo6XCN-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2021-04-23 12:32                         ` Vasily Averin
       [not found]                           ` <dd9b1767-55e0-6754-3ac5-7e01de12f16e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-04-23 12:32 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Andrew Morton,
	Dmitry Safonov

On 4/23/21 3:16 PM, Alexey Dobriyan wrote:
> On Thu, Apr 22, 2021 at 01:37:02PM +0300, Vasily Averin wrote:
>> When user creates IPC objects it forces kernel to allocate memory for
>> these long-living objects.
>>
>> It makes sense to account them to restrict the host's memory consumption
>> from inside the memcg-limited container.
>>
>> This patch enables accounting for IPC shared memory segments, messages
>> semaphores and semaphore's undo lists.
> 
>> --- a/ipc/msg.c
>> +++ b/ipc/msg.c
>> @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
>>  	key_t key = params->key;
>>  	int msgflg = params->flg;
>>  
>> -	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
>> +	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
> 
> Why this requires vmalloc? struct msg_queue is not big at all.
> 
>> --- a/ipc/shm.c
>> +++ b/ipc/shm.c
>> @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>>  			ns->shm_tot + numpages > ns->shm_ctlall)
>>  		return -ENOSPC;
>>  
>> -	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
>> +	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
> 
> Same question.
> Kmem caches can be GFP_ACCOUNT by default.

It is side effect: previously all these objects was allocated via ipc_alloc/ipc_alloc_rcu
function called kvmalloc inside.

Should I replace it to kmalloc right in this patch?

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 08/16] memcg: enable accounting of ipc resources
       [not found]                           ` <dd9b1767-55e0-6754-3ac5-7e01de12f16e-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
@ 2021-04-23 13:40                             ` Michal Hocko
       [not found]                               ` <YILOab0/h83egjUw-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Michal Hocko @ 2021-04-23 13:40 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Alexey Dobriyan, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Andrew Morton,
	Dmitry Safonov

On Fri 23-04-21 15:32:01, Vasily Averin wrote:
> On 4/23/21 3:16 PM, Alexey Dobriyan wrote:
> > On Thu, Apr 22, 2021 at 01:37:02PM +0300, Vasily Averin wrote:
> >> When user creates IPC objects it forces kernel to allocate memory for
> >> these long-living objects.
> >>
> >> It makes sense to account them to restrict the host's memory consumption
> >> from inside the memcg-limited container.
> >>
> >> This patch enables accounting for IPC shared memory segments, messages
> >> semaphores and semaphore's undo lists.
> > 
> >> --- a/ipc/msg.c
> >> +++ b/ipc/msg.c
> >> @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
> >>  	key_t key = params->key;
> >>  	int msgflg = params->flg;
> >>  
> >> -	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
> >> +	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
> > 
> > Why this requires vmalloc? struct msg_queue is not big at all.
> > 
> >> --- a/ipc/shm.c
> >> +++ b/ipc/shm.c
> >> @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> >>  			ns->shm_tot + numpages > ns->shm_ctlall)
> >>  		return -ENOSPC;
> >>  
> >> -	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
> >> +	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
> > 
> > Same question.
> > Kmem caches can be GFP_ACCOUNT by default.
> 
> It is side effect: previously all these objects was allocated via ipc_alloc/ipc_alloc_rcu
> function called kvmalloc inside.
> 
> Should I replace it to kmalloc right in this patch?

I would say those are two independent things. I would agree that
kvmalloc is bogus here. The allocation would try SLAB allocator first
but if it fails (as kvmalloc doesn't try really hard) then it would
fragment memory without a good reason which looks like a bug to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 08/16] memcg: enable accounting of ipc resources
       [not found]                               ` <YILOab0/h83egjUw-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2021-04-23 13:49                                 ` Michal Hocko
       [not found]                                   ` <YILQa1qas7veJaCq-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 305+ messages in thread
From: Michal Hocko @ 2021-04-23 13:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Alexey Dobriyan, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Andrew Morton,
	Dmitry Safonov

On Fri 23-04-21 15:40:58, Michal Hocko wrote:
> On Fri 23-04-21 15:32:01, Vasily Averin wrote:
> > On 4/23/21 3:16 PM, Alexey Dobriyan wrote:
> > > On Thu, Apr 22, 2021 at 01:37:02PM +0300, Vasily Averin wrote:
> > >> When user creates IPC objects it forces kernel to allocate memory for
> > >> these long-living objects.
> > >>
> > >> It makes sense to account them to restrict the host's memory consumption
> > >> from inside the memcg-limited container.
> > >>
> > >> This patch enables accounting for IPC shared memory segments, messages
> > >> semaphores and semaphore's undo lists.
> > > 
> > >> --- a/ipc/msg.c
> > >> +++ b/ipc/msg.c
> > >> @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
> > >>  	key_t key = params->key;
> > >>  	int msgflg = params->flg;
> > >>  
> > >> -	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
> > >> +	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
> > > 
> > > Why this requires vmalloc? struct msg_queue is not big at all.
> > > 
> > >> --- a/ipc/shm.c
> > >> +++ b/ipc/shm.c
> > >> @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> > >>  			ns->shm_tot + numpages > ns->shm_ctlall)
> > >>  		return -ENOSPC;
> > >>  
> > >> -	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
> > >> +	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
> > > 
> > > Same question.
> > > Kmem caches can be GFP_ACCOUNT by default.
> > 
> > It is side effect: previously all these objects was allocated via ipc_alloc/ipc_alloc_rcu
> > function called kvmalloc inside.
> > 
> > Should I replace it to kmalloc right in this patch?
> 
> I would say those are two independent things. I would agree that
> kvmalloc is bogus here. The allocation would try SLAB allocator first
> but if it fails (as kvmalloc doesn't try really hard) then it would
> fragment memory without a good reason which looks like a bug to me.

I have dug into history and this all seems to be just code pattern
copy&paste thing. In the past it was ipc_alloc_rcu which was a common
code for all IPCs and that code was conditional on rcu_use_vmalloc
(HDRLEN_KMALLOC + size > PAGE_SIZE). Later changed to kvmalloc and then
helper removed and all users to use kvmalloc. It seems that only
semaphores can grow large enough to care about kvmalloc though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v3 08/16] memcg: enable accounting of ipc resources
       [not found]                                   ` <YILQa1qas7veJaCq-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2021-04-24 11:17                                     ` Vasily Averin
  2021-04-26 10:18                                         ` Vasily Averin
                                                         ` (2 more replies)
  0 siblings, 3 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-24 11:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexey Dobriyan, cgroups-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Andrew Morton,
	Dmitry Safonov

On 4/23/21 4:49 PM, Michal Hocko wrote:
> On Fri 23-04-21 15:40:58, Michal Hocko wrote:
>> On Fri 23-04-21 15:32:01, Vasily Averin wrote:
>>> On 4/23/21 3:16 PM, Alexey Dobriyan wrote:
>>>> On Thu, Apr 22, 2021 at 01:37:02PM +0300, Vasily Averin wrote:
>>>>> When user creates IPC objects it forces kernel to allocate memory for
>>>>> these long-living objects.
>>>>>
>>>>> It makes sense to account them to restrict the host's memory consumption
>>>>> from inside the memcg-limited container.
>>>>>
>>>>> This patch enables accounting for IPC shared memory segments, messages
>>>>> semaphores and semaphore's undo lists.
>>>>
>>>>> --- a/ipc/msg.c
>>>>> +++ b/ipc/msg.c
>>>>> @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
>>>>>  	key_t key = params->key;
>>>>>  	int msgflg = params->flg;
>>>>>  
>>>>> -	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
>>>>> +	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
>>>>
>>>> Why this requires vmalloc? struct msg_queue is not big at all.
>>>>
>>>>> --- a/ipc/shm.c
>>>>> +++ b/ipc/shm.c
>>>>> @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>>>>>  			ns->shm_tot + numpages > ns->shm_ctlall)
>>>>>  		return -ENOSPC;
>>>>>  
>>>>> -	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
>>>>> +	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
>>>>
>>>> Same question.
>>>> Kmem caches can be GFP_ACCOUNT by default.
>>>
>>> It is side effect: previously all these objects was allocated via ipc_alloc/ipc_alloc_rcu
>>> function called kvmalloc inside.
>>>
>>> Should I replace it to kmalloc right in this patch?
>>
>> I would say those are two independent things. I would agree that
>> kvmalloc is bogus here. The allocation would try SLAB allocator first
>> but if it fails (as kvmalloc doesn't try really hard) then it would
>> fragment memory without a good reason which looks like a bug to me.
> 
> I have dug into history and this all seems to be just code pattern
> copy&paste thing. In the past it was ipc_alloc_rcu which was a common
> code for all IPCs and that code was conditional on rcu_use_vmalloc
> (HDRLEN_KMALLOC + size > PAGE_SIZE). Later changed to kvmalloc and then
> helper removed and all users to use kvmalloc. It seems that only
> semaphores can grow large enough to care about kvmalloc though.

I have one more cleanup for ipc, 
so if no one objects, I'll fix this case, add my patch and send them together
as a separate patch set a bit later.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 0/2] ipc: allocations cleanup
@ 2021-04-26 10:18                                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-26 10:18 UTC (permalink / raw)
  To: Michal Hocko, cgroups
  Cc: linux-kernel, Alexey Dobriyan, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

Some ipc objects use the wrong allocation functions: small objects can use kmalloc(),
and vice versa, potentially large objects can use kmalloc().

I think it's better to handle these patches via cgroups@ to avoid merge conflicts
with my patches that include accounting for ipc objects.

Vasily Averin (2):
  ipc sem: use kvmalloc for sem_undo allocation
  ipc: use kmalloc for msg_queue and shmid_kernel

 ipc/msg.c |  6 +++---
 ipc/sem.c | 10 +++++-----
 ipc/shm.c |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 0/2] ipc: allocations cleanup
@ 2021-04-26 10:18                                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-26 10:18 UTC (permalink / raw)
  To: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

Some ipc objects use the wrong allocation functions: small objects can use kmalloc(),
and vice versa, potentially large objects can use kmalloc().

I think it's better to handle these patches via cgroups@ to avoid merge conflicts
with my patches that include accounting for ipc objects.

Vasily Averin (2):
  ipc sem: use kvmalloc for sem_undo allocation
  ipc: use kmalloc for msg_queue and shmid_kernel

 ipc/msg.c |  6 +++---
 ipc/sem.c | 10 +++++-----
 ipc/shm.c |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 1/2] ipc sem: use kvmalloc for sem_undo allocation
@ 2021-04-26 10:18                                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-26 10:18 UTC (permalink / raw)
  To: Michal Hocko, cgroups
  Cc: linux-kernel, Alexey Dobriyan, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

size of sem_undo can exceed one page and with the maximum possible
nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
to kvmalloc to avoid user-triggered disruptive actions like OOM killer
in case of high-order memory shortage.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/sem.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 52a6599..93088d6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1152,7 +1152,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 		un->semid = -1;
 		list_del_rcu(&un->list_proc);
 		spin_unlock(&un->ulp->lock);
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
@@ -1935,7 +1935,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
 		      GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
@@ -1948,7 +1948,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	if (!ipc_valid_object(&sma->sem_perm)) {
 		sem_unlock(sma, -1);
 		rcu_read_unlock();
-		kfree(new);
+		kvfree(new);
 		un = ERR_PTR(-EIDRM);
 		goto out;
 	}
@@ -1959,7 +1959,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	 */
 	un = lookup_undo(ulp, semid);
 	if (un) {
-		kfree(new);
+		kvfree(new);
 		goto success;
 	}
 	/* step 5: initialize & link new undo structure */
@@ -2420,7 +2420,7 @@ void exit_sem(struct task_struct *tsk)
 		rcu_read_unlock();
 		wake_up_q(&wake_q);
 
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 	kfree(ulp);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH 1/2] ipc sem: use kvmalloc for sem_undo allocation
@ 2021-04-26 10:18                                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-26 10:18 UTC (permalink / raw)
  To: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

size of sem_undo can exceed one page and with the maximum possible
nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
to kvmalloc to avoid user-triggered disruptive actions like OOM killer
in case of high-order memory shortage.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 ipc/sem.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 52a6599..93088d6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1152,7 +1152,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 		un->semid = -1;
 		list_del_rcu(&un->list_proc);
 		spin_unlock(&un->ulp->lock);
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
@@ -1935,7 +1935,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
 		      GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
@@ -1948,7 +1948,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	if (!ipc_valid_object(&sma->sem_perm)) {
 		sem_unlock(sma, -1);
 		rcu_read_unlock();
-		kfree(new);
+		kvfree(new);
 		un = ERR_PTR(-EIDRM);
 		goto out;
 	}
@@ -1959,7 +1959,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	 */
 	un = lookup_undo(ulp, semid);
 	if (un) {
-		kfree(new);
+		kvfree(new);
 		goto success;
 	}
 	/* step 5: initialize & link new undo structure */
@@ -2420,7 +2420,7 @@ void exit_sem(struct task_struct *tsk)
 		rcu_read_unlock();
 		wake_up_q(&wake_q);
 
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 	kfree(ulp);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 10:18                                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-26 10:18 UTC (permalink / raw)
  To: Michal Hocko, cgroups
  Cc: linux-kernel, Alexey Dobriyan, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them.
Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct
kvmalloc call, and now we can use more suitable kmalloc/kfree for them.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c | 6 +++---
 ipc/shm.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 87898cb..79c6625 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head *head)
 	struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
 
 	security_msg_queue_free(&msq->q_perm);
-	kvfree(msq);
+	kfree(msq);
 }
 
 /**
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
@@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	msq->q_perm.security = NULL;
 	retval = security_msg_queue_alloc(&msq->q_perm);
 	if (retval) {
-		kvfree(msq);
+		kfree(msq);
 		return retval;
 	}
 
diff --git a/ipc/shm.c b/ipc/shm.c
index 7632d72..85da060 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head *head)
 	struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
 							shm_perm);
 	security_shm_free(&shp->shm_perm);
-	kvfree(shp);
+	kfree(shp);
 }
 
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
@@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	shp->shm_perm.security = NULL;
 	error = security_shm_alloc(&shp->shm_perm);
 	if (error) {
-		kvfree(shp);
+		kfree(shp);
 		return error;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 10:18                                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-26 10:18 UTC (permalink / raw)
  To: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them.
Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct
kvmalloc call, and now we can use more suitable kmalloc/kfree for them.

Reported-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 ipc/msg.c | 6 +++---
 ipc/shm.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 87898cb..79c6625 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head *head)
 	struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
 
 	security_msg_queue_free(&msq->q_perm);
-	kvfree(msq);
+	kfree(msq);
 }
 
 /**
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
@@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	msq->q_perm.security = NULL;
 	retval = security_msg_queue_alloc(&msq->q_perm);
 	if (retval) {
-		kvfree(msq);
+		kfree(msq);
 		return retval;
 	}
 
diff --git a/ipc/shm.c b/ipc/shm.c
index 7632d72..85da060 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head *head)
 	struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
 							shm_perm);
 	security_shm_free(&shp->shm_perm);
-	kvfree(shp);
+	kfree(shp);
 }
 
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
@@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	shp->shm_perm.security = NULL;
 	error = security_shm_alloc(&shp->shm_perm);
 	if (error) {
-		kvfree(shp);
+		kfree(shp);
 		return error;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 10:25                                           ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-04-26 10:25 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, linux-kernel, Alexey Dobriyan, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Andrew Morton, Dmitry Safonov

On Mon 26-04-21 13:18:14, Vasily Averin wrote:
> msg_queue and shmid_kernel are quite small objects, no need to use
> kvmalloc for them.

Both of them are 256B on most 64b systems.

> Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
> common function for several ipc objects. It had kvmalloc call inside().
> Later, this function went away and was finally replaced by direct
> kvmalloc call, and now we can use more suitable kmalloc/kfree for them.

This describes the history of the code but it doesn't really explain why
kvmalloc is a bad decision here. I would go with the following:
"
Using kvmalloc for sub page size objects is suboptimal because kmalloc
can easily fallback into vmalloc under memory pressure and smaller
objects would fragment memory. Therefore replace kvmalloc by a simple
kmalloc.
"
> 
> Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

With a clarified changelog, feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  ipc/msg.c | 6 +++---
>  ipc/shm.c | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/ipc/msg.c b/ipc/msg.c
> index 87898cb..79c6625 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head *head)
>  	struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
>  
>  	security_msg_queue_free(&msq->q_perm);
> -	kvfree(msq);
> +	kfree(msq);
>  }
>  
>  /**
> @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
>  	key_t key = params->key;
>  	int msgflg = params->flg;
>  
> -	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
> +	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
>  	if (unlikely(!msq))
>  		return -ENOMEM;
>  
> @@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
>  	msq->q_perm.security = NULL;
>  	retval = security_msg_queue_alloc(&msq->q_perm);
>  	if (retval) {
> -		kvfree(msq);
> +		kfree(msq);
>  		return retval;
>  	}
>  
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 7632d72..85da060 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head *head)
>  	struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
>  							shm_perm);
>  	security_shm_free(&shp->shm_perm);
> -	kvfree(shp);
> +	kfree(shp);
>  }
>  
>  static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
> @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>  			ns->shm_tot + numpages > ns->shm_ctlall)
>  		return -ENOSPC;
>  
> -	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
> +	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
>  	if (unlikely(!shp))
>  		return -ENOMEM;
>  
> @@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>  	shp->shm_perm.security = NULL;
>  	error = security_shm_alloc(&shp->shm_perm);
>  	if (error) {
> -		kvfree(shp);
> +		kfree(shp);
>  		return error;
>  	}
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 10:25                                           ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-04-26 10:25 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

On Mon 26-04-21 13:18:14, Vasily Averin wrote:
> msg_queue and shmid_kernel are quite small objects, no need to use
> kvmalloc for them.

Both of them are 256B on most 64b systems.

> Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
> common function for several ipc objects. It had kvmalloc call inside().
> Later, this function went away and was finally replaced by direct
> kvmalloc call, and now we can use more suitable kmalloc/kfree for them.

This describes the history of the code but it doesn't really explain why
kvmalloc is a bad decision here. I would go with the following:
"
Using kvmalloc for sub page size objects is suboptimal because kmalloc
can easily fallback into vmalloc under memory pressure and smaller
objects would fragment memory. Therefore replace kvmalloc by a simple
kmalloc.
"
> 
> Reported-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

With a clarified changelog, feel free to add
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!

> ---
>  ipc/msg.c | 6 +++---
>  ipc/shm.c | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/ipc/msg.c b/ipc/msg.c
> index 87898cb..79c6625 100644
> --- a/ipc/msg.c
> +++ b/ipc/msg.c
> @@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head *head)
>  	struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
>  
>  	security_msg_queue_free(&msq->q_perm);
> -	kvfree(msq);
> +	kfree(msq);
>  }
>  
>  /**
> @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
>  	key_t key = params->key;
>  	int msgflg = params->flg;
>  
> -	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
> +	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
>  	if (unlikely(!msq))
>  		return -ENOMEM;
>  
> @@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
>  	msq->q_perm.security = NULL;
>  	retval = security_msg_queue_alloc(&msq->q_perm);
>  	if (retval) {
> -		kvfree(msq);
> +		kfree(msq);
>  		return retval;
>  	}
>  
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 7632d72..85da060 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head *head)
>  	struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
>  							shm_perm);
>  	security_shm_free(&shp->shm_perm);
> -	kvfree(shp);
> +	kfree(shp);
>  }
>  
>  static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
> @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>  			ns->shm_tot + numpages > ns->shm_ctlall)
>  		return -ENOSPC;
>  
> -	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
> +	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
>  	if (unlikely(!shp))
>  		return -ENOMEM;
>  
> @@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>  	shp->shm_perm.security = NULL;
>  	error = security_shm_alloc(&shp->shm_perm);
>  	if (error) {
> -		kvfree(shp);
> +		kfree(shp);
>  		return error;
>  	}
>  
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 1/2] ipc sem: use kvmalloc for sem_undo allocation
  2021-04-26 10:18                                         ` Vasily Averin
  (?)
@ 2021-04-26 10:28                                         ` Michal Hocko
  -1 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-04-26 10:28 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, linux-kernel, Alexey Dobriyan, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Andrew Morton, Dmitry Safonov

On Mon 26-04-21 13:18:09, Vasily Averin wrote:
> size of sem_undo can exceed one page and with the maximum possible
> nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
> to kvmalloc to avoid user-triggered disruptive actions like OOM killer
> in case of high-order memory shortage.

User triggerable high order allocations are quite a problem on heavily
fragmented systems. They can be a DoS vector.

> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  ipc/sem.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 52a6599..93088d6 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1152,7 +1152,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
>  		un->semid = -1;
>  		list_del_rcu(&un->list_proc);
>  		spin_unlock(&un->ulp->lock);
> -		kfree_rcu(un, rcu);
> +		kvfree_rcu(un, rcu);
>  	}
>  
>  	/* Wake up all pending processes and let them fail with EIDRM. */
> @@ -1935,7 +1935,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
>  	rcu_read_unlock();
>  
>  	/* step 2: allocate new undo structure */
> -	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
> +	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
>  		      GFP_KERNEL_ACCOUNT);
>  	if (!new) {
>  		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
> @@ -1948,7 +1948,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
>  	if (!ipc_valid_object(&sma->sem_perm)) {
>  		sem_unlock(sma, -1);
>  		rcu_read_unlock();
> -		kfree(new);
> +		kvfree(new);
>  		un = ERR_PTR(-EIDRM);
>  		goto out;
>  	}
> @@ -1959,7 +1959,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
>  	 */
>  	un = lookup_undo(ulp, semid);
>  	if (un) {
> -		kfree(new);
> +		kvfree(new);
>  		goto success;
>  	}
>  	/* step 5: initialize & link new undo structure */
> @@ -2420,7 +2420,7 @@ void exit_sem(struct task_struct *tsk)
>  		rcu_read_unlock();
>  		wake_up_q(&wake_q);
>  
> -		kfree_rcu(un, rcu);
> +		kvfree_rcu(un, rcu);
>  	}
>  	kfree(ulp);
>  }
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 1/2] ipc sem: use kvmalloc for sem_undo allocation
@ 2021-04-26 16:22                                           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-04-26 16:22 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Cgroups, LKML, Alexey Dobriyan, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

On Mon, Apr 26, 2021 at 3:18 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> size of sem_undo can exceed one page and with the maximum possible
> nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
> to kvmalloc to avoid user-triggered disruptive actions like OOM killer
> in case of high-order memory shortage.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 1/2] ipc sem: use kvmalloc for sem_undo allocation
@ 2021-04-26 16:22                                           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-04-26 16:22 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Cgroups, LKML, Alexey Dobriyan, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

On Mon, Apr 26, 2021 at 3:18 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> size of sem_undo can exceed one page and with the maximum possible
> nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
> to kvmalloc to avoid user-triggered disruptive actions like OOM killer
> in case of high-order memory shortage.
>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 16:23                                           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-04-26 16:23 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Cgroups, LKML, Alexey Dobriyan, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

On Mon, Apr 26, 2021 at 3:18 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> msg_queue and shmid_kernel are quite small objects, no need to use
> kvmalloc for them.
> Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
> common function for several ipc objects. It had kvmalloc call inside().
> Later, this function went away and was finally replaced by direct
> kvmalloc call, and now we can use more suitable kmalloc/kfree for them.
>
> Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 16:23                                           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-04-26 16:23 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Cgroups, LKML, Alexey Dobriyan, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

On Mon, Apr 26, 2021 at 3:18 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> msg_queue and shmid_kernel are quite small objects, no need to use
> kvmalloc for them.
> Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
> common function for several ipc objects. It had kvmalloc call inside().
> Later, this function went away and was finally replaced by direct
> kvmalloc call, and now we can use more suitable kmalloc/kfree for them.
>
> Reported-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 1/2] ipc sem: use kvmalloc for sem_undo allocation
@ 2021-04-26 20:29                                           ` Roman Gushchin
  0 siblings, 0 replies; 305+ messages in thread
From: Roman Gushchin @ 2021-04-26 20:29 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, cgroups, linux-kernel, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

On Mon, Apr 26, 2021 at 01:18:09PM +0300, Vasily Averin wrote:
> size of sem_undo can exceed one page and with the maximum possible
> nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
> to kvmalloc to avoid user-triggered disruptive actions like OOM killer
> in case of high-order memory shortage.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 1/2] ipc sem: use kvmalloc for sem_undo allocation
@ 2021-04-26 20:29                                           ` Roman Gushchin
  0 siblings, 0 replies; 305+ messages in thread
From: Roman Gushchin @ 2021-04-26 20:29 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

On Mon, Apr 26, 2021 at 01:18:09PM +0300, Vasily Averin wrote:
> size of sem_undo can exceed one page and with the maximum possible
> nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
> to kvmalloc to avoid user-triggered disruptive actions like OOM killer
> in case of high-order memory shortage.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 20:29                                           ` Roman Gushchin
  0 siblings, 0 replies; 305+ messages in thread
From: Roman Gushchin @ 2021-04-26 20:29 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, cgroups, linux-kernel, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

On Mon, Apr 26, 2021 at 01:18:14PM +0300, Vasily Averin wrote:
> msg_queue and shmid_kernel are quite small objects, no need to use
> kvmalloc for them.
> Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
> common function for several ipc objects. It had kvmalloc call inside().
> Later, this function went away and was finally replaced by direct
> kvmalloc call, and now we can use more suitable kmalloc/kfree for them.
> 
> Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-26 20:29                                           ` Roman Gushchin
  0 siblings, 0 replies; 305+ messages in thread
From: Roman Gushchin @ 2021-04-26 20:29 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

On Mon, Apr 26, 2021 at 01:18:14PM +0300, Vasily Averin wrote:
> msg_queue and shmid_kernel are quite small objects, no need to use
> kvmalloc for them.
> Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
> common function for several ipc objects. It had kvmalloc call inside().
> Later, this function went away and was finally replaced by direct
> kvmalloc call, and now we can use more suitable kmalloc/kfree for them.
> 
> Reported-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>

Thanks!

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-28  5:15                                             ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  5:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-kernel, Alexey Dobriyan, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Andrew Morton, Dmitry Safonov

On 4/26/21 1:25 PM, Michal Hocko wrote:
> Using kvmalloc for sub page size objects is suboptimal because kmalloc
> can easily fallback into vmalloc under memory pressure and smaller
> objects would fragment memory. Therefore replace kvmalloc by a simple
> kmalloc.

I think you're wrong here:
kvmalloc can failback to vmalloc for size > PAGE_SIZE only

Please take look at mm/util.c::kvmalloc_node()

        if (size > PAGE_SIZE) {
                kmalloc_flags |= __GFP_NOWARN;

                if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
                        kmalloc_flags |= __GFP_NORETRY;
        }

        ret = kmalloc_node(size, kmalloc_flags, node);

        /*
         * It doesn't really make sense to fallback to vmalloc for sub page
         * requests
         */
        if (ret || size <= PAGE_SIZE)
                return ret;

        return __vmalloc_node(size, 1, flags, node,
                        __builtin_return_address(0));

For small objects kvmalloc is not much different just from kmalloc,
so the patch is mostly cosmetic.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-28  5:15                                             ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  5:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

On 4/26/21 1:25 PM, Michal Hocko wrote:
> Using kvmalloc for sub page size objects is suboptimal because kmalloc
> can easily fallback into vmalloc under memory pressure and smaller
> objects would fragment memory. Therefore replace kvmalloc by a simple
> kmalloc.

I think you're wrong here:
kvmalloc can failback to vmalloc for size > PAGE_SIZE only

Please take look at mm/util.c::kvmalloc_node()

        if (size > PAGE_SIZE) {
                kmalloc_flags |= __GFP_NOWARN;

                if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
                        kmalloc_flags |= __GFP_NORETRY;
        }

        ret = kmalloc_node(size, kmalloc_flags, node);

        /*
         * It doesn't really make sense to fallback to vmalloc for sub page
         * requests
         */
        if (ret || size <= PAGE_SIZE)
                return ret;

        return __vmalloc_node(size, 1, flags, node,
                        __builtin_return_address(0));

For small objects kvmalloc is not much different just from kmalloc,
so the patch is mostly cosmetic.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-28  6:33                                               ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-04-28  6:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, linux-kernel, Alexey Dobriyan, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Andrew Morton, Dmitry Safonov

On Wed 28-04-21 08:15:10, Vasily Averin wrote:
> On 4/26/21 1:25 PM, Michal Hocko wrote:
> > Using kvmalloc for sub page size objects is suboptimal because kmalloc
> > can easily fallback into vmalloc under memory pressure and smaller
> > objects would fragment memory. Therefore replace kvmalloc by a simple
> > kmalloc.
> 
> I think you're wrong here:
> kvmalloc can failback to vmalloc for size > PAGE_SIZE only

You are right. My bad. My memory failed on me. Sorry about the
confusion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-28  6:33                                               ` Michal Hocko
  0 siblings, 0 replies; 305+ messages in thread
From: Michal Hocko @ 2021-04-28  6:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

On Wed 28-04-21 08:15:10, Vasily Averin wrote:
> On 4/26/21 1:25 PM, Michal Hocko wrote:
> > Using kvmalloc for sub page size objects is suboptimal because kmalloc
> > can easily fallback into vmalloc under memory pressure and smaller
> > objects would fragment memory. Therefore replace kvmalloc by a simple
> > kmalloc.
> 
> I think you're wrong here:
> kvmalloc can failback to vmalloc for size > PAGE_SIZE only

You are right. My bad. My memory failed on me. Sorry about the
confusion.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v4 00/16] memcg accounting from OpenVZ
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
@ 2021-04-28  6:51                     ` Vasily Averin
  2021-04-28  6:51                     ` Vasily Averin
                                       ` (15 subsequent siblings)
  16 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev,
	linux-kernel

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      |  6 +++---
 drivers/tty/tty_io.c       |  4 ++--
 fs/fcntl.c                 |  3 ++-
 fs/locks.c                 |  6 ++++--
 fs/namespace.c             |  7 ++++---
 fs/select.c                |  4 ++--
 ipc/msg.c                  |  2 +-
 ipc/namespace.c            |  2 +-
 ipc/sem.c                  | 10 ++++++----
 ipc/shm.c                  |  2 +-
 kernel/cgroup/namespace.c  |  2 +-
 kernel/nsproxy.c           |  2 +-
 kernel/pid_namespace.c     |  2 +-
 kernel/signal.c            |  2 +-
 kernel/time/namespace.c    |  4 ++--
 kernel/time/posix-timers.c |  4 ++--
 kernel/user_namespace.c    |  2 +-
 mm/memcontrol.c            |  2 +-
 net/8021q/vlan.c           |  2 +-
 net/core/dev.c             |  6 +++---
 net/core/fib_rules.c       |  4 ++--
 net/core/scm.c             |  4 ++--
 net/dccp/proto.c           |  2 +-
 net/ipv4/devinet.c         |  2 +-
 net/ipv4/fib_trie.c        |  4 ++--
 net/ipv4/tcp.c             |  4 +++-
 net/ipv6/addrconf.c        |  2 +-
 net/ipv6/ip6_fib.c         |  4 ++--
 net/ipv6/route.c           |  2 +-
 net/ipv6/sit.c             |  5 +++--
 30 files changed, 58 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v4 00/16] memcg accounting from OpenVZ
@ 2021-04-28  6:51                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      |  6 +++---
 drivers/tty/tty_io.c       |  4 ++--
 fs/fcntl.c                 |  3 ++-
 fs/locks.c                 |  6 ++++--
 fs/namespace.c             |  7 ++++---
 fs/select.c                |  4 ++--
 ipc/msg.c                  |  2 +-
 ipc/namespace.c            |  2 +-
 ipc/sem.c                  | 10 ++++++----
 ipc/shm.c                  |  2 +-
 kernel/cgroup/namespace.c  |  2 +-
 kernel/nsproxy.c           |  2 +-
 kernel/pid_namespace.c     |  2 +-
 kernel/signal.c            |  2 +-
 kernel/time/namespace.c    |  4 ++--
 kernel/time/posix-timers.c |  4 ++--
 kernel/user_namespace.c    |  2 +-
 mm/memcontrol.c            |  2 +-
 net/8021q/vlan.c           |  2 +-
 net/core/dev.c             |  6 +++---
 net/core/fib_rules.c       |  4 ++--
 net/core/scm.c             |  4 ++--
 net/dccp/proto.c           |  2 +-
 net/ipv4/devinet.c         |  2 +-
 net/ipv4/fib_trie.c        |  4 ++--
 net/ipv4/tcp.c             |  4 +++-
 net/ipv6/addrconf.c        |  2 +-
 net/ipv6/ip6_fib.c         |  4 ++--
 net/ipv6/route.c           |  2 +-
 net/ipv6/sit.c             |  5 +++--
 30 files changed, 58 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues
@ 2021-04-28  6:51                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9a..87b1e80 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9994,7 +9994,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10061,7 +10061,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10693,7 +10693,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues
@ 2021-04-28  6:51                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9a..87b1e80 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9994,7 +9994,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10061,7 +10061,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10693,7 +10693,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-04-28  6:51                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

Netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d..15108ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cd80ffe..65d8b1d 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2e35f68da..9b90413 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9e53f5..d56a15a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 679699e..0982b7c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2444,8 +2444,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 373d480..5dc5c68 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6510,7 +6510,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-04-28  6:51                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d..15108ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cd80ffe..65d8b1d 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2e35f68da..9b90413 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9e53f5..d56a15a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 679699e..0982b7c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2444,8 +2444,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 373d480..5dc5c68 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6510,7 +6510,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache
  2021-04-22 10:35                 ` [PATCH v3 00/16] " Vasily Averin
                                     ` (2 preceding siblings ...)
  2021-04-28  6:51                     ` Vasily Averin
@ 2021-04-28  6:51                   ` Vasily Averin
  2021-04-28  6:52                     ` Vasily Averin
                                     ` (12 subsequent siblings)
  16 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 6d705d9..f90d1e8 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de7cc84..5817a86b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4498,7 +4498,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 04/16] memcg: enable accounting for VLAN group array
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8b644113..d0a579d4 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 04/16] memcg: enable accounting for VLAN group array
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8b644113..d0a579d4 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 9fdccf0..2ba147c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -320,7 +320,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -333,7 +333,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Author: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 9fdccf0..2ba147c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -320,7 +320,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -333,7 +333,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 8156d4f..e837e4f 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -348,7 +348,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 8156d4f..e837e4f 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -348,7 +348,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 56bb5a5..5ecfa349 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3286,7 +3286,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9a4b980..9c6a42b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1378,7 +1378,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 56bb5a5..5ecfa349 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3286,7 +3286,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9a4b980..9c6a42b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1378,7 +1378,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 08/16] memcg: enable accounting of ipc resources
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Andrew Morton, Alexey Dobriyan, Dmitry Safonov,
	linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c |  2 +-
 ipc/sem.c | 10 ++++++----
 ipc/shm.c |  2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index acd1bc7..87898cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index f6c30a8..52a6599 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -511,7 +511,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1850,7 +1850,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1935,7 +1935,8 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL);
+	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+		      GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -1999,7 +2000,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index febd88d..7632d72 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 08/16] memcg: enable accounting of ipc resources
@ 2021-04-28  6:52                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Andrew Morton, Alexey Dobriyan, Dmitry Safonov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 ipc/msg.c |  2 +-
 ipc/sem.c | 10 ++++++----
 ipc/shm.c |  2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index acd1bc7..87898cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index f6c30a8..52a6599 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -511,7 +511,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1850,7 +1850,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1935,7 +1935,8 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL);
+	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+		      GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -1999,7 +2000,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index febd88d..7632d72 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 09/16] memcg: enable accounting for mnt_cache entries
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5ecfa349..fc1b50d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4213,7 +4214,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 09/16] memcg: enable accounting for mnt_cache entries
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, linux-kernel-u79uwXL29TY76Z2rM5mHXA

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5ecfa349..fc1b50d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4213,7 +4214,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 10/16] memcg: enable accounting for pollfd and select bits arrays
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 10/16] memcg: enable accounting for pollfd and select bits arrays
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, linux-kernel-u79uwXL29TY76Z2rM5mHXA

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 11/16] memcg: enable accounting for signals
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jens Axboe, Eric W. Biederman, Oleg Nesterov,
	linux-kernel

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f271835..a7fa849 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4639,7 +4639,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 11/16] memcg: enable accounting for signals
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jens Axboe, Eric W. Biederman, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f271835..a7fa849 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4639,7 +4639,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner, linux-kernel

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index bf540f5a..2eee615 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner, linux-kernel-u79uwXL29TY76Z2rM5mHXA

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index bf540f5a..2eee615 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 13/16] memcg: enable accounting for file lock caches
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro,
	linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 6125d2d..fb751f3 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3007,10 +3007,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 13/16] memcg: enable accounting for file lock caches
@ 2021-04-28  6:53                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 6125d2d..fb751f3 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3007,10 +3007,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 14/16] memcg: enable accounting for fasync_cache
@ 2021-04-28  6:54                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro,
	linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 14/16] memcg: enable accounting for fasync_cache
@ 2021-04-28  6:54                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 15/16] memcg: enable accounting for tty-related objects
@ 2021-04-28  6:54                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby, linux-kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 391bada..e613b8e 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1502,7 +1502,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3127,7 +3127,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 15/16] memcg: enable accounting for tty-related objects
@ 2021-04-28  6:54                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 391bada..e613b8e 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1502,7 +1502,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3127,7 +3127,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 16/16] memcg: enable accounting for ldt_struct objects
@ 2021-04-28  6:54                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, linux-kernel

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v4 16/16] memcg: enable accounting for ldt_struct objects
@ 2021-04-28  6:54                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Borislav Petkov <bp-l3A5Bk7waGM@public.gmane.org>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 0/2] ipc: allocations cleanup
@ 2021-04-28  7:35                                           ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  7:35 UTC (permalink / raw)
  To: Michal Hocko, cgroups
  Cc: linux-kernel, Alexey Dobriyan, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

Some ipc objects use the wrong allocation functions: small objects can use kmalloc(),
and vice versa, potentially large objects can use kmalloc().

I think it's better to handle these patches via cgroups@ to avoid merge conflicts
with "memcg: enable accounting of ipc resources" patch included to 
"memcg accounting from OpenVZ" patch set.

v2:
- improved patch description

Vasily Averin (2):
  ipc sem: use kvmalloc for sem_undo allocation
  ipc: use kmalloc for msg_queue and shmid_kernel

 ipc/msg.c |  6 +++---
 ipc/sem.c | 10 +++++-----
 ipc/shm.c |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v2 0/2] ipc: allocations cleanup
@ 2021-04-28  7:35                                           ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  7:35 UTC (permalink / raw)
  To: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

Some ipc objects use the wrong allocation functions: small objects can use kmalloc(),
and vice versa, potentially large objects can use kmalloc().

I think it's better to handle these patches via cgroups@ to avoid merge conflicts
with "memcg: enable accounting of ipc resources" patch included to 
"memcg accounting from OpenVZ" patch set.

v2:
- improved patch description

Vasily Averin (2):
  ipc sem: use kvmalloc for sem_undo allocation
  ipc: use kmalloc for msg_queue and shmid_kernel

 ipc/msg.c |  6 +++---
 ipc/sem.c | 10 +++++-----
 ipc/shm.c |  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v2 1/2] ipc sem: use kvmalloc for sem_undo allocation
  2021-04-26 10:18                                         ` Vasily Averin
  (?)
  (?)
@ 2021-04-28  7:35                                         ` Vasily Averin
  -1 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  7:35 UTC (permalink / raw)
  To: Michal Hocko, cgroups
  Cc: linux-kernel, Alexey Dobriyan, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

size of sem_undo can exceed one page and with the maximum possible
nsems = 32000 it can grow up to 64Kb. Let's switch its allocation
to kvmalloc to avoid user-triggered disruptive actions like OOM killer
in case of high-order memory shortage.

User triggerable high order allocations are quite a problem on heavily
fragmented systems. They can be a DoS vector.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 ipc/sem.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 52a6599..93088d6 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1152,7 +1152,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 		un->semid = -1;
 		list_del_rcu(&un->list_proc);
 		spin_unlock(&un->ulp->lock);
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
@@ -1935,7 +1935,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
 		      GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
@@ -1948,7 +1948,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	if (!ipc_valid_object(&sma->sem_perm)) {
 		sem_unlock(sma, -1);
 		rcu_read_unlock();
-		kfree(new);
+		kvfree(new);
 		un = ERR_PTR(-EIDRM);
 		goto out;
 	}
@@ -1959,7 +1959,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	 */
 	un = lookup_undo(ulp, semid);
 	if (un) {
-		kfree(new);
+		kvfree(new);
 		goto success;
 	}
 	/* step 5: initialize & link new undo structure */
@@ -2420,7 +2420,7 @@ void exit_sem(struct task_struct *tsk)
 		rcu_read_unlock();
 		wake_up_q(&wake_q);
 
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 	kfree(ulp);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-28  7:35                                           ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  7:35 UTC (permalink / raw)
  To: Michal Hocko, cgroups
  Cc: linux-kernel, Alexey Dobriyan, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Andrew Morton, Dmitry Safonov

msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them.
mhocko@: "Both of them are 256B on most 64b systems."

Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct
kvmalloc call, and now we can use more suitable kmalloc/kfree for them.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 ipc/msg.c | 6 +++---
 ipc/shm.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 87898cb..79c6625 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head *head)
 	struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
 
 	security_msg_queue_free(&msq->q_perm);
-	kvfree(msq);
+	kfree(msq);
 }
 
 /**
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
@@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	msq->q_perm.security = NULL;
 	retval = security_msg_queue_alloc(&msq->q_perm);
 	if (retval) {
-		kvfree(msq);
+		kfree(msq);
 		return retval;
 	}
 
diff --git a/ipc/shm.c b/ipc/shm.c
index 7632d72..85da060 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head *head)
 	struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
 							shm_perm);
 	security_shm_free(&shp->shm_perm);
-	kvfree(shp);
+	kfree(shp);
 }
 
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
@@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	shp->shm_perm.security = NULL;
 	error = security_shm_alloc(&shp->shm_perm);
 	if (error) {
-		kvfree(shp);
+		kfree(shp);
 		return error;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v2 2/2] ipc: use kmalloc for msg_queue and shmid_kernel
@ 2021-04-28  7:35                                           ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-04-28  7:35 UTC (permalink / raw)
  To: Michal Hocko, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexey Dobriyan,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Dmitry Safonov

msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them.
mhocko@: "Both of them are 256B on most 64b systems."

Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct
kvmalloc call, and now we can use more suitable kmalloc/kfree for them.

Reported-by: Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Acked-by: Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>
---
 ipc/msg.c | 6 +++---
 ipc/shm.c | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 87898cb..79c6625 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head *head)
 	struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
 
 	security_msg_queue_free(&msq->q_perm);
-	kvfree(msq);
+	kfree(msq);
 }
 
 /**
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
@@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	msq->q_perm.security = NULL;
 	retval = security_msg_queue_alloc(&msq->q_perm);
 	if (retval) {
-		kvfree(msq);
+		kfree(msq);
 		return retval;
 	}
 
diff --git a/ipc/shm.c b/ipc/shm.c
index 7632d72..85da060 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head *head)
 	struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
 							shm_perm);
 	security_shm_free(&shp->shm_perm);
-	kvfree(shp);
+	kfree(shp);
 }
 
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
@@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	shp->shm_perm.security = NULL;
 	error = security_shm_alloc(&shp->shm_perm);
 	if (error) {
-		kvfree(shp);
+		kfree(shp);
 		return error;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 15/16] memcg: enable accounting for tty-related objects
@ 2021-04-28  7:38                       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 305+ messages in thread
From: Greg Kroah-Hartman @ 2021-04-28  7:38 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jiri Slaby, linux-kernel

On Wed, Apr 28, 2021 at 09:54:16AM +0300, Vasily Averin wrote:
> At each login the user forces the kernel to create a new terminal and
> allocate up to ~1Kb memory for the tty-related structures.
> 
> By default it's allowed to create up to 4096 ptys with 1024 reserve for
> initial mount namespace only and the settings are controlled by host admin.
> 
> Though this default is not enough for hosters with thousands
> of containers per node. Host admin can be forced to increase it
> up to NR_UNIX98_PTY_MAX = 1<<20.
> 
> By default container is restricted by pty mount_opt.max = 1024,
> but admin inside container can change it via remount. As a result,
> one container can consume almost all allowed ptys
> and allocate up to 1Gb of unaccounted memory.
> 
> It is not enough per-se to trigger OOM on host, however anyway, it allows
> to significantly exceed the assigned memcg limit and leads to troubles
> on the over-committed node.
> 
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 15/16] memcg: enable accounting for tty-related objects
@ 2021-04-28  7:38                       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 305+ messages in thread
From: Greg Kroah-Hartman @ 2021-04-28  7:38 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 28, 2021 at 09:54:16AM +0300, Vasily Averin wrote:
> At each login the user forces the kernel to create a new terminal and
> allocate up to ~1Kb memory for the tty-related structures.
> 
> By default it's allowed to create up to 4096 ptys with 1024 reserve for
> initial mount namespace only and the settings are controlled by host admin.
> 
> Though this default is not enough for hosters with thousands
> of containers per node. Host admin can be forced to increase it
> up to NR_UNIX98_PTY_MAX = 1<<20.
> 
> By default container is restricted by pty mount_opt.max = 1024,
> but admin inside container can change it via remount. As a result,
> one container can consume almost all allowed ptys
> and allocate up to 1Gb of unaccounted memory.
> 
> It is not enough per-se to trigger OOM on host, however anyway, it allows
> to significantly exceed the assigned memcg limit and leads to troubles
> on the over-committed node.
> 
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Acked-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-05-07 13:45                       ` Serge E. Hallyn
  0 siblings, 0 replies; 305+ messages in thread
From: Serge E. Hallyn @ 2021-05-07 13:45 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Tejun Heo,
	Andrew Morton, Zefan Li, Thomas Gleixner, Christian Brauner,
	Kirill Tkhai, Serge Hallyn, Andrei Vagin, linux-kernel

On Wed, Apr 28, 2021 at 09:52:43AM +0300, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

makes sense.

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/namespace.c            | 2 +-
>  ipc/namespace.c           | 2 +-
>  kernel/cgroup/namespace.c | 2 +-
>  kernel/nsproxy.c          | 2 +-
>  kernel/pid_namespace.c    | 2 +-
>  kernel/time/namespace.c   | 4 ++--
>  kernel/user_namespace.c   | 2 +-
>  7 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 56bb5a5..5ecfa349 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3286,7 +3286,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
>  	if (!ucounts)
>  		return ERR_PTR(-ENOSPC);
>  
> -	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns) {
>  		dec_mnt_namespaces(ucounts);
>  		return ERR_PTR(-ENOMEM);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 7bd0766..ae83f0f 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
> +	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
>  	if (ns == NULL)
>  		goto fail_dec;
>  
> diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
> index f5e8828..0d5c298 100644
> --- a/kernel/cgroup/namespace.c
> +++ b/kernel/cgroup/namespace.c
> @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
>  	struct cgroup_namespace *new_ns;
>  	int ret;
>  
> -	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns)
>  		return ERR_PTR(-ENOMEM);
>  	ret = ns_alloc_inum(&new_ns->ns);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index abc01fc..eec72ca 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
>  
>  int __init nsproxy_cache_init(void)
>  {
> -	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
> +	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
>  	return 0;
>  }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index ca43239..6cd6715 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>  
>  static __init int pid_namespaces_init(void)
>  {
> -	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index 12eab0d..aec8328 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
> +	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
>  	if (!ns)
>  		goto fail_dec;
>  
>  	refcount_set(&ns->ns.count, 1);
>  
> -	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!ns->vvar_page)
>  		goto fail_free;
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 9a4b980..9c6a42b 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1378,7 +1378,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
>  
>  static __init int user_namespaces_init(void)
>  {
> -	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> +	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  	return 0;
>  }
>  subsys_initcall(user_namespaces_init);
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-05-07 13:45                       ` Serge E. Hallyn
  0 siblings, 0 replies; 305+ messages in thread
From: Serge E. Hallyn @ 2021-05-07 13:45 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, Tejun Heo, Andrew Morton, Zefan Li,
	Thomas Gleixner, Christian Brauner, Kirill Tkhai, Serge Hallyn,
	Andrei Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 28, 2021 at 09:52:43AM +0300, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

makes sense.

Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>

> ---
>  fs/namespace.c            | 2 +-
>  ipc/namespace.c           | 2 +-
>  kernel/cgroup/namespace.c | 2 +-
>  kernel/nsproxy.c          | 2 +-
>  kernel/pid_namespace.c    | 2 +-
>  kernel/time/namespace.c   | 4 ++--
>  kernel/user_namespace.c   | 2 +-
>  7 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 56bb5a5..5ecfa349 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3286,7 +3286,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
>  	if (!ucounts)
>  		return ERR_PTR(-ENOSPC);
>  
> -	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns) {
>  		dec_mnt_namespaces(ucounts);
>  		return ERR_PTR(-ENOMEM);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 7bd0766..ae83f0f 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
> +	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
>  	if (ns == NULL)
>  		goto fail_dec;
>  
> diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
> index f5e8828..0d5c298 100644
> --- a/kernel/cgroup/namespace.c
> +++ b/kernel/cgroup/namespace.c
> @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
>  	struct cgroup_namespace *new_ns;
>  	int ret;
>  
> -	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns)
>  		return ERR_PTR(-ENOMEM);
>  	ret = ns_alloc_inum(&new_ns->ns);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index abc01fc..eec72ca 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
>  
>  int __init nsproxy_cache_init(void)
>  {
> -	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
> +	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
>  	return 0;
>  }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index ca43239..6cd6715 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>  
>  static __init int pid_namespaces_init(void)
>  {
> -	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index 12eab0d..aec8328 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
> +	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
>  	if (!ns)
>  		goto fail_dec;
>  
>  	refcount_set(&ns->ns.count, 1);
>  
> -	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!ns->vvar_page)
>  		goto fail_free;
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 9a4b980..9c6a42b 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1378,7 +1378,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
>  
>  static __init int user_namespaces_init(void)
>  {
> -	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> +	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  	return 0;
>  }
>  subsys_initcall(user_namespaces_init);
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-05-07 15:03                       ` Christian Brauner
  0 siblings, 0 replies; 305+ messages in thread
From: Christian Brauner @ 2021-05-07 15:03 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Tejun Heo,
	Andrew Morton, Zefan Li, Thomas Gleixner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

On Wed, Apr 28, 2021 at 09:52:43AM +0300, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---

Serge's ack reminded me of this. Looks good if the mm folks are fine
with this too,
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-05-07 15:03                       ` Christian Brauner
  0 siblings, 0 replies; 305+ messages in thread
From: Christian Brauner @ 2021-05-07 15:03 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, Tejun Heo, Andrew Morton, Zefan Li,
	Thomas Gleixner, Kirill Tkhai, Serge Hallyn, Andrei Vagin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 28, 2021 at 09:52:43AM +0300, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---

Serge's ack reminded me of this. Looks good if the mm folks are fine
with this too,
Acked-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab
@ 2021-05-07 15:48                       ` Thomas Gleixner
  0 siblings, 0 replies; 305+ messages in thread
From: Thomas Gleixner @ 2021-05-07 15:48 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, linux-kernel

On Wed, Apr 28 2021 at 09:53, Vasily Averin wrote:

> A program may create multiple interval timers using timer_create().
> For each timer the kernel preallocates a "queued real-time signal",
> Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
> resource limit. The allocated object is quite small, ~250 bytes,
> but even the default signal limits allow to consume up to 100 megabytes
> per user.
>
> It makes sense to account for them to limit the host's memory consumption
> from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab
@ 2021-05-07 15:48                       ` Thomas Gleixner
  0 siblings, 0 replies; 305+ messages in thread
From: Thomas Gleixner @ 2021-05-07 15:48 UTC (permalink / raw)
  To: Vasily Averin, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 28 2021 at 09:53, Vasily Averin wrote:

> A program may create multiple interval timers using timer_create().
> For each timer the kernel preallocates a "queued real-time signal",
> Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
> resource limit. The allocated object is quite small, ~250 bytes,
> but even the default signal limits allow to consume up to 100 megabytes
> per user.
>
> It makes sense to account for them to limit the host's memory consumption
> from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Reviewed-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
@ 2021-07-15 17:11                       ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-15 17:11 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev, LKML

On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> Initially we used our own accounting subsystem, then partially committed
> it to upstream, and a few years ago switched to cgroups v1.
> Now we're rebasing again, revising our old patches and trying to push
> them upstream.
>
> We try to protect the host system from any misuse of kernel memory
> allocation triggered by untrusted users inside the containers.
>
> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> list, though I would be very grateful for any comments from maintainersi
> of affected subsystems or other people added in cc:
>
> Compared to the upstream, we additionally account the following kernel objects:
> - network devices and its Tx/Rx queues
> - ipv4/v6 addresses and routing-related objects
> - inet_bind_bucket cache objects
> - VLAN group arrays
> - ipv6/sit: ip_tunnel_prl
> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> - nsproxy and namespace objects itself
> - IPC objects: semaphores, message queues and share memory segments
> - mounts
> - pollfd and select bits arrays
> - signals and posix timers
> - file lock
> - fasync_struct used by the file lease code and driver's fasync queues
> - tty objects
> - per-mm LDT
>
> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> They require rework and probably will be dropped at all.
>
> Also we're going to add an accounting for nft, however it is not ready yet.
>
> We have not tested performance on upstream, however, our performance team
> compares our current RHEL7-based production kernel and reports that
> they are at least not worse as the according original RHEL7 kernel.
>

Hi Vasily,

What's the status of this series? I see a couple patches did get
acked/reviewed. Can you please re-send the series with updated ack
tags?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
@ 2021-07-15 17:11                       ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-15 17:11 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin, Ingo

On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> Initially we used our own accounting subsystem, then partially committed
> it to upstream, and a few years ago switched to cgroups v1.
> Now we're rebasing again, revising our old patches and trying to push
> them upstream.
>
> We try to protect the host system from any misuse of kernel memory
> allocation triggered by untrusted users inside the containers.
>
> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> list, though I would be very grateful for any comments from maintainersi
> of affected subsystems or other people added in cc:
>
> Compared to the upstream, we additionally account the following kernel objects:
> - network devices and its Tx/Rx queues
> - ipv4/v6 addresses and routing-related objects
> - inet_bind_bucket cache objects
> - VLAN group arrays
> - ipv6/sit: ip_tunnel_prl
> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> - nsproxy and namespace objects itself
> - IPC objects: semaphores, message queues and share memory segments
> - mounts
> - pollfd and select bits arrays
> - signals and posix timers
> - file lock
> - fasync_struct used by the file lease code and driver's fasync queues
> - tty objects
> - per-mm LDT
>
> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> They require rework and probably will be dropped at all.
>
> Also we're going to add an accounting for nft, however it is not ready yet.
>
> We have not tested performance on upstream, however, our performance team
> compares our current RHEL7-based production kernel and reports that
> they are at least not worse as the according original RHEL7 kernel.
>

Hi Vasily,

What's the status of this series? I see a couple patches did get
acked/reviewed. Can you please re-send the series with updated ack
tags?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
@ 2021-07-16  4:11                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-16  4:11 UTC (permalink / raw)
  To: Shakeel Butt, Tejun Heo
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev, LKML

On 7/15/21 8:11 PM, Shakeel Butt wrote:
> On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
>> Initially we used our own accounting subsystem, then partially committed
>> it to upstream, and a few years ago switched to cgroups v1.
>> Now we're rebasing again, revising our old patches and trying to push
>> them upstream.
>>
>> We try to protect the host system from any misuse of kernel memory
>> allocation triggered by untrusted users inside the containers.
>>
>> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
>> list, though I would be very grateful for any comments from maintainersi
>> of affected subsystems or other people added in cc:
>>
>> Compared to the upstream, we additionally account the following kernel objects:
>> - network devices and its Tx/Rx queues
>> - ipv4/v6 addresses and routing-related objects
>> - inet_bind_bucket cache objects
>> - VLAN group arrays
>> - ipv6/sit: ip_tunnel_prl
>> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
>> - nsproxy and namespace objects itself
>> - IPC objects: semaphores, message queues and share memory segments
>> - mounts
>> - pollfd and select bits arrays
>> - signals and posix timers
>> - file lock
>> - fasync_struct used by the file lease code and driver's fasync queues
>> - tty objects
>> - per-mm LDT
>>
>> We have an incorrect/incomplete/obsoleted accounting for few other kernel
>> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
>> They require rework and probably will be dropped at all.
>>
>> Also we're going to add an accounting for nft, however it is not ready yet.
>>
>> We have not tested performance on upstream, however, our performance team
>> compares our current RHEL7-based production kernel and reports that
>> they are at least not worse as the according original RHEL7 kernel.
> 
> Hi Vasily,
> 
> What's the status of this series? I see a couple patches did get
> acked/reviewed. Can you please re-send the series with updated ack
> tags?

Technically my patches does not have any NAKs. Practically they are still them merged.
I've expected Michal will push it, but he advised me to push subsystem maintainers.
I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.

I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
However I do not understand how it helps to push them if patches should be processed through
subsystem maintainers. As far as I understand I'll need to split this patch set into
per-subsystem pieces and sent them to corresponded maintainers.

Thank you,
	Vasily Averin.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
@ 2021-07-16  4:11                         ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-16  4:11 UTC (permalink / raw)
  To: Shakeel Butt, Tejun Heo
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin, Ingo

On 7/15/21 8:11 PM, Shakeel Butt wrote:
> On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>>
>> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
>> Initially we used our own accounting subsystem, then partially committed
>> it to upstream, and a few years ago switched to cgroups v1.
>> Now we're rebasing again, revising our old patches and trying to push
>> them upstream.
>>
>> We try to protect the host system from any misuse of kernel memory
>> allocation triggered by untrusted users inside the containers.
>>
>> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
>> list, though I would be very grateful for any comments from maintainersi
>> of affected subsystems or other people added in cc:
>>
>> Compared to the upstream, we additionally account the following kernel objects:
>> - network devices and its Tx/Rx queues
>> - ipv4/v6 addresses and routing-related objects
>> - inet_bind_bucket cache objects
>> - VLAN group arrays
>> - ipv6/sit: ip_tunnel_prl
>> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
>> - nsproxy and namespace objects itself
>> - IPC objects: semaphores, message queues and share memory segments
>> - mounts
>> - pollfd and select bits arrays
>> - signals and posix timers
>> - file lock
>> - fasync_struct used by the file lease code and driver's fasync queues
>> - tty objects
>> - per-mm LDT
>>
>> We have an incorrect/incomplete/obsoleted accounting for few other kernel
>> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
>> They require rework and probably will be dropped at all.
>>
>> Also we're going to add an accounting for nft, however it is not ready yet.
>>
>> We have not tested performance on upstream, however, our performance team
>> compares our current RHEL7-based production kernel and reports that
>> they are at least not worse as the according original RHEL7 kernel.
> 
> Hi Vasily,
> 
> What's the status of this series? I see a couple patches did get
> acked/reviewed. Can you please re-send the series with updated ack
> tags?

Technically my patches does not have any NAKs. Practically they are still them merged.
I've expected Michal will push it, but he advised me to push subsystem maintainers.
I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.

I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
However I do not understand how it helps to push them if patches should be processed through
subsystem maintainers. As far as I understand I'll need to split this patch set into
per-subsystem pieces and sent them to corresponded maintainers.

Thank you,
	Vasily Averin.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
@ 2021-07-16 12:55                           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-16 12:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro,
	Alexey Dobriyan, Andrei Vagin, Andrew Morton, Borislav Petkov,
	Christian Brauner, David Ahern, David S. Miller, Dmitry Safonov,
	Eric Dumazet, Eric W. Biederman, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Jakub Kicinski,
	J. Bruce Fields, Jeff Layton, Jens Axboe, Jiri Slaby,
	Kirill Tkhai, Oleg Nesterov, Serge Hallyn, Thomas Gleixner,
	Zefan Li, netdev, LKML

On Thu, Jul 15, 2021 at 9:11 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 7/15/21 8:11 PM, Shakeel Butt wrote:
> > On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> >> Initially we used our own accounting subsystem, then partially committed
> >> it to upstream, and a few years ago switched to cgroups v1.
> >> Now we're rebasing again, revising our old patches and trying to push
> >> them upstream.
> >>
> >> We try to protect the host system from any misuse of kernel memory
> >> allocation triggered by untrusted users inside the containers.
> >>
> >> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> >> list, though I would be very grateful for any comments from maintainersi
> >> of affected subsystems or other people added in cc:
> >>
> >> Compared to the upstream, we additionally account the following kernel objects:
> >> - network devices and its Tx/Rx queues
> >> - ipv4/v6 addresses and routing-related objects
> >> - inet_bind_bucket cache objects
> >> - VLAN group arrays
> >> - ipv6/sit: ip_tunnel_prl
> >> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> >> - nsproxy and namespace objects itself
> >> - IPC objects: semaphores, message queues and share memory segments
> >> - mounts
> >> - pollfd and select bits arrays
> >> - signals and posix timers
> >> - file lock
> >> - fasync_struct used by the file lease code and driver's fasync queues
> >> - tty objects
> >> - per-mm LDT
> >>
> >> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> >> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> >> They require rework and probably will be dropped at all.
> >>
> >> Also we're going to add an accounting for nft, however it is not ready yet.
> >>
> >> We have not tested performance on upstream, however, our performance team
> >> compares our current RHEL7-based production kernel and reports that
> >> they are at least not worse as the according original RHEL7 kernel.
> >
> > Hi Vasily,
> >
> > What's the status of this series? I see a couple patches did get
> > acked/reviewed. Can you please re-send the series with updated ack
> > tags?
>
> Technically my patches does not have any NAKs. Practically they are still them merged.
> I've expected Michal will push it, but he advised me to push subsystem maintainers.
> I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.
>
> I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
> However I do not understand how it helps to push them if patches should be processed through
> subsystem maintainers. As far as I understand I'll need to split this patch set into
> per-subsystem pieces and sent them to corresponded maintainers.
>

Usually these kinds of patches (adding memcg accounting) go through mm
tree but if there are no dependencies between the patches and a
consensus that each subsystem maintainer picks the corresponding patch
then that is fine too.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
@ 2021-07-16 12:55                           ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-16 12:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro,
	Alexey Dobriyan, Andrei Vagin, Andrew Morton, Borislav Petkov,
	Christian Brauner, David Ahern, David S. Miller, Dmitry Safonov,
	Eric Dumazet, Eric W. Biederman, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H.Peter

On Thu, Jul 15, 2021 at 9:11 PM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> On 7/15/21 8:11 PM, Shakeel Butt wrote:
> > On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
> >>
> >> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> >> Initially we used our own accounting subsystem, then partially committed
> >> it to upstream, and a few years ago switched to cgroups v1.
> >> Now we're rebasing again, revising our old patches and trying to push
> >> them upstream.
> >>
> >> We try to protect the host system from any misuse of kernel memory
> >> allocation triggered by untrusted users inside the containers.
> >>
> >> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> >> list, though I would be very grateful for any comments from maintainersi
> >> of affected subsystems or other people added in cc:
> >>
> >> Compared to the upstream, we additionally account the following kernel objects:
> >> - network devices and its Tx/Rx queues
> >> - ipv4/v6 addresses and routing-related objects
> >> - inet_bind_bucket cache objects
> >> - VLAN group arrays
> >> - ipv6/sit: ip_tunnel_prl
> >> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> >> - nsproxy and namespace objects itself
> >> - IPC objects: semaphores, message queues and share memory segments
> >> - mounts
> >> - pollfd and select bits arrays
> >> - signals and posix timers
> >> - file lock
> >> - fasync_struct used by the file lease code and driver's fasync queues
> >> - tty objects
> >> - per-mm LDT
> >>
> >> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> >> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> >> They require rework and probably will be dropped at all.
> >>
> >> Also we're going to add an accounting for nft, however it is not ready yet.
> >>
> >> We have not tested performance on upstream, however, our performance team
> >> compares our current RHEL7-based production kernel and reports that
> >> they are at least not worse as the according original RHEL7 kernel.
> >
> > Hi Vasily,
> >
> > What's the status of this series? I see a couple patches did get
> > acked/reviewed. Can you please re-send the series with updated ack
> > tags?
>
> Technically my patches does not have any NAKs. Practically they are still them merged.
> I've expected Michal will push it, but he advised me to push subsystem maintainers.
> I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.
>
> I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
> However I do not understand how it helps to push them if patches should be processed through
> subsystem maintainers. As far as I understand I'll need to split this patch set into
> per-subsystem pieces and sent them to corresponded maintainers.
>

Usually these kinds of patches (adding memcg accounting) go through mm
tree but if there are no dependencies between the patches and a
consensus that each subsystem maintainer picks the corresponding patch
then that is fine too.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v5 00/16] memcg accounting from OpenVZ
  2021-07-16 12:55                           ` Shakeel Butt
@ 2021-07-19 10:44                             ` Vasily Averin
  -1 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, netdev, linux-fsdevel,
	LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v5 00/16] memcg accounting from OpenVZ
@ 2021-07-19 10:44                             ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0..1bbf239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index a9f9379..79df7cd 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 73721a4..d38124b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3bf685f..8eaeade 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2d650dc..a8f118e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2449,8 +2449,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b756a7..5f7286a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6638,7 +6638,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0..1bbf239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index a9f9379..79df7cd 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 73721a4..d38124b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3bf685f..8eaeade 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2d650dc..a8f118e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2449,8 +2449,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b756a7..5f7286a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6638,7 +6638,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found]                           ` <cover.1626688654.git.vvs@virtuozzo.com>
  2021-07-19 10:44                               ` Vasily Averin
  2021-07-19 10:44                               ` Vasily Averin
@ 2021-07-19 10:44                             ` Vasily Averin
  2021-07-19 10:44                             ` [PATCH v5 04/16] memcg: enable accounting for VLAN group array Vasily Averin
                                               ` (12 subsequent siblings)
  15 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7eb0fb2..abb5c59 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5ab5f2..5c0605e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4509,7 +4509,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 04/16] memcg: enable accounting for VLAN group array
       [not found]                           ` <cover.1626688654.git.vvs@virtuozzo.com>
                                               ` (2 preceding siblings ...)
  2021-07-19 10:44                             ` [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
@ 2021-07-19 10:44                             ` Vasily Averin
  2021-07-19 10:44                               ` Vasily Averin
                                               ` (11 subsequent siblings)
  15 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4cdf841..55275ef 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Author: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects
@ 2021-07-19 10:44                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 09/16] memcg: enable accounting for file lock caches
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 09/16] memcg: enable accounting for file lock caches
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, Jeff Layton, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 10/16] memcg: enable accounting for fasync_cache
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 10/16] memcg: enable accounting for fasync_cache
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, Jeff Layton, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 11/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 11/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Tejun Heo,
	Andrew Morton, Zefan Li, Thomas Gleixner, Christian Brauner,
	Kirill Tkhai, Serge Hallyn, Andrei Vagin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Acked-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 12/16] memcg: enable accounting of ipc resources
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexey Dobriyan,
	Dmitry Safonov, Yutian Yang, linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1855,7 +1855,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1941,7 +1941,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 
 	/* step 2: allocate new undo structure */
 	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -2005,7 +2005,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e..ab749be 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 12/16] memcg: enable accounting of ipc resources
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexey Dobriyan, Dmitry Safonov, Yutian Yang,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1855,7 +1855,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1941,7 +1941,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 
 	/* step 2: allocate new undo structure */
 	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -2005,7 +2005,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e..ab749be 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, linux-kernel

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Eric W. Biederman, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 14/16] memcg: enable accounting for posix_timers_cache slab
       [not found]                           ` <cover.1626688654.git.vvs@virtuozzo.com>
                                               ` (12 preceding siblings ...)
  2021-07-19 10:45                               ` Vasily Averin
@ 2021-07-19 10:45                             ` Vasily Averin
  2021-07-19 10:45                               ` Vasily Averin
  2021-07-19 10:46                               ` Vasily Averin
  15 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, linux-kernel

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index dd5697d..7363f81 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 15/16] memcg: enable accounting for tty-related objects
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 26debec..e787f6f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 15/16] memcg: enable accounting for tty-related objects
@ 2021-07-19 10:45                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 26debec..e787f6f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 16/16] memcg: enable accounting for ldt_struct objects
@ 2021-07-19 10:46                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, linux-kernel

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v5 16/16] memcg: enable accounting for ldt_struct objects
@ 2021-07-19 10:46                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-19 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Borislav Petkov <bp-l3A5Bk7waGM@public.gmane.org>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 14:00                                 ` Dmitry Safonov
  0 siblings, 0 replies; 305+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:00 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

Hi Vasily,

On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
[..]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;

This seems to do two separate things in one patch.
Probably, it's better to separate them.
(I may miss how route changes are related to more generic
__alloc_pages() change)

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 14:00                                 ` Dmitry Safonov
  0 siblings, 0 replies; 305+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:00 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

Hi Vasily,

On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
[..]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;

This seems to do two separate things in one patch.
Probably, it's better to separate them.
(I may miss how route changes are related to more generic
__alloc_pages() change)

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 14:22                                   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-19 14:22 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>
> Hi Vasily,
>
> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
> [..]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae1f5d0..1bbf239 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
> >                 return false;
> >
> >         /* Memcg to charge can't be determined. */
> > -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> > +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
> >                 return true;
>
> This seems to do two separate things in one patch.
> Probably, it's better to separate them.
> (I may miss how route changes are related to more generic
> __alloc_pages() change)
>

It was requested to squash them together in some previous versions.
https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa@dhcp22.suse.cz/

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 14:22                                   ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-19 14:22 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> Hi Vasily,
>
> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
> [..]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae1f5d0..1bbf239 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
> >                 return false;
> >
> >         /* Memcg to charge can't be determined. */
> > -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> > +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
> >                 return true;
>
> This seems to do two separate things in one patch.
> Probably, it's better to separate them.
> (I may miss how route changes are related to more generic
> __alloc_pages() change)
>

It was requested to squash them together in some previous versions.
https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org/

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 14:24                                     ` Dmitry Safonov
  0 siblings, 0 replies; 305+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On 7/19/21 3:22 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>>
>> Hi Vasily,
>>
>> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
>> [..]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ae1f5d0..1bbf239 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>>                 return false;
>>>
>>>         /* Memcg to charge can't be determined. */
>>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>>                 return true;
>>
>> This seems to do two separate things in one patch.
>> Probably, it's better to separate them.
>> (I may miss how route changes are related to more generic
>> __alloc_pages() change)
>>
> 
> It was requested to squash them together in some previous versions.
> https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa@dhcp22.suse.cz/
> 

Ah, alright, never mind than.

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-19 14:24                                     ` Dmitry Safonov
  0 siblings, 0 replies; 305+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On 7/19/21 3:22 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> Hi Vasily,
>>
>> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>> [..]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ae1f5d0..1bbf239 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>>                 return false;
>>>
>>>         /* Memcg to charge can't be determined. */
>>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>>                 return true;
>>
>> This seems to do two separate things in one patch.
>> Probably, it's better to separate them.
>> (I may miss how route changes are related to more generic
>> __alloc_pages() change)
>>
> 
> It was requested to squash them together in some previous versions.
> https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org/
> 

Ah, alright, never mind than.

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-19 17:32                                 ` Eric W. Biederman
  0 siblings, 0 replies; 305+ messages in thread
From: Eric W. Biederman @ 2021-07-19 17:32 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, linux-kernel

Vasily Averin <vvs@virtuozzo.com> writes:

> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.

Not any more.  Currently the number of sigqueue objects is limited
by the rlimit of the creator of the user namespace of the container.

> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.

Does it?  Why?  The given justification appears to have bit-rotted
since -rc1.

I know a lot of these things only really need a limit just to catch a
program that starts malfunctioning.  If that is indeed the case
reasonable per-resource limits are probably better than some great big
group limit that can be exhausted with any single resource in the group.

Is there a reason I am not aware of that where it makes sense to group
all of the resources together and only count the number of bytes
consumed?

Eric


> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  kernel/signal.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229ad..8921c4a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -4663,7 +4663,7 @@ void __init signals_init(void)
>  {
>  	siginfo_buildtime_checks();
>  
> -	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
> +	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
>  }
>  
>  #ifdef CONFIG_KGDB_KDB

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-19 17:32                                 ` Eric W. Biederman
  0 siblings, 0 replies; 305+ messages in thread
From: Eric W. Biederman @ 2021-07-19 17:32 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Jens Axboe, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> writes:

> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.

Not any more.  Currently the number of sigqueue objects is limited
by the rlimit of the creator of the user namespace of the container.

> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.

Does it?  Why?  The given justification appears to have bit-rotted
since -rc1.

I know a lot of these things only really need a limit just to catch a
program that starts malfunctioning.  If that is indeed the case
reasonable per-resource limits are probably better than some great big
group limit that can be exhausted with any single resource in the group.

Is there a reason I am not aware of that where it makes sense to group
all of the resources together and only count the number of bytes
consumed?

Eric


> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  kernel/signal.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229ad..8921c4a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -4663,7 +4663,7 @@ void __init signals_init(void)
>  {
>  	siginfo_buildtime_checks();
>  
> -	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
> +	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
>  }
>  
>  #ifdef CONFIG_KGDB_KDB

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
  2021-07-19 17:32                                 ` Eric W. Biederman
@ 2021-07-20  8:35                                   ` Vasily Averin
  -1 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-20  8:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, linux-kernel

On 7/19/21 8:32 PM, Eric W. Biederman wrote:
> Vasily Averin <vvs@virtuozzo.com> writes:
> 
>> When a user send a signal to any another processes it forces the kernel
>> to allocate memory for 'struct sigqueue' objects. The number of signals
>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>> settings allow each user to consume up to several megabytes of memory.
>> Moreover, an untrusted admin inside container can increase the limit or
>> create new fake users and force them to sent signals.
> 
> Not any more.  Currently the number of sigqueue objects is limited
> by the rlimit of the creator of the user namespace of the container.
> 
>> It makes sense to account for these allocations to restrict the host's
>> memory consumption from inside the memcg-limited container.
> 
> Does it?  Why?  The given justification appears to have bit-rotted
> since -rc1.

Could you please explain what was changed in rc1?
From my POV accounting is required to help OOM-killer to select proper target.

> I know a lot of these things only really need a limit just to catch a
> program that starts malfunctioning.  If that is indeed the case
> reasonable per-resource limits are probably better than some great big
> group limit that can be exhausted with any single resource in the group.
> 
> Is there a reason I am not aware of that where it makes sense to group
> all of the resources together and only count the number of bytes
> consumed?

Any new limits:
a) should be set properly depending on huge number of incoming parameters.
b) should properly notify about hits
c) should be updated properly after b) 
d) do a)-c) automatically if possible

In past OpenVz had own accounting subsystem, user beancounters (UBC).
It accounted and limited 20+ resources  per-container: numfiles, file locks,
signals, netfilter rules, socket buffers and so on.
I assume you want to do something similar, so let me share our experience. 

We had a lot of problems with UBC:
- it's quite hard to set up the limit. 
  Why it's good to consume N entities of some resource but it's bad to consume N+1 ones? 
  per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
  To answer the questions host admin should have additional knowledge and skills.

- Ok, we have set all limits. Some application hits it and fails.
  It's quite hard to understand that application hits the limit, and failed due to this reason.
  From users point of view, if some application does not work (stable enough)
  inside container => containers are guilty.

- It's quite hard to understand that failed application just want to increase limit X up to N entities.

As result both host admins and container users was unhappy.
So after years of such fights we decided just to limit accounted memory instead.

Anyway, OOM-killer must know who consumed memory to select proper target.

Thank you,
	vasily Averin
 

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-20  8:35                                   ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-20  8:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Jens Axboe, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/19/21 8:32 PM, Eric W. Biederman wrote:
> Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> writes:
> 
>> When a user send a signal to any another processes it forces the kernel
>> to allocate memory for 'struct sigqueue' objects. The number of signals
>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>> settings allow each user to consume up to several megabytes of memory.
>> Moreover, an untrusted admin inside container can increase the limit or
>> create new fake users and force them to sent signals.
> 
> Not any more.  Currently the number of sigqueue objects is limited
> by the rlimit of the creator of the user namespace of the container.
> 
>> It makes sense to account for these allocations to restrict the host's
>> memory consumption from inside the memcg-limited container.
> 
> Does it?  Why?  The given justification appears to have bit-rotted
> since -rc1.

Could you please explain what was changed in rc1?
From my POV accounting is required to help OOM-killer to select proper target.

> I know a lot of these things only really need a limit just to catch a
> program that starts malfunctioning.  If that is indeed the case
> reasonable per-resource limits are probably better than some great big
> group limit that can be exhausted with any single resource in the group.
> 
> Is there a reason I am not aware of that where it makes sense to group
> all of the resources together and only count the number of bytes
> consumed?

Any new limits:
a) should be set properly depending on huge number of incoming parameters.
b) should properly notify about hits
c) should be updated properly after b) 
d) do a)-c) automatically if possible

In past OpenVz had own accounting subsystem, user beancounters (UBC).
It accounted and limited 20+ resources  per-container: numfiles, file locks,
signals, netfilter rules, socket buffers and so on.
I assume you want to do something similar, so let me share our experience. 

We had a lot of problems with UBC:
- it's quite hard to set up the limit. 
  Why it's good to consume N entities of some resource but it's bad to consume N+1 ones? 
  per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
  To answer the questions host admin should have additional knowledge and skills.

- Ok, we have set all limits. Some application hits it and fails.
  It's quite hard to understand that application hits the limit, and failed due to this reason.
  From users point of view, if some application does not work (stable enough)
  inside container => containers are guilty.

- It's quite hard to understand that failed application just want to increase limit X up to N entities.

As result both host admins and container users was unhappy.
So after years of such fights we decided just to limit accounted memory instead.

Anyway, OOM-killer must know who consumed memory to select proper target.

Thank you,
	vasily Averin
 

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
  2021-07-20  8:35                                   ` Vasily Averin
  (?)
@ 2021-07-20 14:37                                   ` Shakeel Butt
  -1 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-20 14:37 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Eric W. Biederman, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, LKML

On Tue, Jul 20, 2021 at 1:35 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 7/19/21 8:32 PM, Eric W. Biederman wrote:
> > Vasily Averin <vvs@virtuozzo.com> writes:
> >
> >> When a user send a signal to any another processes it forces the kernel
> >> to allocate memory for 'struct sigqueue' objects. The number of signals
> >> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> >> settings allow each user to consume up to several megabytes of memory.
> >> Moreover, an untrusted admin inside container can increase the limit or
> >> create new fake users and force them to sent signals.
> >
> > Not any more.  Currently the number of sigqueue objects is limited
> > by the rlimit of the creator of the user namespace of the container.
> >
> >> It makes sense to account for these allocations to restrict the host's
> >> memory consumption from inside the memcg-limited container.
> >
> > Does it?  Why?  The given justification appears to have bit-rotted
> > since -rc1.
>
> Could you please explain what was changed in rc1?
> From my POV accounting is required to help OOM-killer to select proper target.
>
> > I know a lot of these things only really need a limit just to catch a
> > program that starts malfunctioning.  If that is indeed the case
> > reasonable per-resource limits are probably better than some great big
> > group limit that can be exhausted with any single resource in the group.
> >
> > Is there a reason I am not aware of that where it makes sense to group
> > all of the resources together and only count the number of bytes
> > consumed?
>
> Any new limits:
> a) should be set properly depending on huge number of incoming parameters.
> b) should properly notify about hits
> c) should be updated properly after b)
> d) do a)-c) automatically if possible
>
> In past OpenVz had own accounting subsystem, user beancounters (UBC).
> It accounted and limited 20+ resources  per-container: numfiles, file locks,
> signals, netfilter rules, socket buffers and so on.
> I assume you want to do something similar, so let me share our experience.
>
> We had a lot of problems with UBC:
> - it's quite hard to set up the limit.
>   Why it's good to consume N entities of some resource but it's bad to consume N+1 ones?
>   per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
>   To answer the questions host admin should have additional knowledge and skills.
>
> - Ok, we have set all limits. Some application hits it and fails.
>   It's quite hard to understand that application hits the limit, and failed due to this reason.
>   From users point of view, if some application does not work (stable enough)
>   inside container => containers are guilty.
>
> - It's quite hard to understand that failed application just want to increase limit X up to N entities.
>
> As result both host admins and container users was unhappy.
> So after years of such fights we decided just to limit accounted memory instead.
>
> Anyway, OOM-killer must know who consumed memory to select proper target.
>

Just to support Vasily's point further, for systems running multiple
workloads, it is much more preferred to be able to set one limit for
each workload than to set many different limits.

One concrete example is described in commit ac7b79fd190b ("inotify,
memcg: account inotify instances to kmemcg"). The inotify instances
which can be limited through fs sysctl inotify/max_user_instances and
be further partitioned to users through per-user namespace specific
sysctl but there is no sensible way to set a limit and partition it on
a system that runs different workloads.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-20 16:42                                     ` Eric W. Biederman
  0 siblings, 0 replies; 305+ messages in thread
From: Eric W. Biederman @ 2021-07-20 16:42 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, linux-kernel

Vasily Averin <vvs@virtuozzo.com> writes:

> On 7/19/21 8:32 PM, Eric W. Biederman wrote:
>> Vasily Averin <vvs@virtuozzo.com> writes:
>> 
>>> When a user send a signal to any another processes it forces the kernel
>>> to allocate memory for 'struct sigqueue' objects. The number of signals
>>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>>> settings allow each user to consume up to several megabytes of memory.
>>> Moreover, an untrusted admin inside container can increase the limit or
>>> create new fake users and force them to sent signals.
>> 
>> Not any more.  Currently the number of sigqueue objects is limited
>> by the rlimit of the creator of the user namespace of the container.
>> 
>>> It makes sense to account for these allocations to restrict the host's
>>> memory consumption from inside the memcg-limited container.
>> 
>> Does it?  Why?  The given justification appears to have bit-rotted
>> since -rc1.
>
> Could you please explain what was changed in rc1?
> From my POV accounting is required to help OOM-killer to select proper target.

You can no longer escape the rlimit of the creator of the user
namespace, by creating multiple users inside of the user namespace.  The
users (in the user namespace) will have their individual rlimits but
are jointly by the limit of the creator of the user namespace at the
time the user namespace was created.

>> I know a lot of these things only really need a limit just to catch a
>> program that starts malfunctioning.  If that is indeed the case
>> reasonable per-resource limits are probably better than some great big
>> group limit that can be exhausted with any single resource in the group.
>> 
>> Is there a reason I am not aware of that where it makes sense to group
>> all of the resources together and only count the number of bytes
>> consumed?
>
> Any new limits:
> a) should be set properly depending on huge number of incoming parameters.
> b) should properly notify about hits
> c) should be updated properly after b) 
> d) do a)-c) automatically if possible
>
> In past OpenVz had own accounting subsystem, user beancounters (UBC).
> It accounted and limited 20+ resources  per-container: numfiles, file locks,
> signals, netfilter rules, socket buffers and so on.
> I assume you want to do something similar, so let me share our experience. 
>
> We had a lot of problems with UBC:
> - it's quite hard to set up the limit. 
>   Why it's good to consume N entities of some resource but it's bad to consume N+1 ones? 
>   per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
>   To answer the questions host admin should have additional knowledge and skills.
>
> - Ok, we have set all limits. Some application hits it and fails.
>   It's quite hard to understand that application hits the limit, and failed due to this reason.
>   From users point of view, if some application does not work (stable enough)
>   inside container => containers are guilty.
>
> - It's quite hard to understand that failed application just want to increase limit X up to N entities.
>
> As result both host admins and container users was unhappy.
> So after years of such fights we decided just to limit accounted
> memory instead.


Which is a perfectly fine justification.  However the justification
presented in the change log was that there is no existing limit, and
that is what is factually wrong.

Different kinds of limits serve different purposes.  An accounting of how
many instance of a resource has been used serve the purpose of detecting
when an application has gone completely out of spec.

Limits like you are implementing here are much better for just sharing and
managing system resources.

> Anyway, OOM-killer must know who consumed memory to select proper
> target.

The limits I am talking about are not useful to the OOM killer as
usually the amount of resources that indicates something has gone wrong
are too small to be a system-wide problem.

So please just correct the justification in the commit message and I
will be happy.

Eric  

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-20 16:42                                     ` Eric W. Biederman
  0 siblings, 0 replies; 305+ messages in thread
From: Eric W. Biederman @ 2021-07-20 16:42 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Jens Axboe, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA

Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> writes:

> On 7/19/21 8:32 PM, Eric W. Biederman wrote:
>> Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> writes:
>> 
>>> When a user send a signal to any another processes it forces the kernel
>>> to allocate memory for 'struct sigqueue' objects. The number of signals
>>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>>> settings allow each user to consume up to several megabytes of memory.
>>> Moreover, an untrusted admin inside container can increase the limit or
>>> create new fake users and force them to sent signals.
>> 
>> Not any more.  Currently the number of sigqueue objects is limited
>> by the rlimit of the creator of the user namespace of the container.
>> 
>>> It makes sense to account for these allocations to restrict the host's
>>> memory consumption from inside the memcg-limited container.
>> 
>> Does it?  Why?  The given justification appears to have bit-rotted
>> since -rc1.
>
> Could you please explain what was changed in rc1?
> From my POV accounting is required to help OOM-killer to select proper target.

You can no longer escape the rlimit of the creator of the user
namespace, by creating multiple users inside of the user namespace.  The
users (in the user namespace) will have their individual rlimits but
are jointly by the limit of the creator of the user namespace at the
time the user namespace was created.

>> I know a lot of these things only really need a limit just to catch a
>> program that starts malfunctioning.  If that is indeed the case
>> reasonable per-resource limits are probably better than some great big
>> group limit that can be exhausted with any single resource in the group.
>> 
>> Is there a reason I am not aware of that where it makes sense to group
>> all of the resources together and only count the number of bytes
>> consumed?
>
> Any new limits:
> a) should be set properly depending on huge number of incoming parameters.
> b) should properly notify about hits
> c) should be updated properly after b) 
> d) do a)-c) automatically if possible
>
> In past OpenVz had own accounting subsystem, user beancounters (UBC).
> It accounted and limited 20+ resources  per-container: numfiles, file locks,
> signals, netfilter rules, socket buffers and so on.
> I assume you want to do something similar, so let me share our experience. 
>
> We had a lot of problems with UBC:
> - it's quite hard to set up the limit. 
>   Why it's good to consume N entities of some resource but it's bad to consume N+1 ones? 
>   per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
>   To answer the questions host admin should have additional knowledge and skills.
>
> - Ok, we have set all limits. Some application hits it and fails.
>   It's quite hard to understand that application hits the limit, and failed due to this reason.
>   From users point of view, if some application does not work (stable enough)
>   inside container => containers are guilty.
>
> - It's quite hard to understand that failed application just want to increase limit X up to N entities.
>
> As result both host admins and container users was unhappy.
> So after years of such fights we decided just to limit accounted
> memory instead.


Which is a perfectly fine justification.  However the justification
presented in the change log was that there is no existing limit, and
that is what is factually wrong.

Different kinds of limits serve different purposes.  An accounting of how
many instance of a resource has been used serve the purpose of detecting
when an application has gone completely out of spec.

Limits like you are implementing here are much better for just sharing and
managing system resources.

> Anyway, OOM-killer must know who consumed memory to select proper
> target.

The limits I am talking about are not useful to the OOM killer as
usually the amount of resources that indicates something has gone wrong
are too small to be a system-wide problem.

So please just correct the justification in the commit message and I
will be happy.

Eric  

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-20 19:15                                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-20 19:15 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, LKML

On Mon, Jul 19, 2021 at 3:46 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

It seems like there is an agreement on this patch with the updated
commit message. In next version you can add:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
@ 2021-07-20 19:15                                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-20 19:15 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, LKML

On Mon, Jul 19, 2021 at 3:46 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

It seems like there is an agreement on this patch with the updated
commit message. In next version you can add:

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-20 19:26                                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-20 19:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> An netadmin inside container can use 'ip a a' and 'ip r a'
> to assign a large number of ipv4/ipv6 addresses and routing entries
> and force kernel to allocate megabytes of unaccounted memory
> for long-lived per-netdevice related kernel objects:
> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>
> These objects can be manually removed, though usually they lives
> in memory till destroy of its net namespace.
>
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
>
> One of such objects is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>
> Obsoleted in_interrupt() does not describe real execution context properly.
> From include/linux/preempt.h:
>
>  The following macros are deprecated and should not be used in new code:
>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>
> To verify the current execution context new macro should be used instead:
>  in_task()      - We're in task context
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c      | 2 +-
>  net/core/fib_rules.c | 4 ++--
>  net/ipv4/devinet.c   | 2 +-
>  net/ipv4/fib_trie.c  | 4 ++--
>  net/ipv6/addrconf.c  | 2 +-
>  net/ipv6/ip6_fib.c   | 4 ++--
>  net/ipv6/route.c     | 2 +-
>  7 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;
>
>         return false;

Can you please also change in_interrupt() in active_memcg() as well?
There are other unrelated in_interrupt() in that file but the one in
active_memcg() should be coupled with this change.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-20 19:26                                 ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-20 19:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> An netadmin inside container can use 'ip a a' and 'ip r a'
> to assign a large number of ipv4/ipv6 addresses and routing entries
> and force kernel to allocate megabytes of unaccounted memory
> for long-lived per-netdevice related kernel objects:
> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>
> These objects can be manually removed, though usually they lives
> in memory till destroy of its net namespace.
>
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
>
> One of such objects is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>
> Obsoleted in_interrupt() does not describe real execution context properly.
> From include/linux/preempt.h:
>
>  The following macros are deprecated and should not be used in new code:
>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>
> To verify the current execution context new macro should be used instead:
>  in_task()      - We're in task context
>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  mm/memcontrol.c      | 2 +-
>  net/core/fib_rules.c | 4 ++--
>  net/ipv4/devinet.c   | 2 +-
>  net/ipv4/fib_trie.c  | 4 ++--
>  net/ipv6/addrconf.c  | 2 +-
>  net/ipv6/ip6_fib.c   | 4 ++--
>  net/ipv6/route.c     | 2 +-
>  7 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;
>
>         return false;

Can you please also change in_interrupt() in active_memcg() as well?
There are other unrelated in_interrupt() in that file but the one in
active_memcg() should be coupled with this change.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-26 10:23                                   ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 10:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On 7/20/21 10:26 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> An netadmin inside container can use 'ip a a' and 'ip r a'
>> to assign a large number of ipv4/ipv6 addresses and routing entries
>> and force kernel to allocate megabytes of unaccounted memory
>> for long-lived per-netdevice related kernel objects:
>> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
>> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>>
>> These objects can be manually removed, though usually they lives
>> in memory till destroy of its net namespace.
>>
>> It makes sense to account for them to restrict the host's memory
>> consumption from inside the memcg-limited container.
>>
>> One of such objects is the 'struct fib6_node' mostly allocated in
>> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>>
>>  write_lock_bh(&table->tb6_lock);
>>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>>  write_unlock_bh(&table->tb6_lock);
>>
>> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
>> kmem cache. The proper memory cgroup still cannot be found due to the
>> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>>
>> Obsoleted in_interrupt() does not describe real execution context properly.
>> From include/linux/preempt.h:
>>
>>  The following macros are deprecated and should not be used in new code:
>>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>>
>> To verify the current execution context new macro should be used instead:
>>  in_task()      - We're in task context
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/memcontrol.c      | 2 +-
>>  net/core/fib_rules.c | 4 ++--
>>  net/ipv4/devinet.c   | 2 +-
>>  net/ipv4/fib_trie.c  | 4 ++--
>>  net/ipv6/addrconf.c  | 2 +-
>>  net/ipv6/ip6_fib.c   | 4 ++--
>>  net/ipv6/route.c     | 2 +-
>>  7 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ae1f5d0..1bbf239 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>                 return false;
>>
>>         /* Memcg to charge can't be determined. */
>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>                 return true;
>>
>>         return false;
> 
> Can you please also change in_interrupt() in active_memcg() as well?
> There are other unrelated in_interrupt() in that file but the one in
> active_memcg() should be coupled with this change.

Could you please elaborate?
From my point of view active_memcg is paired with set_active_memcg() and is not related to this case.
active_memcg uses memcg that was set by set_active_memcg(), either from int_active_memcg per-cpu pointer
or from current->active_memcg pointer.
I'm agree, it in case of disabled BH it is incorrect to use int_active_memcg, 
we still can use current->active_memcg. However it isn't a problem, 
memcg will be properly provided in both cases.

I think it's better to fix set_active_memcg/active_memcg by separate patch.

Am I missed something perhaps?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
@ 2021-07-26 10:23                                   ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 10:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On 7/20/21 10:26 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>>
>> An netadmin inside container can use 'ip a a' and 'ip r a'
>> to assign a large number of ipv4/ipv6 addresses and routing entries
>> and force kernel to allocate megabytes of unaccounted memory
>> for long-lived per-netdevice related kernel objects:
>> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
>> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>>
>> These objects can be manually removed, though usually they lives
>> in memory till destroy of its net namespace.
>>
>> It makes sense to account for them to restrict the host's memory
>> consumption from inside the memcg-limited container.
>>
>> One of such objects is the 'struct fib6_node' mostly allocated in
>> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>>
>>  write_lock_bh(&table->tb6_lock);
>>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>>  write_unlock_bh(&table->tb6_lock);
>>
>> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
>> kmem cache. The proper memory cgroup still cannot be found due to the
>> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>>
>> Obsoleted in_interrupt() does not describe real execution context properly.
>> From include/linux/preempt.h:
>>
>>  The following macros are deprecated and should not be used in new code:
>>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>>
>> To verify the current execution context new macro should be used instead:
>>  in_task()      - We're in task context
>>
>> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
>> ---
>>  mm/memcontrol.c      | 2 +-
>>  net/core/fib_rules.c | 4 ++--
>>  net/ipv4/devinet.c   | 2 +-
>>  net/ipv4/fib_trie.c  | 4 ++--
>>  net/ipv6/addrconf.c  | 2 +-
>>  net/ipv6/ip6_fib.c   | 4 ++--
>>  net/ipv6/route.c     | 2 +-
>>  7 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ae1f5d0..1bbf239 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>                 return false;
>>
>>         /* Memcg to charge can't be determined. */
>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>                 return true;
>>
>>         return false;
> 
> Can you please also change in_interrupt() in active_memcg() as well?
> There are other unrelated in_interrupt() in that file but the one in
> active_memcg() should be coupled with this change.

Could you please elaborate?
From my point of view active_memcg is paired with set_active_memcg() and is not related to this case.
active_memcg uses memcg that was set by set_active_memcg(), either from int_active_memcg per-cpu pointer
or from current->active_memcg pointer.
I'm agree, it in case of disabled BH it is incorrect to use int_active_memcg, 
we still can use current->active_memcg. However it isn't a problem, 
memcg will be properly provided in both cases.

I think it's better to fix set_active_memcg/active_memcg by separate patch.

Am I missed something perhaps?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-26 10:23                                   ` Vasily Averin
  (?)
@ 2021-07-26 13:48                                   ` Shakeel Butt
  2021-07-26 16:53                                     ` [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg() Vasily Averin
  -1 siblings, 1 reply; 305+ messages in thread
From: Shakeel Butt @ 2021-07-26 13:48 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On Mon, Jul 26, 2021 at 3:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
[...]
> >
> > Can you please also change in_interrupt() in active_memcg() as well?
> > There are other unrelated in_interrupt() in that file but the one in
> > active_memcg() should be coupled with this change.
>
> Could you please elaborate?
> From my point of view active_memcg is paired with set_active_memcg() and is not related to this case.
> active_memcg uses memcg that was set by set_active_memcg(), either from int_active_memcg per-cpu pointer
> or from current->active_memcg pointer.
> I'm agree, it in case of disabled BH it is incorrect to use int_active_memcg,
> we still can use current->active_memcg. However it isn't a problem,
> memcg will be properly provided in both cases.
>
> I think it's better to fix set_active_memcg/active_memcg by separate patch.
>
> Am I missed something perhaps?
>

No you are right. That should be a separate patch.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg()
  2021-07-26 13:48                                   ` Shakeel Butt
@ 2021-07-26 16:53                                     ` Vasily Averin
  2021-07-26 16:57                                         ` Shakeel Butt
  0 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 16:53 UTC (permalink / raw)
  To: Shakeel Butt, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Andrew Morton
  Cc: cgroups, linux-kernel, linux-mm

set_active_memcg() uses in_interrupt() check to select proper storage for
cgroup: pointer on task struct or per-cpu pointer.

It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH.
It's better to use '!in_task()' instead.

Link: https://lkml.org/lkml/2021/7/26/487
Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 include/linux/sched/mm.h | 2 +-
 mm/memcontrol.c          | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..9dd071f78dba 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -306,7 +306,7 @@ set_active_memcg(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *old;
 
-	if (in_interrupt()) {
+	if (!in_task()) {
 		old = this_cpu_read(int_active_memcg);
 		this_cpu_write(int_active_memcg, memcg);
 	} else {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0cb581..3ebf792ef2c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -905,7 +905,7 @@ EXPORT_SYMBOL(mem_cgroup_from_task);
 
 static __always_inline struct mem_cgroup *active_memcg(void)
 {
-	if (in_interrupt())
+	if (!in_task())
 		return this_cpu_read(int_active_memcg);
 	else
 		return current->active_memcg;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg()
  2021-07-26 16:53                                     ` [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg() Vasily Averin
  2021-07-26 16:57                                         ` Shakeel Butt
@ 2021-07-26 16:57                                         ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-26 16:57 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Andrew Morton, Cgroups, LKML, Linux MM

On Mon, Jul 26, 2021 at 9:53 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> set_active_memcg() uses in_interrupt() check to select proper storage for
> cgroup: pointer on task struct or per-cpu pointer.
>
> It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH.
> It's better to use '!in_task()' instead.
>
> Link: https://lkml.org/lkml/2021/7/26/487
> Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Thanks.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg()
@ 2021-07-26 16:57                                         ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-26 16:57 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Andrew Morton, Cgroups, LKML, Linux MM

On Mon, Jul 26, 2021 at 9:53 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> set_active_memcg() uses in_interrupt() check to select proper storage for
> cgroup: pointer on task struct or per-cpu pointer.
>
> It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH.
> It's better to use '!in_task()' instead.
>
> Link: https://lkml.org/lkml/2021/7/26/487
> Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Thanks.

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg()
@ 2021-07-26 16:57                                         ` Shakeel Butt
  0 siblings, 0 replies; 305+ messages in thread
From: Shakeel Butt @ 2021-07-26 16:57 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Andrew Morton, Cgroups, LKML, Linux MM

On Mon, Jul 26, 2021 at 9:53 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> set_active_memcg() uses in_interrupt() check to select proper storage for
> cgroup: pointer on task struct or per-cpu pointer.
>
> It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH.
> It's better to use '!in_task()' instead.
>
> Link: https://lkml.org/lkml/2021/7/26/487
> Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Thanks.

Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v6 00/16] memcg accounting from
@ 2021-07-26 18:59                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, netdev, linux-fsdevel,
	LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v6 00/16] memcg accounting from
@ 2021-07-26 18:59                               ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues
@ 2021-07-26 18:59                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues
@ 2021-07-26 18:59                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found]                             ` <cover.1627321321.git.vvs@virtuozzo.com>
  2021-07-26 18:59                                 ` Vasily Averin
@ 2021-07-26 19:00                               ` Vasily Averin
  2021-07-26 19:00                                 ` Vasily Averin
                                                 ` (13 subsequent siblings)
  15 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0..1bbf239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index a9f9379..79df7cd 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 73721a4..d38124b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3bf685f..8eaeade 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2d650dc..a8f118e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2449,8 +2449,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b756a7..5f7286a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6638,7 +6638,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7eb0fb2..abb5c59 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5ab5f2..5c0605e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4509,7 +4509,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Hideaki YOSHIFUJI,
	David Ahern, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7eb0fb2..abb5c59 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5ab5f2..5c0605e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4509,7 +4509,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 04/16] memcg: enable accounting for VLAN group array
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4cdf841..55275ef 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 04/16] memcg: enable accounting for VLAN group array
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4cdf841..55275ef 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Author: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays
@ 2021-07-26 19:00                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 09/16] memcg: enable accounting for file lock caches
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 09/16] memcg: enable accounting for file lock caches
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, Jeff Layton, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 10/16] memcg: enable accounting for fasync_cache
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 10/16] memcg: enable accounting for fasync_cache
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, Jeff Layton, J. Bruce Fields,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Tejun Heo,
	Andrew Morton, Zefan Li, Thomas Gleixner, Christian Brauner,
	Kirill Tkhai, Serge Hallyn, Andrei Vagin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Acked-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 12/16] memcg: enable accounting of ipc resources
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexey Dobriyan,
	Dmitry Safonov, Yutian Yang, linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1855,7 +1855,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1941,7 +1941,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 
 	/* step 2: allocate new undo structure */
 	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -2005,7 +2005,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e..ab749be 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 12/16] memcg: enable accounting of ipc resources
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexey Dobriyan, Dmitry Safonov, Yutian Yang,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1855,7 +1855,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1941,7 +1941,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 
 	/* step 2: allocate new undo structure */
 	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -2005,7 +2005,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e..ab749be 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 13/16] memcg: enable accounting for signals
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, linux-kernel

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 13/16] memcg: enable accounting for signals
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Eric W. Biederman, Oleg Nesterov,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Reviewed-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 14/16] memcg: enable accounting for posix_timers_cache slab
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, linux-kernel

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index dd5697d..7363f81 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 14/16] memcg: enable accounting for posix_timers_cache slab
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Thomas Gleixner, linux-kernel-u79uwXL29TY76Z2rM5mHXA

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Reviewed-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index dd5697d..7363f81 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 15/16] memcg: enable accounting for tty-related objects
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 26debec..e787f6f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 15/16] memcg: enable accounting for tty-related objects
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 26debec..e787f6f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 16/16] memcg: enable accounting for ldt_struct objects
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, linux-kernel

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v6 16/16] memcg: enable accounting for ldt_struct objects
@ 2021-07-26 19:01                                 ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Borislav Petkov <bp-l3A5Bk7waGM@public.gmane.org>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-26 19:58                                   ` Kirill Tkhai
  0 siblings, 0 replies; 305+ messages in thread
From: Kirill Tkhai @ 2021-07-26 19:58 UTC (permalink / raw)
  To: Vasily Averin, Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Zefan Li,
	Thomas Gleixner, Christian Brauner, Serge Hallyn, Andrei Vagin,
	linux-kernel

On 26.07.2021 22:01, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>

> ---
>  fs/namespace.c            | 2 +-
>  ipc/namespace.c           | 2 +-
>  kernel/cgroup/namespace.c | 2 +-
>  kernel/nsproxy.c          | 2 +-
>  kernel/pid_namespace.c    | 2 +-
>  kernel/time/namespace.c   | 4 ++--
>  kernel/user_namespace.c   | 2 +-
>  7 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c6a74e5..e443ee6 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
>  	if (!ucounts)
>  		return ERR_PTR(-ENOSPC);
>  
> -	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns) {
>  		dec_mnt_namespaces(ucounts);
>  		return ERR_PTR(-ENOMEM);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 7bd0766..ae83f0f 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
> +	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
>  	if (ns == NULL)
>  		goto fail_dec;
>  
> diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
> index f5e8828..0d5c298 100644
> --- a/kernel/cgroup/namespace.c
> +++ b/kernel/cgroup/namespace.c
> @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
>  	struct cgroup_namespace *new_ns;
>  	int ret;
>  
> -	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns)
>  		return ERR_PTR(-ENOMEM);
>  	ret = ns_alloc_inum(&new_ns->ns);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index abc01fc..eec72ca 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
>  
>  int __init nsproxy_cache_init(void)
>  {
> -	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
> +	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
>  	return 0;
>  }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index ca43239..6cd6715 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>  
>  static __init int pid_namespaces_init(void)
>  {
> -	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index 12eab0d..aec8328 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
> +	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
>  	if (!ns)
>  		goto fail_dec;
>  
>  	refcount_set(&ns->ns.count, 1);
>  
> -	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!ns->vvar_page)
>  		goto fail_free;
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index ef82d40..6b2e3ca 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
>  
>  static __init int user_namespaces_init(void)
>  {
> -	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> +	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  	return 0;
>  }
>  subsys_initcall(user_namespaces_init);
> 


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-26 19:58                                   ` Kirill Tkhai
  0 siblings, 0 replies; 305+ messages in thread
From: Kirill Tkhai @ 2021-07-26 19:58 UTC (permalink / raw)
  To: Vasily Averin, Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Tejun Heo,
	Zefan Li, Thomas Gleixner, Christian Brauner, Serge Hallyn,
	Andrei Vagin, linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 26.07.2021 22:01, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> Acked-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Acked-by: Kirill Tkhai <ktkhai-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

> ---
>  fs/namespace.c            | 2 +-
>  ipc/namespace.c           | 2 +-
>  kernel/cgroup/namespace.c | 2 +-
>  kernel/nsproxy.c          | 2 +-
>  kernel/pid_namespace.c    | 2 +-
>  kernel/time/namespace.c   | 4 ++--
>  kernel/user_namespace.c   | 2 +-
>  7 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c6a74e5..e443ee6 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
>  	if (!ucounts)
>  		return ERR_PTR(-ENOSPC);
>  
> -	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns) {
>  		dec_mnt_namespaces(ucounts);
>  		return ERR_PTR(-ENOMEM);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 7bd0766..ae83f0f 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
> +	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
>  	if (ns == NULL)
>  		goto fail_dec;
>  
> diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
> index f5e8828..0d5c298 100644
> --- a/kernel/cgroup/namespace.c
> +++ b/kernel/cgroup/namespace.c
> @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
>  	struct cgroup_namespace *new_ns;
>  	int ret;
>  
> -	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns)
>  		return ERR_PTR(-ENOMEM);
>  	ret = ns_alloc_inum(&new_ns->ns);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index abc01fc..eec72ca 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
>  
>  int __init nsproxy_cache_init(void)
>  {
> -	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
> +	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
>  	return 0;
>  }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index ca43239..6cd6715 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>  
>  static __init int pid_namespaces_init(void)
>  {
> -	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index 12eab0d..aec8328 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
> +	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
>  	if (!ns)
>  		goto fail_dec;
>  
>  	refcount_set(&ns->ns.count, 1);
>  
> -	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!ns->vvar_page)
>  		goto fail_free;
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index ef82d40..6b2e3ca 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
>  
>  static __init int user_namespaces_init(void)
>  {
> -	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> +	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  	return 0;
>  }
>  subsys_initcall(user_namespaces_init);
> 


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from
  2021-07-26 18:59                               ` Vasily Averin
  (?)
@ 2021-07-26 21:59                               ` David Miller
  2021-07-27  4:44                                   ` Vasily Averin
  -1 siblings, 1 reply; 305+ messages in thread
From: David Miller @ 2021-07-26 21:59 UTC (permalink / raw)
  To: vvs
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel


This series does not apply cleanly to net-next, please respin.

Thank you.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from OpenVZ
@ 2021-07-27  4:44                                   ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  4:44 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel

On 7/27/21 12:59 AM, David Miller wrote:
> 
> This series does not apply cleanly to net-next, please respin.

Dear David,
I found that you have already approved net-related patches of this series and included them into net-next.
So I'll respin v7 without these patches.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from OpenVZ
@ 2021-07-27  4:44                                   ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  4:44 UTC (permalink / raw)
  To: David Miller
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, tj-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	hannes-druUgvl0LCNAfugRpC6u6w,
	vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w, guro-b10kYP2dOMg,
	shakeelb-hpIqsD4AKlfQT0dZR+AlfA, nglaive-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	adobriyan-Re5JQEeQqe8AvxtiuMwx3w, avagin-Re5JQEeQqe8AvxtiuMwx3w,
	bp-Gina5bIWoIWzQB+pC5nmwQ,
	christian.brauner-GeWIH/nMZzLQT0dZR+AlfA,
	dsahern-DgEjT+Ai2ygdnm+yROfE0A,
	0x7f454c46-Re5JQEeQqe8AvxtiuMwx3w,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, kuba-DgEjT+Ai2ygdnm+yROfE0A,
	bfields-uC3wQj2KruNg9hUCZPvPmw, jlayton-DgEjT+Ai2ygdnm+yROfE0A,
	axboe-tSWWG44O7X1aa/9Udqfwiw, jirislaby-DgEjT+Ai2ygdnm+yROfE0A,
	ktkhai-5HdwGun5lf+gSpxsJD1C4w, oleg-H+wXaHxf7aLQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, tglx-hfZtesqFncYOwBW4kG4KsQ,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 7/27/21 12:59 AM, David Miller wrote:
> 
> This series does not apply cleanly to net-next, please respin.

Dear David,
I found that you have already approved net-related patches of this series and included them into net-next.
So I'll respin v7 without these patches.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v7 00/10] memcg accounting from OpenVZ
  2021-07-27  4:44                                   ` Vasily Averin
@ 2021-07-27  5:33                                     ` Vasily Averin
  -1 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Borislav Petkov,
	Christian Brauner, Dmitry Safonov, Eric W. Biederman,
	Greg Kroah-Hartman, H. Peter Anvin, Ingo Molnar, J. Bruce Fields,
	Jeff Layton, Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, linux-fsdevel, LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v7:
- net-related patches was approved and included into net-next git
- rebase to v5.14-rc3
- added Acked-by tag from Kirill Tkhai on "memcg: enable accounting for
  new namesapces and struct nsproxy"

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (10):
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 17 files changed, 34 insertions(+), 29 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v7 00/10] memcg accounting from OpenVZ
@ 2021-07-27  5:33                                     ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Borislav Petkov,
	Christian Brauner, Dmitry Safonov, Eric W. Biederman,
	Greg Kroah-Hartman, H. Peter Anvin, Ingo Molnar, J. Bruce Fields,
	Jeff Layton, Jens Axboe

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v7:
- net-related patches was approved and included into net-next git
- rebase to v5.14-rc3
- added Acked-by tag from Kirill Tkhai on "memcg: enable accounting for
  new namesapces and struct nsproxy"

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (10):
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 17 files changed, 34 insertions(+), 29 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
@ 2021-07-27  5:33                                       ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
@ 2021-07-27  5:33                                       ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays
       [not found]                                   ` <cover.1627362057.git.vvs@virtuozzo.com>
  2021-07-27  5:33                                       ` Vasily Averin
@ 2021-07-27  5:33                                     ` Vasily Averin
  2021-07-27 21:39                                         ` Shakeel Butt
  2021-07-27  5:33                                     ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
                                                       ` (7 subsequent siblings)
  9 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v7 03/10] memcg: enable accounting for file lock caches
       [not found]                                   ` <cover.1627362057.git.vvs@virtuozzo.com>
  2021-07-27  5:33                                       ` Vasily Averin
  2021-07-27  5:33                                     ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-27  5:33                                     ` Vasily Averin
  2021-07-27 21:41                                       ` Shakeel Butt
  2021-07-27  5:33                                     ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
                                                       ` (6 subsequent siblings)
  9 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v7 04/10] memcg: enable accounting for fasync_cache
       [not found]                                   ` <cover.1627362057.git.vvs@virtuozzo.com>
                                                       ` (2 preceding siblings ...)
  2021-07-27  5:33                                     ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-27  5:33                                     ` Vasily Averin
  2021-07-27 21:50                                         ` Shakeel Butt
  2021-07-27  5:33                                       ` Vasily Averin
                                                       ` (5 subsequent siblings)
  9 siblings, 1 reply; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f946bec..714e7c9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-27  5:33                                       ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy
@ 2021-07-27  5:33                                       ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Tejun Heo,
	Andrew Morton, Zefan Li, Thomas Gleixner, Christian Brauner,
	Kirill Tkhai, Serge Hallyn, Andrei Vagin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
Acked-by: Christian Brauner <christian.brauner-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Acked-by: Kirill Tkhai <ktkhai-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 305+ messages in thread

* [PATCH v7 06/10] memcg: enable accounting of ipc resources
@ 2021-07-27  5:33                                       ` Vasily Averin
  0 siblings, 0 replies; 305+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexey Dobriyan,
	Dmitry Safonov, Yutian Yang, linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *se