linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] memcg accounting from OpenVZ
@ 2021-03-09  8:03 Vasily Averin
  2021-03-09 21:12 ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Vasily Averin @ 2021-03-09  8:03 UTC (permalink / raw)
  To: cgroups, linux-mm; +Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov

OpenVZ many years accounted memory of few kernel objects,
this helps us to prevent host memory abuse from inside memcg-limited container.

Vasily Averin (9):
  memcg: accounting for allocations called with disabled BH
  memcg: accounting for fib6_nodes cache
  memcg: accounting for ip6_dst_cache
  memcg: accounting for fib_rules
  memcg: accounting for ip_fib caches
  memcg: accounting for fasync_cache
  memcg: accounting for mnt_cache entries
  memcg: accounting for tty_struct objects
  memcg: accounting for ldt_struct objects

 arch/x86/kernel/ldt.c | 7 ++++---
 drivers/tty/tty_io.c  | 4 ++--
 fs/fcntl.c            | 3 ++-
 fs/namespace.c        | 5 +++--
 mm/memcontrol.c       | 2 +-
 net/core/fib_rules.c  | 4 ++--
 net/ipv4/fib_trie.c   | 4 ++--
 net/ipv6/ip6_fib.c    | 2 +-
 net/ipv6/route.c      | 2 +-
 9 files changed, 18 insertions(+), 15 deletions(-)

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
  2021-03-09  8:03 [PATCH 0/9] memcg accounting from OpenVZ Vasily Averin
@ 2021-03-09 21:12 ` Shakeel Butt
  2021-03-10 10:17   ` Vasily Averin
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2021-03-09 21:12 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Linux MM, Johannes Weiner, Michal Hocko, Vladimir Davydov

On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> OpenVZ many years accounted memory of few kernel objects,
> this helps us to prevent host memory abuse from inside memcg-limited container.
>

The text is cryptic but I am assuming you wanted to say that OpenVZ
has remained on a kernel which was still on opt-out kmem accounting
i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
patches are needed, right?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
  2021-03-09 21:12 ` Shakeel Butt
@ 2021-03-10 10:17   ` Vasily Averin
  2021-03-10 10:41     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Vasily Averin @ 2021-03-10 10:17 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Cgroups, Linux MM, Johannes Weiner, Michal Hocko, Vladimir Davydov

On 3/10/21 12:12 AM, Shakeel Butt wrote:
> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> OpenVZ many years accounted memory of few kernel objects,
>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>
> 
> The text is cryptic but I am assuming you wanted to say that OpenVZ
> has remained on a kernel which was still on opt-out kmem accounting
> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> patches are needed, right?

Something like this.
Frankly speaking I badly understand which arguments should I provide to upstream
to enable accounting for some new king of objects.

OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
and we have accounted all required kernel objects by using our own patches.
When memcg was added to upstream Vladimir Davydov added accounting of some objects
to upstream but did not skipped another ones.
Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
"skipped" objects by our own patches just because we accounted such objects before.
We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
  2021-03-10 10:17   ` Vasily Averin
@ 2021-03-10 10:41     ` Michal Hocko
  2021-03-11  7:00       ` Vasily Averin
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2021-03-10 10:41 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> > On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> OpenVZ many years accounted memory of few kernel objects,
> >> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>
> > 
> > The text is cryptic but I am assuming you wanted to say that OpenVZ
> > has remained on a kernel which was still on opt-out kmem accounting
> > i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> > patches are needed, right?
> 
> Something like this.
> Frankly speaking I badly understand which arguments should I provide to upstream
> to enable accounting for some new king of objects.
> 
> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
> and we have accounted all required kernel objects by using our own patches.
> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> to upstream but did not skipped another ones.
> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> "skipped" objects by our own patches just because we accounted such objects before.
> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

That is certainly an interesting information. But for a changelog it
would be more appropriate to provide information about how much memory
user can induce and whether there is any way to limit that memory by
other means. How practical those other means are and which usecases will
benefit from the containment.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
  2021-03-10 10:41     ` Michal Hocko
@ 2021-03-11  7:00       ` Vasily Averin
  2021-03-11  8:35         ` Michal Hocko
  2021-03-11 15:14         ` [PATCH 0/9] memcg accounting from OpenVZ Shakeel Butt
  0 siblings, 2 replies; 21+ messages in thread
From: Vasily Averin @ 2021-03-11  7:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On 3/10/21 1:41 PM, Michal Hocko wrote:
> On Wed 10-03-21 13:17:19, Vasily Averin wrote:
>> On 3/10/21 12:12 AM, Shakeel Butt wrote:
>>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>
>>>> OpenVZ many years accounted memory of few kernel objects,
>>>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>>
>>> The text is cryptic but I am assuming you wanted to say that OpenVZ
>>> has remained on a kernel which was still on opt-out kmem accounting
>>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
>>> patches are needed, right?
>>
>> Something like this.
>> Frankly speaking I badly understand which arguments should I provide to upstream
>> to enable accounting for some new king of objects.
>>
>> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
>> and we have accounted all required kernel objects by using our own patches.
>> When memcg was added to upstream Vladimir Davydov added accounting of some objects
>> to upstream but did not skipped another ones.
>> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
>> "skipped" objects by our own patches just because we accounted such objects before.
>> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 
> 
> That is certainly an interesting information. But for a changelog it
> would be more appropriate to provide information about how much memory
> user can induce and whether there is any way to limit that memory by
> other means. How practical those other means are and which usecases will
> benefit from the containment.

Right now I would like to understand how should I argument my requests about
accounting of new kind of objects.

Which description it enough to enable object accounting?
Could you please specify some edge rules?
Should I push such patches trough this list? 
Is it probably better to send them to mailing lists of according subsystems?
Should I notify them somehow at least?

"untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"

"each mount inside memcg-limited container creates non-accounted mount object,
 but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"

"unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"

"non-accounted multi-page tty objects can be created from inside memcg-limited container"

"unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
  2021-03-11  7:00       ` Vasily Averin
@ 2021-03-11  8:35         ` Michal Hocko
       [not found]           ` <360b4c94-8713-f621-1049-6bc0865c1867@virtuozzo.com>
                             ` (5 more replies)
  2021-03-11 15:14         ` [PATCH 0/9] memcg accounting from OpenVZ Shakeel Butt
  1 sibling, 6 replies; 21+ messages in thread
From: Michal Hocko @ 2021-03-11  8:35 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Thu 11-03-21 10:00:17, Vasily Averin wrote:
> On 3/10/21 1:41 PM, Michal Hocko wrote:
[...]
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
> 
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
> 
> Which description it enough to enable object accounting?

Doesn't the above paragraph give you a hint?

> Could you please specify some edge rules?

There are no strong rules AFAIK. I would say that it is important is
that the user can trigger a lot of or unbound amount of objects.

> Should I push such patches trough this list? 

yes linux-mm and ccing memcg maintainers is the proper way. It would be
great to CC maintainers of the affected subsystem as well.

> Is it probably better to send them to mailing lists of according subsystems?

> Should I notify them somehow at least?
> 
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
> 
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
> 
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
> 
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
> 
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

OK, that sounds better to me. It would be also great if you can mention
whether there are any other means to limit those objects if there are
any.

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/9] memcg accounting from OpenVZ
  2021-03-11  7:00       ` Vasily Averin
  2021-03-11  8:35         ` Michal Hocko
@ 2021-03-11 15:14         ` Shakeel Butt
  1 sibling, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2021-03-11 15:14 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Cgroups, Linux MM, Johannes Weiner, Vladimir Davydov

On Wed, Mar 10, 2021 at 11:00 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 3/10/21 1:41 PM, Michal Hocko wrote:
> > On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> >> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> >>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>>>
> >>>> OpenVZ many years accounted memory of few kernel objects,
> >>>> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>>
> >>> The text is cryptic but I am assuming you wanted to say that OpenVZ
> >>> has remained on a kernel which was still on opt-out kmem accounting
> >>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> >>> patches are needed, right?
> >>
> >> Something like this.
> >> Frankly speaking I badly understand which arguments should I provide to upstream
> >> to enable accounting for some new king of objects.
> >>
> >> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels)
> >> and we have accounted all required kernel objects by using our own patches.
> >> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> >> to upstream but did not skipped another ones.
> >> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> >> "skipped" objects by our own patches just because we accounted such objects before.
> >> We're working on rebase to new kernels and we prefer to push our old patches to upstream.
> >
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
>
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
>
> Which description it enough to enable object accounting?
> Could you please specify some edge rules?
> Should I push such patches trough this list?
> Is it probably better to send them to mailing lists of according subsystems?
> Should I notify them somehow at least?
>
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
>
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
>
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
>
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
>
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"
>

I think the above reasoning is good enough. Just resend your patches
with the corresponding details.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
       [not found]           ` <360b4c94-8713-f621-1049-6bc0865c1867@virtuozzo.com>
@ 2021-03-15 13:27             ` Borislav Petkov
  2021-03-15 15:48               ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Borislav Petkov @ 2021-03-15 13:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, linux-mm, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, Thomas Gleixner, Ingo Molnar,
	x86

On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> Unprivileged user inside memcg-limited container can create
> non-accounted multi-page per-thread kernel objects for LDT

I have hard time parsing this commit message.

And I'm CCed only on patch 8 of what looks like a patchset.

And that patchset is not on lkml so I can't find the rest to read about
it, perhaps linux-mm.

/me goes and finds it on lore

I can see some bits and pieces, this, for example:

https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/

 ( Btw, that version has your SOB and this patch doesn't even have a
   Signed-off-by. Next time, run your whole set through checkpatch please
   before sending. )

Now, this URL above talks about OOM, ok, that gets me close to the "why"
this patch.

From a quick look at the ldt.c code, we allow a single LDT struct per
mm. Manpage says so too:

DESCRIPTION
       modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
       The LDT is an array of segment descriptors that can be referenced by user  code.
       Linux  allows  processes  to configure a per-process (actually per-mm) LDT.

We allow

/* Maximum number of LDT entries supported. */
#define LDT_ENTRIES     8192

so there's an upper limit per mm.

Now, please explain what is this accounting for?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
       [not found]           ` <85b5f428-294b-af57-f496-5be5fddeeeea@virtuozzo.com>
@ 2021-03-15 15:13             ` David Ahern
  2021-03-15 15:23             ` Shakeel Butt
  2021-03-15 17:09             ` Jakub Kicinski
  2 siblings, 0 replies; 21+ messages in thread
From: David Ahern @ 2021-03-15 15:13 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, Hideaki YOSHIFUJI, David Ahern, Jakub Kicinski

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead
> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/8] memcg: accounting for ip6_dst_cache
       [not found]           ` <8196f732-718e-0465-a39c-62668cc12c2b@virtuozzo.com>
@ 2021-03-15 15:14             ` David Ahern
  0 siblings, 0 replies; 21+ messages in thread
From: David Ahern @ 2021-03-15 15:14 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, Hideaki YOSHIFUJI, Jakub Kicinski, David Ahern

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patches enables accounting for 'struct rt6_info' allocations.
> ---
>  net/ipv6/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/8] memcg: accounting for fib_rules
       [not found]           ` <cb893761-cf6e-fa92-3219-712e485259b4@virtuozzo.com>
@ 2021-03-15 15:14             ` David Ahern
  0 siblings, 0 replies; 21+ messages in thread
From: David Ahern @ 2021-03-15 15:14 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, David Ahern, Jakub Kicinski, Hideaki YOSHIFUJI

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for 'struct fib_rules'
> ---
>  net/core/fib_rules.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/8] memcg: accounting for ip_fib caches
       [not found]           ` <d569bf43-b30a-02af-f7ad-ccc794a50589@virtuozzo.com>
@ 2021-03-15 15:15             ` David Ahern
  0 siblings, 0 replies; 21+ messages in thread
From: David Ahern @ 2021-03-15 15:15 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko
  Cc: linux-mm, Johannes Weiner, Vladimir Davydov, Shakeel Butt,
	David S. Miller, David Ahern, Jakub Kicinski, Hideaki YOSHIFUJI

On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for ip_fib_alias and ip_fib_trie caches
> ---
>  net/ipv4/fib_trie.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
       [not found]           ` <85b5f428-294b-af57-f496-5be5fddeeeea@virtuozzo.com>
  2021-03-15 15:13             ` [PATCH v2 1/8] memcg: accounting for fib6_nodes cache David Ahern
@ 2021-03-15 15:23             ` Shakeel Butt
  2021-03-15 17:09             ` Jakub Kicinski
  2 siblings, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:23 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern, Jakub Kicinski

On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
>
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> It this case is

'In this case it is'

> not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt

'context'

> '!in_task()' check should be used instead

You missed the signoff and it seems like the whole series is missing
it as well. Please run scripts/checkpatch.pl on the patches before
sending again.

> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec0..568f2cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))

Can you please also add some explanation in the commit message on the
differences between in_interrupt() and in_task()? Why is
in_interrupt() not correct here but !in_task() is? What about kernels
with or without PREEMPT_COUNT?

>                 return true;
>
>         return false;
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index ef9d022..fa92ed1 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -2445,7 +2445,7 @@ int __init fib6_init(void)
>
>         fib6_node_kmem = kmem_cache_create("fib6_nodes",
>                                            sizeof(struct fib6_node),
> -                                          0, SLAB_HWCACHE_ALIGN,
> +                                          0, SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT,
>                                            NULL);
>         if (!fib6_node_kmem)
>                 goto out;
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
  2021-03-15 13:27             ` [PATCH v2 8/8] memcg: accounting for ldt_struct objects Borislav Petkov
@ 2021-03-15 15:48               ` Shakeel Butt
  2021-03-15 15:58                 ` Michal Hocko
  2021-03-15 15:59                 ` Borislav Petkov
  0 siblings, 2 replies; 21+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Thomas Gleixner, Ingo Molnar, x86

On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > Unprivileged user inside memcg-limited container can create
> > non-accounted multi-page per-thread kernel objects for LDT
>
> I have hard time parsing this commit message.
>
> And I'm CCed only on patch 8 of what looks like a patchset.
>
> And that patchset is not on lkml so I can't find the rest to read about
> it, perhaps linux-mm.
>
> /me goes and finds it on lore
>
> I can see some bits and pieces, this, for example:
>
> https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/
>
>  ( Btw, that version has your SOB and this patch doesn't even have a
>    Signed-off-by. Next time, run your whole set through checkpatch please
>    before sending. )
>
> Now, this URL above talks about OOM, ok, that gets me close to the "why"
> this patch.
>
> From a quick look at the ldt.c code, we allow a single LDT struct per
> mm. Manpage says so too:
>
> DESCRIPTION
>        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
>        The LDT is an array of segment descriptors that can be referenced by user  code.
>        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
>
> We allow
>
> /* Maximum number of LDT entries supported. */
> #define LDT_ENTRIES     8192
>
> so there's an upper limit per mm.
>
> Now, please explain what is this accounting for?
>

Let me try to provide the reasoning at least from my perspective.
There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines. The
unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/8] memcg: accounting for fasync_cache
       [not found]           ` <4eb97c88-b87c-6f6e-3960-b1a61b46d380@virtuozzo.com>
@ 2021-03-15 15:56             ` Shakeel Butt
  0 siblings, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2021-03-15 15:56 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Jeff Layton, J. Bruce Fields, Alexander Viro

On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> unprivileged user inside memcg-limited container can trigger
> creation of huge number of non-accounted fasync_struct objects

You need to make each patch of this series self-contained by including
the motivation behind the series (just one or two sentences). For
example, for this patch include what's the potential impact of these
huge numbers of unaccounted fasync_struct objects?

> ---
>  fs/fcntl.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index dfc72f1..7941559 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
>                         __FMODE_EXEC | __FMODE_NONOTIFY));
>
>         fasync_cache = kmem_cache_create("fasync_cache",
> -               sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
> +                                        sizeof(struct fasync_struct), 0,
> +                                        SLAB_PANIC | SLAB_ACCOUNT, NULL);
>         return 0;
>  }
>
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
  2021-03-15 15:48               ` Shakeel Butt
@ 2021-03-15 15:58                 ` Michal Hocko
  2021-03-15 15:59                 ` Borislav Petkov
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2021-03-15 15:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Borislav Petkov, Vasily Averin, Cgroups, Linux MM,
	Johannes Weiner, Vladimir Davydov, Thomas Gleixner, Ingo Molnar,
	x86

On Mon 15-03-21 08:48:26, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > > Unprivileged user inside memcg-limited container can create
> > > non-accounted multi-page per-thread kernel objects for LDT
> >
> > I have hard time parsing this commit message.
> >
> > And I'm CCed only on patch 8 of what looks like a patchset.
> >
> > And that patchset is not on lkml so I can't find the rest to read about
> > it, perhaps linux-mm.
> >
> > /me goes and finds it on lore
> >
> > I can see some bits and pieces, this, for example:
> >
> > https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/
> >
> >  ( Btw, that version has your SOB and this patch doesn't even have a
> >    Signed-off-by. Next time, run your whole set through checkpatch please
> >    before sending. )
> >
> > Now, this URL above talks about OOM, ok, that gets me close to the "why"
> > this patch.
> >
> > From a quick look at the ldt.c code, we allow a single LDT struct per
> > mm. Manpage says so too:
> >
> > DESCRIPTION
> >        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
> >        The LDT is an array of segment descriptors that can be referenced by user  code.
> >        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
> >
> > We allow
> >
> > /* Maximum number of LDT entries supported. */
> > #define LDT_ENTRIES     8192
> >
> > so there's an upper limit per mm.
> >
> > Now, please explain what is this accounting for?
> >
> 
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

It would be better to be explicit

8192 * 8 = 64kB * number_of_tasks

so realistically this is in range of lower megabytes. Is this worth the
memcg accounting overhead? Maybe yes but what kind of workloads really
care?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 8/8] memcg: accounting for ldt_struct objects
  2021-03-15 15:48               ` Shakeel Butt
  2021-03-15 15:58                 ` Michal Hocko
@ 2021-03-15 15:59                 ` Borislav Petkov
  1 sibling, 0 replies; 21+ messages in thread
From: Borislav Petkov @ 2021-03-15 15:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, Thomas Gleixner, Ingo Molnar, x86

On Mon, Mar 15, 2021 at 08:48:26AM -0700, Shakeel Butt wrote:
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

Good enough for me, as long as that is part of the commit message.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
       [not found]           ` <85b5f428-294b-af57-f496-5be5fddeeeea@virtuozzo.com>
  2021-03-15 15:13             ` [PATCH v2 1/8] memcg: accounting for fib6_nodes cache David Ahern
  2021-03-15 15:23             ` Shakeel Butt
@ 2021-03-15 17:09             ` Jakub Kicinski
  2021-03-15 19:24               ` Shakeel Butt
  2 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2021-03-15 17:09 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, linux-mm, Johannes Weiner,
	Vladimir Davydov, Shakeel Butt, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead

Sorry for a random question, I didn't get the cover letter. 

What's the overhead of adding SLAB_ACCOUNT?

Please make sure you CC netdev on series which may impact networking.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
  2021-03-15 17:09             ` Jakub Kicinski
@ 2021-03-15 19:24               ` Shakeel Butt
  2021-03-15 19:32                 ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2021-03-15 19:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Vasily Averin, Cgroups, Michal Hocko, Linux MM, Johannes Weiner,
	Vladimir Davydov, David S. Miller, Hideaki YOSHIFUJI,
	David Ahern

On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > An untrusted netadmin inside a memcg-limited container can create a
> > huge number of routing entries. Currently, allocated kernel objects
> > are not accounted to proper memcg, so this can lead to global memory
> > shortage on the host and cause lot of OOM kiils.
> >
> > One such object is the 'struct fib6_node' mostly allocated in
> > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> >
> >  write_lock_bh(&table->tb6_lock);
> >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> >  write_unlock_bh(&table->tb6_lock);
> >
> > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > kmem cache. The proper memory cgroup still cannot be found due to the
> > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > To be sure that caller is not executed in process contxt
> > '!in_task()' check should be used instead
>
> Sorry for a random question, I didn't get the cover letter.
>
> What's the overhead of adding SLAB_ACCOUNT?
>

The potential overhead is for MEMCG users where we need to
charge/account each allocation from SLAB_ACCOUNT kmem caches. However
charging is done in batches, so the cost is amortized. If there is a
concern about a specific workload then it would be good to see the
impact of this patch for that workload.

> Please make sure you CC netdev on series which may impact networking.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
  2021-03-15 19:24               ` Shakeel Butt
@ 2021-03-15 19:32                 ` Roman Gushchin
  2021-03-15 19:35                   ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2021-03-15 19:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jakub Kicinski, Vasily Averin, Cgroups, Michal Hocko, Linux MM,
	Johannes Weiner, Vladimir Davydov, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > > An untrusted netadmin inside a memcg-limited container can create a
> > > huge number of routing entries. Currently, allocated kernel objects
> > > are not accounted to proper memcg, so this can lead to global memory
> > > shortage on the host and cause lot of OOM kiils.
> > >
> > > One such object is the 'struct fib6_node' mostly allocated in
> > > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> > >
> > >  write_lock_bh(&table->tb6_lock);
> > >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> > >  write_unlock_bh(&table->tb6_lock);
> > >
> > > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > > kmem cache. The proper memory cgroup still cannot be found due to the
> > > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > > To be sure that caller is not executed in process contxt
> > > '!in_task()' check should be used instead
> >
> > Sorry for a random question, I didn't get the cover letter.
> >
> > What's the overhead of adding SLAB_ACCOUNT?
> >
> 
> The potential overhead is for MEMCG users where we need to
> charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> charging is done in batches, so the cost is amortized. If there is a
> concern about a specific workload then it would be good to see the
> impact of this patch for that workload.
> 
> > Please make sure you CC netdev on series which may impact networking.

In general the overhead is not that big, so I don't think we should argue
too much about every new case where we want to enable the accounting and
rather focus on those few examples (if any?) where it actually hurts
the performance in a meaningful way.

Thanks!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/8] memcg: accounting for fib6_nodes cache
  2021-03-15 19:32                 ` Roman Gushchin
@ 2021-03-15 19:35                   ` Jakub Kicinski
  0 siblings, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2021-03-15 19:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Vasily Averin, Cgroups, Michal Hocko, Linux MM,
	Johannes Weiner, Vladimir Davydov, David S. Miller,
	Hideaki YOSHIFUJI, David Ahern

On Mon, 15 Mar 2021 12:32:07 -0700 Roman Gushchin wrote:
> On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> > On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:  
> > > Sorry for a random question, I didn't get the cover letter.
> > >
> > > What's the overhead of adding SLAB_ACCOUNT?
> > 
> > The potential overhead is for MEMCG users where we need to
> > charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> > charging is done in batches, so the cost is amortized. If there is a
> > concern about a specific workload then it would be good to see the
> > impact of this patch for that workload.
> >   
> > > Please make sure you CC netdev on series which may impact networking.  
> 
> In general the overhead is not that big, so I don't think we should argue
> too much about every new case where we want to enable the accounting and
> rather focus on those few examples (if any?) where it actually hurts
> the performance in a meaningful way.

Ack, no serious concerns about this particular case.

I was expecting you'd have micro benchmark numbers handy so I was
curious to learn what they are, but that appears not to be the case.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-03-15 19:39 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09  8:03 [PATCH 0/9] memcg accounting from OpenVZ Vasily Averin
2021-03-09 21:12 ` Shakeel Butt
2021-03-10 10:17   ` Vasily Averin
2021-03-10 10:41     ` Michal Hocko
2021-03-11  7:00       ` Vasily Averin
2021-03-11  8:35         ` Michal Hocko
     [not found]           ` <360b4c94-8713-f621-1049-6bc0865c1867@virtuozzo.com>
2021-03-15 13:27             ` [PATCH v2 8/8] memcg: accounting for ldt_struct objects Borislav Petkov
2021-03-15 15:48               ` Shakeel Butt
2021-03-15 15:58                 ` Michal Hocko
2021-03-15 15:59                 ` Borislav Petkov
     [not found]           ` <8196f732-718e-0465-a39c-62668cc12c2b@virtuozzo.com>
2021-03-15 15:14             ` [PATCH v2 2/8] memcg: accounting for ip6_dst_cache David Ahern
     [not found]           ` <cb893761-cf6e-fa92-3219-712e485259b4@virtuozzo.com>
2021-03-15 15:14             ` [PATCH v2 3/8] memcg: accounting for fib_rules David Ahern
     [not found]           ` <d569bf43-b30a-02af-f7ad-ccc794a50589@virtuozzo.com>
2021-03-15 15:15             ` [PATCH v2 4/8] memcg: accounting for ip_fib caches David Ahern
     [not found]           ` <85b5f428-294b-af57-f496-5be5fddeeeea@virtuozzo.com>
2021-03-15 15:13             ` [PATCH v2 1/8] memcg: accounting for fib6_nodes cache David Ahern
2021-03-15 15:23             ` Shakeel Butt
2021-03-15 17:09             ` Jakub Kicinski
2021-03-15 19:24               ` Shakeel Butt
2021-03-15 19:32                 ` Roman Gushchin
2021-03-15 19:35                   ` Jakub Kicinski
     [not found]           ` <4eb97c88-b87c-6f6e-3960-b1a61b46d380@virtuozzo.com>
2021-03-15 15:56             ` [PATCH v2 5/8] memcg: accounting for fasync_cache Shakeel Butt
2021-03-11 15:14         ` [PATCH 0/9] memcg accounting from OpenVZ Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).