All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] memcg: accounting for objects allocated for new netdevice
@ 2022-04-27 10:37 Vasily Averin
  2022-04-27 14:01 ` Michal Koutný
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-04-27 10:37 UTC (permalink / raw)
  To: Roman Gushchin, Vlastimil Babka, Shakeel Butt
  Cc: kernel, Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Greg Kroah-Hartman, Tejun Heo, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-fsdevel

Creating a new netdevice allocates at least ~50Kb of memory for various
kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
creating an unlimited number of netdevice inside a memcg-limited container
does not fall within memcg restrictions, consumes a significant part
of the host's memory, can cause global OOM and lead to random kills of
host processes.

The main consumers of non-accounted memory are:
 ~10Kb   80+ kernfs nodes
 ~6Kb    ipv6_add_dev() allocations
  6Kb    __register_sysctl_table() allocations
  4Kb    neigh_sysctl_register() allocations
  4Kb    __devinet_sysctl_register() allocations
  4Kb    __addrconf_sysctl_register() allocations

Accounting of these objects allows to increase the share of memcg-related
memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
on typical VM with default Fedora 35 kernel) and this should be enough
to somehow protect the host from misuse inside container.

Other related objects are quite small and may not be taken into account
to minimize the expected performance degradation.

It should be separately mentonied ~300 bytes of percpu allocation
of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
it can become the main consumer of memory.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
RFC was discussed here:
https://lore.kernel.org/all/a5e09e93-106d-0527-5b1e-48dbf3b48b4e@virtuozzo.com/
---
 fs/kernfs/mount.c     | 2 +-
 fs/proc/proc_sysctl.c | 2 +-
 net/core/neighbour.c  | 2 +-
 net/ipv4/devinet.c    | 2 +-
 net/ipv6/addrconf.c   | 8 ++++----
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..2881aeeaa880 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,7 +391,7 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 7d9cfc730bd4..df4604fea4f8 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1333,7 +1333,7 @@ struct ctl_table_header *__register_sysctl_table(
 		nr_entries++;
 
 	header = kzalloc(sizeof(struct ctl_table_header) +
-			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
+			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL_ACCOUNT);
 	if (!header)
 		return NULL;
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ec0bf737b076..3dcda2a54f86 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3728,7 +3728,7 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p,
 	char neigh_path[ sizeof("net//neigh/") + IFNAMSIZ + IFNAMSIZ ];
 	char *p_name;
 
-	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL);
+	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL_ACCOUNT);
 	if (!t)
 		goto err;
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index fba2bffd65f7..47523fe5b891 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2566,7 +2566,7 @@ static int __devinet_sysctl_register(struct net *net, char *dev_name,
 	struct devinet_sysctl_table *t;
 	char path[sizeof("net/ipv4/conf/") + IFNAMSIZ];
 
-	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL);
+	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL_ACCOUNT);
 	if (!t)
 		goto out;
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f908e2fd30b2..e79621ee4a0a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -342,7 +342,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
 {
 	int i;
 
-	idev->stats.ipv6 = alloc_percpu(struct ipstats_mib);
+	idev->stats.ipv6 = alloc_percpu_gfp(struct ipstats_mib, GFP_KERNEL_ACCOUNT);
 	if (!idev->stats.ipv6)
 		goto err_ip;
 
@@ -358,7 +358,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
 	if (!idev->stats.icmpv6dev)
 		goto err_icmp;
 	idev->stats.icmpv6msgdev = kzalloc(sizeof(struct icmpv6msg_mib_device),
-					   GFP_KERNEL);
+					   GFP_KERNEL_ACCOUNT);
 	if (!idev->stats.icmpv6msgdev)
 		goto err_icmpmsg;
 
@@ -382,7 +382,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 	if (dev->mtu < IPV6_MIN_MTU)
 		return ERR_PTR(-EINVAL);
 
-	ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL);
+	ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL_ACCOUNT);
 	if (!ndev)
 		return ERR_PTR(err);
 
@@ -7029,7 +7029,7 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
 	struct ctl_table *table;
 	char path[sizeof("net/ipv6/conf/") + IFNAMSIZ];
 
-	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL);
+	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL_ACCOUNT);
 	if (!table)
 		goto out;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH] memcg: accounting for objects allocated for new netdevice
  2022-04-27 10:37 [PATCH] memcg: accounting for objects allocated for new netdevice Vasily Averin
@ 2022-04-27 14:01 ` Michal Koutný
  2022-04-27 16:52   ` Shakeel Butt
  2022-05-02 19:37   ` kernfs memcg accounting Vasily Averin
  0 siblings, 2 replies; 139+ messages in thread
From: Michal Koutný @ 2022-04-27 14:01 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups, netdev,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Greg Kroah-Hartman,
	Tejun Heo, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	linux-fsdevel

Hello Vasily.

On Wed, Apr 27, 2022 at 01:37:50PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index cfa79715fc1a..2881aeeaa880 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -391,7 +391,7 @@ void __init kernfs_init(void)
>  {
>  	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
>  					      sizeof(struct kernfs_node),
> -					      0, SLAB_PANIC, NULL);
> +					      0, SLAB_PANIC | SLAB_ACCOUNT, NULL);

kernfs accounting you say?
kernfs backs up also cgroups, so the parent-child accounting comes to my
mind.
See the temporary switch to parent memcg in mem_cgroup_css_alloc().

(I mean this makes some sense but I'd suggest unlumping the kernfs into
a separate path for possible discussion and its not-only-netdevice
effects.)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH] memcg: accounting for objects allocated for new netdevice
  2022-04-27 14:01 ` Michal Koutný
@ 2022-04-27 16:52   ` Shakeel Butt
  2022-04-27 22:35     ` Vasily Averin
  2022-05-02 19:37   ` kernfs memcg accounting Vasily Averin
  1 sibling, 1 reply; 139+ messages in thread
From: Shakeel Butt @ 2022-04-27 16:52 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Vasily Averin, Roman Gushchin, Vlastimil Babka, kernel,
	Florian Westphal, LKML, Michal Hocko, Cgroups, netdev,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Greg Kroah-Hartman,
	Tejun Heo, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	linux-fsdevel

On Wed, Apr 27, 2022 at 7:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello Vasily.
>
> On Wed, Apr 27, 2022 at 01:37:50PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index cfa79715fc1a..2881aeeaa880 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -391,7 +391,7 @@ void __init kernfs_init(void)
> >  {
> >       kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
> >                                             sizeof(struct kernfs_node),
> > -                                           0, SLAB_PANIC, NULL);
> > +                                           0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
>
> kernfs accounting you say?
> kernfs backs up also cgroups, so the parent-child accounting comes to my
> mind.
> See the temporary switch to parent memcg in mem_cgroup_css_alloc().
>
> (I mean this makes some sense but I'd suggest unlumping the kernfs into
> a separate path for possible discussion and its not-only-netdevice
> effects.)
>

I agree with Michal that kernfs accounting should be its own patch.
Internally at Google, we actually have enabled the memcg accounting of
kernfs nodes. We have workloads which create 100s of subcontainers and
without memcg accounting of kernfs we see high system overhead.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH] memcg: accounting for objects allocated for new netdevice
  2022-04-27 16:52   ` Shakeel Butt
@ 2022-04-27 22:35     ` Vasily Averin
  2022-05-02 12:15       ` [PATCH memcg v2] " Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-04-27 22:35 UTC (permalink / raw)
  To: Shakeel Butt, Michal Koutný
  Cc: Roman Gushchin, Vlastimil Babka, kernel, Florian Westphal, LKML,
	Michal Hocko, Cgroups, netdev, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Greg Kroah-Hartman, Tejun Heo, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, linux-fsdevel

On 4/27/22 19:52, Shakeel Butt wrote:
> On Wed, Apr 27, 2022 at 7:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>>
>> Hello Vasily.
>>
>> On Wed, Apr 27, 2022 at 01:37:50PM +0300, Vasily Averin <vvs@openvz.org> wrote:
>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>> index cfa79715fc1a..2881aeeaa880 100644
>>> --- a/fs/kernfs/mount.c
>>> +++ b/fs/kernfs/mount.c
>>> @@ -391,7 +391,7 @@ void __init kernfs_init(void)
>>>  {
>>>       kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
>>>                                             sizeof(struct kernfs_node),
>>> -                                           0, SLAB_PANIC, NULL);
>>> +                                           0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
>>
>> kernfs accounting you say?
>> kernfs backs up also cgroups, so the parent-child accounting comes to my
>> mind.
>> See the temporary switch to parent memcg in mem_cgroup_css_alloc().
>>
>> (I mean this makes some sense but I'd suggest unlumping the kernfs into
>> a separate path for possible discussion and its not-only-netdevice
>> effects.)
> 
> I agree with Michal that kernfs accounting should be its own patch.
> Internally at Google, we actually have enabled the memcg accounting of
> kernfs nodes. We have workloads which create 100s of subcontainers and
> without memcg accounting of kernfs we see high system overhead.

I had this idea (i.e. move kernfs accounting into separate patch) too, 
but finally decided to include it into current patch.

Kernfs accounting is critical for described scenario. Without it typical
netdevice creating will charge only ~50% of allocated memory, and the rest
of patch does not allow to protect the host properly.

Now I'm going to follow your recommendation and split the patch.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH memcg v2] memcg: accounting for objects allocated for new netdevice
  2022-04-27 22:35     ` Vasily Averin
@ 2022-05-02 12:15       ` Vasily Averin
  2022-05-04 20:50         ` Luis Chamberlain
                           ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-02 12:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: kernel, Florian Westphal, linux-kernel, Roman Gushchin,
	Vlastimil Babka, Michal Hocko, cgroups, netdev, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-fsdevel

Creating a new netdevice allocates at least ~50Kb of memory for various
kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
creating an unlimited number of netdevice inside a memcg-limited container
does not fall within memcg restrictions, consumes a significant part
of the host's memory, can cause global OOM and lead to random kills of
host processes.

The main consumers of non-accounted memory are:
 ~10Kb   80+ kernfs nodes
 ~6Kb    ipv6_add_dev() allocations
  6Kb    __register_sysctl_table() allocations
  4Kb    neigh_sysctl_register() allocations
  4Kb    __devinet_sysctl_register() allocations
  4Kb    __addrconf_sysctl_register() allocations

Accounting of these objects allows to increase the share of memcg-related
memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
on typical VM with default Fedora 35 kernel) and this should be enough
to somehow protect the host from misuse inside container.

Other related objects are quite small and may not be taken into account
to minimize the expected performance degradation.

It should be separately mentonied ~300 bytes of percpu allocation
of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
it can become the main consumer of memory.

This patch does not enables kernfs accounting as it affects
other parts of the kernel and should be discussed separately.
However, even without kernfs, this patch significantly improves the
current situation and allows to take into account more than half
of all netdevice allocations.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
v2: 1) kernfs accounting moved into separate patch, suggested by
    Shakeel and mkoutny@.
    2) in ipv6_add_dev() changed original "sizeof(struct inet6_dev)"
    to "sizeof(*ndev)", according to checkpath.pl recommendation:
      CHECK: Prefer kzalloc(sizeof(*ndev)...) over kzalloc(sizeof
        (struct inet6_dev)...)
---
 fs/proc/proc_sysctl.c | 2 +-
 net/core/neighbour.c  | 2 +-
 net/ipv4/devinet.c    | 2 +-
 net/ipv6/addrconf.c   | 8 ++++----
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 7d9cfc730bd4..df4604fea4f8 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1333,7 +1333,7 @@ struct ctl_table_header *__register_sysctl_table(
 		nr_entries++;
 
 	header = kzalloc(sizeof(struct ctl_table_header) +
-			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
+			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL_ACCOUNT);
 	if (!header)
 		return NULL;
 
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ec0bf737b076..3dcda2a54f86 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3728,7 +3728,7 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p,
 	char neigh_path[ sizeof("net//neigh/") + IFNAMSIZ + IFNAMSIZ ];
 	char *p_name;
 
-	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL);
+	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL_ACCOUNT);
 	if (!t)
 		goto err;
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index fba2bffd65f7..47523fe5b891 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2566,7 +2566,7 @@ static int __devinet_sysctl_register(struct net *net, char *dev_name,
 	struct devinet_sysctl_table *t;
 	char path[sizeof("net/ipv4/conf/") + IFNAMSIZ];
 
-	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL);
+	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL_ACCOUNT);
 	if (!t)
 		goto out;
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f908e2fd30b2..290e5e671774 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -342,7 +342,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
 {
 	int i;
 
-	idev->stats.ipv6 = alloc_percpu(struct ipstats_mib);
+	idev->stats.ipv6 = alloc_percpu_gfp(struct ipstats_mib, GFP_KERNEL_ACCOUNT);
 	if (!idev->stats.ipv6)
 		goto err_ip;
 
@@ -358,7 +358,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
 	if (!idev->stats.icmpv6dev)
 		goto err_icmp;
 	idev->stats.icmpv6msgdev = kzalloc(sizeof(struct icmpv6msg_mib_device),
-					   GFP_KERNEL);
+					   GFP_KERNEL_ACCOUNT);
 	if (!idev->stats.icmpv6msgdev)
 		goto err_icmpmsg;
 
@@ -382,7 +382,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 	if (dev->mtu < IPV6_MIN_MTU)
 		return ERR_PTR(-EINVAL);
 
-	ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL);
+	ndev = kzalloc(sizeof(*ndev), GFP_KERNEL_ACCOUNT);
 	if (!ndev)
 		return ERR_PTR(err);
 
@@ -7029,7 +7029,7 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
 	struct ctl_table *table;
 	char path[sizeof("net/ipv6/conf/") + IFNAMSIZ];
 
-	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL);
+	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL_ACCOUNT);
 	if (!table)
 		goto out;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* kernfs memcg accounting
  2022-04-27 14:01 ` Michal Koutný
  2022-04-27 16:52   ` Shakeel Butt
@ 2022-05-02 19:37   ` Vasily Averin
  2022-05-02 21:22     ` Michal Koutný
  1 sibling, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-02 19:37 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On 4/27/22 17:01, Michal Koutný wrote:
> Hello Vasily.
> 
> On Wed, Apr 27, 2022 at 01:37:50PM +0300, Vasily Averin <vvs@openvz.org> wrote:
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index cfa79715fc1a..2881aeeaa880 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -391,7 +391,7 @@ void __init kernfs_init(void)
>>  {
>>  	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
>>  					      sizeof(struct kernfs_node),
>> -					      0, SLAB_PANIC, NULL);
>> +					      0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
> 
> kernfs accounting you say?
> kernfs backs up also cgroups, so the parent-child accounting comes to my
> mind.
> See the temporary switch to parent memcg in mem_cgroup_css_alloc().

Dear Michal,
I did not understand your statement. Could you please explain it in more details?

I see that cgroup_mkdir()->cgroup_create() creates new kernfs node for new
sub-directory, and with my patch account memory of kernfs node to memcg 
of current process.
Do you think it is incorrect and new kernfs node should be accounted
to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?
Perhaps you mean that in this case kernfs should not be counted at all,
as almost all neighboring allocations do?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-02 19:37   ` kernfs memcg accounting Vasily Averin
@ 2022-05-02 21:22     ` Michal Koutný
  2022-05-04  9:00       ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Koutný @ 2022-05-02 21:22 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

Hello.

On Mon, May 02, 2022 at 10:37:49PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> I did not understand your statement. Could you please explain it in more details?

Sure, let me expand my perhaps ambiguous and indefinite sentence.

> I see that cgroup_mkdir()->cgroup_create() creates new kernfs node for new
> sub-directory, and with my patch account memory of kernfs node to memcg 
> of current process.

Indeed. The variants I'm comparing here are: a) charge to the creator's
memcg, b) charge to the parent (memcg ancestor) of created cgroup.

When struct mem_cgroup charging was introduced, there was a similar
discussion [1].

I can see following aspects here:
1) absolute size of kernfs_objects,
2) practical difference between a) and b),
3) consistency with memcg,
4) v1 vs v2 behavior.

Ad 1) -- normally, I'd treat this as negligible (~120B struct
kernfs_node * there are ~10 of them per subsys * ~10 subsystems ~ 12
KB/cgroup). But I guess the point of this change are exploitative users
where this doesn't hold [2], so absolute size is not so important.

Ad 2) -- in the typical workloads, only top-level cgroup are created by
some management entity and lower level are managed from within, i.e.
there is little difference whom to charge the created objects.

Ad 3) -- struct mem_cgroup objects are charged to their hierarchical
parent, so that dying memcgs can be associated to a subtree which is
where the reclaim can deal with it (in contrast with creator's cgroup).

Now, if I'm looking correctly, the kernfs_node objects are not pinned by
any residual state (subsystems kill_css()->css_clear_dir() synchronously
from rmdir, cgroup itself may be RCU delayed). So the memcg argument
remains purely for consistency (but no practical reason).

Ad 4) -- the variant b) becomes slightly awkward when mkdir'ing a cgroup
in a non-memcg hierarchy (bubbles up to root, despite creator in a
non-root memcg).

How do these reasonings align with your original intention of net
devices accounting? (Are the creators of net devices inside the
container?)


> Do you think it is incorrect and new kernfs node should be accounted
> to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?

I don't think either variant is incorrect. I'd very much prefer the
consistency with memcg behavior (variant a)) but as I've listed the
arguments above, it seems such a consistency can't be easily justified.


> Perhaps you mean that in this case kernfs should not be counted at all,
> as almost all neighboring allocations do?

No, I think it wouldn't help here [2]. (Or which neighboring allocations
do you mean? There must be at least nr_cgroups of them.)

(Of course, then there's the traditional performance argument, cgroup's
kernfs_node object shouldn't be problematic but I can't judge others
(sysfs) but that's nothing to prevent any form of kernfs_node accounting
going forward in my eyes.)

HTH,
Michal

[1] https://lore.kernel.org/all/20200729171039.GA22229@blackbody.suse.cz/
[2] Unless this could be constraint by something even bigger and
accounted. But only struct mem_cgroup (recursively its percpu stats)
comes to my mind.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-02 21:22     ` Michal Koutný
@ 2022-05-04  9:00       ` Vasily Averin
  2022-05-04 14:10         ` Michal Koutný
  2022-05-11  3:06         ` Roman Gushchin
  0 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-04  9:00 UTC (permalink / raw)
  To: Michal Koutný, Roman Gushchin
  Cc: Vlastimil Babka, Shakeel Butt, kernel, Florian Westphal,
	linux-kernel, Michal Hocko, cgroups, Greg Kroah-Hartman,
	Tejun Heo

On 5/3/22 00:22, Michal Koutný wrote:
> When struct mem_cgroup charging was introduced, there was a similar
> discussion [1].

Thank you, I'm missed this patch, it was very interesting and useful.
I would note though, that OpenVZ and LXC have another usecase:
we have separate and independent systemd instances inside OS containers.
So container's cgroups are created not in host's root memcg but 
inside accountable container's root memcg.  

> I can see following aspects here:
> 1) absolute size of kernfs_objects,
> 2) practical difference between a) and b),
> 3) consistency with memcg,
> 4) v1 vs v2 behavior.
...
> How do these reasonings align with your original intention of net
> devices accounting? (Are the creators of net devices inside the
> container?)

It is possible to create netdevice in one namespace/container 
and then move them to another one, and this possibility is widely used.
With my patch memory allocated by these devices will be not accounted
to new memcg, however I do not think it is a problem.
My patches protect the host mostly from misuse, when someone creates
a huge number of nedevices inside a container.

>> Do you think it is incorrect and new kernfs node should be accounted
>> to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?
> 
> I don't think either variant is incorrect. I'd very much prefer the
> consistency with memcg behavior (variant a)) but as I've listed the
> arguments above, it seems such a consistency can't be easily justified.

From my point of view it is most important to account allocated memory
to any cgroup inside container. Select of proper memcg is a secondary goal here.
Frankly speaking I do not see a big difference between memcg of current process,
memcg of newly created child and memcg of its parent.

As far as I understand, Roman chose the parent memcg because it was a special
case of creating a new memory group. He temporally changed active memcg
in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
allocations.
However, he ignored accounting for a rather large struct mem_cgroup
therefore I think we can do not worry about 128 bytes of kernfs node.
Yes, it will be accounted to some other memcg, but the same thing
happens with kernfs nodes of other groups.
I don't think that's a problem.

>> Perhaps you mean that in this case kernfs should not be counted at all,
>> as almost all neighboring allocations do?
> 
> No, I think it wouldn't help here [2]. (Or which neighboring allocations
> do you mean? There must be at least nr_cgroups of them.)

Primary I mean here struct mem_cgroup allocation in mem_cgroup_alloc().
However, I think we need to take into account any other distributions called
inside cgroup_mkdir: struct cgroup and kernefs node in common part and 
any other cgroup-cpecific allocations in other .css_alloc functions.
They all can be called from inside container, allocates non-accountable
memory and by this way theoretically can be misused.
So I'm going to check this scenario a bit later.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-04  9:00       ` Vasily Averin
@ 2022-05-04 14:10         ` Michal Koutný
  2022-05-04 21:16           ` Vasily Averin
  2022-05-11  3:06         ` Roman Gushchin
  1 sibling, 1 reply; 139+ messages in thread
From: Michal Koutný @ 2022-05-04 14:10 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On Wed, May 04, 2022 at 12:00:18PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> My patches protect the host mostly from misuse, when someone creates
> a huge number of nedevices inside a container.

Understood.

> Frankly speaking I do not see a big difference between memcg of current process,
> memcg of newly created child and memcg of its parent.

I agree that's not substantial difference. It's relevant for outer
entities "injecting" something into a subtree.
As I wrote previously, charging to the creator in the generic case is
sensible.

> As far as I understand, Roman chose the parent memcg because it was a special
> case of creating a new memory group. He temporally changed active memcg
> in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
> allocations.

> However, he ignored accounting for a rather large struct mem_cgroup
> therefore I think we can do not worry about 128 bytes of kernfs node.

Are you referring to the current code (>= v5.18-rc2)? All big structs
related to mem_cgroup should be accounted. What is ignored?

> Primary I mean here struct mem_cgroup allocation in mem_cgroup_alloc().

Just note that memory controller may not be always enabled so
cgroup_mkdir != mem_cgroup_alloc().

> However, I think we need to take into account any other distributions called
> inside cgroup_mkdir: struct cgroup and kernefs node in common part and 
> any other cgroup-cpecific allocations in other .css_alloc functions.
> They all can be called from inside container, allocates non-accountable
> memory and by this way theoretically can be misused.

Also note that (if you're purely on unified hierachy) you can protect
against that with cgroup.max.descendants and cgroup.max.depth.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH memcg v2] memcg: accounting for objects allocated for new netdevice
  2022-05-02 12:15       ` [PATCH memcg v2] " Vasily Averin
@ 2022-05-04 20:50         ` Luis Chamberlain
  2022-05-05  3:50         ` patchwork-bot+netdevbpf
  2022-05-11  2:51         ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: Luis Chamberlain @ 2022-05-04 20:50 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, kernel, Florian Westphal, linux-kernel,
	Roman Gushchin, Vlastimil Babka, Michal Hocko, cgroups, netdev,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Kees Cook,
	Iurii Zaikin, linux-fsdevel

On Mon, May 02, 2022 at 03:15:51PM +0300, Vasily Averin wrote:
> Creating a new netdevice allocates at least ~50Kb of memory for various
> kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
> creating an unlimited number of netdevice inside a memcg-limited container
> does not fall within memcg restrictions, consumes a significant part
> of the host's memory, can cause global OOM and lead to random kills of
> host processes.
> 
> The main consumers of non-accounted memory are:
>  ~10Kb   80+ kernfs nodes
>  ~6Kb    ipv6_add_dev() allocations
>   6Kb    __register_sysctl_table() allocations
>   4Kb    neigh_sysctl_register() allocations
>   4Kb    __devinet_sysctl_register() allocations
>   4Kb    __addrconf_sysctl_register() allocations
> 
> Accounting of these objects allows to increase the share of memcg-related
> memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
> on typical VM with default Fedora 35 kernel) and this should be enough
> to somehow protect the host from misuse inside container.
> 
> Other related objects are quite small and may not be taken into account
> to minimize the expected performance degradation.
> 
> It should be separately mentonied ~300 bytes of percpu allocation
> of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
> it can become the main consumer of memory.
> 
> This patch does not enables kernfs accounting as it affects
> other parts of the kernel and should be discussed separately.
> However, even without kernfs, this patch significantly improves the
> current situation and allows to take into account more than half
> of all netdevice allocations.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v2: 1) kernfs accounting moved into separate patch, suggested by
>     Shakeel and mkoutny@.
>     2) in ipv6_add_dev() changed original "sizeof(struct inet6_dev)"
>     to "sizeof(*ndev)", according to checkpath.pl recommendation:
>       CHECK: Prefer kzalloc(sizeof(*ndev)...) over kzalloc(sizeof
>         (struct inet6_dev)...)
> ---
>  fs/proc/proc_sysctl.c | 2 +-

for proc_sysctl:

Acked-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-04 14:10         ` Michal Koutný
@ 2022-05-04 21:16           ` Vasily Averin
  2022-05-05  9:47             ` Michal Koutný
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-04 21:16 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On 5/4/22 17:10, Michal Koutný wrote:
> On Wed, May 04, 2022 at 12:00:18PM +0300, Vasily Averin <vvs@openvz.org> wrote:
>> As far as I understand, Roman chose the parent memcg because it was a special
>> case of creating a new memory group. He temporally changed active memcg
>> in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
>> allocations.
> 
>> However, he ignored accounting for a rather large struct mem_cgroup
>> therefore I think we can do not worry about 128 bytes of kernfs node.
> 
> Are you referring to the current code (>= v5.18-rc2)? All big structs
> related to mem_cgroup should be accounted. What is ignored?

mm/memcontrol.c:
5079 static struct mem_cgroup *mem_cgroup_alloc(void)
5080 {
5081         struct mem_cgroup *memcg;
...
5086         memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL);

I think it should allocate at least 2 pages.

>> Primary I mean here struct mem_cgroup allocation in mem_cgroup_alloc().
> 
> Just note that memory controller may not be always enabled so
> cgroup_mkdir != mem_cgroup_alloc().

However if cgroup_mkdir() calls mem_cgroup_alloc() it correctly account huge percpu
allocations but ignores neighbour multipage allocation.

>> However, I think we need to take into account any other distributions called
>> inside cgroup_mkdir: struct cgroup and kernefs node in common part and 
>> any other cgroup-cpecific allocations in other .css_alloc functions.
>> They all can be called from inside container, allocates non-accountable
>> memory and by this way theoretically can be misused.
> 
> Also note that (if you're purely on unified hierachy) you can protect
> against that with cgroup.max.descendants and cgroup.max.depth.

In past OpenVz had a lot of limits for various resources (for example we had a limit 
for iptable rules), but it was very hard to configure them properly.
Finally we  decided to replace all such custom limits by common memory limit.
It isn't important how many resources tries to use container as long as
it doesn't exceed the memory limit.
Such resource limits can be useful, especially to prevent possible misuses.
However sooner or later there will be a legal user who will rest against them.
All you can do in this situation is to recommend him just increase the limit,
that makes the limit senseless.
Memory limits looks much more reasonable and understandable.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH memcg v2] memcg: accounting for objects allocated for new netdevice
  2022-05-02 12:15       ` [PATCH memcg v2] " Vasily Averin
  2022-05-04 20:50         ` Luis Chamberlain
@ 2022-05-05  3:50         ` patchwork-bot+netdevbpf
  2022-05-11  2:51         ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-05-05  3:50 UTC (permalink / raw)
  To: Vasily Averin
  Cc: shakeelb, kernel, fw, linux-kernel, roman.gushchin, vbabka,
	mhocko, cgroups, netdev, davem, kuba, pabeni, mcgrof, keescook,
	yzaikin, linux-fsdevel

Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 2 May 2022 15:15:51 +0300 you wrote:
> Creating a new netdevice allocates at least ~50Kb of memory for various
> kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
> creating an unlimited number of netdevice inside a memcg-limited container
> does not fall within memcg restrictions, consumes a significant part
> of the host's memory, can cause global OOM and lead to random kills of
> host processes.
> 
> [...]

Here is the summary with links:
  - [memcg,v2] memcg: accounting for objects allocated for new netdevice
    https://git.kernel.org/netdev/net-next/c/425b9c7f51c9

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-04 21:16           ` Vasily Averin
@ 2022-05-05  9:47             ` Michal Koutný
  2022-05-06  8:37               ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Koutný @ 2022-05-05  9:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On Thu, May 05, 2022 at 12:16:12AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> I think it should allocate at least 2 pages.

After decoding kmalloc_type(), I agree this falls into a global
(unaccouted) kmalloc_cache.

> However if cgroup_mkdir() calls mem_cgroup_alloc() it correctly account huge percpu
> allocations but ignores neighbour multipage allocation.

So, the spillover is bound and proportional to memcg limit (same ration
like these two sizes).
But it may be better to account it properly, especially if it's
contribution form an offlined mem_cgroup.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-05  9:47             ` Michal Koutný
@ 2022-05-06  8:37               ` Vasily Averin
  0 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-06  8:37 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On 5/5/22 12:47, Michal Koutný wrote:
> On Thu, May 05, 2022 at 12:16:12AM +0300, Vasily Averin <vvs@openvz.org> wrote:
>> I think it should allocate at least 2 pages.
> 
> After decoding kmalloc_type(), I agree this falls into a global
> (unaccouted) kmalloc_cache.
> 
>> However if cgroup_mkdir() calls mem_cgroup_alloc() it correctly account huge percpu
>> allocations but ignores neighbour multipage allocation.
> 
> So, the spillover is bound and proportional to memcg limit (same ration
> like these two sizes).
> But it may be better to account it properly, especially if it's
> contribution form an offlined mem_cgroup.

I've traced mkdir /sys/fs/cgroup/vvs.test on 4cpu VM with Fedora 
and self-complied upstream kernel, see table with results below.
These calculations are not precise, it depends on kernel config options,
number of cpus, enabled controllers, ignores possible page allocations etc
However I think this is enough to clarify the general situation.

Results:
- Total sum of accounted memory is ~60Kb.
- Accounted only 2 huge percpu allocation marked '=', ~18Kb.
  (and can be 0 without memory controller)
- kernfs nodes and iattrs are among the main memory consumers.
   they are marked '++' to be accounted first
- cgroup_mkdir always allocates 4Kb,
   so I think it should be accounted first too.
- mem_cgroup_css_alloc allocations consumes 10K,
   it's enough to be accounted, especially for VMs with 1-2 CPUs
- Almost all other allocations are quite small and can be ignored.
  Exceptions are percpu allocations in alloc_fair_sched_group(),
   this can consume a significant amount of memory on nodes
   with multiple processors.
   marked by '+', can be accounted later.
- kernfs nodes consumes ~6Kb memory inside simple_xattr_set() 
   and simple_xattr_alloc(). This is quite high numbers,
   but is not critical, and I think we can ignore it at the moment.
- If all proposed memory will be accounted it gives us ~47Kb, 
   or ~75% of all allocated memory.

Any comments are welcome.

Thank you,
	Vasily Averin

number	bytes	$1*$2	sum	note	call_site
of	alloc
allocs
------------------------------------------------------------
1       14448   14448   14448   =       percpu_alloc_percpu:
1       8192    8192    22640   ++      (mem_cgroup_css_alloc+0x54)
49      128     6272    28912   ++      (__kernfs_new_node+0x4e)
49      96      4704    33616   ?       (simple_xattr_alloc+0x2c)
49      88      4312    37928   ++      (__kernfs_iattrs+0x56)
1       4096    4096    42024   ++      (cgroup_mkdir+0xc7)
1       3840    3840    45864   =       percpu_alloc_percpu:
4       512     2048    47912   +       (alloc_fair_sched_group+0x166)
4       512     2048    49960   +       (alloc_fair_sched_group+0x139)
1       2048    2048    52008   ++      (mem_cgroup_css_alloc+0x109)
49      32      1568    53576   ?       (simple_xattr_set+0x5b)
2       584     1168    54744           (radix_tree_node_alloc.constprop.0+0x8d)
1       1024    1024    55768           (cpuset_css_alloc+0x30)
1       1024    1024    56792           (alloc_shrinker_info+0x79)
1       768     768     57560           percpu_alloc_percpu:
1       640     640     58200           (sched_create_group+0x1c)
33      16      528     58728           (__kernfs_new_node+0x31)
1       512     512     59240           (pids_css_alloc+0x1b)
1       512     512     59752           (blkcg_css_alloc+0x39)
9       48      432     60184           percpu_alloc_percpu:
13      32      416     60600           (__kernfs_new_node+0x31)
1       384     384     60984           percpu_alloc_percpu:
1       256     256     61240           (perf_cgroup_css_alloc+0x1c)
1       192     192     61432           percpu_alloc_percpu:
1       64      64      61496           (mem_cgroup_css_alloc+0x363)
1       32      32      61528           (ioprio_alloc_cpd+0x39)
1       32      32      61560           (ioc_cpd_alloc+0x39)
1       32      32      61592           (blkcg_css_alloc+0x6b)
1       32      32      61624           (alloc_fair_sched_group+0x52)
1       32      32      61656           (alloc_fair_sched_group+0x2e)
3       8       24      61680           (__kernfs_new_node+0x31)
3       8       24      61704           (alloc_cpumask_var_node+0x1b)
1       24      24      61728           percpu_alloc_percpu:

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH memcg v2] memcg: accounting for objects allocated for new netdevice
  2022-05-02 12:15       ` [PATCH memcg v2] " Vasily Averin
  2022-05-04 20:50         ` Luis Chamberlain
  2022-05-05  3:50         ` patchwork-bot+netdevbpf
@ 2022-05-11  2:51         ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-11  2:51 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, kernel, Florian Westphal, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups, netdev, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, linux-fsdevel

On Mon, May 02, 2022 at 03:15:51PM +0300, Vasily Averin wrote:
> Creating a new netdevice allocates at least ~50Kb of memory for various
> kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
> creating an unlimited number of netdevice inside a memcg-limited container
> does not fall within memcg restrictions, consumes a significant part
> of the host's memory, can cause global OOM and lead to random kills of
> host processes.
> 
> The main consumers of non-accounted memory are:
>  ~10Kb   80+ kernfs nodes
>  ~6Kb    ipv6_add_dev() allocations
>   6Kb    __register_sysctl_table() allocations
>   4Kb    neigh_sysctl_register() allocations
>   4Kb    __devinet_sysctl_register() allocations
>   4Kb    __addrconf_sysctl_register() allocations
> 
> Accounting of these objects allows to increase the share of memcg-related
> memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
> on typical VM with default Fedora 35 kernel) and this should be enough
> to somehow protect the host from misuse inside container.
> 
> Other related objects are quite small and may not be taken into account
> to minimize the expected performance degradation.
> 
> It should be separately mentonied ~300 bytes of percpu allocation
> of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
> it can become the main consumer of memory.
> 
> This patch does not enables kernfs accounting as it affects
> other parts of the kernel and should be discussed separately.
> However, even without kernfs, this patch significantly improves the
> current situation and allows to take into account more than half
> of all netdevice allocations.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v2: 1) kernfs accounting moved into separate patch, suggested by
>     Shakeel and mkoutny@.
>     2) in ipv6_add_dev() changed original "sizeof(struct inet6_dev)"
>     to "sizeof(*ndev)", according to checkpath.pl recommendation:
>       CHECK: Prefer kzalloc(sizeof(*ndev)...) over kzalloc(sizeof
>         (struct inet6_dev)...)

It seems it's a bit too late, but just for the record:

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-04  9:00       ` Vasily Averin
  2022-05-04 14:10         ` Michal Koutný
@ 2022-05-11  3:06         ` Roman Gushchin
  2022-05-11  6:01           ` Vasily Averin
  2022-05-11 16:34           ` Michal Koutný
  1 sibling, 2 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-11  3:06 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Koutný,
	Vlastimil Babka, Shakeel Butt, kernel, Florian Westphal,
	linux-kernel, Michal Hocko, cgroups, Greg Kroah-Hartman,
	Tejun Heo

On Wed, May 04, 2022 at 12:00:18PM +0300, Vasily Averin wrote:
> On 5/3/22 00:22, Michal Koutný wrote:
> > When struct mem_cgroup charging was introduced, there was a similar
> > discussion [1].
> 
> Thank you, I'm missed this patch, it was very interesting and useful.
> I would note though, that OpenVZ and LXC have another usecase:
> we have separate and independent systemd instances inside OS containers.
> So container's cgroups are created not in host's root memcg but 
> inside accountable container's root memcg.  
> 
> > I can see following aspects here:
> > 1) absolute size of kernfs_objects,
> > 2) practical difference between a) and b),
> > 3) consistency with memcg,
> > 4) v1 vs v2 behavior.
> ...
> > How do these reasonings align with your original intention of net
> > devices accounting? (Are the creators of net devices inside the
> > container?)
> 
> It is possible to create netdevice in one namespace/container 
> and then move them to another one, and this possibility is widely used.
> With my patch memory allocated by these devices will be not accounted
> to new memcg, however I do not think it is a problem.
> My patches protect the host mostly from misuse, when someone creates
> a huge number of nedevices inside a container.
> 
> >> Do you think it is incorrect and new kernfs node should be accounted
> >> to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?
> > 
> > I don't think either variant is incorrect. I'd very much prefer the
> > consistency with memcg behavior (variant a)) but as I've listed the
> > arguments above, it seems such a consistency can't be easily justified.
> 
> From my point of view it is most important to account allocated memory
> to any cgroup inside container. Select of proper memcg is a secondary goal here.
> Frankly speaking I do not see a big difference between memcg of current process,
> memcg of newly created child and memcg of its parent.
> 
> As far as I understand, Roman chose the parent memcg because it was a special
> case of creating a new memory group. He temporally changed active memcg
> in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
> allocations.

My primary goal was to apply the memory pressure on memory cgroups with a lot
of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
is way larger than a page, so a cgroup which looks small can be really large
if we calculate the amount of memory taken by all children memcg internals.

Applying this pressure to another cgroup (e.g. the one which contains systemd)
doesn't help to reclaim any pages which are pinning the dying cgroups.

For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
there is no such problem there.

For consistency reasons I'd suggest to charge all *large* allocations
(e.g. percpu) to the parent cgroup. Small allocations can be ignored.

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-11  3:06         ` Roman Gushchin
@ 2022-05-11  6:01           ` Vasily Averin
  2022-05-11 16:49             ` Michal Koutný
  2022-05-11 17:46             ` Roman Gushchin
  2022-05-11 16:34           ` Michal Koutný
  1 sibling, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-11  6:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Vlastimil Babka, Shakeel Butt, kernel, Florian Westphal,
	linux-kernel, Michal Hocko, cgroups, Greg Kroah-Hartman,
	Tejun Heo

On 5/11/22 06:06, Roman Gushchin wrote:
> On Wed, May 04, 2022 at 12:00:18PM +0300, Vasily Averin wrote:
>> From my point of view it is most important to account allocated memory
>> to any cgroup inside container. Select of proper memcg is a secondary goal here.
>> Frankly speaking I do not see a big difference between memcg of current process,
>> memcg of newly created child and memcg of its parent.
>>
>> As far as I understand, Roman chose the parent memcg because it was a special
>> case of creating a new memory group. He temporally changed active memcg
>> in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
>> allocations.
> 
> My primary goal was to apply the memory pressure on memory cgroups with a lot
> of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
> is way larger than a page, so a cgroup which looks small can be really large
> if we calculate the amount of memory taken by all children memcg internals.
> 
> Applying this pressure to another cgroup (e.g. the one which contains systemd)
> doesn't help to reclaim any pages which are pinning the dying cgroups.
> 
> For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
> there is no such problem there.
> 
> For consistency reasons I'd suggest to charge all *large* allocations
> (e.g. percpu) to the parent cgroup. Small allocations can be ignored.

I showed in [1] other large allocation:
"
number	bytes	$1*$2	sum	note	call_site
of	alloc
allocs
------------------------------------------------------------
1       14448   14448   14448   =       percpu_alloc_percpu:
1       8192    8192    22640   ++      (mem_cgroup_css_alloc+0x54)
49      128     6272    28912   ++      (__kernfs_new_node+0x4e)
49      96      4704    33616   ?       (simple_xattr_alloc+0x2c)
49      88      4312    37928   ++      (__kernfs_iattrs+0x56)
1       4096    4096    42024   ++      (cgroup_mkdir+0xc7)
1       3840    3840    45864   =       percpu_alloc_percpu:
4       512     2048    47912   +       (alloc_fair_sched_group+0x166)
4       512     2048    49960   +       (alloc_fair_sched_group+0x139)
1       2048    2048    52008   ++      (mem_cgroup_css_alloc+0x109)
"
[1] https://lore.kernel.org/all/1aa4cd22-fcb6-0e8d-a1c6-23661d618864@openvz.org/
=	already accounted
++	to be accounted first
+	to be accounted a bit later

There is no problems with objects allocated in mem_cgroup_alloc(),
they will be accounted to parent's memcg.
However I do not understand how to handle other large objects?

We could move set_active_memcg(parent) call from mem_cgroup_css_alloc() 
to cgroup_apply_control_enable() and handle allocation in all .css_alloc()

However I need to handle allocations called from cgroup_mkdir() too and
badly understand how to do it properly.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-11  3:06         ` Roman Gushchin
  2022-05-11  6:01           ` Vasily Averin
@ 2022-05-11 16:34           ` Michal Koutný
  2022-05-11 18:10             ` Roman Gushchin
  1 sibling, 1 reply; 139+ messages in thread
From: Michal Koutný @ 2022-05-11 16:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vasily Averin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On Tue, May 10, 2022 at 08:06:24PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> My primary goal was to apply the memory pressure on memory cgroups with a lot
> of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
> is way larger than a page, so a cgroup which looks small can be really large
> if we calculate the amount of memory taken by all children memcg internals.
> 
> Applying this pressure to another cgroup (e.g. the one which contains systemd)
> doesn't help to reclaim any pages which are pinning the dying cgroups.

Just a note -- this another usecase of cgroups created from within the
subtree (e.g. a container). I agree that cgroup-manager/systemd case is
also valid (as dying memcgs may accumulate after a restart).

memcgs with their retained state with footprint are special.

> For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
> there is no such problem there.
> 
> For consistency reasons I'd suggest to charge all *large* allocations
> (e.g. percpu) to the parent cgroup. Small allocations can be ignored.

Strictly speaking, this would mean that any controller would have on
implicit dependency on the memory controller (such as io controller
has).
In the extreme case even controller-less hierarchy would have such a
requirement (for precise kernfs_node accounting).
Such a dependency is not enforceable on v1 (with various topologies of
different hierarchies).
Although, I initially favored the consistency with memory controller too,
I think it's simpler to charge to the creator's memcg to achieve
consistency across v1 and v2 :-) 

Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-11  6:01           ` Vasily Averin
@ 2022-05-11 16:49             ` Michal Koutný
  2022-05-11 17:46             ` Roman Gushchin
  1 sibling, 0 replies; 139+ messages in thread
From: Michal Koutný @ 2022-05-11 16:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On Wed, May 11, 2022 at 09:01:40AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> number	bytes	$1*$2	sum	note	call_site
> of	alloc
> allocs
> ------------------------------------------------------------
> 1       14448   14448   14448   =       percpu_alloc_percpu:
> 1       8192    8192    22640   ++      (mem_cgroup_css_alloc+0x54)

This requires just adding GFP_KERNEL_ACCOUNT (no new active memcg
switch).

> 49      128     6272    28912   ++      (__kernfs_new_node+0x4e)
> 49      96      4704    33616   ?       (simple_xattr_alloc+0x2c)
> 49      88      4312    37928   ++      (__kernfs_iattrs+0x56)
> 1       4096    4096    42024   ++      (cgroup_mkdir+0xc7)
> 1       3840    3840    45864   =       percpu_alloc_percpu:
> 4       512     2048    47912   +       (alloc_fair_sched_group+0x166)
> 4       512     2048    49960   +       (alloc_fair_sched_group+0x139)
> 1       2048    2048    52008   ++      (mem_cgroup_css_alloc+0x109)
> "
> [1] https://lore.kernel.org/all/1aa4cd22-fcb6-0e8d-a1c6-23661d618864@openvz.org/
> =	already accounted
> ++	to be accounted first
> +	to be accounted a bit later
> 
> There is no problems with objects allocated in mem_cgroup_alloc(),
> they will be accounted to parent's memcg.
> However I do not understand how to handle other large objects?
> 
> We could move set_active_memcg(parent) call from mem_cgroup_css_alloc() 
> to cgroup_apply_control_enable() and handle allocation in all .css_alloc()
> 
> However I need to handle allocations called from cgroup_mkdir() too and
> badly understand how to do it properly.

If we consent to charge to the creator, the change would be just passing
GFP_ACCOUNT at fewer (right) places, wouldn't it?

Also, my undertanding of memcgs is that they're not hermetically tight,
so I think charging just kernfs_nodes (for dirs and files) provides
sufficient bound.

Except for, the xattrs, my older notes say: "make kernfs simple_xattr kernel
accounted, up to KERNFS_USER_XATTR_SIZE_LIMIT*KERNFS_MAX_USER_XATTRS =
128k * 128 = 16M / inode".

HTH,
Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-11  6:01           ` Vasily Averin
  2022-05-11 16:49             ` Michal Koutný
@ 2022-05-11 17:46             ` Roman Gushchin
  1 sibling, 0 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-11 17:46 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Koutný,
	Vlastimil Babka, Shakeel Butt, kernel, Florian Westphal,
	linux-kernel, Michal Hocko, cgroups, Greg Kroah-Hartman,
	Tejun Heo

On Wed, May 11, 2022 at 09:01:40AM +0300, Vasily Averin wrote:
> On 5/11/22 06:06, Roman Gushchin wrote:
> > On Wed, May 04, 2022 at 12:00:18PM +0300, Vasily Averin wrote:
> >> From my point of view it is most important to account allocated memory
> >> to any cgroup inside container. Select of proper memcg is a secondary goal here.
> >> Frankly speaking I do not see a big difference between memcg of current process,
> >> memcg of newly created child and memcg of its parent.
> >>
> >> As far as I understand, Roman chose the parent memcg because it was a special
> >> case of creating a new memory group. He temporally changed active memcg
> >> in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
> >> allocations.
> > 
> > My primary goal was to apply the memory pressure on memory cgroups with a lot
> > of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
> > is way larger than a page, so a cgroup which looks small can be really large
> > if we calculate the amount of memory taken by all children memcg internals.
> > 
> > Applying this pressure to another cgroup (e.g. the one which contains systemd)
> > doesn't help to reclaim any pages which are pinning the dying cgroups.
> > 
> > For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
> > there is no such problem there.
> > 
> > For consistency reasons I'd suggest to charge all *large* allocations
> > (e.g. percpu) to the parent cgroup. Small allocations can be ignored.
> 
> I showed in [1] other large allocation:
> "
> number	bytes	$1*$2	sum	note	call_site
> of	alloc
> allocs
> ------------------------------------------------------------
> 1       14448   14448   14448   =       percpu_alloc_percpu:
> 1       8192    8192    22640   ++      (mem_cgroup_css_alloc+0x54)
> 49      128     6272    28912   ++      (__kernfs_new_node+0x4e)
> 49      96      4704    33616   ?       (simple_xattr_alloc+0x2c)
> 49      88      4312    37928   ++      (__kernfs_iattrs+0x56)
> 1       4096    4096    42024   ++      (cgroup_mkdir+0xc7)
> 1       3840    3840    45864   =       percpu_alloc_percpu:
> 4       512     2048    47912   +       (alloc_fair_sched_group+0x166)
> 4       512     2048    49960   +       (alloc_fair_sched_group+0x139)
> 1       2048    2048    52008   ++      (mem_cgroup_css_alloc+0x109)
> "
> [1] https://lore.kernel.org/all/1aa4cd22-fcb6-0e8d-a1c6-23661d618864@openvz.org/
> =	already accounted
> ++	to be accounted first
> +	to be accounted a bit later
> 
> There is no problems with objects allocated in mem_cgroup_alloc(),
> they will be accounted to parent's memcg.
> However I do not understand how to handle other large objects?
> 
> We could move set_active_memcg(parent) call from mem_cgroup_css_alloc() 
> to cgroup_apply_control_enable() and handle allocation in all .css_alloc()
> 
> However I need to handle allocations called from cgroup_mkdir() too and
> badly understand how to do it properly.

I don't think there is a better alternative than just having several
set_active_memcg(parent);...set_active_memcg(old); places in the cgroup.c.

Nesting is fine here, so it shouldn't be a big issue.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: kernfs memcg accounting
  2022-05-11 16:34           ` Michal Koutný
@ 2022-05-11 18:10             ` Roman Gushchin
  2022-05-13 15:51               ` [PATCH 0/4] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin
                                 ` (4 more replies)
  0 siblings, 5 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-11 18:10 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Vasily Averin, Vlastimil Babka, Shakeel Butt, kernel,
	Florian Westphal, linux-kernel, Michal Hocko, cgroups,
	Greg Kroah-Hartman, Tejun Heo

On Wed, May 11, 2022 at 06:34:39PM +0200, Michal Koutny wrote:
> On Tue, May 10, 2022 at 08:06:24PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > My primary goal was to apply the memory pressure on memory cgroups with a lot
> > of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
> > is way larger than a page, so a cgroup which looks small can be really large
> > if we calculate the amount of memory taken by all children memcg internals.
> > 
> > Applying this pressure to another cgroup (e.g. the one which contains systemd)
> > doesn't help to reclaim any pages which are pinning the dying cgroups.
> 
> Just a note -- this another usecase of cgroups created from within the
> subtree (e.g. a container). I agree that cgroup-manager/systemd case is
> also valid (as dying memcgs may accumulate after a restart).
> 
> memcgs with their retained state with footprint are special.
> 
> > For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
> > there is no such problem there.
> > 
> > For consistency reasons I'd suggest to charge all *large* allocations
> > (e.g. percpu) to the parent cgroup. Small allocations can be ignored.
> 
> Strictly speaking, this would mean that any controller would have on
> implicit dependency on the memory controller (such as io controller
> has).
> In the extreme case even controller-less hierarchy would have such a
> requirement (for precise kernfs_node accounting).
> Such a dependency is not enforceable on v1 (with various topologies of
> different hierarchies).
>
> Although, I initially favored the consistency with memory controller too,
> I think it's simpler to charge to the creator's memcg to achieve
> consistency across v1 and v2 :-)

Ok, v1/v2 consistency is a valid point.

As I said, I'm fine with both options, it shouldn't matter that much
for anything except the memory controller: cgroup internal objects are not
that large and the total memory footprint is usually small unless we have
a lot of (dying) sub-cgroups. From my experience no other controllers
should be affected (blkcg was affected due to a cgwb reference, but should
be fine now), so it's not an issue at all.

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 0/4] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-11 18:10             ` Roman Gushchin
@ 2022-05-13 15:51               ` Vasily Averin
  2022-05-13 17:49                 ` Roman Gushchin
  2022-05-13 15:51               ` [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-13 15:51 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt, Michal Koutný
  Cc: kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
4cpu VM with Fedora and self-complied upstream kernel. The calculations
are not precise, it depends on kernel config options, number of cpus,
enabled controllers, ignores possible page allocations etc.
However this is enough to clarify the general situation:
- Total sum of accounted memory is ~60Kb.
- Accounted only 2 huge percpu allocation marked '=', ~18Kb.
  (and can be 0 without memory controller)
- kernfs nodes and iattrs are among the main memory consumers.
   they are marked '+' to be accounted.
- cgroup_mkdir always allocates 4Kb,
   so I think it should be accounted first too.
- mem_cgroup_css_alloc allocations consumes 10K,
   it's enough to be accounted, especially for VMs with 1-2 CPUs
- Almost all other allocations are quite small and can be ignored.
  Exceptions are percpu allocations in alloc_fair_sched_group(),
   this can consume a significant amount of memory on nodes
   with multiple processors.
- kernfs nodes consumes ~6Kb memory inside simple_xattr_set() 
   and simple_xattr_alloc(). This is quite high numbers,
   but is not critical, and I think we can ignore it at the moment.
- If all proposed memory will be accounted it gives us ~47Kb, 
   or ~75% of all allocated memory.

number	bytes	$1*$2	sum	note	call_site
of	alloc
allocs
------------------------------------------------------------
1       14448   14448   14448   =       percpu_alloc_percpu:
1       8192    8192    22640   +       (mem_cgroup_css_alloc+0x54)
49      128     6272    28912   +       (__kernfs_new_node+0x4e)
49      96      4704    33616   ?       (simple_xattr_alloc+0x2c)
49      88      4312    37928   +       (__kernfs_iattrs+0x56)
1       4096    4096    42024   +       (cgroup_mkdir+0xc7)
1       3840    3840    45864   =       percpu_alloc_percpu:
4       512     2048    47912   +       (alloc_fair_sched_group+0x166)
4       512     2048    49960   +       (alloc_fair_sched_group+0x139)
1       2048    2048    52008   +       (mem_cgroup_css_alloc+0x109)
49      32      1568    53576   ?       (simple_xattr_set+0x5b)
2       584     1168    54744		(radix_tree_node_alloc.constprop.0+0x8d)
1       1024    1024    55768           (cpuset_css_alloc+0x30)
1       1024    1024    56792           (alloc_shrinker_info+0x79)
1       768     768     57560           percpu_alloc_percpu:
1       640     640     58200           (sched_create_group+0x1c)
33      16      528     58728           (__kernfs_new_node+0x31)
1       512     512     59240           (pids_css_alloc+0x1b)
1       512     512     59752           (blkcg_css_alloc+0x39)
9       48      432     60184           percpu_alloc_percpu:
13      32      416     60600           (__kernfs_new_node+0x31)
1       384     384     60984           percpu_alloc_percpu:
1       256     256     61240           (perf_cgroup_css_alloc+0x1c)
1       192     192     61432           percpu_alloc_percpu:
1       64      64      61496           (mem_cgroup_css_alloc+0x363)
1       32      32      61528           (ioprio_alloc_cpd+0x39)
1       32      32      61560           (ioc_cpd_alloc+0x39)
1       32      32      61592           (blkcg_css_alloc+0x6b)
1       32      32      61624           (alloc_fair_sched_group+0x52)
1       32      32      61656           (alloc_fair_sched_group+0x2e)
3       8       24      61680           (__kernfs_new_node+0x31)
3       8       24      61704           (alloc_cpumask_var_node+0x1b)
1       24      24      61728           percpu_alloc_percpu:

This patch-set enables accounting for required resources.
I would like to discuss the patches with cgroup developers and maintainers,
then I'm going re-send approved patches to subsystem maintainers.

Vasily Averin (4):
  memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  memcg: enable accounting for kernfs nodes and iattrs
  memcg: enable accounting for struct cgroup
  memcg: enable accounting for allocations in alloc_fair_sched_group

 fs/kernfs/mount.c      | 6 ++++--
 kernel/cgroup/cgroup.c | 2 +-
 kernel/sched/fair.c    | 4 ++--
 mm/memcontrol.c        | 4 ++--
 4 files changed, 9 insertions(+), 7 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  2022-05-11 18:10             ` Roman Gushchin
  2022-05-13 15:51               ` [PATCH 0/4] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin
@ 2022-05-13 15:51               ` Vasily Averin
  2022-05-19 16:46                 ` Michal Koutný
  2022-05-20  1:07                 ` Shakeel Butt
  2022-05-13 15:51               ` [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs Vasily Averin
                                 ` (2 subsequent siblings)
  4 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-13 15:51 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt, Michal Koutný
  Cc: kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

cgroup mkdir can be misused inside memcg limited container. It can allocate
a lot of host memory without memcg accounting, cause global memory shortage
and force OOM to kill random host process.

Below [1] is result of mkdir /sys/fs/cgroup/test tracing on VM with 4 cpus

number	bytes	$1*$2	sum	note	call_site
of	alloc
allocs
------------------------------------------------------------
1       14448   14448   14448   =       percpu_alloc_percpu:
1       8192    8192    22640           (mem_cgroup_css_alloc+0x54)
49      128     6272    28912           (__kernfs_new_node+0x4e)
49      96      4704    33616           (simple_xattr_alloc+0x2c)
49      88      4312    37928           (__kernfs_iattrs+0x56)
1       4096    4096    42024           (cgroup_mkdir+0xc7)
1       3840    3840    45864   =       percpu_alloc_percpu:
4       512     2048    47912           (alloc_fair_sched_group+0x166)
4       512     2048    49960           (alloc_fair_sched_group+0x139)
1       2048    2048    52008           (mem_cgroup_css_alloc+0x109)
	[smaller objects skipped]
---
Total			61728

'=' --  accounted allocations

This patch enabled accounting for one of the main memory hogs in this
experiment: allocation which are called inside mem_cgroup_css_alloc()

Signed-off-by: Vasily Averin <vvs@openvz.org>
Link: [1] https://lore.kernel.org/all/1aa4cd22-fcb6-0e8d-a1c6-23661d618864@openvz.org/
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 598fece89e2b..52c6163ba6dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5031,7 +5031,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 
-	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node);
+	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL_ACCOUNT, node);
 	if (!pn)
 		return 1;
 
@@ -5083,7 +5083,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	int __maybe_unused i;
 	long error = -ENOMEM;
 
-	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL);
+	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL_ACCOUNT);
 	if (!memcg)
 		return ERR_PTR(error);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs
  2022-05-11 18:10             ` Roman Gushchin
  2022-05-13 15:51               ` [PATCH 0/4] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin
  2022-05-13 15:51               ` [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
@ 2022-05-13 15:51               ` Vasily Averin
  2022-05-19 16:33                 ` Michal Koutný
  2022-05-20  1:12                 ` Shakeel Butt
  2022-05-13 15:52               ` [PATCH 3/4] memcg: enable accounting for struct cgroup Vasily Averin
  2022-05-13 15:52               ` [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
  4 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-13 15:51 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt, Michal Koutný
  Cc: kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

It makes sense to enable accounting for kernfs objects, otherwise its
misuse inside memcg-limited can lead to global memory shortage,
OOM and random kills of host processes.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 fs/kernfs/mount.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..40e896c7c86b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,10 +391,12 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
 					      sizeof(struct kernfs_iattrs),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-11 18:10             ` Roman Gushchin
                                 ` (2 preceding siblings ...)
  2022-05-13 15:51               ` [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs Vasily Averin
@ 2022-05-13 15:52               ` Vasily Averin
  2022-05-19 16:53                 ` Michal Koutný
  2022-05-20  1:31                 ` Shakeel Butt
  2022-05-13 15:52               ` [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
  4 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-13 15:52 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt, Michal Koutný
  Cc: kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

Creating each new cgroup allocates 4Kb for struct cgroup. This is the
largest memory allocation in this scenario and is epecially important
for small VMs with 1-2 CPUs.

Accounting of this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..7595127c5b3a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
 	cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)),
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group
  2022-05-11 18:10             ` Roman Gushchin
                                 ` (3 preceding siblings ...)
  2022-05-13 15:52               ` [PATCH 3/4] memcg: enable accounting for struct cgroup Vasily Averin
@ 2022-05-13 15:52               ` Vasily Averin
  2022-05-19 16:45                 ` Michal Koutný
  2022-05-20  1:18                 ` Shakeel Butt
  4 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-13 15:52 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt, Michal Koutný
  Cc: kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

Creating of each new cpu cgroup allocates two 512-bytes kernel objects
per CPU. This is especially important for cgroups shared parent memory
cgroup. In this scenario, on nodes with multiple processors, these
allocations become one of the main memory consumers.

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a68482d66535..46e66acf7475 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11529,12 +11529,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
-				      GFP_KERNEL, cpu_to_node(i));
+				      GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!cfs_rq)
 			goto err;
 
 		se = kzalloc_node(sizeof(struct sched_entity_stats),
-				  GFP_KERNEL, cpu_to_node(i));
+				  GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!se)
 			goto err_free_rq;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH 0/4] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-13 15:51               ` [PATCH 0/4] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin
@ 2022-05-13 17:49                 ` Roman Gushchin
  2022-05-21 16:37                   ` [PATCH mm v2 0/9] " Vasily Averin
                                     ` (9 more replies)
  0 siblings, 10 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-13 17:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Michal Koutný,
	kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:51:30PM +0300, Vasily Averin wrote:
> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
> are not precise, it depends on kernel config options, number of cpus,
> enabled controllers, ignores possible page allocations etc.
> However this is enough to clarify the general situation:
> - Total sum of accounted memory is ~60Kb.
> - Accounted only 2 huge percpu allocation marked '=', ~18Kb.
>   (and can be 0 without memory controller)
> - kernfs nodes and iattrs are among the main memory consumers.
>    they are marked '+' to be accounted.
> - cgroup_mkdir always allocates 4Kb,
>    so I think it should be accounted first too.
> - mem_cgroup_css_alloc allocations consumes 10K,
>    it's enough to be accounted, especially for VMs with 1-2 CPUs
> - Almost all other allocations are quite small and can be ignored.
>   Exceptions are percpu allocations in alloc_fair_sched_group(),
>    this can consume a significant amount of memory on nodes
>    with multiple processors.
> - kernfs nodes consumes ~6Kb memory inside simple_xattr_set() 
>    and simple_xattr_alloc(). This is quite high numbers,
>    but is not critical, and I think we can ignore it at the moment.
> - If all proposed memory will be accounted it gives us ~47Kb, 
>    or ~75% of all allocated memory.
> 
> number	bytes	$1*$2	sum	note	call_site
> of	alloc
> allocs
> ------------------------------------------------------------
> 1       14448   14448   14448   =       percpu_alloc_percpu:
> 1       8192    8192    22640   +       (mem_cgroup_css_alloc+0x54)
> 49      128     6272    28912   +       (__kernfs_new_node+0x4e)
> 49      96      4704    33616   ?       (simple_xattr_alloc+0x2c)
> 49      88      4312    37928   +       (__kernfs_iattrs+0x56)
> 1       4096    4096    42024   +       (cgroup_mkdir+0xc7)
> 1       3840    3840    45864   =       percpu_alloc_percpu:
> 4       512     2048    47912   +       (alloc_fair_sched_group+0x166)
> 4       512     2048    49960   +       (alloc_fair_sched_group+0x139)
> 1       2048    2048    52008   +       (mem_cgroup_css_alloc+0x109)
> 49      32      1568    53576   ?       (simple_xattr_set+0x5b)
> 2       584     1168    54744		(radix_tree_node_alloc.constprop.0+0x8d)
> 1       1024    1024    55768           (cpuset_css_alloc+0x30)
> 1       1024    1024    56792           (alloc_shrinker_info+0x79)
> 1       768     768     57560           percpu_alloc_percpu:
> 1       640     640     58200           (sched_create_group+0x1c)
> 33      16      528     58728           (__kernfs_new_node+0x31)
> 1       512     512     59240           (pids_css_alloc+0x1b)
> 1       512     512     59752           (blkcg_css_alloc+0x39)
> 9       48      432     60184           percpu_alloc_percpu:
> 13      32      416     60600           (__kernfs_new_node+0x31)
> 1       384     384     60984           percpu_alloc_percpu:
> 1       256     256     61240           (perf_cgroup_css_alloc+0x1c)
> 1       192     192     61432           percpu_alloc_percpu:
> 1       64      64      61496           (mem_cgroup_css_alloc+0x363)
> 1       32      32      61528           (ioprio_alloc_cpd+0x39)
> 1       32      32      61560           (ioc_cpd_alloc+0x39)
> 1       32      32      61592           (blkcg_css_alloc+0x6b)
> 1       32      32      61624           (alloc_fair_sched_group+0x52)
> 1       32      32      61656           (alloc_fair_sched_group+0x2e)
> 3       8       24      61680           (__kernfs_new_node+0x31)
> 3       8       24      61704           (alloc_cpumask_var_node+0x1b)
> 1       24      24      61728           percpu_alloc_percpu:
> 
> This patch-set enables accounting for required resources.
> I would like to discuss the patches with cgroup developers and maintainers,
> then I'm going re-send approved patches to subsystem maintainers.

Hi Vasily!

Thank you for the analysis and for the series. It looks really good to me.

Please, feel free to add
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
for the whole series.

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs
  2022-05-13 15:51               ` [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs Vasily Averin
@ 2022-05-19 16:33                 ` Michal Koutný
  2022-05-20  1:12                 ` Shakeel Butt
  1 sibling, 0 replies; 139+ messages in thread
From: Michal Koutný @ 2022-05-19 16:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Shakeel Butt, kernel, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:51:55PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index cfa79715fc1a..40e896c7c86b 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -391,10 +391,12 @@ void __init kernfs_init(void)
>  {
>  	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
>  					      sizeof(struct kernfs_node),
> -					      0, SLAB_PANIC, NULL);
> +					      0, SLAB_PANIC | SLAB_ACCOUNT,
> +					      NULL);
>  
>  	/* Creates slab cache for kernfs inode attributes */
>  	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
>  					      sizeof(struct kernfs_iattrs),
> -					      0, SLAB_PANIC, NULL);
> +					      0, SLAB_PANIC | SLAB_ACCOUNT,
> +					      NULL);

IIUC your stats report, struct kernfs_iattrs is mere 88B (per
kernfs_node), when considering attributes, I actually thought of
something like:

--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -950,7 +950,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
        if (len < sizeof(*new_xattr))
                return NULL;

-       new_xattr = kvmalloc(len, GFP_KERNEL);
+       new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT);
        if (!new_xattr)
                return NULL;

Again, I'd split the patch into two: kernfs_node and
kernfs_iattrs+simple_xattr for easier bisecting.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group
  2022-05-13 15:52               ` [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
@ 2022-05-19 16:45                 ` Michal Koutný
  2022-05-20  1:18                 ` Shakeel Butt
  1 sibling, 0 replies; 139+ messages in thread
From: Michal Koutný @ 2022-05-19 16:45 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Shakeel Butt, kernel, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:52:20PM +0300, Vasily Averin <vvs@openvz.org> wrote:
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  2022-05-13 15:51               ` [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
@ 2022-05-19 16:46                 ` Michal Koutný
  2022-05-20  1:07                 ` Shakeel Butt
  1 sibling, 0 replies; 139+ messages in thread
From: Michal Koutný @ 2022-05-19 16:46 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Shakeel Butt, kernel, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:51:41PM +0300, Vasily Averin <vvs@openvz.org> wrote:
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-13 15:52               ` [PATCH 3/4] memcg: enable accounting for struct cgroup Vasily Averin
@ 2022-05-19 16:53                 ` Michal Koutný
  2022-05-20  7:24                   ` Vasily Averin
  2022-05-20  1:31                 ` Shakeel Butt
  1 sibling, 1 reply; 139+ messages in thread
From: Michal Koutný @ 2022-05-19 16:53 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Shakeel Butt, kernel, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:52:12PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> Creating each new cgroup allocates 4Kb for struct cgroup. This is the
> largest memory allocation in this scenario and is epecially important
> for small VMs with 1-2 CPUs.

What do you mean by this argument?

(On bigger irons, the percpu components becomes dominant, e.g. struct
cgroup_rstat_cpu.)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  2022-05-13 15:51               ` [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
  2022-05-19 16:46                 ` Michal Koutný
@ 2022-05-20  1:07                 ` Shakeel Butt
  1 sibling, 0 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-05-20  1:07 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Michal Koutný,
	kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:51:41PM +0300, Vasily Averin wrote:
> cgroup mkdir can be misused inside memcg limited container. It can allocate
> a lot of host memory without memcg accounting, cause global memory shortage
> and force OOM to kill random host process.
> 
> Below [1] is result of mkdir /sys/fs/cgroup/test tracing on VM with 4 cpus
> 
> number	bytes	$1*$2	sum	note	call_site
> of	alloc
> allocs
> ------------------------------------------------------------
> 1       14448   14448   14448   =       percpu_alloc_percpu:
> 1       8192    8192    22640           (mem_cgroup_css_alloc+0x54)
> 49      128     6272    28912           (__kernfs_new_node+0x4e)
> 49      96      4704    33616           (simple_xattr_alloc+0x2c)
> 49      88      4312    37928           (__kernfs_iattrs+0x56)
> 1       4096    4096    42024           (cgroup_mkdir+0xc7)
> 1       3840    3840    45864   =       percpu_alloc_percpu:
> 4       512     2048    47912           (alloc_fair_sched_group+0x166)
> 4       512     2048    49960           (alloc_fair_sched_group+0x139)
> 1       2048    2048    52008           (mem_cgroup_css_alloc+0x109)
> 	[smaller objects skipped]
> ---
> Total			61728
> 
> '=' --  accounted allocations
> 
> This patch enabled accounting for one of the main memory hogs in this
> experiment: allocation which are called inside mem_cgroup_css_alloc()
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> Link: [1] https://lore.kernel.org/all/1aa4cd22-fcb6-0e8d-a1c6-23661d618864@openvz.org/
> 

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs
  2022-05-13 15:51               ` [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs Vasily Averin
  2022-05-19 16:33                 ` Michal Koutný
@ 2022-05-20  1:12                 ` Shakeel Butt
  1 sibling, 0 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-05-20  1:12 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Michal Koutný,
	kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:51:55PM +0300, Vasily Averin wrote:
> kernfs nodes are quite small kernel objects, however there are few
> scenarios where it consumes significant piece of all allocated memory:
> 
> 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
>    was allocated for 80+ kernfs nodes.
> 
> 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
>    structures.
> 
> 3) Shakeel Butt reports that Google has workloads which create 100s
>    of subcontainers and they have observed high system overhead
>    without memcg accounting of kernfs.
> 
> It makes sense to enable accounting for kernfs objects, otherwise its
> misuse inside memcg-limited can lead to global memory shortage,
> OOM and random kills of host processes.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

You can keep the ack if you decide to include simple_xattr_alloc() in
following version or different patch.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group
  2022-05-13 15:52               ` [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
  2022-05-19 16:45                 ` Michal Koutný
@ 2022-05-20  1:18                 ` Shakeel Butt
  1 sibling, 0 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-05-20  1:18 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Michal Koutný,
	kernel, linux-kernel, Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 13, 2022 at 06:52:20PM +0300, Vasily Averin wrote:
> Creating of each new cpu cgroup allocates two 512-bytes kernel objects
> per CPU. This is especially important for cgroups shared parent memory
> cgroup. In this scenario, on nodes with multiple processors, these
> allocations become one of the main memory consumers.
> 
> Accounting for this memory helps to avoid misuse inside memcg-limited
> contianers.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-13 15:52               ` [PATCH 3/4] memcg: enable accounting for struct cgroup Vasily Averin
  2022-05-19 16:53                 ` Michal Koutný
@ 2022-05-20  1:31                 ` Shakeel Butt
  1 sibling, 0 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-05-20  1:31 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Michal Koutný,
	kernel, LKML, Vlastimil Babka, Michal Hocko, Cgroups

On Fri, May 13, 2022 at 8:52 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Creating each new cgroup allocates 4Kb for struct cgroup. This is the
> largest memory allocation in this scenario and is epecially important
> for small VMs with 1-2 CPUs.
>
> Accounting of this memory helps to avoid misuse inside memcg-limited
> containers.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
>  kernel/cgroup/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..7595127c5b3a 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>
>         /* allocate the cgroup and its ID, 0 is reserved for the root */
>         cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)),
> -                      GFP_KERNEL);
> +                      GFP_KERNEL_ACCOUNT);
>         if (!cgrp)
>                 return ERR_PTR(-ENOMEM);
>

Please include cgroup_rstat_cpu as well.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-19 16:53                 ` Michal Koutný
@ 2022-05-20  7:24                   ` Vasily Averin
  2022-05-20 20:16                     ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-20  7:24 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Shakeel Butt, kernel, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups

On 5/19/22 19:53, Michal Koutný wrote:
> On Fri, May 13, 2022 at 06:52:12PM +0300, Vasily Averin <vvs@openvz.org> wrote:
>> Creating each new cgroup allocates 4Kb for struct cgroup. This is the
>> largest memory allocation in this scenario and is epecially important
>> for small VMs with 1-2 CPUs.
> 
> What do you mean by this argument?
> 
> (On bigger irons, the percpu components becomes dominant, e.g. struct
> cgroup_rstat_cpu.)

Michal, Shakeel,
thank you very much for your feedback, it helps me understand how to improve
the methodology of my accounting analyze.
I considered the general case and looked for places of maximum memory allocations.
Now I think it would be better to split all called allocations into:
- common part, called for any cgroup type (i.e. cgroup_mkdir and cgroup_create),
- per-cgroup parts,
and focus on 2 corner cases: for single CPU VMs and for "big irons".
It helps to clarify which allocations are accounting-important and which ones
can be safely ignored.

So right now I'm going to redo the calculations and hope it doesn't take long.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-20  7:24                   ` Vasily Averin
@ 2022-05-20 20:16                     ` Vasily Averin
  2022-05-21  0:55                       ` Roman Gushchin
  2022-05-23 13:52                       ` Michal Koutný
  0 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-20 20:16 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Roman Gushchin, Shakeel Butt, kernel, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups

On 5/20/22 10:24, Vasily Averin wrote:
> On 5/19/22 19:53, Michal Koutný wrote:
>> On Fri, May 13, 2022 at 06:52:12PM +0300, Vasily Averin <vvs@openvz.org> wrote:
>>> Creating each new cgroup allocates 4Kb for struct cgroup. This is the
>>> largest memory allocation in this scenario and is epecially important
>>> for small VMs with 1-2 CPUs.
>>
>> What do you mean by this argument?
>>
>> (On bigger irons, the percpu components becomes dominant, e.g. struct
>> cgroup_rstat_cpu.)
> 
> Michal, Shakeel,
> thank you very much for your feedback, it helps me understand how to improve
> the methodology of my accounting analyze.
> I considered the general case and looked for places of maximum memory allocations.
> Now I think it would be better to split all called allocations into:
> - common part, called for any cgroup type (i.e. cgroup_mkdir and cgroup_create),
> - per-cgroup parts,
> and focus on 2 corner cases: for single CPU VMs and for "big irons".
> It helps to clarify which allocations are accounting-important and which ones
> can be safely ignored.
> 
> So right now I'm going to redo the calculations and hope it doesn't take long.

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu
cpu:		~2.5Kb	+ 1036 bytes percpu
cpuset:		~3Kb	+   12 bytes percpu
blkcg:		~3Kb	+   12 bytes percpu
pid:		~1.5Kb	+   12 bytes percpu		
perf:		 ~320b	+   60 bytes percpu
-------------------------------------------
total:		~38Kb	+ 6142 bytes percpu
currently accounted:	  4668 bytes percpu

Results:
a) I'll add accounting for cgroup_rstat_cpu and psi_group_cpu,
they are called in common part and consumes 288 bytes percpu.
b) It makes sense to add accounting for simple_xattr(), as Michal recommend,
 especially because it can grow over 4kb
c) it looks like the rest of the allocations can be ignored

Details are below
('=' -- already accounted, '+' -- to be accounted, '~' -- see KERNFS, '?' -- perhaps later )

common part:
16  ~   352     5632    5632    KERNFS (*)
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)

1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

(*) KERNFS includes:
1   +  128      (__kernfs_new_node+0x4d)	kernfs node
1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)	simple_xattr_alloc() that can grow over 4Kb 
1   ?   32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)


memory:
------
1   +   8192    8192    8192    (mem_cgroup_css_alloc+0x4a)
14  ~   352     4928    13120   KERNFS
1   +   2048    2048    15168   (mem_cgroup_css_alloc+0xdd)
1       1024    1024    16192   (alloc_shrinker_info+0x79)
1       584     584     16776   (radix_tree_node_alloc.constprop.0+0x89)
2       64      128     16904   (percpu_ref_init+0x6a)
1       64      64      16968   (mem_cgroup_css_online+0x32)

1   =   3684    3684    3684    call_site=mem_cgroup_css_alloc+0x9e
1   =   984     984     4668    call_site=mem_cgroup_css_alloc+0xfd
2       12      24      4692    call_site=percpu_ref_init+0x23

cpu:
---
5   ~   352     1760    1760    KERNFS
1       640     640     2400    (sched_create_group+0x1b)
1       64      64      2464    (percpu_ref_init+0x6a)
1       32      32      2496    (alloc_fair_sched_group+0x55)
1       32      32      2528    (alloc_fair_sched_group+0x31)

4   +   512     512      512    (alloc_fair_sched_group+0x16c)
4   +   512     512     1024    (alloc_fair_sched_group+0x13e)
1       12      12      1036    call_site=percpu_ref_init+0x23

cpuset:
------
5   ~   352     1760    1760    KERNFS
1       1024    1024    2784    (cpuset_css_alloc+0x2f)
1       64      64      2848    (percpu_ref_init+0x6a)
3       8       24      2872    (alloc_cpumask_var_node+0x1f)

1       12      12      12      call_site=percpu_ref_init+0x23

blkcg:
-----
6   ~   352     2112    2112    KERNFS
1       512     512     2624    (blkcg_css_alloc+0x37)
1       64      64      2688    (percpu_ref_init+0x6a)
1       32      32      2720    (ioprio_alloc_cpd+0x39)
1       32      32      2752    (ioc_cpd_alloc+0x39)
1       32      32      2784    (blkcg_css_alloc+0x66)

1       12      12      12      call_site=percpu_ref_init+0x23

pid:
---
3   ~   352     1056    1056    KERNFS
1       512     512     1568    (pids_css_alloc+0x1b)
1       64      64      1632    (percpu_ref_init+0x6a)

1       12      12      12      call_site=percpu_ref_init+0x23

perf:
----
1       256     256     256     (perf_cgroup_css_alloc+0x1c)
1       64      64      320     (percpu_ref_init+0x6a)

1       48      48      48      call_site=perf_cgroup_css_alloc+0x33
1       12      12      60      call_site=percpu_ref_init+0x23

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-20 20:16                     ` Vasily Averin
@ 2022-05-21  0:55                       ` Roman Gushchin
  2022-05-21  7:28                         ` Vasily Averin
  2022-05-23 13:52                       ` Michal Koutný
  1 sibling, 1 reply; 139+ messages in thread
From: Roman Gushchin @ 2022-05-21  0:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Koutný,
	Shakeel Butt, kernel, linux-kernel, Vlastimil Babka,
	Michal Hocko, cgroups

On Fri, May 20, 2022 at 11:16:32PM +0300, Vasily Averin wrote:
> On 5/20/22 10:24, Vasily Averin wrote:
> > On 5/19/22 19:53, Michal Koutný wrote:
> >> On Fri, May 13, 2022 at 06:52:12PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> >>> Creating each new cgroup allocates 4Kb for struct cgroup. This is the
> >>> largest memory allocation in this scenario and is epecially important
> >>> for small VMs with 1-2 CPUs.
> >>
> >> What do you mean by this argument?
> >>
> >> (On bigger irons, the percpu components becomes dominant, e.g. struct
> >> cgroup_rstat_cpu.)
> > 
> > Michal, Shakeel,
> > thank you very much for your feedback, it helps me understand how to improve
> > the methodology of my accounting analyze.
> > I considered the general case and looked for places of maximum memory allocations.
> > Now I think it would be better to split all called allocations into:
> > - common part, called for any cgroup type (i.e. cgroup_mkdir and cgroup_create),
> > - per-cgroup parts,
> > and focus on 2 corner cases: for single CPU VMs and for "big irons".
> > It helps to clarify which allocations are accounting-important and which ones
> > can be safely ignored.
> > 
> > So right now I'm going to redo the calculations and hope it doesn't take long.
> 
> common part: 	~11Kb	+  318 bytes percpu
> memcg: 		~17Kb	+ 4692 bytes percpu
> cpu:		~2.5Kb	+ 1036 bytes percpu
> cpuset:		~3Kb	+   12 bytes percpu
> blkcg:		~3Kb	+   12 bytes percpu
> pid:		~1.5Kb	+   12 bytes percpu		
> perf:		 ~320b	+   60 bytes percpu
> -------------------------------------------
> total:		~38Kb	+ 6142 bytes percpu
> currently accounted:	  4668 bytes percpu
> 
> Results:
> a) I'll add accounting for cgroup_rstat_cpu and psi_group_cpu,
> they are called in common part and consumes 288 bytes percpu.
> b) It makes sense to add accounting for simple_xattr(), as Michal recommend,
>  especially because it can grow over 4kb
> c) it looks like the rest of the allocations can be ignored
> 
> Details are below
> ('=' -- already accounted, '+' -- to be accounted, '~' -- see KERNFS, '?' -- perhaps later )
> 
> common part:
> 16  ~   352     5632    5632    KERNFS (*)
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> 
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22

I'm curios, how do you generate these data?
Just an idea: it could be a nice tool, placed somewhere in tools/cgroup/...

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-21  0:55                       ` Roman Gushchin
@ 2022-05-21  7:28                         ` Vasily Averin
  0 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-21  7:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Shakeel Butt, kernel, linux-kernel, Vlastimil Babka,
	Michal Hocko, cgroups

On 5/21/22 03:55, Roman Gushchin wrote:
> On Fri, May 20, 2022 at 11:16:32PM +0300, Vasily Averin wrote:

>> common part:
>> 16  ~   352     5632    5632    KERNFS (*)
>> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
>> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
>> 1       192     192     10504   (__d_alloc+0x29)
>> 2       72      144     10648   (avc_alloc_node+0x27)
>> 2       64      128     10776   (percpu_ref_init+0x6a)
>> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
>>
>> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
>> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
>> 2       12      24      312     call_site=percpu_ref_init+0x23
>> 1       6       6       318     call_site=__percpu_counter_init+0x22
> 
> I'm curios, how do you generate these data?

trace_cmd + awk + /dev/hand

> Just an idea: it could be a nice tool, placed somewhere in tools/cgroup/...

I'm agree, nice idea. I'll try to implement it.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v2 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-13 17:49                 ` Roman Gushchin
@ 2022-05-21 16:37                   ` Vasily Averin
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
  2022-05-21 16:37                   ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin
                                     ` (8 subsequent siblings)
  9 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
4cpu VM with Fedora and self-complied upstream kernel. The calculations
are not precise, it depends on kernel config options, number of cpus,
enabled controllers, ignores possible page allocations etc.
However this is enough to clarify the general situation.
All allocations are splited into:
- common part, always called for each cgroup type
- per-cgroup allocations

In each group we consider 2 corner cases:
- usual allocations, important for 1-2 CPU nodes/Vms
- percpu allocations, important for 'big irons'

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu
cpu:		~2.5Kb	+ 1036 bytes percpu
cpuset:		~3Kb	+   12 bytes percpu
blkcg:		~3Kb	+   12 bytes percpu
pid:		~1.5Kb	+   12 bytes percpu		
perf:		 ~320b	+   60 bytes percpu
-------------------------------------------
total:		~38Kb	+ 6142 bytes percpu
currently accounted:	  4668 bytes percpu

- it's important to account usual allocations called
in common part, because almost all of cgroup-specific allocations
are small. One exception here is memory cgroup, it allocates a few
huge objects that should be accounted.
- Percpu allocation called in common part, in memcg and cpu cgroups
should be accounted, rest ones are small an can be ignored.
- KERNFS objects are allocated both in common part and in most of
cgroups 

Details can be found here:
https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/

I checked other cgroups types was found that they all can be ignored.
Additionally I found allocation of struct rt_rq called in cpu cgroup 
if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
percpu structure and should be accounted too.

v2:
 1) re-split to simplify possible bisect, re-ordered
 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu,
     allocated in common part
 3) added accounting for percpu allocation of struct rt_rq
     (actual if CONFIG_RT_GROUP_SCHED is enabled)
 4) improved patches descriptions 

Vasily Averin (9):
  memcg: enable accounting for struct cgroup
  memcg: enable accounting for kernfs nodes
  memcg: enable accounting for kernfs iattrs
  memcg: enable accounting for struct simple_xattr
  memcg: enable accounting for percpu allocation of struct psi_group_cpu
  memcg: enable accounting for percpu allocation of struct
    cgroup_rstat_cpu
  memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  memcg: enable accounting for allocations in alloc_fair_sched_group
  memcg: enable accounting for percpu allocation of struct rt_rq

 fs/kernfs/mount.c      | 6 ++++--
 fs/xattr.c             | 2 +-
 kernel/cgroup/cgroup.c | 2 +-
 kernel/cgroup/rstat.c  | 3 ++-
 kernel/sched/fair.c    | 4 ++--
 kernel/sched/psi.c     | 3 ++-
 kernel/sched/rt.c      | 2 +-
 mm/memcontrol.c        | 4 ++--
 8 files changed, 15 insertions(+), 11 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup
  2022-05-13 17:49                 ` Roman Gushchin
  2022-05-21 16:37                   ` [PATCH mm v2 0/9] " Vasily Averin
@ 2022-05-21 16:37                   ` Vasily Averin
  2022-05-22  6:37                     ` Muchun Song
  2022-05-21 16:37                   ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
                                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

Creating each new cgroup allocates 4Kb for struct cgroup. This is the
largest memory allocation in this scenario and is epecially important
for small VMs with 1-2 CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Accounting of this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..7595127c5b3a 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
 	cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)),
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes
  2022-05-13 17:49                 ` Roman Gushchin
  2022-05-21 16:37                   ` [PATCH mm v2 0/9] " Vasily Averin
  2022-05-21 16:37                   ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin
@ 2022-05-21 16:37                   ` Vasily Averin
  2022-05-22  6:37                     ` Muchun Song
  2022-05-21 16:37                   ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
                                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)	kernfs node
1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs nodes slab cache.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..3ac4191b1c40 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,7 +391,8 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs
  2022-05-13 17:49                 ` Roman Gushchin
                                     ` (2 preceding siblings ...)
  2022-05-21 16:37                   ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
@ 2022-05-21 16:37                   ` Vasily Averin
  2022-05-22  6:38                     ` Muchun Song
  2022-05-21 16:38                   ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
                                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc    Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs_iattrs_cache slab cache

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 3ac4191b1c40..40e896c7c86b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -397,5 +397,6 @@ void __init kernfs_init(void)
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
 					      sizeof(struct kernfs_iattrs),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr
  2022-05-13 17:49                 ` Roman Gushchin
                                     ` (3 preceding siblings ...)
  2022-05-21 16:37                   ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
@ 2022-05-21 16:38                   ` Vasily Averin
  2022-05-22  6:38                     ` Muchun Song
  2022-05-21 16:38                   ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
                                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for struct simple_xattr. Size of this
structure depends on userspace and can grow over 4Kb.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
 fs/xattr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 998045165916..31305b941756 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -950,7 +950,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
 	if (len < sizeof(*new_xattr))
 		return NULL;
 
-	new_xattr = kvmalloc(len, GFP_KERNEL);
+	new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT);
 	if (!new_xattr)
 		return NULL;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu
  2022-05-13 17:49                 ` Roman Gushchin
                                     ` (4 preceding siblings ...)
  2022-05-21 16:38                   ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
@ 2022-05-21 16:38                   ` Vasily Averin
  2022-05-21 21:34                     ` Shakeel Butt
                                       ` (2 more replies)
  2022-05-21 16:38                   ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
                                     ` (3 subsequent siblings)
  9 siblings, 3 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

struct pci_group_cpu is percpu allocated for each new cgroup and can
consume a significant portion of all allocated memory on nodes with
a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 kernel/sched/psi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a4fa3aadfcba..f0b25380cb12 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -957,7 +957,8 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (static_branch_likely(&psi_disabled))
 		return 0;
 
-	cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
+	cgroup->psi.pcpu = alloc_percpu_gfp(struct psi_group_cpu,
+					    GFP_KERNEL_ACCOUNT);
 	if (!cgroup->psi.pcpu)
 		return -ENOMEM;
 	group_init(&cgroup->psi);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-05-13 17:49                 ` Roman Gushchin
                                     ` (5 preceding siblings ...)
  2022-05-21 16:38                   ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
@ 2022-05-21 16:38                   ` Vasily Averin
  2022-05-21 17:58                     ` Vasily Averin
                                       ` (3 more replies)
  2022-05-21 16:38                   ` [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
                                     ` (2 subsequent siblings)
  9 siblings, 4 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
can consume a significant portion of all allocated memory on nodes
with a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum	Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 kernel/cgroup/rstat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..f76cb63ae2e0 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -257,7 +257,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 
 	/* the root cgrp has rstat_cpu preallocated */
 	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu
+						   GFP_KERNEL_ACCOUNT);
 		if (!cgrp->rstat_cpu)
 			return -ENOMEM;
 	}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  2022-05-13 17:49                 ` Roman Gushchin
                                     ` (6 preceding siblings ...)
  2022-05-21 16:38                   ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
@ 2022-05-21 16:38                   ` Vasily Averin
  2022-05-22  6:47                     ` Muchun Song
  2022-05-21 16:38                   ` [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
  2022-05-21 16:39                   ` [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq Vasily Averin
  9 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

Creation of each memory cgroup allocates few huge objects in
mem_cgroup_css_alloc(). Its size exceeds the size of memory
accounted in common part of cgroup creation:

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu

memory:
------
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
1   +   8192    8192    8192    (mem_cgroup_css_alloc+0x4a) <NB
14  ~   352     4928    13120   KERNFS
1   +   2048    2048    15168   (mem_cgroup_css_alloc+0xdd) <NB
1       1024    1024    16192   (alloc_shrinker_info+0x79)
1       584     584     16776   (radix_tree_node_alloc.constprop.0+0x89)
2       64      128     16904   (percpu_ref_init+0x6a)
1       64      64      16968   (mem_cgroup_css_online+0x32)

1   =   3684    3684    3684    call_site=mem_cgroup_css_alloc+0x9e
1   =   984     984     4668    call_site=mem_cgroup_css_alloc+0xfd
2       12      24      4692    call_site=percpu_ref_init+0x23

     '=' -- already accounted,
     '+' -- to be accounted,
     '~' -- partially accounted

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 598fece89e2b..52c6163ba6dc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5031,7 +5031,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 
-	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node);
+	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL_ACCOUNT, node);
 	if (!pn)
 		return 1;
 
@@ -5083,7 +5083,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	int __maybe_unused i;
 	long error = -ENOMEM;
 
-	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL);
+	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL_ACCOUNT);
 	if (!memcg)
 		return ERR_PTR(error);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group
  2022-05-13 17:49                 ` Roman Gushchin
                                     ` (7 preceding siblings ...)
  2022-05-21 16:38                   ` [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
@ 2022-05-21 16:38                   ` Vasily Averin
  2022-05-22  6:49                     ` Muchun Song
  2022-05-21 16:39                   ` [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq Vasily Averin
  9 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

Creating of each new cpu cgroup allocates two 512-bytes kernel objects
per CPU. This is especially important for cgroups shared parent memory
cgroup. In this scenario, on nodes with multiple processors, these
allocations become one of the main memory consumers.

Memory allocated during new cpu cgroup creation:
common part: 	~11Kb	+  318 bytes percpu
cpu cgroup:	~2.5Kb	+ 1036 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a68482d66535..46e66acf7475 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11529,12 +11529,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
-				      GFP_KERNEL, cpu_to_node(i));
+				      GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!cfs_rq)
 			goto err;
 
 		se = kzalloc_node(sizeof(struct sched_entity_stats),
-				  GFP_KERNEL, cpu_to_node(i));
+				  GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!se)
 			goto err_free_rq;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq
  2022-05-13 17:49                 ` Roman Gushchin
                                     ` (8 preceding siblings ...)
  2022-05-21 16:38                   ` [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
@ 2022-05-21 16:39                   ` Vasily Averin
  2022-05-21 21:37                     ` Shakeel Butt
  2022-05-25  1:31                     ` Roman Gushchin
  9 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 16:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

If enabled in config, alloc_rt_sched_group() is called for each new
cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq.
This significantly exceeds the size of the percpu allocation in the
common part of cgroup creation.

Memory allocated during new cpu cgroup creation
(with enabled RT_GROUP_SCHED):
common part:    ~11Kb   +   318 bytes percpu
cpu cgroup:     ~2.5Kb  + ~2800 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 kernel/sched/rt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a32c46889af8..2639d6dcebe6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -206,7 +206,7 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
-				     GFP_KERNEL, cpu_to_node(i));
+				     GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!rt_rq)
 			goto err;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-05-21 16:38                   ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
@ 2022-05-21 17:58                     ` Vasily Averin
  2022-05-21 21:35                     ` Shakeel Butt
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-21 17:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On 5/21/22 19:38, Vasily Averin wrote:
> struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
> can consume a significant portion of all allocated memory on nodes
> with a large number of CPUs.
> 
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum	Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
> 
>  '+' -- to be accounted,
>  '~' -- partially accounted
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
>  kernel/cgroup/rstat.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 24b5c2ab5598..f76cb63ae2e0 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -257,7 +257,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
>  
>  	/* the root cgrp has rstat_cpu preallocated */
>  	if (!cgrp->rstat_cpu) {
> -		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
> +		cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu
"," was lost here
> +						   GFP_KERNEL_ACCOUNT);
>  		if (!cgrp->rstat_cpu)
>  			return -ENOMEM;
>  	}


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu
  2022-05-21 16:38                   ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
@ 2022-05-21 21:34                     ` Shakeel Butt
  2022-05-22  6:40                     ` Muchun Song
  2022-05-25  1:30                     ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-05-21 21:34 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, LKML, Linux MM, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Cgroups

On Sat, May 21, 2022 at 9:38 AM Vasily Averin <vvs@openvz.org> wrote:
>
> struct pci_group_cpu is percpu allocated for each new cgroup and can
> consume a significant portion of all allocated memory on nodes with
> a large number of CPUs.
>
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum     Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
>
>  '+' -- to be accounted,
>  '~' -- partially accounted
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-05-21 16:38                   ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
  2022-05-21 17:58                     ` Vasily Averin
@ 2022-05-21 21:35                     ` Shakeel Butt
  2022-05-21 22:05                     ` kernel test robot
  2022-05-25  1:31                     ` Roman Gushchin
  3 siblings, 0 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-05-21 21:35 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, LKML, Linux MM, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Cgroups

On Sat, May 21, 2022 at 9:38 AM Vasily Averin <vvs@openvz.org> wrote:
>
> struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
> can consume a significant portion of all allocated memory on nodes
> with a large number of CPUs.
>
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum     Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
>
>  '+' -- to be accounted,
>  '~' -- partially accounted
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq
  2022-05-21 16:39                   ` [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq Vasily Averin
@ 2022-05-21 21:37                     ` Shakeel Butt
  2022-05-25  1:31                     ` Roman Gushchin
  1 sibling, 0 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-05-21 21:37 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, LKML, Linux MM, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Cgroups

On Sat, May 21, 2022 at 9:39 AM Vasily Averin <vvs@openvz.org> wrote:
>
> If enabled in config, alloc_rt_sched_group() is called for each new
> cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq.
> This significantly exceeds the size of the percpu allocation in the
> common part of cgroup creation.
>
> Memory allocated during new cpu cgroup creation
> (with enabled RT_GROUP_SCHED):
> common part:    ~11Kb   +   318 bytes percpu
> cpu cgroup:     ~2.5Kb  + ~2800 bytes percpu
>
> Accounting for this memory helps to avoid misuse inside memcg-limited
> contianers.

*containers

>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-05-21 16:38                   ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
  2022-05-21 17:58                     ` Vasily Averin
  2022-05-21 21:35                     ` Shakeel Butt
@ 2022-05-21 22:05                     ` kernel test robot
  2022-05-25  1:31                     ` Roman Gushchin
  3 siblings, 0 replies; 139+ messages in thread
From: kernel test robot @ 2022-05-21 22:05 UTC (permalink / raw)
  To: Vasily Averin, Andrew Morton
  Cc: kbuild-all, Linux Memory Management List, kernel, linux-kernel,
	Shakeel Butt, Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

Hi Vasily,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on tj-cgroup/for-next driver-core/driver-core-testing linus/master v5.18-rc7 next-20220520]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Vasily-Averin/memcg-enable-accounting-for-struct-cgroup/20220522-004124
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 991d8d8142cad94f9c5c05db25e67fa83d6f772a
config: arm-imxrt_defconfig (https://download.01.org/0day-ci/archive/20220522/202205220531.AVnBFrgq-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/c1b7edf1635aaef50d25ba8246a5e5c997a6bf44
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Vasily-Averin/memcg-enable-accounting-for-struct-cgroup/20220522-004124
        git checkout c1b7edf1635aaef50d25ba8246a5e5c997a6bf44
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash kernel/cgroup/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/cgroup/rstat.c: In function 'cgroup_rstat_init':
>> kernel/cgroup/rstat.c:261:70: error: macro "alloc_percpu_gfp" requires 2 arguments, but only 1 given
     261 |                                                    GFP_KERNEL_ACCOUNT);
         |                                                                      ^
   In file included from include/linux/hrtimer.h:19,
                    from include/linux/sched.h:19,
                    from include/linux/cgroup.h:12,
                    from kernel/cgroup/cgroup-internal.h:5,
                    from kernel/cgroup/rstat.c:2:
   include/linux/percpu.h:133: note: macro "alloc_percpu_gfp" defined here
     133 | #define alloc_percpu_gfp(type, gfp)                                     \
         | 
>> kernel/cgroup/rstat.c:260:35: error: 'alloc_percpu_gfp' undeclared (first use in this function)
     260 |                 cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu
         |                                   ^~~~~~~~~~~~~~~~
   kernel/cgroup/rstat.c:260:35: note: each undeclared identifier is reported only once for each function it appears in


vim +/alloc_percpu_gfp +261 kernel/cgroup/rstat.c

   253	
   254	int cgroup_rstat_init(struct cgroup *cgrp)
   255	{
   256		int cpu;
   257	
   258		/* the root cgrp has rstat_cpu preallocated */
   259		if (!cgrp->rstat_cpu) {
 > 260			cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu
 > 261							   GFP_KERNEL_ACCOUNT);
   262			if (!cgrp->rstat_cpu)
   263				return -ENOMEM;
   264		}
   265	
   266		/* ->updated_children list is self terminated */
   267		for_each_possible_cpu(cpu) {
   268			struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
   269	
   270			rstatc->updated_children = cgrp;
   271			u64_stats_init(&rstatc->bsync);
   272		}
   273	
   274		return 0;
   275	}
   276	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup
  2022-05-21 16:37                   ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin
@ 2022-05-22  6:37                     ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-22  6:37 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:37:36PM +0300, Vasily Averin wrote:
> Creating each new cgroup allocates 4Kb for struct cgroup. This is the
> largest memory allocation in this scenario and is epecially important
> for small VMs with 1-2 CPUs.
> 
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum     Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
> 
>  '+' -- to be accounted,
>  '~' -- partially accounted
> 
> Accounting of this memory helps to avoid misuse inside memcg-limited
> containers.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes
  2022-05-21 16:37                   ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
@ 2022-05-22  6:37                     ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-22  6:37 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:37:49PM +0300, Vasily Averin wrote:
> kernfs nodes are quite small kernel objects, however there are few
> scenarios where it consumes significant piece of all allocated memory:
> 
> 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
>    was allocated for 80+ kernfs nodes.
> 
> 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
>    structures.
> 
> 3) Shakeel Butt reports that Google has workloads which create 100s
>    of subcontainers and they have observed high system overhead
>    without memcg accounting of kernfs.
> 
> Usually new kernfs node creates few other objects:
> 
> Allocs  Alloc   Allocation
> number  size
> --------------------------------------------
> 1   +  128      (__kernfs_new_node+0x4d)	kernfs node
> 1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
> 1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
> 1       32      (simple_xattr_set+0x59)
> 1       8       (__kernfs_new_node+0x30)
> 
> '+' -- to be accounted
> 
> This patch enables accounting for kernfs nodes slab cache.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs
  2022-05-21 16:37                   ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
@ 2022-05-22  6:38                     ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-22  6:38 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:37:59PM +0300, Vasily Averin wrote:
> kernfs nodes are quite small kernel objects, however there are few
> scenarios where it consumes significant piece of all allocated memory:
> 
> 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
>    was allocated for 80+ kernfs nodes.
> 
> 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
>    structures.
> 
> 3) Shakeel Butt reports that Google has workloads which create 100s
>    of subcontainers and they have observed high system overhead
>    without memcg accounting of kernfs.
> 
> Usually new kernfs node creates few other objects:
> 
> Allocs  Alloc    Allocation
> number  size
> --------------------------------------------
> 1   +  128      (__kernfs_new_node+0x4d)        kernfs node
> 1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
> 1   +   96      (simple_xattr_alloc+0x28)       simple_xattr, can grow over 4Kb
> 1       32      (simple_xattr_set+0x59)
> 1       8       (__kernfs_new_node+0x30)
> 
> '+' -- to be accounted
> 
> This patch enables accounting for kernfs_iattrs_cache slab cache
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr
  2022-05-21 16:38                   ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
@ 2022-05-22  6:38                     ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-22  6:38 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:38:11PM +0300, Vasily Averin wrote:
> kernfs nodes are quite small kernel objects, however there are few
> scenarios where it consumes significant piece of all allocated memory:
> 
> 1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
>    was allocated for 80+ kernfs nodes.
> 
> 2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
>    structures.
> 
> 3) Shakeel Butt reports that Google has workloads which create 100s
>    of subcontainers and they have observed high system overhead
>    without memcg accounting of kernfs.
> 
> Usually new kernfs node creates few other objects:
> 
> Allocs  Alloc   Allocation
> number  size
> --------------------------------------------
> 1   +  128      (__kernfs_new_node+0x4d)        kernfs node
> 1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
> 1   +   96      (simple_xattr_alloc+0x28)       simple_xattr
> 1       32      (simple_xattr_set+0x59)
> 1       8       (__kernfs_new_node+0x30)
> 
> '+' -- to be accounted
> 
> This patch enables accounting for struct simple_xattr. Size of this
> structure depends on userspace and can grow over 4Kb.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu
  2022-05-21 16:38                   ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
  2022-05-21 21:34                     ` Shakeel Butt
@ 2022-05-22  6:40                     ` Muchun Song
  2022-05-25  1:30                     ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-22  6:40 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:38:21PM +0300, Vasily Averin wrote:
> struct pci_group_cpu is percpu allocated for each new cgroup and can
> consume a significant portion of all allocated memory on nodes with
> a large number of CPUs.
> 
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum     Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
> 
>  '+' -- to be accounted,
>  '~' -- partially accounted
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  2022-05-21 16:38                   ` [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
@ 2022-05-22  6:47                     ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-22  6:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, LKML, Linux Memory Management List,
	Shakeel Butt, Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, Cgroups

On Sun, May 22, 2022 at 12:38 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Creation of each memory cgroup allocates few huge objects in
> mem_cgroup_css_alloc(). Its size exceeds the size of memory
> accounted in common part of cgroup creation:
>
> common part:    ~11Kb   +  318 bytes percpu
> memcg:          ~17Kb   + 4692 bytes percpu
>
> memory:
> ------
> Allocs  Alloc   $1*$2   Sum     Allocation
> number  size
> --------------------------------------------
> 1   +   8192    8192    8192    (mem_cgroup_css_alloc+0x4a) <NB
> 14  ~   352     4928    13120   KERNFS
> 1   +   2048    2048    15168   (mem_cgroup_css_alloc+0xdd) <NB
> 1       1024    1024    16192   (alloc_shrinker_info+0x79)
> 1       584     584     16776   (radix_tree_node_alloc.constprop.0+0x89)
> 2       64      128     16904   (percpu_ref_init+0x6a)
> 1       64      64      16968   (mem_cgroup_css_online+0x32)
>
> 1   =   3684    3684    3684    call_site=mem_cgroup_css_alloc+0x9e
> 1   =   984     984     4668    call_site=mem_cgroup_css_alloc+0xfd
> 2       12      24      4692    call_site=percpu_ref_init+0x23
>
>      '=' -- already accounted,
>      '+' -- to be accounted,
>      '~' -- partially accounted
>
> Accounting for this memory helps to avoid misuse inside memcg-limited
> contianers.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group
  2022-05-21 16:38                   ` [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
@ 2022-05-22  6:49                     ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-22  6:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, LKML, Linux Memory Management List,
	Shakeel Butt, Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, Cgroups

On Sun, May 22, 2022 at 12:39 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Creating of each new cpu cgroup allocates two 512-bytes kernel objects
> per CPU. This is especially important for cgroups shared parent memory
> cgroup. In this scenario, on nodes with multiple processors, these
> allocations become one of the main memory consumers.
>
> Memory allocated during new cpu cgroup creation:
> common part:    ~11Kb   +  318 bytes percpu
> cpu cgroup:     ~2.5Kb  + 1036 bytes percpu
>
> Accounting for this memory helps to avoid misuse inside memcg-limited
> contianers.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 3/4] memcg: enable accounting for struct cgroup
  2022-05-20 20:16                     ` Vasily Averin
  2022-05-21  0:55                       ` Roman Gushchin
@ 2022-05-23 13:52                       ` Michal Koutný
  1 sibling, 0 replies; 139+ messages in thread
From: Michal Koutný @ 2022-05-23 13:52 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Shakeel Butt, kernel, linux-kernel,
	Vlastimil Babka, Michal Hocko, cgroups

On Fri, May 20, 2022 at 11:16:32PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> common part: 	~11Kb	+  318 bytes percpu
> memcg: 		~17Kb	+ 4692 bytes percpu
> cpu:		~2.5Kb	+ 1036 bytes percpu
> cpuset:		~3Kb	+   12 bytes percpu
> blkcg:		~3Kb	+   12 bytes percpu
> pid:		~1.5Kb	+   12 bytes percpu		
> perf:		 ~320b	+   60 bytes percpu
> -------------------------------------------
> total:		~38Kb	+ 6142 bytes percpu
> currently accounted:	  4668 bytes percpu

Thanks for the breakdown and this overview!

Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu
  2022-05-21 16:38                   ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
  2022-05-21 21:34                     ` Shakeel Butt
  2022-05-22  6:40                     ` Muchun Song
@ 2022-05-25  1:30                     ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-25  1:30 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:38:21PM +0300, Vasily Averin wrote:
> struct pci_group_cpu is percpu allocated for each new cgroup and can
> consume a significant portion of all allocated memory on nodes with
> a large number of CPUs.
> 
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum     Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
> 
>  '+' -- to be accounted,
>  '~' -- partially accounted
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-05-21 16:38                   ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
                                       ` (2 preceding siblings ...)
  2022-05-21 22:05                     ` kernel test robot
@ 2022-05-25  1:31                     ` Roman Gushchin
  3 siblings, 0 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-25  1:31 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:38:31PM +0300, Vasily Averin wrote:
> struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
> can consume a significant portion of all allocated memory on nodes
> with a large number of CPUs.
> 
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum	Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
> 
>  '+' -- to be accounted,
>  '~' -- partially accounted
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq
  2022-05-21 16:39                   ` [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq Vasily Averin
  2022-05-21 21:37                     ` Shakeel Butt
@ 2022-05-25  1:31                     ` Roman Gushchin
  1 sibling, 0 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-05-25  1:31 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Sat, May 21, 2022 at 07:39:03PM +0300, Vasily Averin wrote:
> If enabled in config, alloc_rt_sched_group() is called for each new
> cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq.
> This significantly exceeds the size of the percpu allocation in the
> common part of cgroup creation.
> 
> Memory allocated during new cpu cgroup creation
> (with enabled RT_GROUP_SCHED):
> common part:    ~11Kb   +   318 bytes percpu
> cpu cgroup:     ~2.5Kb  + ~2800 bytes percpu
> 
> Accounting for this memory helps to avoid misuse inside memcg-limited
> contianers.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-21 16:37                   ` [PATCH mm v2 0/9] " Vasily Averin
@ 2022-05-30 11:25                     ` Vasily Averin
  2022-05-30 11:55                       ` Michal Hocko
                                         ` (10 more replies)
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
  1 sibling, 11 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
4cpu VM with Fedora and self-complied upstream kernel. The calculations
are not precise, it depends on kernel config options, number of cpus,
enabled controllers, ignores possible page allocations etc.
However this is enough to clarify the general situation.
All allocations are splited into:
- common part, always called for each cgroup type
- per-cgroup allocations

In each group we consider 2 corner cases:
- usual allocations, important for 1-2 CPU nodes/Vms
- percpu allocations, important for 'big irons'

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu
cpu:		~2.5Kb	+ 1036 bytes percpu
cpuset:		~3Kb	+   12 bytes percpu
blkcg:		~3Kb	+   12 bytes percpu
pid:		~1.5Kb	+   12 bytes percpu		
perf:		 ~320b	+   60 bytes percpu
-------------------------------------------
total:		~38Kb	+ 6142 bytes percpu
currently accounted:	  4668 bytes percpu

- it's important to account usual allocations called
in common part, because almost all of cgroup-specific allocations
are small. One exception here is memory cgroup, it allocates a few
huge objects that should be accounted.
- Percpu allocation called in common part, in memcg and cpu cgroups
should be accounted, rest ones are small an can be ignored.
- KERNFS objects are allocated both in common part and in most of
cgroups 

Details can be found here:
https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/

I checked other cgroups types was found that they all can be ignored.
Additionally I found allocation of struct rt_rq called in cpu cgroup 
if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
percpu structure and should be accounted too.


v3:
 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7)
 2) fixed few typos
 3) added received approvals

v2:
 1) re-split to simplify possible bisect, re-ordered
 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu,
     allocated in common part
 3) added accounting for percpu allocation of struct rt_rq
     (actual if CONFIG_RT_GROUP_SCHED is enabled)
 4) improved patches descriptions 


Vasily Averin (9):
  memcg: enable accounting for struct cgroup
  memcg: enable accounting for kernfs nodes
  memcg: enable accounting for kernfs iattrs
  memcg: enable accounting for struct simple_xattr
  memcg: enable accounting for percpu allocation of struct psi_group_cpu
  memcg: enable accounting for percpu allocation of struct
    cgroup_rstat_cpu
  memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  memcg: enable accounting for allocations in alloc_fair_sched_group
  memcg: enable accounting for perpu allocation of struct rt_rq

 fs/kernfs/mount.c      | 6 ++++--
 fs/xattr.c             | 2 +-
 kernel/cgroup/cgroup.c | 2 +-
 kernel/cgroup/rstat.c  | 3 ++-
 kernel/sched/fair.c    | 4 ++--
 kernel/sched/psi.c     | 3 ++-
 kernel/sched/rt.c      | 2 +-
 mm/memcontrol.c        | 4 ++--
 8 files changed, 15 insertions(+), 11 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
@ 2022-05-30 11:25                       ` Vasily Averin
  2022-05-30 11:26                       ` [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
                                         ` (7 subsequent siblings)
  8 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creating each new cgroup allocates 4Kb for struct cgroup. This is the
largest memory allocation in this scenario and is epecially important
for small VMs with 1-2 CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Accounting of this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..1be0f81fe8e1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
 	cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)),
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
  2022-05-30 11:25                       ` [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup Vasily Averin
@ 2022-05-30 11:26                       ` Vasily Averin
  2022-05-30 11:26                       ` [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
                                         ` (6 subsequent siblings)
  8 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)	kernfs node
1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs nodes slab cache.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..3ac4191b1c40 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,7 +391,8 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
  2022-05-30 11:25                       ` [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup Vasily Averin
  2022-05-30 11:26                       ` [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
@ 2022-05-30 11:26                       ` Vasily Averin
  2022-05-30 11:26                       ` [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
                                         ` (5 subsequent siblings)
  8 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc    Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs_iattrs_cache slab cache

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 3ac4191b1c40..40e896c7c86b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -397,5 +397,6 @@ void __init kernfs_init(void)
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
 					      sizeof(struct kernfs_iattrs),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
                                         ` (2 preceding siblings ...)
  2022-05-30 11:26                       ` [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
@ 2022-05-30 11:26                       ` Vasily Averin
  2022-05-30 11:26                       ` [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
                                         ` (4 subsequent siblings)
  8 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for struct simple_xattr. Size of this
structure depends on userspace and can grow over 4Kb.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/xattr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index e8dd03e4561e..98dcf6600bd9 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -1001,7 +1001,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
 	if (len < sizeof(*new_xattr))
 		return NULL;
 
-	new_xattr = kvmalloc(len, GFP_KERNEL);
+	new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT);
 	if (!new_xattr)
 		return NULL;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
                                         ` (3 preceding siblings ...)
  2022-05-30 11:26                       ` [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
@ 2022-05-30 11:26                       ` Vasily Averin
  2022-05-30 11:26                       ` [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
                                         ` (3 subsequent siblings)
  8 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

struct pci_group_cpu is percpu allocated for each new cgroup and can
consume a significant portion of all allocated memory on nodes with
a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/sched/psi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a337f3e35997..f3ec8553283e 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -957,7 +957,8 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (static_branch_likely(&psi_disabled))
 		return 0;
 
-	cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
+	cgroup->psi.pcpu = alloc_percpu_gfp(struct psi_group_cpu,
+					    GFP_KERNEL_ACCOUNT);
 	if (!cgroup->psi.pcpu)
 		return -ENOMEM;
 	group_init(&cgroup->psi);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
                                         ` (4 preceding siblings ...)
  2022-05-30 11:26                       ` [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
@ 2022-05-30 11:26                       ` Vasily Averin
  2022-05-30 15:04                         ` Muchun Song
  2022-05-30 11:26                       ` [PATCH mm v3 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
                                         ` (2 subsequent siblings)
  8 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
can consume a significant portion of all allocated memory on nodes
with a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum	Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/cgroup/rstat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..2904b185b01b 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -257,7 +257,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 
 	/* the root cgrp has rstat_cpu preallocated */
 	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu,
+						   GFP_KERNEL_ACCOUNT);
 		if (!cgrp->rstat_cpu)
 			return -ENOMEM;
 	}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
                                         ` (5 preceding siblings ...)
  2022-05-30 11:26                       ` [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
@ 2022-05-30 11:26                       ` Vasily Averin
  2022-05-30 11:26                       ` [PATCH mm v3 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
  2022-05-30 11:27                       ` [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
  8 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creation of each memory cgroup allocates few huge objects in
mem_cgroup_css_alloc(). Its size exceeds the size of memory
accounted in common part of cgroup creation:

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu

memory:
------
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
1   +   8192    8192    8192    (mem_cgroup_css_alloc+0x4a) <NB
14  ~   352     4928    13120   KERNFS
1   +   2048    2048    15168   (mem_cgroup_css_alloc+0xdd) <NB
1       1024    1024    16192   (alloc_shrinker_info+0x79)
1       584     584     16776   (radix_tree_node_alloc.constprop.0+0x89)
2       64      128     16904   (percpu_ref_init+0x6a)
1       64      64      16968   (mem_cgroup_css_online+0x32)

1   =   3684    3684    3684    call_site=mem_cgroup_css_alloc+0x9e
1   =   984     984     4668    call_site=mem_cgroup_css_alloc+0xfd
2       12      24      4692    call_site=percpu_ref_init+0x23

     '=' -- already accounted,
     '+' -- to be accounted,
     '~' -- partially accounted

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f31fe6..376734af8935 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5064,7 +5064,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 
-	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node);
+	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL_ACCOUNT, node);
 	if (!pn)
 		return 1;
 
@@ -5116,7 +5116,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	int __maybe_unused i;
 	long error = -ENOMEM;
 
-	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL);
+	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL_ACCOUNT);
 	if (!memcg)
 		return ERR_PTR(error);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
                                         ` (6 preceding siblings ...)
  2022-05-30 11:26                       ` [PATCH mm v3 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
@ 2022-05-30 11:26                       ` Vasily Averin
  2022-05-30 11:27                       ` [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
  8 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creating of each new cpu cgroup allocates two 512-bytes kernel objects
per CPU. This is especially important for cgroups shared parent memory
cgroup. In this scenario, on nodes with multiple processors, these
allocations become one of the main memory consumers.

Memory allocated during new cpu cgroup creation:
common part: 	~11Kb	+  318 bytes percpu
cpu cgroup:	~2.5Kb	+ 1036 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c5b74f66bd3..f4fc39d5aa4b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11499,12 +11499,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
-				      GFP_KERNEL, cpu_to_node(i));
+				      GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!cfs_rq)
 			goto err;
 
 		se = kzalloc_node(sizeof(struct sched_entity_stats),
-				  GFP_KERNEL, cpu_to_node(i));
+				  GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!se)
 			goto err_free_rq;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq
       [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
                                         ` (7 preceding siblings ...)
  2022-05-30 11:26                       ` [PATCH mm v3 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
@ 2022-05-30 11:27                       ` Vasily Averin
  2022-05-30 15:06                         ` Muchun Song
  8 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 11:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

If enabled in config, alloc_rt_sched_group() is called for each new
cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq.
This significantly exceeds the size of the percpu allocation in the
common part of cgroup creation.

Memory allocated during new cpu cgroup creation
(with enabled RT_GROUP_SCHED):
common part:    ~11Kb   +   318 bytes percpu
cpu cgroup:     ~2.5Kb  + ~2800 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/sched/rt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8c9ed9664840..44a8fc096e33 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -256,7 +256,7 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
-				     GFP_KERNEL, cpu_to_node(i));
+				     GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!rt_rq)
 			goto err;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
@ 2022-05-30 11:55                       ` Michal Hocko
  2022-05-30 13:09                         ` Vasily Averin
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                         ` (9 subsequent siblings)
  10 siblings, 1 reply; 139+ messages in thread
From: Michal Hocko @ 2022-05-30 11:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On Mon 30-05-22 14:25:45, Vasily Averin wrote:
> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
> are not precise, it depends on kernel config options, number of cpus,
> enabled controllers, ignores possible page allocations etc.
> However this is enough to clarify the general situation.
> All allocations are splited into:
> - common part, always called for each cgroup type
> - per-cgroup allocations
> 
> In each group we consider 2 corner cases:
> - usual allocations, important for 1-2 CPU nodes/Vms
> - percpu allocations, important for 'big irons'
> 
> common part: 	~11Kb	+  318 bytes percpu
> memcg: 		~17Kb	+ 4692 bytes percpu
> cpu:		~2.5Kb	+ 1036 bytes percpu
> cpuset:		~3Kb	+   12 bytes percpu
> blkcg:		~3Kb	+   12 bytes percpu
> pid:		~1.5Kb	+   12 bytes percpu		
> perf:		 ~320b	+   60 bytes percpu
> -------------------------------------------
> total:		~38Kb	+ 6142 bytes percpu
> currently accounted:	  4668 bytes percpu
> 
> - it's important to account usual allocations called
> in common part, because almost all of cgroup-specific allocations
> are small. One exception here is memory cgroup, it allocates a few
> huge objects that should be accounted.
> - Percpu allocation called in common part, in memcg and cpu cgroups
> should be accounted, rest ones are small an can be ignored.
> - KERNFS objects are allocated both in common part and in most of
> cgroups 
> 
> Details can be found here:
> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
> 
> I checked other cgroups types was found that they all can be ignored.
> Additionally I found allocation of struct rt_rq called in cpu cgroup 
> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
> percpu structure and should be accounted too.

One thing that the changelog is missing is an explanation why do we need
to account those objects. Users are usually not empowered to create
cgroups arbitrarily. Or at least they shouldn't because we can expect
more problems to happen.

Could you clarify this please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-30 11:55                       ` Michal Hocko
@ 2022-05-30 13:09                         ` Vasily Averin
  2022-05-30 14:22                           ` Michal Hocko
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 13:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On 5/30/22 14:55, Michal Hocko wrote:
> On Mon 30-05-22 14:25:45, Vasily Averin wrote:
>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
>> are not precise, it depends on kernel config options, number of cpus,
>> enabled controllers, ignores possible page allocations etc.
>> However this is enough to clarify the general situation.
>> All allocations are splited into:
>> - common part, always called for each cgroup type
>> - per-cgroup allocations
>>
>> In each group we consider 2 corner cases:
>> - usual allocations, important for 1-2 CPU nodes/Vms
>> - percpu allocations, important for 'big irons'
>>
>> common part: 	~11Kb	+  318 bytes percpu
>> memcg: 		~17Kb	+ 4692 bytes percpu
>> cpu:		~2.5Kb	+ 1036 bytes percpu
>> cpuset:		~3Kb	+   12 bytes percpu
>> blkcg:		~3Kb	+   12 bytes percpu
>> pid:		~1.5Kb	+   12 bytes percpu		
>> perf:		 ~320b	+   60 bytes percpu
>> -------------------------------------------
>> total:		~38Kb	+ 6142 bytes percpu
>> currently accounted:	  4668 bytes percpu
>>
>> - it's important to account usual allocations called
>> in common part, because almost all of cgroup-specific allocations
>> are small. One exception here is memory cgroup, it allocates a few
>> huge objects that should be accounted.
>> - Percpu allocation called in common part, in memcg and cpu cgroups
>> should be accounted, rest ones are small an can be ignored.
>> - KERNFS objects are allocated both in common part and in most of
>> cgroups 
>>
>> Details can be found here:
>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
>>
>> I checked other cgroups types was found that they all can be ignored.
>> Additionally I found allocation of struct rt_rq called in cpu cgroup 
>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
>> percpu structure and should be accounted too.
> 
> One thing that the changelog is missing is an explanation why do we need
> to account those objects. Users are usually not empowered to create
> cgroups arbitrarily. Or at least they shouldn't because we can expect
> more problems to happen.
> 
> Could you clarify this please?

The problem is actual for OS-level containers: LXC or OpenVz.
They are widely used for hosting and allow to run containers
by untrusted end-users. Root inside such containers is able
to create groups inside own container and consume host memory
without its proper accounting.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-30 13:09                         ` Vasily Averin
@ 2022-05-30 14:22                           ` Michal Hocko
  2022-05-30 19:58                             ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Hocko @ 2022-05-30 14:22 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On Mon 30-05-22 16:09:00, Vasily Averin wrote:
> On 5/30/22 14:55, Michal Hocko wrote:
> > On Mon 30-05-22 14:25:45, Vasily Averin wrote:
> >> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
> >> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
> >> are not precise, it depends on kernel config options, number of cpus,
> >> enabled controllers, ignores possible page allocations etc.
> >> However this is enough to clarify the general situation.
> >> All allocations are splited into:
> >> - common part, always called for each cgroup type
> >> - per-cgroup allocations
> >>
> >> In each group we consider 2 corner cases:
> >> - usual allocations, important for 1-2 CPU nodes/Vms
> >> - percpu allocations, important for 'big irons'
> >>
> >> common part: 	~11Kb	+  318 bytes percpu
> >> memcg: 		~17Kb	+ 4692 bytes percpu
> >> cpu:		~2.5Kb	+ 1036 bytes percpu
> >> cpuset:		~3Kb	+   12 bytes percpu
> >> blkcg:		~3Kb	+   12 bytes percpu
> >> pid:		~1.5Kb	+   12 bytes percpu		
> >> perf:		 ~320b	+   60 bytes percpu
> >> -------------------------------------------
> >> total:		~38Kb	+ 6142 bytes percpu
> >> currently accounted:	  4668 bytes percpu
> >>
> >> - it's important to account usual allocations called
> >> in common part, because almost all of cgroup-specific allocations
> >> are small. One exception here is memory cgroup, it allocates a few
> >> huge objects that should be accounted.
> >> - Percpu allocation called in common part, in memcg and cpu cgroups
> >> should be accounted, rest ones are small an can be ignored.
> >> - KERNFS objects are allocated both in common part and in most of
> >> cgroups 
> >>
> >> Details can be found here:
> >> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
> >>
> >> I checked other cgroups types was found that they all can be ignored.
> >> Additionally I found allocation of struct rt_rq called in cpu cgroup 
> >> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
> >> percpu structure and should be accounted too.
> > 
> > One thing that the changelog is missing is an explanation why do we need
> > to account those objects. Users are usually not empowered to create
> > cgroups arbitrarily. Or at least they shouldn't because we can expect
> > more problems to happen.
> > 
> > Could you clarify this please?
> 
> The problem is actual for OS-level containers: LXC or OpenVz.
> They are widely used for hosting and allow to run containers
> by untrusted end-users. Root inside such containers is able
> to create groups inside own container and consume host memory
> without its proper accounting.

Is the unaccounted memory really the biggest problem here?
IIRC having really huge cgroup trees can hurt quite some controllers.
E.g. how does the cpu controller deal with too many or too deep
hierarchies?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-05-30 11:26                       ` [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
@ 2022-05-30 15:04                         ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-30 15:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, cgroups

On Mon, May 30, 2022 at 02:26:36PM +0300, Vasily Averin wrote:
> struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
> can consume a significant portion of all allocated memory on nodes
> with a large number of CPUs.
> 
> Common part of the cgroup creation:
> Allocs  Alloc   $1*$2   Sum	Allocation
> number  size
> --------------------------------------------
> 16  ~   352     5632    5632    KERNFS
> 1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
> 1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
> 1       192     192     10504   (__d_alloc+0x29)
> 2       72      144     10648   (avc_alloc_node+0x27)
> 2       64      128     10776   (percpu_ref_init+0x6a)
> 1       64      64      10840   (memcg_list_lru_alloc+0x21a)
> percpu:
> 1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
> 1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
> 2       12      24      312     call_site=percpu_ref_init+0x23
> 1       6       6       318     call_site=__percpu_counter_init+0x22
> 
>  '+' -- to be accounted,
>  '~' -- partially accounted
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq
  2022-05-30 11:27                       ` [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
@ 2022-05-30 15:06                         ` Muchun Song
  0 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-05-30 15:06 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, LKML, Linux Memory Management List,
	Shakeel Butt, Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Michal Hocko, Cgroups

On Mon, May 30, 2022 at 7:27 PM Vasily Averin <vvs@openvz.org> wrote:
>
> If enabled in config, alloc_rt_sched_group() is called for each new
> cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq.
> This significantly exceeds the size of the percpu allocation in the
> common part of cgroup creation.
>
> Memory allocated during new cpu cgroup creation
> (with enabled RT_GROUP_SCHED):
> common part:    ~11Kb   +   318 bytes percpu
> cpu cgroup:     ~2.5Kb  + ~2800 bytes percpu
>
> Accounting for this memory helps to avoid misuse inside memcg-limited
> containers.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-30 14:22                           ` Michal Hocko
@ 2022-05-30 19:58                             ` Vasily Averin
  2022-05-31  7:16                               ` Michal Hocko
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-05-30 19:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On 5/30/22 17:22, Michal Hocko wrote:
> On Mon 30-05-22 16:09:00, Vasily Averin wrote:
>> On 5/30/22 14:55, Michal Hocko wrote:
>>> On Mon 30-05-22 14:25:45, Vasily Averin wrote:
>>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
>>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
>>>> are not precise, it depends on kernel config options, number of cpus,
>>>> enabled controllers, ignores possible page allocations etc.
>>>> However this is enough to clarify the general situation.
>>>> All allocations are splited into:
>>>> - common part, always called for each cgroup type
>>>> - per-cgroup allocations
>>>>
>>>> In each group we consider 2 corner cases:
>>>> - usual allocations, important for 1-2 CPU nodes/Vms
>>>> - percpu allocations, important for 'big irons'
>>>>
>>>> common part: 	~11Kb	+  318 bytes percpu
>>>> memcg: 		~17Kb	+ 4692 bytes percpu
>>>> cpu:		~2.5Kb	+ 1036 bytes percpu
>>>> cpuset:		~3Kb	+   12 bytes percpu
>>>> blkcg:		~3Kb	+   12 bytes percpu
>>>> pid:		~1.5Kb	+   12 bytes percpu		
>>>> perf:		 ~320b	+   60 bytes percpu
>>>> -------------------------------------------
>>>> total:		~38Kb	+ 6142 bytes percpu
>>>> currently accounted:	  4668 bytes percpu
>>>>
>>>> - it's important to account usual allocations called
>>>> in common part, because almost all of cgroup-specific allocations
>>>> are small. One exception here is memory cgroup, it allocates a few
>>>> huge objects that should be accounted.
>>>> - Percpu allocation called in common part, in memcg and cpu cgroups
>>>> should be accounted, rest ones are small an can be ignored.
>>>> - KERNFS objects are allocated both in common part and in most of
>>>> cgroups 
>>>>
>>>> Details can be found here:
>>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
>>>>
>>>> I checked other cgroups types was found that they all can be ignored.
>>>> Additionally I found allocation of struct rt_rq called in cpu cgroup 
>>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
>>>> percpu structure and should be accounted too.
>>>
>>> One thing that the changelog is missing is an explanation why do we need
>>> to account those objects. Users are usually not empowered to create
>>> cgroups arbitrarily. Or at least they shouldn't because we can expect
>>> more problems to happen.
>>>
>>> Could you clarify this please?
>>
>> The problem is actual for OS-level containers: LXC or OpenVz.
>> They are widely used for hosting and allow to run containers
>> by untrusted end-users. Root inside such containers is able
>> to create groups inside own container and consume host memory
>> without its proper accounting.
> 
> Is the unaccounted memory really the biggest problem here?
> IIRC having really huge cgroup trees can hurt quite some controllers.
> E.g. how does the cpu controller deal with too many or too deep
> hierarchies?

Could you please describe it in more details?
Maybe it was passed me by, maybe I messed or forgot something,
however I cannot remember any other practical cgroup-related issues.

Maybe deep hierarchies does not work well.
however, I have not heard that the internal configuration of cgroup
can affect the upper level too.

Please let me know if this can happen, this is very interesting for us.

In our case, the hoster configures only the top level of the cgroup
and does not worry about possible misconfiguration inside containers
if it does not affect other containers or the host itself.

Unaccounted memory, contrary, can affects both neighbor containers and host system,
we saw it many times, and therefore we pay special attention to such issues.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-30 19:58                             ` Vasily Averin
@ 2022-05-31  7:16                               ` Michal Hocko
  2022-06-01  3:43                                 ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Hocko @ 2022-05-31  7:16 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On Mon 30-05-22 22:58:30, Vasily Averin wrote:
> On 5/30/22 17:22, Michal Hocko wrote:
> > On Mon 30-05-22 16:09:00, Vasily Averin wrote:
> >> On 5/30/22 14:55, Michal Hocko wrote:
> >>> On Mon 30-05-22 14:25:45, Vasily Averin wrote:
> >>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
> >>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
> >>>> are not precise, it depends on kernel config options, number of cpus,
> >>>> enabled controllers, ignores possible page allocations etc.
> >>>> However this is enough to clarify the general situation.
> >>>> All allocations are splited into:
> >>>> - common part, always called for each cgroup type
> >>>> - per-cgroup allocations
> >>>>
> >>>> In each group we consider 2 corner cases:
> >>>> - usual allocations, important for 1-2 CPU nodes/Vms
> >>>> - percpu allocations, important for 'big irons'
> >>>>
> >>>> common part: 	~11Kb	+  318 bytes percpu
> >>>> memcg: 		~17Kb	+ 4692 bytes percpu
> >>>> cpu:		~2.5Kb	+ 1036 bytes percpu
> >>>> cpuset:		~3Kb	+   12 bytes percpu
> >>>> blkcg:		~3Kb	+   12 bytes percpu
> >>>> pid:		~1.5Kb	+   12 bytes percpu		
> >>>> perf:		 ~320b	+   60 bytes percpu
> >>>> -------------------------------------------
> >>>> total:		~38Kb	+ 6142 bytes percpu
> >>>> currently accounted:	  4668 bytes percpu
> >>>>
> >>>> - it's important to account usual allocations called
> >>>> in common part, because almost all of cgroup-specific allocations
> >>>> are small. One exception here is memory cgroup, it allocates a few
> >>>> huge objects that should be accounted.
> >>>> - Percpu allocation called in common part, in memcg and cpu cgroups
> >>>> should be accounted, rest ones are small an can be ignored.
> >>>> - KERNFS objects are allocated both in common part and in most of
> >>>> cgroups 
> >>>>
> >>>> Details can be found here:
> >>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
> >>>>
> >>>> I checked other cgroups types was found that they all can be ignored.
> >>>> Additionally I found allocation of struct rt_rq called in cpu cgroup 
> >>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
> >>>> percpu structure and should be accounted too.
> >>>
> >>> One thing that the changelog is missing is an explanation why do we need
> >>> to account those objects. Users are usually not empowered to create
> >>> cgroups arbitrarily. Or at least they shouldn't because we can expect
> >>> more problems to happen.
> >>>
> >>> Could you clarify this please?
> >>
> >> The problem is actual for OS-level containers: LXC or OpenVz.
> >> They are widely used for hosting and allow to run containers
> >> by untrusted end-users. Root inside such containers is able
> >> to create groups inside own container and consume host memory
> >> without its proper accounting.
> > 
> > Is the unaccounted memory really the biggest problem here?
> > IIRC having really huge cgroup trees can hurt quite some controllers.
> > E.g. how does the cpu controller deal with too many or too deep
> > hierarchies?
> 
> Could you please describe it in more details?
> Maybe it was passed me by, maybe I messed or forgot something,
> however I cannot remember any other practical cgroup-related issues.
> 
> Maybe deep hierarchies does not work well.
> however, I have not heard that the internal configuration of cgroup
> can affect the upper level too.

My first thought was any controller with a fixed math constrains like
cpu controller. But I have to admit that I haven't really checked
whether imprecision can accumulate and propagate outside of the
hierarchy.

Another concern I would have is a id space depletion. At least memory
controller depends on idr ids which have a space that is rather limited
#define MEM_CGROUP_ID_MAX       USHRT_MAX

Also the runtime overhead would increase with a large number of cgroups.
Take a global memory reclaim as an example. All the cgroups have to be
iterated. This will have an impact outside of the said hierarchy. One
could argue that limiting untrusted top level cgroups would be a certain
mitigation but I can imagine this could get very non trivial easily.

Anyway, let me just be explicit. I am not against these patches. In fact
I cannot really judge their overhead. But right now I am not really sure
they are going to help much against untrusted users.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-31  7:16                               ` Michal Hocko
@ 2022-06-01  3:43                                 ` Vasily Averin
  2022-06-01  9:15                                   ` Michal Koutný
  2022-06-01  9:26                                   ` Michal Hocko
  0 siblings, 2 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-01  3:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On 5/31/22 10:16, Michal Hocko wrote:
> On Mon 30-05-22 22:58:30, Vasily Averin wrote:
>> On 5/30/22 17:22, Michal Hocko wrote:
>>> On Mon 30-05-22 16:09:00, Vasily Averin wrote:
>>>> On 5/30/22 14:55, Michal Hocko wrote:
>>>>> On Mon 30-05-22 14:25:45, Vasily Averin wrote:
>>>>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
>>>>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
>>>>>> are not precise, it depends on kernel config options, number of cpus,
>>>>>> enabled controllers, ignores possible page allocations etc.
>>>>>> However this is enough to clarify the general situation.
>>>>>> All allocations are splited into:
>>>>>> - common part, always called for each cgroup type
>>>>>> - per-cgroup allocations
>>>>>>
>>>>>> In each group we consider 2 corner cases:
>>>>>> - usual allocations, important for 1-2 CPU nodes/Vms
>>>>>> - percpu allocations, important for 'big irons'
>>>>>>
>>>>>> common part: 	~11Kb	+  318 bytes percpu
>>>>>> memcg: 		~17Kb	+ 4692 bytes percpu
>>>>>> cpu:		~2.5Kb	+ 1036 bytes percpu
>>>>>> cpuset:		~3Kb	+   12 bytes percpu
>>>>>> blkcg:		~3Kb	+   12 bytes percpu
>>>>>> pid:		~1.5Kb	+   12 bytes percpu		
>>>>>> perf:		 ~320b	+   60 bytes percpu
>>>>>> -------------------------------------------
>>>>>> total:		~38Kb	+ 6142 bytes percpu
>>>>>> currently accounted:	  4668 bytes percpu
>>>>>>
>>>>>> - it's important to account usual allocations called
>>>>>> in common part, because almost all of cgroup-specific allocations
>>>>>> are small. One exception here is memory cgroup, it allocates a few
>>>>>> huge objects that should be accounted.
>>>>>> - Percpu allocation called in common part, in memcg and cpu cgroups
>>>>>> should be accounted, rest ones are small an can be ignored.
>>>>>> - KERNFS objects are allocated both in common part and in most of
>>>>>> cgroups 
>>>>>>
>>>>>> Details can be found here:
>>>>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
>>>>>>
>>>>>> I checked other cgroups types was found that they all can be ignored.
>>>>>> Additionally I found allocation of struct rt_rq called in cpu cgroup 
>>>>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
>>>>>> percpu structure and should be accounted too.
>>>>>
>>>>> One thing that the changelog is missing is an explanation why do we need
>>>>> to account those objects. Users are usually not empowered to create
>>>>> cgroups arbitrarily. Or at least they shouldn't because we can expect
>>>>> more problems to happen.
>>>>>
>>>>> Could you clarify this please?
>>>>
>>>> The problem is actual for OS-level containers: LXC or OpenVz.
>>>> They are widely used for hosting and allow to run containers
>>>> by untrusted end-users. Root inside such containers is able
>>>> to create groups inside own container and consume host memory
>>>> without its proper accounting.
>>>
>>> Is the unaccounted memory really the biggest problem here?
>>> IIRC having really huge cgroup trees can hurt quite some controllers.
>>> E.g. how does the cpu controller deal with too many or too deep
>>> hierarchies?
>>
>> Could you please describe it in more details?
>> Maybe it was passed me by, maybe I messed or forgot something,
>> however I cannot remember any other practical cgroup-related issues.
>>
>> Maybe deep hierarchies does not work well.
>> however, I have not heard that the internal configuration of cgroup
>> can affect the upper level too.
> 
> My first thought was any controller with a fixed math constrains like
> cpu controller. But I have to admit that I haven't really checked
> whether imprecision can accumulate and propagate outside of the
> hierarchy.
> 
> Another concern I would have is a id space depletion. At least memory
> controller depends on idr ids which have a space that is rather limited
> #define MEM_CGROUP_ID_MAX       USHRT_MAX
> 
> Also the runtime overhead would increase with a large number of cgroups.
> Take a global memory reclaim as an example. All the cgroups have to be
> iterated. This will have an impact outside of the said hierarchy. One
> could argue that limiting untrusted top level cgroups would be a certain
> mitigation but I can imagine this could get very non trivial easily.
> 
> Anyway, let me just be explicit. I am not against these patches. In fact
> I cannot really judge their overhead. But right now I am not really sure
> they are going to help much against untrusted users.

Thank you very much, this information is very valuable for us.

I'm understand your scepticism, the problem looks critical
for upstream-based LXC, and I don't understand well how to
properly protected it right now.

However, it isn't critical for OpenVz. Our kernel does not allow
to change of cgroup.subgroups_limit from inside containers.

CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 
512
CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
-bash: echo: write error: Operation not permitted
CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
-bash: echo: write error: Operation not permitted

I doubt this way can be accepted in upstream, however for OpenVz
something like this it is mandatory because it much better
than nothing.

The number can be adjusted by host admin. The current default limit
looks too small for me, however it is not difficult to increase it
to a reasonable 10,000.

My experiments show that ~10000 cgroups consumes 0.5 Gb memory on 4cpu VM.
On "big irons" it can easily grow up to several Gb. This is quite a lot
to ignore its accounting.

I agree, highly qualified people like you can find many other ways
of abuse anyway. However, OpenVz is trying to somehow prevent this,
not in upstream, unfortunately, but at least in our own kernel.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-06-01  3:43                                 ` Vasily Averin
@ 2022-06-01  9:15                                   ` Michal Koutný
  2022-06-01  9:32                                     ` Michal Hocko
  2022-06-01  9:26                                   ` Michal Hocko
  1 sibling, 1 reply; 139+ messages in thread
From: Michal Koutný @ 2022-06-01  9:15 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Andrew Morton, kernel, linux-kernel, linux-mm,
	Shakeel Butt, Roman Gushchin, Vlastimil Babka, Muchun Song,
	cgroups

On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> 512
> CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> -bash: echo: write error: Operation not permitted
> CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> -bash: echo: write error: Operation not permitted
> 
> I doubt this way can be accepted in upstream, however for OpenVz
> something like this it is mandatory because it much better
> than nothing.

Is this customization of yours something like cgroup.max.descendants on
the unified (v2) hierarchy? (Just curious.)

(It can be made inaccessible from within the subtree either with cgroup
ns or good old FS permissions.)

Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-06-01  3:43                                 ` Vasily Averin
  2022-06-01  9:15                                   ` Michal Koutný
@ 2022-06-01  9:26                                   ` Michal Hocko
  1 sibling, 0 replies; 139+ messages in thread
From: Michal Hocko @ 2022-06-01  9:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, kernel, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On Wed 01-06-22 06:43:27, Vasily Averin wrote:
[...]
> However, it isn't critical for OpenVz. Our kernel does not allow
> to change of cgroup.subgroups_limit from inside containers.

What is the semantic of this limit?

> CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> 512
> CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> -bash: echo: write error: Operation not permitted
> CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> -bash: echo: write error: Operation not permitted
> 
> I doubt this way can be accepted in upstream, however for OpenVz
> something like this it is mandatory because it much better
> than nothing.
> 
> The number can be adjusted by host admin. The current default limit
> looks too small for me, however it is not difficult to increase it
> to a reasonable 10,000.
> 
> My experiments show that ~10000 cgroups consumes 0.5 Gb memory on 4cpu VM.
> On "big irons" it can easily grow up to several Gb. This is quite a lot
> to ignore its accounting.

Too many cgroups can certainly have a high memory footprint. I guess
this is quite clear. The question is whether trying to limit them by the
memory footprint is really the right way to go. I would be especially
worried about those smaller machines because of a smaller footprint
which would allow to deplete the id space faster.

Maybe we need some sort of limit on the number of cgroups in a subtree
so that any potential runaway can be prevented regardless of the cgroups
memory footprint. One potentially big problem with that is that cgroups
can live quite long after being offlined (e.g. memcg) so such a limit
could easily trigger I can imagine.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-06-01  9:15                                   ` Michal Koutný
@ 2022-06-01  9:32                                     ` Michal Hocko
  2022-06-01 13:05                                       ` Michal Hocko
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Hocko @ 2022-06-01  9:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm,
	Shakeel Butt, Roman Gushchin, Vlastimil Babka, Muchun Song,
	cgroups

On Wed 01-06-22 11:15:43, Michal Koutny wrote:
> On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > 512
> > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > -bash: echo: write error: Operation not permitted
> > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > -bash: echo: write error: Operation not permitted
> > 
> > I doubt this way can be accepted in upstream, however for OpenVz
> > something like this it is mandatory because it much better
> > than nothing.
> 
> Is this customization of yours something like cgroup.max.descendants on
> the unified (v2) hierarchy? (Just curious.)
> 
> (It can be made inaccessible from within the subtree either with cgroup
> ns or good old FS permissions.)

So we already do have a limit to prevent somebody from running away with
the number of cgroups. Nice! I was not aware of that and I guess this
looks like the right thing to do. So do we need more control and
accounting that this?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-06-01  9:32                                     ` Michal Hocko
@ 2022-06-01 13:05                                       ` Michal Hocko
  2022-06-01 14:22                                         ` Roman Gushchin
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Hocko @ 2022-06-01 13:05 UTC (permalink / raw)
  To: Michal Koutný, Roman Gushchin
  Cc: Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm,
	Shakeel Butt, Vlastimil Babka, Muchun Song, cgroups

On Wed 01-06-22 11:32:26, Michal Hocko wrote:
> On Wed 01-06-22 11:15:43, Michal Koutny wrote:
> > On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> > > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > 512
> > > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > -bash: echo: write error: Operation not permitted
> > > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > -bash: echo: write error: Operation not permitted
> > > 
> > > I doubt this way can be accepted in upstream, however for OpenVz
> > > something like this it is mandatory because it much better
> > > than nothing.
> > 
> > Is this customization of yours something like cgroup.max.descendants on
> > the unified (v2) hierarchy? (Just curious.)
> > 
> > (It can be made inaccessible from within the subtree either with cgroup
> > ns or good old FS permissions.)
> 
> So we already do have a limit to prevent somebody from running away with
> the number of cgroups. Nice! I was not aware of that and I guess this
> looks like the right thing to do. So do we need more control and
> accounting that this?

I have checked the actual implementation and noticed that cgroups are
uncharged when offlined (rmdir-ed) which means that an adversary could
still trick the limit and runaway while still consuming resources.

Roman, I guess the reason for this implementation was to avoid limit to
trigger on setups with memcgs which can take quite some time to die?
Would it make sense to make the implementation more strict to really act
as gate against potential cgroups count runways?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-06-01 13:05                                       ` Michal Hocko
@ 2022-06-01 14:22                                         ` Roman Gushchin
  2022-06-01 15:24                                           ` Michal Hocko
  0 siblings, 1 reply; 139+ messages in thread
From: Roman Gushchin @ 2022-06-01 14:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michal Koutný,
	Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm,
	Shakeel Butt, Vlastimil Babka, Muchun Song, cgroups

On Wed, Jun 01, 2022 at 03:05:34PM +0200, Michal Hocko wrote:
> On Wed 01-06-22 11:32:26, Michal Hocko wrote:
> > On Wed 01-06-22 11:15:43, Michal Koutny wrote:
> > > On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> > > > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > > 512
> > > > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > > -bash: echo: write error: Operation not permitted
> > > > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > > -bash: echo: write error: Operation not permitted
> > > > 
> > > > I doubt this way can be accepted in upstream, however for OpenVz
> > > > something like this it is mandatory because it much better
> > > > than nothing.
> > > 
> > > Is this customization of yours something like cgroup.max.descendants on
> > > the unified (v2) hierarchy? (Just curious.)
> > > 
> > > (It can be made inaccessible from within the subtree either with cgroup
> > > ns or good old FS permissions.)
> > 
> > So we already do have a limit to prevent somebody from running away with
> > the number of cgroups. Nice!

Yes, we do!

> > I was not aware of that and I guess this
> > looks like the right thing to do. So do we need more control and
> > accounting that this?
> 
> I have checked the actual implementation and noticed that cgroups are
> uncharged when offlined (rmdir-ed) which means that an adversary could
> still trick the limit and runaway while still consuming resources.
> 
> Roman, I guess the reason for this implementation was to avoid limit to
> trigger on setups with memcgs which can take quite some time to die?
> Would it make sense to make the implementation more strict to really act
> as gate against potential cgroups count runways?

The reasoning was that in many cases a user can't do much about dying cgroups,
so it's not clear how they should/would handle getting -EAGAIN on creating a
new cgroup (retrying will not help, obviously). Live cgroups can be easily
deleted, dying cgroups - not always.

I'm not sure about switching the semantics. I'd wait till Muchun's lru page
reparenting will be landed (could be within 1-2 releases, I guess) and then we
can check whether the whole problem is mostly gone. Honestly, I think we might
need to fix few another things, but it might be not that hard (in comparison
to what we already did).

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-06-01 14:22                                         ` Roman Gushchin
@ 2022-06-01 15:24                                           ` Michal Hocko
  0 siblings, 0 replies; 139+ messages in thread
From: Michal Hocko @ 2022-06-01 15:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Michal Koutný,
	Vasily Averin, Andrew Morton, kernel, linux-kernel, linux-mm,
	Shakeel Butt, Vlastimil Babka, Muchun Song, cgroups

On Wed 01-06-22 07:22:05, Roman Gushchin wrote:
> On Wed, Jun 01, 2022 at 03:05:34PM +0200, Michal Hocko wrote:
> > On Wed 01-06-22 11:32:26, Michal Hocko wrote:
> > > On Wed 01-06-22 11:15:43, Michal Koutny wrote:
> > > > On Wed, Jun 01, 2022 at 06:43:27AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> > > > > CT-901 /# cat /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > > > 512
> > > > > CT-901 /# echo 3333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > > > -bash: echo: write error: Operation not permitted
> > > > > CT-901 /# echo 333 > /sys/fs/cgroup/memory/cgroup.subgroups_limit 
> > > > > -bash: echo: write error: Operation not permitted
> > > > > 
> > > > > I doubt this way can be accepted in upstream, however for OpenVz
> > > > > something like this it is mandatory because it much better
> > > > > than nothing.
> > > > 
> > > > Is this customization of yours something like cgroup.max.descendants on
> > > > the unified (v2) hierarchy? (Just curious.)
> > > > 
> > > > (It can be made inaccessible from within the subtree either with cgroup
> > > > ns or good old FS permissions.)
> > > 
> > > So we already do have a limit to prevent somebody from running away with
> > > the number of cgroups. Nice!
> 
> Yes, we do!
> 
> > > I was not aware of that and I guess this
> > > looks like the right thing to do. So do we need more control and
> > > accounting that this?
> > 
> > I have checked the actual implementation and noticed that cgroups are
> > uncharged when offlined (rmdir-ed) which means that an adversary could
> > still trick the limit and runaway while still consuming resources.
> > 
> > Roman, I guess the reason for this implementation was to avoid limit to
> > trigger on setups with memcgs which can take quite some time to die?
> > Would it make sense to make the implementation more strict to really act
> > as gate against potential cgroups count runways?
> 
> The reasoning was that in many cases a user can't do much about dying cgroups,
> so it's not clear how they should/would handle getting -EAGAIN on creating a
> new cgroup (retrying will not help, obviously). Live cgroups can be easily
> deleted, dying cgroups - not always.
> 
> I'm not sure about switching the semantics. I'd wait till Muchun's lru page
> reparenting will be landed (could be within 1-2 releases, I guess) and then we
> can check whether the whole problem is mostly gone. Honestly, I think we might
> need to fix few another things, but it might be not that hard (in comparison
> to what we already did).

OK, thanks for the confirmation! Say we end up mitigating the
too-easy-to-linger memcgs long standing issue. Do we still need an
extended cgroup data structure accounting?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v4 0/9] memcg: accounting for objects allocated by mkdir cgroup
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
  2022-05-30 11:55                       ` Michal Hocko
@ 2022-06-13  5:34                       ` Vasily Averin
  2022-06-23 14:50                         ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin
                                           ` (9 more replies)
  2022-06-13  5:34                       ` [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup Vasily Averin
                                         ` (8 subsequent siblings)
  10 siblings, 10 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

In some cases, creating a cgroup allocates a noticeable amount of memory.
This operation can be executed from inside memory-limited container,
but currently this memory is not accounted to memcg and can be misused.
This allow container to exceed the assigned memory limit and avoid
memcg OOM. Moreover, in case of global memory shortage on the host,
the OOM-killer may not find a real memory eater and start killing
random processes on the host.

This is especially important for OpenVZ and LXC used on hosting,
where containers are used by untrusted end users.

Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
4cpu VM with Fedora and self-complied upstream kernel. The calculations
are not precise, it depends on kernel config options, number of cpus,
enabled controllers, ignores possible page allocations etc.
However this is enough to clarify the general situation.
All allocations are splitted into:
- common part, always called for each cgroup type
- per-cgroup allocations

In each group we consider 2 corner cases:
- usual allocations, important for 1-2 CPU nodes/Vms
- percpu allocations, important for 'big irons'

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu
cpu:		~2.5Kb	+ 1036 bytes percpu
cpuset:		~3Kb	+   12 bytes percpu
blkcg:		~3Kb	+   12 bytes percpu
pid:		~1.5Kb	+   12 bytes percpu		
perf:		 ~320b	+   60 bytes percpu
-------------------------------------------
total:		~38Kb	+ 6142 bytes percpu
currently accounted:	  4668 bytes percpu

- it's important to account usual allocations called
in common part, because almost all of cgroup-specific allocations
are small. One exception here is memory cgroup, it allocates a few
huge objects that should be accounted.
- Percpu allocation called in common part, in memcg and cpu cgroups
should be accounted, rest ones are small an can be ignored.
- KERNFS objects are allocated both in common part and in most of
cgroups 

Details can be found here:
https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/

I checked other cgroups types was found that they all can be ignored.
Additionally I found allocation of struct rt_rq called in cpu cgroup 
if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
percpu structure and should be accounted too.

v4:
 1) re-based to linux-next (next-20220610)
   now psi_group is not a part of struct cgroup and is allocated on demand
 2) added received approval from Muchun Song
 3) improved cover letter description according to akpm@ request

v3:
 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7)
 2) fixed few typos
 3) added received approvals

v2:
 1) re-split to simplify possible bisect, re-ordered
 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu,
     allocated in common part
 3) added accounting for percpu allocation of struct rt_rq
     (actual if CONFIG_RT_GROUP_SCHED is enabled)
 4) improved patches descriptions 

Vasily Averin (9):
  memcg: enable accounting for struct cgroup
  memcg: enable accounting for kernfs nodes
  memcg: enable accounting for kernfs iattrs
  memcg: enable accounting for struct simple_xattr
  memcg: enable accounting for percpu allocation of struct psi_group_cpu
  memcg: enable accounting for percpu allocation of struct
    cgroup_rstat_cpu
  memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  memcg: enable accounting for allocations in alloc_fair_sched_group
  memcg: enable accounting for perpu allocation of struct rt_rq

 fs/kernfs/mount.c      | 6 ++++--
 fs/xattr.c             | 2 +-
 kernel/cgroup/cgroup.c | 2 +-
 kernel/cgroup/rstat.c  | 3 ++-
 kernel/sched/fair.c    | 4 ++--
 kernel/sched/psi.c     | 5 +++--
 kernel/sched/rt.c      | 2 +-
 mm/memcontrol.c        | 4 ++--
 8 files changed, 16 insertions(+), 12 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
  2022-05-30 11:55                       ` Michal Hocko
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
@ 2022-06-13  5:34                       ` Vasily Averin
  2022-06-13  5:34                       ` [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
                                         ` (7 subsequent siblings)
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creating each new cgroup allocates 4Kb for struct cgroup. This is the
largest memory allocation in this scenario and is epecially important
for small VMs with 1-2 CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Accounting of this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 90a654cb8a1e..9adf4ad4b623 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
 	cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)),
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (2 preceding siblings ...)
  2022-06-13  5:34                       ` [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup Vasily Averin
@ 2022-06-13  5:34                       ` Vasily Averin
  2022-06-13  5:34                       ` [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
                                         ` (6 subsequent siblings)
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)	kernfs node
1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs nodes slab cache.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..3ac4191b1c40 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,7 +391,8 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (3 preceding siblings ...)
  2022-06-13  5:34                       ` [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
@ 2022-06-13  5:34                       ` Vasily Averin
  2022-06-13  5:35                       ` [PATCH mm v4 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
                                         ` (5 subsequent siblings)
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc    Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs_iattrs_cache slab cache

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 3ac4191b1c40..40e896c7c86b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -397,5 +397,6 @@ void __init kernfs_init(void)
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
 					      sizeof(struct kernfs_iattrs),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 4/9] memcg: enable accounting for struct simple_xattr
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (4 preceding siblings ...)
  2022-06-13  5:34                       ` [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
@ 2022-06-13  5:35                       ` Vasily Averin
  2022-06-13  5:35                       ` [PATCH mm v4 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
                                         ` (4 subsequent siblings)
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for struct simple_xattr. Size of this
structure depends on userspace and can grow over 4Kb.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/xattr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index e8dd03e4561e..98dcf6600bd9 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -1001,7 +1001,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
 	if (len < sizeof(*new_xattr))
 		return NULL;
 
-	new_xattr = kvmalloc(len, GFP_KERNEL);
+	new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT);
 	if (!new_xattr)
 		return NULL;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (5 preceding siblings ...)
  2022-06-13  5:35                       ` [PATCH mm v4 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
@ 2022-06-13  5:35                       ` Vasily Averin
  2022-06-13  5:35                       ` [PATCH mm v4 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
                                         ` (3 subsequent siblings)
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

struct pci_group_cpu is percpu allocated for each new cgroup and can
consume a significant portion of all allocated memory on nodes with
a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/sched/psi.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ec66b40bdd40..c95e87269fc5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -957,11 +957,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (static_branch_likely(&psi_disabled))
 		return 0;
 
-	cgroup->psi = kmalloc(sizeof(struct psi_group), GFP_KERNEL);
+	cgroup->psi = kmalloc(sizeof(struct psi_group), GFP_KERNEL_ACCOUNT);
 	if (!cgroup->psi)
 		return -ENOMEM;
 
-	cgroup->psi->pcpu = alloc_percpu(struct psi_group_cpu);
+	cgroup->psi->pcpu = alloc_percpu_gfp(struct psi_group_cpu,
+					     GFP_KERNEL_ACCOUNT);
 	if (!cgroup->psi->pcpu) {
 		kfree(cgroup->psi);
 		return -ENOMEM;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (6 preceding siblings ...)
  2022-06-13  5:35                       ` [PATCH mm v4 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
@ 2022-06-13  5:35                       ` Vasily Averin
  2022-06-13  5:35                       ` [PATCH mm v4 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
                                         ` (2 subsequent siblings)
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
can consume a significant portion of all allocated memory on nodes
with a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum	Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/cgroup/rstat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..2904b185b01b 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -257,7 +257,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 
 	/* the root cgrp has rstat_cpu preallocated */
 	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu,
+						   GFP_KERNEL_ACCOUNT);
 		if (!cgrp->rstat_cpu)
 			return -ENOMEM;
 	}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (7 preceding siblings ...)
  2022-06-13  5:35                       ` [PATCH mm v4 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
@ 2022-06-13  5:35                       ` Vasily Averin
  2022-06-13  5:35                       ` [PATCH mm v4 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
  2022-06-13  5:35                       ` [PATCH mm v4 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creation of each memory cgroup allocates few huge objects in
mem_cgroup_css_alloc(). Its size exceeds the size of memory
accounted in common part of cgroup creation:

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu

memory:
------
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
1   +   8192    8192    8192    (mem_cgroup_css_alloc+0x4a) <NB
14  ~   352     4928    13120   KERNFS
1   +   2048    2048    15168   (mem_cgroup_css_alloc+0xdd) <NB
1       1024    1024    16192   (alloc_shrinker_info+0x79)
1       584     584     16776   (radix_tree_node_alloc.constprop.0+0x89)
2       64      128     16904   (percpu_ref_init+0x6a)
1       64      64      16968   (mem_cgroup_css_online+0x32)

1   =   3684    3684    3684    call_site=mem_cgroup_css_alloc+0x9e
1   =   984     984     4668    call_site=mem_cgroup_css_alloc+0xfd
2       12      24      4692    call_site=percpu_ref_init+0x23

     '=' -- already accounted,
     '+' -- to be accounted,
     '~' -- partially accounted

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27cebaa53472..a8647a8417e4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5086,7 +5086,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 
-	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node);
+	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL_ACCOUNT, node);
 	if (!pn)
 		return 1;
 
@@ -5138,7 +5138,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	int __maybe_unused i;
 	long error = -ENOMEM;
 
-	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL);
+	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL_ACCOUNT);
 	if (!memcg)
 		return ERR_PTR(error);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (8 preceding siblings ...)
  2022-06-13  5:35                       ` [PATCH mm v4 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
@ 2022-06-13  5:35                       ` Vasily Averin
  2022-06-13  5:35                       ` [PATCH mm v4 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creating of each new cpu cgroup allocates two 512-bytes kernel objects
per CPU. This is especially important for cgroups shared parent memory
cgroup. In this scenario, on nodes with multiple processors, these
allocations become one of the main memory consumers.

Memory allocated during new cpu cgroup creation:
common part: 	~11Kb	+  318 bytes percpu
cpu cgroup:	~2.5Kb	+ 1036 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8202b5cd3d5..71161be1e783 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11503,12 +11503,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
-				      GFP_KERNEL, cpu_to_node(i));
+				      GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!cfs_rq)
 			goto err;
 
 		se = kzalloc_node(sizeof(struct sched_entity_stats),
-				  GFP_KERNEL, cpu_to_node(i));
+				  GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!se)
 			goto err_free_rq;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v4 9/9] memcg: enable accounting for perpu allocation of struct rt_rq
  2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
                                         ` (9 preceding siblings ...)
  2022-06-13  5:35                       ` [PATCH mm v4 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
@ 2022-06-13  5:35                       ` Vasily Averin
  10 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-13  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

If enabled in config, alloc_rt_sched_group() is called for each new
cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq.
This significantly exceeds the size of the percpu allocation in the
common part of cgroup creation.

Memory allocated during new cpu cgroup creation
(with enabled RT_GROUP_SCHED):
common part:    ~11Kb   +   318 bytes percpu
cpu cgroup:     ~2.5Kb  + ~2800 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/sched/rt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8c9ed9664840..44a8fc096e33 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -256,7 +256,7 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
-				     GFP_KERNEL, cpu_to_node(i));
+				     GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!rt_rq)
 			goto err;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
@ 2022-06-23 14:50                         ` Vasily Averin
  2022-06-23 15:03                           ` Vasily Averin
  2022-06-23 14:50                         ` [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup Vasily Averin
                                           ` (8 subsequent siblings)
  9 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

In some cases, creating a cgroup allocates a noticeable amount of memory.
This operation can be executed from inside memory-limited container,
but currently this memory is not accounted to memcg and can be misused.
This allow container to exceed the assigned memory limit and avoid
memcg OOM. Moreover, in case of global memory shortage on the host,
the OOM-killer may not find a real memory eater and start killing
random processes on the host.

This is especially important for OpenVZ and LXC used on hosting,
where containers are used by untrusted end users.

Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
4cpu VM with Fedora and self-complied upstream kernel. The calculations
are not precise, it depends on kernel config options, number of cpus,
enabled controllers, ignores possible page allocations etc.
However this is enough to clarify the general situation.
All allocations are splitted into:
- common part, always called for each cgroup type
- per-cgroup allocations

In each group we consider 2 corner cases:
- usual allocations, important for 1-2 CPU nodes/Vms
- percpu allocations, important for 'big irons'

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu
cpu:		~2.5Kb	+ 1036 bytes percpu
cpuset:		~3Kb	+   12 bytes percpu
blkcg:		~3Kb	+   12 bytes percpu
pid:		~1.5Kb	+   12 bytes percpu		
perf:		 ~320b	+   60 bytes percpu
-------------------------------------------
total:		~38Kb	+ 6142 bytes percpu
currently accounted:	  4668 bytes percpu

- it's important to account usual allocations called
in common part, because almost all of cgroup-specific allocations
are small. One exception here is memory cgroup, it allocates a few
huge objects that should be accounted.
- Percpu allocation called in common part, in memcg and cpu cgroups
should be accounted, rest ones are small an can be ignored.
- KERNFS objects are allocated both in common part and in most of
cgroups 

Details can be found here:
https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/

I checked other cgroups types was found that they all can be ignored.
Additionally I found allocation of struct rt_rq called in cpu cgroup 
if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
percpu structure and should be accounted too.

v5:
 1) re-based to linux-mm (mm-everything-2022-06-22-20-36)

v4:
 1) re-based to linux-next (next-20220610)
   now psi_group is not a part of struct cgroup and is allocated on demand
 2) added received approval from Muchun Song
 3) improved cover letter description according to akpm@ request

v3:
 1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7)
 2) fixed few typos
 3) added received approvals

v2:
 1) re-split to simplify possible bisect, re-ordered
 2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu,
     allocated in common part
 3) added accounting for percpu allocation of struct rt_rq
     (actual if CONFIG_RT_GROUP_SCHED is enabled)
 4) improved patches descriptions 

Vasily Averin (9):
  memcg: enable accounting for struct cgroup
  memcg: enable accounting for kernfs nodes
  memcg: enable accounting for kernfs iattrs
  memcg: enable accounting for struct simple_xattr
  memcg: enable accounting for percpu allocation of struct psi_group_cpu
  memcg: enable accounting for percpu allocation of struct
    cgroup_rstat_cpu
  memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  memcg: enable accounting for allocations in alloc_fair_sched_group
  memcg: enable accounting for perpu allocation of struct rt_rq

 fs/kernfs/mount.c      | 6 ++++--
 fs/xattr.c             | 2 +-
 kernel/cgroup/cgroup.c | 2 +-
 kernel/cgroup/rstat.c  | 3 ++-
 kernel/sched/fair.c    | 4 ++--
 kernel/sched/psi.c     | 2 +-
 kernel/sched/rt.c      | 2 +-
 mm/memcontrol.c        | 4 ++--
 8 files changed, 14 insertions(+), 11 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
  2022-06-23 14:50                         ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin
@ 2022-06-23 14:50                         ` Vasily Averin
  2022-06-23 14:50                         ` [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
                                           ` (7 subsequent siblings)
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creating each new cgroup allocates 4Kb for struct cgroup. This is the
largest memory allocation in this scenario and is epecially important
for small VMs with 1-2 CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Accounting of this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..1be0f81fe8e1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5353,7 +5353,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 
 	/* allocate the cgroup and its ID, 0 is reserved for the root */
 	cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)),
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!cgrp)
 		return ERR_PTR(-ENOMEM);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
  2022-06-23 14:50                         ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin
  2022-06-23 14:50                         ` [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup Vasily Averin
@ 2022-06-23 14:50                         ` Vasily Averin
  2022-06-23 14:51                         ` [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
                                           ` (6 subsequent siblings)
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)	kernfs node
1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs nodes slab cache.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index cfa79715fc1a..3ac4191b1c40 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -391,7 +391,8 @@ void __init kernfs_init(void)
 {
 	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
 					      sizeof(struct kernfs_node),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                           ` (2 preceding siblings ...)
  2022-06-23 14:50                         ` [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
@ 2022-06-23 14:51                         ` Vasily Averin
  2022-06-23 14:51                         ` [PATCH mm v5 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
                                           ` (5 subsequent siblings)
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc    Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for kernfs_iattrs_cache slab cache

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/kernfs/mount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 3ac4191b1c40..40e896c7c86b 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -397,5 +397,6 @@ void __init kernfs_init(void)
 	/* Creates slab cache for kernfs inode attributes */
 	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
 					      sizeof(struct kernfs_iattrs),
-					      0, SLAB_PANIC, NULL);
+					      0, SLAB_PANIC | SLAB_ACCOUNT,
+					      NULL);
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 4/9] memcg: enable accounting for struct simple_xattr
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                           ` (3 preceding siblings ...)
  2022-06-23 14:51                         ` [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
@ 2022-06-23 14:51                         ` Vasily Averin
  2022-06-23 14:51                         ` [PATCH mm v5 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
                                           ` (4 subsequent siblings)
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

kernfs nodes are quite small kernel objects, however there are few
scenarios where it consumes significant piece of all allocated memory:

1) creating a new netdevice allocates ~50Kb of memory, where ~10Kb
   was allocated for 80+ kernfs nodes.

2) cgroupv2 mkdir allocates ~60Kb of memory, ~10Kb of them are kernfs
   structures.

3) Shakeel Butt reports that Google has workloads which create 100s
   of subcontainers and they have observed high system overhead
   without memcg accounting of kernfs.

Usually new kernfs node creates few other objects:

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)        kernfs node
1   +   88      (__kernfs_iattrs+0x57)          kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)       simple_xattr
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

'+' -- to be accounted

This patch enables accounting for struct simple_xattr. Size of this
structure depends on userspace and can grow over 4Kb.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/xattr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index e8dd03e4561e..98dcf6600bd9 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -1001,7 +1001,7 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
 	if (len < sizeof(*new_xattr))
 		return NULL;
 
-	new_xattr = kvmalloc(len, GFP_KERNEL);
+	new_xattr = kvmalloc(len, GFP_KERNEL_ACCOUNT);
 	if (!new_xattr)
 		return NULL;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                           ` (4 preceding siblings ...)
  2022-06-23 14:51                         ` [PATCH mm v5 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
@ 2022-06-23 14:51                         ` Vasily Averin
  2022-06-23 14:51                         ` [PATCH mm v5 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
                                           ` (3 subsequent siblings)
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

struct pci_group_cpu is percpu allocated for each new cgroup and can
consume a significant portion of all allocated memory on nodes with
a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index a337f3e35997..0da10159d3d9 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -957,7 +957,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup)
 	if (static_branch_likely(&psi_disabled))
 		return 0;
 
-	cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu);
+	cgroup->psi.pcpu = alloc_percpu_gfp(struct psi_group_cpu, GFP_KERNEL_ACCOUNT);
 	if (!cgroup->psi.pcpu)
 		return -ENOMEM;
 	group_init(&cgroup->psi);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                           ` (5 preceding siblings ...)
  2022-06-23 14:51                         ` [PATCH mm v5 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
@ 2022-06-23 14:51                         ` Vasily Averin
  2022-06-23 14:51                         ` [PATCH mm v5 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
                                           ` (2 subsequent siblings)
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

struct cgroup_rstat_cpu is percpu allocated for each new cgroup and
can consume a significant portion of all allocated memory on nodes
with a large number of CPUs.

Common part of the cgroup creation:
Allocs  Alloc   $1*$2   Sum	Allocation
number  size
--------------------------------------------
16  ~   352     5632    5632    KERNFS
1   +   4096    4096    9728    (cgroup_mkdir+0xe4)
1       584     584     10312   (radix_tree_node_alloc.constprop.0+0x89)
1       192     192     10504   (__d_alloc+0x29)
2       72      144     10648   (avc_alloc_node+0x27)
2       64      128     10776   (percpu_ref_init+0x6a)
1       64      64      10840   (memcg_list_lru_alloc+0x21a)
percpu:
1   +   192     192     192     call_site=psi_cgroup_alloc+0x1e
1   +   96      96      288     call_site=cgroup_rstat_init+0x5f
2       12      24      312     call_site=percpu_ref_init+0x23
1       6       6       318     call_site=__percpu_counter_init+0x22

 '+' -- to be accounted,
 '~' -- partially accounted

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/cgroup/rstat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 24b5c2ab5598..2904b185b01b 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -257,7 +257,8 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 
 	/* the root cgrp has rstat_cpu preallocated */
 	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		cgrp->rstat_cpu = alloc_percpu_gfp(struct cgroup_rstat_cpu,
+						   GFP_KERNEL_ACCOUNT);
 		if (!cgrp->rstat_cpu)
 			return -ENOMEM;
 	}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                           ` (6 preceding siblings ...)
  2022-06-23 14:51                         ` [PATCH mm v5 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
@ 2022-06-23 14:51                         ` Vasily Averin
  2022-06-23 14:51                         ` [PATCH mm v5 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
  2022-06-23 14:52                         ` [PATCH mm v5 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creation of each memory cgroup allocates few huge objects in
mem_cgroup_css_alloc(). Its size exceeds the size of memory
accounted in common part of cgroup creation:

common part: 	~11Kb	+  318 bytes percpu
memcg: 		~17Kb	+ 4692 bytes percpu

memory:
------
Allocs  Alloc   $1*$2   Sum     Allocation
number  size
--------------------------------------------
1   +   8192    8192    8192    (mem_cgroup_css_alloc+0x4a) <NB
14  ~   352     4928    13120   KERNFS
1   +   2048    2048    15168   (mem_cgroup_css_alloc+0xdd) <NB
1       1024    1024    16192   (alloc_shrinker_info+0x79)
1       584     584     16776   (radix_tree_node_alloc.constprop.0+0x89)
2       64      128     16904   (percpu_ref_init+0x6a)
1       64      64      16968   (mem_cgroup_css_online+0x32)

1   =   3684    3684    3684    call_site=mem_cgroup_css_alloc+0x9e
1   =   984     984     4668    call_site=mem_cgroup_css_alloc+0xfd
2       12      24      4692    call_site=percpu_ref_init+0x23

     '=' -- already accounted,
     '+' -- to be accounted,
     '~' -- partially accounted

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 85adc43c5a25..275d0c847f05 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5257,7 +5257,7 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
 
-	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL, node);
+	pn = kzalloc_node(sizeof(*pn), GFP_KERNEL_ACCOUNT, node);
 	if (!pn)
 		return 1;
 
@@ -5309,7 +5309,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	int __maybe_unused i;
 	long error = -ENOMEM;
 
-	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL);
+	memcg = kzalloc(struct_size(memcg, nodeinfo, nr_node_ids), GFP_KERNEL_ACCOUNT);
 	if (!memcg)
 		return ERR_PTR(error);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                           ` (7 preceding siblings ...)
  2022-06-23 14:51                         ` [PATCH mm v5 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
@ 2022-06-23 14:51                         ` Vasily Averin
  2022-06-23 14:52                         ` [PATCH mm v5 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

Creating of each new cpu cgroup allocates two 512-bytes kernel objects
per CPU. This is especially important for cgroups shared parent memory
cgroup. In this scenario, on nodes with multiple processors, these
allocations become one of the main memory consumers.

Memory allocated during new cpu cgroup creation:
common part: 	~11Kb	+  318 bytes percpu
cpu cgroup:	~2.5Kb	+ 1036 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
contianers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8202b5cd3d5..71161be1e783 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11503,12 +11503,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
-				      GFP_KERNEL, cpu_to_node(i));
+				      GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!cfs_rq)
 			goto err;
 
 		se = kzalloc_node(sizeof(struct sched_entity_stats),
-				  GFP_KERNEL, cpu_to_node(i));
+				  GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!se)
 			goto err_free_rq;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH mm v5 9/9] memcg: enable accounting for perpu allocation of struct rt_rq
  2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
                                           ` (8 preceding siblings ...)
  2022-06-23 14:51                         ` [PATCH mm v5 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
@ 2022-06-23 14:52                         ` Vasily Averin
  9 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 14:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel, linux-kernel, linux-mm, Shakeel Butt, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Michal Hocko, Muchun Song, cgroups

If enabled in config, alloc_rt_sched_group() is called for each new
cpu cgroup and allocates a huge (~1700 bytes) percpu struct rt_rq.
This significantly exceeds the size of the percpu allocation in the
common part of cgroup creation.

Memory allocated during new cpu cgroup creation
(with enabled RT_GROUP_SCHED):
common part:    ~11Kb   +   318 bytes percpu
cpu cgroup:     ~2.5Kb  + ~2800 bytes percpu

Accounting for this memory helps to avoid misuse inside memcg-limited
containers.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
---
 kernel/sched/rt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8c9ed9664840..44a8fc096e33 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -256,7 +256,7 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)
 
 	for_each_possible_cpu(i) {
 		rt_rq = kzalloc_node(sizeof(struct rt_rq),
-				     GFP_KERNEL, cpu_to_node(i));
+				     GFP_KERNEL_ACCOUNT, cpu_to_node(i));
 		if (!rt_rq)
 			goto err;
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-23 14:50                         ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin
@ 2022-06-23 15:03                           ` Vasily Averin
  2022-06-23 16:07                             ` Michal Hocko
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-23 15:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kernel, Andrew Morton, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

Dear Michal,
do you still have any concerns about this patch set?

Thank you,
	Vasily Averin

On 6/23/22 17:50, Vasily Averin wrote:
> In some cases, creating a cgroup allocates a noticeable amount of memory.
> This operation can be executed from inside memory-limited container,
> but currently this memory is not accounted to memcg and can be misused.
> This allow container to exceed the assigned memory limit and avoid
> memcg OOM. Moreover, in case of global memory shortage on the host,
> the OOM-killer may not find a real memory eater and start killing
> random processes on the host.
> 
> This is especially important for OpenVZ and LXC used on hosting,
> where containers are used by untrusted end users.
> 
> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
> are not precise, it depends on kernel config options, number of cpus,
> enabled controllers, ignores possible page allocations etc.
> However this is enough to clarify the general situation.
> All allocations are splitted into:
> - common part, always called for each cgroup type
> - per-cgroup allocations
> 
> In each group we consider 2 corner cases:
> - usual allocations, important for 1-2 CPU nodes/Vms
> - percpu allocations, important for 'big irons'
> 
> common part: 	~11Kb	+  318 bytes percpu
> memcg: 		~17Kb	+ 4692 bytes percpu
> cpu:		~2.5Kb	+ 1036 bytes percpu
> cpuset:		~3Kb	+   12 bytes percpu
> blkcg:		~3Kb	+   12 bytes percpu
> pid:		~1.5Kb	+   12 bytes percpu		
> perf:		 ~320b	+   60 bytes percpu
> -------------------------------------------
> total:		~38Kb	+ 6142 bytes percpu
> currently accounted:	  4668 bytes percpu
> 
> - it's important to account usual allocations called
> in common part, because almost all of cgroup-specific allocations
> are small. One exception here is memory cgroup, it allocates a few
> huge objects that should be accounted.
> - Percpu allocation called in common part, in memcg and cpu cgroups
> should be accounted, rest ones are small an can be ignored.
> - KERNFS objects are allocated both in common part and in most of
> cgroups 
> 
> Details can be found here:
> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
> 
> I checked other cgroups types was found that they all can be ignored.
> Additionally I found allocation of struct rt_rq called in cpu cgroup 
> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
> percpu structure and should be accounted too.
> 
> v5:
>  1) re-based to linux-mm (mm-everything-2022-06-22-20-36)
> 
> v4:
>  1) re-based to linux-next (next-20220610)
>    now psi_group is not a part of struct cgroup and is allocated on demand
>  2) added received approval from Muchun Song
>  3) improved cover letter description according to akpm@ request
> 
> v3:
>  1) re-based to current upstream (v5.18-11267-gb00ed48bb0a7)
>  2) fixed few typos
>  3) added received approvals
> 
> v2:
>  1) re-split to simplify possible bisect, re-ordered
>  2) added accounting for percpu psi_group_cpu and cgroup_rstat_cpu,
>      allocated in common part
>  3) added accounting for percpu allocation of struct rt_rq
>      (actual if CONFIG_RT_GROUP_SCHED is enabled)
>  4) improved patches descriptions 
> 
> Vasily Averin (9):
>   memcg: enable accounting for struct cgroup
>   memcg: enable accounting for kernfs nodes
>   memcg: enable accounting for kernfs iattrs
>   memcg: enable accounting for struct simple_xattr
>   memcg: enable accounting for percpu allocation of struct psi_group_cpu
>   memcg: enable accounting for percpu allocation of struct
>     cgroup_rstat_cpu
>   memcg: enable accounting for large allocations in mem_cgroup_css_alloc
>   memcg: enable accounting for allocations in alloc_fair_sched_group
>   memcg: enable accounting for perpu allocation of struct rt_rq
> 
>  fs/kernfs/mount.c      | 6 ++++--
>  fs/xattr.c             | 2 +-
>  kernel/cgroup/cgroup.c | 2 +-
>  kernel/cgroup/rstat.c  | 3 ++-
>  kernel/sched/fair.c    | 4 ++--
>  kernel/sched/psi.c     | 2 +-
>  kernel/sched/rt.c      | 2 +-
>  mm/memcontrol.c        | 4 ++--
>  8 files changed, 14 insertions(+), 11 deletions(-)
> 


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-23 15:03                           ` Vasily Averin
@ 2022-06-23 16:07                             ` Michal Hocko
  2022-06-23 16:55                               ` Shakeel Butt
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Hocko @ 2022-06-23 16:07 UTC (permalink / raw)
  To: Vasily Averin
  Cc: kernel, Andrew Morton, linux-kernel, linux-mm, Shakeel Butt,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, cgroups

On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> Dear Michal,
> do you still have any concerns about this patch set?

Yes, I do not think we have concluded this to be really necessary. IIRC 
Roman would like to see lingering cgroups addressed in not-so-distant
future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
have a limit for the number of cgroups in the tree. So why should we
chase after allocations that correspond the cgroups and somehow try to
cap their number via the memory consumption. This looks like something
that will get out of sync eventually and it also doesn't seem like the
best control to me (comparing to an explicit limit to prevent runaways).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-23 16:07                             ` Michal Hocko
@ 2022-06-23 16:55                               ` Shakeel Butt
  2022-06-24 10:40                                 ` Vasily Averin
  2022-06-24 13:59                                 ` Michal Hocko
  0 siblings, 2 replies; 139+ messages in thread
From: Shakeel Butt @ 2022-06-23 16:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, Cgroups

On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> > Dear Michal,
> > do you still have any concerns about this patch set?
>
> Yes, I do not think we have concluded this to be really necessary. IIRC
> Roman would like to see lingering cgroups addressed in not-so-distant
> future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
> have a limit for the number of cgroups in the tree. So why should we
> chase after allocations that correspond the cgroups and somehow try to
> cap their number via the memory consumption. This looks like something
> that will get out of sync eventually and it also doesn't seem like the
> best control to me (comparing to an explicit limit to prevent runaways).
> --

Let me give a counter argument to that. On a system running multiple
workloads, how can the admin come up with a sensible limit for the
number of cgroups? There will definitely be jobs that require much
more number of sub-cgroups. Asking the admins to dynamically tune
another tuneable is just asking for more complications. At the end all
the users would just set it to max.

I would recommend to see the commit ac7b79fd190b ("inotify, memcg:
account inotify instances to kmemcg") where there is already a sysctl
(inotify/max_user_instances) to limit the number of instances but
there was no sensible way to set that limit on a multi-tenant system.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-23 16:55                               ` Shakeel Butt
@ 2022-06-24 10:40                                 ` Vasily Averin
  2022-06-24 12:26                                   ` Michal Koutný
  2022-06-24 13:59                                 ` Michal Hocko
  1 sibling, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-24 10:40 UTC (permalink / raw)
  To: Shakeel Butt, Michal Hocko
  Cc: kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Muchun Song, Cgroups

On 6/23/22 19:55, Shakeel Butt wrote:
> On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Thu 23-06-22 18:03:31, Vasily Averin wrote:
>>> Dear Michal,
>>> do you still have any concerns about this patch set?
>>
>> Yes, I do not think we have concluded this to be really necessary. IIRC
>> Roman would like to see lingering cgroups addressed in not-so-distant
>> future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
>> have a limit for the number of cgroups in the tree. So why should we
>> chase after allocations that correspond the cgroups and somehow try to
>> cap their number via the memory consumption. This looks like something
>> that will get out of sync eventually and it also doesn't seem like the
>> best control to me (comparing to an explicit limit to prevent runaways).
>> --
> 
> Let me give a counter argument to that. On a system running multiple
> workloads, how can the admin come up with a sensible limit for the
> number of cgroups? There will definitely be jobs that require much
> more number of sub-cgroups. Asking the admins to dynamically tune
> another tuneable is just asking for more complications. At the end all
> the users would just set it to max.
> 
> I would recommend to see the commit ac7b79fd190b ("inotify, memcg:
> account inotify instances to kmemcg") where there is already a sysctl
> (inotify/max_user_instances) to limit the number of instances but
> there was no sensible way to set that limit on a multi-tenant system.

I've found that MEM_CGROUP_ID_MAX limits memory cgroups only. Other types
of cgroups do not have similar restrictions. Yes, we can set some per-container 
limit for all cgroups, but to me it looks like workaround while
proper memory accounting looks like real solution.

Btw could you please explain why memory cgroups have MEM_CGROUP_ID_MAX limit
Why it is required at all and why it was set to USHRT_MAX? I believe that
in the future it may be really reachable:

Let's set up per-container cgroup limit to some small numbers, 
for example to 512 as OpenVz doing right now. On real node with 300
containers we can easily get 100*300 = 30000 cgroups, and consume ~3Gb memory, 
without any misuse. I think it is too much to ignore its accounting.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-24 10:40                                 ` Vasily Averin
@ 2022-06-24 12:26                                   ` Michal Koutný
  0 siblings, 0 replies; 139+ messages in thread
From: Michal Koutný @ 2022-06-24 12:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Michal Hocko, kernel, Andrew Morton, LKML,
	Linux MM, Roman Gushchin, Vlastimil Babka, Muchun Song, Cgroups

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

On Fri, Jun 24, 2022 at 01:40:14PM +0300, Vasily Averin <vvs@openvz.org> wrote:
> Btw could you please explain why memory cgroups have MEM_CGROUP_ID_MAX limit
> Why it is required at all and why it was set to USHRT_MAX? I believe that
> in the future it may be really reachable:

IIRC, one reason is 2B * nr_swap_pages of memory overhead (in
swap_cgroup_swapon()) that's ~0.05% of swap space occupied additionally
in RAM (fortunately swap needn't cover whole RAM).

HTH,
Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-23 16:55                               ` Shakeel Butt
  2022-06-24 10:40                                 ` Vasily Averin
@ 2022-06-24 13:59                                 ` Michal Hocko
  2022-06-25  9:43                                   ` [PATCH RFC] memcg: avoid idr ids space depletion Vasily Averin
                                                     ` (2 more replies)
  1 sibling, 3 replies; 139+ messages in thread
From: Michal Hocko @ 2022-06-24 13:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, Cgroups

On Thu 23-06-22 09:55:33, Shakeel Butt wrote:
> On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> > > Dear Michal,
> > > do you still have any concerns about this patch set?
> >
> > Yes, I do not think we have concluded this to be really necessary. IIRC
> > Roman would like to see lingering cgroups addressed in not-so-distant
> > future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
> > have a limit for the number of cgroups in the tree. So why should we
> > chase after allocations that correspond the cgroups and somehow try to
> > cap their number via the memory consumption. This looks like something
> > that will get out of sync eventually and it also doesn't seem like the
> > best control to me (comparing to an explicit limit to prevent runaways).
> > --
> 
> Let me give a counter argument to that. On a system running multiple
> workloads, how can the admin come up with a sensible limit for the
> number of cgroups?

How is that any easier through memory consumption? Something that might
change between kernel versions? Is it even possible to prevent from id
depletion by the memory consumption? Any medium sized memcg can easily
consume all the ids AFAICS.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH RFC] memcg: avoid idr ids space depletion
  2022-06-24 13:59                                 ` Michal Hocko
@ 2022-06-25  9:43                                   ` Vasily Averin
  2022-06-25 14:04                                   ` [PATCH RFC] memcg: notify about global mem_cgroup_id " Vasily Averin
  2022-06-27 16:37                                   ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Shakeel Butt
  2 siblings, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-25  9:43 UTC (permalink / raw)
  To: Shakeel Butt, Michal Koutný, Michal Hocko
  Cc: kernel, linux-kernel, Andrew Morton, linux-mm, Roman Gushchin,
	Vlastimil Babka, Muchun Song, cgroups

I tried to increase MEM_CGROUP_ID_MAX to INT_MAX and found no
significant difficulties. What do you think about following patch?
I did not tested it, just checked its compilation.
I hope it allows:
- to avoid memcg id space depletion on normal nodes
- to set up per-container cgroup limit to USHRT_MAX to prevent possible misuse
  and in general use memcg accounting for allocated resources.

Thank you,
	Vasily Averin
---

Michal Hocko pointed that memory controller depends on idr ids which
have a space that is rather limited
 #define MEM_CGROUP_ID_MAX       USHRT_MAX

The limit can be reached on nodes hosted several hundred OS containers
with new distributions running hundreds of services in their own memory
cgroups.

This patch increases the space up to INT_MAX.
---
 include/linux/memcontrol.h  | 15 +++++++++------
 include/linux/swap_cgroup.h | 14 +++++---------
 mm/memcontrol.c             |  6 +++---
 mm/swap_cgroup.c            | 10 ++++------
 4 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 744cde2b2368..e3468550ba20 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -59,10 +59,13 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
-
+#ifdef CONFIG_64BIT
+#define MEM_CGROUP_ID_SHIFT	31
+#define MEM_CGROUP_ID_MAX	INT_MAX - 1
+#else
 #define MEM_CGROUP_ID_SHIFT	16
 #define MEM_CGROUP_ID_MAX	USHRT_MAX
-
+#endif
 struct mem_cgroup_id {
 	int id;
 	refcount_t ref;
@@ -852,14 +855,14 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
 
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+static inline int mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
 		return 0;
 
 	return memcg->id.id;
 }
-struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
+struct mem_cgroup *mem_cgroup_from_id(int id);
 
 #ifdef CONFIG_SHRINKER_DEBUG
 static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
@@ -1374,12 +1377,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
-static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
+static inline int mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
 }
 
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+static inline struct mem_cgroup *mem_cgroup_from_id(int id)
 {
 	WARN_ON_ONCE(id);
 	/* XXX: This should always return root_mem_cgroup */
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index a12dd1c3966c..711dd18380ed 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -6,25 +6,21 @@
 
 #ifdef CONFIG_MEMCG_SWAP
 
-extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new);
-extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-					 unsigned int nr_ents);
-extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern int swap_cgroup_cmpxchg(swp_entry_t ent, int old, int new);
+extern int swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents);
+extern int lookup_swap_cgroup_id(swp_entry_t ent);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 
 #else
 
 static inline
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-				  unsigned int nr_ents)
+unsigned short swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents)
 {
 	return 0;
 }
 
-static inline
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
+static inline int lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	return 0;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 275d0c847f05..d4c606a06bcd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5224,7 +5224,7 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
  *
  * Caller must hold rcu_read_lock().
  */
-struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+struct mem_cgroup *mem_cgroup_from_id(int id)
 {
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	return idr_find(&mem_cgroup_idr, id);
@@ -7021,7 +7021,7 @@ int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
 {
 	struct folio *folio = page_folio(page);
 	struct mem_cgroup *memcg;
-	unsigned short id;
+	int id;
 	int ret;
 
 	if (mem_cgroup_disabled())
@@ -7541,7 +7541,7 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 {
 	struct mem_cgroup *memcg;
-	unsigned short id;
+	int id;
 
 	id = swap_cgroup_record(entry, 0, nr_pages);
 	rcu_read_lock();
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 5a9442979a18..76fa5c42e03f 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -15,7 +15,7 @@ struct swap_cgroup_ctrl {
 static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
 
 struct swap_cgroup {
-	unsigned short		id;
+	int		id;
 };
 #define SC_PER_PAGE	(PAGE_SIZE/sizeof(struct swap_cgroup))
 
@@ -94,8 +94,7 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
  * Returns old id at success, 0 at failure.
  * (There is no mem_cgroup using 0 as its id)
  */
-unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new)
+int swap_cgroup_cmpxchg(swp_entry_t ent, int old, int new)
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
@@ -123,8 +122,7 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
  */
-unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
-				  unsigned int nr_ents)
+int swap_cgroup_record(swp_entry_t ent, int id, unsigned int nr_ents)
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
@@ -159,7 +157,7 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
  *
  * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
  */
-unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
+int lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH RFC] memcg: notify about global mem_cgroup_id space depletion
  2022-06-24 13:59                                 ` Michal Hocko
  2022-06-25  9:43                                   ` [PATCH RFC] memcg: avoid idr ids space depletion Vasily Averin
@ 2022-06-25 14:04                                   ` Vasily Averin
  2022-06-26  1:56                                     ` Roman Gushchin
  2022-06-27 16:37                                   ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Shakeel Butt
  2 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-25 14:04 UTC (permalink / raw)
  To: Shakeel Butt, Michal Koutný, Michal Hocko
  Cc: kernel, linux-kernel, Andrew Morton, linux-mm, Roman Gushchin,
	Vlastimil Babka, Muchun Song, cgroups

Currently host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot
start a new service, but nothing points to the real cause of
this failure.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..5229321636f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
 	if (memcg->id.id < 0) {
 		error = memcg->id.id;
+		pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
 		goto fail;
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH RFC] memcg: notify about global mem_cgroup_id space depletion
  2022-06-25 14:04                                   ` [PATCH RFC] memcg: notify about global mem_cgroup_id " Vasily Averin
@ 2022-06-26  1:56                                     ` Roman Gushchin
  2022-06-26  7:11                                       ` Vasily Averin
  2022-06-27  2:11                                       ` [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion Vasily Averin
  0 siblings, 2 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-06-26  1:56 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Michal Koutný,
	Michal Hocko, kernel, linux-kernel, Andrew Morton, linux-mm,
	Vlastimil Babka, Muchun Song, cgroups

On Sat, Jun 25, 2022 at 05:04:27PM +0300, Vasily Averin wrote:
> Currently host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot
> start a new service, but nothing points to the real cause of
> this failure.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
>  mm/memcontrol.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..5229321636f2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>  				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>  	if (memcg->id.id < 0) {
>  		error = memcg->id.id;
> +		pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>  		goto fail;
>  	}

Hm, in this case it should return -ENOSPC and it's a very unique return code.
If it's not returned from the mkdir() call, we should fix this.
Otherwise it's up to systemd to handle it properly.

I'm not opposing for adding a warning, but parsing dmesg is not how
the error handling should be done.

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH RFC] memcg: notify about global mem_cgroup_id space depletion
  2022-06-26  1:56                                     ` Roman Gushchin
@ 2022-06-26  7:11                                       ` Vasily Averin
  2022-06-27  2:12                                         ` [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached Vasily Averin
  2022-06-27  2:11                                       ` [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion Vasily Averin
  1 sibling, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-26  7:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Michal Koutný,
	Michal Hocko, kernel, linux-kernel, Andrew Morton, linux-mm,
	Vlastimil Babka, Muchun Song, cgroups

On 6/26/22 04:56, Roman Gushchin wrote:
> On Sat, Jun 25, 2022 at 05:04:27PM +0300, Vasily Averin wrote:
>> Currently host owner is not informed about the exhaustion of the
>> global mem_cgroup_id space. When this happens, systemd cannot
>> start a new service, but nothing points to the real cause of
>> this failure.
>>
>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>> ---
>>  mm/memcontrol.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d4c606a06bcd..5229321636f2 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5317,6 +5317,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>>  				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>>  	if (memcg->id.id < 0) {
>>  		error = memcg->id.id;
>> +		pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>>  		goto fail;
>>  	}
> 
> Hm, in this case it should return -ENOSPC and it's a very unique return code.
> If it's not returned from the mkdir() call, we should fix this.
> Otherwise it's up to systemd to handle it properly.
> 
> I'm not opposing for adding a warning, but parsing dmesg is not how
> the error handling should be done.

I'm agree,  I think it's a good idea. Moreover I think it makes sense to
use -ENOSPC  when the local cgroup's limit is reached.
Currently cgroup_mkdir() returns -EAGAIN, this looks strange for me.

        if (!cgroup_check_hierarchy_limits(parent)) {
                ret = -EAGAIN;
                goto out_unlock;
        }

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
  2022-06-26  1:56                                     ` Roman Gushchin
  2022-06-26  7:11                                       ` Vasily Averin
@ 2022-06-27  2:11                                       ` Vasily Averin
  2022-06-27  3:23                                         ` Muchun Song
  1 sibling, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-27  2:11 UTC (permalink / raw)
  To: Shakeel Butt, Roman Gushchin, Michal Koutný, Michal Hocko
  Cc: kernel, linux-kernel, Andrew Morton, linux-mm, Vlastimil Babka,
	Muchun Song, cgroups

Currently, the host owner is not informed about the exhaustion of the
global mem_cgroup_id space. When this happens, systemd cannot start a
new service and receives a unique -ENOSPC error code.
However, this can happen inside this container, persist in the log file
of the local container, and may not be noticed by the host owner if he
did not try to start any new services.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC
    if no free IDs could be found, but can also return -ENOMEM.
    Therefore error code check was added before message output and
    patch descriprion was adopted.
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d4c606a06bcd..ffc6b5d6b95e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 				 1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
 	if (memcg->id.id < 0) {
 		error = memcg->id.id;
+		if (error == -ENOSPC)
+			pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
 		goto fail;
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-26  7:11                                       ` Vasily Averin
@ 2022-06-27  2:12                                         ` Vasily Averin
  2022-06-27  3:33                                           ` Muchun Song
                                                             ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-27  2:12 UTC (permalink / raw)
  To: Shakeel Butt, Roman Gushchin, Michal Koutný,
	Michal Hocko, Tejun Heo, Zefan Li, Johannes Weiner
  Cc: kernel, linux-kernel, Andrew Morton, linux-mm, Vlastimil Babka,
	Muchun Song, cgroups

When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
not return -EAGAIN, but instead react similarly to reaching the global
limit.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
 kernel/cgroup/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1be0f81fe8e1..243239553ea3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
 		return -ENODEV;
 
 	if (!cgroup_check_hierarchy_limits(parent)) {
-		ret = -EAGAIN;
+		ret = -ENOSPC;
 		goto out_unlock;
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
  2022-06-27  2:11                                       ` [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion Vasily Averin
@ 2022-06-27  3:23                                         ` Muchun Song
  2022-06-27  6:49                                           ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Muchun Song @ 2022-06-27  3:23 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Roman Gushchin, Michal Koutný,
	Michal Hocko, kernel, LKML, Andrew Morton,
	Linux Memory Management List, Vlastimil Babka, Cgroups

On Mon, Jun 27, 2022 at 10:11 AM Vasily Averin <vvs@openvz.org> wrote:
>
> Currently, the host owner is not informed about the exhaustion of the
> global mem_cgroup_id space. When this happens, systemd cannot start a
> new service and receives a unique -ENOSPC error code.
> However, this can happen inside this container, persist in the log file
> of the local container, and may not be noticed by the host owner if he
> did not try to start any new services.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
> v2: Roman Gushchin pointed that idr_alloc() should return unique -ENOSPC

If the caller can know -ENOSPC is returned by mkdir(), then I
think the user (perhaps systemd) is the best place to throw out the
error message instead of in the kernel log. Right?

Thanks.

>     if no free IDs could be found, but can also return -ENOMEM.
>     Therefore error code check was added before message output and
>     patch descriprion was adopted.
> ---
>  mm/memcontrol.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d4c606a06bcd..ffc6b5d6b95e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5317,6 +5317,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>                                  1, MEM_CGROUP_ID_MAX + 1, GFP_KERNEL);
>         if (memcg->id.id < 0) {
>                 error = memcg->id.id;
> +               if (error == -ENOSPC)
> +                       pr_notice_ratelimited("mem_cgroup_id space is exhausted\n");
>                 goto fail;
>         }
>
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-27  2:12                                         ` [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached Vasily Averin
@ 2022-06-27  3:33                                           ` Muchun Song
  2022-06-27  9:07                                           ` Tejun Heo
  2022-06-28  0:44                                           ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: Muchun Song @ 2022-06-27  3:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Roman Gushchin, Michal Koutný,
	Michal Hocko, Tejun Heo, Zefan Li, Johannes Weiner, kernel, LKML,
	Andrew Morton, Linux Memory Management List, Vlastimil Babka,
	Cgroups

On Mon, Jun 27, 2022 at 10:12 AM Vasily Averin <vvs@openvz.org> wrote:
>
> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
> not return -EAGAIN, but instead react similarly to reaching the global
> limit.
>
> Signed-off-by: Vasily Averin <vvs@openvz.org>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
  2022-06-27  3:23                                         ` Muchun Song
@ 2022-06-27  6:49                                           ` Vasily Averin
  2022-06-28  1:11                                             ` Roman Gushchin
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-27  6:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: Shakeel Butt, Roman Gushchin, Michal Koutný,
	Michal Hocko, kernel, LKML, Andrew Morton,
	Linux Memory Management List, Vlastimil Babka, Cgroups

On 6/27/22 06:23, Muchun Song wrote:
> If the caller can know -ENOSPC is returned by mkdir(), then I
> think the user (perhaps systemd) is the best place to throw out the
> error message instead of in the kernel log. Right?

Such an incident may occur inside the container.
OpenVZ nodes can host 300-400 containers, and the host admin cannot
monitor guest logs. the dmesg message is necessary to inform the host
owner that the global limit has been reached, otherwise he can
continue to believe that there are no problems on the node.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-27  2:12                                         ` [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached Vasily Averin
  2022-06-27  3:33                                           ` Muchun Song
@ 2022-06-27  9:07                                           ` Tejun Heo
  2022-06-28  0:44                                           ` Roman Gushchin
  2 siblings, 0 replies; 139+ messages in thread
From: Tejun Heo @ 2022-06-27  9:07 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Roman Gushchin, Michal Koutný,
	Michal Hocko, Zefan Li, Johannes Weiner, kernel, linux-kernel,
	Andrew Morton, linux-mm, Vlastimil Babka, Muchun Song, cgroups

On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
> not return -EAGAIN, but instead react similarly to reaching the global
> limit.

While I'm not necessarily against this change, I find the rationale to
be somewhat lacking. Can you please elaborate why -ENOSPC is the right
one while -EAGAIN is incorrect?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-24 13:59                                 ` Michal Hocko
  2022-06-25  9:43                                   ` [PATCH RFC] memcg: avoid idr ids space depletion Vasily Averin
  2022-06-25 14:04                                   ` [PATCH RFC] memcg: notify about global mem_cgroup_id " Vasily Averin
@ 2022-06-27 16:37                                   ` Shakeel Butt
  2022-07-01 11:03                                     ` Michal Hocko
  2 siblings, 1 reply; 139+ messages in thread
From: Shakeel Butt @ 2022-06-27 16:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, Cgroups

On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 09:55:33, Shakeel Butt wrote:
> > On Thu, Jun 23, 2022 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 18:03:31, Vasily Averin wrote:
> > > > Dear Michal,
> > > > do you still have any concerns about this patch set?
> > >
> > > Yes, I do not think we have concluded this to be really necessary. IIRC
> > > Roman would like to see lingering cgroups addressed in not-so-distant
> > > future (http://lkml.kernel.org/r/Ypd2DW7id4M3KJJW@carbon) and we already
> > > have a limit for the number of cgroups in the tree. So why should we
> > > chase after allocations that correspond the cgroups and somehow try to
> > > cap their number via the memory consumption. This looks like something
> > > that will get out of sync eventually and it also doesn't seem like the
> > > best control to me (comparing to an explicit limit to prevent runaways).
> > > --
> >
> > Let me give a counter argument to that. On a system running multiple
> > workloads, how can the admin come up with a sensible limit for the
> > number of cgroups?
>
> How is that any easier through memory consumption? Something that might
> change between kernel versions?

In v2, we do provide a way for admins to right size the containers
without killing them. Actually we are trying to use memory.high for
right sizing the jobs. (It is not the best but workable and there are
opportunities to improve it).

Similar mechanisms for other types of limits are lacking. Usually the
application would be getting the error for which it can not do
anything most of the time.

> Is it even possible to prevent from id
> depletion by the memory consumption? Any medium sized memcg can easily
> consume all the ids AFAICS.

Though the patch series is pitched as protection against OOMs, I think
it is beneficial irrespective. Protection against an adversarial actor
should not be the aim here. IMO this patch series improves the memory
association to the actual user which is better than unattributed
memory treated as system overhead.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-27  2:12                                         ` [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached Vasily Averin
  2022-06-27  3:33                                           ` Muchun Song
  2022-06-27  9:07                                           ` Tejun Heo
@ 2022-06-28  0:44                                           ` Roman Gushchin
  2022-06-28  3:59                                             ` Vasily Averin
  2 siblings, 1 reply; 139+ messages in thread
From: Roman Gushchin @ 2022-06-28  0:44 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Michal Koutný,
	Michal Hocko, Tejun Heo, Zefan Li, Johannes Weiner, kernel,
	linux-kernel, Andrew Morton, linux-mm, Vlastimil Babka,
	Muchun Song, cgroups

On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
> not return -EAGAIN, but instead react similarly to reaching the global
> limit.
> 
> Signed-off-by: Vasily Averin <vvs@openvz.org>
> ---
>  kernel/cgroup/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1be0f81fe8e1..243239553ea3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
>  		return -ENODEV;
>  
>  	if (!cgroup_check_hierarchy_limits(parent)) {
> -		ret = -EAGAIN;
> +		ret = -ENOSPC;

I'd not argue whether ENOSPC is better or worse here, but I don't think we need
to change it now. It's been in this state for a long time and is a part of ABI.
EAGAIN is pretty unique as a mkdir() result, so systemd can handle it well.

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
  2022-06-27  6:49                                           ` Vasily Averin
@ 2022-06-28  1:11                                             ` Roman Gushchin
  2022-06-28  3:43                                               ` Vasily Averin
  2022-06-28  9:08                                               ` Michal Koutný
  0 siblings, 2 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-06-28  1:11 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Muchun Song, Shakeel Butt, Michal Koutný,
	Michal Hocko, kernel, LKML, Andrew Morton,
	Linux Memory Management List, Vlastimil Babka, Cgroups

On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote:
> On 6/27/22 06:23, Muchun Song wrote:
> > If the caller can know -ENOSPC is returned by mkdir(), then I
> > think the user (perhaps systemd) is the best place to throw out the
> > error message instead of in the kernel log. Right?
> 
> Such an incident may occur inside the container.
> OpenVZ nodes can host 300-400 containers, and the host admin cannot
> monitor guest logs. the dmesg message is necessary to inform the host
> owner that the global limit has been reached, otherwise he can
> continue to believe that there are no problems on the node.

Why this is happening? It's hard to believe someone really needs that
many cgroups. Is this when somebody fails to delete old cgroups?

I wanted to say that it's better to introduce a memcg event, but then
I realized it's probably not worth the wasted space. Is this a common
scenario?

I think a better approach will be to add a cgroup event (displayed via
cgroup.events) about reaching the maximum limit of cgroups. E.g.
cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants
to some value below memcg_id space size. It's more work, but IMO it's
a better way to communicate this event. As a bonus, you can easily
get an idea which cgroup depletes the limit.

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
  2022-06-28  1:11                                             ` Roman Gushchin
@ 2022-06-28  3:43                                               ` Vasily Averin
  2022-06-28  9:08                                               ` Michal Koutný
  1 sibling, 0 replies; 139+ messages in thread
From: Vasily Averin @ 2022-06-28  3:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Muchun Song, Shakeel Butt, Michal Koutný,
	Michal Hocko, kernel, LKML, Andrew Morton,
	Linux Memory Management List, Vlastimil Babka, Cgroups

On 6/28/22 04:11, Roman Gushchin wrote:
> On Mon, Jun 27, 2022 at 09:49:18AM +0300, Vasily Averin wrote:
>> On 6/27/22 06:23, Muchun Song wrote:
>>> If the caller can know -ENOSPC is returned by mkdir(), then I
>>> think the user (perhaps systemd) is the best place to throw out the
>>> error message instead of in the kernel log. Right?
>>
>> Such an incident may occur inside the container.
>> OpenVZ nodes can host 300-400 containers, and the host admin cannot
>> monitor guest logs. the dmesg message is necessary to inform the host
>> owner that the global limit has been reached, otherwise he can
>> continue to believe that there are no problems on the node.
> 
> Why this is happening? It's hard to believe someone really needs that
> many cgroups. Is this when somebody fails to delete old cgroups?

I do not have direct claims that some node really reached this limit,
however I saw crashdumps with 30000+ cgroups.

Theoretically OpenVz/LXC nodes can host up to several thousand containers
per node. Practically production nodes with 300-400 containers are a
common thing. I assume that each container can easily use up to 100-200
memory cgroups, and I think this is normal consumption.

Therefore, I believe that 64K limit is quite achievable in real life.
Primary goal of my patch is to confirm this theory.

> I wanted to say that it's better to introduce a memcg event, but then
> I realized it's probably not worth the wasted space. Is this a common
> scenario?
> 
> I think a better approach will be to add a cgroup event (displayed via
> cgroup.events) about reaching the maximum limit of cgroups. E.g.
> cgroups.events::max_nr_reached. Then you can set cgroup.max.descendants
> to some value below memcg_id space size. It's more work, but IMO it's
> a better way to communicate this event. As a bonus, you can easily
> get an idea which cgroup depletes the limit.

For my goal (i.e. just to confirm that 64K limit was reached) this
functionality is too complicated. This confirmation is important because
it should push us to increase the global limit.

However, I think your idea is great, In perspective it helps both OpenVZ
and LXC and possibly Shakeel to understand the real memcg using and set
the proper limit for containers. 

I'm going to prepare such patches, however I'm not sure I'll have enough
time for this task in the near future.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-28  0:44                                           ` Roman Gushchin
@ 2022-06-28  3:59                                             ` Vasily Averin
  2022-06-28  9:16                                               ` Michal Koutný
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-28  3:59 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Michal Koutný,
	Michal Hocko, Tejun Heo, Zefan Li, Johannes Weiner, kernel,
	linux-kernel, Andrew Morton, linux-mm, Vlastimil Babka,
	Muchun Song, cgroups

On 6/28/22 03:44, Roman Gushchin wrote:
> On Mon, Jun 27, 2022 at 05:12:55AM +0300, Vasily Averin wrote:
>> When cgroup_mkdir reaches the limits of the cgroup hierarchy, it should
>> not return -EAGAIN, but instead react similarly to reaching the global
>> limit.
>>
>> Signed-off-by: Vasily Averin <vvs@openvz.org>
>> ---
>>  kernel/cgroup/cgroup.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1be0f81fe8e1..243239553ea3 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5495,7 +5495,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
>>  		return -ENODEV;
>>  
>>  	if (!cgroup_check_hierarchy_limits(parent)) {
>> -		ret = -EAGAIN;
>> +		ret = -ENOSPC;
> 
> I'd not argue whether ENOSPC is better or worse here, but I don't think we need
> to change it now. It's been in this state for a long time and is a part of ABI.
> EAGAIN is pretty unique as a mkdir() result, so systemd can handle it well.

I would agree with you, however in my opinion EAGAIN is used to restart an
interrupted system call. Thus, I worry its return can loop the user space without
any chance of continuation.

However, maybe I'm confusing something?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion
  2022-06-28  1:11                                             ` Roman Gushchin
  2022-06-28  3:43                                               ` Vasily Averin
@ 2022-06-28  9:08                                               ` Michal Koutný
  1 sibling, 0 replies; 139+ messages in thread
From: Michal Koutný @ 2022-06-28  9:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Vasily Averin, Muchun Song, Shakeel Butt, Michal Hocko, kernel,
	LKML, Andrew Morton, Linux Memory Management List,
	Vlastimil Babka, Cgroups

On Mon, Jun 27, 2022 at 06:11:27PM -0700, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> I think a better approach will be to add a cgroup event (displayed via
> cgroup.events) about reaching the maximum limit of cgroups. E.g.
> cgroups.events::max_nr_reached.

This sounds like a good generalization.

> Then you can set cgroup.max.descendants to some value below memcg_id
> space size. It's more work, but IMO it's a better way to communicate
> this event. As a bonus, you can easily get an idea which cgroup
> depletes the limit.

Just mind there's a difference between events: what cgroup's limit was
hit and what cgroup was affected by the limit [1] (the former is more
useful for the calibration if I understand the situation).

Michal

[1] https://lore.kernel.org/all/20200205134426.10570-2-mkoutny@suse.com/

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-28  3:59                                             ` Vasily Averin
@ 2022-06-28  9:16                                               ` Michal Koutný
  2022-06-28  9:22                                                 ` Tejun Heo
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Koutný @ 2022-06-28  9:16 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Roman Gushchin, Shakeel Butt, Michal Hocko, Tejun Heo, Zefan Li,
	Johannes Weiner, kernel, linux-kernel, Andrew Morton, linux-mm,
	Vlastimil Babka, Muchun Song, cgroups

On Tue, Jun 28, 2022 at 06:59:06AM +0300, Vasily Averin <vvs@openvz.org> wrote:
> I would agree with you, however in my opinion EAGAIN is used to restart an
> interrupted system call. Thus, I worry its return can loop the user space without
> any chance of continuation.
> 
> However, maybe I'm confusing something?

The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better
sense here. (And I suspect the dependency on this particular value won't
be very wide spread.)

0.02€
Michal

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-28  9:16                                               ` Michal Koutný
@ 2022-06-28  9:22                                                 ` Tejun Heo
  2022-06-29  6:13                                                   ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Tejun Heo @ 2022-06-28  9:22 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Vasily Averin, Roman Gushchin, Shakeel Butt, Michal Hocko,
	Zefan Li, Johannes Weiner, kernel, linux-kernel, Andrew Morton,
	linux-mm, Vlastimil Babka, Muchun Song, cgroups

On Tue, Jun 28, 2022 at 11:16:48AM +0200, Michal Koutný wrote:
> The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better
> sense here. (And I suspect the dependency on this particular value won't
> be very wide spread.)

Given how we use these system calls as triggers for random kernel
operations, I don't think adhering to posix standard is necessary or
possible. Using an error code which isn't listed in the man page isn't
particularly high in the list of discrepancies.

Again, I'm not against changing it but I'd like to see better
rationales. On one side, we have "it's been this way for a long time
and there's nothing particularly broken about it". I'm not sure the
arguments we have for the other side is strong enough yet.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-28  9:22                                                 ` Tejun Heo
@ 2022-06-29  6:13                                                   ` Vasily Averin
  2022-06-29 19:25                                                     ` Tejun Heo
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-06-29  6:13 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: Roman Gushchin, Shakeel Butt, Michal Hocko, Zefan Li,
	Johannes Weiner, kernel, linux-kernel, Andrew Morton, linux-mm,
	Vlastimil Babka, Muchun Song, cgroups

On 6/28/22 12:22, Tejun Heo wrote:
> On Tue, Jun 28, 2022 at 11:16:48AM +0200, Michal Koutný wrote:
>> The mkdir(2) manpage doesn't list EAGAIN at all. ENOSPC makes better
>> sense here. (And I suspect the dependency on this particular value won't
>> be very wide spread.)
> 
> Given how we use these system calls as triggers for random kernel
> operations, I don't think adhering to posix standard is necessary or
> possible. Using an error code which isn't listed in the man page isn't
> particularly high in the list of discrepancies.
> 
> Again, I'm not against changing it but I'd like to see better
> rationales. On one side, we have "it's been this way for a long time
> and there's nothing particularly broken about it". I'm not sure the
> arguments we have for the other side is strong enough yet.

I would like to recall this patch.

I experimented on fedora36 node with LXC and centos stream 9 container.
and I did not noticed any critical systemd troubles with original -EAGAIN.
When cgroup's limit is reached systemd cannot start new services, 
for example lxc-attach generates following output:

[root@fc34-vvs ~]# lxc-attach c9s
lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc"
lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11
lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup
lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container

I did not found any loop in userspace caused by EAGAIN.
Messages looks unclear, however situation with the patched kernel is not much better:

[root@fc34-vvs ~]# lxc-attach c9s
lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc"
lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11
lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup
lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-29  6:13                                                   ` Vasily Averin
@ 2022-06-29 19:25                                                     ` Tejun Heo
  2022-07-01  2:42                                                       ` Roman Gushchin
  0 siblings, 1 reply; 139+ messages in thread
From: Tejun Heo @ 2022-06-29 19:25 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Koutný,
	Roman Gushchin, Shakeel Butt, Michal Hocko, Zefan Li,
	Johannes Weiner, kernel, linux-kernel, Andrew Morton, linux-mm,
	Vlastimil Babka, Muchun Song, cgroups

On Wed, Jun 29, 2022 at 09:13:02AM +0300, Vasily Averin wrote:
> I experimented on fedora36 node with LXC and centos stream 9 container.
> and I did not noticed any critical systemd troubles with original -EAGAIN.
> When cgroup's limit is reached systemd cannot start new services, 
> for example lxc-attach generates following output:
> 
> [root@fc34-vvs ~]# lxc-attach c9s
> lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc"
> lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11
> lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup
> lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container
> 
> I did not found any loop in userspace caused by EAGAIN.
> Messages looks unclear, however situation with the patched kernel is not much better:
> 
> [root@fc34-vvs ~]# lxc-attach c9s
> lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc"
> lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11
> lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup
> lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container

I'd say "resource temporarily unavailable" is better fitting than "no
space left on device" and the syscall restart thing isn't handled by
-EAGAIN return value. Grep restart_block for that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached
  2022-06-29 19:25                                                     ` Tejun Heo
@ 2022-07-01  2:42                                                       ` Roman Gushchin
  0 siblings, 0 replies; 139+ messages in thread
From: Roman Gushchin @ 2022-07-01  2:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vasily Averin, Michal Koutný,
	Shakeel Butt, Michal Hocko, Zefan Li, Johannes Weiner, kernel,
	linux-kernel, Andrew Morton, linux-mm, Vlastimil Babka,
	Muchun Song, cgroups

On Thu, Jun 30, 2022 at 04:25:57AM +0900, Tejun Heo wrote:
> On Wed, Jun 29, 2022 at 09:13:02AM +0300, Vasily Averin wrote:
> > I experimented on fedora36 node with LXC and centos stream 9 container.
> > and I did not noticed any critical systemd troubles with original -EAGAIN.
> > When cgroup's limit is reached systemd cannot start new services, 
> > for example lxc-attach generates following output:
> > 
> > [root@fc34-vvs ~]# lxc-attach c9s
> > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 Resource temporarily unavailable - Failed to create leaf cgroup ".lxc"
> > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 Resource temporarily unavailable - Failed to attach to cgroup fd 11
> > lxc-attach: c9s: attach.c: lxc_attach: 1679 Resource temporarily unavailable - Failed to attach cgroup
> > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container
> > 
> > I did not found any loop in userspace caused by EAGAIN.
> > Messages looks unclear, however situation with the patched kernel is not much better:
> > 
> > [root@fc34-vvs ~]# lxc-attach c9s
> > lxc-attach: c9s: cgroups/cgfsng.c: cgroup_attach_leaf: 2084 No space left on device - Failed to create leaf cgroup ".lxc"
> > lxc-attach: c9s: cgroups/cgfsng.c: __cgroup_attach_many: 3517 No space left on device - Failed to attach to cgroup fd 11
> > lxc-attach: c9s: attach.c: lxc_attach: 1679 No space left on device - Failed to attach cgroup
> > lxc-attach: c9s: attach.c: do_attach: 1237 No data available - Failed to receive lsm label fd
> > lxc-attach: c9s: attach.c: do_attach: 1375 Failed to attach to container
> 
> I'd say "resource temporarily unavailable" is better fitting than "no
> space left on device"

+1

Thanks!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-06-27 16:37                                   ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Shakeel Butt
@ 2022-07-01 11:03                                     ` Michal Hocko
  2022-07-10 18:53                                       ` Vasily Averin
  0 siblings, 1 reply; 139+ messages in thread
From: Michal Hocko @ 2022-07-01 11:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, kernel, Andrew Morton, LKML, Linux MM,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, Cgroups

On Mon 27-06-22 09:37:14, Shakeel Butt wrote:
> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Is it even possible to prevent from id
> > depletion by the memory consumption? Any medium sized memcg can easily
> > consume all the ids AFAICS.
> 
> Though the patch series is pitched as protection against OOMs, I think
> it is beneficial irrespective. Protection against an adversarial actor
> should not be the aim here. IMO this patch series improves the memory
> association to the actual user which is better than unattributed
> memory treated as system overhead.

Considering the amount of memory and "normal" cgroup usage (I guess we
can agree that delegated subtrees do not count their cgroups in
thousands) is this really something that is worth bothering with?

I mean, these patches are really small and not really disruptive so I do
not really see any problem with them. Except that they clearly add a
maintenance overhead. Not directly with the memory they track but any
future cgroup/memcg metadata related objects would need to be tracked as
well and I am worried this will get quickly out of sync. So we will have
a half assed solution in place that doesn't really help any containment
nor it provides a good and robust consumption tracking.

All that being said I find these changes rather without a great value or
use.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-07-01 11:03                                     ` Michal Hocko
@ 2022-07-10 18:53                                       ` Vasily Averin
  2022-07-11 16:24                                         ` Michal Hocko
  0 siblings, 1 reply; 139+ messages in thread
From: Vasily Averin @ 2022-07-10 18:53 UTC (permalink / raw)
  To: Michal Hocko, Shakeel Butt
  Cc: kernel, Andrew Morton, LKML, Linux MM, Roman Gushchin,
	Michal Koutný,
	Vlastimil Babka, Muchun Song, Cgroups

On 7/1/22 14:03, Michal Hocko wrote:
> On Mon 27-06-22 09:37:14, Shakeel Butt wrote:
>> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
>>> Is it even possible to prevent from id
>>> depletion by the memory consumption? Any medium sized memcg can easily
>>> consume all the ids AFAICS.
>>
>> Though the patch series is pitched as protection against OOMs, I think
>> it is beneficial irrespective. Protection against an adversarial actor
>> should not be the aim here. IMO this patch series improves the memory
>> association to the actual user which is better than unattributed
>> memory treated as system overhead.
> 
> Considering the amount of memory and "normal" cgroup usage (I guess we
> can agree that delegated subtrees do not count their cgroups in
> thousands) is this really something that is worth bothering with?
> 
> I mean, these patches are really small and not really disruptive so I do
> not really see any problem with them. Except that they clearly add a
> maintenance overhead. Not directly with the memory they track but any
> future cgroup/memcg metadata related objects would need to be tracked as
> well and I am worried this will get quickly out of sync. So we will have
> a half assed solution in place that doesn't really help any containment
> nor it provides a good and robust consumption tracking.
> 
> All that being said I find these changes rather without a great value or
> use.

Dear Michal,
I sill have 2 questions:
1) if you do not want to account any memory allocated for cgroup objects,
should you perhaps revert commit 3e38e0aaca9e "mm: memcg: charge memcg percpu
memory to the parent cgroup". Is it an exception perhaps?
(in fact I hope you will not revert this patch, I just would like to know 
your explanations about this accounting)
2) my patch set includes kernfs accounting required for proper netdevices accounting

Allocs  Alloc   Allocation
number  size
--------------------------------------------
1   +  128      (__kernfs_new_node+0x4d)	kernfs node
1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
1       32      (simple_xattr_set+0x59)
1       8       (__kernfs_new_node+0x30)

 2/9] memcg: enable accounting for kernfs nodes
 3/9] memcg: enable accounting for kernfs iattrs
 4/9] memcg: enable accounting for struct simple_xattr

What do you think about them? Should I resend them as a new separate patch set?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup
  2022-07-10 18:53                                       ` Vasily Averin
@ 2022-07-11 16:24                                         ` Michal Hocko
  0 siblings, 0 replies; 139+ messages in thread
From: Michal Hocko @ 2022-07-11 16:24 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, kernel, Andrew Morton, LKML, Linux MM,
	Roman Gushchin, Michal Koutný,
	Vlastimil Babka, Muchun Song, Cgroups

On Sun 10-07-22 21:53:34, Vasily Averin wrote:
> On 7/1/22 14:03, Michal Hocko wrote:
> > On Mon 27-06-22 09:37:14, Shakeel Butt wrote:
> >> On Fri, Jun 24, 2022 at 6:59 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> >>> Is it even possible to prevent from id
> >>> depletion by the memory consumption? Any medium sized memcg can easily
> >>> consume all the ids AFAICS.
> >>
> >> Though the patch series is pitched as protection against OOMs, I think
> >> it is beneficial irrespective. Protection against an adversarial actor
> >> should not be the aim here. IMO this patch series improves the memory
> >> association to the actual user which is better than unattributed
> >> memory treated as system overhead.
> > 
> > Considering the amount of memory and "normal" cgroup usage (I guess we
> > can agree that delegated subtrees do not count their cgroups in
> > thousands) is this really something that is worth bothering with?
> > 
> > I mean, these patches are really small and not really disruptive so I do
> > not really see any problem with them. Except that they clearly add a
> > maintenance overhead. Not directly with the memory they track but any
> > future cgroup/memcg metadata related objects would need to be tracked as
> > well and I am worried this will get quickly out of sync. So we will have
> > a half assed solution in place that doesn't really help any containment
> > nor it provides a good and robust consumption tracking.
> > 
> > All that being said I find these changes rather without a great value or
> > use.
> 
> Dear Michal,
> I sill have 2 questions:
> 1) if you do not want to account any memory allocated for cgroup objects,
> should you perhaps revert commit 3e38e0aaca9e "mm: memcg: charge memcg percpu
> memory to the parent cgroup". Is it an exception perhaps?
> (in fact I hope you will not revert this patch, I just would like to know 
> your explanations about this accounting)

Well, I have to say I was not a great fan of this patch when it was
proposed but I didn't really have strong arguments against it to nack
it. It was simple enough, rather self contained in few places. Just to
give you an insight into my thinking here. Your patchseries is also not
something I would nack (nor I have done that). I am not super fan of it
either. I voiced against it because it just hit my internal thrashold of
how many different places are patched without any systemic approach. If
we consider that it doesn't really help with the initial intention to
protect against adversaries then what is the point of all the churn?

Others might think differently and if you can get acks by other
maintainers then I won't stand in the way. I have voiced my concerns and
I hope my thinking is clear now.

> 2) my patch set includes kernfs accounting required for proper netdevices accounting
> 
> Allocs  Alloc   Allocation
> number  size
> --------------------------------------------
> 1   +  128      (__kernfs_new_node+0x4d)	kernfs node
> 1   +   88      (__kernfs_iattrs+0x57)		kernfs iattrs
> 1   +   96      (simple_xattr_alloc+0x28)	simple_xattr, can grow over 4Kb
> 1       32      (simple_xattr_set+0x59)
> 1       8       (__kernfs_new_node+0x30)
> 
>  2/9] memcg: enable accounting for kernfs nodes
>  3/9] memcg: enable accounting for kernfs iattrs
>  4/9] memcg: enable accounting for struct simple_xattr
> 
> What do you think about them? Should I resend them as a new separate patch set?

kernfs is not really my area so I cannot really comment on those.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2022-07-11 16:24 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-27 10:37 [PATCH] memcg: accounting for objects allocated for new netdevice Vasily Averin
2022-04-27 14:01 ` Michal Koutný
2022-04-27 16:52   ` Shakeel Butt
2022-04-27 22:35     ` Vasily Averin
2022-05-02 12:15       ` [PATCH memcg v2] " Vasily Averin
2022-05-04 20:50         ` Luis Chamberlain
2022-05-05  3:50         ` patchwork-bot+netdevbpf
2022-05-11  2:51         ` Roman Gushchin
2022-05-02 19:37   ` kernfs memcg accounting Vasily Averin
2022-05-02 21:22     ` Michal Koutný
2022-05-04  9:00       ` Vasily Averin
2022-05-04 14:10         ` Michal Koutný
2022-05-04 21:16           ` Vasily Averin
2022-05-05  9:47             ` Michal Koutný
2022-05-06  8:37               ` Vasily Averin
2022-05-11  3:06         ` Roman Gushchin
2022-05-11  6:01           ` Vasily Averin
2022-05-11 16:49             ` Michal Koutný
2022-05-11 17:46             ` Roman Gushchin
2022-05-11 16:34           ` Michal Koutný
2022-05-11 18:10             ` Roman Gushchin
2022-05-13 15:51               ` [PATCH 0/4] memcg: accounting for objects allocated by mkdir cgroup Vasily Averin
2022-05-13 17:49                 ` Roman Gushchin
2022-05-21 16:37                   ` [PATCH mm v2 0/9] " Vasily Averin
2022-05-30 11:25                     ` [PATCH mm v3 " Vasily Averin
2022-05-30 11:55                       ` Michal Hocko
2022-05-30 13:09                         ` Vasily Averin
2022-05-30 14:22                           ` Michal Hocko
2022-05-30 19:58                             ` Vasily Averin
2022-05-31  7:16                               ` Michal Hocko
2022-06-01  3:43                                 ` Vasily Averin
2022-06-01  9:15                                   ` Michal Koutný
2022-06-01  9:32                                     ` Michal Hocko
2022-06-01 13:05                                       ` Michal Hocko
2022-06-01 14:22                                         ` Roman Gushchin
2022-06-01 15:24                                           ` Michal Hocko
2022-06-01  9:26                                   ` Michal Hocko
2022-06-13  5:34                       ` [PATCH mm v4 " Vasily Averin
2022-06-23 14:50                         ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Vasily Averin
2022-06-23 15:03                           ` Vasily Averin
2022-06-23 16:07                             ` Michal Hocko
2022-06-23 16:55                               ` Shakeel Butt
2022-06-24 10:40                                 ` Vasily Averin
2022-06-24 12:26                                   ` Michal Koutný
2022-06-24 13:59                                 ` Michal Hocko
2022-06-25  9:43                                   ` [PATCH RFC] memcg: avoid idr ids space depletion Vasily Averin
2022-06-25 14:04                                   ` [PATCH RFC] memcg: notify about global mem_cgroup_id " Vasily Averin
2022-06-26  1:56                                     ` Roman Gushchin
2022-06-26  7:11                                       ` Vasily Averin
2022-06-27  2:12                                         ` [PATCH cgroup] cgroup: set the correct return code if hierarchy limits are reached Vasily Averin
2022-06-27  3:33                                           ` Muchun Song
2022-06-27  9:07                                           ` Tejun Heo
2022-06-28  0:44                                           ` Roman Gushchin
2022-06-28  3:59                                             ` Vasily Averin
2022-06-28  9:16                                               ` Michal Koutný
2022-06-28  9:22                                                 ` Tejun Heo
2022-06-29  6:13                                                   ` Vasily Averin
2022-06-29 19:25                                                     ` Tejun Heo
2022-07-01  2:42                                                       ` Roman Gushchin
2022-06-27  2:11                                       ` [PATCH mm v2] memcg: notify about global mem_cgroup_id space depletion Vasily Averin
2022-06-27  3:23                                         ` Muchun Song
2022-06-27  6:49                                           ` Vasily Averin
2022-06-28  1:11                                             ` Roman Gushchin
2022-06-28  3:43                                               ` Vasily Averin
2022-06-28  9:08                                               ` Michal Koutný
2022-06-27 16:37                                   ` [PATCH mm v5 0/9] memcg: accounting for objects allocated by mkdir, cgroup Shakeel Butt
2022-07-01 11:03                                     ` Michal Hocko
2022-07-10 18:53                                       ` Vasily Averin
2022-07-11 16:24                                         ` Michal Hocko
2022-06-23 14:50                         ` [PATCH mm v5 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-06-23 14:50                         ` [PATCH mm v5 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-06-23 14:51                         ` [PATCH mm v5 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
2022-06-23 14:51                         ` [PATCH mm v5 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
2022-06-23 14:51                         ` [PATCH mm v5 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
2022-06-23 14:51                         ` [PATCH mm v5 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
2022-06-23 14:51                         ` [PATCH mm v5 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
2022-06-23 14:51                         ` [PATCH mm v5 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
2022-06-23 14:52                         ` [PATCH mm v5 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
2022-06-13  5:34                       ` [PATCH mm v4 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-06-13  5:34                       ` [PATCH mm v4 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-06-13  5:34                       ` [PATCH mm v4 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
2022-06-13  5:35                       ` [PATCH mm v4 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
2022-06-13  5:35                       ` [PATCH mm v4 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
2022-06-13  5:35                       ` [PATCH mm v4 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
2022-06-13  5:35                       ` [PATCH mm v4 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
2022-06-13  5:35                       ` [PATCH mm v4 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
2022-06-13  5:35                       ` [PATCH mm v4 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
     [not found]                     ` <cover.1653899364.git.vvs@openvz.org>
2022-05-30 11:25                       ` [PATCH mm v3 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-05-30 11:26                       ` [PATCH mm v3 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-05-30 11:26                       ` [PATCH mm v3 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
2022-05-30 11:26                       ` [PATCH mm v3 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
2022-05-30 11:26                       ` [PATCH mm v3 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
2022-05-30 11:26                       ` [PATCH mm v3 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
2022-05-30 15:04                         ` Muchun Song
2022-05-30 11:26                       ` [PATCH mm v3 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
2022-05-30 11:26                       ` [PATCH mm v3 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
2022-05-30 11:27                       ` [PATCH mm v3 9/9] memcg: enable accounting for perpu allocation of struct rt_rq Vasily Averin
2022-05-30 15:06                         ` Muchun Song
2022-05-21 16:37                   ` [PATCH mm v2 1/9] memcg: enable accounting for struct cgroup Vasily Averin
2022-05-22  6:37                     ` Muchun Song
2022-05-21 16:37                   ` [PATCH mm v2 2/9] memcg: enable accounting for kernfs nodes Vasily Averin
2022-05-22  6:37                     ` Muchun Song
2022-05-21 16:37                   ` [PATCH mm v2 3/9] memcg: enable accounting for kernfs iattrs Vasily Averin
2022-05-22  6:38                     ` Muchun Song
2022-05-21 16:38                   ` [PATCH mm v2 4/9] memcg: enable accounting for struct simple_xattr Vasily Averin
2022-05-22  6:38                     ` Muchun Song
2022-05-21 16:38                   ` [PATCH mm v2 5/9] memcg: enable accounting for percpu allocation of struct psi_group_cpu Vasily Averin
2022-05-21 21:34                     ` Shakeel Butt
2022-05-22  6:40                     ` Muchun Song
2022-05-25  1:30                     ` Roman Gushchin
2022-05-21 16:38                   ` [PATCH mm v2 6/9] memcg: enable accounting for percpu allocation of struct cgroup_rstat_cpu Vasily Averin
2022-05-21 17:58                     ` Vasily Averin
2022-05-21 21:35                     ` Shakeel Butt
2022-05-21 22:05                     ` kernel test robot
2022-05-25  1:31                     ` Roman Gushchin
2022-05-21 16:38                   ` [PATCH mm v2 7/9] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
2022-05-22  6:47                     ` Muchun Song
2022-05-21 16:38                   ` [PATCH mm v2 8/9] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
2022-05-22  6:49                     ` Muchun Song
2022-05-21 16:39                   ` [PATCH mm v2 9/9] memcg: enable accounting for percpu allocation of struct rt_rq Vasily Averin
2022-05-21 21:37                     ` Shakeel Butt
2022-05-25  1:31                     ` Roman Gushchin
2022-05-13 15:51               ` [PATCH 1/4] memcg: enable accounting for large allocations in mem_cgroup_css_alloc Vasily Averin
2022-05-19 16:46                 ` Michal Koutný
2022-05-20  1:07                 ` Shakeel Butt
2022-05-13 15:51               ` [PATCH 2/4] memcg: enable accounting for kernfs nodes and iattrs Vasily Averin
2022-05-19 16:33                 ` Michal Koutný
2022-05-20  1:12                 ` Shakeel Butt
2022-05-13 15:52               ` [PATCH 3/4] memcg: enable accounting for struct cgroup Vasily Averin
2022-05-19 16:53                 ` Michal Koutný
2022-05-20  7:24                   ` Vasily Averin
2022-05-20 20:16                     ` Vasily Averin
2022-05-21  0:55                       ` Roman Gushchin
2022-05-21  7:28                         ` Vasily Averin
2022-05-23 13:52                       ` Michal Koutný
2022-05-20  1:31                 ` Shakeel Butt
2022-05-13 15:52               ` [PATCH 4/4] memcg: enable accounting for allocations in alloc_fair_sched_group Vasily Averin
2022-05-19 16:45                 ` Michal Koutný
2022-05-20  1:18                 ` Shakeel Butt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.