linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/16] memcg accounting from OpenVZ
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
@ 2021-04-28  6:51 ` Vasily Averin
  2021-07-15 17:11   ` Shakeel Butt
  2021-04-28  6:51 ` [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev,
	linux-kernel

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      |  6 +++---
 drivers/tty/tty_io.c       |  4 ++--
 fs/fcntl.c                 |  3 ++-
 fs/locks.c                 |  6 ++++--
 fs/namespace.c             |  7 ++++---
 fs/select.c                |  4 ++--
 ipc/msg.c                  |  2 +-
 ipc/namespace.c            |  2 +-
 ipc/sem.c                  | 10 ++++++----
 ipc/shm.c                  |  2 +-
 kernel/cgroup/namespace.c  |  2 +-
 kernel/nsproxy.c           |  2 +-
 kernel/pid_namespace.c     |  2 +-
 kernel/signal.c            |  2 +-
 kernel/time/namespace.c    |  4 ++--
 kernel/time/posix-timers.c |  4 ++--
 kernel/user_namespace.c    |  2 +-
 mm/memcontrol.c            |  2 +-
 net/8021q/vlan.c           |  2 +-
 net/core/dev.c             |  6 +++---
 net/core/fib_rules.c       |  4 ++--
 net/core/scm.c             |  4 ++--
 net/dccp/proto.c           |  2 +-
 net/ipv4/devinet.c         |  2 +-
 net/ipv4/fib_trie.c        |  4 ++--
 net/ipv4/tcp.c             |  4 +++-
 net/ipv6/addrconf.c        |  2 +-
 net/ipv6/ip6_fib.c         |  4 ++--
 net/ipv6/route.c           |  2 +-
 net/ipv6/sit.c             |  5 +++--
 30 files changed, 58 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
  2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
@ 2021-04-28  6:51 ` Vasily Averin
  2021-04-28  6:51 ` [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9a..87b1e80 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9994,7 +9994,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10061,7 +10061,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10693,7 +10693,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
  2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
  2021-04-28  6:51 ` [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
@ 2021-04-28  6:51 ` Vasily Averin
  2021-04-28  6:51 ` [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

Netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d..15108ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cd80ffe..65d8b1d 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2e35f68da..9b90413 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9e53f5..d56a15a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 679699e..0982b7c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2444,8 +2444,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 373d480..5dc5c68 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6510,7 +6510,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (2 preceding siblings ...)
  2021-04-28  6:51 ` [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-04-28  6:51 ` Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 04/16] memcg: enable accounting for VLAN group array Vasily Averin
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 6d705d9..f90d1e8 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de7cc84..5817a86b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4498,7 +4498,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 04/16] memcg: enable accounting for VLAN group array
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (3 preceding siblings ...)
  2021-04-28  6:51 ` [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8b644113..d0a579d4 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (4 preceding siblings ...)
  2021-04-28  6:52 ` [PATCH v4 04/16] memcg: enable accounting for VLAN group array Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 9fdccf0..2ba147c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -320,7 +320,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -333,7 +333,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (5 preceding siblings ...)
  2021-04-28  6:52 ` [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 8156d4f..e837e4f 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -348,7 +348,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (6 preceding siblings ...)
  2021-04-28  6:52 ` [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  2021-05-07 13:45   ` Serge E. Hallyn
  2021-05-07 15:03   ` Christian Brauner
  2021-04-28  6:52 ` [PATCH v4 08/16] memcg: enable accounting of ipc resources Vasily Averin
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 56bb5a5..5ecfa349 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3286,7 +3286,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9a4b980..9c6a42b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1378,7 +1378,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 08/16] memcg: enable accounting of ipc resources
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (7 preceding siblings ...)
  2021-04-28  6:52 ` [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  2021-04-28  6:53 ` [PATCH v4 09/16] memcg: enable accounting for mnt_cache entries Vasily Averin
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Andrew Morton, Alexey Dobriyan, Dmitry Safonov,
	linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c |  2 +-
 ipc/sem.c | 10 ++++++----
 ipc/shm.c |  2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index acd1bc7..87898cb 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kvmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index f6c30a8..52a6599 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -511,7 +511,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1850,7 +1850,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1935,7 +1935,8 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL);
+	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+		      GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -1999,7 +2000,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index febd88d..7632d72 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kvmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 09/16] memcg: enable accounting for mnt_cache entries
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (8 preceding siblings ...)
  2021-04-28  6:52 ` [PATCH v4 08/16] memcg: enable accounting of ipc resources Vasily Averin
@ 2021-04-28  6:53 ` Vasily Averin
  2021-04-28  6:53 ` [PATCH v4 10/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5ecfa349..fc1b50d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4213,7 +4214,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 10/16] memcg: enable accounting for pollfd and select bits arrays
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (9 preceding siblings ...)
  2021-04-28  6:53 ` [PATCH v4 09/16] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-04-28  6:53 ` Vasily Averin
  2021-04-28  6:53 ` [PATCH v4 11/16] memcg: enable accounting for signals Vasily Averin
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 11/16] memcg: enable accounting for signals
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (10 preceding siblings ...)
  2021-04-28  6:53 ` [PATCH v4 10/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-04-28  6:53 ` Vasily Averin
  2021-04-28  6:53 ` [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jens Axboe, Eric W. Biederman, Oleg Nesterov,
	linux-kernel

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f271835..a7fa849 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4639,7 +4639,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (11 preceding siblings ...)
  2021-04-28  6:53 ` [PATCH v4 11/16] memcg: enable accounting for signals Vasily Averin
@ 2021-04-28  6:53 ` Vasily Averin
  2021-05-07 15:48   ` Thomas Gleixner
  2021-04-28  6:53 ` [PATCH v4 13/16] memcg: enable accounting for file lock caches Vasily Averin
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner, linux-kernel

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index bf540f5a..2eee615 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 13/16] memcg: enable accounting for file lock caches
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (12 preceding siblings ...)
  2021-04-28  6:53 ` [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
@ 2021-04-28  6:53 ` Vasily Averin
  2021-04-28  6:54 ` [PATCH v4 14/16] memcg: enable accounting for fasync_cache Vasily Averin
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:53 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro,
	linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 6125d2d..fb751f3 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3007,10 +3007,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 14/16] memcg: enable accounting for fasync_cache
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (13 preceding siblings ...)
  2021-04-28  6:53 ` [PATCH v4 13/16] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-04-28  6:54 ` Vasily Averin
  2021-04-28  6:54 ` [PATCH v4 15/16] memcg: enable accounting for tty-related objects Vasily Averin
  2021-04-28  6:54 ` [PATCH v4 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Jeff Layton, J. Bruce Fields, Alexander Viro,
	linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 15/16] memcg: enable accounting for tty-related objects
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (14 preceding siblings ...)
  2021-04-28  6:54 ` [PATCH v4 14/16] memcg: enable accounting for fasync_cache Vasily Averin
@ 2021-04-28  6:54 ` Vasily Averin
  2021-04-28  7:38   ` Greg Kroah-Hartman
  2021-04-28  6:54 ` [PATCH v4 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  16 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby, linux-kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 391bada..e613b8e 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1502,7 +1502,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3127,7 +3127,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v4 16/16] memcg: enable accounting for ldt_struct objects
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (15 preceding siblings ...)
  2021-04-28  6:54 ` [PATCH v4 15/16] memcg: enable accounting for tty-related objects Vasily Averin
@ 2021-04-28  6:54 ` Vasily Averin
  16 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-04-28  6:54 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, linux-kernel

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 15/16] memcg: enable accounting for tty-related objects
  2021-04-28  6:54 ` [PATCH v4 15/16] memcg: enable accounting for tty-related objects Vasily Averin
@ 2021-04-28  7:38   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 101+ messages in thread
From: Greg Kroah-Hartman @ 2021-04-28  7:38 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jiri Slaby, linux-kernel

On Wed, Apr 28, 2021 at 09:54:16AM +0300, Vasily Averin wrote:
> At each login the user forces the kernel to create a new terminal and
> allocate up to ~1Kb memory for the tty-related structures.
> 
> By default it's allowed to create up to 4096 ptys with 1024 reserve for
> initial mount namespace only and the settings are controlled by host admin.
> 
> Though this default is not enough for hosters with thousands
> of containers per node. Host admin can be forced to increase it
> up to NR_UNIX98_PTY_MAX = 1<<20.
> 
> By default container is restricted by pty mount_opt.max = 1024,
> but admin inside container can change it via remount. As a result,
> one container can consume almost all allowed ptys
> and allocate up to 1Gb of unaccounted memory.
> 
> It is not enough per-se to trigger OOM on host, however anyway, it allows
> to significantly exceed the assigned memcg limit and leads to troubles
> on the over-committed node.
> 
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
  2021-04-28  6:52 ` [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-05-07 13:45   ` Serge E. Hallyn
  2021-05-07 15:03   ` Christian Brauner
  1 sibling, 0 replies; 101+ messages in thread
From: Serge E. Hallyn @ 2021-05-07 13:45 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Tejun Heo,
	Andrew Morton, Zefan Li, Thomas Gleixner, Christian Brauner,
	Kirill Tkhai, Serge Hallyn, Andrei Vagin, linux-kernel

On Wed, Apr 28, 2021 at 09:52:43AM +0300, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

makes sense.

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
>  fs/namespace.c            | 2 +-
>  ipc/namespace.c           | 2 +-
>  kernel/cgroup/namespace.c | 2 +-
>  kernel/nsproxy.c          | 2 +-
>  kernel/pid_namespace.c    | 2 +-
>  kernel/time/namespace.c   | 4 ++--
>  kernel/user_namespace.c   | 2 +-
>  7 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 56bb5a5..5ecfa349 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3286,7 +3286,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
>  	if (!ucounts)
>  		return ERR_PTR(-ENOSPC);
>  
> -	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns) {
>  		dec_mnt_namespaces(ucounts);
>  		return ERR_PTR(-ENOMEM);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 7bd0766..ae83f0f 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
> +	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
>  	if (ns == NULL)
>  		goto fail_dec;
>  
> diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
> index f5e8828..0d5c298 100644
> --- a/kernel/cgroup/namespace.c
> +++ b/kernel/cgroup/namespace.c
> @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
>  	struct cgroup_namespace *new_ns;
>  	int ret;
>  
> -	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns)
>  		return ERR_PTR(-ENOMEM);
>  	ret = ns_alloc_inum(&new_ns->ns);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index abc01fc..eec72ca 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
>  
>  int __init nsproxy_cache_init(void)
>  {
> -	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
> +	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
>  	return 0;
>  }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index ca43239..6cd6715 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>  
>  static __init int pid_namespaces_init(void)
>  {
> -	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index 12eab0d..aec8328 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
> +	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
>  	if (!ns)
>  		goto fail_dec;
>  
>  	refcount_set(&ns->ns.count, 1);
>  
> -	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!ns->vvar_page)
>  		goto fail_free;
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 9a4b980..9c6a42b 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1378,7 +1378,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
>  
>  static __init int user_namespaces_init(void)
>  {
> -	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> +	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  	return 0;
>  }
>  subsys_initcall(user_namespaces_init);
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy
  2021-04-28  6:52 ` [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
  2021-05-07 13:45   ` Serge E. Hallyn
@ 2021-05-07 15:03   ` Christian Brauner
  1 sibling, 0 replies; 101+ messages in thread
From: Christian Brauner @ 2021-05-07 15:03 UTC (permalink / raw)
  To: Vasily Averin
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Tejun Heo,
	Andrew Morton, Zefan Li, Thomas Gleixner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

On Wed, Apr 28, 2021 at 09:52:43AM +0300, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---

Serge's ack reminded me of this. Looks good if the mm folks are fine
with this too,
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab
  2021-04-28  6:53 ` [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
@ 2021-05-07 15:48   ` Thomas Gleixner
  0 siblings, 0 replies; 101+ messages in thread
From: Thomas Gleixner @ 2021-05-07 15:48 UTC (permalink / raw)
  To: Vasily Averin, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, linux-kernel

On Wed, Apr 28 2021 at 09:53, Vasily Averin wrote:

> A program may create multiple interval timers using timer_create().
> For each timer the kernel preallocates a "queued real-time signal",
> Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
> resource limit. The allocated object is quite small, ~250 bytes,
> but even the default signal limits allow to consume up to 100 megabytes
> per user.
>
> It makes sense to account for them to limit the host's memory consumption
> from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
  2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
@ 2021-07-15 17:11   ` Shakeel Butt
  2021-07-16  4:11     ` Vasily Averin
  0 siblings, 1 reply; 101+ messages in thread
From: Shakeel Butt @ 2021-07-15 17:11 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev, LKML

On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> Initially we used our own accounting subsystem, then partially committed
> it to upstream, and a few years ago switched to cgroups v1.
> Now we're rebasing again, revising our old patches and trying to push
> them upstream.
>
> We try to protect the host system from any misuse of kernel memory
> allocation triggered by untrusted users inside the containers.
>
> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> list, though I would be very grateful for any comments from maintainersi
> of affected subsystems or other people added in cc:
>
> Compared to the upstream, we additionally account the following kernel objects:
> - network devices and its Tx/Rx queues
> - ipv4/v6 addresses and routing-related objects
> - inet_bind_bucket cache objects
> - VLAN group arrays
> - ipv6/sit: ip_tunnel_prl
> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> - nsproxy and namespace objects itself
> - IPC objects: semaphores, message queues and share memory segments
> - mounts
> - pollfd and select bits arrays
> - signals and posix timers
> - file lock
> - fasync_struct used by the file lease code and driver's fasync queues
> - tty objects
> - per-mm LDT
>
> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> They require rework and probably will be dropped at all.
>
> Also we're going to add an accounting for nft, however it is not ready yet.
>
> We have not tested performance on upstream, however, our performance team
> compares our current RHEL7-based production kernel and reports that
> they are at least not worse as the according original RHEL7 kernel.
>

Hi Vasily,

What's the status of this series? I see a couple patches did get
acked/reviewed. Can you please re-send the series with updated ack
tags?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
  2021-07-15 17:11   ` Shakeel Butt
@ 2021-07-16  4:11     ` Vasily Averin
  2021-07-16 12:55       ` Shakeel Butt
  0 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-16  4:11 UTC (permalink / raw)
  To: Shakeel Butt, Tejun Heo
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev, LKML

On 7/15/21 8:11 PM, Shakeel Butt wrote:
> On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
>> Initially we used our own accounting subsystem, then partially committed
>> it to upstream, and a few years ago switched to cgroups v1.
>> Now we're rebasing again, revising our old patches and trying to push
>> them upstream.
>>
>> We try to protect the host system from any misuse of kernel memory
>> allocation triggered by untrusted users inside the containers.
>>
>> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
>> list, though I would be very grateful for any comments from maintainersi
>> of affected subsystems or other people added in cc:
>>
>> Compared to the upstream, we additionally account the following kernel objects:
>> - network devices and its Tx/Rx queues
>> - ipv4/v6 addresses and routing-related objects
>> - inet_bind_bucket cache objects
>> - VLAN group arrays
>> - ipv6/sit: ip_tunnel_prl
>> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
>> - nsproxy and namespace objects itself
>> - IPC objects: semaphores, message queues and share memory segments
>> - mounts
>> - pollfd and select bits arrays
>> - signals and posix timers
>> - file lock
>> - fasync_struct used by the file lease code and driver's fasync queues
>> - tty objects
>> - per-mm LDT
>>
>> We have an incorrect/incomplete/obsoleted accounting for few other kernel
>> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
>> They require rework and probably will be dropped at all.
>>
>> Also we're going to add an accounting for nft, however it is not ready yet.
>>
>> We have not tested performance on upstream, however, our performance team
>> compares our current RHEL7-based production kernel and reports that
>> they are at least not worse as the according original RHEL7 kernel.
> 
> Hi Vasily,
> 
> What's the status of this series? I see a couple patches did get
> acked/reviewed. Can you please re-send the series with updated ack
> tags?

Technically my patches does not have any NAKs. Practically they are still them merged.
I've expected Michal will push it, but he advised me to push subsystem maintainers.
I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.

I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
However I do not understand how it helps to push them if patches should be processed through
subsystem maintainers. As far as I understand I'll need to split this patch set into
per-subsystem pieces and sent them to corresponded maintainers.

Thank you,
	Vasily Averin.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
  2021-07-16  4:11     ` Vasily Averin
@ 2021-07-16 12:55       ` Shakeel Butt
  2021-07-19 10:44         ` [PATCH v5 " Vasily Averin
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  0 siblings, 2 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-16 12:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro,
	Alexey Dobriyan, Andrei Vagin, Andrew Morton, Borislav Petkov,
	Christian Brauner, David Ahern, David S. Miller, Dmitry Safonov,
	Eric Dumazet, Eric W. Biederman, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Jakub Kicinski,
	J. Bruce Fields, Jeff Layton, Jens Axboe, Jiri Slaby,
	Kirill Tkhai, Oleg Nesterov, Serge Hallyn, Thomas Gleixner,
	Zefan Li, netdev, LKML

On Thu, Jul 15, 2021 at 9:11 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 7/15/21 8:11 PM, Shakeel Butt wrote:
> > On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> >> Initially we used our own accounting subsystem, then partially committed
> >> it to upstream, and a few years ago switched to cgroups v1.
> >> Now we're rebasing again, revising our old patches and trying to push
> >> them upstream.
> >>
> >> We try to protect the host system from any misuse of kernel memory
> >> allocation triggered by untrusted users inside the containers.
> >>
> >> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> >> list, though I would be very grateful for any comments from maintainersi
> >> of affected subsystems or other people added in cc:
> >>
> >> Compared to the upstream, we additionally account the following kernel objects:
> >> - network devices and its Tx/Rx queues
> >> - ipv4/v6 addresses and routing-related objects
> >> - inet_bind_bucket cache objects
> >> - VLAN group arrays
> >> - ipv6/sit: ip_tunnel_prl
> >> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> >> - nsproxy and namespace objects itself
> >> - IPC objects: semaphores, message queues and share memory segments
> >> - mounts
> >> - pollfd and select bits arrays
> >> - signals and posix timers
> >> - file lock
> >> - fasync_struct used by the file lease code and driver's fasync queues
> >> - tty objects
> >> - per-mm LDT
> >>
> >> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> >> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> >> They require rework and probably will be dropped at all.
> >>
> >> Also we're going to add an accounting for nft, however it is not ready yet.
> >>
> >> We have not tested performance on upstream, however, our performance team
> >> compares our current RHEL7-based production kernel and reports that
> >> they are at least not worse as the according original RHEL7 kernel.
> >
> > Hi Vasily,
> >
> > What's the status of this series? I see a couple patches did get
> > acked/reviewed. Can you please re-send the series with updated ack
> > tags?
>
> Technically my patches does not have any NAKs. Practically they are still them merged.
> I've expected Michal will push it, but he advised me to push subsystem maintainers.
> I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.
>
> I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
> However I do not understand how it helps to push them if patches should be processed through
> subsystem maintainers. As far as I understand I'll need to split this patch set into
> per-subsystem pieces and sent them to corresponded maintainers.
>

Usually these kinds of patches (adding memcg accounting) go through mm
tree but if there are no dependencies between the patches and a
consensus that each subsystem maintainer picks the corresponding patch
then that is fine too.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 00/16] memcg accounting from OpenVZ
  2021-07-16 12:55       ` Shakeel Butt
@ 2021-07-19 10:44         ` Vasily Averin
  2021-07-26 18:59           ` [PATCH v6 00/16] memcg accounting from Vasily Averin
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  1 sibling, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, netdev, linux-fsdevel,
	LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
                             ` (14 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  2021-07-19 10:44           ` [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 14:00             ` Dmitry Safonov
  2021-07-20 19:26             ` Shakeel Butt
  2021-07-19 10:44           ` [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
                             ` (13 subsequent siblings)
  15 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0..1bbf239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index a9f9379..79df7cd 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 73721a4..d38124b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3bf685f..8eaeade 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2d650dc..a8f118e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2449,8 +2449,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b756a7..5f7286a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6638,7 +6638,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  2021-07-19 10:44           ` [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 04/16] memcg: enable accounting for VLAN group array Vasily Averin
                             ` (12 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7eb0fb2..abb5c59 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5ab5f2..5c0605e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4509,7 +4509,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 04/16] memcg: enable accounting for VLAN group array
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (2 preceding siblings ...)
  2021-07-19 10:44           ` [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
                             ` (11 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4cdf841..55275ef 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (3 preceding siblings ...)
  2021-07-19 10:44           ` [PATCH v5 04/16] memcg: enable accounting for VLAN group array Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
                             ` (10 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (4 preceding siblings ...)
  2021-07-19 10:44           ` [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
                             ` (9 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (5 preceding siblings ...)
  2021-07-19 10:44           ` [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                             ` (8 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (6 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 09/16] memcg: enable accounting for file lock caches Vasily Averin
                             ` (7 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 09/16] memcg: enable accounting for file lock caches
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (7 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 10/16] memcg: enable accounting for fasync_cache Vasily Averin
                             ` (6 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 10/16] memcg: enable accounting for fasync_cache
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (8 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 09/16] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 11/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
                             ` (5 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 11/16] memcg: enable accounting for new namesapces and struct nsproxy
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (9 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 10/16] memcg: enable accounting for fasync_cache Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 12/16] memcg: enable accounting of ipc resources Vasily Averin
                             ` (4 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 12/16] memcg: enable accounting of ipc resources
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (10 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 11/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 13/16] memcg: enable accounting for signals Vasily Averin
                             ` (3 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexey Dobriyan,
	Dmitry Safonov, Yutian Yang, linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1855,7 +1855,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1941,7 +1941,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 
 	/* step 2: allocate new undo structure */
 	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -2005,7 +2005,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e..ab749be 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 13/16] memcg: enable accounting for signals
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (11 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 12/16] memcg: enable accounting of ipc resources Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 17:32             ` Eric W. Biederman
  2021-07-20 19:15             ` Shakeel Butt
  2021-07-19 10:45           ` [PATCH v5 14/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
                             ` (2 subsequent siblings)
  15 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, linux-kernel

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 14/16] memcg: enable accounting for posix_timers_cache slab
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (12 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 13/16] memcg: enable accounting for signals Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:45           ` [PATCH v5 15/16] memcg: enable accounting for tty-related objects Vasily Averin
  2021-07-19 10:46           ` [PATCH v5 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, linux-kernel

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index dd5697d..7363f81 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 15/16] memcg: enable accounting for tty-related objects
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (13 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 14/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
@ 2021-07-19 10:45           ` Vasily Averin
  2021-07-19 10:46           ` [PATCH v5 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 26debec..e787f6f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v5 16/16] memcg: enable accounting for ldt_struct objects
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (14 preceding siblings ...)
  2021-07-19 10:45           ` [PATCH v5 15/16] memcg: enable accounting for tty-related objects Vasily Averin
@ 2021-07-19 10:46           ` Vasily Averin
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-19 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, linux-kernel

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-07-19 14:00             ` Dmitry Safonov
  2021-07-19 14:22               ` Shakeel Butt
  2021-07-20 19:26             ` Shakeel Butt
  1 sibling, 1 reply; 101+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:00 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

Hi Vasily,

On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
[..]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;

This seems to do two separate things in one patch.
Probably, it's better to separate them.
(I may miss how route changes are related to more generic
__alloc_pages() change)

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 14:00             ` Dmitry Safonov
@ 2021-07-19 14:22               ` Shakeel Butt
  2021-07-19 14:24                 ` Dmitry Safonov
  0 siblings, 1 reply; 101+ messages in thread
From: Shakeel Butt @ 2021-07-19 14:22 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>
> Hi Vasily,
>
> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
> [..]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae1f5d0..1bbf239 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
> >                 return false;
> >
> >         /* Memcg to charge can't be determined. */
> > -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> > +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
> >                 return true;
>
> This seems to do two separate things in one patch.
> Probably, it's better to separate them.
> (I may miss how route changes are related to more generic
> __alloc_pages() change)
>

It was requested to squash them together in some previous versions.
https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa@dhcp22.suse.cz/

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 14:22               ` Shakeel Butt
@ 2021-07-19 14:24                 ` Dmitry Safonov
  0 siblings, 0 replies; 101+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On 7/19/21 3:22 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>>
>> Hi Vasily,
>>
>> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
>> [..]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ae1f5d0..1bbf239 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>>                 return false;
>>>
>>>         /* Memcg to charge can't be determined. */
>>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>>                 return true;
>>
>> This seems to do two separate things in one patch.
>> Probably, it's better to separate them.
>> (I may miss how route changes are related to more generic
>> __alloc_pages() change)
>>
> 
> It was requested to squash them together in some previous versions.
> https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa@dhcp22.suse.cz/
> 

Ah, alright, never mind than.

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
  2021-07-19 10:45           ` [PATCH v5 13/16] memcg: enable accounting for signals Vasily Averin
@ 2021-07-19 17:32             ` Eric W. Biederman
  2021-07-20  8:35               ` Vasily Averin
  2021-07-20 19:15             ` Shakeel Butt
  1 sibling, 1 reply; 101+ messages in thread
From: Eric W. Biederman @ 2021-07-19 17:32 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, linux-kernel

Vasily Averin <vvs@virtuozzo.com> writes:

> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.

Not any more.  Currently the number of sigqueue objects is limited
by the rlimit of the creator of the user namespace of the container.

> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.

Does it?  Why?  The given justification appears to have bit-rotted
since -rc1.

I know a lot of these things only really need a limit just to catch a
program that starts malfunctioning.  If that is indeed the case
reasonable per-resource limits are probably better than some great big
group limit that can be exhausted with any single resource in the group.

Is there a reason I am not aware of that where it makes sense to group
all of the resources together and only count the number of bytes
consumed?

Eric


> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  kernel/signal.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229ad..8921c4a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -4663,7 +4663,7 @@ void __init signals_init(void)
>  {
>  	siginfo_buildtime_checks();
>  
> -	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
> +	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
>  }
>  
>  #ifdef CONFIG_KGDB_KDB

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
  2021-07-19 17:32             ` Eric W. Biederman
@ 2021-07-20  8:35               ` Vasily Averin
  2021-07-20 14:37                 ` Shakeel Butt
  2021-07-20 16:42                 ` Eric W. Biederman
  0 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-20  8:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, linux-kernel

On 7/19/21 8:32 PM, Eric W. Biederman wrote:
> Vasily Averin <vvs@virtuozzo.com> writes:
> 
>> When a user send a signal to any another processes it forces the kernel
>> to allocate memory for 'struct sigqueue' objects. The number of signals
>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>> settings allow each user to consume up to several megabytes of memory.
>> Moreover, an untrusted admin inside container can increase the limit or
>> create new fake users and force them to sent signals.
> 
> Not any more.  Currently the number of sigqueue objects is limited
> by the rlimit of the creator of the user namespace of the container.
> 
>> It makes sense to account for these allocations to restrict the host's
>> memory consumption from inside the memcg-limited container.
> 
> Does it?  Why?  The given justification appears to have bit-rotted
> since -rc1.

Could you please explain what was changed in rc1?
From my POV accounting is required to help OOM-killer to select proper target.

> I know a lot of these things only really need a limit just to catch a
> program that starts malfunctioning.  If that is indeed the case
> reasonable per-resource limits are probably better than some great big
> group limit that can be exhausted with any single resource in the group.
> 
> Is there a reason I am not aware of that where it makes sense to group
> all of the resources together and only count the number of bytes
> consumed?

Any new limits:
a) should be set properly depending on huge number of incoming parameters.
b) should properly notify about hits
c) should be updated properly after b) 
d) do a)-c) automatically if possible

In past OpenVz had own accounting subsystem, user beancounters (UBC).
It accounted and limited 20+ resources  per-container: numfiles, file locks,
signals, netfilter rules, socket buffers and so on.
I assume you want to do something similar, so let me share our experience. 

We had a lot of problems with UBC:
- it's quite hard to set up the limit. 
  Why it's good to consume N entities of some resource but it's bad to consume N+1 ones? 
  per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
  To answer the questions host admin should have additional knowledge and skills.

- Ok, we have set all limits. Some application hits it and fails.
  It's quite hard to understand that application hits the limit, and failed due to this reason.
  From users point of view, if some application does not work (stable enough)
  inside container => containers are guilty.

- It's quite hard to understand that failed application just want to increase limit X up to N entities.

As result both host admins and container users was unhappy.
So after years of such fights we decided just to limit accounted memory instead.

Anyway, OOM-killer must know who consumed memory to select proper target.

Thank you,
	vasily Averin
 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
  2021-07-20  8:35               ` Vasily Averin
@ 2021-07-20 14:37                 ` Shakeel Butt
  2021-07-20 16:42                 ` Eric W. Biederman
  1 sibling, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-20 14:37 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Eric W. Biederman, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, LKML

On Tue, Jul 20, 2021 at 1:35 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 7/19/21 8:32 PM, Eric W. Biederman wrote:
> > Vasily Averin <vvs@virtuozzo.com> writes:
> >
> >> When a user send a signal to any another processes it forces the kernel
> >> to allocate memory for 'struct sigqueue' objects. The number of signals
> >> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> >> settings allow each user to consume up to several megabytes of memory.
> >> Moreover, an untrusted admin inside container can increase the limit or
> >> create new fake users and force them to sent signals.
> >
> > Not any more.  Currently the number of sigqueue objects is limited
> > by the rlimit of the creator of the user namespace of the container.
> >
> >> It makes sense to account for these allocations to restrict the host's
> >> memory consumption from inside the memcg-limited container.
> >
> > Does it?  Why?  The given justification appears to have bit-rotted
> > since -rc1.
>
> Could you please explain what was changed in rc1?
> From my POV accounting is required to help OOM-killer to select proper target.
>
> > I know a lot of these things only really need a limit just to catch a
> > program that starts malfunctioning.  If that is indeed the case
> > reasonable per-resource limits are probably better than some great big
> > group limit that can be exhausted with any single resource in the group.
> >
> > Is there a reason I am not aware of that where it makes sense to group
> > all of the resources together and only count the number of bytes
> > consumed?
>
> Any new limits:
> a) should be set properly depending on huge number of incoming parameters.
> b) should properly notify about hits
> c) should be updated properly after b)
> d) do a)-c) automatically if possible
>
> In past OpenVz had own accounting subsystem, user beancounters (UBC).
> It accounted and limited 20+ resources  per-container: numfiles, file locks,
> signals, netfilter rules, socket buffers and so on.
> I assume you want to do something similar, so let me share our experience.
>
> We had a lot of problems with UBC:
> - it's quite hard to set up the limit.
>   Why it's good to consume N entities of some resource but it's bad to consume N+1 ones?
>   per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
>   To answer the questions host admin should have additional knowledge and skills.
>
> - Ok, we have set all limits. Some application hits it and fails.
>   It's quite hard to understand that application hits the limit, and failed due to this reason.
>   From users point of view, if some application does not work (stable enough)
>   inside container => containers are guilty.
>
> - It's quite hard to understand that failed application just want to increase limit X up to N entities.
>
> As result both host admins and container users was unhappy.
> So after years of such fights we decided just to limit accounted memory instead.
>
> Anyway, OOM-killer must know who consumed memory to select proper target.
>

Just to support Vasily's point further, for systems running multiple
workloads, it is much more preferred to be able to set one limit for
each workload than to set many different limits.

One concrete example is described in commit ac7b79fd190b ("inotify,
memcg: account inotify instances to kmemcg"). The inotify instances
which can be limited through fs sysctl inotify/max_user_instances and
be further partitioned to users through per-user namespace specific
sysctl but there is no sensible way to set a limit and partition it on
a system that runs different workloads.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
  2021-07-20  8:35               ` Vasily Averin
  2021-07-20 14:37                 ` Shakeel Butt
@ 2021-07-20 16:42                 ` Eric W. Biederman
  1 sibling, 0 replies; 101+ messages in thread
From: Eric W. Biederman @ 2021-07-20 16:42 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jens Axboe,
	Oleg Nesterov, linux-kernel

Vasily Averin <vvs@virtuozzo.com> writes:

> On 7/19/21 8:32 PM, Eric W. Biederman wrote:
>> Vasily Averin <vvs@virtuozzo.com> writes:
>> 
>>> When a user send a signal to any another processes it forces the kernel
>>> to allocate memory for 'struct sigqueue' objects. The number of signals
>>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>>> settings allow each user to consume up to several megabytes of memory.
>>> Moreover, an untrusted admin inside container can increase the limit or
>>> create new fake users and force them to sent signals.
>> 
>> Not any more.  Currently the number of sigqueue objects is limited
>> by the rlimit of the creator of the user namespace of the container.
>> 
>>> It makes sense to account for these allocations to restrict the host's
>>> memory consumption from inside the memcg-limited container.
>> 
>> Does it?  Why?  The given justification appears to have bit-rotted
>> since -rc1.
>
> Could you please explain what was changed in rc1?
> From my POV accounting is required to help OOM-killer to select proper target.

You can no longer escape the rlimit of the creator of the user
namespace, by creating multiple users inside of the user namespace.  The
users (in the user namespace) will have their individual rlimits but
are jointly by the limit of the creator of the user namespace at the
time the user namespace was created.

>> I know a lot of these things only really need a limit just to catch a
>> program that starts malfunctioning.  If that is indeed the case
>> reasonable per-resource limits are probably better than some great big
>> group limit that can be exhausted with any single resource in the group.
>> 
>> Is there a reason I am not aware of that where it makes sense to group
>> all of the resources together and only count the number of bytes
>> consumed?
>
> Any new limits:
> a) should be set properly depending on huge number of incoming parameters.
> b) should properly notify about hits
> c) should be updated properly after b) 
> d) do a)-c) automatically if possible
>
> In past OpenVz had own accounting subsystem, user beancounters (UBC).
> It accounted and limited 20+ resources  per-container: numfiles, file locks,
> signals, netfilter rules, socket buffers and so on.
> I assume you want to do something similar, so let me share our experience. 
>
> We had a lot of problems with UBC:
> - it's quite hard to set up the limit. 
>   Why it's good to consume N entities of some resource but it's bad to consume N+1 ones? 
>   per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
>   To answer the questions host admin should have additional knowledge and skills.
>
> - Ok, we have set all limits. Some application hits it and fails.
>   It's quite hard to understand that application hits the limit, and failed due to this reason.
>   From users point of view, if some application does not work (stable enough)
>   inside container => containers are guilty.
>
> - It's quite hard to understand that failed application just want to increase limit X up to N entities.
>
> As result both host admins and container users was unhappy.
> So after years of such fights we decided just to limit accounted
> memory instead.


Which is a perfectly fine justification.  However the justification
presented in the change log was that there is no existing limit, and
that is what is factually wrong.

Different kinds of limits serve different purposes.  An accounting of how
many instance of a resource has been used serve the purpose of detecting
when an application has gone completely out of spec.

Limits like you are implementing here are much better for just sharing and
managing system resources.

> Anyway, OOM-killer must know who consumed memory to select proper
> target.

The limits I am talking about are not useful to the OOM killer as
usually the amount of resources that indicates something has gone wrong
are too small to be a system-wide problem.

So please just correct the justification in the commit message and I
will be happy.

Eric  

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 13/16] memcg: enable accounting for signals
  2021-07-19 10:45           ` [PATCH v5 13/16] memcg: enable accounting for signals Vasily Averin
  2021-07-19 17:32             ` Eric W. Biederman
@ 2021-07-20 19:15             ` Shakeel Butt
  1 sibling, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-20 19:15 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, LKML

On Mon, Jul 19, 2021 at 3:46 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

It seems like there is an agreement on this patch with the updated
commit message. In next version you can add:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
  2021-07-19 14:00             ` Dmitry Safonov
@ 2021-07-20 19:26             ` Shakeel Butt
  2021-07-26 10:23               ` Vasily Averin
  1 sibling, 1 reply; 101+ messages in thread
From: Shakeel Butt @ 2021-07-20 19:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> An netadmin inside container can use 'ip a a' and 'ip r a'
> to assign a large number of ipv4/ipv6 addresses and routing entries
> and force kernel to allocate megabytes of unaccounted memory
> for long-lived per-netdevice related kernel objects:
> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>
> These objects can be manually removed, though usually they lives
> in memory till destroy of its net namespace.
>
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
>
> One of such objects is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>
> Obsoleted in_interrupt() does not describe real execution context properly.
> From include/linux/preempt.h:
>
>  The following macros are deprecated and should not be used in new code:
>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>
> To verify the current execution context new macro should be used instead:
>  in_task()      - We're in task context
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c      | 2 +-
>  net/core/fib_rules.c | 4 ++--
>  net/ipv4/devinet.c   | 2 +-
>  net/ipv4/fib_trie.c  | 4 ++--
>  net/ipv6/addrconf.c  | 2 +-
>  net/ipv6/ip6_fib.c   | 4 ++--
>  net/ipv6/route.c     | 2 +-
>  7 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;
>
>         return false;

Can you please also change in_interrupt() in active_memcg() as well?
There are other unrelated in_interrupt() in that file but the one in
active_memcg() should be coupled with this change.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-20 19:26             ` Shakeel Butt
@ 2021-07-26 10:23               ` Vasily Averin
  2021-07-26 13:48                 ` Shakeel Butt
  0 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 10:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On 7/20/21 10:26 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> An netadmin inside container can use 'ip a a' and 'ip r a'
>> to assign a large number of ipv4/ipv6 addresses and routing entries
>> and force kernel to allocate megabytes of unaccounted memory
>> for long-lived per-netdevice related kernel objects:
>> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
>> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>>
>> These objects can be manually removed, though usually they lives
>> in memory till destroy of its net namespace.
>>
>> It makes sense to account for them to restrict the host's memory
>> consumption from inside the memcg-limited container.
>>
>> One of such objects is the 'struct fib6_node' mostly allocated in
>> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>>
>>  write_lock_bh(&table->tb6_lock);
>>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>>  write_unlock_bh(&table->tb6_lock);
>>
>> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
>> kmem cache. The proper memory cgroup still cannot be found due to the
>> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>>
>> Obsoleted in_interrupt() does not describe real execution context properly.
>> From include/linux/preempt.h:
>>
>>  The following macros are deprecated and should not be used in new code:
>>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>>
>> To verify the current execution context new macro should be used instead:
>>  in_task()      - We're in task context
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/memcontrol.c      | 2 +-
>>  net/core/fib_rules.c | 4 ++--
>>  net/ipv4/devinet.c   | 2 +-
>>  net/ipv4/fib_trie.c  | 4 ++--
>>  net/ipv6/addrconf.c  | 2 +-
>>  net/ipv6/ip6_fib.c   | 4 ++--
>>  net/ipv6/route.c     | 2 +-
>>  7 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ae1f5d0..1bbf239 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>                 return false;
>>
>>         /* Memcg to charge can't be determined. */
>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>                 return true;
>>
>>         return false;
> 
> Can you please also change in_interrupt() in active_memcg() as well?
> There are other unrelated in_interrupt() in that file but the one in
> active_memcg() should be coupled with this change.

Could you please elaborate?
From my point of view active_memcg is paired with set_active_memcg() and is not related to this case.
active_memcg uses memcg that was set by set_active_memcg(), either from int_active_memcg per-cpu pointer
or from current->active_memcg pointer.
I'm agree, it in case of disabled BH it is incorrect to use int_active_memcg, 
we still can use current->active_memcg. However it isn't a problem, 
memcg will be properly provided in both cases.

I think it's better to fix set_active_memcg/active_memcg by separate patch.

Am I missed something perhaps?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-26 10:23               ` Vasily Averin
@ 2021-07-26 13:48                 ` Shakeel Butt
  2021-07-26 16:53                   ` [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg() Vasily Averin
  0 siblings, 1 reply; 101+ messages in thread
From: Shakeel Butt @ 2021-07-26 13:48 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On Mon, Jul 26, 2021 at 3:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
[...]
> >
> > Can you please also change in_interrupt() in active_memcg() as well?
> > There are other unrelated in_interrupt() in that file but the one in
> > active_memcg() should be coupled with this change.
>
> Could you please elaborate?
> From my point of view active_memcg is paired with set_active_memcg() and is not related to this case.
> active_memcg uses memcg that was set by set_active_memcg(), either from int_active_memcg per-cpu pointer
> or from current->active_memcg pointer.
> I'm agree, it in case of disabled BH it is incorrect to use int_active_memcg,
> we still can use current->active_memcg. However it isn't a problem,
> memcg will be properly provided in both cases.
>
> I think it's better to fix set_active_memcg/active_memcg by separate patch.
>
> Am I missed something perhaps?
>

No you are right. That should be a separate patch.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg()
  2021-07-26 13:48                 ` Shakeel Butt
@ 2021-07-26 16:53                   ` Vasily Averin
  2021-07-26 16:57                     ` Shakeel Butt
  0 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 16:53 UTC (permalink / raw)
  To: Shakeel Butt, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Roman Gushchin, Andrew Morton
  Cc: cgroups, linux-kernel, linux-mm

set_active_memcg() uses in_interrupt() check to select proper storage for
cgroup: pointer on task struct or per-cpu pointer.

It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH.
It's better to use '!in_task()' instead.

Link: https://lkml.org/lkml/2021/7/26/487
Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 include/linux/sched/mm.h | 2 +-
 mm/memcontrol.c          | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..9dd071f78dba 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -306,7 +306,7 @@ set_active_memcg(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *old;
 
-	if (in_interrupt()) {
+	if (!in_task()) {
 		old = this_cpu_read(int_active_memcg);
 		this_cpu_write(int_active_memcg, memcg);
 	} else {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0cb581..3ebf792ef2c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -905,7 +905,7 @@ EXPORT_SYMBOL(mem_cgroup_from_task);
 
 static __always_inline struct mem_cgroup *active_memcg(void)
 {
-	if (in_interrupt())
+	if (!in_task())
 		return this_cpu_read(int_active_memcg);
 	else
 		return current->active_memcg;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg()
  2021-07-26 16:53                   ` [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg() Vasily Averin
@ 2021-07-26 16:57                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-26 16:57 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Roman Gushchin,
	Andrew Morton, Cgroups, LKML, Linux MM

On Mon, Jul 26, 2021 at 9:53 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> set_active_memcg() uses in_interrupt() check to select proper storage for
> cgroup: pointer on task struct or per-cpu pointer.
>
> It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH.
> It's better to use '!in_task()' instead.
>
> Link: https://lkml.org/lkml/2021/7/26/487
> Fixes: 37d5985c003d ("mm: kmem: prepare remote memcg charging infra for interrupt contexts")
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Thanks.

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 00/16] memcg accounting from
  2021-07-19 10:44         ` [PATCH v5 " Vasily Averin
@ 2021-07-26 18:59           ` Vasily Averin
  2021-07-26 21:59             ` David Miller
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
  1 sibling, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, netdev, linux-fsdevel,
	LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
@ 2021-07-26 18:59             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
                               ` (14 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
  2021-07-26 18:59             ` [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
                               ` (13 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0..1bbf239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index a9f9379..79df7cd 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 73721a4..d38124b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3bf685f..8eaeade 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2d650dc..a8f118e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2449,8 +2449,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b756a7..5f7286a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6638,7 +6638,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
  2021-07-26 18:59             ` [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 04/16] memcg: enable accounting for VLAN group array Vasily Averin
                               ` (12 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7eb0fb2..abb5c59 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5ab5f2..5c0605e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4509,7 +4509,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 04/16] memcg: enable accounting for VLAN group array
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (2 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
                               ` (11 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4cdf841..55275ef 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (3 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 04/16] memcg: enable accounting for VLAN group array Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
                               ` (10 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (4 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
                               ` (9 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (5 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                               ` (8 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (6 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 09/16] memcg: enable accounting for file lock caches Vasily Averin
                               ` (7 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 09/16] memcg: enable accounting for file lock caches
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (7 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 10/16] memcg: enable accounting for fasync_cache Vasily Averin
                               ` (6 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 10/16] memcg: enable accounting for fasync_cache
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (8 preceding siblings ...)
  2021-07-26 19:01             ` [PATCH v6 09/16] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
                               ` (5 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (9 preceding siblings ...)
  2021-07-26 19:01             ` [PATCH v6 10/16] memcg: enable accounting for fasync_cache Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  2021-07-26 19:58               ` Kirill Tkhai
  2021-07-26 19:01             ` [PATCH v6 12/16] memcg: enable accounting of ipc resources Vasily Averin
                               ` (4 subsequent siblings)
  15 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 12/16] memcg: enable accounting of ipc resources
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (10 preceding siblings ...)
  2021-07-26 19:01             ` [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 13/16] memcg: enable accounting for signals Vasily Averin
                               ` (3 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexey Dobriyan,
	Dmitry Safonov, Yutian Yang, linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1855,7 +1855,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1941,7 +1941,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 
 	/* step 2: allocate new undo structure */
 	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -2005,7 +2005,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e..ab749be 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 13/16] memcg: enable accounting for signals
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (11 preceding siblings ...)
  2021-07-26 19:01             ` [PATCH v6 12/16] memcg: enable accounting of ipc resources Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 14/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
                               ` (2 subsequent siblings)
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, linux-kernel

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 14/16] memcg: enable accounting for posix_timers_cache slab
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (12 preceding siblings ...)
  2021-07-26 19:01             ` [PATCH v6 13/16] memcg: enable accounting for signals Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 15/16] memcg: enable accounting for tty-related objects Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, linux-kernel

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index dd5697d..7363f81 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 15/16] memcg: enable accounting for tty-related objects
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (13 preceding siblings ...)
  2021-07-26 19:01             ` [PATCH v6 14/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  2021-07-26 19:01             ` [PATCH v6 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 26debec..e787f6f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v6 16/16] memcg: enable accounting for ldt_struct objects
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (14 preceding siblings ...)
  2021-07-26 19:01             ` [PATCH v6 15/16] memcg: enable accounting for tty-related objects Vasily Averin
@ 2021-07-26 19:01             ` Vasily Averin
  15 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, linux-kernel

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy
  2021-07-26 19:01             ` [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-07-26 19:58               ` Kirill Tkhai
  0 siblings, 0 replies; 101+ messages in thread
From: Kirill Tkhai @ 2021-07-26 19:58 UTC (permalink / raw)
  To: Vasily Averin, Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Zefan Li,
	Thomas Gleixner, Christian Brauner, Serge Hallyn, Andrei Vagin,
	linux-kernel

On 26.07.2021 22:01, Vasily Averin wrote:
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>

> ---
>  fs/namespace.c            | 2 +-
>  ipc/namespace.c           | 2 +-
>  kernel/cgroup/namespace.c | 2 +-
>  kernel/nsproxy.c          | 2 +-
>  kernel/pid_namespace.c    | 2 +-
>  kernel/time/namespace.c   | 4 ++--
>  kernel/user_namespace.c   | 2 +-
>  7 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c6a74e5..e443ee6 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
>  	if (!ucounts)
>  		return ERR_PTR(-ENOSPC);
>  
> -	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns) {
>  		dec_mnt_namespaces(ucounts);
>  		return ERR_PTR(-ENOMEM);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 7bd0766..ae83f0f 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
> +	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
>  	if (ns == NULL)
>  		goto fail_dec;
>  
> diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
> index f5e8828..0d5c298 100644
> --- a/kernel/cgroup/namespace.c
> +++ b/kernel/cgroup/namespace.c
> @@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
>  	struct cgroup_namespace *new_ns;
>  	int ret;
>  
> -	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> +	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
>  	if (!new_ns)
>  		return ERR_PTR(-ENOMEM);
>  	ret = ns_alloc_inum(&new_ns->ns);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index abc01fc..eec72ca 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
>  
>  int __init nsproxy_cache_init(void)
>  {
> -	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
> +	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
>  	return 0;
>  }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index ca43239..6cd6715 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
>  
>  static __init int pid_namespaces_init(void)
>  {
> -	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
> index 12eab0d..aec8328 100644
> --- a/kernel/time/namespace.c
> +++ b/kernel/time/namespace.c
> @@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
>  		goto fail;
>  
>  	err = -ENOMEM;
> -	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
> +	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
>  	if (!ns)
>  		goto fail_dec;
>  
>  	refcount_set(&ns->ns.count, 1);
>  
> -	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!ns->vvar_page)
>  		goto fail_free;
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index ef82d40..6b2e3ca 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
>  
>  static __init int user_namespaces_init(void)
>  {
> -	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
> +	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  	return 0;
>  }
>  subsys_initcall(user_namespaces_init);
> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from
  2021-07-26 18:59           ` [PATCH v6 00/16] memcg accounting from Vasily Averin
@ 2021-07-26 21:59             ` David Miller
  2021-07-27  4:44               ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
  0 siblings, 1 reply; 101+ messages in thread
From: David Miller @ 2021-07-26 21:59 UTC (permalink / raw)
  To: vvs
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel


This series does not apply cleanly to net-next, please respin.

Thank you.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from OpenVZ
  2021-07-26 21:59             ` David Miller
@ 2021-07-27  4:44               ` Vasily Averin
  2021-07-27  5:33                 ` [PATCH v7 00/10] " Vasily Averin
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
  0 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  4:44 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel

On 7/27/21 12:59 AM, David Miller wrote:
> 
> This series does not apply cleanly to net-next, please respin.

Dear David,
I found that you have already approved net-related patches of this series and included them into net-next.
So I'll respin v7 without these patches.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 00/10] memcg accounting from OpenVZ
  2021-07-27  4:44               ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
@ 2021-07-27  5:33                 ` Vasily Averin
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
  1 sibling, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Borislav Petkov,
	Christian Brauner, Dmitry Safonov, Eric W. Biederman,
	Greg Kroah-Hartman, H. Peter Anvin, Ingo Molnar, J. Bruce Fields,
	Jeff Layton, Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, linux-fsdevel, LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v7:
- net-related patches was approved and included into net-next git
- rebase to v5.14-rc3
- added Acked-by tag from Kirill Tkhai on "memcg: enable accounting for
  new namesapces and struct nsproxy"

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (10):
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 17 files changed, 34 insertions(+), 29 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
@ 2021-07-27  5:33                   ` Vasily Averin
  2021-07-27  6:44                     ` Shakeel Butt
  2021-07-27  7:21                     ` Christian Brauner
  2021-07-27  5:33                   ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                                     ` (8 subsequent siblings)
  9 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
  2021-07-27  5:33                   ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-27  5:33                   ` Vasily Averin
  2021-07-27 21:39                     ` Shakeel Butt
  2021-07-27  5:33                   ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
                                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 03/10] memcg: enable accounting for file lock caches
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
  2021-07-27  5:33                   ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
  2021-07-27  5:33                   ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-27  5:33                   ` Vasily Averin
  2021-07-27 21:41                     ` Shakeel Butt
  2021-07-27  5:33                   ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
                                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 04/10] memcg: enable accounting for fasync_cache
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
                                     ` (2 preceding siblings ...)
  2021-07-27  5:33                   ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-27  5:33                   ` Vasily Averin
  2021-07-27 21:50                     ` Shakeel Butt
  2021-07-27  5:33                   ` [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
                                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f946bec..714e7c9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
                                     ` (3 preceding siblings ...)
  2021-07-27  5:33                   ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
@ 2021-07-27  5:33                   ` Vasily Averin
  2021-07-27 21:51                     ` Shakeel Butt
  2021-07-27  5:33                   ` [PATCH v7 06/10] memcg: enable accounting of ipc resources Vasily Averin
                                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Andrew Morton,
	Zefan Li, Thomas Gleixner, Christian Brauner, Kirill Tkhai,
	Serge Hallyn, Andrei Vagin, linux-kernel

Container admin can create new namespaces and force kernel to allocate
up to several pages of memory for the namespaces and its associated
structures.
Net and uts namespaces have enabled accounting for such allocations.
It makes sense to account for rest ones to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/namespace.c            | 2 +-
 ipc/namespace.c           | 2 +-
 kernel/cgroup/namespace.c | 2 +-
 kernel/nsproxy.c          | 2 +-
 kernel/pid_namespace.c    | 2 +-
 kernel/time/namespace.c   | 4 ++--
 kernel/user_namespace.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c6a74e5..e443ee6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3289,7 +3289,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool a
 	if (!ucounts)
 		return ERR_PTR(-ENOSPC);
 
-	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns) {
 		dec_mnt_namespaces(ucounts);
 		return ERR_PTR(-ENOMEM);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 7bd0766..ae83f0f 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -42,7 +42,7 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL);
+	ns = kzalloc(sizeof(struct ipc_namespace), GFP_KERNEL_ACCOUNT);
 	if (ns == NULL)
 		goto fail_dec;
 
diff --git a/kernel/cgroup/namespace.c b/kernel/cgroup/namespace.c
index f5e8828..0d5c298 100644
--- a/kernel/cgroup/namespace.c
+++ b/kernel/cgroup/namespace.c
@@ -24,7 +24,7 @@ static struct cgroup_namespace *alloc_cgroup_ns(void)
 	struct cgroup_namespace *new_ns;
 	int ret;
 
-	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL_ACCOUNT);
 	if (!new_ns)
 		return ERR_PTR(-ENOMEM);
 	ret = ns_alloc_inum(&new_ns->ns);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index abc01fc..eec72ca 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -568,6 +568,6 @@ static void commit_nsset(struct nsset *nsset)
 
 int __init nsproxy_cache_init(void)
 {
-	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC|SLAB_ACCOUNT);
 	return 0;
 }
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index ca43239..6cd6715 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -449,7 +449,7 @@ static struct user_namespace *pidns_owner(struct ns_common *ns)
 
 static __init int pid_namespaces_init(void)
 {
-	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	register_sysctl_paths(kern_path, pid_ns_ctl_table);
diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c
index 12eab0d..aec8328 100644
--- a/kernel/time/namespace.c
+++ b/kernel/time/namespace.c
@@ -88,13 +88,13 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
 		goto fail;
 
 	err = -ENOMEM;
-	ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+	ns = kmalloc(sizeof(*ns), GFP_KERNEL_ACCOUNT);
 	if (!ns)
 		goto fail_dec;
 
 	refcount_set(&ns->ns.count, 1);
 
-	ns->vvar_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	ns->vvar_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!ns->vvar_page)
 		goto fail_free;
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index ef82d40..6b2e3ca 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1385,7 +1385,7 @@ static struct user_namespace *userns_owner(struct ns_common *ns)
 
 static __init int user_namespaces_init(void)
 {
-	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC);
+	user_ns_cachep = KMEM_CACHE(user_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 	return 0;
 }
 subsys_initcall(user_namespaces_init);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 06/10] memcg: enable accounting of ipc resources
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
                                     ` (4 preceding siblings ...)
  2021-07-27  5:33                   ` [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-07-27  5:33                   ` Vasily Averin
  2021-07-27 22:33                     ` Shakeel Butt
  2021-07-27  5:34                   ` [PATCH v7 07/10] memcg: enable accounting for signals Vasily Averin
                                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexey Dobriyan,
	Dmitry Safonov, Yutian Yang, linux-kernel

When user creates IPC objects it forces kernel to allocate memory for
these long-living objects.

It makes sense to account them to restrict the host's memory consumption
from inside the memcg-limited container.

This patch enables accounting for IPC shared memory segments, messages
semaphores and semaphore's undo lists.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 ipc/msg.c | 2 +-
 ipc/sem.c | 9 +++++----
 ipc/shm.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 6810276..a0d0577 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
diff --git a/ipc/sem.c b/ipc/sem.c
index 971e75d..1a8b9f0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -514,7 +514,7 @@ static struct sem_array *sem_alloc(size_t nsems)
 	if (nsems > (INT_MAX - sizeof(*sma)) / sizeof(sma->sems[0]))
 		return NULL;
 
-	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL);
+	sma = kvzalloc(struct_size(sma, sems, nsems), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!sma))
 		return NULL;
 
@@ -1855,7 +1855,7 @@ static inline int get_undo_list(struct sem_undo_list **undo_listp)
 
 	undo_list = current->sysvsem.undo_list;
 	if (!undo_list) {
-		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL);
+		undo_list = kzalloc(sizeof(*undo_list), GFP_KERNEL_ACCOUNT);
 		if (undo_list == NULL)
 			return -ENOMEM;
 		spin_lock_init(&undo_list->lock);
@@ -1941,7 +1941,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid)
 
 	/* step 2: allocate new undo structure */
 	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
-		       GFP_KERNEL);
+		       GFP_KERNEL_ACCOUNT);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -2005,7 +2005,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops,
 	if (nsops > ns->sc_semopm)
 		return -E2BIG;
 	if (nsops > SEMOPM_FAST) {
-		sops = kvmalloc_array(nsops, sizeof(*sops), GFP_KERNEL);
+		sops = kvmalloc_array(nsops, sizeof(*sops),
+				      GFP_KERNEL_ACCOUNT);
 		if (sops == NULL)
 			return -ENOMEM;
 	}
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e..ab749be 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL_ACCOUNT);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 07/10] memcg: enable accounting for signals
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
                                     ` (5 preceding siblings ...)
  2021-07-27  5:33                   ` [PATCH v7 06/10] memcg: enable accounting of ipc resources Vasily Averin
@ 2021-07-27  5:34                   ` Vasily Averin
  2021-07-27  5:34                   ` [PATCH v7 08/10] memcg: enable accounting for posix_timers_cache slab Vasily Averin
                                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Jens Axboe, Eric W. Biederman,
	Oleg Nesterov, linux-kernel

When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
 {
 	siginfo_buildtime_checks();
 
-	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+	sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
 }
 
 #ifdef CONFIG_KGDB_KDB
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 08/10] memcg: enable accounting for posix_timers_cache slab
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
                                     ` (6 preceding siblings ...)
  2021-07-27  5:34                   ` [PATCH v7 07/10] memcg: enable accounting for signals Vasily Averin
@ 2021-07-27  5:34                   ` Vasily Averin
  2021-07-27 22:33                     ` Shakeel Butt
  2021-07-27  5:34                   ` [PATCH v7 09/10] memcg: enable accounting for tty-related objects Vasily Averin
  2021-07-27  5:34                   ` [PATCH v7 10/10] memcg: enable accounting for ldt_struct objects Vasily Averin
  9 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, linux-kernel

A program may create multiple interval timers using timer_create().
For each timer the kernel preallocates a "queued real-time signal",
Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
resource limit. The allocated object is quite small, ~250 bytes,
but even the default signal limits allow to consume up to 100 megabytes
per user.

It makes sense to account for them to limit the host's memory consumption
from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/posix-timers.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index dd5697d..7363f81 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -273,8 +273,8 @@ static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
 static __init int init_posix_timers(void)
 {
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
-					sizeof (struct k_itimer), 0, SLAB_PANIC,
-					NULL);
+					sizeof(struct k_itimer), 0,
+					SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 __initcall(init_posix_timers);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 09/10] memcg: enable accounting for tty-related objects
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
                                     ` (7 preceding siblings ...)
  2021-07-27  5:34                   ` [PATCH v7 08/10] memcg: enable accounting for posix_timers_cache slab Vasily Averin
@ 2021-07-27  5:34                   ` Vasily Averin
  2021-07-27  6:09                     ` Greg Kroah-Hartman
  2021-07-27  6:54                     ` Jiri Slaby
  2021-07-27  5:34                   ` [PATCH v7 10/10] memcg: enable accounting for ldt_struct objects Vasily Averin
  9 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 drivers/tty/tty_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 26debec..e787f6f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
@@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
 {
 	struct tty_struct *tty;
 
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
 	if (!tty)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v7 10/10] memcg: enable accounting for ldt_struct objects
       [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
                                     ` (8 preceding siblings ...)
  2021-07-27  5:34                   ` [PATCH v7 09/10] memcg: enable accounting for tty-related objects Vasily Averin
@ 2021-07-27  5:34                   ` Vasily Averin
  2021-07-27 22:36                     ` Shakeel Butt
  9 siblings, 1 reply; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  5:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, linux-kernel

Each task can request own LDT and force the kernel to allocate up to
64Kb memory per-mm.

There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines.
The unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.

It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/ldt.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index aa15132..525876e 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -154,7 +154,7 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL_ACCOUNT);
 	if (!new_ldt)
 		return NULL;
 
@@ -168,9 +168,9 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries = __vmalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 
 	if (!new_ldt->entries) {
 		kfree(new_ldt);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 09/10] memcg: enable accounting for tty-related objects
  2021-07-27  5:34                   ` [PATCH v7 09/10] memcg: enable accounting for tty-related objects Vasily Averin
@ 2021-07-27  6:09                     ` Greg Kroah-Hartman
  2021-07-27  6:54                     ` Jiri Slaby
  1 sibling, 0 replies; 101+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-27  6:09 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Jiri Slaby,
	linux-kernel

On Tue, Jul 27, 2021 at 08:34:14AM +0300, Vasily Averin wrote:
> At each login the user forces the kernel to create a new terminal and
> allocate up to ~1Kb memory for the tty-related structures.
> 
> By default it's allowed to create up to 4096 ptys with 1024 reserve for
> initial mount namespace only and the settings are controlled by host admin.
> 
> Though this default is not enough for hosters with thousands
> of containers per node. Host admin can be forced to increase it
> up to NR_UNIX98_PTY_MAX = 1<<20.
> 
> By default container is restricted by pty mount_opt.max = 1024,
> but admin inside container can change it via remount. As a result,
> one container can consume almost all allowed ptys
> and allocate up to 1Gb of unaccounted memory.
> 
> It is not enough per-se to trigger OOM on host, however anyway, it allows
> to significantly exceed the assigned memcg limit and leads to troubles
> on the over-committed node.
> 
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  drivers/tty/tty_io.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

As this is independant of all of the rest, I'll just take this through
my tree now so that you do not have to keep resending it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
  2021-07-27  5:33                   ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-27  6:44                     ` Shakeel Butt
  2021-07-27  7:21                     ` Christian Brauner
  1 sibling, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27  6:44 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> The kernel allocates ~400 bytes of 'strcut mount' for any new mount.

*struct mount*

> Creating a new mount namespace clones most of the parent mounts,
> and this can be repeated many times. Additionally, each mount allocates
> up to PATH_MAX=4096 bytes for mnt->mnt_devname.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 09/10] memcg: enable accounting for tty-related objects
  2021-07-27  5:34                   ` [PATCH v7 09/10] memcg: enable accounting for tty-related objects Vasily Averin
  2021-07-27  6:09                     ` Greg Kroah-Hartman
@ 2021-07-27  6:54                     ` Jiri Slaby
  2021-07-27  8:02                       ` Vasily Averin
  1 sibling, 1 reply; 101+ messages in thread
From: Jiri Slaby @ 2021-07-27  6:54 UTC (permalink / raw)
  To: Vasily Averin, Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Greg Kroah-Hartman,
	linux-kernel

On 27. 07. 21, 7:34, Vasily Averin wrote:
> At each login the user forces the kernel to create a new terminal and
> allocate up to ~1Kb memory for the tty-related structures.
> 
> By default it's allowed to create up to 4096 ptys with 1024 reserve for
> initial mount namespace only and the settings are controlled by host admin.
> 
> Though this default is not enough for hosters with thousands
> of containers per node. Host admin can be forced to increase it
> up to NR_UNIX98_PTY_MAX = 1<<20.
> 
> By default container is restricted by pty mount_opt.max = 1024,
> but admin inside container can change it via remount. As a result,
> one container can consume almost all allowed ptys
> and allocate up to 1Gb of unaccounted memory.
> 
> It is not enough per-se to trigger OOM on host, however anyway, it allows
> to significantly exceed the assigned memcg limit and leads to troubles
> on the over-committed node.
> 
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>   drivers/tty/tty_io.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index 26debec..e787f6f 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
>   	/* Stash the termios data */
>   	tp = tty->driver->termios[idx];
>   	if (tp == NULL) {
> -		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
> +		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);

termios are not saved for PTYs (TTY_DRIVER_RESET_TERMIOS). Am I missing 
something?

>   		if (tp == NULL)
>   			return;
>   		tty->driver->termios[idx] = tp;
> @@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
>   {
>   	struct tty_struct *tty;
>   
> -	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
> +	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
>   	if (!tty)
>   		return NULL;
>   
> 

thanks,
-- 
js
suse labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
  2021-07-27  5:33                   ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
  2021-07-27  6:44                     ` Shakeel Butt
@ 2021-07-27  7:21                     ` Christian Brauner
  1 sibling, 0 replies; 101+ messages in thread
From: Christian Brauner @ 2021-07-27  7:21 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Alexander Viro, linux-fsdevel, linux-kernel

On Tue, Jul 27, 2021 at 08:33:12AM +0300, Vasily Averin wrote:
> The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
> Creating a new mount namespace clones most of the parent mounts,
> and this can be repeated many times. Additionally, each mount allocates
> up to PATH_MAX=4096 bytes for mnt->mnt_devname.
> 
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---

Looks good. Thank you!
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

I wonder how much this increases reported memory consumption when you
boot full system containers that run systemd and a bunch of systemd
services that each use a separate mount namespace.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 09/10] memcg: enable accounting for tty-related objects
  2021-07-27  6:54                     ` Jiri Slaby
@ 2021-07-27  8:02                       ` Vasily Averin
  2021-07-27  9:26                         ` [PATCH TTY] memcg: drop GFP_KERNEL_ACCOUNT use in tty_save_termios() Vasily Averin
  2021-07-27  9:30                         ` [PATCH v7 09/10] " Greg Kroah-Hartman
  0 siblings, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  8:02 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: cgroups, Greg Kroah-Hartman, linux-kernel

On 7/27/21 9:54 AM, Jiri Slaby wrote:
> On 27. 07. 21, 7:34, Vasily Averin wrote:
>> At each login the user forces the kernel to create a new terminal and
>> allocate up to ~1Kb memory for the tty-related structures.
>>
>> By default it's allowed to create up to 4096 ptys with 1024 reserve for
>> initial mount namespace only and the settings are controlled by host admin.
>>
>> Though this default is not enough for hosters with thousands
>> of containers per node. Host admin can be forced to increase it
>> up to NR_UNIX98_PTY_MAX = 1<<20.
>>
>> By default container is restricted by pty mount_opt.max = 1024,
>> but admin inside container can change it via remount. As a result,
>> one container can consume almost all allowed ptys
>> and allocate up to 1Gb of unaccounted memory.
>>
>> It is not enough per-se to trigger OOM on host, however anyway, it allows
>> to significantly exceed the assigned memcg limit and leads to troubles
>> on the over-committed node.
>>
>> It makes sense to account for them to restrict the host's memory
>> consumption from inside the memcg-limited container.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> ---
>>   drivers/tty/tty_io.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
>> index 26debec..e787f6f 100644
>> --- a/drivers/tty/tty_io.c
>> +++ b/drivers/tty/tty_io.c
>> @@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
>>       /* Stash the termios data */
>>       tp = tty->driver->termios[idx];
>>       if (tp == NULL) {
>> -        tp = kmalloc(sizeof(*tp), GFP_KERNEL);
>> +        tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
> 
> termios are not saved for PTYs (TTY_DRIVER_RESET_TERMIOS). Am I missing something?

No, you are right, I've missed this.
Typical terminals inside containers use TTY_DRIVER_RESET_TERMIOS flag and therefore do not save termios.
So its accounting have near-to-zero impact in real life.
I'll prepare fixup to drop GFP_KERNEL_ACCOUNT here.

Thank you very much,
	Vasily Averin

>>           if (tp == NULL)
>>               return;
>>           tty->driver->termios[idx] = tp;
>> @@ -3119,7 +3119,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
>>   {
>>       struct tty_struct *tty;
>>   -    tty = kzalloc(sizeof(*tty), GFP_KERNEL);
>> +    tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
>>       if (!tty)
>>           return NULL;
>>  
> 
> thanks,


^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH TTY] memcg: drop GFP_KERNEL_ACCOUNT use in tty_save_termios()
  2021-07-27  8:02                       ` Vasily Averin
@ 2021-07-27  9:26                         ` Vasily Averin
  2021-07-27  9:32                           ` Greg Kroah-Hartman
  2022-02-28  9:13                           ` [PATCH v2] memcg: enable accounting for tty-related objects Vasily Averin
  2021-07-27  9:30                         ` [PATCH v7 09/10] " Greg Kroah-Hartman
  1 sibling, 2 replies; 101+ messages in thread
From: Vasily Averin @ 2021-07-27  9:26 UTC (permalink / raw)
  To: Jiri Slaby, Greg Kroah-Hartman; +Cc: cgroups, linux-kernel

Jiri Slaby pointed that termios are not saved for PTYs and for other
terminals used inside containers. Therefore accounting for saved
termios have near to zero impact in real life scenarios.

Cc: Jiri Slaby <jirislaby@kernel.org>
Fixes: 854dd8a572a0 ("memcg: enable accounting for tty-related objects")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 drivers/tty/tty_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index e787f6f..a6230b2 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
 	/* Stash the termios data */
 	tp = tty->driver->termios[idx];
 	if (tp == NULL) {
-		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
+		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
 		if (tp == NULL)
 			return;
 		tty->driver->termios[idx] = tp;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 09/10] memcg: enable accounting for tty-related objects
  2021-07-27  8:02                       ` Vasily Averin
  2021-07-27  9:26                         ` [PATCH TTY] memcg: drop GFP_KERNEL_ACCOUNT use in tty_save_termios() Vasily Averin
@ 2021-07-27  9:30                         ` Greg Kroah-Hartman
  1 sibling, 0 replies; 101+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-27  9:30 UTC (permalink / raw)
  To: Vasily Averin; +Cc: Jiri Slaby, cgroups, linux-kernel

On Tue, Jul 27, 2021 at 11:02:31AM +0300, Vasily Averin wrote:
> On 7/27/21 9:54 AM, Jiri Slaby wrote:
> > On 27. 07. 21, 7:34, Vasily Averin wrote:
> >> At each login the user forces the kernel to create a new terminal and
> >> allocate up to ~1Kb memory for the tty-related structures.
> >>
> >> By default it's allowed to create up to 4096 ptys with 1024 reserve for
> >> initial mount namespace only and the settings are controlled by host admin.
> >>
> >> Though this default is not enough for hosters with thousands
> >> of containers per node. Host admin can be forced to increase it
> >> up to NR_UNIX98_PTY_MAX = 1<<20.
> >>
> >> By default container is restricted by pty mount_opt.max = 1024,
> >> but admin inside container can change it via remount. As a result,
> >> one container can consume almost all allowed ptys
> >> and allocate up to 1Gb of unaccounted memory.
> >>
> >> It is not enough per-se to trigger OOM on host, however anyway, it allows
> >> to significantly exceed the assigned memcg limit and leads to troubles
> >> on the over-committed node.
> >>
> >> It makes sense to account for them to restrict the host's memory
> >> consumption from inside the memcg-limited container.
> >>
> >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> >> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >> ---
> >>   drivers/tty/tty_io.c | 4 ++--
> >>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> >> index 26debec..e787f6f 100644
> >> --- a/drivers/tty/tty_io.c
> >> +++ b/drivers/tty/tty_io.c
> >> @@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
> >>       /* Stash the termios data */
> >>       tp = tty->driver->termios[idx];
> >>       if (tp == NULL) {
> >> -        tp = kmalloc(sizeof(*tp), GFP_KERNEL);
> >> +        tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
> > 
> > termios are not saved for PTYs (TTY_DRIVER_RESET_TERMIOS). Am I missing something?
> 
> No, you are right, I've missed this.
> Typical terminals inside containers use TTY_DRIVER_RESET_TERMIOS flag and therefore do not save termios.
> So its accounting have near-to-zero impact in real life.
> I'll prepare fixup to drop GFP_KERNEL_ACCOUNT here.

I'll go drop this patch from my tree.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH TTY] memcg: drop GFP_KERNEL_ACCOUNT use in tty_save_termios()
  2021-07-27  9:26                         ` [PATCH TTY] memcg: drop GFP_KERNEL_ACCOUNT use in tty_save_termios() Vasily Averin
@ 2021-07-27  9:32                           ` Greg Kroah-Hartman
  2022-02-28  9:13                           ` [PATCH v2] memcg: enable accounting for tty-related objects Vasily Averin
  1 sibling, 0 replies; 101+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-27  9:32 UTC (permalink / raw)
  To: Vasily Averin; +Cc: Jiri Slaby, cgroups, linux-kernel

On Tue, Jul 27, 2021 at 12:26:12PM +0300, Vasily Averin wrote:
> Jiri Slaby pointed that termios are not saved for PTYs and for other
> terminals used inside containers. Therefore accounting for saved
> termios have near to zero impact in real life scenarios.
> 
> Cc: Jiri Slaby <jirislaby@kernel.org>
> Fixes: 854dd8a572a0 ("memcg: enable accounting for tty-related objects")
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  drivers/tty/tty_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
> index e787f6f..a6230b2 100644
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -1493,7 +1493,7 @@ void tty_save_termios(struct tty_struct *tty)
>  	/* Stash the termios data */
>  	tp = tty->driver->termios[idx];
>  	if (tp == NULL) {
> -		tp = kmalloc(sizeof(*tp), GFP_KERNEL_ACCOUNT);
> +		tp = kmalloc(sizeof(*tp), GFP_KERNEL);
>  		if (tp == NULL)
>  			return;
>  		tty->driver->termios[idx] = tp;
> -- 
> 1.8.3.1
> 

I can just drop the original patch from my tree, it has not gone into my
unmutable branch yet.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays
  2021-07-27  5:33                   ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-27 21:39                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27 21:39 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> User can call select/poll system calls with a large number of assigned
> file descriptors and force kernel to allocate up to several pages of memory
> till end of these sleeping system calls. We have here long-living
> unaccounted per-task allocations.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  fs/select.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/select.c b/fs/select.c
> index 945896d..e83e563 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>                         goto out_nofds;
>
>                 alloc_size = 6 * size;
> -               bits = kvmalloc(alloc_size, GFP_KERNEL);
> +               bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);

What about the similar allocation in compat_core_sys_select()? Also
what about the allocation in poll_get_entry()?

>                 if (!bits)
>                         goto out_nofds;
>         }
> @@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
>
>                 len = min(todo, POLLFD_PER_PAGE);
>                 walk = walk->next = kmalloc(struct_size(walk, entries, len),
> -                                           GFP_KERNEL);
> +                                           GFP_KERNEL_ACCOUNT);
>                 if (!walk) {
>                         err = -ENOMEM;
>                         goto out_fds;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 03/10] memcg: enable accounting for file lock caches
  2021-07-27  5:33                   ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-27 21:41                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27 21:41 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> User can create file locks for each open file and force kernel
> to allocate small but long-living objects per each open file.
>
> It makes sense to account for these objects to limit the host's memory
> consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 04/10] memcg: enable accounting for fasync_cache
  2021-07-27  5:33                   ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
@ 2021-07-27 21:50                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27 21:50 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> fasync_struct is used by almost all character device drivers to set up
> the fasync queue, and for regular files by the file lease code.
> This structure is quite small but long-living and it can be assigned
> for any open file.
>
> It makes sense to account for its allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy
  2021-07-27  5:33                   ` [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
@ 2021-07-27 21:51                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27 21:51 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Tejun Heo, Zefan Li,
	Thomas Gleixner, Christian Brauner, Kirill Tkhai, Serge Hallyn,
	Andrei Vagin, LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> Container admin can create new namespaces and force kernel to allocate
> up to several pages of memory for the namespaces and its associated
> structures.
> Net and uts namespaces have enabled accounting for such allocations.
> It makes sense to account for rest ones to restrict the host's memory
> consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 06/10] memcg: enable accounting of ipc resources
  2021-07-27  5:33                   ` [PATCH v7 06/10] memcg: enable accounting of ipc resources Vasily Averin
@ 2021-07-27 22:33                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27 22:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexey Dobriyan,
	Dmitry Safonov, Yutian Yang, LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> When user creates IPC objects it forces kernel to allocate memory for
> these long-living objects.
>
> It makes sense to account them to restrict the host's memory consumption
> from inside the memcg-limited container.
>
> This patch enables accounting for IPC shared memory segments, messages
> semaphores and semaphore's undo lists.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 08/10] memcg: enable accounting for posix_timers_cache slab
  2021-07-27  5:34                   ` [PATCH v7 08/10] memcg: enable accounting for posix_timers_cache slab Vasily Averin
@ 2021-07-27 22:33                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27 22:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, LKML

On Mon, Jul 26, 2021 at 10:34 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> A program may create multiple interval timers using timer_create().
> For each timer the kernel preallocates a "queued real-time signal",
> Consequently, the number of timers is limited by the RLIMIT_SIGPENDING
> resource limit. The allocated object is quite small, ~250 bytes,
> but even the default signal limits allow to consume up to 100 megabytes
> per user.
>
> It makes sense to account for them to limit the host's memory consumption
> from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v7 10/10] memcg: enable accounting for ldt_struct objects
  2021-07-27  5:34                   ` [PATCH v7 10/10] memcg: enable accounting for ldt_struct objects Vasily Averin
@ 2021-07-27 22:36                     ` Shakeel Butt
  0 siblings, 0 replies; 101+ messages in thread
From: Shakeel Butt @ 2021-07-27 22:36 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, LKML

On Mon, Jul 26, 2021 at 10:34 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> Each task can request own LDT and force the kernel to allocate up to
> 64Kb memory per-mm.
>
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines.
> The unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.
>
> It makes sense to account for this objects to restrict the host's memory
> consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Borislav Petkov <bp@suse.de>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v2] memcg: enable accounting for tty-related objects
  2021-07-27  9:26                         ` [PATCH TTY] memcg: drop GFP_KERNEL_ACCOUNT use in tty_save_termios() Vasily Averin
  2021-07-27  9:32                           ` Greg Kroah-Hartman
@ 2022-02-28  9:13                           ` Vasily Averin
  1 sibling, 0 replies; 101+ messages in thread
From: Vasily Averin @ 2022-02-28  9:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Greg Kroah-Hartman, Jiri Slaby,
	linux-kernel, kernel

At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.

By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host admin.

Though this default is not enough for hosters with thousands
of containers per node. Host admin can be forced to increase it
up to NR_UNIX98_PTY_MAX = 1<<20.

By default container is restricted by pty mount_opt.max = 1024,
but admin inside container can change it via remount. As a result,
one container can consume almost all allowed ptys
and allocate up to 1Gb of unaccounted memory.

It is not enough per-se to trigger OOM on host, however anyway, it allows
to significantly exceed the assigned memcg limit and leads to troubles
on the over-committed node.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

v2: removed hunk patched tty_save_termios()
Jiri Slaby pointed that termios are not saved for PTYs and for other
terminals used inside containers. Therefore accounting for saved
termios have near to zero impact in real life scenarios.
v1 patch version was dropped due to noticed issue,
however hunk patched alloc_tty_struct is still actual.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
  drivers/tty/tty_io.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 7e8b3bd59c7b..8fec1d8648f5 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -3088,7 +3088,7 @@ struct tty_struct *alloc_tty_struct(struct tty_driver *driver, int idx)
  {
  	struct tty_struct *tty;
  
-	tty = kzalloc(sizeof(*tty), GFP_KERNEL);
+	tty = kzalloc(sizeof(*tty), GFP_KERNEL_ACCOUNT);
  	if (!tty)
  		return NULL;
  
-- 
2.25.1


^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2022-02-28  9:14 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
2021-07-15 17:11   ` Shakeel Butt
2021-07-16  4:11     ` Vasily Averin
2021-07-16 12:55       ` Shakeel Butt
2021-07-19 10:44         ` [PATCH v5 " Vasily Averin
2021-07-26 18:59           ` [PATCH v6 00/16] memcg accounting from Vasily Averin
2021-07-26 21:59             ` David Miller
2021-07-27  4:44               ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
2021-07-27  5:33                 ` [PATCH v7 00/10] " Vasily Averin
     [not found]                 ` <cover.1627362057.git.vvs@virtuozzo.com>
2021-07-27  5:33                   ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
2021-07-27  6:44                     ` Shakeel Butt
2021-07-27  7:21                     ` Christian Brauner
2021-07-27  5:33                   ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
2021-07-27 21:39                     ` Shakeel Butt
2021-07-27  5:33                   ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
2021-07-27 21:41                     ` Shakeel Butt
2021-07-27  5:33                   ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
2021-07-27 21:50                     ` Shakeel Butt
2021-07-27  5:33                   ` [PATCH v7 05/10] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
2021-07-27 21:51                     ` Shakeel Butt
2021-07-27  5:33                   ` [PATCH v7 06/10] memcg: enable accounting of ipc resources Vasily Averin
2021-07-27 22:33                     ` Shakeel Butt
2021-07-27  5:34                   ` [PATCH v7 07/10] memcg: enable accounting for signals Vasily Averin
2021-07-27  5:34                   ` [PATCH v7 08/10] memcg: enable accounting for posix_timers_cache slab Vasily Averin
2021-07-27 22:33                     ` Shakeel Butt
2021-07-27  5:34                   ` [PATCH v7 09/10] memcg: enable accounting for tty-related objects Vasily Averin
2021-07-27  6:09                     ` Greg Kroah-Hartman
2021-07-27  6:54                     ` Jiri Slaby
2021-07-27  8:02                       ` Vasily Averin
2021-07-27  9:26                         ` [PATCH TTY] memcg: drop GFP_KERNEL_ACCOUNT use in tty_save_termios() Vasily Averin
2021-07-27  9:32                           ` Greg Kroah-Hartman
2022-02-28  9:13                           ` [PATCH v2] memcg: enable accounting for tty-related objects Vasily Averin
2021-07-27  9:30                         ` [PATCH v7 09/10] " Greg Kroah-Hartman
2021-07-27  5:34                   ` [PATCH v7 10/10] memcg: enable accounting for ldt_struct objects Vasily Averin
2021-07-27 22:36                     ` Shakeel Butt
     [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
2021-07-26 18:59             ` [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
2021-07-26 19:00             ` [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
2021-07-26 19:00             ` [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
2021-07-26 19:00             ` [PATCH v6 04/16] memcg: enable accounting for VLAN group array Vasily Averin
2021-07-26 19:00             ` [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
2021-07-26 19:00             ` [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
2021-07-26 19:00             ` [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
2021-07-26 19:00             ` [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
2021-07-26 19:01             ` [PATCH v6 09/16] memcg: enable accounting for file lock caches Vasily Averin
2021-07-26 19:01             ` [PATCH v6 10/16] memcg: enable accounting for fasync_cache Vasily Averin
2021-07-26 19:01             ` [PATCH v6 11/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
2021-07-26 19:58               ` Kirill Tkhai
2021-07-26 19:01             ` [PATCH v6 12/16] memcg: enable accounting of ipc resources Vasily Averin
2021-07-26 19:01             ` [PATCH v6 13/16] memcg: enable accounting for signals Vasily Averin
2021-07-26 19:01             ` [PATCH v6 14/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
2021-07-26 19:01             ` [PATCH v6 15/16] memcg: enable accounting for tty-related objects Vasily Averin
2021-07-26 19:01             ` [PATCH v6 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
     [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
2021-07-19 10:44           ` [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
2021-07-19 14:00             ` Dmitry Safonov
2021-07-19 14:22               ` Shakeel Butt
2021-07-19 14:24                 ` Dmitry Safonov
2021-07-20 19:26             ` Shakeel Butt
2021-07-26 10:23               ` Vasily Averin
2021-07-26 13:48                 ` Shakeel Butt
2021-07-26 16:53                   ` [PATCH] memcg: replace in_interrupt() by !in_task() in active_memcg() Vasily Averin
2021-07-26 16:57                     ` Shakeel Butt
2021-07-19 10:44           ` [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
2021-07-19 10:44           ` [PATCH v5 04/16] memcg: enable accounting for VLAN group array Vasily Averin
2021-07-19 10:44           ` [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
2021-07-19 10:44           ` [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
2021-07-19 10:45           ` [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
2021-07-19 10:45           ` [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
2021-07-19 10:45           ` [PATCH v5 09/16] memcg: enable accounting for file lock caches Vasily Averin
2021-07-19 10:45           ` [PATCH v5 10/16] memcg: enable accounting for fasync_cache Vasily Averin
2021-07-19 10:45           ` [PATCH v5 11/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
2021-07-19 10:45           ` [PATCH v5 12/16] memcg: enable accounting of ipc resources Vasily Averin
2021-07-19 10:45           ` [PATCH v5 13/16] memcg: enable accounting for signals Vasily Averin
2021-07-19 17:32             ` Eric W. Biederman
2021-07-20  8:35               ` Vasily Averin
2021-07-20 14:37                 ` Shakeel Butt
2021-07-20 16:42                 ` Eric W. Biederman
2021-07-20 19:15             ` Shakeel Butt
2021-07-19 10:45           ` [PATCH v5 14/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
2021-07-19 10:45           ` [PATCH v5 15/16] memcg: enable accounting for tty-related objects Vasily Averin
2021-07-19 10:46           ` [PATCH v5 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin
2021-04-28  6:51 ` [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
2021-04-28  6:51 ` [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
2021-04-28  6:51 ` [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
2021-04-28  6:52 ` [PATCH v4 04/16] memcg: enable accounting for VLAN group array Vasily Averin
2021-04-28  6:52 ` [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
2021-04-28  6:52 ` [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
2021-04-28  6:52 ` [PATCH v4 07/16] memcg: enable accounting for new namesapces and struct nsproxy Vasily Averin
2021-05-07 13:45   ` Serge E. Hallyn
2021-05-07 15:03   ` Christian Brauner
2021-04-28  6:52 ` [PATCH v4 08/16] memcg: enable accounting of ipc resources Vasily Averin
2021-04-28  6:53 ` [PATCH v4 09/16] memcg: enable accounting for mnt_cache entries Vasily Averin
2021-04-28  6:53 ` [PATCH v4 10/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
2021-04-28  6:53 ` [PATCH v4 11/16] memcg: enable accounting for signals Vasily Averin
2021-04-28  6:53 ` [PATCH v4 12/16] memcg: enable accounting for posix_timers_cache slab Vasily Averin
2021-05-07 15:48   ` Thomas Gleixner
2021-04-28  6:53 ` [PATCH v4 13/16] memcg: enable accounting for file lock caches Vasily Averin
2021-04-28  6:54 ` [PATCH v4 14/16] memcg: enable accounting for fasync_cache Vasily Averin
2021-04-28  6:54 ` [PATCH v4 15/16] memcg: enable accounting for tty-related objects Vasily Averin
2021-04-28  7:38   ` Greg Kroah-Hartman
2021-04-28  6:54 ` [PATCH v4 16/16] memcg: enable accounting for ldt_struct objects Vasily Averin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).