netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/16] memcg accounting from OpenVZ
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
@ 2021-04-28  6:51 ` Vasily Averin
  2021-07-15 17:11   ` Shakeel Butt
  2021-04-28  6:51 ` [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev,
	linux-kernel

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      |  6 +++---
 drivers/tty/tty_io.c       |  4 ++--
 fs/fcntl.c                 |  3 ++-
 fs/locks.c                 |  6 ++++--
 fs/namespace.c             |  7 ++++---
 fs/select.c                |  4 ++--
 ipc/msg.c                  |  2 +-
 ipc/namespace.c            |  2 +-
 ipc/sem.c                  | 10 ++++++----
 ipc/shm.c                  |  2 +-
 kernel/cgroup/namespace.c  |  2 +-
 kernel/nsproxy.c           |  2 +-
 kernel/pid_namespace.c     |  2 +-
 kernel/signal.c            |  2 +-
 kernel/time/namespace.c    |  4 ++--
 kernel/time/posix-timers.c |  4 ++--
 kernel/user_namespace.c    |  2 +-
 mm/memcontrol.c            |  2 +-
 net/8021q/vlan.c           |  2 +-
 net/core/dev.c             |  6 +++---
 net/core/fib_rules.c       |  4 ++--
 net/core/scm.c             |  4 ++--
 net/dccp/proto.c           |  2 +-
 net/ipv4/devinet.c         |  2 +-
 net/ipv4/fib_trie.c        |  4 ++--
 net/ipv4/tcp.c             |  4 +++-
 net/ipv6/addrconf.c        |  2 +-
 net/ipv6/ip6_fib.c         |  4 ++--
 net/ipv6/route.c           |  2 +-
 net/ipv6/sit.c             |  5 +++--
 30 files changed, 58 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
  2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
@ 2021-04-28  6:51 ` Vasily Averin
  2021-04-28  6:51 ` [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9a..87b1e80 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9994,7 +9994,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10061,7 +10061,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10693,7 +10693,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
  2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
  2021-04-28  6:51 ` [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
@ 2021-04-28  6:51 ` Vasily Averin
  2021-04-28  6:51 ` [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

Netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d..15108ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index cd80ffe..65d8b1d 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2e35f68da..9b90413 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a9e53f5..d56a15a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 679699e..0982b7c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2444,8 +2444,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 373d480..5dc5c68 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6510,7 +6510,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (2 preceding siblings ...)
  2021-04-28  6:51 ` [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-04-28  6:51 ` Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 04/16] memcg: enable accounting for VLAN group array Vasily Averin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-04-28  6:51 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 6d705d9..f90d1e8 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index de7cc84..5817a86b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4498,7 +4498,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 04/16] memcg: enable accounting for VLAN group array
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (3 preceding siblings ...)
  2021-04-28  6:51 ` [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
  6 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8b644113..d0a579d4 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (4 preceding siblings ...)
  2021-04-28  6:52 ` [PATCH v4 04/16] memcg: enable accounting for VLAN group array Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  2021-04-28  6:52 ` [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
  6 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski,
	Hideaki YOSHIFUJI, David Ahern, netdev, linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 9fdccf0..2ba147c 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -320,7 +320,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -333,7 +333,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects
       [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
                   ` (5 preceding siblings ...)
  2021-04-28  6:52 ` [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
@ 2021-04-28  6:52 ` Vasily Averin
  6 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-04-28  6:52 UTC (permalink / raw)
  To: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner, Vladimir Davydov
  Cc: Roman Gushchin, David S. Miller, Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index 8156d4f..e837e4f 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -348,7 +348,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
  2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
@ 2021-07-15 17:11   ` Shakeel Butt
  2021-07-16  4:11     ` Vasily Averin
  0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2021-07-15 17:11 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev, LKML

On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> Initially we used our own accounting subsystem, then partially committed
> it to upstream, and a few years ago switched to cgroups v1.
> Now we're rebasing again, revising our old patches and trying to push
> them upstream.
>
> We try to protect the host system from any misuse of kernel memory
> allocation triggered by untrusted users inside the containers.
>
> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> list, though I would be very grateful for any comments from maintainersi
> of affected subsystems or other people added in cc:
>
> Compared to the upstream, we additionally account the following kernel objects:
> - network devices and its Tx/Rx queues
> - ipv4/v6 addresses and routing-related objects
> - inet_bind_bucket cache objects
> - VLAN group arrays
> - ipv6/sit: ip_tunnel_prl
> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> - nsproxy and namespace objects itself
> - IPC objects: semaphores, message queues and share memory segments
> - mounts
> - pollfd and select bits arrays
> - signals and posix timers
> - file lock
> - fasync_struct used by the file lease code and driver's fasync queues
> - tty objects
> - per-mm LDT
>
> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> They require rework and probably will be dropped at all.
>
> Also we're going to add an accounting for nft, however it is not ready yet.
>
> We have not tested performance on upstream, however, our performance team
> compares our current RHEL7-based production kernel and reports that
> they are at least not worse as the according original RHEL7 kernel.
>

Hi Vasily,

What's the status of this series? I see a couple patches did get
acked/reviewed. Can you please re-send the series with updated ack
tags?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
  2021-07-15 17:11   ` Shakeel Butt
@ 2021-07-16  4:11     ` Vasily Averin
  2021-07-16 12:55       ` Shakeel Butt
  0 siblings, 1 reply; 32+ messages in thread
From: Vasily Averin @ 2021-07-16  4:11 UTC (permalink / raw)
  To: Shakeel Butt, Tejun Heo
  Cc: Cgroups, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Alexander Viro, Alexey Dobriyan, Andrei Vagin,
	Andrew Morton, Borislav Petkov, Christian Brauner, David Ahern,
	David S. Miller, Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Tejun Heo, Thomas Gleixner, Zefan Li, netdev, LKML

On 7/15/21 8:11 PM, Shakeel Butt wrote:
> On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
>> Initially we used our own accounting subsystem, then partially committed
>> it to upstream, and a few years ago switched to cgroups v1.
>> Now we're rebasing again, revising our old patches and trying to push
>> them upstream.
>>
>> We try to protect the host system from any misuse of kernel memory
>> allocation triggered by untrusted users inside the containers.
>>
>> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
>> list, though I would be very grateful for any comments from maintainersi
>> of affected subsystems or other people added in cc:
>>
>> Compared to the upstream, we additionally account the following kernel objects:
>> - network devices and its Tx/Rx queues
>> - ipv4/v6 addresses and routing-related objects
>> - inet_bind_bucket cache objects
>> - VLAN group arrays
>> - ipv6/sit: ip_tunnel_prl
>> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
>> - nsproxy and namespace objects itself
>> - IPC objects: semaphores, message queues and share memory segments
>> - mounts
>> - pollfd and select bits arrays
>> - signals and posix timers
>> - file lock
>> - fasync_struct used by the file lease code and driver's fasync queues
>> - tty objects
>> - per-mm LDT
>>
>> We have an incorrect/incomplete/obsoleted accounting for few other kernel
>> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
>> They require rework and probably will be dropped at all.
>>
>> Also we're going to add an accounting for nft, however it is not ready yet.
>>
>> We have not tested performance on upstream, however, our performance team
>> compares our current RHEL7-based production kernel and reports that
>> they are at least not worse as the according original RHEL7 kernel.
> 
> Hi Vasily,
> 
> What's the status of this series? I see a couple patches did get
> acked/reviewed. Can you please re-send the series with updated ack
> tags?

Technically my patches does not have any NAKs. Practically they are still them merged.
I've expected Michal will push it, but he advised me to push subsystem maintainers.
I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.

I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
However I do not understand how it helps to push them if patches should be processed through
subsystem maintainers. As far as I understand I'll need to split this patch set into
per-subsystem pieces and sent them to corresponded maintainers.

Thank you,
	Vasily Averin.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/16] memcg accounting from OpenVZ
  2021-07-16  4:11     ` Vasily Averin
@ 2021-07-16 12:55       ` Shakeel Butt
  2021-07-19 10:44         ` [PATCH v5 " Vasily Averin
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  0 siblings, 2 replies; 32+ messages in thread
From: Shakeel Butt @ 2021-07-16 12:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro,
	Alexey Dobriyan, Andrei Vagin, Andrew Morton, Borislav Petkov,
	Christian Brauner, David Ahern, David S. Miller, Dmitry Safonov,
	Eric Dumazet, Eric W. Biederman, Greg Kroah-Hartman,
	Hideaki YOSHIFUJI, H. Peter Anvin, Ingo Molnar, Jakub Kicinski,
	J. Bruce Fields, Jeff Layton, Jens Axboe, Jiri Slaby,
	Kirill Tkhai, Oleg Nesterov, Serge Hallyn, Thomas Gleixner,
	Zefan Li, netdev, LKML

On Thu, Jul 15, 2021 at 9:11 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 7/15/21 8:11 PM, Shakeel Butt wrote:
> > On Tue, Apr 27, 2021 at 11:51 PM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
> >> Initially we used our own accounting subsystem, then partially committed
> >> it to upstream, and a few years ago switched to cgroups v1.
> >> Now we're rebasing again, revising our old patches and trying to push
> >> them upstream.
> >>
> >> We try to protect the host system from any misuse of kernel memory
> >> allocation triggered by untrusted users inside the containers.
> >>
> >> Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
> >> list, though I would be very grateful for any comments from maintainersi
> >> of affected subsystems or other people added in cc:
> >>
> >> Compared to the upstream, we additionally account the following kernel objects:
> >> - network devices and its Tx/Rx queues
> >> - ipv4/v6 addresses and routing-related objects
> >> - inet_bind_bucket cache objects
> >> - VLAN group arrays
> >> - ipv6/sit: ip_tunnel_prl
> >> - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
> >> - nsproxy and namespace objects itself
> >> - IPC objects: semaphores, message queues and share memory segments
> >> - mounts
> >> - pollfd and select bits arrays
> >> - signals and posix timers
> >> - file lock
> >> - fasync_struct used by the file lease code and driver's fasync queues
> >> - tty objects
> >> - per-mm LDT
> >>
> >> We have an incorrect/incomplete/obsoleted accounting for few other kernel
> >> objects: sk_filter, af_packets, netlink and xt_counters for iptables.
> >> They require rework and probably will be dropped at all.
> >>
> >> Also we're going to add an accounting for nft, however it is not ready yet.
> >>
> >> We have not tested performance on upstream, however, our performance team
> >> compares our current RHEL7-based production kernel and reports that
> >> they are at least not worse as the according original RHEL7 kernel.
> >
> > Hi Vasily,
> >
> > What's the status of this series? I see a couple patches did get
> > acked/reviewed. Can you please re-send the series with updated ack
> > tags?
>
> Technically my patches does not have any NAKs. Practically they are still them merged.
> I've expected Michal will push it, but he advised me to push subsystem maintainers.
> I've asked Tejun to pick up the whole patch set and I'm waiting for his feedback right now.
>
> I can resend patch set once again, with collected approval and with rebase to v5.14-rc1.
> However I do not understand how it helps to push them if patches should be processed through
> subsystem maintainers. As far as I understand I'll need to split this patch set into
> per-subsystem pieces and sent them to corresponded maintainers.
>

Usually these kinds of patches (adding memcg accounting) go through mm
tree but if there are no dependencies between the patches and a
consensus that each subsystem maintainer picks the corresponding patch
then that is fine too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 00/16] memcg accounting from OpenVZ
  2021-07-16 12:55       ` Shakeel Butt
@ 2021-07-19 10:44         ` Vasily Averin
  2021-07-26 18:59           ` [PATCH v6 00/16] memcg accounting from Vasily Averin
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  1 sibling, 2 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, netdev, linux-fsdevel,
	LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  2021-07-19 10:44           ` [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 14:00             ` Dmitry Safonov
  2021-07-20 19:26             ` Shakeel Butt
  2021-07-19 10:44           ` [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
                             ` (3 subsequent siblings)
  5 siblings, 2 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0..1bbf239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index a9f9379..79df7cd 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 73721a4..d38124b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3bf685f..8eaeade 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2d650dc..a8f118e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2449,8 +2449,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b756a7..5f7286a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6638,7 +6638,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
  2021-07-19 10:44           ` [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 04/16] memcg: enable accounting for VLAN group array Vasily Averin
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7eb0fb2..abb5c59 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5ab5f2..5c0605e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4509,7 +4509,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 04/16] memcg: enable accounting for VLAN group array
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (2 preceding siblings ...)
  2021-07-19 10:44           ` [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4cdf841..55275ef 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (3 preceding siblings ...)
  2021-07-19 10:44           ` [PATCH v5 04/16] memcg: enable accounting for VLAN group array Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  2021-07-19 10:44           ` [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects
       [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
                             ` (4 preceding siblings ...)
  2021-07-19 10:44           ` [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
@ 2021-07-19 10:44           ` Vasily Averin
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-07-19 14:00             ` Dmitry Safonov
  2021-07-19 14:22               ` Shakeel Butt
  2021-07-20 19:26             ` Shakeel Butt
  1 sibling, 1 reply; 32+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:00 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

Hi Vasily,

On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
[..]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;

This seems to do two separate things in one patch.
Probably, it's better to separate them.
(I may miss how route changes are related to more generic
__alloc_pages() change)

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 14:00             ` Dmitry Safonov
@ 2021-07-19 14:22               ` Shakeel Butt
  2021-07-19 14:24                 ` Dmitry Safonov
  0 siblings, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2021-07-19 14:22 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>
> Hi Vasily,
>
> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
> [..]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae1f5d0..1bbf239 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
> >                 return false;
> >
> >         /* Memcg to charge can't be determined. */
> > -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> > +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
> >                 return true;
>
> This seems to do two separate things in one patch.
> Probably, it's better to separate them.
> (I may miss how route changes are related to more generic
> __alloc_pages() change)
>

It was requested to squash them together in some previous versions.
https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa@dhcp22.suse.cz/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 14:22               ` Shakeel Butt
@ 2021-07-19 14:24                 ` Dmitry Safonov
  0 siblings, 0 replies; 32+ messages in thread
From: Dmitry Safonov @ 2021-07-19 14:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Andrew Morton, Cgroups, Michal Hocko,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Network Development, open list

On 7/19/21 3:22 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 7:00 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>>
>> Hi Vasily,
>>
>> On Mon, 19 Jul 2021 at 11:45, Vasily Averin <vvs@virtuozzo.com> wrote:
>> [..]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index ae1f5d0..1bbf239 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>>                 return false;
>>>
>>>         /* Memcg to charge can't be determined. */
>>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>>                 return true;
>>
>> This seems to do two separate things in one patch.
>> Probably, it's better to separate them.
>> (I may miss how route changes are related to more generic
>> __alloc_pages() change)
>>
> 
> It was requested to squash them together in some previous versions.
> https://lore.kernel.org/linux-mm/YEiUIf0old+AZssa@dhcp22.suse.cz/
> 

Ah, alright, never mind than.

Thanks,
           Dmitry

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
  2021-07-19 14:00             ` Dmitry Safonov
@ 2021-07-20 19:26             ` Shakeel Butt
  2021-07-26 10:23               ` Vasily Averin
  1 sibling, 1 reply; 32+ messages in thread
From: Shakeel Butt @ 2021-07-20 19:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> An netadmin inside container can use 'ip a a' and 'ip r a'
> to assign a large number of ipv4/ipv6 addresses and routing entries
> and force kernel to allocate megabytes of unaccounted memory
> for long-lived per-netdevice related kernel objects:
> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>
> These objects can be manually removed, though usually they lives
> in memory till destroy of its net namespace.
>
> It makes sense to account for them to restrict the host's memory
> consumption from inside the memcg-limited container.
>
> One of such objects is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>
> Obsoleted in_interrupt() does not describe real execution context properly.
> From include/linux/preempt.h:
>
>  The following macros are deprecated and should not be used in new code:
>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>
> To verify the current execution context new macro should be used instead:
>  in_task()      - We're in task context
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c      | 2 +-
>  net/core/fib_rules.c | 4 ++--
>  net/ipv4/devinet.c   | 2 +-
>  net/ipv4/fib_trie.c  | 4 ++--
>  net/ipv6/addrconf.c  | 2 +-
>  net/ipv6/ip6_fib.c   | 4 ++--
>  net/ipv6/route.c     | 2 +-
>  7 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0..1bbf239 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>                 return true;
>
>         return false;

Can you please also change in_interrupt() in active_memcg() as well?
There are other unrelated in_interrupt() in that file but the one in
active_memcg() should be coupled with this change.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-20 19:26             ` Shakeel Butt
@ 2021-07-26 10:23               ` Vasily Averin
  2021-07-26 13:48                 ` Shakeel Butt
  0 siblings, 1 reply; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 10:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On 7/20/21 10:26 PM, Shakeel Butt wrote:
> On Mon, Jul 19, 2021 at 3:44 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> An netadmin inside container can use 'ip a a' and 'ip r a'
>> to assign a large number of ipv4/ipv6 addresses and routing entries
>> and force kernel to allocate megabytes of unaccounted memory
>> for long-lived per-netdevice related kernel objects:
>> 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
>> 'struct rt6_info', 'struct fib_rules' and ip_fib caches.
>>
>> These objects can be manually removed, though usually they lives
>> in memory till destroy of its net namespace.
>>
>> It makes sense to account for them to restrict the host's memory
>> consumption from inside the memcg-limited container.
>>
>> One of such objects is the 'struct fib6_node' mostly allocated in
>> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>>
>>  write_lock_bh(&table->tb6_lock);
>>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>>  write_unlock_bh(&table->tb6_lock);
>>
>> In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
>> kmem cache. The proper memory cgroup still cannot be found due to the
>> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
>>
>> Obsoleted in_interrupt() does not describe real execution context properly.
>> From include/linux/preempt.h:
>>
>>  The following macros are deprecated and should not be used in new code:
>>  in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled
>>
>> To verify the current execution context new macro should be used instead:
>>  in_task()      - We're in task context
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/memcontrol.c      | 2 +-
>>  net/core/fib_rules.c | 4 ++--
>>  net/ipv4/devinet.c   | 2 +-
>>  net/ipv4/fib_trie.c  | 4 ++--
>>  net/ipv6/addrconf.c  | 2 +-
>>  net/ipv6/ip6_fib.c   | 4 ++--
>>  net/ipv6/route.c     | 2 +-
>>  7 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ae1f5d0..1bbf239 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>>                 return false;
>>
>>         /* Memcg to charge can't be determined. */
>> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
>> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
>>                 return true;
>>
>>         return false;
> 
> Can you please also change in_interrupt() in active_memcg() as well?
> There are other unrelated in_interrupt() in that file but the one in
> active_memcg() should be coupled with this change.

Could you please elaborate?
From my point of view active_memcg is paired with set_active_memcg() and is not related to this case.
active_memcg uses memcg that was set by set_active_memcg(), either from int_active_memcg per-cpu pointer
or from current->active_memcg pointer.
I'm agree, it in case of disabled BH it is incorrect to use int_active_memcg, 
we still can use current->active_memcg. However it isn't a problem, 
memcg will be properly provided in both cases.

I think it's better to fix set_active_memcg/active_memcg by separate patch.

Am I missed something perhaps?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects
  2021-07-26 10:23               ` Vasily Averin
@ 2021-07-26 13:48                 ` Shakeel Butt
  0 siblings, 0 replies; 32+ messages in thread
From: Shakeel Butt @ 2021-07-26 13:48 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev, LKML

On Mon, Jul 26, 2021 at 3:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
[...]
> >
> > Can you please also change in_interrupt() in active_memcg() as well?
> > There are other unrelated in_interrupt() in that file but the one in
> > active_memcg() should be coupled with this change.
>
> Could you please elaborate?
> From my point of view active_memcg is paired with set_active_memcg() and is not related to this case.
> active_memcg uses memcg that was set by set_active_memcg(), either from int_active_memcg per-cpu pointer
> or from current->active_memcg pointer.
> I'm agree, it in case of disabled BH it is incorrect to use int_active_memcg,
> we still can use current->active_memcg. However it isn't a problem,
> memcg will be properly provided in both cases.
>
> I think it's better to fix set_active_memcg/active_memcg by separate patch.
>
> Am I missed something perhaps?
>

No you are right. That should be a separate patch.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v6 00/16] memcg accounting from
  2021-07-19 10:44         ` [PATCH v5 " Vasily Averin
@ 2021-07-26 18:59           ` Vasily Averin
  2021-07-26 21:59             ` David Miller
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
  1 sibling, 1 reply; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, netdev, linux-fsdevel,
	LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
@ 2021-07-26 18:59             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

Container netadmin can create a lot of fake net devices,
then create a new net namespace and repeat it again and again.
Net device can request the creation of up to 4096 tx and rx queues,
and force kernel to allocate up to several tens of megabytes memory
per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/dev.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2a..e9aa1e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10100,7 +10100,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	rx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -10167,7 +10167,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	tx = kvzalloc(sz, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -10807,7 +10807,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+	p = kvzalloc(alloc_size, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
  2021-07-26 18:59             ` [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

An netadmin inside container can use 'ip a a' and 'ip r a'
to assign a large number of ipv4/ipv6 addresses and routing entries
and force kernel to allocate megabytes of unaccounted memory
for long-lived per-netdevice related kernel objects:
'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
'struct rt6_info', 'struct fib_rules' and ip_fib caches.

These objects can be manually removed, though usually they lives
in memory till destroy of its net namespace.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

One of such objects is the 'struct fib6_node' mostly allocated in
net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:

 write_lock_bh(&table->tb6_lock);
 err = fib6_add(&table->tb6_root, rt, info, mxc);
 write_unlock_bh(&table->tb6_lock);

In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
kmem cache. The proper memory cgroup still cannot be found due to the
incorrect 'in_interrupt()' check used in memcg_kmem_bypass().

Obsoleted in_interrupt() does not describe real execution context properly.
From include/linux/preempt.h:

 The following macros are deprecated and should not be used in new code:
 in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled

To verify the current execution context new macro should be used instead:
 in_task()	- We're in task context

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c      | 2 +-
 net/core/fib_rules.c | 4 ++--
 net/ipv4/devinet.c   | 2 +-
 net/ipv4/fib_trie.c  | 4 ++--
 net/ipv6/addrconf.c  | 2 +-
 net/ipv6/ip6_fib.c   | 4 ++--
 net/ipv6/route.c     | 2 +-
 7 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0..1bbf239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -968,7 +968,7 @@ static __always_inline bool memcg_kmem_bypass(void)
 		return false;
 
 	/* Memcg to charge can't be determined. */
-	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+	if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))
 		return true;
 
 	return false;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index a9f9379..79df7cd 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -57,7 +57,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 {
 	struct fib_rule *r;
 
-	r = kzalloc(ops->rule_size, GFP_KERNEL);
+	r = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (r == NULL)
 		return -ENOMEM;
 
@@ -541,7 +541,7 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto errout;
 	}
 
-	nlrule = kzalloc(ops->rule_size, GFP_KERNEL);
+	nlrule = kzalloc(ops->rule_size, GFP_KERNEL_ACCOUNT);
 	if (!nlrule) {
 		err = -ENOMEM;
 		goto errout;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 73721a4..d38124b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -215,7 +215,7 @@ static void devinet_sysctl_unregister(struct in_device *idev)
 
 static struct in_ifaddr *inet_alloc_ifa(void)
 {
-	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL);
+	return kzalloc(sizeof(struct in_ifaddr), GFP_KERNEL_ACCOUNT);
 }
 
 static void inet_rcu_free_ifa(struct rcu_head *head)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 25cf387..8060524 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2380,11 +2380,11 @@ void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
+					  0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	trie_leaf_kmem = kmem_cache_create("ip_fib_trie",
 					   LEAF_SIZE,
-					   0, SLAB_PANIC, NULL);
+					   0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
 }
 
 struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3bf685f..8eaeade 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1080,7 +1080,7 @@ static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
 			goto out;
 	}
 
-	ifa = kzalloc(sizeof(*ifa), gfp_flags);
+	ifa = kzalloc(sizeof(*ifa), gfp_flags | __GFP_ACCOUNT);
 	if (!ifa) {
 		err = -ENOBUFS;
 		goto out;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2d650dc..a8f118e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -2449,8 +2449,8 @@ int __init fib6_init(void)
 	int ret = -ENOMEM;
 
 	fib6_node_kmem = kmem_cache_create("fib6_nodes",
-					   sizeof(struct fib6_node),
-					   0, SLAB_HWCACHE_ALIGN,
+					   sizeof(struct fib6_node), 0,
+					   SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT,
 					   NULL);
 	if (!fib6_node_kmem)
 		goto out;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7b756a7..5f7286a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6638,7 +6638,7 @@ int __init ip6_route_init(void)
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
 		kmem_cache_create("ip6_dst_cache", sizeof(struct rt6_info), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
  2021-07-26 18:59             ` [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 04/16] memcg: enable accounting for VLAN group array Vasily Averin
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

net namespace can create up to 64K tcp and dccp ports and force kernel
to allocate up to several megabytes of memory per netns
for inet_bind_bucket objects.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/dccp/proto.c | 2 +-
 net/ipv4/tcp.c   | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7eb0fb2..abb5c59 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1126,7 +1126,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("dccp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d5ab5f2..5c0605e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4509,7 +4509,9 @@ void __init tcp_init(void)
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
 				  sizeof(struct inet_bind_bucket), 0,
-				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 04/16] memcg: enable accounting for VLAN group array
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (2 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, netdev, linux-kernel

vlan array consume up to 8 pages of memory per net device.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/8021q/vlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 4cdf841..55275ef 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -67,7 +67,7 @@ static int vlan_group_prealloc_vid(struct vlan_group *vg,
 		return 0;
 
 	size = sizeof(struct net_device *) * VLAN_GROUP_ARRAY_PART_LEN;
-	array = kzalloc(size, GFP_KERNEL);
+	array = kzalloc(size, GFP_KERNEL_ACCOUNT);
 	if (array == NULL)
 		return -ENOBUFS;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (3 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 04/16] memcg: enable accounting for VLAN group array Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  2021-07-26 19:00             ` [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller,
	Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern, netdev,
	linux-kernel

Author: Andrey Ryabinin <aryabinin@virtuozzo.com>

The size of the ip_tunnel_prl structs allocation is controllable from
user-space, thus it's better to avoid spam in dmesg if allocation failed.
Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
accounting. Allocation is temporary and limited by 4GB.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/ipv6/sit.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index df5bea8..33adc12 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -321,7 +321,7 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 	 * we try harder to allocate.
 	 */
 	kp = (cmax <= 1 || capable(CAP_NET_ADMIN)) ?
-		kcalloc(cmax, sizeof(*kp), GFP_KERNEL | __GFP_NOWARN) :
+		kcalloc(cmax, sizeof(*kp), GFP_KERNEL_ACCOUNT | __GFP_NOWARN) :
 		NULL;
 
 	rcu_read_lock();
@@ -334,7 +334,8 @@ static int ipip6_tunnel_get_prl(struct net_device *dev, struct ifreq *ifr)
 		 * For root users, retry allocating enough memory for
 		 * the answer.
 		 */
-		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC);
+		kp = kcalloc(ca, sizeof(*kp), GFP_ATOMIC | __GFP_ACCOUNT |
+					      __GFP_NOWARN);
 		if (!kp) {
 			ret = -ENOMEM;
 			goto out;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects
       [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
                               ` (4 preceding siblings ...)
  2021-07-26 19:00             ` [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
@ 2021-07-26 19:00             ` Vasily Averin
  5 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, netdev, linux-kernel

unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
Each such send call forces kernel to allocate up to 2Kb memory for
struct scm_fp_list.

It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 net/core/scm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index ae3085d..5c356f0 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -79,7 +79,7 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 
 	if (!fpl)
 	{
-		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL);
+		fpl = kmalloc(sizeof(struct scm_fp_list), GFP_KERNEL_ACCOUNT);
 		if (!fpl)
 			return -ENOMEM;
 		*fplp = fpl;
@@ -355,7 +355,7 @@ struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl)
 		return NULL;
 
 	new_fpl = kmemdup(fpl, offsetof(struct scm_fp_list, fp[fpl->count]),
-			  GFP_KERNEL);
+			  GFP_KERNEL_ACCOUNT);
 	if (new_fpl) {
 		for (i = 0; i < fpl->count; i++)
 			get_file(fpl->fp[i]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from
  2021-07-26 18:59           ` [PATCH v6 00/16] memcg accounting from Vasily Averin
@ 2021-07-26 21:59             ` David Miller
  2021-07-27  4:44               ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2021-07-26 21:59 UTC (permalink / raw)
  To: vvs
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel


This series does not apply cleanly to net-next, please respin.

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from OpenVZ
  2021-07-26 21:59             ` David Miller
@ 2021-07-27  4:44               ` Vasily Averin
  0 siblings, 0 replies; 32+ messages in thread
From: Vasily Averin @ 2021-07-27  4:44 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel

On 7/27/21 12:59 AM, David Miller wrote:
> 
> This series does not apply cleanly to net-next, please respin.

Dear David,
I found that you have already approved net-related patches of this series and included them into net-next.
So I'll respin v7 without these patches.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-07-27  4:45 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <8664122a-99d3-7199-869a-781b21b7e712@virtuozzo.com>
2021-04-28  6:51 ` [PATCH v4 00/16] memcg accounting from OpenVZ Vasily Averin
2021-07-15 17:11   ` Shakeel Butt
2021-07-16  4:11     ` Vasily Averin
2021-07-16 12:55       ` Shakeel Butt
2021-07-19 10:44         ` [PATCH v5 " Vasily Averin
2021-07-26 18:59           ` [PATCH v6 00/16] memcg accounting from Vasily Averin
2021-07-26 21:59             ` David Miller
2021-07-27  4:44               ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
     [not found]           ` <cover.1627321321.git.vvs@virtuozzo.com>
2021-07-26 18:59             ` [PATCH v6 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
2021-07-26 19:00             ` [PATCH v6 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
2021-07-26 19:00             ` [PATCH v6 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
2021-07-26 19:00             ` [PATCH v6 04/16] memcg: enable accounting for VLAN group array Vasily Averin
2021-07-26 19:00             ` [PATCH v6 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
2021-07-26 19:00             ` [PATCH v6 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
     [not found]         ` <cover.1626688654.git.vvs@virtuozzo.com>
2021-07-19 10:44           ` [PATCH v5 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
2021-07-19 10:44           ` [PATCH v5 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
2021-07-19 14:00             ` Dmitry Safonov
2021-07-19 14:22               ` Shakeel Butt
2021-07-19 14:24                 ` Dmitry Safonov
2021-07-20 19:26             ` Shakeel Butt
2021-07-26 10:23               ` Vasily Averin
2021-07-26 13:48                 ` Shakeel Butt
2021-07-19 10:44           ` [PATCH v5 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
2021-07-19 10:44           ` [PATCH v5 04/16] memcg: enable accounting for VLAN group array Vasily Averin
2021-07-19 10:44           ` [PATCH v5 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
2021-07-19 10:44           ` [PATCH v5 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin
2021-04-28  6:51 ` [PATCH v4 01/16] memcg: enable accounting for net_device and Tx/Rx queues Vasily Averin
2021-04-28  6:51 ` [PATCH v4 02/16] memcg: enable accounting for IP address and routing-related objects Vasily Averin
2021-04-28  6:51 ` [PATCH v4 03/16] memcg: enable accounting for inet_bin_bucket cache Vasily Averin
2021-04-28  6:52 ` [PATCH v4 04/16] memcg: enable accounting for VLAN group array Vasily Averin
2021-04-28  6:52 ` [PATCH v4 05/16] memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation Vasily Averin
2021-04-28  6:52 ` [PATCH v4 06/16] memcg: enable accounting for scm_fp_list objects Vasily Averin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).