[RFC PATCH 0/4] Make inotify instance/watches be accounted per userns

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
@ 2016-06-01  7:52 Nikolay Borisov
  2016-06-01  7:52 ` [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace Nikolay Borisov
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-01  7:52 UTC (permalink / raw)
  To: john, eparis, ebiederm
  Cc: jack, linux-kernel, gorcunov, avagin, netdev, operations,
	Nikolay Borisov

Currently the inotify instances/watches are being accounted in the 
user_struct structure. This means that in setups where multiple 
users in unprivileged containers map to the same underlying 
real user (e.g. user_struct) the inotify limits are going to be 
shared as well which can lead to unplesantries. This is a problem 
since any user inside any of the containers can potentially exhaust 
the instance/watches limit which in turn might prevent certain 
services from other containers from starting. 

The solution I propose is rather simple, instead of accounting the 
watches/instances per user_struct, start accounting them in a hashtable, 
where the index used is the hashed pointer of the userns. This way
the administrator needn't set the inotify limits very high and also 
the risk of one container breaching the limits and affecting every 
other container is alleviated. 

I have performed functional testing to validate that limits in 
different namespaces are indeed separate, as well as running 
multiple inotify stressers from stress-ng to ensure I haven't 
introduced any race conditions. 

This series  is based on 4.7-rc1 (and applies cleanly on 4.4.10) and 
consist of the following 4 patches: 

Patch 1: This introduces the necessary structure and code changes. Including
hashtable.h to sched.h causes some warnings in files which define HAS_SIZE macro, 
patch 3 fixes this by doing mechanical rename. 

Patch 2: This patch flips the inotify code to user the new infrastructure.

Patch 3: This is a simple mechanical rename of conflicting definitions with 
hashtable.h's HASH_SIZE macro. I'm happy about comments how I should go 
about this. 

Patch 4: This is a rather self-container patch and can go irrespective of 
whether the series is accepted, it's needed so that building the kernel 
with !CONFIG_INOTIFY_USER doesn't fail (with patch 1 being applied). 
However, fdinfo.c doesn't really need inotify.h  

Nikolay Borisov (4):
  inotify: Add infrastructure to account inotify limits per-namespace
  inotify: Convert inotify limits to be accounted
    per-realuser/per-namespace
  misc: Rename the HASH_SIZE macro
  inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER

 fs/logfs/dir.c                           |  6 +--
 fs/notify/fdinfo.c                       |  3 ++
 fs/notify/inotify/inotify.h              | 68 ++++++++++++++++++++++++++++++++
 fs/notify/inotify/inotify_fsnotify.c     | 14 ++++++-
 fs/notify/inotify/inotify_user.c         | 57 ++++++++++++++++++++++----
 include/linux/fsnotify_backend.h         |  1 +
 include/linux/sched.h                    |  5 ++-
 kernel/user.c                            | 13 ++++++
 net/ipv6/ip6_gre.c                       |  8 ++--
 net/ipv6/ip6_tunnel.c                    | 10 ++---
 net/ipv6/ip6_vti.c                       | 10 ++---
 net/ipv6/sit.c                           | 10 ++---
 security/keys/encrypted-keys/encrypted.c | 32 +++++++--------
 13 files changed, 189 insertions(+), 48 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace
  2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
@ 2016-06-01  7:52 ` Nikolay Borisov
  2016-06-06  8:05   ` Cyrill Gorcunov
  2016-06-01  7:52 ` [PATCH 2/4] inotify: Convert inotify limits to be accounted per-realuser/per-namespace Nikolay Borisov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-01  7:52 UTC (permalink / raw)
  To: john, eparis, ebiederm
  Cc: jack, linux-kernel, gorcunov, avagin, netdev, operations,
	Nikolay Borisov

This patch adds the necessary members to user_struct. The idea behind
the solution is really simple - user the userns pointers as keys into
a hash table which holds the inotify instances/watches counts. This
allows to account the limits per userns rather than per real user,
which makes certain scenarios such as a single mapped user in a
container deplete the inotify resources for all other users, which
map to the exact same real user.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/notify/inotify/inotify.h      | 68 ++++++++++++++++++++++++++++++++++++++++
 fs/notify/inotify/inotify_user.c | 36 +++++++++++++++++++++
 include/linux/fsnotify_backend.h |  1 +
 include/linux/sched.h            |  3 ++
 kernel/user.c                    | 13 ++++++++
 5 files changed, 121 insertions(+)

diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index ed855ef6f077..e069e1e4262a 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -1,6 +1,7 @@
 #include <linux/fsnotify_backend.h>
 #include <linux/inotify.h>
 #include <linux/slab.h> /* struct kmem_cache */
+#include <linux/hashtable.h>
 
 struct inotify_event_info {
 	struct fsnotify_event fse;
@@ -15,6 +16,13 @@ struct inotify_inode_mark {
 	int wd;
 };
 
+struct inotify_state {
+	struct hlist_node node;
+	void *key; /* user_namespace ptr */
+	u32 inotify_watches; /* How many inotify watches does this user have? */
+	u32 inotify_devs;  /* How many inotify devs does this user have opened? */
+};
+
 static inline struct inotify_event_info *INOTIFY_E(struct fsnotify_event *fse)
 {
 	return container_of(fse, struct inotify_event_info, fse);
@@ -30,3 +38,63 @@ extern int inotify_handle_event(struct fsnotify_group *group,
 				const unsigned char *file_name, u32 cookie);
 
 extern const struct fsnotify_ops inotify_fsnotify_ops;
+
+/* Helpers for manipulating various inotify state, stored in user_struct */
+static inline struct inotify_state *__find_inotify_state(struct user_struct *user,
+							  void *key)
+{
+	struct inotify_state *state;
+
+	hash_for_each_possible(user->inotify_tbl, state, node, (unsigned long)key)
+		if (state->key == key)
+			return state;
+
+	return NULL;
+}
+
+static inline void inotify_inc_watches(struct user_struct *user, void *key)
+{
+	struct inotify_state *state;
+
+	spin_lock(&user->inotify_lock);
+	state = __find_inotify_state(user, key);
+	state->inotify_watches++;
+	spin_unlock(&user->inotify_lock);
+}
+
+
+static inline void inotify_dec_watches(struct user_struct *user, void *key)
+{
+	struct inotify_state *state;
+
+	spin_lock(&user->inotify_lock);
+	state = __find_inotify_state(user, key);
+	state->inotify_watches--;
+	spin_unlock(&user->inotify_lock);
+}
+
+static inline int inotify_read_watches(struct user_struct *user, void *key)
+{
+	struct inotify_state *state;
+	int ret;
+
+	spin_lock(&user->inotify_lock);
+	state = __find_inotify_state(user, key);
+	ret = state->inotify_watches;
+	spin_unlock(&user->inotify_lock);
+	return ret;
+}
+
+static inline unsigned long inotify_dec_return_dev(struct user_struct *user,
+						   void *key)
+{
+	struct inotify_state *state;
+	unsigned long ret;
+
+	spin_lock(&user->inotify_lock);
+	state = __find_inotify_state(user, key);
+	ret = --state->inotify_devs;
+	spin_unlock(&user->inotify_lock);
+
+	return ret;
+}
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index b8d08d0d0a4d..ae7ec2414252 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -86,6 +86,42 @@ struct ctl_table inotify_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
+
+static int inotify_init_state(struct user_struct *user,
+						  void *key)
+{
+	struct inotify_state *state;
+	int ret = 0;
+
+	spin_lock(&user->inotify_lock);
+	state =  __find_inotify_count(user, key);
+
+	if (!state) {
+		spin_unlock(&user->inotify_lock);
+		state = kzalloc(sizeof(struct inotify_state), GFP_KERNEL);
+		if (!state)
+			return -ENOMEM;
+
+		state->key = current_user_ns();
+		state->inotify_watches = 0;
+		state->inotify_devs = 1;
+
+		spin_lock(&user->inotify_lock);
+		hash_add(user->inotify_tbl, &state->node, (unsigned long)key);
+
+		goto out;
+	} else {
+
+		if (++state->inotify_devs > inotify_max_user_instances) {
+			ret = -EMFILE;
+			goto out;
+		}
+	}
+out:
+	spin_unlock(&user->inotify_lock);
+	return ret;
+}
+
 static inline __u32 inotify_arg_to_mask(u32 arg)
 {
 	__u32 mask;
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 29f917517299..89f7c247b038 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -170,6 +170,7 @@ struct fsnotify_group {
 			spinlock_t	idr_lock;
 			struct idr      idr;
 			struct user_struct      *user;
+			void *userns_ptr;
 		} inotify_data;
 #endif
 #ifdef CONFIG_FANOTIFY
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..0c55d951d0bb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -58,6 +58,7 @@ struct sched_param {
 #include <linux/uidgid.h>
 #include <linux/gfp.h>
 #include <linux/magic.h>
+#include <linux/hashtable.h>
 #include <linux/cgroup-defs.h>
 
 #include <asm/processor.h>
@@ -839,6 +840,8 @@ struct user_struct {
 	atomic_t processes;	/* How many processes does this user have? */
 	atomic_t sigpending;	/* How many pending signals does this user have? */
 #ifdef CONFIG_INOTIFY_USER
+	spinlock_t inotify_lock;
+	DECLARE_HASHTABLE(inotify_tbl, 6);
 	atomic_t inotify_watches; /* How many inotify watches does this user have? */
 	atomic_t inotify_devs;	/* How many inotify devs does this user have opened? */
 #endif
diff --git a/kernel/user.c b/kernel/user.c
index b069ccbfb0b0..0c73b0318806 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,6 +17,8 @@
 #include <linux/export.h>
 #include <linux/user_namespace.h>
 #include <linux/proc_ns.h>
+#include <linux/hashtable.h>
+
 
 /*
  * userns count is 1 for root user, 1 for init_uts_ns,
@@ -94,6 +96,9 @@ struct user_struct root_user = {
 	.sigpending	= ATOMIC_INIT(0),
 	.locked_shm     = 0,
 	.uid		= GLOBAL_ROOT_UID,
+#ifdef CONFIG_INOTIFY_USER
+	.inotify_lock	= __SPIN_LOCK_UNLOCKED(root_user.inotify_lock),
+#endif
 };
 
 /*
@@ -184,6 +189,10 @@ struct user_struct *alloc_uid(kuid_t uid)
 
 		new->uid = uid;
 		atomic_set(&new->__count, 1);
+#ifdef CONFIG_INOTIFY_USER
+		spin_lock_init(&new->inotify_lock);
+		hash_init(new->inotify_tbl);
+#endif
 
 		/*
 		 * Before adding this, check whether we raced
@@ -223,6 +232,10 @@ static int __init uid_cache_init(void)
 	uid_hash_insert(&root_user, uidhashentry(GLOBAL_ROOT_UID));
 	spin_unlock_irq(&uidhash_lock);
 
+#ifdef CONFIG_INOTIFY_USER
+	hash_init(root_user.inotify_tbl);
+#endif
+
 	return 0;
 }
 subsys_initcall(uid_cache_init);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] inotify: Convert inotify limits to be accounted per-realuser/per-namespace
  2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
  2016-06-01  7:52 ` [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace Nikolay Borisov
@ 2016-06-01  7:52 ` Nikolay Borisov
  2016-06-01  7:52 ` [PATCH 3/4] misc: Rename the HASH_SIZE macro Nikolay Borisov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-01  7:52 UTC (permalink / raw)
  To: john, eparis, ebiederm
  Cc: jack, linux-kernel, gorcunov, avagin, netdev, operations,
	Nikolay Borisov

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/notify/inotify/inotify_fsnotify.c | 14 +++++++++++++-
 fs/notify/inotify/inotify_user.c     | 23 +++++++++++++++--------
 include/linux/sched.h                |  2 --
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 2cd900c2c737..efaeec3f2e26 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -166,7 +166,19 @@ static void inotify_free_group_priv(struct fsnotify_group *group)
 	idr_for_each(&group->inotify_data.idr, idr_callback, group);
 	idr_destroy(&group->inotify_data.idr);
 	if (group->inotify_data.user) {
-		atomic_dec(&group->inotify_data.user->inotify_devs);
+		struct user_struct *user = group->inotify_data.user;
+		void *key = group->inotify_data.userns_ptr;
+		struct inotify_state *state;
+
+		spin_lock(&user->inotify_lock);
+		state = __find_inotify_state(user, key);
+		if (--state->inotify_devs == 0)
+			hash_del(&state->node);
+		spin_unlock(&user->inotify_lock);
+
+		if (state->inotify_devs == 0)
+			kfree(state);
+
 		free_uid(group->inotify_data.user);
 	}
 }
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index ae7ec2414252..e7cc4eaa838f 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -94,7 +94,7 @@ static int inotify_init_state(struct user_struct *user,
 	int ret = 0;
 
 	spin_lock(&user->inotify_lock);
-	state =  __find_inotify_count(user, key);
+	state =  __find_inotify_state(user, key);
 
 	if (!state) {
 		spin_unlock(&user->inotify_lock);
@@ -536,7 +536,8 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
 	/* remove this mark from the idr */
 	inotify_remove_from_idr(group, i_mark);
 
-	atomic_dec(&group->inotify_data.user->inotify_watches);
+	inotify_dec_watches(group->inotify_data.user,
+			    group->inotify_data.userns_ptr);
 }
 
 /* ding dong the mark is dead */
@@ -609,6 +610,8 @@ static int inotify_new_watch(struct fsnotify_group *group,
 	int ret;
 	struct idr *idr = &group->inotify_data.idr;
 	spinlock_t *idr_lock = &group->inotify_data.idr_lock;
+	struct user_struct *user = group->inotify_data.user;
+	void *key = group->inotify_data.userns_ptr;
 
 	mask = inotify_arg_to_mask(arg);
 
@@ -621,7 +624,7 @@ static int inotify_new_watch(struct fsnotify_group *group,
 	tmp_i_mark->wd = -1;
 
 	ret = -ENOSPC;
-	if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches)
+	if (inotify_read_watches(user, key) >= inotify_max_user_watches)
 		goto out_err;
 
 	ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
@@ -638,7 +641,7 @@ static int inotify_new_watch(struct fsnotify_group *group,
 	}
 
 	/* increment the number of watches the user has */
-	atomic_inc(&group->inotify_data.user->inotify_watches);
+	inotify_inc_watches(user, key);
 
 	/* return the watch descriptor for this new mark */
 	ret = tmp_i_mark->wd;
@@ -669,6 +672,9 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
 {
 	struct fsnotify_group *group;
 	struct inotify_event_info *oevent;
+	struct user_struct *user = get_current_user();
+	void *key = current_user_ns();
+	int ret;
 
 	group = fsnotify_alloc_group(&inotify_fsnotify_ops);
 	if (IS_ERR(group))
@@ -689,12 +695,13 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
 
 	spin_lock_init(&group->inotify_data.idr_lock);
 	idr_init(&group->inotify_data.idr);
-	group->inotify_data.user = get_current_user();
+	group->inotify_data.user = user;
+	group->inotify_data.userns_ptr = key;
 
-	if (atomic_inc_return(&group->inotify_data.user->inotify_devs) >
-	    inotify_max_user_instances) {
+	ret = inotify_init_state(user, key);
+	if (ret < 0) {
 		fsnotify_destroy_group(group);
-		return ERR_PTR(-EMFILE);
+		return ERR_PTR(ret);
 	}
 
 	return group;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c55d951d0bb..8f589b32ed15 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -842,8 +842,6 @@ struct user_struct {
 #ifdef CONFIG_INOTIFY_USER
 	spinlock_t inotify_lock;
 	DECLARE_HASHTABLE(inotify_tbl, 6);
-	atomic_t inotify_watches; /* How many inotify watches does this user have? */
-	atomic_t inotify_devs;	/* How many inotify devs does this user have opened? */
 #endif
 #ifdef CONFIG_FANOTIFY
 	atomic_t fanotify_listeners;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] misc: Rename the HASH_SIZE macro
  2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
  2016-06-01  7:52 ` [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace Nikolay Borisov
  2016-06-01  7:52 ` [PATCH 2/4] inotify: Convert inotify limits to be accounted per-realuser/per-namespace Nikolay Borisov
@ 2016-06-01  7:52 ` Nikolay Borisov
  2016-06-01 18:13   ` David Miller
  2016-06-01  7:53 ` [PATCH 4/4] inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER Nikolay Borisov
  2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
  4 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-01  7:52 UTC (permalink / raw)
  To: john, eparis, ebiederm
  Cc: jack, linux-kernel, gorcunov, avagin, netdev, operations,
	Nikolay Borisov

This change is required since the inotify-per-namespace code added
hashtable.h to the include list of sched.h. This in turn causes
compiler warnings since HASH_SIZE is being defined in multiple
locations

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/logfs/dir.c                           |  6 +++---
 net/ipv6/ip6_gre.c                       |  8 ++++----
 net/ipv6/ip6_tunnel.c                    | 10 +++++-----
 net/ipv6/ip6_vti.c                       | 10 +++++-----
 net/ipv6/sit.c                           | 10 +++++-----
 security/keys/encrypted-keys/encrypted.c | 32 ++++++++++++++++----------------
 6 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 2d5336bd4efd..bcd754d216bd 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -95,7 +95,7 @@ static int beyond_eof(struct inode *inode, loff_t bix)
  * of each character and pick a prime nearby, preferably a bit-sparse
  * one.
  */
-static u32 hash_32(const char *s, int len, u32 seed)
+static u32 logfs_hash_32(const char *s, int len, u32 seed)
 {
 	u32 hash = seed;
 	int i;
@@ -159,7 +159,7 @@ static struct page *logfs_get_dd_page(struct inode *dir, struct dentry *dentry)
 	struct qstr *name = &dentry->d_name;
 	struct page *page;
 	struct logfs_disk_dentry *dd;
-	u32 hash = hash_32(name->name, name->len, 0);
+	u32 hash = logfs_hash_32(name->name, name->len, 0);
 	pgoff_t index;
 	int round;
 
@@ -370,7 +370,7 @@ static int logfs_write_dir(struct inode *dir, struct dentry *dentry,
 {
 	struct page *page;
 	struct logfs_disk_dentry *dd;
-	u32 hash = hash_32(dentry->d_name.name, dentry->d_name.len, 0);
+	u32 hash = logfs_hash_32(dentry->d_name.name, dentry->d_name.len, 0);
 	pgoff_t index;
 	int round, err;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index af503f518278..b73b4dc5c7ad 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -62,11 +62,11 @@ module_param(log_ecn_error, bool, 0644);
 MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
 #define HASH_SIZE_SHIFT  5
-#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
+#define IP6G_HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
 static int ip6gre_net_id __read_mostly;
 struct ip6gre_net {
-	struct ip6_tnl __rcu *tunnels[4][HASH_SIZE];
+	struct ip6_tnl __rcu *tunnels[4][IP6G_HASH_SIZE];
 
 	struct net_device *fb_tunnel_dev;
 };
@@ -96,7 +96,7 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu);
    will match fallback tunnel.
  */
 
-#define HASH_KEY(key) (((__force u32)key^((__force u32)key>>4))&(HASH_SIZE - 1))
+#define HASH_KEY(key) (((__force u32)key^((__force u32)key>>4))&(IP6G_HASH_SIZE - 1))
 static u32 HASH_ADDR(const struct in6_addr *addr)
 {
 	u32 hash = ipv6_addr_hash(addr);
@@ -1086,7 +1086,7 @@ static void ip6gre_destroy_tunnels(struct net *net, struct list_head *head)
 
 	for (prio = 0; prio < 4; prio++) {
 		int h;
-		for (h = 0; h < HASH_SIZE; h++) {
+		for (h = 0; h < IP6G_HASH_SIZE; h++) {
 			struct ip6_tnl *t;
 
 			t = rtnl_dereference(ign->tunnels[prio][h]);
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 7b0481e3738f..50b57a435f05 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -64,8 +64,8 @@ MODULE_LICENSE("GPL");
 MODULE_ALIAS_RTNL_LINK("ip6tnl");
 MODULE_ALIAS_NETDEV("ip6tnl0");
 
-#define HASH_SIZE_SHIFT  5
-#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
+#define IP6_HASH_SIZE_SHIFT  5
+#define IP6_HASH_SIZE (1 << IP6_HASH_SIZE_SHIFT)
 
 static bool log_ecn_error = true;
 module_param(log_ecn_error, bool, 0644);
@@ -75,7 +75,7 @@ static u32 HASH(const struct in6_addr *addr1, const struct in6_addr *addr2)
 {
 	u32 hash = ipv6_addr_hash(addr1) ^ ipv6_addr_hash(addr2);
 
-	return hash_32(hash, HASH_SIZE_SHIFT);
+	return hash_32(hash, IP6_HASH_SIZE_SHIFT);
 }
 
 static int ip6_tnl_dev_init(struct net_device *dev);
@@ -87,7 +87,7 @@ struct ip6_tnl_net {
 	/* the IPv6 tunnel fallback device */
 	struct net_device *fb_tnl_dev;
 	/* lists for storing tunnels in use */
-	struct ip6_tnl __rcu *tnls_r_l[HASH_SIZE];
+	struct ip6_tnl __rcu *tnls_r_l[IP6_HASH_SIZE];
 	struct ip6_tnl __rcu *tnls_wc[1];
 	struct ip6_tnl __rcu **tnls[2];
 };
@@ -2031,7 +2031,7 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct net *net)
 		if (dev->rtnl_link_ops == &ip6_link_ops)
 			unregister_netdevice_queue(dev, &list);
 
-	for (h = 0; h < HASH_SIZE; h++) {
+	for (h = 0; h < IP6_HASH_SIZE; h++) {
 		t = rtnl_dereference(ip6n->tnls_r_l[h]);
 		while (t) {
 			/* If dev is in the same netns, it has already
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index d90a11f14040..30e242140909 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -50,14 +50,14 @@
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
-#define HASH_SIZE_SHIFT  5
-#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
+#define VTI_HASH_SIZE_SHIFT  5
+#define VTI_HASH_SIZE (1 << VTI_HASH_SIZE_SHIFT)
 
 static u32 HASH(const struct in6_addr *addr1, const struct in6_addr *addr2)
 {
 	u32 hash = ipv6_addr_hash(addr1) ^ ipv6_addr_hash(addr2);
 
-	return hash_32(hash, HASH_SIZE_SHIFT);
+	return hash_32(hash, VTI_HASH_SIZE_SHIFT);
 }
 
 static int vti6_dev_init(struct net_device *dev);
@@ -69,7 +69,7 @@ struct vti6_net {
 	/* the vti6 tunnel fallback device */
 	struct net_device *fb_tnl_dev;
 	/* lists for storing tunnels in use */
-	struct ip6_tnl __rcu *tnls_r_l[HASH_SIZE];
+	struct ip6_tnl __rcu *tnls_r_l[VTI_HASH_SIZE];
 	struct ip6_tnl __rcu *tnls_wc[1];
 	struct ip6_tnl __rcu **tnls[2];
 };
@@ -1040,7 +1040,7 @@ static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
 	struct ip6_tnl *t;
 	LIST_HEAD(list);
 
-	for (h = 0; h < HASH_SIZE; h++) {
+	for (h = 0; h < VTI_HASH_SIZE; h++) {
 		t = rtnl_dereference(ip6n->tnls_r_l[h]);
 		while (t) {
 			unregister_netdevice_queue(t->dev, &list);
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0a5a255277e5..757ec087ce01 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -62,7 +62,7 @@
    For comments look at net/ipv4/ip_gre.c --ANK
  */
 
-#define HASH_SIZE  16
+#define SIT_HASH_SIZE  16
 #define HASH(addr) (((__force u32)addr^((__force u32)addr>>4))&0xF)
 
 static bool log_ecn_error = true;
@@ -78,9 +78,9 @@ static struct rtnl_link_ops sit_link_ops __read_mostly;
 
 static int sit_net_id __read_mostly;
 struct sit_net {
-	struct ip_tunnel __rcu *tunnels_r_l[HASH_SIZE];
-	struct ip_tunnel __rcu *tunnels_r[HASH_SIZE];
-	struct ip_tunnel __rcu *tunnels_l[HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_r_l[SIT_HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_r[SIT_HASH_SIZE];
+	struct ip_tunnel __rcu *tunnels_l[SIT_HASH_SIZE];
 	struct ip_tunnel __rcu *tunnels_wc[1];
 	struct ip_tunnel __rcu **tunnels[4];
 
@@ -1773,7 +1773,7 @@ static void __net_exit sit_destroy_tunnels(struct net *net,
 
 	for (prio = 1; prio < 4; prio++) {
 		int h;
-		for (h = 0; h < HASH_SIZE; h++) {
+		for (h = 0; h < SIT_HASH_SIZE; h++) {
 			struct ip_tunnel *t;
 
 			t = rtnl_dereference(sitn->tunnels[prio][h]);
diff --git a/security/keys/encrypted-keys/encrypted.c b/security/keys/encrypted-keys/encrypted.c
index 5adbfc32242f..1c2271db2918 100644
--- a/security/keys/encrypted-keys/encrypted.c
+++ b/security/keys/encrypted-keys/encrypted.c
@@ -49,7 +49,7 @@ static int blksize;
 #define KEY_TRUSTED_PREFIX_LEN (sizeof (KEY_TRUSTED_PREFIX) - 1)
 #define KEY_USER_PREFIX_LEN (sizeof (KEY_USER_PREFIX) - 1)
 #define KEY_ECRYPTFS_DESC_LEN 16
-#define HASH_SIZE SHA256_DIGEST_SIZE
+#define E_HASH_SIZE SHA256_DIGEST_SIZE
 #define MAX_DATA_SIZE 4096
 #define MIN_DATA_SIZE  20
 
@@ -380,8 +380,8 @@ static int get_derived_key(u8 *derived_key, enum derived_key_type key_type,
 	int ret;
 
 	derived_buf_len = strlen("AUTH_KEY") + 1 + master_keylen;
-	if (derived_buf_len < HASH_SIZE)
-		derived_buf_len = HASH_SIZE;
+	if (derived_buf_len < E_HASH_SIZE)
+		derived_buf_len = E_HASH_SIZE;
 
 	derived_buf = kzalloc(derived_buf_len, GFP_KERNEL);
 	if (!derived_buf) {
@@ -517,7 +517,7 @@ out:
 static int datablob_hmac_append(struct encrypted_key_payload *epayload,
 				const u8 *master_key, size_t master_keylen)
 {
-	u8 derived_key[HASH_SIZE];
+	u8 derived_key[E_HASH_SIZE];
 	u8 *digest;
 	int ret;
 
@@ -529,7 +529,7 @@ static int datablob_hmac_append(struct encrypted_key_payload *epayload,
 	ret = calc_hmac(digest, derived_key, sizeof derived_key,
 			epayload->format, epayload->datablob_len);
 	if (!ret)
-		dump_hmac(NULL, digest, HASH_SIZE);
+		dump_hmac(NULL, digest, E_HASH_SIZE);
 out:
 	return ret;
 }
@@ -539,8 +539,8 @@ static int datablob_hmac_verify(struct encrypted_key_payload *epayload,
 				const u8 *format, const u8 *master_key,
 				size_t master_keylen)
 {
-	u8 derived_key[HASH_SIZE];
-	u8 digest[HASH_SIZE];
+	u8 derived_key[E_HASH_SIZE];
+	u8 digest[E_HASH_SIZE];
 	int ret;
 	char *p;
 	unsigned short len;
@@ -565,8 +565,8 @@ static int datablob_hmac_verify(struct encrypted_key_payload *epayload,
 		ret = -EINVAL;
 		dump_hmac("datablob",
 			  epayload->format + epayload->datablob_len,
-			  HASH_SIZE);
-		dump_hmac("calc", digest, HASH_SIZE);
+			  E_HASH_SIZE);
+		dump_hmac("calc", digest, E_HASH_SIZE);
 	}
 out:
 	return ret;
@@ -651,12 +651,12 @@ static struct encrypted_key_payload *encrypted_key_alloc(struct key *key,
 	    + strlen(datalen) + 1 + ivsize + 1 + encrypted_datalen;
 
 	ret = key_payload_reserve(key, payload_datalen + datablob_len
-				  + HASH_SIZE + 1);
+				  + E_HASH_SIZE + 1);
 	if (ret < 0)
 		return ERR_PTR(ret);
 
 	epayload = kzalloc(sizeof(*epayload) + payload_datalen +
-			   datablob_len + HASH_SIZE + 1, GFP_KERNEL);
+			   datablob_len + E_HASH_SIZE + 1, GFP_KERNEL);
 	if (!epayload)
 		return ERR_PTR(-ENOMEM);
 
@@ -670,7 +670,7 @@ static int encrypted_key_decrypt(struct encrypted_key_payload *epayload,
 				 const char *format, const char *hex_encoded_iv)
 {
 	struct key *mkey;
-	u8 derived_key[HASH_SIZE];
+	u8 derived_key[E_HASH_SIZE];
 	const u8 *master_key;
 	u8 *hmac;
 	const char *hex_encoded_data;
@@ -680,7 +680,7 @@ static int encrypted_key_decrypt(struct encrypted_key_payload *epayload,
 	int ret;
 
 	encrypted_datalen = roundup(epayload->decrypted_datalen, blksize);
-	asciilen = (ivsize + 1 + encrypted_datalen + HASH_SIZE) * 2;
+	asciilen = (ivsize + 1 + encrypted_datalen + E_HASH_SIZE) * 2;
 	if (strlen(hex_encoded_iv) != asciilen)
 		return -EINVAL;
 
@@ -695,7 +695,7 @@ static int encrypted_key_decrypt(struct encrypted_key_payload *epayload,
 
 	hmac = epayload->format + epayload->datablob_len;
 	ret = hex2bin(hmac, hex_encoded_data + (encrypted_datalen * 2),
-		      HASH_SIZE);
+		      E_HASH_SIZE);
 	if (ret < 0)
 		return -EINVAL;
 
@@ -918,7 +918,7 @@ static long encrypted_read(const struct key *key, char __user *buffer,
 	struct key *mkey;
 	const u8 *master_key;
 	size_t master_keylen;
-	char derived_key[HASH_SIZE];
+	char derived_key[E_HASH_SIZE];
 	char *ascii_buf;
 	size_t asciiblob_len;
 	int ret;
@@ -928,7 +928,7 @@ static long encrypted_read(const struct key *key, char __user *buffer,
 	/* returns the hex encoded iv, encrypted-data, and hmac as ascii */
 	asciiblob_len = epayload->datablob_len + ivsize + 1
 	    + roundup(epayload->decrypted_datalen, blksize)
-	    + (HASH_SIZE * 2);
+	    + (E_HASH_SIZE * 2);
 
 	if (!buffer || buflen < asciiblob_len)
 		return asciiblob_len;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER
  2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
                   ` (2 preceding siblings ...)
  2016-06-01  7:52 ` [PATCH 3/4] misc: Rename the HASH_SIZE macro Nikolay Borisov
@ 2016-06-01  7:53 ` Nikolay Borisov
  2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
  4 siblings, 0 replies; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-01  7:53 UTC (permalink / raw)
  To: john, eparis, ebiederm
  Cc: jack, linux-kernel, gorcunov, avagin, netdev, operations,
	Nikolay Borisov

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/notify/fdinfo.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index fd98e5100cab..62068f89d144 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -13,7 +13,10 @@
 #include <linux/proc_fs.h>
 #include <linux/exportfs.h>
 
+#ifdef CONFIG_INOTIFY_USER
 #include "inotify/inotify.h"
+#endif
+
 #include "../fs/mount.h"
 
 #if defined(CONFIG_PROC_FS)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
                   ` (3 preceding siblings ...)
  2016-06-01  7:53 ` [PATCH 4/4] inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER Nikolay Borisov
@ 2016-06-01 16:00 ` Eric W. Biederman
  2016-06-02  6:27   ` Nikolay Borisov
  2016-06-02  7:49   ` Jan Kara
  4 siblings, 2 replies; 17+ messages in thread
From: Eric W. Biederman @ 2016-06-01 16:00 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: john, eparis, jack, linux-kernel, gorcunov, avagin, netdev,
	operations, Linux Containers

Cc'd the containers list.


Nikolay Borisov <kernel@kyup.com> writes:

> Currently the inotify instances/watches are being accounted in the 
> user_struct structure. This means that in setups where multiple 
> users in unprivileged containers map to the same underlying 
> real user (e.g. user_struct) the inotify limits are going to be 
> shared as well which can lead to unplesantries. This is a problem 
> since any user inside any of the containers can potentially exhaust 
> the instance/watches limit which in turn might prevent certain 
> services from other containers from starting.

On a high level this is a bit problematic as it appears to escapes the
current limits and allows anyone creating a user namespace to have their
own fresh set of limits.  Given that anyone should be able to create a
user namespace whenever they feel like escaping limits is a problem.
That however is solvable.

A practical question.  What kind of limits are we looking at here?

Are these loose limits for detecting buggy programs that have gone
off their rails?

Are these tight limits to ensure multitasking is possible?



For tight limits where something is actively controlling the limits you
probably want a cgroup base solution.

For loose limits that are the kind where you set a good default and
forget about I think a user namespace based solution is reasonable.

> The solution I propose is rather simple, instead of accounting the 
> watches/instances per user_struct, start accounting them in a hashtable, 
> where the index used is the hashed pointer of the userns. This way
> the administrator needn't set the inotify limits very high and also 
> the risk of one container breaching the limits and affecting every 
> other container is alleviated.

I don't think this is the right data structure for a user namespace
based solution, at least in part because it does not account for users
escaping.

> I have performed functional testing to validate that limits in 
> different namespaces are indeed separate, as well as running 
> multiple inotify stressers from stress-ng to ensure I haven't 
> introduced any race conditions. 
>
> This series  is based on 4.7-rc1 (and applies cleanly on 4.4.10) and 
> consist of the following 4 patches: 
>
> Patch 1: This introduces the necessary structure and code changes. Including
> hashtable.h to sched.h causes some warnings in files which define HAS_SIZE macro, 
> patch 3 fixes this by doing mechanical rename. 
>
> Patch 2: This patch flips the inotify code to user the new infrastructure.
>
> Patch 3: This is a simple mechanical rename of conflicting definitions with 
> hashtable.h's HASH_SIZE macro. I'm happy about comments how I should go 
> about this. 
>
> Patch 4: This is a rather self-container patch and can go irrespective of 
> whether the series is accepted, it's needed so that building the kernel 
> with !CONFIG_INOTIFY_USER doesn't fail (with patch 1 being applied). 
> However, fdinfo.c doesn't really need inotify.h  
>
> Nikolay Borisov (4):
>   inotify: Add infrastructure to account inotify limits per-namespace
>   inotify: Convert inotify limits to be accounted
>     per-realuser/per-namespace
>   misc: Rename the HASH_SIZE macro
>   inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER
>
>  fs/logfs/dir.c                           |  6 +--
>  fs/notify/fdinfo.c                       |  3 ++
>  fs/notify/inotify/inotify.h              | 68 ++++++++++++++++++++++++++++++++
>  fs/notify/inotify/inotify_fsnotify.c     | 14 ++++++-
>  fs/notify/inotify/inotify_user.c         | 57 ++++++++++++++++++++++----
>  include/linux/fsnotify_backend.h         |  1 +
>  include/linux/sched.h                    |  5 ++-
>  kernel/user.c                            | 13 ++++++
>  net/ipv6/ip6_gre.c                       |  8 ++--
>  net/ipv6/ip6_tunnel.c                    | 10 ++---
>  net/ipv6/ip6_vti.c                       | 10 ++---
>  net/ipv6/sit.c                           | 10 ++---
>  security/keys/encrypted-keys/encrypted.c | 32 +++++++--------
>  13 files changed, 189 insertions(+), 48 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] misc: Rename the HASH_SIZE macro
  2016-06-01  7:52 ` [PATCH 3/4] misc: Rename the HASH_SIZE macro Nikolay Borisov
@ 2016-06-01 18:13   ` David Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2016-06-01 18:13 UTC (permalink / raw)
  To: kernel
  Cc: john, eparis, ebiederm, jack, linux-kernel, gorcunov, avagin,
	netdev, operations

From: Nikolay Borisov <kernel@kyup.com>
Date: Wed,  1 Jun 2016 10:52:59 +0300

> This change is required since the inotify-per-namespace code added
> hashtable.h to the include list of sched.h. This in turn causes
> compiler warnings since HASH_SIZE is being defined in multiple
> locations
> 
> Signed-off-by: Nikolay Borisov <kernel@kyup.com>

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
@ 2016-06-02  6:27   ` Nikolay Borisov
  2016-06-02 16:19     ` Eric W. Biederman
  2016-06-02  7:49   ` Jan Kara
  1 sibling, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-02  6:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: john, eparis, jack, linux-kernel, gorcunov, avagin, netdev,
	operations, Linux Containers



On 06/01/2016 07:00 PM, Eric W. Biederman wrote:
> Cc'd the containers list.
> 
> 
> Nikolay Borisov <kernel@kyup.com> writes:
> 
>> Currently the inotify instances/watches are being accounted in the 
>> user_struct structure. This means that in setups where multiple 
>> users in unprivileged containers map to the same underlying 
>> real user (e.g. user_struct) the inotify limits are going to be 
>> shared as well which can lead to unplesantries. This is a problem 
>> since any user inside any of the containers can potentially exhaust 
>> the instance/watches limit which in turn might prevent certain 
>> services from other containers from starting.
> 
> On a high level this is a bit problematic as it appears to escapes the
> current limits and allows anyone creating a user namespace to have their
> own fresh set of limits.  Given that anyone should be able to create a
> user namespace whenever they feel like escaping limits is a problem.
> That however is solvable.

This is indeed a problem and the presented solution is rather dumb in
that regard. I'm happy to work with you on suggestions so that I arrive
at a solution that is upstreamable.

> 
> A practical question.  What kind of limits are we looking at here?
> 
> Are these loose limits for detecting buggy programs that have gone
> off their rails?

Loose limits.

> 
> Are these tight limits to ensure multitasking is possible?
> 
> 
> 
> For tight limits where something is actively controlling the limits you
> probably want a cgroup base solution.
> 
> For loose limits that are the kind where you set a good default and
> forget about I think a user namespace based solution is reasonable.

That's exactly the use case I had in mind.

> 
>> The solution I propose is rather simple, instead of accounting the 
>> watches/instances per user_struct, start accounting them in a hashtable, 
>> where the index used is the hashed pointer of the userns. This way
>> the administrator needn't set the inotify limits very high and also 
>> the risk of one container breaching the limits and affecting every 
>> other container is alleviated.
> 
> I don't think this is the right data structure for a user namespace
> based solution, at least in part because it does not account for users
> escaping.

Admittedly this is a naive solution, what are you ideas on something
which achieves my initial aim of having limits per users, yet not
allowing them to just create another namespace and escape them. The
current namespace code has a hard-coded limit of 32 for nesting user
namespaces. So currently at the worst case one can escape the limits up
to 32 * current_limits.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
  2016-06-02  6:27   ` Nikolay Borisov
@ 2016-06-02  7:49   ` Jan Kara
  2016-06-02 16:58     ` Eric W. Biederman
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Kara @ 2016-06-02  7:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Nikolay Borisov, john, eparis, jack, linux-kernel, gorcunov,
	avagin, netdev, operations, Linux Containers

On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
> Cc'd the containers list.
> 
> Nikolay Borisov <kernel@kyup.com> writes:
> 
> > Currently the inotify instances/watches are being accounted in the 
> > user_struct structure. This means that in setups where multiple 
> > users in unprivileged containers map to the same underlying 
> > real user (e.g. user_struct) the inotify limits are going to be 
> > shared as well which can lead to unplesantries. This is a problem 
> > since any user inside any of the containers can potentially exhaust 
> > the instance/watches limit which in turn might prevent certain 
> > services from other containers from starting.
> 
> On a high level this is a bit problematic as it appears to escapes the
> current limits and allows anyone creating a user namespace to have their
> own fresh set of limits.  Given that anyone should be able to create a
> user namespace whenever they feel like escaping limits is a problem.
> That however is solvable.
> 
> A practical question.  What kind of limits are we looking at here?
> 
> Are these loose limits for detecting buggy programs that have gone
> off their rails?
> 
> Are these tight limits to ensure multitasking is possible?

The original motivation for these limits is to limit resource usage.  There
is in-kernel data structure that is associated with each notification mark
you create and we don't want users to be able to DoS the system by creating
too many of them. Thus we limit number of notification marks for each user.
There is also a limit on the number of notification instances - those are
naturally limited by the number of open file descriptors but admin may want
to limit them more...

So cgroups would be probably the best fit for this but I'm not sure whether
it is not an overkill...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-02  6:27   ` Nikolay Borisov
@ 2016-06-02 16:19     ` Eric W. Biederman
  0 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2016-06-02 16:19 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: john, eparis, jack, linux-kernel, gorcunov, avagin, netdev,
	operations, Linux Containers

Nikolay Borisov <kernel@kyup.com> writes:

> On 06/01/2016 07:00 PM, Eric W. Biederman wrote:
>> Cc'd the containers list.
>> 
>> 
>> Nikolay Borisov <kernel@kyup.com> writes:
>> 
>>> Currently the inotify instances/watches are being accounted in the 
>>> user_struct structure. This means that in setups where multiple 
>>> users in unprivileged containers map to the same underlying 
>>> real user (e.g. user_struct) the inotify limits are going to be 
>>> shared as well which can lead to unplesantries. This is a problem 
>>> since any user inside any of the containers can potentially exhaust 
>>> the instance/watches limit which in turn might prevent certain 
>>> services from other containers from starting.
>> 
>> On a high level this is a bit problematic as it appears to escapes the
>> current limits and allows anyone creating a user namespace to have their
>> own fresh set of limits.  Given that anyone should be able to create a
>> user namespace whenever they feel like escaping limits is a problem.
>> That however is solvable.
>
> This is indeed a problem and the presented solution is rather dumb in
> that regard. I'm happy to work with you on suggestions so that I arrive
> at a solution that is upstreamable.

The one in kernel solution to hierarchical resource limits that I am
aware of is the current include/linux/page_counter.h which evolved from
include/linux/res_counter.h

>> A practical question.  What kind of limits are we looking at here?
>> 
>> Are these loose limits for detecting buggy programs that have gone
>> off their rails?
>
> Loose limits.
>
>> 
>> Are these tight limits to ensure multitasking is possible?
>> 
>> 
>> 
>> For tight limits where something is actively controlling the limits you
>> probably want a cgroup base solution.
>> 
>> For loose limits that are the kind where you set a good default and
>> forget about I think a user namespace based solution is reasonable.
>
> That's exactly the use case I had in mind.
>
>> 
>>> The solution I propose is rather simple, instead of accounting the 
>>> watches/instances per user_struct, start accounting them in a hashtable, 
>>> where the index used is the hashed pointer of the userns. This way
>>> the administrator needn't set the inotify limits very high and also 
>>> the risk of one container breaching the limits and affecting every 
>>> other container is alleviated.
>> 
>> I don't think this is the right data structure for a user namespace
>> based solution, at least in part because it does not account for users
>> escaping.
>
> Admittedly this is a naive solution, what are you ideas on something
> which achieves my initial aim of having limits per users, yet not
> allowing them to just create another namespace and escape them. The
> current namespace code has a hard-coded limit of 32 for nesting user
> namespaces. So currently at the worst case one can escape the limits up
> to 32 * current_limits.

32 is the nesting depth not the width of the tree.  But see above.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-02  7:49   ` Jan Kara
@ 2016-06-02 16:58     ` Eric W. Biederman
  2016-06-03 11:14       ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2016-06-02 16:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nikolay Borisov, john, eparis, linux-kernel, gorcunov, avagin,
	netdev, operations, Linux Containers

Nikolay please see my question for you at the end.

Jan Kara <jack@suse.cz> writes:

> On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
>> Cc'd the containers list.
>> 
>> Nikolay Borisov <kernel@kyup.com> writes:
>> 
>> > Currently the inotify instances/watches are being accounted in the 
>> > user_struct structure. This means that in setups where multiple 
>> > users in unprivileged containers map to the same underlying 
>> > real user (e.g. user_struct) the inotify limits are going to be 
>> > shared as well which can lead to unplesantries. This is a problem 
>> > since any user inside any of the containers can potentially exhaust 
>> > the instance/watches limit which in turn might prevent certain 
>> > services from other containers from starting.
>> 
>> On a high level this is a bit problematic as it appears to escapes the
>> current limits and allows anyone creating a user namespace to have their
>> own fresh set of limits.  Given that anyone should be able to create a
>> user namespace whenever they feel like escaping limits is a problem.
>> That however is solvable.
>> 
>> A practical question.  What kind of limits are we looking at here?
>> 
>> Are these loose limits for detecting buggy programs that have gone
>> off their rails?
>> 
>> Are these tight limits to ensure multitasking is possible?
>
> The original motivation for these limits is to limit resource usage.  There
> is in-kernel data structure that is associated with each notification mark
> you create and we don't want users to be able to DoS the system by creating
> too many of them. Thus we limit number of notification marks for each user.
> There is also a limit on the number of notification instances - those are
> naturally limited by the number of open file descriptors but admin may want
> to limit them more...
>
> So cgroups would be probably the best fit for this but I'm not sure whether
> it is not an overkill...

There is some level of kernel memory accounting in the memory cgroup.

That said my experience with cgroups is that while they are good for
some things the semantics that derive from the userspace API are
problematic.

In the cgroup model objects in the kernel don't belong to a cgroup they
belong to a task/process.  Those processes belong to a cgroup.
Processes under control of a sufficiently privileged parent are allowed
to switch cgroups.  This causes implementation challenges and sematic
mismatch in a world where things are typically considered to have an
owner.

Right now fs_notify groups (upon which all of the rest of the inotify
accounting is built upon) belong to a user.  So there is a semantic
mismatch with cgroups right out of the gate.

Given that cgroups have not choosen to account for individual kernel
objects or give that level of control, I think it reasonable to look to
other possible solutions.  Assuming the overhead can be kept under
control.

The implementation of a hierarchical counter in mm/page_counter.c
strongly suggests to me that the overhead can be kept under control.

And yes.  I am thinking of the problem space where you have a limit
based on the problem domain where if an application consumes more than
the limit, the application is likely bonkers.  Which does prevent a DOS
situation in kernel memory.  But is different from the problem I have
seen cgroups solve.

The problem I have seen cgroups solve looks like.  Hmm.  I have 8GB of
ram.  I have 3 containers.  Container A can have 4GB, Container B can
have 1GB and container C can have 3GB.  Then I know one container won't
push the other containers into swap.

Perhaps that would tend to be a top down/vs a bottom up approach to
coming up with limits.  As DOS preventions limits like the inotify ones
are generally written from the perspective of if you have more than X
you are crazy.  While cgroup limits tend to be thought about top down
from a total system management point of view.

So I think there is definitely something to look at.

All of that said there is definitely a practical question that needs to
be asked.  Nikolay how did you get into this situation?  A typical user
namespace configuration will set up uid and gid maps with the help of a
privileged program and not map the uid of the user who created the user
namespace.  Thus avoiding exhausting the limits of the user who created
the container.

Which makes me personally more worried about escaping the existing
limits than exhausting the limits of a particular user.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-02 16:58     ` Eric W. Biederman
@ 2016-06-03 11:14       ` Nikolay Borisov
  2016-06-03 20:41         ` Eric W. Biederman
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-03 11:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, john, eparis, linux-kernel, gorcunov, avagin, netdev,
	operations, Linux Containers



On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
> 
> Nikolay please see my question for you at the end.
> 
> Jan Kara <jack@suse.cz> writes:
> 
>> On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
>>> Cc'd the containers list.
>>>
>>> Nikolay Borisov <kernel@kyup.com> writes:
>>>
>>>> Currently the inotify instances/watches are being accounted in the 
>>>> user_struct structure. This means that in setups where multiple 
>>>> users in unprivileged containers map to the same underlying 
>>>> real user (e.g. user_struct) the inotify limits are going to be 
>>>> shared as well which can lead to unplesantries. This is a problem 
>>>> since any user inside any of the containers can potentially exhaust 
>>>> the instance/watches limit which in turn might prevent certain 
>>>> services from other containers from starting.
>>>
>>> On a high level this is a bit problematic as it appears to escapes the
>>> current limits and allows anyone creating a user namespace to have their
>>> own fresh set of limits.  Given that anyone should be able to create a
>>> user namespace whenever they feel like escaping limits is a problem.
>>> That however is solvable.
>>>
>>> A practical question.  What kind of limits are we looking at here?
>>>
>>> Are these loose limits for detecting buggy programs that have gone
>>> off their rails?
>>>
>>> Are these tight limits to ensure multitasking is possible?
>>
>> The original motivation for these limits is to limit resource usage.  There
>> is in-kernel data structure that is associated with each notification mark
>> you create and we don't want users to be able to DoS the system by creating
>> too many of them. Thus we limit number of notification marks for each user.
>> There is also a limit on the number of notification instances - those are
>> naturally limited by the number of open file descriptors but admin may want
>> to limit them more...
>>
>> So cgroups would be probably the best fit for this but I'm not sure whether
>> it is not an overkill...
> 
> There is some level of kernel memory accounting in the memory cgroup.
> 
> That said my experience with cgroups is that while they are good for
> some things the semantics that derive from the userspace API are
> problematic.
> 
> In the cgroup model objects in the kernel don't belong to a cgroup they
> belong to a task/process.  Those processes belong to a cgroup.
> Processes under control of a sufficiently privileged parent are allowed
> to switch cgroups.  This causes implementation challenges and sematic
> mismatch in a world where things are typically considered to have an
> owner.
> 
> Right now fs_notify groups (upon which all of the rest of the inotify
> accounting is built upon) belong to a user.  So there is a semantic
> mismatch with cgroups right out of the gate.
> 
> Given that cgroups have not choosen to account for individual kernel
> objects or give that level of control, I think it reasonable to look to
> other possible solutions.  Assuming the overhead can be kept under
> control.
> 
> The implementation of a hierarchical counter in mm/page_counter.c
> strongly suggests to me that the overhead can be kept under control.
> 
> And yes.  I am thinking of the problem space where you have a limit
> based on the problem domain where if an application consumes more than
> the limit, the application is likely bonkers.  Which does prevent a DOS
> situation in kernel memory.  But is different from the problem I have
> seen cgroups solve.
> 
> The problem I have seen cgroups solve looks like.  Hmm.  I have 8GB of
> ram.  I have 3 containers.  Container A can have 4GB, Container B can
> have 1GB and container C can have 3GB.  Then I know one container won't
> push the other containers into swap.
> 
> Perhaps that would tend to be a top down/vs a bottom up approach to
> coming up with limits.  As DOS preventions limits like the inotify ones
> are generally written from the perspective of if you have more than X
> you are crazy.  While cgroup limits tend to be thought about top down
> from a total system management point of view.
> 
> So I think there is definitely something to look at.
> 
> 
> All of that said there is definitely a practical question that needs to
> be asked.  Nikolay how did you get into this situation?  A typical user
> namespace configuration will set up uid and gid maps with the help of a
> privileged program and not map the uid of the user who created the user
> namespace.  Thus avoiding exhausting the limits of the user who created
> the container.

Right but imagine having multiple containers with identical uid/gid maps
for LXC-based setups imagine this:

lxc.id_map = u 0 1337 65536

Now all processes which are running with the same user on different
containers will actually share the underlying user_struct thus the
inotify limits. In such cases even running multiple instances of 'tail'
in one container will eventually use all allowed inotify/mark instances.
For this to happen you needn't also have complete overlap of the uid
map, it's enough to have at least one UID between 2 containers overlap.


So the risk of exhaustion doesn't apply to the privileged user that
created the container and the uid mapping, but rather the users under
which the various processes in the container are running. Does that make
it clear?


> 
> Which makes me personally more worried about escaping the existing
> limits than exhausting the limits of a particular user.

So I thought bit about it and I guess a solution can be concocted which
utilize the hierarchical nature of page counter, and the inotify limits
are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
admin can set one fairly large on the init_user_ns and then in every
namespace created one can set smaller limits. That way for a branch in
the tree (in the nomenclature you used in your previous reply to me) you
will really be upper-bound to the limit set in the namespace which have
->level = 1. For the width of the tree, you will be bound by the
"global" init_user_ns limits. How does that sound?


> 
> Eric
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-03 11:14       ` Nikolay Borisov
@ 2016-06-03 20:41         ` Eric W. Biederman
  2016-06-06  6:41           ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2016-06-03 20:41 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Jan Kara, john, eparis, linux-kernel, gorcunov, avagin, netdev,
	operations, Linux Containers

Nikolay Borisov <kernel@kyup.com> writes:

> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>> 
>> Nikolay please see my question for you at the end.
[snip] 
>> All of that said there is definitely a practical question that needs to
>> be asked.  Nikolay how did you get into this situation?  A typical user
>> namespace configuration will set up uid and gid maps with the help of a
>> privileged program and not map the uid of the user who created the user
>> namespace.  Thus avoiding exhausting the limits of the user who created
>> the container.
>
> Right but imagine having multiple containers with identical uid/gid maps
> for LXC-based setups imagine this:
>
> lxc.id_map = u 0 1337 65536

So I am only moderately concerned when the containers have overlapping
ids.  Because at some level overlapping ids means they are the same
user.  This is certainly true for file permissions and for other
permissions.  To isolate one container from another it fundamentally
needs to have separate uids and gids on the host system.

> Now all processes which are running with the same user on different
> containers will actually share the underlying user_struct thus the
> inotify limits. In such cases even running multiple instances of 'tail'
> in one container will eventually use all allowed inotify/mark instances.
> For this to happen you needn't also have complete overlap of the uid
> map, it's enough to have at least one UID between 2 containers overlap.
>
>
> So the risk of exhaustion doesn't apply to the privileged user that
> created the container and the uid mapping, but rather the users under
> which the various processes in the container are running. Does that make
> it clear?

Yes.  That is clear.

>> Which makes me personally more worried about escaping the existing
>> limits than exhausting the limits of a particular user.
>
> So I thought bit about it and I guess a solution can be concocted which
> utilize the hierarchical nature of page counter, and the inotify limits
> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
> admin can set one fairly large on the init_user_ns and then in every
> namespace created one can set smaller limits. That way for a branch in
> the tree (in the nomenclature you used in your previous reply to me) you
> will really be upper-bound to the limit set in the namespace which have
> ->level = 1. For the width of the tree, you will be bound by the
> "global" init_user_ns limits. How does that sound?

As a addendum to that design.  I think there should be an additional
sysctl or two that specifies how much the limit decreases when creating
a new user namespace and when creating a new user in that user
namespace.  That way with a good selection of limits and a limit
decrease people can use the kernel defaults without needing to change
them.

Having default settings that are good enough 99% of the time and that
people don't need to tune, would be my biggest requirement (aside from
being light-weight) for merging something like this.

If things are set and forget and even the continer case does not need to
be aware then I think we have a design sufficiently robust and different
from what cgroups is doing to make it worth while to have a userns based
solution.

I can see a lot of different limits implemented this way.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-03 20:41         ` Eric W. Biederman
@ 2016-06-06  6:41           ` Nikolay Borisov
  2016-06-06 20:00             ` Eric W. Biederman
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-06  6:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, avagin, netdev, Linux Containers, linux-kernel, eparis,
	operations, gorcunov, john



On 06/03/2016 11:41 PM, Eric W. Biederman wrote:
> Nikolay Borisov <kernel@kyup.com> writes:
> 
>> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>>
>>> Nikolay please see my question for you at the end.
> [snip] 
>>> All of that said there is definitely a practical question that needs to
>>> be asked.  Nikolay how did you get into this situation?  A typical user
>>> namespace configuration will set up uid and gid maps with the help of a
>>> privileged program and not map the uid of the user who created the user
>>> namespace.  Thus avoiding exhausting the limits of the user who created
>>> the container.
>>
>> Right but imagine having multiple containers with identical uid/gid maps
>> for LXC-based setups imagine this:
>>
>> lxc.id_map = u 0 1337 65536
> 
> So I am only moderately concerned when the containers have overlapping
> ids.  Because at some level overlapping ids means they are the same
> user.  This is certainly true for file permissions and for other
> permissions.  To isolate one container from another it fundamentally
> needs to have separate uids and gids on the host system.
> 
>> Now all processes which are running with the same user on different
>> containers will actually share the underlying user_struct thus the
>> inotify limits. In such cases even running multiple instances of 'tail'
>> in one container will eventually use all allowed inotify/mark instances.
>> For this to happen you needn't also have complete overlap of the uid
>> map, it's enough to have at least one UID between 2 containers overlap.
>>
>>
>> So the risk of exhaustion doesn't apply to the privileged user that
>> created the container and the uid mapping, but rather the users under
>> which the various processes in the container are running. Does that make
>> it clear?
> 
> Yes.  That is clear.
> 
>>> Which makes me personally more worried about escaping the existing
>>> limits than exhausting the limits of a particular user.
>>
>> So I thought bit about it and I guess a solution can be concocted which
>> utilize the hierarchical nature of page counter, and the inotify limits
>> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
>> admin can set one fairly large on the init_user_ns and then in every
>> namespace created one can set smaller limits. That way for a branch in
>> the tree (in the nomenclature you used in your previous reply to me) you
>> will really be upper-bound to the limit set in the namespace which have
>> ->level = 1. For the width of the tree, you will be bound by the
>> "global" init_user_ns limits. How does that sound?
> 
> As a addendum to that design.  I think there should be an additional
> sysctl or two that specifies how much the limit decreases when creating
> a new user namespace and when creating a new user in that user
> namespace.  That way with a good selection of limits and a limit
> decrease people can use the kernel defaults without needing to change
> them.

I agree that a sysctl which controls how the limits are set for new
namespaces is a good idea. I think it's best if this is in % rather than
some absolute value. Also I'm not sure about the sysctl when a user is
added in a namespace since just adding a new user should fall under the
limits of the current userns.

Also should those sysctls be global or should they be per-namespace? At
this point I'm more inclined to have global sysctl and maybe refine it
in the future if the need arises?


> 
> Having default settings that are good enough 99% of the time and that
> people don't need to tune, would be my biggest requirement (aside from
> being light-weight) for merging something like this.
> 
> If things are set and forget and even the continer case does not need to
> be aware then I think we have a design sufficiently robust and different
> from what cgroups is doing to make it worth while to have a userns based
> solution.

Provided that we agree on the overall design, so far it seems we just
need to iron out the details with the sysctl I'll be happy to implement
this.


> 
> I can see a lot of different limits implemented this way.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace
  2016-06-01  7:52 ` [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace Nikolay Borisov
@ 2016-06-06  8:05   ` Cyrill Gorcunov
  2016-06-06  9:26     ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Cyrill Gorcunov @ 2016-06-06  8:05 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: john, eparis, ebiederm, jack, linux-kernel, avagin, netdev, operations

On Wed, Jun 01, 2016 at 10:52:57AM +0300, Nikolay Borisov wrote:
> This patch adds the necessary members to user_struct. The idea behind
> the solution is really simple - user the userns pointers as keys into
> a hash table which holds the inotify instances/watches counts. This
> allows to account the limits per userns rather than per real user,
> which makes certain scenarios such as a single mapped user in a
> container deplete the inotify resources for all other users, which
> map to the exact same real user.
> 
> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
...
> +static inline unsigned long inotify_dec_return_dev(struct user_struct *user,
> +						   void *key)
> +{
> +	struct inotify_state *state;
> +	unsigned long ret;
> +
> +	spin_lock(&user->inotify_lock);
> +	state = __find_inotify_state(user, key);
> +	ret = --state->inotify_devs;
> +	spin_unlock(&user->inotify_lock);
> +
> +	return ret;
> +}

Hi Nikolay! Could you please explain why this new function is not used anywhere
in other patches or I miss something obvious?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace
  2016-06-06  8:05   ` Cyrill Gorcunov
@ 2016-06-06  9:26     ` Nikolay Borisov
  0 siblings, 0 replies; 17+ messages in thread
From: Nikolay Borisov @ 2016-06-06  9:26 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: john, eparis, ebiederm, jack, linux-kernel, avagin, netdev, operations



On 06/06/2016 11:05 AM, Cyrill Gorcunov wrote:
> On Wed, Jun 01, 2016 at 10:52:57AM +0300, Nikolay Borisov wrote:
>> This patch adds the necessary members to user_struct. The idea behind
>> the solution is really simple - user the userns pointers as keys into
>> a hash table which holds the inotify instances/watches counts. This
>> allows to account the limits per userns rather than per real user,
>> which makes certain scenarios such as a single mapped user in a
>> container deplete the inotify resources for all other users, which
>> map to the exact same real user.
>>
>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ...
>> +static inline unsigned long inotify_dec_return_dev(struct user_struct *user,
>> +						   void *key)
>> +{
>> +	struct inotify_state *state;
>> +	unsigned long ret;
>> +
>> +	spin_lock(&user->inotify_lock);
>> +	state = __find_inotify_state(user, key);
>> +	ret = --state->inotify_devs;
>> +	spin_unlock(&user->inotify_lock);
>> +
>> +	return ret;
>> +}
> 
> Hi Nikolay! Could you please explain why this new function is not used anywhere
> in other patches or I miss something obvious?

Hi Cyrill,

It seems this is a left-over from an earlier, internal version of this
patchset. You can disregard it. Also, given the direction that the
discussion with Eric took I think I will be redesigning the solution
entirely. Thanks for taking the time to read the code!


Nikolay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
  2016-06-06  6:41           ` Nikolay Borisov
@ 2016-06-06 20:00             ` Eric W. Biederman
  0 siblings, 0 replies; 17+ messages in thread
From: Eric W. Biederman @ 2016-06-06 20:00 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Jan Kara, avagin, netdev, Linux Containers, linux-kernel, eparis,
	operations, gorcunov, john

Nikolay Borisov <kernel@kyup.com> writes:

> On 06/03/2016 11:41 PM, Eric W. Biederman wrote:
>> Nikolay Borisov <kernel@kyup.com> writes:
>> 
>>> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>>>
>>>> Nikolay please see my question for you at the end.
>> [snip] 
>>>> All of that said there is definitely a practical question that needs to
>>>> be asked.  Nikolay how did you get into this situation?  A typical user
>>>> namespace configuration will set up uid and gid maps with the help of a
>>>> privileged program and not map the uid of the user who created the user
>>>> namespace.  Thus avoiding exhausting the limits of the user who created
>>>> the container.
>>>
>>> Right but imagine having multiple containers with identical uid/gid maps
>>> for LXC-based setups imagine this:
>>>
>>> lxc.id_map = u 0 1337 65536
>> 
>> So I am only moderately concerned when the containers have overlapping
>> ids.  Because at some level overlapping ids means they are the same
>> user.  This is certainly true for file permissions and for other
>> permissions.  To isolate one container from another it fundamentally
>> needs to have separate uids and gids on the host system.
>> 
>>> Now all processes which are running with the same user on different
>>> containers will actually share the underlying user_struct thus the
>>> inotify limits. In such cases even running multiple instances of 'tail'
>>> in one container will eventually use all allowed inotify/mark instances.
>>> For this to happen you needn't also have complete overlap of the uid
>>> map, it's enough to have at least one UID between 2 containers overlap.
>>>
>>>
>>> So the risk of exhaustion doesn't apply to the privileged user that
>>> created the container and the uid mapping, but rather the users under
>>> which the various processes in the container are running. Does that make
>>> it clear?
>> 
>> Yes.  That is clear.
>> 
>>>> Which makes me personally more worried about escaping the existing
>>>> limits than exhausting the limits of a particular user.
>>>
>>> So I thought bit about it and I guess a solution can be concocted which
>>> utilize the hierarchical nature of page counter, and the inotify limits
>>> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
>>> admin can set one fairly large on the init_user_ns and then in every
>>> namespace created one can set smaller limits. That way for a branch in
>>> the tree (in the nomenclature you used in your previous reply to me) you
>>> will really be upper-bound to the limit set in the namespace which have
>>> ->level = 1. For the width of the tree, you will be bound by the
>>> "global" init_user_ns limits. How does that sound?
>> 
>> As a addendum to that design.  I think there should be an additional
>> sysctl or two that specifies how much the limit decreases when creating
>> a new user namespace and when creating a new user in that user
>> namespace.  That way with a good selection of limits and a limit
>> decrease people can use the kernel defaults without needing to change
>> them.
>
> I agree that a sysctl which controls how the limits are set for new
> namespaces is a good idea. I think it's best if this is in % rather than
> some absolute value. Also I'm not sure about the sysctl when a user is
> added in a namespace since just adding a new user should fall under the
> limits of the current userns.

My hunch is that a reserve per namespace as an absolute number will be
easier to implement and analyze but I don't much care.

I meant that we have a tree where we track created inotify things
that looks like:

                    uns0:
      +-------------//\\----------+
     /       /------/  \----\      \
   user1  user2            user3  user4
                   +-------//\\--------+
                  /     /--/  \---\     \
                uns1  uns2       uns3  uns4
              +-------//\\---------+
             /    /---/  \---\      \
          user5 user6       user7  user8


Allowing a hierarchical tracking of things per user and per user
namespace.

The limits programed with the sysctl would look something like they do
today.
          
> Also should those sysctls be global or should they be per-namespace? At
> this point I'm more inclined to have global sysctl and maybe refine it
> in the future if the need arises?

I think at the end of the day per-namespace is interesting.  We
certainly need to track the values as if they were per namespace.

However given that this should be a setup and forget kind of operation
we don't need to worry about how to implement the sysctl settings as per
namespace in the until everything else is sorted.  

>> Having default settings that are good enough 99% of the time and that
>> people don't need to tune, would be my biggest requirement (aside from
>> being light-weight) for merging something like this.
>> 
>> If things are set and forget and even the continer case does not need to
>> be aware then I think we have a design sufficiently robust and different
>> from what cgroups is doing to make it worth while to have a userns based
>> solution.
>
> Provided that we agree on the overall design, so far it seems we just
> need to iron out the details with the sysctl I'll be happy to implement
> this.

Thanks.  There are some other limits that need to be implemented in this
style that are more important to me: maximum number of user namespaces,
max number of pid namespaces, max number of mount namespaces, etc.
Those limits I will gladly implement.  As I can finally see how to make
all of this just work.  Which is to say the per userns per user per data
structures that hold the counts will be worth creating generically.

No need to generalize the code prematurely I think it make sense to sort
out the logic on whichever we implement first and then the rest of the
interesting limits can just follow the pattern that gets laid down.

Eric

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-06-06 20:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-01  7:52 [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Nikolay Borisov
2016-06-01  7:52 ` [PATCH 1/4] inotify: Add infrastructure to account inotify limits per-namespace Nikolay Borisov
2016-06-06  8:05   ` Cyrill Gorcunov
2016-06-06  9:26     ` Nikolay Borisov
2016-06-01  7:52 ` [PATCH 2/4] inotify: Convert inotify limits to be accounted per-realuser/per-namespace Nikolay Borisov
2016-06-01  7:52 ` [PATCH 3/4] misc: Rename the HASH_SIZE macro Nikolay Borisov
2016-06-01 18:13   ` David Miller
2016-06-01  7:53 ` [PATCH 4/4] inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER Nikolay Borisov
2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
2016-06-02  6:27   ` Nikolay Borisov
2016-06-02 16:19     ` Eric W. Biederman
2016-06-02  7:49   ` Jan Kara
2016-06-02 16:58     ` Eric W. Biederman
2016-06-03 11:14       ` Nikolay Borisov
2016-06-03 20:41         ` Eric W. Biederman
2016-06-06  6:41           ` Nikolay Borisov
2016-06-06 20:00             ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).