linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2 v6] user namespace: use union in {g,u}idmap struct
@ 2017-10-24 22:04 Christian Brauner
  2017-10-24 22:04 ` [PATCH 2/2 v6] user namespaces: bump idmap limits to 340 Christian Brauner
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2017-10-24 22:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: ebiederm, serge, tycho, Christian Brauner

- Add a struct containing two pointer to extents and wrap both the static extent
  array and the struct into a union. This is done in preparation for bumping the
  {g,u}idmap limits for user namespaces.
- Add brackets around anonymous union when using designated initializers to
  initialize members in order to please gcc <= 4.4.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
Changelog 2017-10-24:
* no changes

Changelog 2017-10-19:
* kernel/user.c: Use brackets around anonymous union when using designated
  initializers to initialize members. This is done to please gcc <= 4.4.
---
---
 include/linux/user_namespace.h | 18 +++++++++++++-----
 kernel/user.c                  | 30 ++++++++++++++++++------------
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index c18e01252346..7c83d7f6289b 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -12,13 +12,21 @@
 
 #define UID_GID_MAP_MAX_EXTENTS 5
 
+struct uid_gid_extent {
+	u32 first;
+	u32 lower_first;
+	u32 count;
+};
+
 struct uid_gid_map {	/* 64 bytes -- 1 cache line */
 	u32 nr_extents;
-	struct uid_gid_extent {
-		u32 first;
-		u32 lower_first;
-		u32 count;
-	} extent[UID_GID_MAP_MAX_EXTENTS];
+	union {
+		struct uid_gid_extent extent[UID_GID_MAP_MAX_EXTENTS];
+		struct {
+			struct uid_gid_extent *forward;
+			struct uid_gid_extent *reverse;
+		};
+	};
 };
 
 #define USERNS_SETGROUPS_ALLOWED 1UL
diff --git a/kernel/user.c b/kernel/user.c
index 00281add65b2..9a20acce460d 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -26,26 +26,32 @@
 struct user_namespace init_user_ns = {
 	.uid_map = {
 		.nr_extents = 1,
-		.extent[0] = {
-			.first = 0,
-			.lower_first = 0,
-			.count = 4294967295U,
+		{
+			.extent[0] = {
+				.first = 0,
+				.lower_first = 0,
+				.count = 4294967295U,
+			},
 		},
 	},
 	.gid_map = {
 		.nr_extents = 1,
-		.extent[0] = {
-			.first = 0,
-			.lower_first = 0,
-			.count = 4294967295U,
+		{
+			.extent[0] = {
+				.first = 0,
+				.lower_first = 0,
+				.count = 4294967295U,
+			},
 		},
 	},
 	.projid_map = {
 		.nr_extents = 1,
-		.extent[0] = {
-			.first = 0,
-			.lower_first = 0,
-			.count = 4294967295U,
+		{
+			.extent[0] = {
+				.first = 0,
+				.lower_first = 0,
+				.count = 4294967295U,
+			},
 		},
 	},
 	.count = ATOMIC_INIT(3),
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2 v6] user namespaces: bump idmap limits to 340
  2017-10-24 22:04 [PATCH 1/2 v6] user namespace: use union in {g,u}idmap struct Christian Brauner
@ 2017-10-24 22:04 ` Christian Brauner
  2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2017-10-24 22:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: ebiederm, serge, tycho, Christian Brauner

There are quite some use cases where users run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:

some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1

This translates to:

MAPPING-TYPE | CONTAINER |    HOST |     RANGE |
-------------|-----------|---------|-----------|
         uid |       999 |     999 |         1 |
         uid |      1001 |    1001 |         1 |
         uid |         0 | 1000000 |       999 |
         uid |      1000 | 1001000 |         1 |
         uid |      1002 | 1001002 | 999998998 |
------------------------------------------------
         gid |       999 |     999 |         1 |
         gid |      1001 |    1001 |         1 |
         gid |         0 | 1000000 |       999 |
         gid |      1000 | 1001000 |         1 |
         gid |      1002 | 1001002 | 999998998 |

which is already the current limit.

As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to use

struct uid_gid_extent {
	u32 first;
	u32 lower_first;
	u32 count;
};

struct uid_gid_map { /* 64 bytes -- 1 cache line */
	u32 nr_extents;
	union {
		struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
		struct {
			struct uid_gid_extent *forward;
			struct uid_gid_extent *reverse;
		};
	};
};

For the base cases we will only use the struct uid_gid_extent extent member. If
we go over UID_GID_MAP_MAX_BASE_EXTENTS mappings we perform a single 4k
kmalloc() which means we can have a maximum of 340 mappings
(340 * size(struct uid_gid_extent) = 4080). For the latter case we use two
pointers "forward" and "reverse". The forward pointer points to an array sorted
by "first" and the reverse pointer points to an array sorted by "lower_first".
We can then perform binary search on those arrays.

Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:

> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:

> v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

I used an identical approach on my laptop. Here's a thorough description of what
I did. I built a 4.14.0-rc4 mainline kernel with my new idmap patches applied. I
booted into single user mode and used an ext4 filesystem to open/create
1,000,000 files. Then I looped through all of the files calling fstat() on each
of them 1000 times and calculated the mean fstat() time for a single file. (The
test program can be found below.)

Here are the results. For fun, I compared the first version of my patch which
scaled linearly with the new version of the patch:

|   # MAPPINGS |   PATCH-V1 | PATCH-NEW |
|--------------|------------|-----------|
|   0 mappings |     158 ns |   158 ns  |
|   1 mappings |     164 ns |   157 ns  |
|   2 mappings |     170 ns |   158 ns  |
|   3 mappings |     175 ns |   161 ns  |
|   5 mappings |     187 ns |   165 ns  |
|  10 mappings |     218 ns |   199 ns  |
|  50 mappings |     528 ns |   218 ns  |
| 100 mappings |     980 ns |   229 ns  |
| 200 mappings |    1880 ns |   239 ns  |
| 300 mappings |    2760 ns |   240 ns  |
| 340 mappings | not tested |   248 ns  |

Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:

 #define __GNU_SOURCE
 #define __STDC_FORMAT_MACROS
 #include <errno.h>
 #include <dirent.h>
 #include <fcntl.h>
 #include <inttypes.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <sys/types.h>

 int main(int argc, char *argv[])
 {
 	int ret;
 	size_t i, k;
 	int fd[1000000];
 	int times[1000];
 	char pathname[4096];
 	struct stat st;
 	struct timeval t1, t2;
 	uint64_t time_in_mcs;
 	uint64_t sum = 0;

 	if (argc != 2) {
 		fprintf(stderr, "Please specify a directory where to create "
 				"the test files\n");
 		exit(EXIT_FAILURE);
 	}

 	for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
 		sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
 		fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
 		if (fd[i] < 0) {
 			ssize_t j;
 			for (j = i; j >= 0; j--)
 				close(fd[j]);
 			exit(EXIT_FAILURE);
 		}
 	}

 	for (k = 0; k < 1000; k++) {
 		ret = gettimeofday(&t1, NULL);
 		if (ret < 0)
 			goto close_all;

 		for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
 			ret = fstat(fd[i], &st);
 			if (ret < 0)
 				goto close_all;
 		}

 		ret = gettimeofday(&t2, NULL);
 		if (ret < 0)
 			goto close_all;

 		time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
 			      (1000000 * t1.tv_sec + t1.tv_usec);
 		printf("Total time in micro seconds:       %" PRIu64 "\n",
 		       time_in_mcs);
 		printf("Total time in nanoseconds:         %" PRIu64 "\n",
 		       time_in_mcs * 1000);
 		printf("Time per file in nanoseconds:      %" PRIu64 "\n",
 		       (time_in_mcs * 1000) / 1000000);
 		times[k] = (time_in_mcs * 1000) / 1000000;
 	}

 close_all:
 	for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
 		close(fd[i]);

 	if (ret < 0)
 		exit(EXIT_FAILURE);

 	for (k = 0; k < 1000; k++) {
 		sum += times[k];
 	}

 	printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);

 	exit(EXIT_SUCCESS);;
 }

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
CC: Serge Hallyn <serge@hallyn.com>
CC: Eric Biederman <ebiederm@xmission.com>
---
Changelog 2017-10-24:
* kernel/user_namespace.c:insert_extent():
  Check for NULL after kmalloc() instead of IS_ERR().
  Before this codepath did:

  if (IS_ERR(forward))
          return PTR_ERR(forward);

 But kmalloc() does not return IS_ERR() but rather NULL on failure. So let's
 change this to:

  if (!forward)
          return -ENOMEM;

Changelog 2017-10-22:
* kernel/user_namespace.c:sort_idmaps():
  Check for NULL after kmemdup() instead of IS_ERR().
  Before this codepath did:

  if (IS_ERR(map->reverse)) {
          map->reverse = NULL;
          return PTR_ERR(map->reverse);
 }

 But kmemdup() does not return IS_ERR() but rather NULL on failure. So let's
 change this to:

 if (!map->reverse)
          return -ENOMEM;

Changelog 2017-10-19:
* kernel/user_namespace.c: Remove unnecessary mutex when freeing extents.
* non-functional: Fix typos in commit message.
* non-functional: Add headers to test program in commit message to make tests I
  performed easily reproducible.
---
---
 include/linux/user_namespace.h |   7 +-
 kernel/user_namespace.c        | 350 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 324 insertions(+), 33 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 7c83d7f6289b..1c1046a60fb4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -10,7 +10,8 @@
 #include <linux/sysctl.h>
 #include <linux/err.h>
 
-#define UID_GID_MAP_MAX_EXTENTS 5
+#define UID_GID_MAP_MAX_BASE_EXTENTS 5
+#define UID_GID_MAP_MAX_EXTENTS 340
 
 struct uid_gid_extent {
 	u32 first;
@@ -18,10 +19,10 @@ struct uid_gid_extent {
 	u32 count;
 };
 
-struct uid_gid_map {	/* 64 bytes -- 1 cache line */
+struct uid_gid_map { /* 64 bytes -- 1 cache line */
 	u32 nr_extents;
 	union {
-		struct uid_gid_extent extent[UID_GID_MAP_MAX_EXTENTS];
+		struct uid_gid_extent extent[UID_GID_MAP_MAX_BASE_EXTENTS];
 		struct {
 			struct uid_gid_extent *forward;
 			struct uid_gid_extent *reverse;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c490f1e4313b..5fd2d53dbc75 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -23,6 +23,8 @@
 #include <linux/ctype.h>
 #include <linux/projid.h>
 #include <linux/fs_struct.h>
+#include <linux/bsearch.h>
+#include <linux/sort.h>
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
 static DEFINE_MUTEX(userns_state_mutex);
@@ -181,6 +183,18 @@ static void free_user_ns(struct work_struct *work)
 	do {
 		struct ucounts *ucounts = ns->ucounts;
 		parent = ns->parent;
+		if (ns->gid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+			kfree(ns->gid_map.forward);
+			kfree(ns->gid_map.reverse);
+		}
+		if (ns->uid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+			kfree(ns->uid_map.forward);
+			kfree(ns->uid_map.reverse);
+		}
+		if (ns->projid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+			kfree(ns->projid_map.forward);
+			kfree(ns->projid_map.reverse);
+		}
 		retire_userns_sysctls(ns);
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 		key_put(ns->persistent_keyring_register);
@@ -198,7 +212,84 @@ void __put_user_ns(struct user_namespace *ns)
 }
 EXPORT_SYMBOL(__put_user_ns);
 
-static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
+/**
+ * idmap_key struct holds the information necessary to find an idmapping in a
+ * sorted idmap array. It is passed to cmp_map_id() as first argument.
+ */
+struct idmap_key {
+	bool map_up; /* true  -> id from kid; false -> kid from id */
+	u32 id; /* id to find */
+	u32 count; /* == 0 unless used with map_id_range_down() */
+};
+
+/**
+ * cmp_map_id - Function to be passed to bsearch() to find the requested
+ * idmapping. Expects struct idmap_key to be passed via @k.
+ */
+static int cmp_map_id(const void *k, const void *e)
+{
+	u32 first, last, id2;
+	const struct idmap_key *key = k;
+	const struct uid_gid_extent *el = e;
+
+	/* handle map_id_range_down() */
+	if (key->count)
+		id2 = key->id + key->count - 1;
+	else
+		id2 = key->id;
+
+	/* handle map_id_{down,up}() */
+	if (key->map_up)
+		first = el->lower_first;
+	else
+		first = el->first;
+
+	last = first + el->count - 1;
+
+	if (key->id >= first && key->id <= last &&
+	    (id2 >= first && id2 <= last))
+		return 0;
+
+	if (key->id < first || id2 < first)
+		return -1;
+
+	return 1;
+}
+
+/**
+ * map_id_range_down_max - Find idmap via binary search in ordered idmap array.
+ * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS.
+ */
+static u32 map_id_range_down_max(struct uid_gid_map *map, u32 id, u32 count)
+{
+	u32 extents;
+	struct uid_gid_extent *extent;
+	struct idmap_key key;
+
+	key.map_up = false;
+	key.count = count;
+	key.id = id;
+
+	extents = map->nr_extents;
+	smp_rmb();
+
+	extent = bsearch(&key, map->forward, extents,
+			 sizeof(struct uid_gid_extent), cmp_map_id);
+	/* Map the id or note failure */
+	if (extent)
+		id = (id - extent->first) + extent->lower_first;
+	else
+		id = (u32) -1;
+
+	return id;
+}
+
+/**
+ * map_id_range_down_base - Find idmap via binary search in static extent array.
+ * Can only be called if number of mappings is equal or less than
+ * UID_GID_MAP_MAX_BASE_EXTENTS.
+ */
+static u32 map_id_range_down_base(struct uid_gid_map *map, u32 id, u32 count)
 {
 	unsigned idx, extents;
 	u32 first, last, id2;
@@ -224,7 +315,23 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
 	return id;
 }
 
-static u32 map_id_down(struct uid_gid_map *map, u32 id)
+static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
+{
+	u32 extents = map->nr_extents;
+	smp_rmb();
+
+	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+		return map_id_range_down_base(map, id, count);
+
+	return map_id_range_down_max(map, id, count);
+}
+
+/**
+ * map_id_down_base - Find idmap via binary search in static extent array.
+ * Can only be called if number of mappings is equal or less than
+ * UID_GID_MAP_MAX_BASE_EXTENTS.
+ */
+static u32 map_id_down_base(struct uid_gid_map *map, u32 id)
 {
 	unsigned idx, extents;
 	u32 first, last;
@@ -247,7 +354,23 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
 	return id;
 }
 
-static u32 map_id_up(struct uid_gid_map *map, u32 id)
+static u32 map_id_down(struct uid_gid_map *map, u32 id)
+{
+	u32 extents = map->nr_extents;
+	smp_rmb();
+
+	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+		return map_id_down_base(map, id);
+
+	return map_id_range_down_max(map, id, 0);
+}
+
+/**
+ * map_id_up_base - Find idmap via binary search in static extent array.
+ * Can only be called if number of mappings is equal or less than
+ * UID_GID_MAP_MAX_BASE_EXTENTS.
+ */
+static u32 map_id_up_base(struct uid_gid_map *map, u32 id)
 {
 	unsigned idx, extents;
 	u32 first, last;
@@ -270,6 +393,45 @@ static u32 map_id_up(struct uid_gid_map *map, u32 id)
 	return id;
 }
 
+/**
+ * map_id_up_max - Find idmap via binary search in ordered idmap array.
+ * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS.
+ */
+static u32 map_id_up_max(struct uid_gid_map *map, u32 id)
+{
+	u32 extents;
+	struct uid_gid_extent *extent;
+	struct idmap_key key;
+
+	key.map_up = true;
+	key.count = 0;
+	key.id = id;
+
+	extents = map->nr_extents;
+	smp_rmb();
+
+	extent = bsearch(&key, map->reverse, extents,
+			 sizeof(struct uid_gid_extent), cmp_map_id);
+	/* Map the id or note failure */
+	if (extent)
+		id = (id - extent->lower_first) + extent->first;
+	else
+		id = (u32) -1;
+
+	return id;
+}
+
+static u32 map_id_up(struct uid_gid_map *map, u32 id)
+{
+	u32 extents = map->nr_extents;
+	smp_rmb();
+
+	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+		return map_id_up_base(map, id);
+
+	return map_id_up_max(map, id);
+}
+
 /**
  *	make_kuid - Map a user-namespace uid pair into a kuid.
  *	@ns:  User namespace that the uid is in
@@ -540,13 +702,15 @@ static int projid_m_show(struct seq_file *seq, void *v)
 static void *m_start(struct seq_file *seq, loff_t *ppos,
 		     struct uid_gid_map *map)
 {
-	struct uid_gid_extent *extent = NULL;
 	loff_t pos = *ppos;
 
-	if (pos < map->nr_extents)
-		extent = &map->extent[pos];
+	if (pos >= map->nr_extents)
+		return NULL;
+
+	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+		return &map->extent[pos];
 
-	return extent;
+	return &map->forward[pos];
 }
 
 static void *uid_m_start(struct seq_file *seq, loff_t *ppos)
@@ -618,7 +782,10 @@ static bool mappings_overlap(struct uid_gid_map *new_map,
 		u32 prev_upper_last, prev_lower_last;
 		struct uid_gid_extent *prev;
 
-		prev = &new_map->extent[idx];
+		if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+			prev = &new_map->extent[idx];
+		else
+			prev = &new_map->forward[idx];
 
 		prev_upper_first = prev->first;
 		prev_lower_first = prev->lower_first;
@@ -638,6 +805,102 @@ static bool mappings_overlap(struct uid_gid_map *new_map,
 	return false;
 }
 
+/**
+ * insert_extent - Safely insert a new idmap extent into struct uid_gid_map.
+ * Takes care to allocate a 4K block of memory if the number of mappings exceeds
+ * UID_GID_MAP_MAX_BASE_EXTENTS.
+ */
+static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent)
+{
+	if (map->nr_extents < UID_GID_MAP_MAX_BASE_EXTENTS) {
+		map->extent[map->nr_extents].first = extent->first;
+		map->extent[map->nr_extents].lower_first = extent->lower_first;
+		map->extent[map->nr_extents].count = extent->count;
+		return 0;
+	}
+
+	if (map->nr_extents == UID_GID_MAP_MAX_BASE_EXTENTS) {
+		struct uid_gid_extent *forward;
+
+		/* Allocate memory for 340 mappings. */
+		forward = kmalloc(sizeof(struct uid_gid_extent) *
+				 UID_GID_MAP_MAX_EXTENTS, GFP_KERNEL);
+		if (!forward)
+			return -ENOMEM;
+
+		/* Copy over memory. Only set up memory for the forward pointer.
+		 * Defer the memory setup for the reverse pointer.
+		 */
+		memcpy(forward, map->extent,
+		       map->nr_extents * sizeof(map->extent[0]));
+
+		map->forward = forward;
+		map->reverse = NULL;
+	}
+
+	map->forward[map->nr_extents].first = extent->first;
+	map->forward[map->nr_extents].lower_first = extent->lower_first;
+	map->forward[map->nr_extents].count = extent->count;
+	return 0;
+}
+
+/* cmp function to sort() forward mappings */
+static int cmp_extents_forward(const void *a, const void *b)
+{
+	const struct uid_gid_extent *e1 = a;
+	const struct uid_gid_extent *e2 = b;
+
+	if (e1->first < e2->first)
+		return -1;
+
+	if (e1->first > e2->first)
+		return 1;
+
+	return 0;
+}
+
+/* cmp function to sort() reverse mappings */
+static int cmp_extents_reverse(const void *a, const void *b)
+{
+	const struct uid_gid_extent *e1 = a;
+	const struct uid_gid_extent *e2 = b;
+
+	if (e1->lower_first < e2->lower_first)
+		return -1;
+
+	if (e1->lower_first > e2->lower_first)
+		return 1;
+
+	return 0;
+}
+
+/**
+ * sort_idmaps - Sorts an array of idmap entries.
+ * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS.
+ */
+static int sort_idmaps(struct uid_gid_map *map)
+{
+	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+		return 0;
+
+	/* Sort forward array. */
+	sort(map->forward, map->nr_extents, sizeof(struct uid_gid_extent),
+	     cmp_extents_forward, NULL);
+
+	/* Only copy the memory from forward we actually need. */
+	map->reverse = kmemdup(map->forward,
+			       map->nr_extents * sizeof(struct uid_gid_extent),
+			       GFP_KERNEL);
+	if (!map->reverse)
+		return -ENOMEM;
+
+	/* Sort reverse array. */
+	sort(map->reverse, map->nr_extents, sizeof(struct uid_gid_extent),
+	     cmp_extents_reverse, NULL);
+
+	return 0;
+}
+
 static ssize_t map_write(struct file *file, const char __user *buf,
 			 size_t count, loff_t *ppos,
 			 int cap_setid,
@@ -648,7 +911,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	struct user_namespace *ns = seq->private;
 	struct uid_gid_map new_map;
 	unsigned idx;
-	struct uid_gid_extent *extent = NULL;
+	struct uid_gid_extent extent;
 	char *kbuf = NULL, *pos, *next_line;
 	ssize_t ret = -EINVAL;
 
@@ -673,6 +936,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	 */
 	mutex_lock(&userns_state_mutex);
 
+	memset(&new_map, 0, sizeof(struct uid_gid_map));
+
 	ret = -EPERM;
 	/* Only allow one successful write to the map */
 	if (map->nr_extents != 0)
@@ -700,9 +965,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	/* Parse the user data */
 	ret = -EINVAL;
 	pos = kbuf;
-	new_map.nr_extents = 0;
 	for (; pos; pos = next_line) {
-		extent = &new_map.extent[new_map.nr_extents];
 
 		/* Find the end of line and ensure I don't look past it */
 		next_line = strchr(pos, '\n');
@@ -714,17 +977,17 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 		}
 
 		pos = skip_spaces(pos);
-		extent->first = simple_strtoul(pos, &pos, 10);
+		extent.first = simple_strtoul(pos, &pos, 10);
 		if (!isspace(*pos))
 			goto out;
 
 		pos = skip_spaces(pos);
-		extent->lower_first = simple_strtoul(pos, &pos, 10);
+		extent.lower_first = simple_strtoul(pos, &pos, 10);
 		if (!isspace(*pos))
 			goto out;
 
 		pos = skip_spaces(pos);
-		extent->count = simple_strtoul(pos, &pos, 10);
+		extent.count = simple_strtoul(pos, &pos, 10);
 		if (*pos && !isspace(*pos))
 			goto out;
 
@@ -734,29 +997,33 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 			goto out;
 
 		/* Verify we have been given valid starting values */
-		if ((extent->first == (u32) -1) ||
-		    (extent->lower_first == (u32) -1))
+		if ((extent.first == (u32) -1) ||
+		    (extent.lower_first == (u32) -1))
 			goto out;
 
 		/* Verify count is not zero and does not cause the
 		 * extent to wrap
 		 */
-		if ((extent->first + extent->count) <= extent->first)
+		if ((extent.first + extent.count) <= extent.first)
 			goto out;
-		if ((extent->lower_first + extent->count) <=
-		     extent->lower_first)
+		if ((extent.lower_first + extent.count) <=
+		     extent.lower_first)
 			goto out;
 
 		/* Do the ranges in extent overlap any previous extents? */
-		if (mappings_overlap(&new_map, extent))
+		if (mappings_overlap(&new_map, &extent))
 			goto out;
 
-		new_map.nr_extents++;
-
-		/* Fail if the file contains too many extents */
-		if ((new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) &&
+		if ((new_map.nr_extents + 1) == UID_GID_MAP_MAX_EXTENTS &&
 		    (next_line != NULL))
 			goto out;
+
+		ret = insert_extent(&new_map, &extent);
+		if (ret < 0)
+			goto out;
+		ret = -EINVAL;
+
+		new_map.nr_extents++;
 	}
 	/* Be very certaint the new map actually exists */
 	if (new_map.nr_extents == 0)
@@ -767,16 +1034,26 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
 		goto out;
 
+	ret = sort_idmaps(&new_map);
+	if (ret < 0)
+		goto out;
+
+	ret = -EPERM;
 	/* Map the lower ids from the parent user namespace to the
 	 * kernel global id space.
 	 */
 	for (idx = 0; idx < new_map.nr_extents; idx++) {
+		struct uid_gid_extent *e;
 		u32 lower_first;
-		extent = &new_map.extent[idx];
+
+		if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+			e = &new_map.extent[idx];
+		else
+			e = &new_map.forward[idx];
 
 		lower_first = map_id_range_down(parent_map,
-						extent->lower_first,
-						extent->count);
+						e->lower_first,
+						e->count);
 
 		/* Fail if we can not map the specified extent to
 		 * the kernel global id space.
@@ -784,18 +1061,31 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 		if (lower_first == (u32) -1)
 			goto out;
 
-		extent->lower_first = lower_first;
+		e->lower_first = lower_first;
 	}
 
 	/* Install the map */
-	memcpy(map->extent, new_map.extent,
-		new_map.nr_extents*sizeof(new_map.extent[0]));
+	if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
+		memcpy(map->extent, new_map.extent,
+		       new_map.nr_extents * sizeof(new_map.extent[0]));
+	} else {
+		map->forward = new_map.forward;
+		map->reverse = new_map.reverse;
+	}
 	smp_wmb();
 	map->nr_extents = new_map.nr_extents;
 
 	*ppos = count;
 	ret = count;
 out:
+	if (ret < 0 && new_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+		kfree(new_map.forward);
+		kfree(new_map.reverse);
+		map->forward = NULL;
+		map->reverse = NULL;
+		map->nr_extents = 0;
+	}
+
 	mutex_unlock(&userns_state_mutex);
 	kfree(kbuf);
 	return ret;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 0/5] userns: bump idmap limits, fixes & tweaks
  2017-10-24 22:04 ` [PATCH 2/2 v6] user namespaces: bump idmap limits to 340 Christian Brauner
@ 2017-10-31 23:46   ` Eric W. Biederman
  2017-10-31 23:47     ` [PATCH 1/5] userns: Don't special case a count of 0 Eric W. Biederman
                       ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-10-31 23:46 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, serge, tycho, Linux Containers


Christian I have looked through your code and I have found one real
issue and of things I want to twak

The real issue is reading nr_extents multiple times when reading a map.
That can introduce races that will allow walking past the end of the
array, if the first read is 0 but the second read is > 5.

I have also found a couple of tweaks that look like they are worth
implementing.

As all of these are very small and very straight forward I have
tested these and applied them all to my for-next branch


Eric W. Biederman (5):
      userns: Don't special case a count of 0
      userns: Simplify the user and group mapping functions
      userns: Don't read extents twice in m_start
      userns: Make map_id_down a wrapper for map_id_range_down
      userns: Simplify insert_extent

 kernel/user_namespace.c | 159 ++++++++++++++++--------------------------------
 1 file changed, 51 insertions(+), 108 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] userns: Don't special case a count of 0
  2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
@ 2017-10-31 23:47     ` Eric W. Biederman
  2017-10-31 23:47     ` [PATCH 2/5] userns: Simplify the user and group mapping functions Eric W. Biederman
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-10-31 23:47 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, serge, tycho, Linux Containers


We can always use a count of 1 so there is no reason to have
a special case of a count of 0.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 kernel/user_namespace.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 5fd2d53dbc75..c9904ee084c4 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -232,11 +232,7 @@ static int cmp_map_id(const void *k, const void *e)
 	const struct idmap_key *key = k;
 	const struct uid_gid_extent *el = e;
 
-	/* handle map_id_range_down() */
-	if (key->count)
-		id2 = key->id + key->count - 1;
-	else
-		id2 = key->id;
+	id2 = key->id + key->count - 1;
 
 	/* handle map_id_{down,up}() */
 	if (key->map_up)
@@ -362,7 +358,7 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
 	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
 		return map_id_down_base(map, id);
 
-	return map_id_range_down_max(map, id, 0);
+	return map_id_range_down_max(map, id, 1);
 }
 
 /**
@@ -404,7 +400,7 @@ static u32 map_id_up_max(struct uid_gid_map *map, u32 id)
 	struct idmap_key key;
 
 	key.map_up = true;
-	key.count = 0;
+	key.count = 1;
 	key.id = id;
 
 	extents = map->nr_extents;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] userns: Simplify the user and group mapping functions
  2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
  2017-10-31 23:47     ` [PATCH 1/5] userns: Don't special case a count of 0 Eric W. Biederman
@ 2017-10-31 23:47     ` Eric W. Biederman
  2017-10-31 23:48     ` [PATCH 3/5] userns: Don't read extents twice in m_start Eric W. Biederman
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-10-31 23:47 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, serge, tycho, Linux Containers


Consolidate reading the number of extents and computing the return
value in the map_id_down, map_id_range_down and map_id_range.

This removal of one read of extents makes one smp_rmb unnecessary
and makes the code safe it is executed during the map write.  Reading
the number of extents twice and depending on the result being the same
is not safe, as it could be 0 the first time and > 5 the second time,
which would lead to misinterpreting the union fields.

The consolidation of the return value just removes a duplicate
caluculation which should make it easier to understand and maintain
the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/user_namespace.c | 132 +++++++++++++++++++++---------------------------
 1 file changed, 58 insertions(+), 74 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c9904ee084c4..563a2981d7c7 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -256,28 +256,17 @@ static int cmp_map_id(const void *k, const void *e)
  * map_id_range_down_max - Find idmap via binary search in ordered idmap array.
  * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS.
  */
-static u32 map_id_range_down_max(struct uid_gid_map *map, u32 id, u32 count)
+static struct uid_gid_extent *
+map_id_range_down_max(unsigned extents, struct uid_gid_map *map, u32 id, u32 count)
 {
-	u32 extents;
-	struct uid_gid_extent *extent;
 	struct idmap_key key;
 
 	key.map_up = false;
 	key.count = count;
 	key.id = id;
 
-	extents = map->nr_extents;
-	smp_rmb();
-
-	extent = bsearch(&key, map->forward, extents,
-			 sizeof(struct uid_gid_extent), cmp_map_id);
-	/* Map the id or note failure */
-	if (extent)
-		id = (id - extent->first) + extent->lower_first;
-	else
-		id = (u32) -1;
-
-	return id;
+	return bsearch(&key, map->forward, extents,
+		       sizeof(struct uid_gid_extent), cmp_map_id);
 }
 
 /**
@@ -285,41 +274,43 @@ static u32 map_id_range_down_max(struct uid_gid_map *map, u32 id, u32 count)
  * Can only be called if number of mappings is equal or less than
  * UID_GID_MAP_MAX_BASE_EXTENTS.
  */
-static u32 map_id_range_down_base(struct uid_gid_map *map, u32 id, u32 count)
+static struct uid_gid_extent *
+map_id_range_down_base(unsigned extents, struct uid_gid_map *map, u32 id, u32 count)
 {
-	unsigned idx, extents;
+	unsigned idx;
 	u32 first, last, id2;
 
 	id2 = id + count - 1;
 
 	/* Find the matching extent */
-	extents = map->nr_extents;
-	smp_rmb();
 	for (idx = 0; idx < extents; idx++) {
 		first = map->extent[idx].first;
 		last = first + map->extent[idx].count - 1;
 		if (id >= first && id <= last &&
 		    (id2 >= first && id2 <= last))
-			break;
+			return &map->extent[idx];
 	}
-	/* Map the id or note failure */
-	if (idx < extents)
-		id = (id - first) + map->extent[idx].lower_first;
-	else
-		id = (u32) -1;
-
-	return id;
+	return NULL;
 }
 
 static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
 {
-	u32 extents = map->nr_extents;
+	struct uid_gid_extent *extent;
+	unsigned extents = map->nr_extents;
 	smp_rmb();
 
 	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
-		return map_id_range_down_base(map, id, count);
+		extent = map_id_range_down_base(extents, map, id, count);
+	else
+		extent = map_id_range_down_max(extents, map, id, count);
 
-	return map_id_range_down_max(map, id, count);
+	/* Map the id or note failure */
+	if (extent)
+		id = (id - extent->first) + extent->lower_first;
+	else
+		id = (u32) -1;
+
+	return id;
 }
 
 /**
@@ -327,38 +318,40 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
  * Can only be called if number of mappings is equal or less than
  * UID_GID_MAP_MAX_BASE_EXTENTS.
  */
-static u32 map_id_down_base(struct uid_gid_map *map, u32 id)
+static struct uid_gid_extent *
+map_id_down_base(unsigned extents, struct uid_gid_map *map, u32 id)
 {
-	unsigned idx, extents;
+	unsigned idx;
 	u32 first, last;
 
 	/* Find the matching extent */
-	extents = map->nr_extents;
-	smp_rmb();
 	for (idx = 0; idx < extents; idx++) {
 		first = map->extent[idx].first;
 		last = first + map->extent[idx].count - 1;
 		if (id >= first && id <= last)
-			break;
+			return &map->extent[idx];
 	}
-	/* Map the id or note failure */
-	if (idx < extents)
-		id = (id - first) + map->extent[idx].lower_first;
-	else
-		id = (u32) -1;
-
-	return id;
+	return NULL;
 }
 
 static u32 map_id_down(struct uid_gid_map *map, u32 id)
 {
-	u32 extents = map->nr_extents;
+	struct uid_gid_extent *extent;
+	unsigned extents = map->nr_extents;
 	smp_rmb();
 
 	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
-		return map_id_down_base(map, id);
+		extent = map_id_down_base(extents, map, id);
+	else
+		extent = map_id_range_down_max(extents, map, id, 1);
 
-	return map_id_range_down_max(map, id, 1);
+	/* Map the id or note failure */
+	if (extent)
+		id = (id - extent->first) + extent->lower_first;
+	else
+		id = (u32) -1;
+
+	return id;
 }
 
 /**
@@ -366,48 +359,50 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
  * Can only be called if number of mappings is equal or less than
  * UID_GID_MAP_MAX_BASE_EXTENTS.
  */
-static u32 map_id_up_base(struct uid_gid_map *map, u32 id)
+static struct uid_gid_extent *
+map_id_up_base(unsigned extents, struct uid_gid_map *map, u32 id)
 {
-	unsigned idx, extents;
+	unsigned idx;
 	u32 first, last;
 
 	/* Find the matching extent */
-	extents = map->nr_extents;
-	smp_rmb();
 	for (idx = 0; idx < extents; idx++) {
 		first = map->extent[idx].lower_first;
 		last = first + map->extent[idx].count - 1;
 		if (id >= first && id <= last)
-			break;
+			return &map->extent[idx];
 	}
-	/* Map the id or note failure */
-	if (idx < extents)
-		id = (id - first) + map->extent[idx].first;
-	else
-		id = (u32) -1;
-
-	return id;
+	return NULL;
 }
 
 /**
  * map_id_up_max - Find idmap via binary search in ordered idmap array.
  * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS.
  */
-static u32 map_id_up_max(struct uid_gid_map *map, u32 id)
+static struct uid_gid_extent *
+map_id_up_max(unsigned extents, struct uid_gid_map *map, u32 id)
 {
-	u32 extents;
-	struct uid_gid_extent *extent;
 	struct idmap_key key;
 
 	key.map_up = true;
 	key.count = 1;
 	key.id = id;
 
-	extents = map->nr_extents;
+	return bsearch(&key, map->reverse, extents,
+		       sizeof(struct uid_gid_extent), cmp_map_id);
+}
+
+static u32 map_id_up(struct uid_gid_map *map, u32 id)
+{
+	struct uid_gid_extent *extent;
+	unsigned extents = map->nr_extents;
 	smp_rmb();
 
-	extent = bsearch(&key, map->reverse, extents,
-			 sizeof(struct uid_gid_extent), cmp_map_id);
+	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+		extent = map_id_up_base(extents, map, id);
+	else
+		extent = map_id_up_max(extents, map, id);
+
 	/* Map the id or note failure */
 	if (extent)
 		id = (id - extent->lower_first) + extent->first;
@@ -417,17 +412,6 @@ static u32 map_id_up_max(struct uid_gid_map *map, u32 id)
 	return id;
 }
 
-static u32 map_id_up(struct uid_gid_map *map, u32 id)
-{
-	u32 extents = map->nr_extents;
-	smp_rmb();
-
-	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
-		return map_id_up_base(map, id);
-
-	return map_id_up_max(map, id);
-}
-
 /**
  *	make_kuid - Map a user-namespace uid pair into a kuid.
  *	@ns:  User namespace that the uid is in
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
  2017-10-31 23:47     ` [PATCH 1/5] userns: Don't special case a count of 0 Eric W. Biederman
  2017-10-31 23:47     ` [PATCH 2/5] userns: Simplify the user and group mapping functions Eric W. Biederman
@ 2017-10-31 23:48     ` Eric W. Biederman
  2017-11-01  8:31       ` Nikolay Borisov
  2017-10-31 23:48     ` [PATCH 4/5] userns: Make map_id_down a wrapper for map_id_range_down Eric W. Biederman
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2017-10-31 23:48 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, serge, tycho, Linux Containers


This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
the map is being written does not do strange things.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/user_namespace.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 563a2981d7c7..4f7e357ac1e2 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
 		     struct uid_gid_map *map)
 {
 	loff_t pos = *ppos;
+	unsigned extents = map->nr_extents;
+	smp_rmb();
 
-	if (pos >= map->nr_extents)
+	if (pos >= extents)
 		return NULL;
 
-	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
 		return &map->extent[pos];
 
 	return &map->forward[pos];
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] userns: Make map_id_down a wrapper for map_id_range_down
  2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
                       ` (2 preceding siblings ...)
  2017-10-31 23:48     ` [PATCH 3/5] userns: Don't read extents twice in m_start Eric W. Biederman
@ 2017-10-31 23:48     ` Eric W. Biederman
  2017-10-31 23:49     ` [PATCH 5/5] userns: Simplify insert_extent Eric W. Biederman
  2017-11-01 10:51     ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Christian Brauner
  5 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-10-31 23:48 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, serge, tycho, Linux Containers


There is no good reason for this code duplication, the number of cache
line accesses not the number of instructions are the bottleneck in
this code.

Therefore simplify maintenance by removing unnecessary code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/user_namespace.c | 38 +-------------------------------------
 1 file changed, 1 insertion(+), 37 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 4f7e357ac1e2..1d0298870ee3 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -313,45 +313,9 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
 	return id;
 }
 
-/**
- * map_id_down_base - Find idmap via binary search in static extent array.
- * Can only be called if number of mappings is equal or less than
- * UID_GID_MAP_MAX_BASE_EXTENTS.
- */
-static struct uid_gid_extent *
-map_id_down_base(unsigned extents, struct uid_gid_map *map, u32 id)
-{
-	unsigned idx;
-	u32 first, last;
-
-	/* Find the matching extent */
-	for (idx = 0; idx < extents; idx++) {
-		first = map->extent[idx].first;
-		last = first + map->extent[idx].count - 1;
-		if (id >= first && id <= last)
-			return &map->extent[idx];
-	}
-	return NULL;
-}
-
 static u32 map_id_down(struct uid_gid_map *map, u32 id)
 {
-	struct uid_gid_extent *extent;
-	unsigned extents = map->nr_extents;
-	smp_rmb();
-
-	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
-		extent = map_id_down_base(extents, map, id);
-	else
-		extent = map_id_range_down_max(extents, map, id, 1);
-
-	/* Map the id or note failure */
-	if (extent)
-		id = (id - extent->first) + extent->lower_first;
-	else
-		id = (u32) -1;
-
-	return id;
+	return map_id_range_down(map, id, 1);
 }
 
 /**
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] userns: Simplify insert_extent
  2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
                       ` (3 preceding siblings ...)
  2017-10-31 23:48     ` [PATCH 4/5] userns: Make map_id_down a wrapper for map_id_range_down Eric W. Biederman
@ 2017-10-31 23:49     ` Eric W. Biederman
  2017-11-01 10:51     ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Christian Brauner
  5 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-10-31 23:49 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-kernel, serge, tycho, Linux Containers


Consolidate the code to write to the new mapping at the end of the
function to remove the duplication.  Move the increase in the number
of mappings into insert_extent, keeping the logic together.

Just a small increase in readability and maintainability.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/user_namespace.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 1d0298870ee3..899c31060ff3 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -758,12 +758,7 @@ static bool mappings_overlap(struct uid_gid_map *new_map,
  */
 static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent)
 {
-	if (map->nr_extents < UID_GID_MAP_MAX_BASE_EXTENTS) {
-		map->extent[map->nr_extents].first = extent->first;
-		map->extent[map->nr_extents].lower_first = extent->lower_first;
-		map->extent[map->nr_extents].count = extent->count;
-		return 0;
-	}
+	struct uid_gid_extent *dest;
 
 	if (map->nr_extents == UID_GID_MAP_MAX_BASE_EXTENTS) {
 		struct uid_gid_extent *forward;
@@ -784,9 +779,13 @@ static int insert_extent(struct uid_gid_map *map, struct uid_gid_extent *extent)
 		map->reverse = NULL;
 	}
 
-	map->forward[map->nr_extents].first = extent->first;
-	map->forward[map->nr_extents].lower_first = extent->lower_first;
-	map->forward[map->nr_extents].count = extent->count;
+	if (map->nr_extents < UID_GID_MAP_MAX_BASE_EXTENTS)
+		dest = &map->extent[map->nr_extents];
+	else
+		dest = &map->forward[map->nr_extents];
+
+	*dest = *extent;
+	map->nr_extents++;
 	return 0;
 }
 
@@ -968,8 +967,6 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 		if (ret < 0)
 			goto out;
 		ret = -EINVAL;
-
-		new_map.nr_extents++;
 	}
 	/* Be very certaint the new map actually exists */
 	if (new_map.nr_extents == 0)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-10-31 23:48     ` [PATCH 3/5] userns: Don't read extents twice in m_start Eric W. Biederman
@ 2017-11-01  8:31       ` Nikolay Borisov
  2017-11-01 11:08         ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Nikolay Borisov @ 2017-11-01  8:31 UTC (permalink / raw)
  To: Eric W. Biederman, Christian Brauner
  Cc: Linux Containers, tycho, linux-kernel



On  1.11.2017 01:48, Eric W. Biederman wrote:
> 
> This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
> the map is being written does not do strange things.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/user_namespace.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 563a2981d7c7..4f7e357ac1e2 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
>  		     struct uid_gid_map *map)
>  {
>  	loff_t pos = *ppos;
> +	unsigned extents = map->nr_extents;
> +	smp_rmb();

Barriers need to be paired to work correctly as well as have explicit
comments describing the pairing as per kernel coding style. Checkpatch
will actually produce warning for that particular memory barrier.

>  
> -	if (pos >= map->nr_extents)
> +	if (pos >= extents)
>  		return NULL;
>  
> -	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> +	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
>  		return &map->extent[pos];
>  
>  	return &map->forward[pos];
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] userns: bump idmap limits, fixes & tweaks
  2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
                       ` (4 preceding siblings ...)
  2017-10-31 23:49     ` [PATCH 5/5] userns: Simplify insert_extent Eric W. Biederman
@ 2017-11-01 10:51     ` Christian Brauner
  2017-11-01 11:15       ` Eric W. Biederman
  5 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2017-11-01 10:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Linux Kernel Mailing List, Serge Hallyn,
	Tycho Andersen, Linux Containers

On Tue, Oct 31, 2017 at 06:46:32PM -0500, Eric W. Biederman wrote:
>
> Christian I have looked through your code and I have found one real
> issue and of things I want to twak

Cool, thanks for taking a close look Eric.

>
> The real issue is reading nr_extents multiple times when reading a map.
> That can introduce races that will allow walking past the end of the
> array, if the first read is 0 but the second read is > 5.
>
> I have also found a couple of tweaks that look like they are worth
> implementing.

Yeah, I saw that you unified some of the functions. I was thinking about this
but wanted to keep the cases distinct even with some amount of code duplication.
But it seems very much worth it from a maintenance perspective. Thanks!

>
> As all of these are very small and very straight forward I have
> tested these and applied them all to my for-next branch

Thanks for the fixes Eric. Really appreciated. If you're too swamped for stuff
like that I'm obviously happy to do such trivial fixes myself. :)

Christian

>
>
> Eric W. Biederman (5):
>       userns: Don't special case a count of 0
>       userns: Simplify the user and group mapping functions
>       userns: Don't read extents twice in m_start
>       userns: Make map_id_down a wrapper for map_id_range_down
>       userns: Simplify insert_extent
>
>  kernel/user_namespace.c | 159 ++++++++++++++++--------------------------------
>  1 file changed, 51 insertions(+), 108 deletions(-)
>
>
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01  8:31       ` Nikolay Borisov
@ 2017-11-01 11:08         ` Eric W. Biederman
  2017-11-01 13:05           ` Nikolay Borisov
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Eric W. Biederman @ 2017-11-01 11:08 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Christian Brauner, Linux Containers, tycho, linux-kernel

Nikolay Borisov <nborisov@suse.com> writes:

> On  1.11.2017 01:48, Eric W. Biederman wrote:
>> 
>> This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
>> the map is being written does not do strange things.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/user_namespace.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 563a2981d7c7..4f7e357ac1e2 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
>>  		     struct uid_gid_map *map)
>>  {
>>  	loff_t pos = *ppos;
>> +	unsigned extents = map->nr_extents;
>> +	smp_rmb();
>
> Barriers need to be paired to work correctly as well as have explicit
> comments describing the pairing as per kernel coding style. Checkpatch
> will actually produce warning for that particular memory barrier.

So please look at the code and read the comment.  The fact the barrier
was not in m_start earlier is strictly speaking a bug.

In practice except for a very narrow window when this data is changing
the one time it can, this code does not matter at all.

As for checkpatch I have sympathy for it, checkpatch has a hard job,
but I won't listen to checkpatch when it is wrong.

If you have additional cleanups you would like to make in this area
please send patches.

Eric

>>  
>> -	if (pos >= map->nr_extents)
>> +	if (pos >= extents)
>>  		return NULL;
>>  
>> -	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
>> +	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
>>  		return &map->extent[pos];
>>  
>>  	return &map->forward[pos];
>> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] userns: bump idmap limits, fixes & tweaks
  2017-11-01 10:51     ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Christian Brauner
@ 2017-11-01 11:15       ` Eric W. Biederman
  2017-11-01 13:31         ` Christian Brauner
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2017-11-01 11:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christian Brauner, Linux Kernel Mailing List, Serge Hallyn,
	Tycho Andersen, Linux Containers

Christian Brauner <christian.brauner@canonical.com> writes:

> On Tue, Oct 31, 2017 at 06:46:32PM -0500, Eric W. Biederman wrote:
>>
>> Christian I have looked through your code and I have found one real
>> issue and of things I want to twak
>
> Cool, thanks for taking a close look Eric.
>
>>
>> The real issue is reading nr_extents multiple times when reading a map.
>> That can introduce races that will allow walking past the end of the
>> array, if the first read is 0 but the second read is > 5.
>>
>> I have also found a couple of tweaks that look like they are worth
>> implementing.
>
> Yeah, I saw that you unified some of the functions. I was thinking about this
> but wanted to keep the cases distinct even with some amount of code duplication.
> But it seems very much worth it from a maintenance perspective. Thanks!

Yes.  If we have a performance regression I am willing to remove the
unification of map_id_range_down and map_id_down.  But I can't imagine
that will result in a measurable performance difference.  If it does
make a measurable perforamnce difference we almost certainly need to split
the bsearch case as well.

>> As all of these are very small and very straight forward I have
>> tested these and applied them all to my for-next branch
>
> Thanks for the fixes Eric. Really appreciated. If you're too swamped for stuff
> like that I'm obviously happy to do such trivial fixes myself. :)

If you would test this some more in your setup I would appreciate it,
just in case I missed something.

Given where we are in the development cycle and the correctness concerns
I just applied these as without the fix for reading extents exactly once
the code is dangerously wrong.

Eric

> Christian
>
>>
>>
>> Eric W. Biederman (5):
>>       userns: Don't special case a count of 0
>>       userns: Simplify the user and group mapping functions
>>       userns: Don't read extents twice in m_start
>>       userns: Make map_id_down a wrapper for map_id_range_down
>>       userns: Simplify insert_extent
>>
>>  kernel/user_namespace.c | 159 ++++++++++++++++--------------------------------
>>  1 file changed, 51 insertions(+), 108 deletions(-)
>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 11:08         ` Eric W. Biederman
@ 2017-11-01 13:05           ` Nikolay Borisov
  2017-11-01 13:05           ` Peter Zijlstra
  2017-11-01 17:00           ` Joe Perches
  2 siblings, 0 replies; 22+ messages in thread
From: Nikolay Borisov @ 2017-11-01 13:05 UTC (permalink / raw)
  To: Eric W. Biederman, Nikolay Borisov
  Cc: Christian Brauner, tycho, Linux Containers, linux-kernel,
	Joe Azzopardi, joe, apw, Paul E. McKenney, Peter Zijlstra,
	mathieu.desnoyers

[CCing some memory-barriers people + checkpatch maintainers]

On  1.11.2017 13:08, Eric W. Biederman wrote:
> Nikolay Borisov <nborisov@suse.com> writes:
> 
>> On  1.11.2017 01:48, Eric W. Biederman wrote:
>>>
>>> This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
>>> the map is being written does not do strange things.
>>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>  kernel/user_namespace.c | 6 ++++--
>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>>> index 563a2981d7c7..4f7e357ac1e2 100644
>>> --- a/kernel/user_namespace.c
>>> +++ b/kernel/user_namespace.c
>>> @@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
>>>  		     struct uid_gid_map *map)
>>>  {
>>>  	loff_t pos = *ppos;
>>> +	unsigned extents = map->nr_extents;
>>> +	smp_rmb();
>>
>> Barriers need to be paired to work correctly as well as have explicit
>> comments describing the pairing as per kernel coding style. Checkpatch
>> will actually produce warning for that particular memory barrier.
> 
> So please look at the code and read the comment.  The fact the barrier
> was not in m_start earlier is strictly speaking a bug.

The comment you are referring to (I guess) is the one in map_write,
right? However it states nothing about which barrier pairs with which
one so it's not possible to reason with 100% certainty about the
intentions of the code. I think the fact there was a missing barrier
here just makes this point stronger.

Additionally, when someone reading the code sees a rmb being added
without a pairing wmb it's only natural to think that something is amiss
in this case.

> 
> In practice except for a very narrow window when this data is changing
> the one time it can, this code does not matter at all.
> 
> As for checkpatch I have sympathy for it, checkpatch has a hard job,
> but I won't listen to checkpatch when it is wrong.

Well if you deem the checkpatch then this rule is better removed from
checkpatch. However, considering how hard it is to debug and reason
about correctness of barriers without explicit comments I'd say you are
in the wrong on this one.

> 
> If you have additional cleanups you would like to make in this area
> please send patches.
> 
> Eric
> 
>>>  
>>> -	if (pos >= map->nr_extents)
>>> +	if (pos >= extents)
>>>  		return NULL;
>>>  
>>> -	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
>>> +	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
>>>  		return &map->extent[pos];
>>>  
>>>  	return &map->forward[pos];
>>>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 11:08         ` Eric W. Biederman
  2017-11-01 13:05           ` Nikolay Borisov
@ 2017-11-01 13:05           ` Peter Zijlstra
  2017-11-01 14:01             ` Christian Brauner
  2017-11-01 16:31             ` Christian Brauner
  2017-11-01 17:00           ` Joe Perches
  2 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-11-01 13:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Nikolay Borisov, Christian Brauner, Linux Containers, tycho,
	linux-kernel

On Wed, Nov 01, 2017 at 06:08:35AM -0500, Eric W. Biederman wrote:
> Nikolay Borisov <nborisov@suse.com> writes:
> 
> > On  1.11.2017 01:48, Eric W. Biederman wrote:
> >> 
> >> This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
> >> the map is being written does not do strange things.
> >> 
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  kernel/user_namespace.c | 6 ++++--
> >>  1 file changed, 4 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> >> index 563a2981d7c7..4f7e357ac1e2 100644
> >> --- a/kernel/user_namespace.c
> >> +++ b/kernel/user_namespace.c
> >> @@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
> >>  		     struct uid_gid_map *map)
> >>  {
> >>  	loff_t pos = *ppos;
> >> +	unsigned extents = map->nr_extents;
> >> +	smp_rmb();
> >
> > Barriers need to be paired to work correctly as well as have explicit
> > comments describing the pairing as per kernel coding style. Checkpatch
> > will actually produce warning for that particular memory barrier.
> 
> So please look at the code and read the comment.

What comment, there isn't any, which is what he's complaining about.

> The fact the barrier was not in m_start earlier is strictly speaking a
> bug.

Sure; doesn't excuse you for not writing sensible comments to go with
it.

> In practice except for a very narrow window when this data is changing
> the one time it can, this code does not matter at all.
> 
> As for checkpatch I have sympathy for it, checkpatch has a hard job,
> but I won't listen to checkpatch when it is wrong.

No, undocumented barriers are a royal pain. Memory barriers should come
with a comment that describes the desired ordering and points to the
pairing barrier(s).

Also, you probably want READ_ONCE() here and WRITE_ONCE() in
map_write(), the compiler is free to do unordered byte loads/stores
without it.

And finally, did you want to use smp_store_release() and
smp_load_acquire() instead?

Something like so perhaps?

---
 kernel/user_namespace.c | 73 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 28 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c490f1e4313b..f758911cabd5 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -25,8 +25,47 @@
 #include <linux/fs_struct.h>
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
+
+/*
+ * The userns_state_mutex serializes all writes to any given map.
+ *
+ * Any map is only ever written once.
+ *
+ * An id map fits within 1 cache line on most architectures.
+ */
 static DEFINE_MUTEX(userns_state_mutex);
 
+/*
+ *
+ * There is a one time data dependency between reading the count of the extents
+ * and the values of the extents.  The desired behavior is to see the values of
+ * the extents that were written before the count of the extents.
+ *
+ * To achieve this smp_store_release() is used on guarantee the write order and
+ * smp_load_acquire() is guaranteed that we don't have weakly ordered
+ * architectures returning stale data.
+ */
+static inline void map_store_extents(struct uid_gid_map *map, unsigned int extents)
+{
+	/*
+	 * Ensure the map->extent[] stores happen-before we grow map->nr_extents
+	 * to cover it. Matches the load_acquire in map_load_extents().
+	 */
+	smp_store_release(&map->nr_extents, extents);
+}
+
+static inline unsigned int map_load_extents(struct uid_gid_map *map)
+{
+	/*
+	 * Ensure the map->nr_extents load happens-before we try and access
+	 * map->extent[], such that we guarantee the data is in fact there.
+	 *
+	 * Matches the store-relese in map_store_extents().
+	 */
+	return smp_load_acquire(&map->nr_extents);
+}
+
+
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
 				struct uid_gid_map *map);
@@ -206,8 +245,7 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
 	id2 = id + count - 1;
 
 	/* Find the matching extent */
-	extents = map->nr_extents;
-	smp_rmb();
+	extents = map_load_extents(map);
 	for (idx = 0; idx < extents; idx++) {
 		first = map->extent[idx].first;
 		last = first + map->extent[idx].count - 1;
@@ -230,8 +268,7 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
 	u32 first, last;
 
 	/* Find the matching extent */
-	extents = map->nr_extents;
-	smp_rmb();
+	extents = map_load_extents(map);
 	for (idx = 0; idx < extents; idx++) {
 		first = map->extent[idx].first;
 		last = first + map->extent[idx].count - 1;
@@ -253,8 +290,7 @@ static u32 map_id_up(struct uid_gid_map *map, u32 id)
 	u32 first, last;
 
 	/* Find the matching extent */
-	extents = map->nr_extents;
-	smp_rmb();
+	extents = map_load_extents(map);
 	for (idx = 0; idx < extents; idx++) {
 		first = map->extent[idx].lower_first;
 		last = first + map->extent[idx].count - 1;
@@ -543,7 +579,7 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
 	struct uid_gid_extent *extent = NULL;
 	loff_t pos = *ppos;
 
-	if (pos < map->nr_extents)
+	if (pos < map_load_extents(map))
 		extent = &map->extent[pos];
 
 	return extent;
@@ -652,25 +688,6 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	char *kbuf = NULL, *pos, *next_line;
 	ssize_t ret = -EINVAL;
 
-	/*
-	 * The userns_state_mutex serializes all writes to any given map.
-	 *
-	 * Any map is only ever written once.
-	 *
-	 * An id map fits within 1 cache line on most architectures.
-	 *
-	 * On read nothing needs to be done unless you are on an
-	 * architecture with a crazy cache coherency model like alpha.
-	 *
-	 * There is a one time data dependency between reading the
-	 * count of the extents and the values of the extents.  The
-	 * desired behavior is to see the values of the extents that
-	 * were written before the count of the extents.
-	 *
-	 * To achieve this smp_wmb() is used on guarantee the write
-	 * order and smp_rmb() is guaranteed that we don't have crazy
-	 * architectures returning stale data.
-	 */
 	mutex_lock(&userns_state_mutex);
 
 	ret = -EPERM;
@@ -790,8 +807,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	/* Install the map */
 	memcpy(map->extent, new_map.extent,
 		new_map.nr_extents*sizeof(new_map.extent[0]));
-	smp_wmb();
-	map->nr_extents = new_map.nr_extents;
+
+	map_store_extents(map, new_map.nr_extents);
 
 	*ppos = count;
 	ret = count;

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] userns: bump idmap limits, fixes & tweaks
  2017-11-01 11:15       ` Eric W. Biederman
@ 2017-11-01 13:31         ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2017-11-01 13:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Christian Brauner, Linux Kernel Mailing List, Serge Hallyn,
	Tycho Andersen, Linux Containers

On Wed, Nov 01, 2017 at 06:15:53AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Tue, Oct 31, 2017 at 06:46:32PM -0500, Eric W. Biederman wrote:
> >>
> >> Christian I have looked through your code and I have found one real
> >> issue and of things I want to twak
> >
> > Cool, thanks for taking a close look Eric.
> >
> >>
> >> The real issue is reading nr_extents multiple times when reading a map.
> >> That can introduce races that will allow walking past the end of the
> >> array, if the first read is 0 but the second read is > 5.
> >>
> >> I have also found a couple of tweaks that look like they are worth
> >> implementing.
> >
> > Yeah, I saw that you unified some of the functions. I was thinking about this
> > but wanted to keep the cases distinct even with some amount of code duplication.
> > But it seems very much worth it from a maintenance perspective. Thanks!
> 
> Yes.  If we have a performance regression I am willing to remove the
> unification of map_id_range_down and map_id_down.  But I can't imagine
> that will result in a measurable performance difference.  If it does
> make a measurable perforamnce difference we almost certainly need to split
> the bsearch case as well.
> 
> >> As all of these are very small and very straight forward I have
> >> tested these and applied them all to my for-next branch
> >
> > Thanks for the fixes Eric. Really appreciated. If you're too swamped for stuff
> > like that I'm obviously happy to do such trivial fixes myself. :)
> 
> If you would test this some more in your setup I would appreciate it,
> just in case I missed something.

I've been running kernels with this patch for a few weeks now with nested
unprivileged containers and all sorts of weird setups. So far I didn't observe
any problems. Once we've settled the memory barrier discussion I'm going to
recompile and run a few tests.

> 
> Given where we are in the development cycle and the correctness concerns
> I just applied these as without the fix for reading extents exactly once
> the code is dangerously wrong.
> 
> Eric
> 
> > Christian
> >
> >>
> >>
> >> Eric W. Biederman (5):
> >>       userns: Don't special case a count of 0
> >>       userns: Simplify the user and group mapping functions
> >>       userns: Don't read extents twice in m_start
> >>       userns: Make map_id_down a wrapper for map_id_range_down
> >>       userns: Simplify insert_extent
> >>
> >>  kernel/user_namespace.c | 159 ++++++++++++++++--------------------------------
> >>  1 file changed, 51 insertions(+), 108 deletions(-)
> >>
> >>
> >>
> >>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 13:05           ` Peter Zijlstra
@ 2017-11-01 14:01             ` Christian Brauner
  2017-11-01 14:16               ` Peter Zijlstra
  2017-11-01 16:31             ` Christian Brauner
  1 sibling, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2017-11-01 14:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Nikolay Borisov, Christian Brauner,
	Linux Containers, tycho, linux-kernel

On Wed, Nov 01, 2017 at 02:05:39PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 01, 2017 at 06:08:35AM -0500, Eric W. Biederman wrote:
> > Nikolay Borisov <nborisov@suse.com> writes:
> > 
> > > On  1.11.2017 01:48, Eric W. Biederman wrote:
> > >> 
> > >> This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
> > >> the map is being written does not do strange things.
> > >> 
> > >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > >> ---
> > >>  kernel/user_namespace.c | 6 ++++--
> > >>  1 file changed, 4 insertions(+), 2 deletions(-)
> > >> 
> > >> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> > >> index 563a2981d7c7..4f7e357ac1e2 100644
> > >> --- a/kernel/user_namespace.c
> > >> +++ b/kernel/user_namespace.c
> > >> @@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
> > >>  		     struct uid_gid_map *map)
> > >>  {
> > >>  	loff_t pos = *ppos;
> > >> +	unsigned extents = map->nr_extents;
> > >> +	smp_rmb();
> > >
> > > Barriers need to be paired to work correctly as well as have explicit
> > > comments describing the pairing as per kernel coding style. Checkpatch
> > > will actually produce warning for that particular memory barrier.
> > 
> > So please look at the code and read the comment.
> 
> What comment, there isn't any, which is what he's complaining about.
> 
> > The fact the barrier was not in m_start earlier is strictly speaking a
> > bug.
> 
> Sure; doesn't excuse you for not writing sensible comments to go with
> it.
> 
> > In practice except for a very narrow window when this data is changing
> > the one time it can, this code does not matter at all.
> > 
> > As for checkpatch I have sympathy for it, checkpatch has a hard job,
> > but I won't listen to checkpatch when it is wrong.
> 
> No, undocumented barriers are a royal pain. Memory barriers should come
> with a comment that describes the desired ordering and points to the
> pairing barrier(s).

Tbf, this isn't solely Eric's fault. I'm to blame here too since I didn't
document the already existing smb_rmb()s and the new one I introduced when
writing the patches. I didn't know that there was a hard-set requirement to
document those. I also didn't see anything in the kernel coding style or the
memory barriers documentation (But it has been some time since I read those.).

> 
> Also, you probably want READ_ONCE() here and WRITE_ONCE() in
> map_write(), the compiler is free to do unordered byte loads/stores
> without it.
> 
> And finally, did you want to use smp_store_release() and
> smp_load_acquire() instead?

Maybe a stupid question but do you suspect this is a real problem in this case
since you're phrasing it as a question? Iirc, *_acquire() operations include
locking operations and might come with a greater performance impact then
smb_{rmb,wmb}(). Given that this is a very performance critical path we should
be sure.

> 
> Something like so perhaps?
> 
> ---
>  kernel/user_namespace.c | 73 ++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index c490f1e4313b..f758911cabd5 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -25,8 +25,47 @@
>  #include <linux/fs_struct.h>
>  
>  static struct kmem_cache *user_ns_cachep __read_mostly;
> +
> +/*
> + * The userns_state_mutex serializes all writes to any given map.
> + *
> + * Any map is only ever written once.
> + *
> + * An id map fits within 1 cache line on most architectures.
> + */
>  static DEFINE_MUTEX(userns_state_mutex);
>  
> +/*
> + *
> + * There is a one time data dependency between reading the count of the extents
> + * and the values of the extents.  The desired behavior is to see the values of
> + * the extents that were written before the count of the extents.
> + *
> + * To achieve this smp_store_release() is used on guarantee the write order and
> + * smp_load_acquire() is guaranteed that we don't have weakly ordered
> + * architectures returning stale data.
> + */
> +static inline void map_store_extents(struct uid_gid_map *map, unsigned int extents)
> +{
> +	/*
> +	 * Ensure the map->extent[] stores happen-before we grow map->nr_extents
> +	 * to cover it. Matches the load_acquire in map_load_extents().
> +	 */
> +	smp_store_release(&map->nr_extents, extents);
> +}
> +
> +static inline unsigned int map_load_extents(struct uid_gid_map *map)
> +{
> +	/*
> +	 * Ensure the map->nr_extents load happens-before we try and access
> +	 * map->extent[], such that we guarantee the data is in fact there.
> +	 *
> +	 * Matches the store-relese in map_store_extents().
> +	 */
> +	return smp_load_acquire(&map->nr_extents);
> +}
> +
> +
>  static bool new_idmap_permitted(const struct file *file,
>  				struct user_namespace *ns, int cap_setid,
>  				struct uid_gid_map *map);
> @@ -206,8 +245,7 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
>  	id2 = id + count - 1;
>  
>  	/* Find the matching extent */
> -	extents = map->nr_extents;
> -	smp_rmb();
> +	extents = map_load_extents(map);
>  	for (idx = 0; idx < extents; idx++) {
>  		first = map->extent[idx].first;
>  		last = first + map->extent[idx].count - 1;
> @@ -230,8 +268,7 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
>  	u32 first, last;
>  
>  	/* Find the matching extent */
> -	extents = map->nr_extents;
> -	smp_rmb();
> +	extents = map_load_extents(map);
>  	for (idx = 0; idx < extents; idx++) {
>  		first = map->extent[idx].first;
>  		last = first + map->extent[idx].count - 1;
> @@ -253,8 +290,7 @@ static u32 map_id_up(struct uid_gid_map *map, u32 id)
>  	u32 first, last;
>  
>  	/* Find the matching extent */
> -	extents = map->nr_extents;
> -	smp_rmb();
> +	extents = map_load_extents(map);
>  	for (idx = 0; idx < extents; idx++) {
>  		first = map->extent[idx].lower_first;
>  		last = first + map->extent[idx].count - 1;
> @@ -543,7 +579,7 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
>  	struct uid_gid_extent *extent = NULL;
>  	loff_t pos = *ppos;
>  
> -	if (pos < map->nr_extents)
> +	if (pos < map_load_extents(map))
>  		extent = &map->extent[pos];
>  
>  	return extent;
> @@ -652,25 +688,6 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	char *kbuf = NULL, *pos, *next_line;
>  	ssize_t ret = -EINVAL;
>  
> -	/*
> -	 * The userns_state_mutex serializes all writes to any given map.
> -	 *
> -	 * Any map is only ever written once.
> -	 *
> -	 * An id map fits within 1 cache line on most architectures.
> -	 *
> -	 * On read nothing needs to be done unless you are on an
> -	 * architecture with a crazy cache coherency model like alpha.
> -	 *
> -	 * There is a one time data dependency between reading the
> -	 * count of the extents and the values of the extents.  The
> -	 * desired behavior is to see the values of the extents that
> -	 * were written before the count of the extents.
> -	 *
> -	 * To achieve this smp_wmb() is used on guarantee the write
> -	 * order and smp_rmb() is guaranteed that we don't have crazy
> -	 * architectures returning stale data.
> -	 */
>  	mutex_lock(&userns_state_mutex);
>  
>  	ret = -EPERM;
> @@ -790,8 +807,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	/* Install the map */
>  	memcpy(map->extent, new_map.extent,
>  		new_map.nr_extents*sizeof(new_map.extent[0]));
> -	smp_wmb();
> -	map->nr_extents = new_map.nr_extents;
> +
> +	map_store_extents(map, new_map.nr_extents);
>  
>  	*ppos = count;
>  	ret = count;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 14:01             ` Christian Brauner
@ 2017-11-01 14:16               ` Peter Zijlstra
  2017-11-01 16:29                 ` Christian Brauner
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2017-11-01 14:16 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric W. Biederman, Nikolay Borisov, Christian Brauner,
	Linux Containers, tycho, linux-kernel

On Wed, Nov 01, 2017 at 03:01:45PM +0100, Christian Brauner wrote:
> Tbf, this isn't solely Eric's fault. I'm to blame here too since I didn't
> document the already existing smb_rmb()s and the new one I introduced when
> writing the patches. I didn't know that there was a hard-set requirement to
> document those. I also didn't see anything in the kernel coding style or the
> memory barriers documentation (But it has been some time since I read those.).

There's too many documents to read.. I'm not sure we changed
coding-style, and I suspect that'll just end up being another bike-shed
in any case.

We did get checkpatch changed though, which is a strong enough clue that
something needs to happen.

But What Nikolay said; memory ordering is hard enough if you're clear on
what exactly you intend to do. But if you later try and reconstruct
without comments, its nearly impossible.

It gets even better if someone changes the ordering requirements over
time and you grow hidden and non-obvious dependencies :/

> > Also, you probably want READ_ONCE() here and WRITE_ONCE() in
> > map_write(), the compiler is free to do unordered byte loads/stores
> > without it.
> > 
> > And finally, did you want to use smp_store_release() and
> > smp_load_acquire() instead?
> 
> Maybe a stupid question but do you suspect this is a real problem in
> this case since you're phrasing it as a question?

Rhetorical question mostly, I suspect its just what you meant to do, as
per the proposed patch.

> Iirc, *_acquire() operations include
> locking operations and might come with a greater performance impact then
> smb_{rmb,wmb}(). Given that this is a very performance critical path we should
> be sure.

No locking what so ever. LOAD-ACQUIRE and STORE-RELEASE are memory ordering
flavours that are paired with the respective memory operation.

It is true that locking ops provide these exact orderings, but that
doesn't imply the reverse.

In short, store-release is a store that ensures all prior load _and_
stores happen-before this store. A load-acquire is a load which
happens-before any subsequent load or stores.

But a release does not constrain later loads or stores, and an acquire
does not constrain prior load or stores.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 14:16               ` Peter Zijlstra
@ 2017-11-01 16:29                 ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2017-11-01 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Nikolay Borisov, Christian Brauner,
	Linux Containers, tycho, linux-kernel

On Wed, Nov 01, 2017 at 03:16:54PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 01, 2017 at 03:01:45PM +0100, Christian Brauner wrote:
> > Tbf, this isn't solely Eric's fault. I'm to blame here too since I didn't
> > document the already existing smb_rmb()s and the new one I introduced when
> > writing the patches. I didn't know that there was a hard-set requirement to
> > document those. I also didn't see anything in the kernel coding style or the
> > memory barriers documentation (But it has been some time since I read those.).
> 
> There's too many documents to read.. I'm not sure we changed
> coding-style, and I suspect that'll just end up being another bike-shed
> in any case.
> 
> We did get checkpatch changed though, which is a strong enough clue that
> something needs to happen.
> 
> But What Nikolay said; memory ordering is hard enough if you're clear on
> what exactly you intend to do. But if you later try and reconstruct
> without comments, its nearly impossible.

Yeah, agreed. I was happy to see that Eric explained his smp_wmb() in detail.
That was quite helpful in figuring this out!

> 
> It gets even better if someone changes the ordering requirements over
> time and you grow hidden and non-obvious dependencies :/
> 
> > > Also, you probably want READ_ONCE() here and WRITE_ONCE() in
> > > map_write(), the compiler is free to do unordered byte loads/stores
> > > without it.
> > > 
> > > And finally, did you want to use smp_store_release() and
> > > smp_load_acquire() instead?
> > 
> > Maybe a stupid question but do you suspect this is a real problem in
> > this case since you're phrasing it as a question?
> 
> Rhetorical question mostly, I suspect its just what you meant to do, as
> per the proposed patch.
> 
> > Iirc, *_acquire() operations include
> > locking operations and might come with a greater performance impact then
> > smb_{rmb,wmb}(). Given that this is a very performance critical path we should
> > be sure.
> 
> No locking what so ever. LOAD-ACQUIRE and STORE-RELEASE are memory ordering
> flavours that are paired with the respective memory operation.

Ah right, now I remember I was confused by a part of the memory barriers
documentation that referenced locks. Acquire operations include locks and
smp_load_acquire().  Right, should've remembered that. Thanks!

> 
> It is true that locking ops provide these exact orderings, but that
> doesn't imply the reverse.
> 
> In short, store-release is a store that ensures all prior load _and_
> stores happen-before this store. A load-acquire is a load which
> happens-before any subsequent load or stores.
> 
> But a release does not constrain later loads or stores, and an acquire
> does not constrain prior load or stores.
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 13:05           ` Peter Zijlstra
  2017-11-01 14:01             ` Christian Brauner
@ 2017-11-01 16:31             ` Christian Brauner
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2017-11-01 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Nikolay Borisov, Christian Brauner,
	Linux Containers, tycho, linux-kernel

On Wed, Nov 01, 2017 at 02:05:39PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 01, 2017 at 06:08:35AM -0500, Eric W. Biederman wrote:
> > Nikolay Borisov <nborisov@suse.com> writes:
> > 
> > > On  1.11.2017 01:48, Eric W. Biederman wrote:
> > >> 
> > >> This is important so reading /proc/<pid>/{uid_map,gid_map,projid_map} while
> > >> the map is being written does not do strange things.
> > >> 
> > >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > >> ---
> > >>  kernel/user_namespace.c | 6 ++++--
> > >>  1 file changed, 4 insertions(+), 2 deletions(-)
> > >> 
> > >> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> > >> index 563a2981d7c7..4f7e357ac1e2 100644
> > >> --- a/kernel/user_namespace.c
> > >> +++ b/kernel/user_namespace.c
> > >> @@ -683,11 +683,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
> > >>  		     struct uid_gid_map *map)
> > >>  {
> > >>  	loff_t pos = *ppos;
> > >> +	unsigned extents = map->nr_extents;
> > >> +	smp_rmb();
> > >
> > > Barriers need to be paired to work correctly as well as have explicit
> > > comments describing the pairing as per kernel coding style. Checkpatch
> > > will actually produce warning for that particular memory barrier.
> > 
> > So please look at the code and read the comment.
> 
> What comment, there isn't any, which is what he's complaining about.
> 
> > The fact the barrier was not in m_start earlier is strictly speaking a
> > bug.
> 
> Sure; doesn't excuse you for not writing sensible comments to go with
> it.
> 
> > In practice except for a very narrow window when this data is changing
> > the one time it can, this code does not matter at all.
> > 
> > As for checkpatch I have sympathy for it, checkpatch has a hard job,
> > but I won't listen to checkpatch when it is wrong.
> 
> No, undocumented barriers are a royal pain. Memory barriers should come
> with a comment that describes the desired ordering and points to the
> pairing barrier(s).
> 
> Also, you probably want READ_ONCE() here and WRITE_ONCE() in
> map_write(), the compiler is free to do unordered byte loads/stores
> without it.
> 
> And finally, did you want to use smp_store_release() and
> smp_load_acquire() instead?
> 
> Something like so perhaps?

Given the discussion this looks fine to me unless Eric stops some issues I'm
too blind to see. So for whatever this is worth this patch would get an ack from
me.

> 
> ---
>  kernel/user_namespace.c | 73 ++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index c490f1e4313b..f758911cabd5 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -25,8 +25,47 @@
>  #include <linux/fs_struct.h>
>  
>  static struct kmem_cache *user_ns_cachep __read_mostly;
> +
> +/*
> + * The userns_state_mutex serializes all writes to any given map.
> + *
> + * Any map is only ever written once.
> + *
> + * An id map fits within 1 cache line on most architectures.
> + */
>  static DEFINE_MUTEX(userns_state_mutex);
>  
> +/*
> + *
> + * There is a one time data dependency between reading the count of the extents
> + * and the values of the extents.  The desired behavior is to see the values of
> + * the extents that were written before the count of the extents.
> + *
> + * To achieve this smp_store_release() is used on guarantee the write order and
> + * smp_load_acquire() is guaranteed that we don't have weakly ordered
> + * architectures returning stale data.
> + */
> +static inline void map_store_extents(struct uid_gid_map *map, unsigned int extents)
> +{
> +	/*
> +	 * Ensure the map->extent[] stores happen-before we grow map->nr_extents
> +	 * to cover it. Matches the load_acquire in map_load_extents().
> +	 */
> +	smp_store_release(&map->nr_extents, extents);
> +}
> +
> +static inline unsigned int map_load_extents(struct uid_gid_map *map)
> +{
> +	/*
> +	 * Ensure the map->nr_extents load happens-before we try and access
> +	 * map->extent[], such that we guarantee the data is in fact there.
> +	 *
> +	 * Matches the store-relese in map_store_extents().
> +	 */
> +	return smp_load_acquire(&map->nr_extents);
> +}
> +
> +
>  static bool new_idmap_permitted(const struct file *file,
>  				struct user_namespace *ns, int cap_setid,
>  				struct uid_gid_map *map);
> @@ -206,8 +245,7 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
>  	id2 = id + count - 1;
>  
>  	/* Find the matching extent */
> -	extents = map->nr_extents;
> -	smp_rmb();
> +	extents = map_load_extents(map);
>  	for (idx = 0; idx < extents; idx++) {
>  		first = map->extent[idx].first;
>  		last = first + map->extent[idx].count - 1;
> @@ -230,8 +268,7 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
>  	u32 first, last;
>  
>  	/* Find the matching extent */
> -	extents = map->nr_extents;
> -	smp_rmb();
> +	extents = map_load_extents(map);
>  	for (idx = 0; idx < extents; idx++) {
>  		first = map->extent[idx].first;
>  		last = first + map->extent[idx].count - 1;
> @@ -253,8 +290,7 @@ static u32 map_id_up(struct uid_gid_map *map, u32 id)
>  	u32 first, last;
>  
>  	/* Find the matching extent */
> -	extents = map->nr_extents;
> -	smp_rmb();
> +	extents = map_load_extents(map);
>  	for (idx = 0; idx < extents; idx++) {
>  		first = map->extent[idx].lower_first;
>  		last = first + map->extent[idx].count - 1;
> @@ -543,7 +579,7 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
>  	struct uid_gid_extent *extent = NULL;
>  	loff_t pos = *ppos;
>  
> -	if (pos < map->nr_extents)
> +	if (pos < map_load_extents(map))
>  		extent = &map->extent[pos];
>  
>  	return extent;
> @@ -652,25 +688,6 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	char *kbuf = NULL, *pos, *next_line;
>  	ssize_t ret = -EINVAL;
>  
> -	/*
> -	 * The userns_state_mutex serializes all writes to any given map.
> -	 *
> -	 * Any map is only ever written once.
> -	 *
> -	 * An id map fits within 1 cache line on most architectures.
> -	 *
> -	 * On read nothing needs to be done unless you are on an
> -	 * architecture with a crazy cache coherency model like alpha.
> -	 *
> -	 * There is a one time data dependency between reading the
> -	 * count of the extents and the values of the extents.  The
> -	 * desired behavior is to see the values of the extents that
> -	 * were written before the count of the extents.
> -	 *
> -	 * To achieve this smp_wmb() is used on guarantee the write
> -	 * order and smp_rmb() is guaranteed that we don't have crazy
> -	 * architectures returning stale data.
> -	 */
>  	mutex_lock(&userns_state_mutex);
>  
>  	ret = -EPERM;
> @@ -790,8 +807,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	/* Install the map */
>  	memcpy(map->extent, new_map.extent,
>  		new_map.nr_extents*sizeof(new_map.extent[0]));
> -	smp_wmb();
> -	map->nr_extents = new_map.nr_extents;
> +
> +	map_store_extents(map, new_map.nr_extents);
>  
>  	*ppos = count;
>  	ret = count;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 11:08         ` Eric W. Biederman
  2017-11-01 13:05           ` Nikolay Borisov
  2017-11-01 13:05           ` Peter Zijlstra
@ 2017-11-01 17:00           ` Joe Perches
  2017-11-01 17:20             ` Eric W. Biederman
  2 siblings, 1 reply; 22+ messages in thread
From: Joe Perches @ 2017-11-01 17:00 UTC (permalink / raw)
  To: Eric W. Biederman, Nikolay Borisov
  Cc: Christian Brauner, Linux Containers, tycho, linux-kernel

On Wed, 2017-11-01 at 06:08 -0500, Eric W. Biederman wrote:
> I won't listen to checkpatch when it is wrong.

Always a good idea.

btw: what is checkpatch wrong about this time?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 17:00           ` Joe Perches
@ 2017-11-01 17:20             ` Eric W. Biederman
  2017-11-01 18:15               ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2017-11-01 17:20 UTC (permalink / raw)
  To: Joe Perches
  Cc: Nikolay Borisov, Christian Brauner, Linux Containers, tycho,
	linux-kernel, Peter Zijlstra

Joe Perches <joe@perches.com> writes:

> On Wed, 2017-11-01 at 06:08 -0500, Eric W. Biederman wrote:
>> I won't listen to checkpatch when it is wrong.
>
> Always a good idea.
>
> btw: what is checkpatch wrong about this time?

Well the way I was hearing the conversation was that there was a patch
that fixed a real bug, but it was wrong because checkpatch complained
about it.

So I don't even know if the warning is a problem.  But blocking bug
fixes because there is a warning certainly is.

If someone wants to change coding style in practice so that every
smp_rmb and every smp_wmb has detailed comments that everyone must
include they need to follow the usual rule and update the entire kernel
when making an interface change.  As that did not happen I don't see
any problems with incremental updates in the style the code is already
in.

Not that I will mind a patch that updates the code, but I am not going
to hold up a perfectly good bug fix waiting for one either.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] userns: Don't read extents twice in m_start
  2017-11-01 17:20             ` Eric W. Biederman
@ 2017-11-01 18:15               ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2017-11-01 18:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Joe Perches, Nikolay Borisov, Christian Brauner,
	Linux Containers, tycho, linux-kernel

On Wed, Nov 01, 2017 at 12:20:52PM -0500, Eric W. Biederman wrote:

> Well the way I was hearing the conversation was that there was a patch
> that fixed a real bug, but it was wrong because checkpatch complained
> about it.
> 
> So I don't even know if the warning is a problem.  But blocking bug
> fixes because there is a warning certainly is.
> 
> If someone wants to change coding style in practice so that every
> smp_rmb and every smp_wmb has detailed comments that everyone must
> include they need to follow the usual rule and update the entire kernel
> when making an interface change.  As that did not happen I don't see
> any problems with incremental updates in the style the code is already
> in.

Reverse engineering all the memory barriers in the code is a massive
undertaking. We are looking at it, but its slow progress. Mostly an
update when touched approach.

Although some searching for dodgy patterns; see for example commit:

  a7af0af32171 ("blk-mq: attempt to fix atomic flag memory ordering")

which was the result of people looking at creative
smp_mb__{before,after}_atomic() usage -- in specific use of those
barriers not immediately adjacent to the respective atomic operation.

But to use the lack of completion of that work to not write comments on
code you understand is complete bollocks though.

> Not that I will mind a patch that updates the code, but I am not going
> to hold up a perfectly good bug fix waiting for one either.

If both you and the submitter actually understand the ordering, asking
them to write the comment isn't onerous and should not hold up anything
much at all.

In fact, if you cannot write that comment you cannot claim it to be a
good fix. And I'm saying the lack of the comment means its not a
perfectly good fix at all, since a perfect fix would indeed explain
memory ordering.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-11-01 18:16 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-24 22:04 [PATCH 1/2 v6] user namespace: use union in {g,u}idmap struct Christian Brauner
2017-10-24 22:04 ` [PATCH 2/2 v6] user namespaces: bump idmap limits to 340 Christian Brauner
2017-10-31 23:46   ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Eric W. Biederman
2017-10-31 23:47     ` [PATCH 1/5] userns: Don't special case a count of 0 Eric W. Biederman
2017-10-31 23:47     ` [PATCH 2/5] userns: Simplify the user and group mapping functions Eric W. Biederman
2017-10-31 23:48     ` [PATCH 3/5] userns: Don't read extents twice in m_start Eric W. Biederman
2017-11-01  8:31       ` Nikolay Borisov
2017-11-01 11:08         ` Eric W. Biederman
2017-11-01 13:05           ` Nikolay Borisov
2017-11-01 13:05           ` Peter Zijlstra
2017-11-01 14:01             ` Christian Brauner
2017-11-01 14:16               ` Peter Zijlstra
2017-11-01 16:29                 ` Christian Brauner
2017-11-01 16:31             ` Christian Brauner
2017-11-01 17:00           ` Joe Perches
2017-11-01 17:20             ` Eric W. Biederman
2017-11-01 18:15               ` Peter Zijlstra
2017-10-31 23:48     ` [PATCH 4/5] userns: Make map_id_down a wrapper for map_id_range_down Eric W. Biederman
2017-10-31 23:49     ` [PATCH 5/5] userns: Simplify insert_extent Eric W. Biederman
2017-11-01 10:51     ` [PATCH 0/5] userns: bump idmap limits, fixes & tweaks Christian Brauner
2017-11-01 11:15       ` Eric W. Biederman
2017-11-01 13:31         ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).