linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit
@ 2018-03-12 20:15 Waiman Long
  2018-03-12 20:15 ` [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:15 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Waiman Long

v3->v4:
 - Remove v3 patches 1 & 2 as they have been merged into the mm tree.
 - Change flags from uint16_t to unsigned int.
 - Remove CTL_FLAGS_OOR_WARNED and use pr_warn_ratelimited() instead.
 - Simplify the warning message code.
 - Add a new patch to fail the ctl_table registration with invalid flag.
 - Add a test case for range clamping in sysctl selftest.

v2->v3:
 - Fix kdoc comment errors.
 - Incorporate comments and suggestions from Luis R. Rodriguez.
 - Add a patch to fix a typo error in fs/proc/proc_sysctl.c.

v1->v2:
 - Add kdoc comments to the do_proc_do{u}intvec_minmax_conv_param
   structures.
 - Add a new flags field to the ctl_table structure for specifying
   whether range clamping should be activated instead of adding new
   sysctl parameter handlers.
 - Clamp the semmni value embedded in the multi-values sem parameter.

v1 patch: https://lkml.org/lkml/2018/2/19/453
v2 patch: https://lkml.org/lkml/2018/2/27/627

The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

Enforcing the limit by failing sysctl parameter write, however, can
break existing user applications. To address this delemma, a new flags
field is introduced into the ctl_table. The value CTL_FLAGS_CLAMP_RANGE
can be added to any ctl_table entries to enable a looser range clamping
without returning any error. For example,

  .flags = CTL_FLAGS_CLAMP_RANGE,

This flags value are now used for the range checking of shmmni,
msgmni and semmni without breaking existing applications. If any out
of range value is written to those sysctl parameters, the following
warning will be printed instead.

  sysctl: "shmmni" was set out of range [0, 32768], clamped to 32768.

Reading the values back will show 32768 instead of some fake values.

Waiman Long (6):
  sysctl: Add flags to support min/max range clamping
  proc/sysctl: Check for invalid flags bits
  sysctl: Warn when a clamped sysctl parameter is set out of range
  ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  ipc: Clamp semmni to the real IPCMNI limit
  test_sysctl: Add range clamping test

 fs/proc/proc_sysctl.c                    | 12 +++++
 include/linux/sysctl.h                   | 15 ++++++
 ipc/ipc_sysctl.c                         | 22 +++++++--
 ipc/sem.c                                | 28 +++++++++++
 ipc/util.h                               |  4 ++
 kernel/sysctl.c                          | 80 ++++++++++++++++++++++++++++----
 tools/testing/selftests/sysctl/sysctl.sh | 43 +++++++++++++++++
 7 files changed, 192 insertions(+), 12 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping
  2018-03-12 20:15 [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit Waiman Long
@ 2018-03-12 20:15 ` Waiman Long
  2018-03-12 20:44   ` Luis R. Rodriguez
  2018-03-13 17:46   ` Eric W. Biederman
  2018-03-12 20:15 ` [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits Waiman Long
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:15 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Waiman Long

When minimum/maximum values are specified for a sysctl parameter in
the ctl_table structure with proc_dointvec_minmax() handler, update
to that parameter will fail with error if the given value is outside
of the required range.

There are use cases where it may be better to clamp the value of
the sysctl parameter to the given range without failing the update,
especially if the users are not aware of the actual range limits.
Reading the value back after the update will now be a good practice
to see if the provided value exceeds the range limits.

To provide this less restrictive form of range checking, a new flags
field is added to the ctl_table structure.

When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
entry, any update from the userspace will be clamped to the given
range without error if either the proc_dointvec_minmax() or the
proc_douintvec_minmax() handlers is used.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/sysctl.h | 15 +++++++++++++++
 kernel/sysctl.c        | 48 +++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index b769ecf..963e363 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,6 +116,7 @@ struct ctl_table
 	void *data;
 	int maxlen;
 	umode_t mode;
+	unsigned int flags;
 	struct ctl_table *child;	/* Deprecated */
 	proc_handler *proc_handler;	/* Callback for text formatting */
 	struct ctl_table_poll *poll;
@@ -123,6 +124,20 @@ struct ctl_table
 	void *extra2;
 } __randomize_layout;
 
+/**
+ * enum ctl_table_flags - flags for the ctl table (struct ctl_table.flags)
+ *
+ * @CTL_FLAGS_CLAMP_RANGE: Set to indicate that the entry should be
+ *	flexibly clamped to min/max range in case the user provided
+ *	an incorrect value.
+ */
+enum ctl_table_flags {
+	CTL_FLAGS_CLAMP_RANGE		= BIT(0),
+	__CTL_FLAGS_MAX			= BIT(1),
+};
+
+#define CTL_TABLE_FLAGS_ALL	(__CTL_FLAGS_MAX - 1)
+
 struct ctl_node {
 	struct rb_node node;
 	struct ctl_table_header *header;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d2aa6b4..3d65f41 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2504,6 +2504,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
  * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range checking structure
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
+ * @flags: pointer to flags
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2512,6 +2513,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
 struct do_proc_dointvec_minmax_conv_param {
 	int *min;
 	int *max;
+	unsigned int *flags;
 };
 
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
@@ -2521,9 +2523,21 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
 	struct do_proc_dointvec_minmax_conv_param *param = data;
 	if (write) {
 		int val = *negp ? -*lvalp : *lvalp;
-		if ((param->min && *param->min > val) ||
-		    (param->max && *param->max < val))
-			return -EINVAL;
+		bool clamp = param->flags &&
+			   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
+
+		if (param->min && *param->min > val) {
+			if (clamp)
+				val = *param->min;
+			else
+				return -EINVAL;
+		}
+		if (param->max && *param->max < val) {
+			if (clamp)
+				val = *param->max;
+			else
+				return -EINVAL;
+		}
 		*valp = val;
 	} else {
 		int val = *valp;
@@ -2552,7 +2566,8 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
  * This routine will ensure the values are within the range specified by
  * table->extra1 (min) and table->extra2 (max).
  *
- * Returns 0 on success or -EINVAL on write when the range check fails.
+ * Returns 0 on success or -EINVAL on write when the range check fails
+ * without the CTL_FLAGS_CLAMP_RANGE flag.
  */
 int proc_dointvec_minmax(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos)
@@ -2560,6 +2575,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 	struct do_proc_dointvec_minmax_conv_param param = {
 		.min = (int *) table->extra1,
 		.max = (int *) table->extra2,
+		.flags = &table->flags,
 	};
 	return do_proc_dointvec(table, write, buffer, lenp, ppos,
 				do_proc_dointvec_minmax_conv, &param);
@@ -2569,6 +2585,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
  * struct do_proc_douintvec_minmax_conv_param - proc_douintvec_minmax() range checking structure
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
+ * @flags: pointer to flags
  *
  * The do_proc_douintvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2577,6 +2594,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 struct do_proc_douintvec_minmax_conv_param {
 	unsigned int *min;
 	unsigned int *max;
+	unsigned int *flags;
 };
 
 static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
@@ -2587,14 +2605,24 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
 
 	if (write) {
 		unsigned int val = *lvalp;
+		bool clamp = param->flags &&
+			   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
 
 		if (*lvalp > UINT_MAX)
 			return -EINVAL;
 
-		if ((param->min && *param->min > val) ||
-		    (param->max && *param->max < val))
-			return -ERANGE;
-
+		if (param->min && *param->min > val) {
+			if (clamp)
+				val = *param->min;
+			else
+				return -ERANGE;
+		}
+		if (param->max && *param->max < val) {
+			if (clamp)
+				val = *param->max;
+			else
+				return -ERANGE;
+		}
 		*valp = val;
 	} else {
 		unsigned int val = *valp;
@@ -2621,7 +2649,8 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
  * check for UINT_MAX to avoid having to support wrap around uses from
  * userspace.
  *
- * Returns 0 on success or -ERANGE on write when the range check fails.
+ * Returns 0 on success or -ERANGE on write when the range check fails
+ * without the CTL_FLAGS_CLAMP_RANGE flag.
  */
 int proc_douintvec_minmax(struct ctl_table *table, int write,
 			  void __user *buffer, size_t *lenp, loff_t *ppos)
@@ -2629,6 +2658,7 @@ int proc_douintvec_minmax(struct ctl_table *table, int write,
 	struct do_proc_douintvec_minmax_conv_param param = {
 		.min = (unsigned int *) table->extra1,
 		.max = (unsigned int *) table->extra2,
+		.flags = &table->flags,
 	};
 	return do_proc_douintvec(table, write, buffer, lenp, ppos,
 				 do_proc_douintvec_minmax_conv, &param);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 20:15 [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit Waiman Long
  2018-03-12 20:15 ` [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping Waiman Long
@ 2018-03-12 20:15 ` Waiman Long
  2018-03-12 20:46   ` Luis R. Rodriguez
  2018-03-12 20:52   ` Andrew Morton
  2018-03-12 20:15 ` [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range Waiman Long
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:15 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Waiman Long

Checking code is added to check for invalid flags in the ctl_table
and return error if an unknown flag is used.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/proc/proc_sysctl.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 493c975..67c0c82 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1092,6 +1092,16 @@ static int sysctl_check_table_array(const char *path, struct ctl_table *table)
 	return err;
 }
 
+static int sysctl_check_flags(const char *path, struct ctl_table *table)
+{
+	int err = 0;
+
+	if (table->flags & ~CTL_TABLE_FLAGS_ALL)
+		err = sysctl_err(path, table, "invalid flags");
+
+	return err;
+}
+
 static int sysctl_check_table(const char *path, struct ctl_table *table)
 {
 	int err = 0;
@@ -1111,6 +1121,8 @@ static int sysctl_check_table(const char *path, struct ctl_table *table)
 		    (table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) {
 			if (!table->data)
 				err |= sysctl_err(path, table, "No data");
+			if (table->flags)
+				err |= sysctl_check_flags(path, table);
 			if (!table->maxlen)
 				err |= sysctl_err(path, table, "No maxlen");
 			else
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range
  2018-03-12 20:15 [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit Waiman Long
  2018-03-12 20:15 ` [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping Waiman Long
  2018-03-12 20:15 ` [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits Waiman Long
@ 2018-03-12 20:15 ` Waiman Long
  2018-03-12 20:50   ` Luis R. Rodriguez
  2018-03-12 21:00   ` Andrew Morton
  2018-03-12 20:15 ` [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit Waiman Long
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:15 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Waiman Long

Even with clamped sysctl parameters, it is still not that straight
forward to figure out the exact range of those parameters. One may
try to write extreme parameter values to see if they get clamped.
To make it easier, a warning with the expected range will now be
printed into the kernel ring buffer when a clamped sysctl parameter
receives an out of range value.

The pr_warn_ratelimited() macro is used to limit the number of warning
messages that can be printed within a given period of time.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/sysctl.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d65f41..14aca92 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2505,6 +2505,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2514,31 +2515,48 @@ struct do_proc_dointvec_minmax_conv_param {
 	int *min;
 	int *max;
 	unsigned int *flags;
+	const char *name;
 };
 
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+#define pr_fmt(fmt)	"sysctl: " fmt
+
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
 					int *valp,
 					int write, void *data)
 {
 	struct do_proc_dointvec_minmax_conv_param *param = data;
+
 	if (write) {
 		int val = *negp ? -*lvalp : *lvalp;
+		bool clamped = false;
 		bool clamp = param->flags &&
 			   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
 
 		if (param->min && *param->min > val) {
-			if (clamp)
+			if (clamp) {
 				val = *param->min;
-			else
+				clamped = true;
+			} else {
 				return -EINVAL;
+			}
 		}
 		if (param->max && *param->max < val) {
-			if (clamp)
+			if (clamp) {
 				val = *param->max;
-			else
+				clamped = true;
+			} else {
 				return -EINVAL;
+			}
 		}
 		*valp = val;
+		if (clamped && param->name)
+			pr_warn_ratelimited("\"%s\" was set out of range [%d, %d], clamped to %d.\n",
+				param->name,
+				param->min ? *param->min : -INT_MAX,
+				param->max ? *param->max :  INT_MAX, val);
 	} else {
 		int val = *valp;
 		if (val < 0) {
@@ -2576,6 +2594,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 		.min = (int *) table->extra1,
 		.max = (int *) table->extra2,
 		.flags = &table->flags,
+		.name  = table->procname,
 	};
 	return do_proc_dointvec(table, write, buffer, lenp, ppos,
 				do_proc_dointvec_minmax_conv, &param);
@@ -2586,6 +2605,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_douintvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2595,6 +2615,7 @@ struct do_proc_douintvec_minmax_conv_param {
 	unsigned int *min;
 	unsigned int *max;
 	unsigned int *flags;
+	const char *name;
 };
 
 static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
@@ -2605,6 +2626,7 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
 
 	if (write) {
 		unsigned int val = *lvalp;
+		bool clamped = false;
 		bool clamp = param->flags &&
 			   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
 
@@ -2612,18 +2634,27 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
 			return -EINVAL;
 
 		if (param->min && *param->min > val) {
-			if (clamp)
+			if (clamp) {
 				val = *param->min;
-			else
+				clamped = true;
+			} else {
 				return -ERANGE;
+			}
 		}
 		if (param->max && *param->max < val) {
-			if (clamp)
+			if (clamp) {
 				val = *param->max;
-			else
+				clamped = true;
+			} else {
 				return -ERANGE;
+			}
 		}
 		*valp = val;
+		if (clamped && param->name)
+			pr_warn_ratelimited("\"%s\" was set out of range [%u, %u], clamped to %u.\n",
+				param->name,
+				param->min ? *param->min : 0,
+				param->max ? *param->max : UINT_MAX, val);
 	} else {
 		unsigned int val = *valp;
 		*lvalp = (unsigned long) val;
@@ -2659,6 +2690,7 @@ int proc_douintvec_minmax(struct ctl_table *table, int write,
 		.min = (unsigned int *) table->extra1,
 		.max = (unsigned int *) table->extra2,
 		.flags = &table->flags,
+		.name  = table->procname,
 	};
 	return do_proc_douintvec(table, write, buffer, lenp, ppos,
 				 do_proc_douintvec_minmax_conv, &param);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  2018-03-12 20:15 [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit Waiman Long
                   ` (2 preceding siblings ...)
  2018-03-12 20:15 ` [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range Waiman Long
@ 2018-03-12 20:15 ` Waiman Long
  2018-03-13 18:17   ` Eric W. Biederman
  2018-03-12 20:15 ` [PATCH v4 5/6] ipc: Clamp semmni to the real IPCMNI limit Waiman Long
  2018-03-12 20:15 ` [PATCH v4 6/6] test_sysctl: Add range clamping test Waiman Long
  5 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:15 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Waiman Long

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really
IPCMNI (32k). This can mislead users as they think they can get a
value that is not real.

Enforcing the limit by failing the sysctl parameter write, however,
can break existing user applications. Instead, the range clamping flag
is set to enforce the limit without failing existing user code. Users
can easily figure out if the sysctl parameter value is out of range
by either reading back the parameter value or checking the kernel
ring buffer for warning.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 ipc/ipc_sysctl.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 8ad93c2..1955dd4 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+static int ipc_mni = IPCMNI;
 
 static struct ctl_table ipc_kern_table[] = {
 	{
@@ -120,7 +121,10 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 		.data		= &init_ipc_ns.shm_ctlmni,
 		.maxlen		= sizeof(init_ipc_ns.shm_ctlmni),
 		.mode		= 0644,
-		.proc_handler	= proc_ipc_dointvec,
+		.proc_handler	= proc_ipc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &ipc_mni,
+		.flags		= CTL_FLAGS_CLAMP_RANGE,
 	},
 	{
 		.procname	= "shm_rmid_forced",
@@ -147,7 +151,8 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &int_max,
+		.extra2		= &ipc_mni,
+		.flags		= CTL_FLAGS_CLAMP_RANGE,
 	},
 	{
 		.procname	= "auto_msgmni",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 5/6] ipc: Clamp semmni to the real IPCMNI limit
  2018-03-12 20:15 [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit Waiman Long
                   ` (3 preceding siblings ...)
  2018-03-12 20:15 ` [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit Waiman Long
@ 2018-03-12 20:15 ` Waiman Long
  2018-03-12 20:52   ` Luis R. Rodriguez
  2018-03-12 20:15 ` [PATCH v4 6/6] test_sysctl: Add range clamping test Waiman Long
  5 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:15 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Waiman Long

This patch clamps the semmni value (fourth element of sem_ctls[]
array) to within the [0, IPCMNI] range and prints a warning message
once when an out-of-range value is being written.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 ipc/ipc_sysctl.c | 13 ++++++++++++-
 ipc/sem.c        | 28 ++++++++++++++++++++++++++++
 ipc/util.h       |  4 ++++
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 1955dd4..49cbc43 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -88,12 +88,22 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 	return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
 }
 
+static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
+	void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret = proc_ipc_dointvec(table, write, buffer, lenp, ppos);
+
+	sem_check_semmni(table, current->nsproxy->ipc_ns);
+	return ret;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec	   NULL
 #define proc_ipc_dointvec_minmax   NULL
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni	   NULL
+#define proc_ipc_sem_dointvec	   NULL
 #endif
 
 static int zero;
@@ -177,7 +187,8 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 		.data		= &init_ipc_ns.sem_ctls,
 		.maxlen		= 4*sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_ipc_dointvec,
+		.proc_handler	= proc_ipc_sem_dointvec,
+		.flags		= CTL_FLAGS_CLAMP_RANGE,
 	},
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	{
diff --git a/ipc/sem.c b/ipc/sem.c
index a4af049..2b47bc3 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2337,3 +2337,31 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	return 0;
 }
 #endif
+
+#ifdef CONFIG_PROC_SYSCTL
+/*
+ * Check to see if semmni is out of range and clamp it if necessary.
+ */
+void sem_check_semmni(struct ctl_table *table, struct ipc_namespace *ns)
+{
+	bool clamped = false;
+
+	if (!(table->flags & CTL_FLAGS_CLAMP_RANGE))
+		return;
+
+	/*
+	 * Clamp semmni to the range [0, IPCMNI].
+	 */
+	if (ns->sc_semmni < 0) {
+		ns->sc_semmni = 0;
+		clamped = true;
+	}
+	if (ns->sc_semmni > IPCMNI) {
+		ns->sc_semmni = IPCMNI;
+		clamped = true;
+	}
+	if (clamped)
+		pr_warn_once("sysctl: \"sem[3]\" was set out of range [%d, %d], clamped to %d.\n",
+			     0, IPCMNI, ns->sc_semmni);
+}
+#endif
diff --git a/ipc/util.h b/ipc/util.h
index 89b8ec1..af57394 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -206,6 +206,10 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 		void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+#ifdef CONFIG_PROC_SYSCTL
+extern void sem_check_semmni(struct ctl_table *table, struct ipc_namespace *ns);
+#endif
+
 #ifdef CONFIG_COMPAT
 #include <linux/compat.h>
 struct compat_ipc_perm {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v4 6/6] test_sysctl: Add range clamping test
  2018-03-12 20:15 [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit Waiman Long
                   ` (4 preceding siblings ...)
  2018-03-12 20:15 ` [PATCH v4 5/6] ipc: Clamp semmni to the real IPCMNI limit Waiman Long
@ 2018-03-12 20:15 ` Waiman Long
  2018-03-12 20:53   ` Luis R. Rodriguez
  5 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:15 UTC (permalink / raw)
  To: Luis R. Rodriguez, Kees Cook
  Cc: linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Waiman Long

Add a range clamping test for the msgmni sysctl parameter to verify
that the input value will be clamped if it exceeds the builtin maximum
or minimum value.

Below is the expected test run result:

Running test: sysctl_test_0006 - run #0
Checking range minimum clamping ... ok
Checking range maximum clamping ... ok

Signed-off-by: Waiman Long <longman@redhat.com>
---
 tools/testing/selftests/sysctl/sysctl.sh | 43 ++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/tools/testing/selftests/sysctl/sysctl.sh b/tools/testing/selftests/sysctl/sysctl.sh
index ec232c3..fbf9d73 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -34,6 +34,7 @@ ALL_TESTS="$ALL_TESTS 0002:1:1"
 ALL_TESTS="$ALL_TESTS 0003:1:1"
 ALL_TESTS="$ALL_TESTS 0004:1:1"
 ALL_TESTS="$ALL_TESTS 0005:3:1"
+ALL_TESTS="$ALL_TESTS 0006:1:1"
 
 test_modprobe()
 {
@@ -62,6 +63,9 @@ function allow_user_defaults()
 	if [ -z $WRITES_STRICT ]; then
 		WRITES_STRICT="${PROD_SYSCTL}/kernel/sysctl_writes_strict"
 	fi
+	if [ -z $MSGMNI ]; then
+		MSGMNI="${PROD_SYSCTL}/kernel/msgmni"
+	fi
 }
 
 function check_production_sysctl_writes_strict()
@@ -543,6 +547,34 @@ run_stringtests()
 	test_rc
 }
 
+# TARGET, BEYOND_MIN & BEYOND_MAX need to be defined before running test.
+run_range_clamping_test()
+{
+	echo -n "Checking range minimum clamping ... "
+	echo $BEYOND_MIN > "$TARGET" > /dev/null 2>&1
+	EXITVAL=$?
+	NEWVAL=$(cat "$TARGET")
+	if [[ $EXITVAL -ne 0 || $NEWVAL -le $BEYOND_MIN ]]; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+
+	echo -n "Checking range maximum clamping ... "
+	echo $BEYOND_MAX > "$TARGET" > /dev/null 2>&1
+	EXITVAL=$?
+	NEWVAL=$(cat "$TARGET")
+	if [[ $EXITVAL -ne 0 || $NEWVAL -ge $BEYOND_MAX ]]; then
+		echo "FAIL" >&2
+		rc=1
+	else
+		echo "ok"
+	fi
+
+	test_rc
+}
+
 sysctl_test_0001()
 {
 	TARGET="${SYSCTL}/int_0001"
@@ -600,6 +632,17 @@ sysctl_test_0005()
 	run_limit_digit_int_array
 }
 
+sysctl_test_0006()
+{
+	TARGET="${MSGMNI}"
+	ORIG=$(cat "${TARGET}")
+	BEYOND_MIN=-1
+	BEYOND_MAX=1000000000
+
+	run_range_clamping_test
+	set_orig
+}
+
 list_tests()
 {
 	echo "Test ID list:"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping
  2018-03-12 20:15 ` [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping Waiman Long
@ 2018-03-12 20:44   ` Luis R. Rodriguez
  2018-03-12 20:48     ` Waiman Long
  2018-03-13 17:46   ` Eric W. Biederman
  1 sibling, 1 reply; 42+ messages in thread
From: Luis R. Rodriguez @ 2018-03-12 20:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On Mon, Mar 12, 2018 at 04:15:39PM -0400, Waiman Long wrote:
> When minimum/maximum values are specified for a sysctl parameter in
> the ctl_table structure with proc_dointvec_minmax() handler, update
> to that parameter will fail with error if the given value is outside
> of the required range.
> 
> There are use cases where it may be better to clamp the value of
> the sysctl parameter to the given range without failing the update,
> especially if the users are not aware of the actual range limits.
> Reading the value back after the update will now be a good practice
> to see if the provided value exceeds the range limits.
> 
> To provide this less restrictive form of range checking, a new flags
> field is added to the ctl_table structure.
> 
> When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
> entry, any update from the userspace will be clamped to the given
> range without error if either the proc_dointvec_minmax() or the
> proc_douintvec_minmax() handlers is used.

You keep missing to document both on commit log and kdoc which
end on the range is selected if you happen to go over. To be clear
it is unclear from reading the commit log if a default is set if
you go over if you pick another value. In this case it is conditional
if you go over we pick the high range max, and if you go below the
lower range minimum set allowed.

What happens if no low range is set and that is what the issue in
terms of range triggers?

> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/sysctl.h | 15 +++++++++++++++
>  kernel/sysctl.c        | 48 +++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 54 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index b769ecf..963e363 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -116,6 +116,7 @@ struct ctl_table
>  	void *data;
>  	int maxlen;
>  	umode_t mode;
> +	unsigned int flags;
>  	struct ctl_table *child;	/* Deprecated */
>  	proc_handler *proc_handler;	/* Callback for text formatting */
>  	struct ctl_table_poll *poll;
> @@ -123,6 +124,20 @@ struct ctl_table
>  	void *extra2;
>  } __randomize_layout;
>  
> +/**
> + * enum ctl_table_flags - flags for the ctl table (struct ctl_table.flags)
> + *
> + * @CTL_FLAGS_CLAMP_RANGE: Set to indicate that the entry should be
> + *	flexibly clamped to min/max range in case the user provided
> + *	an incorrect value.

It should be documented clearly here.

So NACK for now.

  Luis

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 20:15 ` [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits Waiman Long
@ 2018-03-12 20:46   ` Luis R. Rodriguez
  2018-03-12 20:54     ` Waiman Long
  2018-03-12 20:52   ` Andrew Morton
  1 sibling, 1 reply; 42+ messages in thread
From: Luis R. Rodriguez @ 2018-03-12 20:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On Mon, Mar 12, 2018 at 04:15:40PM -0400, Waiman Long wrote:
> Checking code is added to check for invalid flags in the ctl_table
> and return error if an unknown flag is used.

This should be merged with the first patch otherwise there are atomic
points in time on the commit log history where invalid values are allowed
and that makes no sense.

This can probably be expanded to verify semantics further. Details
below.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  fs/proc/proc_sysctl.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index 493c975..67c0c82 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -1092,6 +1092,16 @@ static int sysctl_check_table_array(const char *path, struct ctl_table *table)
>  	return err;
>  }
>  
> +static int sysctl_check_flags(const char *path, struct ctl_table *table)
> +{
> +	int err = 0;
> +
> +	if (table->flags & ~CTL_TABLE_FLAGS_ALL)
> +		err = sysctl_err(path, table, "invalid flags");

What if a range for the upper limit is set but not the lower limit and
the user goes over the lower limit?

How about the inverse?

Do we need both ranges set?

  Luis

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping
  2018-03-12 20:44   ` Luis R. Rodriguez
@ 2018-03-12 20:48     ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:48 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Kees Cook, linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox

On 03/12/2018 04:44 PM, Luis R. Rodriguez wrote:
> On Mon, Mar 12, 2018 at 04:15:39PM -0400, Waiman Long wrote:
>> When minimum/maximum values are specified for a sysctl parameter in
>> the ctl_table structure with proc_dointvec_minmax() handler, update
>> to that parameter will fail with error if the given value is outside
>> of the required range.
>>
>> There are use cases where it may be better to clamp the value of
>> the sysctl parameter to the given range without failing the update,
>> especially if the users are not aware of the actual range limits.
>> Reading the value back after the update will now be a good practice
>> to see if the provided value exceeds the range limits.
>>
>> To provide this less restrictive form of range checking, a new flags
>> field is added to the ctl_table structure.
>>
>> When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
>> entry, any update from the userspace will be clamped to the given
>> range without error if either the proc_dointvec_minmax() or the
>> proc_douintvec_minmax() handlers is used.
> You keep missing to document both on commit log and kdoc which
> end on the range is selected if you happen to go over. To be clear
> it is unclear from reading the commit log if a default is set if
> you go over if you pick another value. In this case it is conditional
> if you go over we pick the high range max, and if you go below the
> lower range minimum set allowed.
>
> What happens if no low range is set and that is what the issue in
> terms of range triggers?

Sorry for missing that.

If a minimum value is not specified, no minimum checking and clamping
will be done. Similarly for maximum. I will clarify that.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range
  2018-03-12 20:15 ` [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range Waiman Long
@ 2018-03-12 20:50   ` Luis R. Rodriguez
  2018-03-12 21:07     ` Waiman Long
  2018-03-12 21:00   ` Andrew Morton
  1 sibling, 1 reply; 42+ messages in thread
From: Luis R. Rodriguez @ 2018-03-12 20:50 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On Mon, Mar 12, 2018 at 04:15:41PM -0400, Waiman Long wrote:
> Even with clamped sysctl parameters, it is still not that straight
> forward to figure out the exact range of those parameters. One may
> try to write extreme parameter values to see if they get clamped.
> To make it easier, a warning with the expected range will now be
> printed into the kernel ring buffer when a clamped sysctl parameter
> receives an out of range value.
> 
> The pr_warn_ratelimited() macro is used to limit the number of warning
> messages that can be printed within a given period of time.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/sysctl.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 3d65f41..14aca92 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2505,6 +2505,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
>   * @min: pointer to minimum allowable value
>   * @max: pointer to maximum allowable value
>   * @flags: pointer to flags
> + * @name: sysctl parameter name
>   *
>   * The do_proc_dointvec_minmax_conv_param structure provides the
>   * minimum and maximum values for doing range checking for those sysctl
> @@ -2514,31 +2515,48 @@ struct do_proc_dointvec_minmax_conv_param {
>  	int *min;
>  	int *max;
>  	unsigned int *flags;
> +	const char *name;
>  };
>  
> +#ifdef pr_fmt
> +#undef pr_fmt
> +#endif
> +#define pr_fmt(fmt)	"sysctl: " fmt

No. This needs to be defined at the top of the file, please look
at other uses on kernel/*.c and just use:

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

The filename already provides the name we want.

> +
>  static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
>  					int *valp,
>  					int write, void *data)
>  {
>  	struct do_proc_dointvec_minmax_conv_param *param = data;
> +
>  	if (write) {
>  		int val = *negp ? -*lvalp : *lvalp;
> +		bool clamped = false;
>  		bool clamp = param->flags &&
>  			   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
>  
>  		if (param->min && *param->min > val) {
> -			if (clamp)
> +			if (clamp) {
>  				val = *param->min;
> -			else
> +				clamped = true;
> +			} else {
>  				return -EINVAL;
> +			}
>  		}
>  		if (param->max && *param->max < val) {
> -			if (clamp)
> +			if (clamp) {
>  				val = *param->max;
> -			else
> +				clamped = true;
> +			} else {
>  				return -EINVAL;
> +			}
>  		}
>  		*valp = val;
> +		if (clamped && param->name)
> +			pr_warn_ratelimited("\"%s\" was set out of range [%d, %d], clamped to %d.\n",
> +				param->name,
> +				param->min ? *param->min : -INT_MAX,
> +				param->max ? *param->max :  INT_MAX, val);

We already warn here but I see you also add yet-another warning for the driver....

  Luis

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 20:15 ` [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits Waiman Long
  2018-03-12 20:46   ` Luis R. Rodriguez
@ 2018-03-12 20:52   ` Andrew Morton
  2018-03-12 22:12     ` Waiman Long
  1 sibling, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2018-03-12 20:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Al Viro, Matthew Wilcox

On Mon, 12 Mar 2018 16:15:40 -0400 Waiman Long <longman@redhat.com> wrote:

> Checking code is added to check for invalid flags in the ctl_table
> and return error if an unknown flag is used.

Why?  What's wrong with the old code, what value does this change add,
etc.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 5/6] ipc: Clamp semmni to the real IPCMNI limit
  2018-03-12 20:15 ` [PATCH v4 5/6] ipc: Clamp semmni to the real IPCMNI limit Waiman Long
@ 2018-03-12 20:52   ` Luis R. Rodriguez
  2018-03-12 20:59     ` Waiman Long
  0 siblings, 1 reply; 42+ messages in thread
From: Luis R. Rodriguez @ 2018-03-12 20:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On Mon, Mar 12, 2018 at 04:15:43PM -0400, Waiman Long wrote:
> +	if (clamped)
> +		pr_warn_once("sysctl: \"sem[3]\" was set out of range [%d, %d], clamped to %d.\n",
> +			     0, IPCMNI, ns->sc_semmni);

Why warn if the kernel already does that? If we can avoid more code on
drivers the better.

  Luis

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 6/6] test_sysctl: Add range clamping test
  2018-03-12 20:15 ` [PATCH v4 6/6] test_sysctl: Add range clamping test Waiman Long
@ 2018-03-12 20:53   ` Luis R. Rodriguez
  2018-03-12 21:00     ` Waiman Long
  0 siblings, 1 reply; 42+ messages in thread
From: Luis R. Rodriguez @ 2018-03-12 20:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On Mon, Mar 12, 2018 at 04:15:44PM -0400, Waiman Long wrote:
> Add a range clamping test for the msgmni sysctl parameter to verify

No! We don't want to test production values, please add a new test
entry which is similar but does *not* modify system production values!

  Luis

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 20:46   ` Luis R. Rodriguez
@ 2018-03-12 20:54     ` Waiman Long
  2018-03-12 20:59       ` Luis R. Rodriguez
  0 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:54 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Kees Cook, linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox

On 03/12/2018 04:46 PM, Luis R. Rodriguez wrote:
> On Mon, Mar 12, 2018 at 04:15:40PM -0400, Waiman Long wrote:
>> Checking code is added to check for invalid flags in the ctl_table
>> and return error if an unknown flag is used.
> This should be merged with the first patch otherwise there are atomic
> points in time on the commit log history where invalid values are allowed
> and that makes no sense.
>
> This can probably be expanded to verify semantics further. Details
> below.
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  fs/proc/proc_sysctl.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>> index 493c975..67c0c82 100644
>> --- a/fs/proc/proc_sysctl.c
>> +++ b/fs/proc/proc_sysctl.c
>> @@ -1092,6 +1092,16 @@ static int sysctl_check_table_array(const char *path, struct ctl_table *table)
>>  	return err;
>>  }
>>  
>> +static int sysctl_check_flags(const char *path, struct ctl_table *table)
>> +{
>> +	int err = 0;
>> +
>> +	if (table->flags & ~CTL_TABLE_FLAGS_ALL)
>> +		err = sysctl_err(path, table, "invalid flags");
> What if a range for the upper limit is set but not the lower limit and
> the user goes over the lower limit?
>
> How about the inverse?
>
> Do we need both ranges set?
>
>   Luis

This check is just to make sure that no invalid flag bit is set. Range
clamping is just one of flag bits, though this is the only one currently
supported. In fact, it is allowed that the minimum or maximum can be
left unspecified. In this case, no minimum or maximum checking will be
done. So I don't see anything related to range checking should be put here.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 20:54     ` Waiman Long
@ 2018-03-12 20:59       ` Luis R. Rodriguez
  2018-03-12 21:02         ` Waiman Long
  0 siblings, 1 reply; 42+ messages in thread
From: Luis R. Rodriguez @ 2018-03-12 20:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On Mon, Mar 12, 2018 at 04:54:51PM -0400, Waiman Long wrote:
> On 03/12/2018 04:46 PM, Luis R. Rodriguez wrote:
> > On Mon, Mar 12, 2018 at 04:15:40PM -0400, Waiman Long wrote:
> >> Checking code is added to check for invalid flags in the ctl_table
> >> and return error if an unknown flag is used.
> > This should be merged with the first patch otherwise there are atomic
> > points in time on the commit log history where invalid values are allowed
> > and that makes no sense.
> >
> > This can probably be expanded to verify semantics further. Details
> > below.
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >>  fs/proc/proc_sysctl.c | 12 ++++++++++++
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> >> index 493c975..67c0c82 100644
> >> --- a/fs/proc/proc_sysctl.c
> >> +++ b/fs/proc/proc_sysctl.c
> >> @@ -1092,6 +1092,16 @@ static int sysctl_check_table_array(const char *path, struct ctl_table *table)
> >>  	return err;
> >>  }
> >>  
> >> +static int sysctl_check_flags(const char *path, struct ctl_table *table)
> >> +{
> >> +	int err = 0;
> >> +
> >> +	if (table->flags & ~CTL_TABLE_FLAGS_ALL)
> >> +		err = sysctl_err(path, table, "invalid flags");
> > What if a range for the upper limit is set but not the lower limit and
> > the user goes over the lower limit?
> >
> > How about the inverse?
> >
> > Do we need both ranges set?
> >
> >   Luis
> 
> This check is just to make sure that no invalid flag bit is set. Range
> clamping is just one of flag bits, though this is the only one currently
> supported. In fact, it is allowed that the minimum or maximum can be
> left unspecified. In this case, no minimum or maximum checking will be
> done. So I don't see anything related to range checking should be put here.

What if minimum is greater than maximum?

  Luis

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 5/6] ipc: Clamp semmni to the real IPCMNI limit
  2018-03-12 20:52   ` Luis R. Rodriguez
@ 2018-03-12 20:59     ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 20:59 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Kees Cook, linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox

On 03/12/2018 04:52 PM, Luis R. Rodriguez wrote:
> On Mon, Mar 12, 2018 at 04:15:43PM -0400, Waiman Long wrote:
>> +	if (clamped)
>> +		pr_warn_once("sysctl: \"sem[3]\" was set out of range [%d, %d], clamped to %d.\n",
>> +			     0, IPCMNI, ns->sc_semmni);
> Why warn if the kernel already does that? If we can avoid more code on
> drivers the better.
>
>   Luis

As said before, this case is special because "sem" is a collection of 4
different values. The last one is semmni which I tries to apply the same
range clamping here, whereas the other 3 are left as they are. The
standard proc_dointvec_minmax() handler cannot be used here because of
that. I will clarify that in the commit log.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range
  2018-03-12 20:15 ` [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range Waiman Long
  2018-03-12 20:50   ` Luis R. Rodriguez
@ 2018-03-12 21:00   ` Andrew Morton
  2018-03-12 21:04     ` Waiman Long
  1 sibling, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2018-03-12 21:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Al Viro, Matthew Wilcox

On Mon, 12 Mar 2018 16:15:41 -0400 Waiman Long <longman@redhat.com> wrote:

> Even with clamped sysctl parameters, it is still not that straight
> forward to figure out the exact range of those parameters. One may
> try to write extreme parameter values to see if they get clamped.
> To make it easier, a warning with the expected range will now be
> printed into the kernel ring buffer when a clamped sysctl parameter
> receives an out of range value.
> 
> The pr_warn_ratelimited() macro is used to limit the number of warning
> messages that can be printed within a given period of time.
> 
> ...
>
> +#ifdef pr_fmt
> +#undef pr_fmt
> +#endif
> +#define pr_fmt(fmt)	"sysctl: " fmt

Why is it necessary to undefine pr_fmt?  That's a somewhat unusual
thing to do.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 6/6] test_sysctl: Add range clamping test
  2018-03-12 20:53   ` Luis R. Rodriguez
@ 2018-03-12 21:00     ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 21:00 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Kees Cook, linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox

On 03/12/2018 04:53 PM, Luis R. Rodriguez wrote:
> On Mon, Mar 12, 2018 at 04:15:44PM -0400, Waiman Long wrote:
>> Add a range clamping test for the msgmni sysctl parameter to verify
> No! We don't want to test production values, please add a new test
> entry which is similar but does *not* modify system production values!
>
>   Luis

Oh, I see. Will update the test accordingly.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 20:59       ` Luis R. Rodriguez
@ 2018-03-12 21:02         ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 21:02 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Kees Cook, linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox

On 03/12/2018 04:59 PM, Luis R. Rodriguez wrote:
> On Mon, Mar 12, 2018 at 04:54:51PM -0400, Waiman Long wrote:
>> On 03/12/2018 04:46 PM, Luis R. Rodriguez wrote:
>>> On Mon, Mar 12, 2018 at 04:15:40PM -0400, Waiman Long wrote:
>>>> Checking code is added to check for invalid flags in the ctl_table
>>>> and return error if an unknown flag is used.
>>> This should be merged with the first patch otherwise there are atomic
>>> points in time on the commit log history where invalid values are allowed
>>> and that makes no sense.
>>>
>>> This can probably be expanded to verify semantics further. Details
>>> below.
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>>  fs/proc/proc_sysctl.c | 12 ++++++++++++
>>>>  1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
>>>> index 493c975..67c0c82 100644
>>>> --- a/fs/proc/proc_sysctl.c
>>>> +++ b/fs/proc/proc_sysctl.c
>>>> @@ -1092,6 +1092,16 @@ static int sysctl_check_table_array(const char *path, struct ctl_table *table)
>>>>  	return err;
>>>>  }
>>>>  
>>>> +static int sysctl_check_flags(const char *path, struct ctl_table *table)
>>>> +{
>>>> +	int err = 0;
>>>> +
>>>> +	if (table->flags & ~CTL_TABLE_FLAGS_ALL)
>>>> +		err = sysctl_err(path, table, "invalid flags");
>>> What if a range for the upper limit is set but not the lower limit and
>>> the user goes over the lower limit?
>>>
>>> How about the inverse?
>>>
>>> Do we need both ranges set?
>>>
>>>   Luis
>> This check is just to make sure that no invalid flag bit is set. Range
>> clamping is just one of flag bits, though this is the only one currently
>> supported. In fact, it is allowed that the minimum or maximum can be
>> left unspecified. In this case, no minimum or maximum checking will be
>> done. So I don't see anything related to range checking should be put here.
> What if minimum is greater than maximum?
>
>   Luis

Yes, you are right. That is a valid check. I am going to add that in the
next patch.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range
  2018-03-12 21:00   ` Andrew Morton
@ 2018-03-12 21:04     ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 21:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Al Viro, Matthew Wilcox

On 03/12/2018 05:00 PM, Andrew Morton wrote:
> On Mon, 12 Mar 2018 16:15:41 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> Even with clamped sysctl parameters, it is still not that straight
>> forward to figure out the exact range of those parameters. One may
>> try to write extreme parameter values to see if they get clamped.
>> To make it easier, a warning with the expected range will now be
>> printed into the kernel ring buffer when a clamped sysctl parameter
>> receives an out of range value.
>>
>> The pr_warn_ratelimited() macro is used to limit the number of warning
>> messages that can be printed within a given period of time.
>>
>> ...
>>
>> +#ifdef pr_fmt
>> +#undef pr_fmt
>> +#endif
>> +#define pr_fmt(fmt)	"sysctl: " fmt
> Why is it necessary to undefine pr_fmt?  That's a somewhat unusual
> thing to do.
>
>
Because it is defined in the printk.h. I should have put the pr_fmt
definition before the iinclude file to eliminate the need to undefine it.

-Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range
  2018-03-12 20:50   ` Luis R. Rodriguez
@ 2018-03-12 21:07     ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-12 21:07 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Kees Cook, linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox

On 03/12/2018 04:50 PM, Luis R. Rodriguez wrote:
> On Mon, Mar 12, 2018 at 04:15:41PM -0400, Waiman Long wrote:
>> Even with clamped sysctl parameters, it is still not that straight
>> forward to figure out the exact range of those parameters. One may
>> try to write extreme parameter values to see if they get clamped.
>> To make it easier, a warning with the expected range will now be
>> printed into the kernel ring buffer when a clamped sysctl parameter
>> receives an out of range value.
>>
>> The pr_warn_ratelimited() macro is used to limit the number of warning
>> messages that can be printed within a given period of time.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/sysctl.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 40 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 3d65f41..14aca92 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2505,6 +2505,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
>>   * @min: pointer to minimum allowable value
>>   * @max: pointer to maximum allowable value
>>   * @flags: pointer to flags
>> + * @name: sysctl parameter name
>>   *
>>   * The do_proc_dointvec_minmax_conv_param structure provides the
>>   * minimum and maximum values for doing range checking for those sysctl
>> @@ -2514,31 +2515,48 @@ struct do_proc_dointvec_minmax_conv_param {
>>  	int *min;
>>  	int *max;
>>  	unsigned int *flags;
>> +	const char *name;
>>  };
>>  
>> +#ifdef pr_fmt
>> +#undef pr_fmt
>> +#endif
>> +#define pr_fmt(fmt)	"sysctl: " fmt
> No. This needs to be defined at the top of the file, please look
> at other uses on kernel/*.c and just use:
>
> #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> The filename already provides the name we want.

Right. That is the right way to do it.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 20:52   ` Andrew Morton
@ 2018-03-12 22:12     ` Waiman Long
  2018-03-12 22:42       ` Andrew Morton
  0 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-12 22:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Al Viro, Matthew Wilcox

On 03/12/2018 04:52 PM, Andrew Morton wrote:
> On Mon, 12 Mar 2018 16:15:40 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> Checking code is added to check for invalid flags in the ctl_table
>> and return error if an unknown flag is used.
> Why?  What's wrong with the old code, what value does this change add,
> etc.

This is an additional checking code requested by Luis.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits
  2018-03-12 22:12     ` Waiman Long
@ 2018-03-12 22:42       ` Andrew Morton
  0 siblings, 0 replies; 42+ messages in thread
From: Andrew Morton @ 2018-03-12 22:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Al Viro, Matthew Wilcox

On Mon, 12 Mar 2018 18:12:47 -0400 Waiman Long <longman@redhat.com> wrote:

> On 03/12/2018 04:52 PM, Andrew Morton wrote:
> > On Mon, 12 Mar 2018 16:15:40 -0400 Waiman Long <longman@redhat.com> wrote:
> >
> >> Checking code is added to check for invalid flags in the ctl_table
> >> and return error if an unknown flag is used.
> > Why?  What's wrong with the old code, what value does this change add,
> > etc.
> 
> This is an additional checking code requested by Luis.

Readers of this patch will wish to know why it exists.  That doesn't
tell us!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping
  2018-03-12 20:15 ` [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping Waiman Long
  2018-03-12 20:44   ` Luis R. Rodriguez
@ 2018-03-13 17:46   ` Eric W. Biederman
  2018-03-13 18:49     ` Waiman Long
  1 sibling, 1 reply; 42+ messages in thread
From: Eric W. Biederman @ 2018-03-13 17:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

Waiman Long <longman@redhat.com> writes:

> When minimum/maximum values are specified for a sysctl parameter in
> the ctl_table structure with proc_dointvec_minmax() handler, update
> to that parameter will fail with error if the given value is outside
> of the required range.
>
> There are use cases where it may be better to clamp the value of
> the sysctl parameter to the given range without failing the update,
> especially if the users are not aware of the actual range limits.
> Reading the value back after the update will now be a good practice
> to see if the provided value exceeds the range limits.

What use cases?  Who will break.  Examples would be good.

> To provide this less restrictive form of range checking, a new flags
> field is added to the ctl_table structure.
>
> When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
> entry, any update from the userspace will be clamped to the given
> range without error if either the proc_dointvec_minmax() or the
> proc_douintvec_minmax() handlers is used.

As this is constructed it will increase the size of ctl_table for
everyone.  Better would be to add a new function that behaves similary
but differently than to burden struct ctl_table for the rest of time.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  include/linux/sysctl.h | 15 +++++++++++++++
>  kernel/sysctl.c        | 48 +++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 54 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index b769ecf..963e363 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -116,6 +116,7 @@ struct ctl_table
>  	void *data;
>  	int maxlen;
>  	umode_t mode;
> +	unsigned int flags;
>  	struct ctl_table *child;	/* Deprecated */
>  	proc_handler *proc_handler;	/* Callback for text formatting */
>  	struct ctl_table_poll *poll;
> @@ -123,6 +124,20 @@ struct ctl_table
>  	void *extra2;
>  } __randomize_layout;
>  
> +/**
> + * enum ctl_table_flags - flags for the ctl table (struct ctl_table.flags)
> + *
> + * @CTL_FLAGS_CLAMP_RANGE: Set to indicate that the entry should be
> + *	flexibly clamped to min/max range in case the user provided
> + *	an incorrect value.
> + */
> +enum ctl_table_flags {
> +	CTL_FLAGS_CLAMP_RANGE		= BIT(0),
> +	__CTL_FLAGS_MAX			= BIT(1),
> +};
> +
> +#define CTL_TABLE_FLAGS_ALL	(__CTL_FLAGS_MAX - 1)
> +
>  struct ctl_node {
>  	struct rb_node node;
>  	struct ctl_table_header *header;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index d2aa6b4..3d65f41 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2504,6 +2504,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
>   * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range checking structure
>   * @min: pointer to minimum allowable value
>   * @max: pointer to maximum allowable value
> + * @flags: pointer to flags
>   *
>   * The do_proc_dointvec_minmax_conv_param structure provides the
>   * minimum and maximum values for doing range checking for those sysctl
> @@ -2512,6 +2513,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
>  struct do_proc_dointvec_minmax_conv_param {
>  	int *min;
>  	int *max;
> +	unsigned int *flags;
>  };
>  
>  static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
> @@ -2521,9 +2523,21 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
>  	struct do_proc_dointvec_minmax_conv_param *param = data;
>  	if (write) {
>  		int val = *negp ? -*lvalp : *lvalp;
> -		if ((param->min && *param->min > val) ||
> -		    (param->max && *param->max < val))
> -			return -EINVAL;
> +		bool clamp = param->flags &&
> +			   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
> +
> +		if (param->min && *param->min > val) {
> +			if (clamp)
> +				val = *param->min;
> +			else
> +				return -EINVAL;
> +		}
> +		if (param->max && *param->max < val) {
> +			if (clamp)
> +				val = *param->max;
> +			else
> +				return -EINVAL;
> +		}
>  		*valp = val;
>  	} else {
>  		int val = *valp;
> @@ -2552,7 +2566,8 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
>   * This routine will ensure the values are within the range specified by
>   * table->extra1 (min) and table->extra2 (max).
>   *
> - * Returns 0 on success or -EINVAL on write when the range check fails.
> + * Returns 0 on success or -EINVAL on write when the range check fails
> + * without the CTL_FLAGS_CLAMP_RANGE flag.
>   */
>  int proc_dointvec_minmax(struct ctl_table *table, int write,
>  		  void __user *buffer, size_t *lenp, loff_t *ppos)
> @@ -2560,6 +2575,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
>  	struct do_proc_dointvec_minmax_conv_param param = {
>  		.min = (int *) table->extra1,
>  		.max = (int *) table->extra2,
> +		.flags = &table->flags,
>  	};
>  	return do_proc_dointvec(table, write, buffer, lenp, ppos,
>  				do_proc_dointvec_minmax_conv, &param);
> @@ -2569,6 +2585,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
>   * struct do_proc_douintvec_minmax_conv_param - proc_douintvec_minmax() range checking structure
>   * @min: pointer to minimum allowable value
>   * @max: pointer to maximum allowable value
> + * @flags: pointer to flags
>   *
>   * The do_proc_douintvec_minmax_conv_param structure provides the
>   * minimum and maximum values for doing range checking for those sysctl
> @@ -2577,6 +2594,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
>  struct do_proc_douintvec_minmax_conv_param {
>  	unsigned int *min;
>  	unsigned int *max;
> +	unsigned int *flags;
>  };
>  
>  static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
> @@ -2587,14 +2605,24 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
>  
>  	if (write) {
>  		unsigned int val = *lvalp;
> +		bool clamp = param->flags &&
> +			   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
>  
>  		if (*lvalp > UINT_MAX)
>  			return -EINVAL;
>  
> -		if ((param->min && *param->min > val) ||
> -		    (param->max && *param->max < val))
> -			return -ERANGE;
> -
> +		if (param->min && *param->min > val) {
> +			if (clamp)
> +				val = *param->min;
> +			else
> +				return -ERANGE;
> +		}
> +		if (param->max && *param->max < val) {
> +			if (clamp)
> +				val = *param->max;
> +			else
> +				return -ERANGE;
> +		}
>  		*valp = val;
>  	} else {
>  		unsigned int val = *valp;
> @@ -2621,7 +2649,8 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
>   * check for UINT_MAX to avoid having to support wrap around uses from
>   * userspace.
>   *
> - * Returns 0 on success or -ERANGE on write when the range check fails.
> + * Returns 0 on success or -ERANGE on write when the range check fails
> + * without the CTL_FLAGS_CLAMP_RANGE flag.
>   */
>  int proc_douintvec_minmax(struct ctl_table *table, int write,
>  			  void __user *buffer, size_t *lenp, loff_t *ppos)
> @@ -2629,6 +2658,7 @@ int proc_douintvec_minmax(struct ctl_table *table, int write,
>  	struct do_proc_douintvec_minmax_conv_param param = {
>  		.min = (unsigned int *) table->extra1,
>  		.max = (unsigned int *) table->extra2,
> +		.flags = &table->flags,
>  	};
>  	return do_proc_douintvec(table, write, buffer, lenp, ppos,
>  				 do_proc_douintvec_minmax_conv, &param);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  2018-03-12 20:15 ` [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit Waiman Long
@ 2018-03-13 18:17   ` Eric W. Biederman
  2018-03-13 18:39     ` Waiman Long
  0 siblings, 1 reply; 42+ messages in thread
From: Eric W. Biederman @ 2018-03-13 18:17 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

Waiman Long <longman@redhat.com> writes:

> A user can write arbitrary integer values to msgmni and shmmni sysctl
> parameters without getting error, but the actual limit is really
> IPCMNI (32k). This can mislead users as they think they can get a
> value that is not real.
>
> Enforcing the limit by failing the sysctl parameter write, however,
> can break existing user applications.

Which applications examples please.

I am seeing this patchset late but it looks like a whole lot of changes
to avoid a theoretical possibility.

Changes that have an impact on more than just the ipc code you are
patching.

That makes me feel very uncomfortable with these changes.

Eric


> Instead, the range clamping flag
> is set to enforce the limit without failing existing user code. Users
> can easily figure out if the sysctl parameter value is out of range
> by either reading back the parameter value or checking the kernel
> ring buffer for warning.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  ipc/ipc_sysctl.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
> index 8ad93c2..1955dd4 100644
> --- a/ipc/ipc_sysctl.c
> +++ b/ipc/ipc_sysctl.c
> @@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
>  static int zero;
>  static int one = 1;
>  static int int_max = INT_MAX;
> +static int ipc_mni = IPCMNI;
>  
>  static struct ctl_table ipc_kern_table[] = {
>  	{
> @@ -120,7 +121,10 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
>  		.data		= &init_ipc_ns.shm_ctlmni,
>  		.maxlen		= sizeof(init_ipc_ns.shm_ctlmni),
>  		.mode		= 0644,
> -		.proc_handler	= proc_ipc_dointvec,
> +		.proc_handler	= proc_ipc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &ipc_mni,
> +		.flags		= CTL_FLAGS_CLAMP_RANGE,
>  	},
>  	{
>  		.procname	= "shm_rmid_forced",
> @@ -147,7 +151,8 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
>  		.mode		= 0644,
>  		.proc_handler	= proc_ipc_dointvec_minmax,
>  		.extra1		= &zero,
> -		.extra2		= &int_max,
> +		.extra2		= &ipc_mni,
> +		.flags		= CTL_FLAGS_CLAMP_RANGE,
>  	},
>  	{
>  		.procname	= "auto_msgmni",

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  2018-03-13 18:17   ` Eric W. Biederman
@ 2018-03-13 18:39     ` Waiman Long
  2018-03-13 20:29       ` Eric W. Biederman
  0 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-13 18:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On 03/13/2018 02:17 PM, Eric W. Biederman wrote:
> Waiman Long <longman@redhat.com> writes:
>
>> A user can write arbitrary integer values to msgmni and shmmni sysctl
>> parameters without getting error, but the actual limit is really
>> IPCMNI (32k). This can mislead users as they think they can get a
>> value that is not real.
>>
>> Enforcing the limit by failing the sysctl parameter write, however,
>> can break existing user applications.
> Which applications examples please.
>
> I am seeing this patchset late but it looks like a whole lot of changes
> to avoid a theoretical possibility.
>
> Changes that have an impact on more than just the ipc code you are
> patching.
>
> That makes me feel very uncomfortable with these changes.
>
> Eric

This patchset is constructed to address a customer request that there is
no easy way to find out the actual usable range of a sysctl parameter.
In this particular case, the customer wants to use more than 32k shared
memory segments. They can put in a large value into shmmni, but the
application didn't work properly because shmmni was internally clamped
to 32k without any visible sign that a smaller limit has been imposed.

Out of a concern that there might be customers out there setting those
sysctl parameters outside of the allowable range without knowing it,
just enforcing the right limits may have the undesirable consequence of
breaking their existing setup scripts. I don't have concrete example of
what customers are doing that, but it  won't look good if we wait until
the complaints come in.

The new code won't affect existing code unless the necessary flag is
set. So would you mind elaborating what other impact do you see that
will affect other non-IPC code in an undesirable way?

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping
  2018-03-13 17:46   ` Eric W. Biederman
@ 2018-03-13 18:49     ` Waiman Long
  0 siblings, 0 replies; 42+ messages in thread
From: Waiman Long @ 2018-03-13 18:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On 03/13/2018 01:46 PM, Eric W. Biederman wrote:
> Waiman Long <longman@redhat.com> writes:
>
>> When minimum/maximum values are specified for a sysctl parameter in
>> the ctl_table structure with proc_dointvec_minmax() handler, update
>> to that parameter will fail with error if the given value is outside
>> of the required range.
>>
>> There are use cases where it may be better to clamp the value of
>> the sysctl parameter to the given range without failing the update,
>> especially if the users are not aware of the actual range limits.
>> Reading the value back after the update will now be a good practice
>> to see if the provided value exceeds the range limits.
> What use cases?  Who will break.  Examples would be good.

See my response to the 4/6 patch.

>> To provide this less restrictive form of range checking, a new flags
>> field is added to the ctl_table structure.
>>
>> When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
>> entry, any update from the userspace will be clamped to the given
>> range without error if either the proc_dointvec_minmax() or the
>> proc_douintvec_minmax() handlers is used.
> As this is constructed it will increase the size of ctl_table for
> everyone.  Better would be to add a new function that behaves similary
> but differently than to burden struct ctl_table for the rest of time.

If you have concern about increasing the size of the ctl_table, I can
revert it back to use unsigned short for flag which will fit into the
hole left by the 16-bit mode field. I thought increasing the size of
ctl_table a bit isn't a big problem, I may be wrong on that.

Adding the new flags field will afford us the ability to better
customize the sysctl parameters to different needs that may arise in the
future.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  2018-03-13 18:39     ` Waiman Long
@ 2018-03-13 20:29       ` Eric W. Biederman
  2018-03-13 21:06         ` Waiman Long
  0 siblings, 1 reply; 42+ messages in thread
From: Eric W. Biederman @ 2018-03-13 20:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

Waiman Long <longman@redhat.com> writes:

> On 03/13/2018 02:17 PM, Eric W. Biederman wrote:
>> Waiman Long <longman@redhat.com> writes:
>>
>>> A user can write arbitrary integer values to msgmni and shmmni sysctl
>>> parameters without getting error, but the actual limit is really
>>> IPCMNI (32k). This can mislead users as they think they can get a
>>> value that is not real.
>>>
>>> Enforcing the limit by failing the sysctl parameter write, however,
>>> can break existing user applications.
>> Which applications examples please.
>>
>> I am seeing this patchset late but it looks like a whole lot of changes
>> to avoid a theoretical possibility.
>>
>> Changes that have an impact on more than just the ipc code you are
>> patching.
>>
>> That makes me feel very uncomfortable with these changes.
>>
>> Eric
>
> This patchset is constructed to address a customer request that there is
> no easy way to find out the actual usable range of a sysctl parameter.
> In this particular case, the customer wants to use more than 32k shared
> memory segments. They can put in a large value into shmmni, but the
> application didn't work properly because shmmni was internally clamped
> to 32k without any visible sign that a smaller limit has been imposed.
>
> Out of a concern that there might be customers out there setting those
> sysctl parameters outside of the allowable range without knowing it,
> just enforcing the right limits may have the undesirable consequence of
> breaking their existing setup scripts. I don't have concrete example of
> what customers are doing that, but it  won't look good if we wait until
> the complaints come in.
>
> The new code won't affect existing code unless the necessary flag is
> set. So would you mind elaborating what other impact do you see that
> will affect other non-IPC code in an undesirable way?

The increase in size of struct ctl_table.  Every caller is affected.
Plus it increases everyone's cognitive load to figure out what is
this flags field as they fill out ctl_table.

Just introducing a proc_dointvec_minmax_clamped follows the existing
pattern and it makes it easier for everyone who both read the code.



It strikes me as quite peculiar that the response to bug report where
the complaint is an error is not given, is to continue the current
behavior without giving an error.


Arguably the simplest fix here would be to kill IPCMNI entirely.  Assign
the shmids from a sequence counter.  And place those structures in a
rbtree indexed by shmni.  There are 32bit fields but I don't think we
must keep the low 16bits for an index into an array and the high 16bits
as the actual sequence number.

Except for the checkpoint/restart case which is aguably much too
specific about how these ids are assigned that would give much more
freedom and allow people the number of shm segments that they actually
want to use.


For a further complication I don't expect you can get away with changing
the size or the fields in struct ctl_table in the kernel your customers
are running.


So please use a new function not flags it will simplify everyone's life.
If you can please actually fix this so you can have more shmids that
would be the really classy thing todo.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  2018-03-13 20:29       ` Eric W. Biederman
@ 2018-03-13 21:06         ` Waiman Long
  2018-03-15  0:49           ` [RFC][PATCH] ipc: Remove IPCMNI Eric W. Biederman
  0 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-13 21:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox

On 03/13/2018 04:29 PM, Eric W. Biederman wrote:
> Waiman Long <longman@redhat.com> writes:
>
>> On 03/13/2018 02:17 PM, Eric W. Biederman wrote:
>>> Waiman Long <longman@redhat.com> writes:
>>>
>>>> A user can write arbitrary integer values to msgmni and shmmni sysctl
>>>> parameters without getting error, but the actual limit is really
>>>> IPCMNI (32k). This can mislead users as they think they can get a
>>>> value that is not real.
>>>>
>>>> Enforcing the limit by failing the sysctl parameter write, however,
>>>> can break existing user applications.
>>> Which applications examples please.
>>>
>>> I am seeing this patchset late but it looks like a whole lot of changes
>>> to avoid a theoretical possibility.
>>>
>>> Changes that have an impact on more than just the ipc code you are
>>> patching.
>>>
>>> That makes me feel very uncomfortable with these changes.
>>>
>>> Eric
>> This patchset is constructed to address a customer request that there is
>> no easy way to find out the actual usable range of a sysctl parameter.
>> In this particular case, the customer wants to use more than 32k shared
>> memory segments. They can put in a large value into shmmni, but the
>> application didn't work properly because shmmni was internally clamped
>> to 32k without any visible sign that a smaller limit has been imposed.
>>
>> Out of a concern that there might be customers out there setting those
>> sysctl parameters outside of the allowable range without knowing it,
>> just enforcing the right limits may have the undesirable consequence of
>> breaking their existing setup scripts. I don't have concrete example of
>> what customers are doing that, but it  won't look good if we wait until
>> the complaints come in.
>>
>> The new code won't affect existing code unless the necessary flag is
>> set. So would you mind elaborating what other impact do you see that
>> will affect other non-IPC code in an undesirable way?
> The increase in size of struct ctl_table.  Every caller is affected.
> Plus it increases everyone's cognitive load to figure out what is
> this flags field as they fill out ctl_table.

As said earlier, there is a way to add the new flags without increasing
the size of the structure. The flags was originally a uint16_t which
won't increase the size of the structure. It was changed in v4 to
provide more space for future extension. I am going to change it back to
uint16_t. It can certainly be changed again in the future if we really
need more than 16 bits.

> Just introducing a proc_dointvec_minmax_clamped follows the existing
> pattern and it makes it easier for everyone who both read the code.

It was the approach that was taken in my v1 patch. I changed it to use a
flag in v2 as it is more general. It also make it easier to extend the
feature in the future. Adding more proc_handlers also has the same issue
of increasing cognitive load of what proc_handler to use in the
ctl_table. This is the side effect of adding more complexity and there
is no way to work around it.

> It strikes me as quite peculiar that the response to bug report where
> the complaint is an error is not given, is to continue the current
> behavior without giving an error.
The customer actually has 2 requests. First, it is to make it easier to
figure out what are the real ranges of some of the sysctl parameters.
Secondly, they want to increase the IPCMNI limit to more than 32k. This
patch tries to address the first one. I will leave the second one to a
future patch once this one is done. As said before, the reason for
adding a clamping mode is to avoid regression. If you guys strongly
believe that this is not an issue at all, I am fine going along and
enforce the limit by failing invalid value.

I am also working on a follow-up patch to add an iteration API to dump
out the the ranges of relevant sysctl parameters that allow range of
values. That will require using a flag to annotate ctl_table entries
that have ranges.

> Arguably the simplest fix here would be to kill IPCMNI entirely.  Assign
> the shmids from a sequence counter.  And place those structures in a
> rbtree indexed by shmni.  There are 32bit fields but I don't think we
> must keep the low 16bits for an index into an array and the high 16bits
> as the actual sequence number.

Yes, that is on my to-do list.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-13 21:06         ` Waiman Long
@ 2018-03-15  0:49           ` Eric W. Biederman
  2018-03-15 17:02             ` Waiman Long
  2018-03-15 19:45             ` Matthew Wilcox
  0 siblings, 2 replies; 42+ messages in thread
From: Eric W. Biederman @ 2018-03-15  0:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox, Stanislav Kinsbursky,
	Linux Containers


The define IPCMNI was originally the size of a statically sized array in
the kernel and that has long since been removed.  Therefore there is no
fundamental reason for IPCMNI.

The only remaining use IPCMNI serves is as a convoluted way to format
the ipc id to userspace.  It does not appear that anything except for
the CHECKPOINT_RESTORE code even cares about this variety of assignment
and the CHECKPOINT_RESTORE code only cares about this weirdness because
it has to restore these peculiar ids.

Therefore make the assignment of ipc ids match the description in
Advanced Programming in the Unix Environment and assign the next id
until INT_MAX is hit then loop around to the lower ids.

This can be implemented trivially with the current code using idr_alloc_cyclic.

To make it possible to keep checkpoint/restore working I have renamed
the sysctls from xxx_next_id to xxx_nextid.  That is enough change that
a smart CRIU implementation can see that what is exported has changed,
and act accordingly.  New kernels will be able to restore the old id's.

This code still needs some real world testing to verify my assumptions.
And some work with the CRIU implementations to actually add the code
that deals with the new for of id assignment.

Updates: 03f595668017 ("ipc: add sysctl to specify desired next object id")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Waiman please take a look at this and run it through some tests etc,
I am pretty certain something like this patch is all you need to do
to sort out ipc assignment.  Not messing with sysctls needed.

 include/linux/ipc.h           |  2 --
 include/linux/ipc_namespace.h |  1 -
 ipc/ipc_sysctl.c              |  6 ++--
 ipc/namespace.c               | 11 ++----
 ipc/util.c                    | 80 ++++++++++---------------------------------
 ipc/util.h                    | 11 +-----
 6 files changed, 25 insertions(+), 86 deletions(-)

diff --git a/include/linux/ipc.h b/include/linux/ipc.h
index 821b2f260992..6cc2df7f7ac9 100644
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -8,8 +8,6 @@
 #include <uapi/linux/ipc.h>
 #include <linux/refcount.h>
 
-#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
-
 /* used by in-kernel data structures */
 struct kern_ipc_perm {
 	spinlock_t	lock;
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..cab33b6a8236 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -15,7 +15,6 @@ struct user_namespace;
 
 struct ipc_ids {
 	int in_use;
-	unsigned short seq;
 	bool tables_initialized;
 	struct rw_semaphore rwsem;
 	struct idr ipcs_idr;
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 8ad93c29f511..a599963d58bf 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -176,7 +176,7 @@ static struct ctl_table ipc_kern_table[] = {
 	},
 #ifdef CONFIG_CHECKPOINT_RESTORE
 	{
-		.procname	= "sem_next_id",
+		.procname	= "sem_nextid",
 		.data		= &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
 		.mode		= 0644,
@@ -185,7 +185,7 @@ static struct ctl_table ipc_kern_table[] = {
 		.extra2		= &int_max,
 	},
 	{
-		.procname	= "msg_next_id",
+		.procname	= "msg_nextid",
 		.data		= &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
 		.mode		= 0644,
@@ -194,7 +194,7 @@ static struct ctl_table ipc_kern_table[] = {
 		.extra2		= &int_max,
 	},
 	{
-		.procname	= "shm_next_id",
+		.procname	= "shm_nextid",
 		.data		= &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
 		.mode		= 0644,
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f59a89966f92..84eaeba9e96c 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -109,20 +109,13 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
 {
 	struct kern_ipc_perm *perm;
 	int next_id;
-	int total, in_use;
 
 	down_write(&ids->rwsem);
-
-	in_use = ids->in_use;
-
-	for (total = 0, next_id = 0; total < in_use; next_id++) {
-		perm = idr_find(&ids->ipcs_idr, next_id);
-		if (perm == NULL)
-			continue;
+	next_id = 0;
+	while ((perm = idr_get_next(&ids->ipcs_idr, &next_id))) {
 		rcu_read_lock();
 		ipc_lock_object(perm);
 		free(ns, perm);
-		total++;
 	}
 	up_write(&ids->rwsem);
 }
diff --git a/ipc/util.c b/ipc/util.c
index 4ed5a17dd06f..ce6bf18e54df 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -118,7 +118,6 @@ int ipc_init_ids(struct ipc_ids *ids)
 {
 	int err;
 	ids->in_use = 0;
-	ids->seq = 0;
 	init_rwsem(&ids->rwsem);
 	err = rhashtable_init(&ids->key_ht, &ipc_kht_params);
 	if (err)
@@ -192,46 +191,18 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 	return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-/*
- * Specify desired id for next allocated IPC object.
- */
-#define ipc_idr_alloc(ids, new)						\
-	idr_alloc(&(ids)->ipcs_idr, (new),				\
-		  (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
-		  0, GFP_NOWAIT)
 
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
-			      struct kern_ipc_perm *new)
+static int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
-	if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
-		new->seq = ids->seq++;
-		if (ids->seq > IPCID_SEQ_MAX)
-			ids->seq = 0;
-	} else {
-		new->seq = ipcid_to_seqx(ids->next_id);
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	if (ids->next_id >= 0) {
+		idr_set_cursor(&ids->ipcs_idr, ids->next_id);
 		ids->next_id = -1;
 	}
-
-	return SEQ_MULTIPLIER * new->seq + id;
-}
-
-#else
-#define ipc_idr_alloc(ids, new)					\
-	idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
-
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
-			      struct kern_ipc_perm *new)
-{
-	new->seq = ids->seq++;
-	if (ids->seq > IPCID_SEQ_MAX)
-		ids->seq = 0;
-
-	return SEQ_MULTIPLIER * new->seq + id;
+#endif
+	return idr_alloc_cyclic(&ids->ipcs_idr, (new), 0, 0, GFP_NOWAIT);
 }
 
-#endif /* CONFIG_CHECKPOINT_RESTORE */
-
 /**
  * ipc_addid - add an ipc identifier
  * @ids: ipc identifier set
@@ -251,9 +222,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
 	kgid_t egid;
 	int id, err;
 
-	if (limit > IPCMNI)
-		limit = IPCMNI;
-
 	if (!ids->tables_initialized || ids->in_use >= limit)
 		return -ENOSPC;
 
@@ -290,7 +258,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
 	if (id > ids->max_id)
 		ids->max_id = id;
 
-	new->id = ipc_buildid(id, ids, new);
+	new->id = id;
 
 	return id;
 }
@@ -430,7 +398,7 @@ static void ipc_kht_remove(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
  */
 void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
 {
-	int lid = ipcid_to_idx(ipcp->id);
+	int lid = ipcp->id;
 
 	idr_remove(&ids->ipcs_idr, lid);
 	ipc_kht_remove(ids, ipcp);
@@ -563,7 +531,7 @@ void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out)
 struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id)
 {
 	struct kern_ipc_perm *out;
-	int lid = ipcid_to_idx(id);
+	int lid = id;
 
 	if (unlikely(!ids->tables_initialized))
 		return ERR_PTR(-EINVAL);
@@ -757,30 +725,20 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos,
 					      loff_t *new_pos)
 {
 	struct kern_ipc_perm *ipc;
-	int total, id;
-
-	total = 0;
-	for (id = 0; id < pos && total < ids->in_use; id++) {
-		ipc = idr_find(&ids->ipcs_idr, id);
-		if (ipc != NULL)
-			total++;
-	}
+	int id;
 
-	if (total >= ids->in_use)
+	/* Out of range - return NULL to terminate iteration */
+	if (pos > INT_MAX)
 		return NULL;
 
-	for (; pos < IPCMNI; pos++) {
-		ipc = idr_find(&ids->ipcs_idr, pos);
-		if (ipc != NULL) {
-			*new_pos = pos + 1;
-			rcu_read_lock();
-			ipc_lock_object(ipc);
-			return ipc;
-		}
-	}
+	ipc = idr_get_next(&ids->ipcs_idr, &id);
+	if (!ipc)
+		return NULL;
 
-	/* Out of range - return NULL to terminate iteration */
-	return NULL;
+	*new_pos = id + 1;
+	rcu_read_lock();
+	ipc_lock_object(ipc);
+	return ipc;
 }
 
 static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
diff --git a/ipc/util.h b/ipc/util.h
index 89b8ec176fc4..de8e27367f0c 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -15,8 +15,6 @@
 #include <linux/err.h>
 #include <linux/ipc_namespace.h>
 
-#define SEQ_MULTIPLIER	(IPCMNI)
-
 int sem_init(void);
 int msg_init(void);
 void shm_init(void);
@@ -93,10 +91,6 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
 #define IPC_MSG_IDS	1
 #define IPC_SHM_IDS	2
 
-#define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
-#define ipcid_to_seqx(id) ((id) / SEQ_MULTIPLIER)
-#define IPCID_SEQ_MAX min_t(int, INT_MAX/SEQ_MULTIPLIER, USHRT_MAX)
-
 /* must be called with ids->rwsem acquired for writing */
 int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
 
@@ -120,9 +114,6 @@ static inline int ipc_get_maxid(struct ipc_ids *ids)
 	if (ids->in_use == 0)
 		return -1;
 
-	if (ids->in_use == IPCMNI)
-		return IPCMNI - 1;
-
 	return ids->max_id;
 }
 
@@ -163,7 +154,7 @@ extern int store_msg(void __user *dest, struct msg_msg *msg, size_t len);
 
 static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int uid)
 {
-	return uid / SEQ_MULTIPLIER != ipcp->seq;
+	return uid != ipcp->seq;
 }
 
 static inline void ipc_lock_object(struct kern_ipc_perm *perm)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-15  0:49           ` [RFC][PATCH] ipc: Remove IPCMNI Eric W. Biederman
@ 2018-03-15 17:02             ` Waiman Long
  2018-03-15 19:00               ` Eric W. Biederman
  2018-03-15 19:45             ` Matthew Wilcox
  1 sibling, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-15 17:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox, Stanislav Kinsbursky,
	Linux Containers

On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
> The define IPCMNI was originally the size of a statically sized array in
> the kernel and that has long since been removed.  Therefore there is no
> fundamental reason for IPCMNI.
>
> The only remaining use IPCMNI serves is as a convoluted way to format
> the ipc id to userspace.  It does not appear that anything except for
> the CHECKPOINT_RESTORE code even cares about this variety of assignment
> and the CHECKPOINT_RESTORE code only cares about this weirdness because
> it has to restore these peculiar ids.
>
> Therefore make the assignment of ipc ids match the description in
> Advanced Programming in the Unix Environment and assign the next id
> until INT_MAX is hit then loop around to the lower ids.
>
> This can be implemented trivially with the current code using idr_alloc_cyclic.
>
> To make it possible to keep checkpoint/restore working I have renamed
> the sysctls from xxx_next_id to xxx_nextid.  That is enough change that
> a smart CRIU implementation can see that what is exported has changed,
> and act accordingly.  New kernels will be able to restore the old id's.
>
> This code still needs some real world testing to verify my assumptions.
> And some work with the CRIU implementations to actually add the code
> that deals with the new for of id assignment.
>
> Updates: 03f595668017 ("ipc: add sysctl to specify desired next object id")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>
> Waiman please take a look at this and run it through some tests etc,
> I am pretty certain something like this patch is all you need to do
> to sort out ipc assignment.  Not messing with sysctls needed.
>
>  include/linux/ipc.h           |  2 --
>  include/linux/ipc_namespace.h |  1 -
>  ipc/ipc_sysctl.c              |  6 ++--
>  ipc/namespace.c               | 11 ++----
>  ipc/util.c                    | 80 ++++++++++---------------------------------
>  ipc/util.h                    | 11 +-----
>  6 files changed, 25 insertions(+), 86 deletions(-)
>
> diff --git a/include/linux/ipc.h b/include/linux/ipc.h
> index 821b2f260992..6cc2df7f7ac9 100644
> --- a/include/linux/ipc.h
> +++ b/include/linux/ipc.h
> @@ -8,8 +8,6 @@
>  #include <uapi/linux/ipc.h>
>  #include <linux/refcount.h>
>  
> -#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
> -
>  /* used by in-kernel data structures */
>  struct kern_ipc_perm {
>  	spinlock_t	lock;
> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
> index b5630c8eb2f3..cab33b6a8236 100644
> --- a/include/linux/ipc_namespace.h
> +++ b/include/linux/ipc_namespace.h
> @@ -15,7 +15,6 @@ struct user_namespace;
>  
>  struct ipc_ids {
>  	int in_use;
> -	unsigned short seq;
>  	bool tables_initialized;
>  	struct rw_semaphore rwsem;
>  	struct idr ipcs_idr;
> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
> index 8ad93c29f511..a599963d58bf 100644
> --- a/ipc/ipc_sysctl.c
> +++ b/ipc/ipc_sysctl.c
> @@ -176,7 +176,7 @@ static struct ctl_table ipc_kern_table[] = {
>  	},
>  #ifdef CONFIG_CHECKPOINT_RESTORE
>  	{
> -		.procname	= "sem_next_id",
> +		.procname	= "sem_nextid",
>  		.data		= &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
>  		.mode		= 0644,
> @@ -185,7 +185,7 @@ static struct ctl_table ipc_kern_table[] = {
>  		.extra2		= &int_max,
>  	},
>  	{
> -		.procname	= "msg_next_id",
> +		.procname	= "msg_nextid",
>  		.data		= &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
>  		.mode		= 0644,
> @@ -194,7 +194,7 @@ static struct ctl_table ipc_kern_table[] = {
>  		.extra2		= &int_max,
>  	},
>  	{
> -		.procname	= "shm_next_id",
> +		.procname	= "shm_nextid",
>  		.data		= &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
>  		.mode		= 0644,

So you are changing the names of existing sysctl parameters. Will it be
better to add new sysctl to indicate that the rule has changed instead?

> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index f59a89966f92..84eaeba9e96c 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -109,20 +109,13 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
>  {
>  	struct kern_ipc_perm *perm;
>  	int next_id;
> -	int total, in_use;
>  
>  	down_write(&ids->rwsem);
> -
> -	in_use = ids->in_use;
> -
> -	for (total = 0, next_id = 0; total < in_use; next_id++) {
> -		perm = idr_find(&ids->ipcs_idr, next_id);
> -		if (perm == NULL)
> -			continue;
> +	next_id = 0;
> +	while ((perm = idr_get_next(&ids->ipcs_idr, &next_id))) {
>  		rcu_read_lock();
>  		ipc_lock_object(perm);
>  		free(ns, perm);
> -		total++;
>  	}
>  	up_write(&ids->rwsem);
>  }
> diff --git a/ipc/util.c b/ipc/util.c
> index 4ed5a17dd06f..ce6bf18e54df 100644
> --- a/ipc/util.c
> +++ b/ipc/util.c
> @@ -118,7 +118,6 @@ int ipc_init_ids(struct ipc_ids *ids)
>  {
>  	int err;
>  	ids->in_use = 0;
> -	ids->seq = 0;
>  	init_rwsem(&ids->rwsem);
>  	err = rhashtable_init(&ids->key_ht, &ipc_kht_params);
>  	if (err)
> @@ -192,46 +191,18 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
>  	return NULL;
>  }
>  
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> -/*
> - * Specify desired id for next allocated IPC object.
> - */
> -#define ipc_idr_alloc(ids, new)						\
> -	idr_alloc(&(ids)->ipcs_idr, (new),				\
> -		  (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
> -		  0, GFP_NOWAIT)
>  
> -static inline int ipc_buildid(int id, struct ipc_ids *ids,
> -			      struct kern_ipc_perm *new)
> +static int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
>  {
> -	if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
> -		new->seq = ids->seq++;
> -		if (ids->seq > IPCID_SEQ_MAX)
> -			ids->seq = 0;
> -	} else {
> -		new->seq = ipcid_to_seqx(ids->next_id);
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> +	if (ids->next_id >= 0) {
> +		idr_set_cursor(&ids->ipcs_idr, ids->next_id);
>  		ids->next_id = -1;
>  	}
> -
> -	return SEQ_MULTIPLIER * new->seq + id;
> -}
> -
> -#else
> -#define ipc_idr_alloc(ids, new)					\
> -	idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
> -
> -static inline int ipc_buildid(int id, struct ipc_ids *ids,
> -			      struct kern_ipc_perm *new)
> -{
> -	new->seq = ids->seq++;
> -	if (ids->seq > IPCID_SEQ_MAX)
> -		ids->seq = 0;
> -
> -	return SEQ_MULTIPLIER * new->seq + id;
> +#endif
> +	return idr_alloc_cyclic(&ids->ipcs_idr, (new), 0, 0, GFP_NOWAIT);
>  }
>  
> -#endif /* CONFIG_CHECKPOINT_RESTORE */
> -
>  /**
>   * ipc_addid - add an ipc identifier
>   * @ids: ipc identifier set
> @@ -251,9 +222,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
>  	kgid_t egid;
>  	int id, err;
>  
> -	if (limit > IPCMNI)
> -		limit = IPCMNI;
> -
>  	if (!ids->tables_initialized || ids->in_use >= limit)
>  		return -ENOSPC;
>  
> @@ -290,7 +258,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
>  	if (id > ids->max_id)
>  		ids->max_id = id;
>  
> -	new->id = ipc_buildid(id, ids, new);
> +	new->id = id;
>  
>  	return id;
>  }
> @@ -430,7 +398,7 @@ static void ipc_kht_remove(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
>   */
>  void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
>  {
> -	int lid = ipcid_to_idx(ipcp->id);
> +	int lid = ipcp->id;
>  
>  	idr_remove(&ids->ipcs_idr, lid);
>  	ipc_kht_remove(ids, ipcp);
> @@ -563,7 +531,7 @@ void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out)
>  struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id)
>  {
>  	struct kern_ipc_perm *out;
> -	int lid = ipcid_to_idx(id);
> +	int lid = id;
>  
>  	if (unlikely(!ids->tables_initialized))
>  		return ERR_PTR(-EINVAL);
> @@ -757,30 +725,20 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos,
>  					      loff_t *new_pos)
>  {
>  	struct kern_ipc_perm *ipc;
> -	int total, id;
> -
> -	total = 0;
> -	for (id = 0; id < pos && total < ids->in_use; id++) {
> -		ipc = idr_find(&ids->ipcs_idr, id);
> -		if (ipc != NULL)
> -			total++;
> -	}
> +	int id;

I think you need to initialize id to pos. Right?

>  
> -	if (total >= ids->in_use)
> +	/* Out of range - return NULL to terminate iteration */
> +	if (pos > INT_MAX)
>  		return NULL;
>  
> -	for (; pos < IPCMNI; pos++) {
> -		ipc = idr_find(&ids->ipcs_idr, pos);
> -		if (ipc != NULL) {
> -			*new_pos = pos + 1;
> -			rcu_read_lock();
> -			ipc_lock_object(ipc);
> -			return ipc;
> -		}
> -	}
> +	ipc = idr_get_next(&ids->ipcs_idr, &id);
> +	if (!ipc)
> +		return NULL;
>  
> -	/* Out of range - return NULL to terminate iteration */
> -	return NULL;
> +	*new_pos = id + 1;
> +	rcu_read_lock();
> +	ipc_lock_object(ipc);
> +	return ipc;
>  }
>  
>  static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
> diff --git a/ipc/util.h b/ipc/util.h
> index 89b8ec176fc4..de8e27367f0c 100644
> --- a/ipc/util.h
> +++ b/ipc/util.h
> @@ -15,8 +15,6 @@
>  #include <linux/err.h>
>  #include <linux/ipc_namespace.h>
>  
> -#define SEQ_MULTIPLIER	(IPCMNI)
> -
>  int sem_init(void);
>  int msg_init(void);
>  void shm_init(void);
> @@ -93,10 +91,6 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
>  #define IPC_MSG_IDS	1
>  #define IPC_SHM_IDS	2
>  
> -#define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
> -#define ipcid_to_seqx(id) ((id) / SEQ_MULTIPLIER)
> -#define IPCID_SEQ_MAX min_t(int, INT_MAX/SEQ_MULTIPLIER, USHRT_MAX)
> -
>  /* must be called with ids->rwsem acquired for writing */
>  int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
>  
> @@ -120,9 +114,6 @@ static inline int ipc_get_maxid(struct ipc_ids *ids)
>  	if (ids->in_use == 0)
>  		return -1;
>  
> -	if (ids->in_use == IPCMNI)
> -		return IPCMNI - 1;
> -
>  	return ids->max_id;
>  }
>  
> @@ -163,7 +154,7 @@ extern int store_msg(void __user *dest, struct msg_msg *msg, size_t len);
>  
>  static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int uid)
>  {
> -	return uid / SEQ_MULTIPLIER != ipcp->seq;
> +	return uid != ipcp->seq;
>  }
>  
>  static inline void ipc_lock_object(struct kern_ipc_perm *perm)

I don't know the history why the id management of SysV IPC was designed
in such a convoluted way, but the patch does make sense to me.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-15 17:02             ` Waiman Long
@ 2018-03-15 19:00               ` Eric W. Biederman
  2018-03-15 21:46                 ` Waiman Long
  0 siblings, 1 reply; 42+ messages in thread
From: Eric W. Biederman @ 2018-03-15 19:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox, Stanislav Kinsbursky,
	Linux Containers

Waiman Long <longman@redhat.com> writes:

> On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
>> The define IPCMNI was originally the size of a statically sized array in
>> the kernel and that has long since been removed.  Therefore there is no
>> fundamental reason for IPCMNI.
>>
>> The only remaining use IPCMNI serves is as a convoluted way to format
>> the ipc id to userspace.  It does not appear that anything except for
>> the CHECKPOINT_RESTORE code even cares about this variety of assignment
>> and the CHECKPOINT_RESTORE code only cares about this weirdness because
>> it has to restore these peculiar ids.
>>
>> Therefore make the assignment of ipc ids match the description in
>> Advanced Programming in the Unix Environment and assign the next id
>> until INT_MAX is hit then loop around to the lower ids.
>>
>> This can be implemented trivially with the current code using idr_alloc_cyclic.
>>
>> To make it possible to keep checkpoint/restore working I have renamed
>> the sysctls from xxx_next_id to xxx_nextid.  That is enough change that
>> a smart CRIU implementation can see that what is exported has changed,
>> and act accordingly.  New kernels will be able to restore the old id's.
>>
>> This code still needs some real world testing to verify my assumptions.
>> And some work with the CRIU implementations to actually add the code
>> that deals with the new for of id assignment.
>>
>> Updates: 03f595668017 ("ipc: add sysctl to specify desired next object id")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>
>> Waiman please take a look at this and run it through some tests etc,
>> I am pretty certain something like this patch is all you need to do
>> to sort out ipc assignment.  Not messing with sysctls needed.
>>
>>  include/linux/ipc.h           |  2 --
>>  include/linux/ipc_namespace.h |  1 -
>>  ipc/ipc_sysctl.c              |  6 ++--
>>  ipc/namespace.c               | 11 ++----
>>  ipc/util.c                    | 80 ++++++++++---------------------------------
>>  ipc/util.h                    | 11 +-----
>>  6 files changed, 25 insertions(+), 86 deletions(-)
>>
>> diff --git a/include/linux/ipc.h b/include/linux/ipc.h
>> index 821b2f260992..6cc2df7f7ac9 100644
>> --- a/include/linux/ipc.h
>> +++ b/include/linux/ipc.h
>> @@ -8,8 +8,6 @@
>>  #include <uapi/linux/ipc.h>
>>  #include <linux/refcount.h>
>>  
>> -#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
>> -
>>  /* used by in-kernel data structures */
>>  struct kern_ipc_perm {
>>  	spinlock_t	lock;
>> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
>> index b5630c8eb2f3..cab33b6a8236 100644
>> --- a/include/linux/ipc_namespace.h
>> +++ b/include/linux/ipc_namespace.h
>> @@ -15,7 +15,6 @@ struct user_namespace;
>>  
>>  struct ipc_ids {
>>  	int in_use;
>> -	unsigned short seq;
>>  	bool tables_initialized;
>>  	struct rw_semaphore rwsem;
>>  	struct idr ipcs_idr;
>> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
>> index 8ad93c29f511..a599963d58bf 100644
>> --- a/ipc/ipc_sysctl.c
>> +++ b/ipc/ipc_sysctl.c
>> @@ -176,7 +176,7 @@ static struct ctl_table ipc_kern_table[] = {
>>  	},
>>  #ifdef CONFIG_CHECKPOINT_RESTORE
>>  	{
>> -		.procname	= "sem_next_id",
>> +		.procname	= "sem_nextid",
>>  		.data		= &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
>>  		.mode		= 0644,
>> @@ -185,7 +185,7 @@ static struct ctl_table ipc_kern_table[] = {
>>  		.extra2		= &int_max,
>>  	},
>>  	{
>> -		.procname	= "msg_next_id",
>> +		.procname	= "msg_nextid",
>>  		.data		= &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
>>  		.mode		= 0644,
>> @@ -194,7 +194,7 @@ static struct ctl_table ipc_kern_table[] = {
>>  		.extra2		= &int_max,
>>  	},
>>  	{
>> -		.procname	= "shm_next_id",
>> +		.procname	= "shm_nextid",
>>  		.data		= &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
>>  		.mode		= 0644,
>
> So you are changing the names of existing sysctl parameters. Will it be
> better to add new sysctl to indicate that the rule has changed
> instead?

In practice I am replacing one set of sysctls with another, that work
very similarly but not quite the same.  As we can't keep the existing
semantics removing the old sysctl seems correct.  Likewise adding
a new sysctl with slightly changed semantics seems correct.

This needs an accompanying patch to CRIU to see which sysctls are
available and to change it's behavior based upon that.  The practical
question is what makes it easiest not to confuse CRIU.

Not having the sysctl should be something that CRIU detects today
and the old versions should fail gracefully.  But testing is needed.
Adding a new sysctl to say the behavior has changed and reusing the
old names won't have the same effect of disabling existing versions
of CRIU.

>> diff --git a/ipc/util.c b/ipc/util.c
>> index 4ed5a17dd06f..ce6bf18e54df 100644
>> --- a/ipc/util.c
>> +++ b/ipc/util.c
>> @@ -118,7 +118,6 @@ int ipc_init_ids(struct ipc_ids *ids)
>>  {
>>  	int err;
>>  	ids->in_use = 0;
>> -	ids->seq = 0;
>>  	init_rwsem(&ids->rwsem);
>>  	err = rhashtable_init(&ids->key_ht, &ipc_kht_params);
>>  	if (err)
>> @@ -192,46 +191,18 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
>>  	return NULL;
>>  }
>>  
>> -#ifdef CONFIG_CHECKPOINT_RESTORE
>> -/*
>> - * Specify desired id for next allocated IPC object.
>> - */
>> -#define ipc_idr_alloc(ids, new)						\
>> -	idr_alloc(&(ids)->ipcs_idr, (new),				\
>> -		  (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
>> -		  0, GFP_NOWAIT)
>>  
>> -static inline int ipc_buildid(int id, struct ipc_ids *ids,
>> -			      struct kern_ipc_perm *new)
>> +static int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
>>  {
>> -	if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
>> -		new->seq = ids->seq++;
>> -		if (ids->seq > IPCID_SEQ_MAX)
>> -			ids->seq = 0;
>> -	} else {
>> -		new->seq = ipcid_to_seqx(ids->next_id);
>> +#ifdef CONFIG_CHECKPOINT_RESTORE
>> +	if (ids->next_id >= 0) {
>> +		idr_set_cursor(&ids->ipcs_idr, ids->next_id);
>>  		ids->next_id = -1;
>>  	}
>> -
>> -	return SEQ_MULTIPLIER * new->seq + id;
>> -}
>> -
>> -#else
>> -#define ipc_idr_alloc(ids, new)					\
>> -	idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
>> -
>> -static inline int ipc_buildid(int id, struct ipc_ids *ids,
>> -			      struct kern_ipc_perm *new)
>> -{
>> -	new->seq = ids->seq++;
>> -	if (ids->seq > IPCID_SEQ_MAX)
>> -		ids->seq = 0;
>> -
>> -	return SEQ_MULTIPLIER * new->seq + id;
>> +#endif
>> +	return idr_alloc_cyclic(&ids->ipcs_idr, (new), 0, 0, GFP_NOWAIT);
>>  }
>>  
>> -#endif /* CONFIG_CHECKPOINT_RESTORE */
>> -
>>  /**
>>   * ipc_addid - add an ipc identifier
>>   * @ids: ipc identifier set
>> @@ -251,9 +222,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
>>  	kgid_t egid;
>>  	int id, err;
>>  
>> -	if (limit > IPCMNI)
>> -		limit = IPCMNI;
>> -
>>  	if (!ids->tables_initialized || ids->in_use >= limit)
>>  		return -ENOSPC;
>>  
>> @@ -290,7 +258,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
>>  	if (id > ids->max_id)
>>  		ids->max_id = id;
>>  
>> -	new->id = ipc_buildid(id, ids, new);
>> +	new->id = id;
>>  
>>  	return id;
>>  }
>> @@ -430,7 +398,7 @@ static void ipc_kht_remove(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
>>   */
>>  void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
>>  {
>> -	int lid = ipcid_to_idx(ipcp->id);
>> +	int lid = ipcp->id;
>>  
>>  	idr_remove(&ids->ipcs_idr, lid);
>>  	ipc_kht_remove(ids, ipcp);
>> @@ -563,7 +531,7 @@ void ipc64_perm_to_ipc_perm(struct ipc64_perm *in, struct ipc_perm *out)
>>  struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids *ids, int id)
>>  {
>>  	struct kern_ipc_perm *out;
>> -	int lid = ipcid_to_idx(id);
>> +	int lid = id;
>>  
>>  	if (unlikely(!ids->tables_initialized))
>>  		return ERR_PTR(-EINVAL);
>> @@ -757,30 +725,20 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos,
>>  					      loff_t *new_pos)
>>  {
>>  	struct kern_ipc_perm *ipc;
>> -	int total, id;
>> -
>> -	total = 0;
>> -	for (id = 0; id < pos && total < ids->in_use; id++) {
>> -		ipc = idr_find(&ids->ipcs_idr, id);
>> -		if (ipc != NULL)
>> -			total++;
>> -	}
>> +	int id;
>
> I think you need to initialize id to pos. Right?
>
Yes.

I have not done more than compiled tested this.  It would probably
be wise to move this bit to a prepatch.  That just changes the
implementation to something more efficient, and removes a use
of IPCMNI at the same time.
>>  
>> -	if (total >= ids->in_use)
>> +	/* Out of range - return NULL to terminate iteration */
>> +	if (pos > INT_MAX)
>>  		return NULL;
>>  
>> -	for (; pos < IPCMNI; pos++) {
>> -		ipc = idr_find(&ids->ipcs_idr, pos);
>> -		if (ipc != NULL) {
>> -			*new_pos = pos + 1;
>> -			rcu_read_lock();
>> -			ipc_lock_object(ipc);
>> -			return ipc;
>> -		}
>> -	}
>> +	ipc = idr_get_next(&ids->ipcs_idr, &id);
>> +	if (!ipc)
>> +		return NULL;
>>  
>> -	/* Out of range - return NULL to terminate iteration */
>> -	return NULL;
>> +	*new_pos = id + 1;
>> +	rcu_read_lock();
>> +	ipc_lock_object(ipc);
>> +	return ipc;
>>  }
>>  
>>  static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
>> diff --git a/ipc/util.h b/ipc/util.h
>> index 89b8ec176fc4..de8e27367f0c 100644
>> --- a/ipc/util.h
>> +++ b/ipc/util.h
>> @@ -15,8 +15,6 @@
>>  #include <linux/err.h>
>>  #include <linux/ipc_namespace.h>
>>  
>> -#define SEQ_MULTIPLIER	(IPCMNI)
>> -
>>  int sem_init(void);
>>  int msg_init(void);
>>  void shm_init(void);
>> @@ -93,10 +91,6 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
>>  #define IPC_MSG_IDS	1
>>  #define IPC_SHM_IDS	2
>>  
>> -#define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)
>> -#define ipcid_to_seqx(id) ((id) / SEQ_MULTIPLIER)
>> -#define IPCID_SEQ_MAX min_t(int, INT_MAX/SEQ_MULTIPLIER, USHRT_MAX)
>> -
>>  /* must be called with ids->rwsem acquired for writing */
>>  int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
>>  
>> @@ -120,9 +114,6 @@ static inline int ipc_get_maxid(struct ipc_ids *ids)
>>  	if (ids->in_use == 0)
>>  		return -1;
>>  
>> -	if (ids->in_use == IPCMNI)
>> -		return IPCMNI - 1;
>> -
>>  	return ids->max_id;
>>  }
>>  
>> @@ -163,7 +154,7 @@ extern int store_msg(void __user *dest, struct msg_msg *msg, size_t len);
>>  
>>  static inline int ipc_checkid(struct kern_ipc_perm *ipcp, int uid)
>>  {
>> -	return uid / SEQ_MULTIPLIER != ipcp->seq;
>> +	return uid != ipcp->seq;
>>  }
>>  
>>  static inline void ipc_lock_object(struct kern_ipc_perm *perm)
>
> I don't know the history why the id management of SysV IPC was designed
> in such a convoluted way, but the patch does make sense to me.

I don't have the full history and we might wind up finding more as we
run this patch through it's paces.

The earliest history I know is what I read in Advanced Programming in
the Unix Environment (which predates linux).  It described the ipc ids
as assigned from a counter that wraps.  I thought like my patch
implements. On closer reading it has a counter that increases each time
the slot is used, and then wraps.  Exactly like Linux before my patch.
*Grrr*

The existing structure of the bifurcated is present in Linux 1.0.  At
that time SHMMNI was 256.  SHMMNI was the size of a static array of shm
segments.  The high 24 bits held a sequence number that was incremented
when a segment was removed at the time.  Presumably the upper bits were
incremented to avoid swiftly reusing the same shm ids.

Hmm.  I took a quick look at FreeBSD10 and it has the exact same split
in the id.  So userspace may actually depend upon that split.

Which comes down to the fundamental question what depends upon what.
How do other operating systems like Solaris handle this?

Does any nix flavor support more that 16bits worth of shm segments?

The API has been deprecated for the last 20 years and we are still
keeping it alive.  Sigh.

Still there is fundamentally only one thing the kernel can do if we wish
to increase the number of shm segments.

Please take my patch and test it out and see if you can find anything
that cares about the change. Except for needing id reuse to be
infrequent I can not imagine that there is anything that cares.

It could very reasonably be argued that my when shmmni is < INT_MAX
my patch implements a version of the existing algorithm.  As we go
through all of the possible ids before we reuse any of them.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-15  0:49           ` [RFC][PATCH] ipc: Remove IPCMNI Eric W. Biederman
  2018-03-15 17:02             ` Waiman Long
@ 2018-03-15 19:45             ` Matthew Wilcox
  1 sibling, 0 replies; 42+ messages in thread
From: Matthew Wilcox @ 2018-03-15 19:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Waiman Long, Luis R. Rodriguez, Kees Cook, linux-kernel,
	linux-fsdevel, Andrew Morton, Al Viro, Stanislav Kinsbursky,
	Linux Containers

On Wed, Mar 14, 2018 at 07:49:29PM -0500, Eric W. Biederman wrote:
> @@ -109,20 +109,13 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
>  {
>  	struct kern_ipc_perm *perm;
>  	int next_id;
> -	int total, in_use;
>  
>  	down_write(&ids->rwsem);
> -
> -	in_use = ids->in_use;
> -
> -	for (total = 0, next_id = 0; total < in_use; next_id++) {
> -		perm = idr_find(&ids->ipcs_idr, next_id);
> -		if (perm == NULL)
> -			continue;
> +	next_id = 0;
> +	while ((perm = idr_get_next(&ids->ipcs_idr, &next_id))) {
>  		rcu_read_lock();
>  		ipc_lock_object(perm);
>  		free(ns, perm);
> -		total++;
>  	}
>  	up_write(&ids->rwsem);
>  }

We have a helper for this:

	idr_for_each_entry(&ids->ipcs_idr, perm, next_id) {
		rcu_read_lock();
		ipc_lock_object(perm);
		free(ns, perm);
	}

(using idr_get_next() is tricky because you have to remember to increment
next_id yourself, and you didn't).

> +static int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
>  {
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> +	if (ids->next_id >= 0) {
> +		idr_set_cursor(&ids->ipcs_idr, ids->next_id);
>  		ids->next_id = -1;
>  	}
> +#endif
> +	return idr_alloc_cyclic(&ids->ipcs_idr, (new), 0, 0, GFP_NOWAIT);
>  }
>  
> -#endif /* CONFIG_CHECKPOINT_RESTORE */

That seems a little convoluted; is there a reason to not call idr_set_cursor()
instead of assigning to ids->next_id?

> @@ -757,30 +725,20 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos,
>  					      loff_t *new_pos)
>  {
>  	struct kern_ipc_perm *ipc;
> +	int id;
>  
> +	/* Out of range - return NULL to terminate iteration */
> +	if (pos > INT_MAX)
>  		return NULL;
>  
> +	ipc = idr_get_next(&ids->ipcs_idr, &id);
> +	if (!ipc)
> +		return NULL;
>  
> +	*new_pos = id + 1;
> +	rcu_read_lock();
> +	ipc_lock_object(ipc);
> +	return ipc;
>  }

I'm no expert on the IPC locking, but I would have thought you'd want to
call rcu_read_lock() before calling idr_get_next() to avoid a simultaneous
delete from freeing 'ipc'.

Oh, I see.  proc_start takes the rwsem for read and proc_stop releases it.
The locking here seems quite shabby and in need of renovation.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-15 19:00               ` Eric W. Biederman
@ 2018-03-15 21:46                 ` Waiman Long
  2018-03-29  2:14                   ` Davidlohr Bueso
  0 siblings, 1 reply; 42+ messages in thread
From: Waiman Long @ 2018-03-15 21:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Matthew Wilcox, Stanislav Kinsbursky,
	Linux Containers

On 03/15/2018 03:00 PM, Eric W. Biederman wrote:
> Waiman Long <longman@redhat.com> writes:
>
>> On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
>>> The define IPCMNI was originally the size of a statically sized array in
>>> the kernel and that has long since been removed.  Therefore there is no
>>> fundamental reason for IPCMNI.
>>>
>>> The only remaining use IPCMNI serves is as a convoluted way to format
>>> the ipc id to userspace.  It does not appear that anything except for
>>> the CHECKPOINT_RESTORE code even cares about this variety of assignment
>>> and the CHECKPOINT_RESTORE code only cares about this weirdness because
>>> it has to restore these peculiar ids.
>>>
>>> Therefore make the assignment of ipc ids match the description in
>>> Advanced Programming in the Unix Environment and assign the next id
>>> until INT_MAX is hit then loop around to the lower ids.
>>>
>>> This can be implemented trivially with the current code using idr_alloc_cyclic.
>>>
>>> To make it possible to keep checkpoint/restore working I have renamed
>>> the sysctls from xxx_next_id to xxx_nextid.  That is enough change that
>>> a smart CRIU implementation can see that what is exported has changed,
>>> and act accordingly.  New kernels will be able to restore the old id's.
>>>
>>> This code still needs some real world testing to verify my assumptions.
>>> And some work with the CRIU implementations to actually add the code
>>> that deals with the new for of id assignment.
>>>
>>> Updates: 03f595668017 ("ipc: add sysctl to specify desired next object id")
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>> ---
>>>
>>> Waiman please take a look at this and run it through some tests etc,
>>> I am pretty certain something like this patch is all you need to do
>>> to sort out ipc assignment.  Not messing with sysctls needed.
>>>
>>>  include/linux/ipc.h           |  2 --
>>>  include/linux/ipc_namespace.h |  1 -
>>>  ipc/ipc_sysctl.c              |  6 ++--
>>>  ipc/namespace.c               | 11 ++----
>>>  ipc/util.c                    | 80 ++++++++++---------------------------------
>>>  ipc/util.h                    | 11 +-----
>>>  6 files changed, 25 insertions(+), 86 deletions(-)
>>>
>>> diff --git a/include/linux/ipc.h b/include/linux/ipc.h
>>> index 821b2f260992..6cc2df7f7ac9 100644
>>> --- a/include/linux/ipc.h
>>> +++ b/include/linux/ipc.h
>>> @@ -8,8 +8,6 @@
>>>  #include <uapi/linux/ipc.h>
>>>  #include <linux/refcount.h>
>>>  
>>> -#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
>>> -
>>>  /* used by in-kernel data structures */
>>>  struct kern_ipc_perm {
>>>  	spinlock_t	lock;
>>> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
>>> index b5630c8eb2f3..cab33b6a8236 100644
>>> --- a/include/linux/ipc_namespace.h
>>> +++ b/include/linux/ipc_namespace.h
>>> @@ -15,7 +15,6 @@ struct user_namespace;
>>>  
>>>  struct ipc_ids {
>>>  	int in_use;
>>> -	unsigned short seq;
>>>  	bool tables_initialized;
>>>  	struct rw_semaphore rwsem;
>>>  	struct idr ipcs_idr;
>>> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
>>> index 8ad93c29f511..a599963d58bf 100644
>>> --- a/ipc/ipc_sysctl.c
>>> +++ b/ipc/ipc_sysctl.c
>>> @@ -176,7 +176,7 @@ static struct ctl_table ipc_kern_table[] = {
>>>  	},
>>>  #ifdef CONFIG_CHECKPOINT_RESTORE
>>>  	{
>>> -		.procname	= "sem_next_id",
>>> +		.procname	= "sem_nextid",
>>>  		.data		= &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
>>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
>>>  		.mode		= 0644,
>>> @@ -185,7 +185,7 @@ static struct ctl_table ipc_kern_table[] = {
>>>  		.extra2		= &int_max,
>>>  	},
>>>  	{
>>> -		.procname	= "msg_next_id",
>>> +		.procname	= "msg_nextid",
>>>  		.data		= &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
>>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
>>>  		.mode		= 0644,
>>> @@ -194,7 +194,7 @@ static struct ctl_table ipc_kern_table[] = {
>>>  		.extra2		= &int_max,
>>>  	},
>>>  	{
>>> -		.procname	= "shm_next_id",
>>> +		.procname	= "shm_nextid",
>>>  		.data		= &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
>>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
>>>  		.mode		= 0644,
>> So you are changing the names of existing sysctl parameters. Will it be
>> better to add new sysctl to indicate that the rule has changed
>> instead?
> In practice I am replacing one set of sysctls with another, that work
> very similarly but not quite the same.  As we can't keep the existing
> semantics removing the old sysctl seems correct.  Likewise adding
> a new sysctl with slightly changed semantics seems correct.
>
> This needs an accompanying patch to CRIU to see which sysctls are
> available and to change it's behavior based upon that.  The practical
> question is what makes it easiest not to confuse CRIU.
>
> Not having the sysctl should be something that CRIU detects today
> and the old versions should fail gracefully.  But testing is needed.
> Adding a new sysctl to say the behavior has changed and reusing the
> old names won't have the same effect of disabling existing versions
> of CRIU.

That is fine as long as CRIU is the only user.

>
>> I don't know the history why the id management of SysV IPC was designed
>> in such a convoluted way, but the patch does make sense to me.
> I don't have the full history and we might wind up finding more as we
> run this patch through it's paces.
>
> The earliest history I know is what I read in Advanced Programming in
> the Unix Environment (which predates linux).  It described the ipc ids
> as assigned from a counter that wraps.  I thought like my patch
> implements. On closer reading it has a counter that increases each time
> the slot is used, and then wraps.  Exactly like Linux before my patch.
> *Grrr*
>
> The existing structure of the bifurcated is present in Linux 1.0.  At
> that time SHMMNI was 256.  SHMMNI was the size of a static array of shm
> segments.  The high 24 bits held a sequence number that was incremented
> when a segment was removed at the time.  Presumably the upper bits were
> incremented to avoid swiftly reusing the same shm ids.
>
> Hmm.  I took a quick look at FreeBSD10 and it has the exact same split
> in the id.  So userspace may actually depend upon that split.

Backward compatibility is the part that I am most worry about this
patch. That is also the reason I asked why the ID is generated in such a
way.

My original thinking was to have an extended mode where the IPCMNI
becomes 8M from 32k. That will reduce the sequence number from 16 bits
to 8 bits. The extended mode is enabled by adding, for example, a boot
option. So this will be an opt-in feature instead of as a default.

>
> Which comes down to the fundamental question what depends upon what.
> How do other operating systems like Solaris handle this?

I don't know how Solaris handle this, but I know they support up to 2^24
shm segments.

>
> Does any nix flavor support more that 16bits worth of shm segments?
>
> The API has been deprecated for the last 20 years and we are still
> keeping it alive.  Sigh.
>
> Still there is fundamentally only one thing the kernel can do if we wish
> to increase the number of shm segments.
>
> Please take my patch and test it out and see if you can find anything
> that cares about the change. Except for needing id reuse to be
> infrequent I can not imagine that there is anything that cares.
>
> It could very reasonably be argued that my when shmmni is < INT_MAX
> my patch implements a version of the existing algorithm.  As we go
> through all of the possible ids before we reuse any of them.
>
> Eric
>
Thanks for the patch, I am still thinking about what is the best way to
handle this.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-15 21:46                 ` Waiman Long
@ 2018-03-29  2:14                   ` Davidlohr Bueso
  2018-03-29  8:47                     ` Manfred Spraul
  0 siblings, 1 reply; 42+ messages in thread
From: Davidlohr Bueso @ 2018-03-29  2:14 UTC (permalink / raw)
  To: Waiman Long, Michael Kerrisk
  Cc: Eric W. Biederman, Manfred Spraul, Luis R. Rodriguez, Kees Cook,
	linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Stanislav Kinsbursky, Linux Containers,
	linux-api

Cc'ing mtk, Manfred and linux-api.

See below.

On Thu, 15 Mar 2018, Waiman Long wrote:

>On 03/15/2018 03:00 PM, Eric W. Biederman wrote:
>> Waiman Long <longman@redhat.com> writes:
>>
>>> On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
>>>> The define IPCMNI was originally the size of a statically sized array in
>>>> the kernel and that has long since been removed.  Therefore there is no
>>>> fundamental reason for IPCMNI.
>>>>
>>>> The only remaining use IPCMNI serves is as a convoluted way to format
>>>> the ipc id to userspace.  It does not appear that anything except for
>>>> the CHECKPOINT_RESTORE code even cares about this variety of assignment
>>>> and the CHECKPOINT_RESTORE code only cares about this weirdness because
>>>> it has to restore these peculiar ids.
>>>>
>>>> Therefore make the assignment of ipc ids match the description in
>>>> Advanced Programming in the Unix Environment and assign the next id
>>>> until INT_MAX is hit then loop around to the lower ids.
>>>>
>>>> This can be implemented trivially with the current code using idr_alloc_cyclic.
>>>>
>>>> To make it possible to keep checkpoint/restore working I have renamed
>>>> the sysctls from xxx_next_id to xxx_nextid.  That is enough change that
>>>> a smart CRIU implementation can see that what is exported has changed,
>>>> and act accordingly.  New kernels will be able to restore the old id's.
>>>>
>>>> This code still needs some real world testing to verify my assumptions.
>>>> And some work with the CRIU implementations to actually add the code
>>>> that deals with the new for of id assignment.
>>>>
>>>> Updates: 03f595668017 ("ipc: add sysctl to specify desired next object id")
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> ---
>>>>
>>>> Waiman please take a look at this and run it through some tests etc,
>>>> I am pretty certain something like this patch is all you need to do
>>>> to sort out ipc assignment.  Not messing with sysctls needed.
>>>>
>>>>  include/linux/ipc.h           |  2 --
>>>>  include/linux/ipc_namespace.h |  1 -
>>>>  ipc/ipc_sysctl.c              |  6 ++--
>>>>  ipc/namespace.c               | 11 ++----
>>>>  ipc/util.c                    | 80 ++++++++++---------------------------------
>>>>  ipc/util.h                    | 11 +-----
>>>>  6 files changed, 25 insertions(+), 86 deletions(-)
>>>>
>>>> diff --git a/include/linux/ipc.h b/include/linux/ipc.h
>>>> index 821b2f260992..6cc2df7f7ac9 100644
>>>> --- a/include/linux/ipc.h
>>>> +++ b/include/linux/ipc.h
>>>> @@ -8,8 +8,6 @@
>>>>  #include <uapi/linux/ipc.h>
>>>>  #include <linux/refcount.h>
>>>>
>>>> -#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl changes) */
>>>> -
>>>>  /* used by in-kernel data structures */
>>>>  struct kern_ipc_perm {
>>>>  	spinlock_t	lock;
>>>> diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
>>>> index b5630c8eb2f3..cab33b6a8236 100644
>>>> --- a/include/linux/ipc_namespace.h
>>>> +++ b/include/linux/ipc_namespace.h
>>>> @@ -15,7 +15,6 @@ struct user_namespace;
>>>>
>>>>  struct ipc_ids {
>>>>  	int in_use;
>>>> -	unsigned short seq;
>>>>  	bool tables_initialized;
>>>>  	struct rw_semaphore rwsem;
>>>>  	struct idr ipcs_idr;
>>>> diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
>>>> index 8ad93c29f511..a599963d58bf 100644
>>>> --- a/ipc/ipc_sysctl.c
>>>> +++ b/ipc/ipc_sysctl.c
>>>> @@ -176,7 +176,7 @@ static struct ctl_table ipc_kern_table[] = {
>>>>  	},
>>>>  #ifdef CONFIG_CHECKPOINT_RESTORE
>>>>  	{
>>>> -		.procname	= "sem_next_id",
>>>> +		.procname	= "sem_nextid",
>>>>  		.data		= &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
>>>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
>>>>  		.mode		= 0644,
>>>> @@ -185,7 +185,7 @@ static struct ctl_table ipc_kern_table[] = {
>>>>  		.extra2		= &int_max,
>>>>  	},
>>>>  	{
>>>> -		.procname	= "msg_next_id",
>>>> +		.procname	= "msg_nextid",
>>>>  		.data		= &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
>>>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
>>>>  		.mode		= 0644,
>>>> @@ -194,7 +194,7 @@ static struct ctl_table ipc_kern_table[] = {
>>>>  		.extra2		= &int_max,
>>>>  	},
>>>>  	{
>>>> -		.procname	= "shm_next_id",
>>>> +		.procname	= "shm_nextid",
>>>>  		.data		= &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
>>>>  		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
>>>>  		.mode		= 0644,
>>> So you are changing the names of existing sysctl parameters. Will it be
>>> better to add new sysctl to indicate that the rule has changed
>>> instead?
>> In practice I am replacing one set of sysctls with another, that work
>> very similarly but not quite the same.  As we can't keep the existing
>> semantics removing the old sysctl seems correct.  Likewise adding
>> a new sysctl with slightly changed semantics seems correct.
>>
>> This needs an accompanying patch to CRIU to see which sysctls are
>> available and to change it's behavior based upon that.  The practical
>> question is what makes it easiest not to confuse CRIU.
>>
>> Not having the sysctl should be something that CRIU detects today
>> and the old versions should fail gracefully.  But testing is needed.
>> Adding a new sysctl to say the behavior has changed and reusing the
>> old names won't have the same effect of disabling existing versions
>> of CRIU.
>
>That is fine as long as CRIU is the only user.
>
>>
>>> I don't know the history why the id management of SysV IPC was designed
>>> in such a convoluted way, but the patch does make sense to me.
>> I don't have the full history and we might wind up finding more as we
>> run this patch through it's paces.
>>
>> The earliest history I know is what I read in Advanced Programming in
>> the Unix Environment (which predates linux).  It described the ipc ids
>> as assigned from a counter that wraps.  I thought like my patch
>> implements. On closer reading it has a counter that increases each time
>> the slot is used, and then wraps.  Exactly like Linux before my patch.
>> *Grrr*
>>
>> The existing structure of the bifurcated is present in Linux 1.0.  At
>> that time SHMMNI was 256.  SHMMNI was the size of a static array of shm
>> segments.  The high 24 bits held a sequence number that was incremented
>> when a segment was removed at the time.  Presumably the upper bits were
>> incremented to avoid swiftly reusing the same shm ids.
>>
>> Hmm.  I took a quick look at FreeBSD10 and it has the exact same split
>> in the id.  So userspace may actually depend upon that split.
>
>Backward compatibility is the part that I am most worry about this
>patch. That is also the reason I asked why the ID is generated in such a
>way.

I share these fears.

Thanks,
Davidlohr

>
>My original thinking was to have an extended mode where the IPCMNI
>becomes 8M from 32k. That will reduce the sequence number from 16 bits
>to 8 bits. The extended mode is enabled by adding, for example, a boot
>option. So this will be an opt-in feature instead of as a default.
>
>>
>> Which comes down to the fundamental question what depends upon what.
>> How do other operating systems like Solaris handle this?
>
>I don't know how Solaris handle this, but I know they support up to 2^24
>shm segments.
>
>>
>> Does any nix flavor support more that 16bits worth of shm segments?
>>
>> The API has been deprecated for the last 20 years and we are still
>> keeping it alive.  Sigh.
>>
>> Still there is fundamentally only one thing the kernel can do if we wish
>> to increase the number of shm segments.
>>
>> Please take my patch and test it out and see if you can find anything
>> that cares about the change. Except for needing id reuse to be
>> infrequent I can not imagine that there is anything that cares.
>>
>> It could very reasonably be argued that my when shmmni is < INT_MAX
>> my patch implements a version of the existing algorithm.  As we go
>> through all of the possible ids before we reuse any of them.
>>
>> Eric
>>
>Thanks for the patch, I am still thinking about what is the best way to
>handle this.
>
>Cheers,
>Longman
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-29  2:14                   ` Davidlohr Bueso
@ 2018-03-29  8:47                     ` Manfred Spraul
  2018-03-29 10:56                       ` Matthew Wilcox
  2018-03-29 20:08                       ` Eric W. Biederman
  0 siblings, 2 replies; 42+ messages in thread
From: Manfred Spraul @ 2018-03-29  8:47 UTC (permalink / raw)
  To: Davidlohr Bueso, Waiman Long, Michael Kerrisk
  Cc: Eric W. Biederman, Luis R. Rodriguez, Kees Cook, linux-kernel,
	linux-fsdevel, Andrew Morton, Al Viro, Matthew Wilcox,
	Stanislav Kinsbursky, Linux Containers, linux-api

Hello together,

On 03/29/2018 04:14 AM, Davidlohr Bueso wrote:
> Cc'ing mtk, Manfred and linux-api.
>
> See below.
>
> On Thu, 15 Mar 2018, Waiman Long wrote:
>
>> On 03/15/2018 03:00 PM, Eric W. Biederman wrote:
>>> Waiman Long <longman@redhat.com> writes:
>>>
>>>> On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
>>>>> The define IPCMNI was originally the size of a statically sized 
>>>>> array in
>>>>> the kernel and that has long since been removed. Therefore there 
>>>>> is no
>>>>> fundamental reason for IPCMNI.
>>>>>
>>>>> The only remaining use IPCMNI serves is as a convoluted way to format
>>>>> the ipc id to userspace.  It does not appear that anything except for
>>>>> the CHECKPOINT_RESTORE code even cares about this variety of 
>>>>> assignment
>>>>> and the CHECKPOINT_RESTORE code only cares about this weirdness 
>>>>> because
>>>>> it has to restore these peculiar ids.
>>>>>
My assumption is that if an array is recreated, it should get a 
different id.
     a=semget(1234,,);
     semctl(a,,IPC_RMID);
     b=semget(1234,,);
now a!=b.

Rational: semop() calls only refer to the array by the id.
If there is a stale process in the system that tries to access the "old" 
array and the new array has the same id, then the locking gets corrupted.
>>>>> Therefore make the assignment of ipc ids match the description in
>>>>> Advanced Programming in the Unix Environment and assign the next id
>>>>> until INT_MAX is hit then loop around to the lower ids.
>>>>>
Ok, sounds good.
That way we really cycle through INT_MAX, right now a==b would happen 
after 128k RMID calls.
>>>>> This can be implemented trivially with the current code using 
>>>>> idr_alloc_cyclic.
>>>>>
Is there a performance impact?
Right now, the idr tree is only large if there are lots of objects.
What happens if we have only 1 object, with id=INT_MAX-1?

semop() that do not sleep are fairly fast.
The same applies for msgsnd/msgrcv, if the message is small enough.

@Davidlohr:
Do you know if there are application that frequently call semop() and it 
doesn't have to sleep?
 From the scalability that was pushed into the kernel, I assume that 
this exists.

I have myself only checked postgresql, and postgresql always sleeps.
(and this was long ago)
>>>>> To make it possible to keep checkpoint/restore working I have renamed
>>>>> the sysctls from xxx_next_id to xxx_nextid.  That is enough change 
>>>>> that
>>>>> a smart CRIU implementation can see that what is exported has 
>>>>> changed,
>>>>> and act accordingly.  New kernels will be able to restore the old 
>>>>> id's.
>>>>>
>>>>> This code still needs some real world testing to verify my 
>>>>> assumptions.
>>>>> And some work with the CRIU implementations to actually add the code
>>>>> that deals with the new for of id assignment.
>>>>>
It means that all existing checkpoint/restore application will not work 
with a new kernel.
Everyone must first update the checkpoint/restore application, then 
update the kernel.

Is this acceptable?

--
     Manfred

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-29  8:47                     ` Manfred Spraul
@ 2018-03-29 10:56                       ` Matthew Wilcox
  2018-03-29 18:07                         ` Manfred Spraul
  2018-03-29 20:08                       ` Eric W. Biederman
  1 sibling, 1 reply; 42+ messages in thread
From: Matthew Wilcox @ 2018-03-29 10:56 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Davidlohr Bueso, Waiman Long, Michael Kerrisk, Eric W. Biederman,
	Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Stanislav Kinsbursky, Linux Containers,
	linux-api

On Thu, Mar 29, 2018 at 10:47:45AM +0200, Manfred Spraul wrote:
> > > > > > This can be implemented trivially with the current code
> > > > > > using idr_alloc_cyclic.
>
> Is there a performance impact?
> Right now, the idr tree is only large if there are lots of objects.
> What happens if we have only 1 object, with id=INT_MAX-1?

The radix tree uses a branching factor of 64 entries (6 bits) per level.
The maximum ID is 31 bits (positive signed 32-bit integer).  So the
worst case for a single object is 6 pointer dereferences to find the
object anywhere in the range (INT_MAX/2 - INT_MAX].  That will read 12
cachelines.  If we were to constrain ourselves to a maximum of INT_MAX/2
(30 bits), we'd reduce that to 5 pointer dereferences and 10 cachelines.

I have plans to make the representation more efficient and bring
the specific case of one element with a high ID down to one pointer
dereference and 2 cachelines, but I have not yet had time to implement
those plans.

>From a memory consumption point of view, 6 layers of tree will consume
6/7 of a page on a 64-bit x86 kernel.  I'm aiming to bring that down to
1/7 of a page.  We get 7 radix_tree_nodes per 4kB page.

(The old IDR tree had 256 entries per level which would have taken only
four layers to get us to 31 bits, but the cost was getting only 3 layers
per 8kB order-1 page, so we'd've taken 2 + 2/3 page to accomplish the
same goal).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-29 10:56                       ` Matthew Wilcox
@ 2018-03-29 18:07                         ` Manfred Spraul
  2018-03-29 18:52                           ` Eric W. Biederman
  2018-03-29 19:32                           ` Matthew Wilcox
  0 siblings, 2 replies; 42+ messages in thread
From: Manfred Spraul @ 2018-03-29 18:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Davidlohr Bueso, Waiman Long, Michael Kerrisk, Eric W. Biederman,
	Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Stanislav Kinsbursky, Linux Containers,
	linux-api

Hello Mathew,

On 03/29/2018 12:56 PM, Matthew Wilcox wrote:
> On Thu, Mar 29, 2018 at 10:47:45AM +0200, Manfred Spraul wrote:
>>>>>>> This can be implemented trivially with the current code
>>>>>>> using idr_alloc_cyclic.
>> Is there a performance impact?
>> Right now, the idr tree is only large if there are lots of objects.
>> What happens if we have only 1 object, with id=INT_MAX-1?
> The radix tree uses a branching factor of 64 entries (6 bits) per level.
> The maximum ID is 31 bits (positive signed 32-bit integer).  So the
> worst case for a single object is 6 pointer dereferences to find the
> object anywhere in the range (INT_MAX/2 - INT_MAX].  That will read 12
> cachelines.  If we were to constrain ourselves to a maximum of INT_MAX/2
> (30 bits), we'd reduce that to 5 pointer dereferences and 10 cachelines.
I'm concerned about the up to 6 branches.
But this is just guessing, we need a test with a realistic workload.

--
     Manfred

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-29 18:07                         ` Manfred Spraul
@ 2018-03-29 18:52                           ` Eric W. Biederman
  2018-03-29 19:32                           ` Matthew Wilcox
  1 sibling, 0 replies; 42+ messages in thread
From: Eric W. Biederman @ 2018-03-29 18:52 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Matthew Wilcox, Davidlohr Bueso, Waiman Long, Michael Kerrisk,
	Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Stanislav Kinsbursky, Linux Containers,
	linux-api

Manfred Spraul <manfred@colorfullife.com> writes:

> Hello Mathew,
>
> On 03/29/2018 12:56 PM, Matthew Wilcox wrote:
>> On Thu, Mar 29, 2018 at 10:47:45AM +0200, Manfred Spraul wrote:
>>>>>>>> This can be implemented trivially with the current code
>>>>>>>> using idr_alloc_cyclic.
>>> Is there a performance impact?
>>> Right now, the idr tree is only large if there are lots of objects.
>>> What happens if we have only 1 object, with id=INT_MAX-1?
>> The radix tree uses a branching factor of 64 entries (6 bits) per level.
>> The maximum ID is 31 bits (positive signed 32-bit integer).  So the
>> worst case for a single object is 6 pointer dereferences to find the
>> object anywhere in the range (INT_MAX/2 - INT_MAX].  That will read 12
>> cachelines.  If we were to constrain ourselves to a maximum of INT_MAX/2
>> (30 bits), we'd reduce that to 5 pointer dereferences and 10 cachelines.
> I'm concerned about the up to 6 branches.
> But this is just guessing, we need a test with a realistic workload.

Yes.

My primary purpose with the patch was to show that the issues with the
current limits could be resolved in a straght forward manner.  I really
don't know if idrs are the appropriate data structure.  It is possible
rbtrees are a better fit.


I think my algorithm I proposed for generating identifiers is as likely
as any to be a good one.  It does need testing on a wide variety of
applications to see what applications actually care about and for that I
think my proposed patch is more than sufficient.

By not keeping a generation counters in the slots themselves linux
already differs substantially from traditional implementations.

Doing something to free us from using a fixed number of bits for the
counter and a fixed number of bits to encode the slot we can support
much larger use of this API.

Eric

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-29 18:07                         ` Manfred Spraul
  2018-03-29 18:52                           ` Eric W. Biederman
@ 2018-03-29 19:32                           ` Matthew Wilcox
  1 sibling, 0 replies; 42+ messages in thread
From: Matthew Wilcox @ 2018-03-29 19:32 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Davidlohr Bueso, Waiman Long, Michael Kerrisk, Eric W. Biederman,
	Luis R. Rodriguez, Kees Cook, linux-kernel, linux-fsdevel,
	Andrew Morton, Al Viro, Stanislav Kinsbursky, Linux Containers,
	linux-api

On Thu, Mar 29, 2018 at 08:07:44PM +0200, Manfred Spraul wrote:
> Hello Mathew,
> 
> On 03/29/2018 12:56 PM, Matthew Wilcox wrote:
> > On Thu, Mar 29, 2018 at 10:47:45AM +0200, Manfred Spraul wrote:
> > > > > > > > This can be implemented trivially with the current code
> > > > > > > > using idr_alloc_cyclic.
> > > Is there a performance impact?
> > > Right now, the idr tree is only large if there are lots of objects.
> > > What happens if we have only 1 object, with id=INT_MAX-1?
> > The radix tree uses a branching factor of 64 entries (6 bits) per level.
> > The maximum ID is 31 bits (positive signed 32-bit integer).  So the
> > worst case for a single object is 6 pointer dereferences to find the
> > object anywhere in the range (INT_MAX/2 - INT_MAX].  That will read 12
> > cachelines.  If we were to constrain ourselves to a maximum of INT_MAX/2
> > (30 bits), we'd reduce that to 5 pointer dereferences and 10 cachelines.
> I'm concerned about the up to 6 branches.
> But this is just guessing, we need a test with a realistic workload.

Yes, and once there's a realistic workload, I'll be happy to prioritise
adapting the data structure to reduce the pointer chases.

FWIW, the plan is this:

There's currently an unused 32-bit field (on 64-bit machines) which I plan
to make the 'base' field.  So at each step down the tree, one subtracts
that field from the index in order to decide which slot to look at next.
Something like this:

index=0x3000'0000
(root) -> order=24
          offset=48 -> order=18
                       offset=0 -> order=12
                                   offset=0 -> order=6
                                               offset=0 -> order=0
                                                           offset=0 -> data

compresses to a single node:
(root) -> order=0
          base=3000'0000
          offset=0 -> data

If one has one entry at 5 and another entry at 0x3000'0000, the tree
looks like this (three nodes):

(root) -> order=24
          base=0
          offset=0  -> order=0
                       base=0
                       offset=5 -> entry1
          offset=48 -> order=0
                       base=0
                       offset=0 -> entry2

The trick is making sure that looking up offset 0x300'1000 returns NULL
and not entry2, but I can make that work.

An alternative is to go to something a little more libJudy and have
mutating internal nodes in the tree that can represent this kind of
situation in a more compact form.  There's a tradeoff to be made between
simplicity of implementation, cost of insertion, cost of lookup and
memory consumption.  I don't know where the right balance point is yet.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] ipc: Remove IPCMNI
  2018-03-29  8:47                     ` Manfred Spraul
  2018-03-29 10:56                       ` Matthew Wilcox
@ 2018-03-29 20:08                       ` Eric W. Biederman
  1 sibling, 0 replies; 42+ messages in thread
From: Eric W. Biederman @ 2018-03-29 20:08 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Davidlohr Bueso, Waiman Long, Michael Kerrisk, Luis R. Rodriguez,
	Kees Cook, linux-kernel, linux-fsdevel, Andrew Morton, Al Viro,
	Matthew Wilcox, Stanislav Kinsbursky, Linux Containers,
	linux-api

Manfred Spraul <manfred@colorfullife.com> writes:

>>>>> On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
>>>>>> To make it possible to keep checkpoint/restore working I have renamed
>>>>>> the sysctls from xxx_next_id to xxx_nextid.  That is enough change that
>>>>>> a smart CRIU implementation can see that what is exported has changed,
>>>>>> and act accordingly.  New kernels will be able to restore the old id's.
>>>>>>
>>>>>> This code still needs some real world testing to verify my assumptions.
>>>>>> And some work with the CRIU implementations to actually add the code
>>>>>> that deals with the new for of id assignment.
>>>>>>
> It means that all existing checkpoint/restore application will not work with a
> new kernel.
> Everyone must first update the checkpoint/restore application, then update the
> kernel.
>
> Is this acceptable?

There is no nead.

I just reread through how next_id is implementated in ipc/util.c and I
had been reading it wrong.  There is no need to change the sysctl.

What criu needs is an interface that specifies the next_id to allocate
both the high and the low bits and that is what this the sysctl
provides.

The implemenation could use a little cleanup so it is easier to
understand something like this perhaps:

diff --git a/ipc/util.c b/ipc/util.c
index 3783b7991cc7..2d4ec6e5b70b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -192,46 +192,32 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
 	return NULL;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-/*
- * Specify desired id for next allocated IPC object.
- */
-#define ipc_idr_alloc(ids, new)						\
-	idr_alloc(&(ids)->ipcs_idr, (new),				\
-		  (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\
-		  0, GFP_NOWAIT)
-
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
-			      struct kern_ipc_perm *new)
-{
-	if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
-		new->seq = ids->seq++;
-		if (ids->seq > IPCID_SEQ_MAX)
-			ids->seq = 0;
-	} else {
-		new->seq = ipcid_to_seqx(ids->next_id);
-		ids->next_id = -1;
-	}
 
-	return SEQ_MULTIPLIER * new->seq + id;
-}
-
-#else
-#define ipc_idr_alloc(ids, new)					\
-	idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT)
-
-static inline int ipc_buildid(int id, struct ipc_ids *ids,
-			      struct kern_ipc_perm *new)
+static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new)
 {
+	int id;
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	if (unlikely(new->id >= 0)) {
+		int idx = ipcid_to_idx(new->id);
+		id = idr_alloc(&ids->ipcs_idr, new, idx, 0, GFP_NOWAIT);
+		if (id < 0)
+			return id;
+		if (id != idx) {
+			idr_remove(&ids->ipcs_idr, id);
+			return -EBUSY;
+		}
+		new->seq = ipcid_to_seqx(new->id);
+		return id;
+	}
+#endif
+	id = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT);
 	new->seq = ids->seq++;
 	if (ids->seq > IPCID_SEQ_MAX)
 		ids->seq = 0;
-
-	return SEQ_MULTIPLIER * new->seq + id;
+	new->id = SEQ_MULTIPLIER * new->seq + id;
+	return id;
 }
 
-#endif /* CONFIG_CHECKPOINT_RESTORE */
-
 /**
  * ipc_addid - add an ipc identifier
  * @ids: ipc identifier set
@@ -269,6 +255,10 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
 	new->cuid = new->uid = euid;
 	new->gid = new->cgid = egid;
 
+	new->id = ids->next_id;
+	if (new->id >= 0)
+		ids->next_id = -1;
+
 	id = ipc_idr_alloc(ids, new);
 	idr_preload_end();
 
@@ -290,8 +280,6 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit)
 	if (id > ids->max_id)
 		ids->max_id = id;
 
-	new->id = ipc_buildid(id, ids, new);
-
 	return id;
 }
 

Eric

^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2018-03-29 20:09 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-12 20:15 [PATCH v4 0/6] ipc: Clamp *mni to the real IPCMNI limit Waiman Long
2018-03-12 20:15 ` [PATCH v4 1/6] sysctl: Add flags to support min/max range clamping Waiman Long
2018-03-12 20:44   ` Luis R. Rodriguez
2018-03-12 20:48     ` Waiman Long
2018-03-13 17:46   ` Eric W. Biederman
2018-03-13 18:49     ` Waiman Long
2018-03-12 20:15 ` [PATCH v4 2/6] proc/sysctl: Check for invalid flags bits Waiman Long
2018-03-12 20:46   ` Luis R. Rodriguez
2018-03-12 20:54     ` Waiman Long
2018-03-12 20:59       ` Luis R. Rodriguez
2018-03-12 21:02         ` Waiman Long
2018-03-12 20:52   ` Andrew Morton
2018-03-12 22:12     ` Waiman Long
2018-03-12 22:42       ` Andrew Morton
2018-03-12 20:15 ` [PATCH v4 3/6] sysctl: Warn when a clamped sysctl parameter is set out of range Waiman Long
2018-03-12 20:50   ` Luis R. Rodriguez
2018-03-12 21:07     ` Waiman Long
2018-03-12 21:00   ` Andrew Morton
2018-03-12 21:04     ` Waiman Long
2018-03-12 20:15 ` [PATCH v4 4/6] ipc: Clamp msgmni and shmmni to the real IPCMNI limit Waiman Long
2018-03-13 18:17   ` Eric W. Biederman
2018-03-13 18:39     ` Waiman Long
2018-03-13 20:29       ` Eric W. Biederman
2018-03-13 21:06         ` Waiman Long
2018-03-15  0:49           ` [RFC][PATCH] ipc: Remove IPCMNI Eric W. Biederman
2018-03-15 17:02             ` Waiman Long
2018-03-15 19:00               ` Eric W. Biederman
2018-03-15 21:46                 ` Waiman Long
2018-03-29  2:14                   ` Davidlohr Bueso
2018-03-29  8:47                     ` Manfred Spraul
2018-03-29 10:56                       ` Matthew Wilcox
2018-03-29 18:07                         ` Manfred Spraul
2018-03-29 18:52                           ` Eric W. Biederman
2018-03-29 19:32                           ` Matthew Wilcox
2018-03-29 20:08                       ` Eric W. Biederman
2018-03-15 19:45             ` Matthew Wilcox
2018-03-12 20:15 ` [PATCH v4 5/6] ipc: Clamp semmni to the real IPCMNI limit Waiman Long
2018-03-12 20:52   ` Luis R. Rodriguez
2018-03-12 20:59     ` Waiman Long
2018-03-12 20:15 ` [PATCH v4 6/6] test_sysctl: Add range clamping test Waiman Long
2018-03-12 20:53   ` Luis R. Rodriguez
2018-03-12 21:00     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).