linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sysctl: Add the kernel.ns_last_pid control
@ 2011-11-28 15:21 Pavel Emelyanov
  2011-11-28 15:53 ` Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Pavel Emelyanov @ 2011-11-28 15:21 UTC (permalink / raw)
  To: Tejun Heo, Oleg Nesterov, Andrew Morton
  Cc: Linux Kernel Mailing List, Cyrill Gorcunov

The sysctl works on the current task's pid namespace, getting and setting its
last_pid field.

Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to
create a task with desired pid value. This ability is required badly for the
checkpoint/restore in userspace.

This approach suits all the parties for now.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>

---
 Documentation/sysctl/kernel.txt |    8 ++++++++
 kernel/pid.c                    |    4 +++-
 kernel/pid_namespace.c          |   31 +++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1f24636..1e9cd67 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,6 +401,14 @@ PIDs of value pid_max or larger are not allocated.
 
 ==============================================================
 
+ns_last_pid:
+
+The last pid allocated in the current (the one task using this sysctl
+lives in) pid namespace. When selecting a pid for a next task on fork
+kernel tries to allocate a number starting from this one.
+
+==============================================================
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/kernel/pid.c b/kernel/pid.c
index fa5f722..ce8e00d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -137,7 +137,9 @@ static int pid_before(int base, int a, int b)
 }
 
 /*
- * We might be racing with someone else trying to set pid_ns->last_pid.
+ * We might be racing with someone else trying to set pid_ns->last_pid
+ * at the pid allocation time (there's also a sysctl for this, but racing
+ * with this one is OK, see comment in kernel/pid_namespace.c about it).
  * We want the winner to have the "later" value, because if the
  * "earlier" value prevails, then a pid may get reused immediately.
  *
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e9c9adc..bcd3f16 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	return;
 }
 
+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table tmp = *table;
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/*
+	 * Writing directly to ns' last_pid field is OK, since this field
+	 * is volatile in a living namespace anyway and a code writing to
+	 * it should synchronize its usage with external means.
+	 */
+
+	tmp.data = &current->nsproxy->pid_ns->last_pid;
+	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table[] = {
+	{
+		.procname = "ns_last_pid",
+		.maxlen = sizeof(int),
+		.mode = 0666, /* permissions are checked in the handler */
+		.proc_handler = pid_ns_ctl_handler,
+	},
+	{ }
+};
+
+static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
+
 static __init int pid_namespaces_init(void)
 {
 	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+	register_sysctl_paths(kern_path, pid_ns_ctl_table);
 	return 0;
 }
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
  2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
@ 2011-11-28 15:53 ` Tejun Heo
  2011-11-28 16:04   ` Pavel Emelyanov
  2011-11-29 17:47 ` Oleg Nesterov
  2012-01-12 22:49 ` Andrew Morton
  2 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2011-11-28 15:53 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov

On Mon, Nov 28, 2011 at 07:21:25PM +0400, Pavel Emelyanov wrote:
> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct ctl_table tmp = *table;
> +
> +	if (write && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	/*
> +	 * Writing directly to ns' last_pid field is OK, since this field
> +	 * is volatile in a living namespace anyway and a code writing to
> +	 * it should synchronize its usage with external means.
> +	 */

I would still prefer using set_last_pid() but if you insist to update
last_pid directly, please note the direct update in the comment on top
of set_last_pid() too.

Other than that,

  Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
  2011-11-28 15:53 ` Tejun Heo
@ 2011-11-28 16:04   ` Pavel Emelyanov
  2011-11-28 16:09     ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Emelyanov @ 2011-11-28 16:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov

On 11/28/2011 07:53 PM, Tejun Heo wrote:
> On Mon, Nov 28, 2011 at 07:21:25PM +0400, Pavel Emelyanov wrote:
>> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +	struct ctl_table tmp = *table;
>> +
>> +	if (write && !capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	/*
>> +	 * Writing directly to ns' last_pid field is OK, since this field
>> +	 * is volatile in a living namespace anyway and a code writing to
>> +	 * it should synchronize its usage with external means.
>> +	 */
> 
> I would still prefer using set_last_pid() but if you insist to update
> last_pid directly, please note the direct update in the comment on top
> of set_last_pid() too.

It's already there in this patch.

> Other than that,
> 
>   Acked-by: Tejun Heo <tj@kernel.org>

Thanks!

> Thanks.
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
  2011-11-28 16:04   ` Pavel Emelyanov
@ 2011-11-28 16:09     ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2011-11-28 16:09 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov

On Mon, Nov 28, 2011 at 08:04:40PM +0400, Pavel Emelyanov wrote:
> > I would still prefer using set_last_pid() but if you insist to update
> > last_pid directly, please note the direct update in the comment on top
> > of set_last_pid() too.
> 
> It's already there in this patch.

Heh, indeed.  My eyes just skipped over them.  :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
  2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
  2011-11-28 15:53 ` Tejun Heo
@ 2011-11-29 17:47 ` Oleg Nesterov
  2011-11-29 18:12   ` Pavel Emelyanov
  2012-01-12 22:49 ` Andrew Morton
  2 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2011-11-29 17:47 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov

On 11/28, Pavel Emelyanov wrote:
>
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	return;
>  }
>
> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct ctl_table tmp = *table;
> +
> +	if (write && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	/*
> +	 * Writing directly to ns' last_pid field is OK, since this field
> +	 * is volatile in a living namespace anyway and a code writing to
> +	 * it should synchronize its usage with external means.
> +	 */
> +
> +	tmp.data = &current->nsproxy->pid_ns->last_pid;
> +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +}
> +
> +static struct ctl_table pid_ns_ctl_table[] = {
> +	{
> +		.procname = "ns_last_pid",
> +		.maxlen = sizeof(int),
> +		.mode = 0666, /* permissions are checked in the handler */
> +		.proc_handler = pid_ns_ctl_handler,
> +	},
> +	{ }
> +};
> +
> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
> +
>  static __init int pid_namespaces_init(void)
>  {
>  	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	register_sysctl_paths(kern_path, pid_ns_ctl_table);
>  	return 0;
>  }

Hmm. This way it depends on CONFIG_PID_NS.

Can't we simply add an entry into kern_table[] ? And without ns_, just
/proc/sys/kernel/last_pid.

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
  2011-11-29 17:47 ` Oleg Nesterov
@ 2011-11-29 18:12   ` Pavel Emelyanov
  2011-11-29 19:22     ` Oleg Nesterov
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Emelyanov @ 2011-11-29 18:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov

On 11/29/2011 09:47 PM, Oleg Nesterov wrote:
> On 11/28, Pavel Emelyanov wrote:
>>
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>>  	return;
>>  }
>>
>> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +	struct ctl_table tmp = *table;
>> +
>> +	if (write && !capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	/*
>> +	 * Writing directly to ns' last_pid field is OK, since this field
>> +	 * is volatile in a living namespace anyway and a code writing to
>> +	 * it should synchronize its usage with external means.
>> +	 */
>> +
>> +	tmp.data = &current->nsproxy->pid_ns->last_pid;
>> +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>> +}
>> +
>> +static struct ctl_table pid_ns_ctl_table[] = {
>> +	{
>> +		.procname = "ns_last_pid",
>> +		.maxlen = sizeof(int),
>> +		.mode = 0666, /* permissions are checked in the handler */
>> +		.proc_handler = pid_ns_ctl_handler,
>> +	},
>> +	{ }
>> +};
>> +
>> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>> +
>>  static __init int pid_namespaces_init(void)
>>  {
>>  	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
>> +	register_sysctl_paths(kern_path, pid_ns_ctl_table);
>>  	return 0;
>>  }
> 
> Hmm. This way it depends on CONFIG_PID_NS.

Yes, since this _is_ for namespaces. As we've found out this is close to completely
unusable in the initial namespace in which tasks are just forking without caring
much about what CAP_SYS_ADMIN-s think about this.

> Can't we simply add an entry into kern_table[] ?

And store the .proc_handler function dealing with somewhat which is pid namespace
specific in the same generic file?

> And without ns_, just /proc/sys/kernel/last_pid.

But that's the namespace's last pid, not just some system-wide last pid.

> Oleg.
> 
> .
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
  2011-11-29 18:12   ` Pavel Emelyanov
@ 2011-11-29 19:22     ` Oleg Nesterov
  0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2011-11-29 19:22 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov

On 11/29, Pavel Emelyanov wrote:
>
> On 11/29/2011 09:47 PM, Oleg Nesterov wrote:
> >> +
> >> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
> >> +
> >>  static __init int pid_namespaces_init(void)
> >>  {
> >>  	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> >> +	register_sysctl_paths(kern_path, pid_ns_ctl_table);
> >>  	return 0;
> >>  }
> >
> > Hmm. This way it depends on CONFIG_PID_NS.
>
> Yes, since this _is_ for namespaces. As we've found out this is close to completely
> unusable in the initial namespace in which tasks are just forking without caring
> much about what CAP_SYS_ADMIN-s think about this.

I agree, it is not very much usable. Still I think it can be used.
Say, init can write RESERVED_PIDS to this file. Or you can use it
to test the pid-reuse problems.

> > Can't we simply add an entry into kern_table[] ?
>
> And store the .proc_handler function dealing with somewhat which is pid namespace
> specific in the same generic file?

Why not? In fact I think that, say, /proc/sys/kernel/pid_max should
act per-namespace too.

> > And without ns_, just /proc/sys/kernel/last_pid.
>
> But that's the namespace's last pid, not just some system-wide last pid.

Sure, it is not system wide. Unless you use it from the root ns.


OK. I do not really care. I think the patch is correct, lets do it
this way.

Oleg.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
  2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
  2011-11-28 15:53 ` Tejun Heo
  2011-11-29 17:47 ` Oleg Nesterov
@ 2012-01-12 22:49 ` Andrew Morton
  2 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2012-01-12 22:49 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Tejun Heo, Oleg Nesterov, Linux Kernel Mailing List, Cyrill Gorcunov

On Mon, 28 Nov 2011 19:21:25 +0400
Pavel Emelyanov <xemul@parallels.com> wrote:

> The sysctl works on the current task's pid namespace, getting and setting its
> last_pid field.
> 
> Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to
> create a task with desired pid value. This ability is required badly for the
> checkpoint/restore in userspace.
> 
> This approach suits all the parties for now.

I'm checking this November patch prior to sending it to Linus...

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 1f24636..1e9cd67 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -401,6 +401,14 @@ PIDs of value pid_max or larger are not allocated.
>  
>  ==============================================================
>  
> +ns_last_pid:
> +
> +The last pid allocated in the current (the one task using this sysctl
> +lives in) pid namespace. When selecting a pid for a next task on fork
> +kernel tries to allocate a number starting from this one.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/kernel/pid.c b/kernel/pid.c
> index fa5f722..ce8e00d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -137,7 +137,9 @@ static int pid_before(int base, int a, int b)
>  }
>  
>  /*
> - * We might be racing with someone else trying to set pid_ns->last_pid.
> + * We might be racing with someone else trying to set pid_ns->last_pid
> + * at the pid allocation time (there's also a sysctl for this, but racing
> + * with this one is OK, see comment in kernel/pid_namespace.c about it).
>   * We want the winner to have the "later" value, because if the
>   * "earlier" value prevails, then a pid may get reused immediately.
>   *
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index e9c9adc..bcd3f16 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	return;
>  }
>  
> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct ctl_table tmp = *table;
> +
> +	if (write && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	/*
> +	 * Writing directly to ns' last_pid field is OK, since this field
> +	 * is volatile in a living namespace anyway and a code writing to
> +	 * it should synchronize its usage with external means.
> +	 */
> +
> +	tmp.data = &current->nsproxy->pid_ns->last_pid;
> +	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +}
> +
> +static struct ctl_table pid_ns_ctl_table[] = {
> +	{
> +		.procname = "ns_last_pid",
> +		.maxlen = sizeof(int),
> +		.mode = 0666, /* permissions are checked in the handler */
> +		.proc_handler = pid_ns_ctl_handler,
> +	},
> +	{ }
> +};
> +
> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
> +
>  static __init int pid_namespaces_init(void)
>  {
>  	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> +	register_sysctl_paths(kern_path, pid_ns_ctl_table);
>  	return 0;
>  }

I think we should now make this code conditional on the new
CONFIG_CHECKPOINT_RESTORE.  I'll merge the patch as-is and will ask you
or Cyrill to send a followup patch doing this, please?


I'll confess that part of my motivation for wrapping c/r-specific code
inside CONFIG_CHECKPOINT_RESTORE is to make it easy for us to later
delete it all if your c/r project end up being unsuccessful.  Sorry :)


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-01-12 22:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
2011-11-28 15:53 ` Tejun Heo
2011-11-28 16:04   ` Pavel Emelyanov
2011-11-28 16:09     ` Tejun Heo
2011-11-29 17:47 ` Oleg Nesterov
2011-11-29 18:12   ` Pavel Emelyanov
2011-11-29 19:22     ` Oleg Nesterov
2012-01-12 22:49 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).