* [PATCH] sysctl: Add the kernel.ns_last_pid control
@ 2011-11-28 15:21 Pavel Emelyanov
2011-11-28 15:53 ` Tejun Heo
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Pavel Emelyanov @ 2011-11-28 15:21 UTC (permalink / raw)
To: Tejun Heo, Oleg Nesterov, Andrew Morton
Cc: Linux Kernel Mailing List, Cyrill Gorcunov
The sysctl works on the current task's pid namespace, getting and setting its
last_pid field.
Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to
create a task with desired pid value. This ability is required badly for the
checkpoint/restore in userspace.
This approach suits all the parties for now.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
Documentation/sysctl/kernel.txt | 8 ++++++++
kernel/pid.c | 4 +++-
kernel/pid_namespace.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+), 1 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 1f24636..1e9cd67 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,6 +401,14 @@ PIDs of value pid_max or larger are not allocated.
==============================================================
+ns_last_pid:
+
+The last pid allocated in the current (the one task using this sysctl
+lives in) pid namespace. When selecting a pid for a next task on fork
+kernel tries to allocate a number starting from this one.
+
+==============================================================
+
powersave-nap: (PPC only)
If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/kernel/pid.c b/kernel/pid.c
index fa5f722..ce8e00d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -137,7 +137,9 @@ static int pid_before(int base, int a, int b)
}
/*
- * We might be racing with someone else trying to set pid_ns->last_pid.
+ * We might be racing with someone else trying to set pid_ns->last_pid
+ * at the pid allocation time (there's also a sysctl for this, but racing
+ * with this one is OK, see comment in kernel/pid_namespace.c about it).
* We want the winner to have the "later" value, because if the
* "earlier" value prevails, then a pid may get reused immediately.
*
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e9c9adc..bcd3f16 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
return;
}
+static int pid_ns_ctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table tmp = *table;
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /*
+ * Writing directly to ns' last_pid field is OK, since this field
+ * is volatile in a living namespace anyway and a code writing to
+ * it should synchronize its usage with external means.
+ */
+
+ tmp.data = ¤t->nsproxy->pid_ns->last_pid;
+ return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table[] = {
+ {
+ .procname = "ns_last_pid",
+ .maxlen = sizeof(int),
+ .mode = 0666, /* permissions are checked in the handler */
+ .proc_handler = pid_ns_ctl_handler,
+ },
+ { }
+};
+
+static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
+
static __init int pid_namespaces_init(void)
{
pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
+ register_sysctl_paths(kern_path, pid_ns_ctl_table);
return 0;
}
--
1.5.5.6
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
@ 2011-11-28 15:53 ` Tejun Heo
2011-11-28 16:04 ` Pavel Emelyanov
2011-11-29 17:47 ` Oleg Nesterov
2012-01-12 22:49 ` Andrew Morton
2 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2011-11-28 15:53 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov
On Mon, Nov 28, 2011 at 07:21:25PM +0400, Pavel Emelyanov wrote:
> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct ctl_table tmp = *table;
> +
> + if (write && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + /*
> + * Writing directly to ns' last_pid field is OK, since this field
> + * is volatile in a living namespace anyway and a code writing to
> + * it should synchronize its usage with external means.
> + */
I would still prefer using set_last_pid() but if you insist to update
last_pid directly, please note the direct update in the comment on top
of set_last_pid() too.
Other than that,
Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
2011-11-28 15:53 ` Tejun Heo
@ 2011-11-28 16:04 ` Pavel Emelyanov
2011-11-28 16:09 ` Tejun Heo
0 siblings, 1 reply; 8+ messages in thread
From: Pavel Emelyanov @ 2011-11-28 16:04 UTC (permalink / raw)
To: Tejun Heo
Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov
On 11/28/2011 07:53 PM, Tejun Heo wrote:
> On Mon, Nov 28, 2011 at 07:21:25PM +0400, Pavel Emelyanov wrote:
>> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>> + void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> + struct ctl_table tmp = *table;
>> +
>> + if (write && !capable(CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + /*
>> + * Writing directly to ns' last_pid field is OK, since this field
>> + * is volatile in a living namespace anyway and a code writing to
>> + * it should synchronize its usage with external means.
>> + */
>
> I would still prefer using set_last_pid() but if you insist to update
> last_pid directly, please note the direct update in the comment on top
> of set_last_pid() too.
It's already there in this patch.
> Other than that,
>
> Acked-by: Tejun Heo <tj@kernel.org>
Thanks!
> Thanks.
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
2011-11-28 16:04 ` Pavel Emelyanov
@ 2011-11-28 16:09 ` Tejun Heo
0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2011-11-28 16:09 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: Oleg Nesterov, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov
On Mon, Nov 28, 2011 at 08:04:40PM +0400, Pavel Emelyanov wrote:
> > I would still prefer using set_last_pid() but if you insist to update
> > last_pid directly, please note the direct update in the comment on top
> > of set_last_pid() too.
>
> It's already there in this patch.
Heh, indeed. My eyes just skipped over them. :)
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
2011-11-28 15:53 ` Tejun Heo
@ 2011-11-29 17:47 ` Oleg Nesterov
2011-11-29 18:12 ` Pavel Emelyanov
2012-01-12 22:49 ` Andrew Morton
2 siblings, 1 reply; 8+ messages in thread
From: Oleg Nesterov @ 2011-11-29 17:47 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: Tejun Heo, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov
On 11/28, Pavel Emelyanov wrote:
>
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> return;
> }
>
> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct ctl_table tmp = *table;
> +
> + if (write && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + /*
> + * Writing directly to ns' last_pid field is OK, since this field
> + * is volatile in a living namespace anyway and a code writing to
> + * it should synchronize its usage with external means.
> + */
> +
> + tmp.data = ¤t->nsproxy->pid_ns->last_pid;
> + return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +}
> +
> +static struct ctl_table pid_ns_ctl_table[] = {
> + {
> + .procname = "ns_last_pid",
> + .maxlen = sizeof(int),
> + .mode = 0666, /* permissions are checked in the handler */
> + .proc_handler = pid_ns_ctl_handler,
> + },
> + { }
> +};
> +
> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
> +
> static __init int pid_namespaces_init(void)
> {
> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> + register_sysctl_paths(kern_path, pid_ns_ctl_table);
> return 0;
> }
Hmm. This way it depends on CONFIG_PID_NS.
Can't we simply add an entry into kern_table[] ? And without ns_, just
/proc/sys/kernel/last_pid.
Oleg.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
2011-11-29 17:47 ` Oleg Nesterov
@ 2011-11-29 18:12 ` Pavel Emelyanov
2011-11-29 19:22 ` Oleg Nesterov
0 siblings, 1 reply; 8+ messages in thread
From: Pavel Emelyanov @ 2011-11-29 18:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Tejun Heo, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov
On 11/29/2011 09:47 PM, Oleg Nesterov wrote:
> On 11/28, Pavel Emelyanov wrote:
>>
>> --- a/kernel/pid_namespace.c
>> +++ b/kernel/pid_namespace.c
>> @@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>> return;
>> }
>>
>> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>> + void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> + struct ctl_table tmp = *table;
>> +
>> + if (write && !capable(CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + /*
>> + * Writing directly to ns' last_pid field is OK, since this field
>> + * is volatile in a living namespace anyway and a code writing to
>> + * it should synchronize its usage with external means.
>> + */
>> +
>> + tmp.data = ¤t->nsproxy->pid_ns->last_pid;
>> + return proc_dointvec(&tmp, write, buffer, lenp, ppos);
>> +}
>> +
>> +static struct ctl_table pid_ns_ctl_table[] = {
>> + {
>> + .procname = "ns_last_pid",
>> + .maxlen = sizeof(int),
>> + .mode = 0666, /* permissions are checked in the handler */
>> + .proc_handler = pid_ns_ctl_handler,
>> + },
>> + { }
>> +};
>> +
>> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>> +
>> static __init int pid_namespaces_init(void)
>> {
>> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
>> + register_sysctl_paths(kern_path, pid_ns_ctl_table);
>> return 0;
>> }
>
> Hmm. This way it depends on CONFIG_PID_NS.
Yes, since this _is_ for namespaces. As we've found out this is close to completely
unusable in the initial namespace in which tasks are just forking without caring
much about what CAP_SYS_ADMIN-s think about this.
> Can't we simply add an entry into kern_table[] ?
And store the .proc_handler function dealing with somewhat which is pid namespace
specific in the same generic file?
> And without ns_, just /proc/sys/kernel/last_pid.
But that's the namespace's last pid, not just some system-wide last pid.
> Oleg.
>
> .
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
2011-11-29 18:12 ` Pavel Emelyanov
@ 2011-11-29 19:22 ` Oleg Nesterov
0 siblings, 0 replies; 8+ messages in thread
From: Oleg Nesterov @ 2011-11-29 19:22 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: Tejun Heo, Andrew Morton, Linux Kernel Mailing List, Cyrill Gorcunov
On 11/29, Pavel Emelyanov wrote:
>
> On 11/29/2011 09:47 PM, Oleg Nesterov wrote:
> >> +
> >> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
> >> +
> >> static __init int pid_namespaces_init(void)
> >> {
> >> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> >> + register_sysctl_paths(kern_path, pid_ns_ctl_table);
> >> return 0;
> >> }
> >
> > Hmm. This way it depends on CONFIG_PID_NS.
>
> Yes, since this _is_ for namespaces. As we've found out this is close to completely
> unusable in the initial namespace in which tasks are just forking without caring
> much about what CAP_SYS_ADMIN-s think about this.
I agree, it is not very much usable. Still I think it can be used.
Say, init can write RESERVED_PIDS to this file. Or you can use it
to test the pid-reuse problems.
> > Can't we simply add an entry into kern_table[] ?
>
> And store the .proc_handler function dealing with somewhat which is pid namespace
> specific in the same generic file?
Why not? In fact I think that, say, /proc/sys/kernel/pid_max should
act per-namespace too.
> > And without ns_, just /proc/sys/kernel/last_pid.
>
> But that's the namespace's last pid, not just some system-wide last pid.
Sure, it is not system wide. Unless you use it from the root ns.
OK. I do not really care. I think the patch is correct, lets do it
this way.
Oleg.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sysctl: Add the kernel.ns_last_pid control
2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
2011-11-28 15:53 ` Tejun Heo
2011-11-29 17:47 ` Oleg Nesterov
@ 2012-01-12 22:49 ` Andrew Morton
2 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2012-01-12 22:49 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: Tejun Heo, Oleg Nesterov, Linux Kernel Mailing List, Cyrill Gorcunov
On Mon, 28 Nov 2011 19:21:25 +0400
Pavel Emelyanov <xemul@parallels.com> wrote:
> The sysctl works on the current task's pid namespace, getting and setting its
> last_pid field.
>
> Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to
> create a task with desired pid value. This ability is required badly for the
> checkpoint/restore in userspace.
>
> This approach suits all the parties for now.
I'm checking this November patch prior to sending it to Linus...
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 1f24636..1e9cd67 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -401,6 +401,14 @@ PIDs of value pid_max or larger are not allocated.
>
> ==============================================================
>
> +ns_last_pid:
> +
> +The last pid allocated in the current (the one task using this sysctl
> +lives in) pid namespace. When selecting a pid for a next task on fork
> +kernel tries to allocate a number starting from this one.
> +
> +==============================================================
> +
> powersave-nap: (PPC only)
>
> If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/kernel/pid.c b/kernel/pid.c
> index fa5f722..ce8e00d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -137,7 +137,9 @@ static int pid_before(int base, int a, int b)
> }
>
> /*
> - * We might be racing with someone else trying to set pid_ns->last_pid.
> + * We might be racing with someone else trying to set pid_ns->last_pid
> + * at the pid allocation time (there's also a sysctl for this, but racing
> + * with this one is OK, see comment in kernel/pid_namespace.c about it).
> * We want the winner to have the "later" value, because if the
> * "earlier" value prevails, then a pid may get reused immediately.
> *
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index e9c9adc..bcd3f16 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -191,9 +191,40 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> return;
> }
>
> +static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + struct ctl_table tmp = *table;
> +
> + if (write && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + /*
> + * Writing directly to ns' last_pid field is OK, since this field
> + * is volatile in a living namespace anyway and a code writing to
> + * it should synchronize its usage with external means.
> + */
> +
> + tmp.data = ¤t->nsproxy->pid_ns->last_pid;
> + return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +}
> +
> +static struct ctl_table pid_ns_ctl_table[] = {
> + {
> + .procname = "ns_last_pid",
> + .maxlen = sizeof(int),
> + .mode = 0666, /* permissions are checked in the handler */
> + .proc_handler = pid_ns_ctl_handler,
> + },
> + { }
> +};
> +
> +static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
> +
> static __init int pid_namespaces_init(void)
> {
> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> + register_sysctl_paths(kern_path, pid_ns_ctl_table);
> return 0;
> }
I think we should now make this code conditional on the new
CONFIG_CHECKPOINT_RESTORE. I'll merge the patch as-is and will ask you
or Cyrill to send a followup patch doing this, please?
I'll confess that part of my motivation for wrapping c/r-specific code
inside CONFIG_CHECKPOINT_RESTORE is to make it easy for us to later
delete it all if your c/r project end up being unsuccessful. Sorry :)
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-01-12 22:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-28 15:21 [PATCH] sysctl: Add the kernel.ns_last_pid control Pavel Emelyanov
2011-11-28 15:53 ` Tejun Heo
2011-11-28 16:04 ` Pavel Emelyanov
2011-11-28 16:09 ` Tejun Heo
2011-11-29 17:47 ` Oleg Nesterov
2011-11-29 18:12 ` Pavel Emelyanov
2011-11-29 19:22 ` Oleg Nesterov
2012-01-12 22:49 ` Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.