All of lore.kernel.org
 help / color / mirror / Atom feed
* [rfc 0/3] A small bundle in a sake of checkpoint/restore
@ 2011-11-29 19:12 Cyrill Gorcunov
  2011-11-29 19:12 ` [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov

Hi,

this series introduce a few changes we need for checkpoint,
restore procedures. Mostly harmless I think, but complains
and comments are really appreciated. My main concern is
new prctl codes.

Thanks,
 Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-11-29 19:12 [rfc 0/3] A small bundle in a sake of checkpoint/restore Cyrill Gorcunov
@ 2011-11-29 19:12 ` Cyrill Gorcunov
  2011-11-29 20:06   ` Kees Cook
                     ` (2 more replies)
  2011-11-29 19:12 ` [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
  2011-11-29 19:12 ` [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires Cyrill Gorcunov
  2 siblings, 3 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov, Cyrill Gorcunov

[-- Attachment #1: fs-proc-Add-start_data-end_data-start_brk-members --]
[-- Type: text/plain, Size: 1137 bytes --]

It helps to dump and restore this mm_struct members.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 fs/proc/array.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -464,7 +464,7 @@ static int do_task_stat(struct seq_file
 
 	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
 		pid_nr_ns(pid, ns),
 		tcomm,
 		state,
@@ -511,7 +511,10 @@ static int do_task_stat(struct seq_file
 		task->policy,
 		(unsigned long long)delayacct_blkio_ticks(task),
 		cputime_to_clock_t(gtime),
-		cputime_to_clock_t(cgtime));
+		cputime_to_clock_t(cgtime),
+		mm ? (permitted ? mm->start_data : 1) : 0,
+		mm ? (permitted ? mm->end_data : 1) : 0,
+		mm ? (permitted ? mm->start_brk : 1) : 0);
 	if (mm)
 		mmput(mm);
 	return 0;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-11-29 19:12 [rfc 0/3] A small bundle in a sake of checkpoint/restore Cyrill Gorcunov
  2011-11-29 19:12 ` [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
@ 2011-11-29 19:12 ` Cyrill Gorcunov
  2011-11-30  5:00   ` KAMEZAWA Hiroyuki
  2011-11-29 19:12 ` [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires Cyrill Gorcunov
  2 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov, Cyrill Gorcunov

[-- Attachment #1: fs-proc-Introduce-the-Children-line-in-proc-pid-stat --]
[-- Type: text/plain, Size: 1592 bytes --]

From: Pavel Emelyanov <xemul@parallels.com>

There is no easy way to make a reverse parent->children chain
from the task status, in turn children->parent provided with "PPid"
field.

So instead of walking over all pids in system to figure out what
children the task have -- we add explicit "Children" member to
/proc/<pid>/status since kernel already knows this kind of information
but it was not yet exported.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 fs/proc/array.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -158,6 +158,18 @@ static inline const char *get_task_state(struct task_struct *tsk)
 	return *p;
 }
 
+static void task_children(struct seq_file *m, struct task_struct *p, struct pid_namespace *ns)
+{
+	struct task_struct *c;
+
+	seq_printf(m, "Children:");
+	read_lock(&tasklist_lock);
+	list_for_each_entry(c, &p->children, sibling)
+		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
+	read_unlock(&tasklist_lock);
+	seq_putc(m, '\n');
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *p)
 {
@@ -192,6 +204,8 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		cred->uid, cred->euid, cred->suid, cred->fsuid,
 		cred->gid, cred->egid, cred->sgid, cred->fsgid);
 
+	task_children(m, p, ns);
+
 	task_lock(p);
 	if (p->files)
 		fdt = files_fdtable(p->files);
-- 
1.7.7.3



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 19:12 [rfc 0/3] A small bundle in a sake of checkpoint/restore Cyrill Gorcunov
  2011-11-29 19:12 ` [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
  2011-11-29 19:12 ` [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
@ 2011-11-29 19:12 ` Cyrill Gorcunov
  2011-11-29 20:19   ` Kees Cook
  2 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov, Cyrill Gorcunov

[-- Attachment #1: prctl-tune-up-mm_struct-members --]
[-- Type: text/plain, Size: 2634 bytes --]

A few members of mm_struct such as start_code, end_code,
start_data, end_data, start_stack, start_brk, brk provided
by the kernel via /proc/$pid/stat and we use it at checkpoint
time.

At restore time we need a mechanism to restore those values
back and for this sake PR_SET_MM prctl code is introduced.

Note at moment this inteface is allowed for CAP_SYS_ADMIN
only.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---

Actually I'm not sure if CAP_SYS_ADMIN restriction is
really needed here. Opinions?

 include/linux/prctl.h |   12 +++++++++++
 kernel/sys.c          |   52 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

Index: linux-2.6.git/include/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/linux/prctl.h
+++ linux-2.6.git/include/linux/prctl.h
@@ -102,4 +102,16 @@
 
 #define PR_MCE_KILL_GET 34
 
+/*
+ * Tune up process memory map specifics.
+ */
+#define PR_SET_MM		35
+# define PR_SET_MM_START_CODE		1
+# define PR_SET_MM_END_CODE		2
+# define PR_SET_MM_START_DATA		3
+# define PR_SET_MM_END_DATA		4
+# define PR_SET_MM_START_STACK		5
+# define PR_SET_MM_START_BRK		6
+# define PR_SET_MM_BRK			7
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1841,6 +1841,58 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_MM: {
+			struct mm_struct *mm;
+			struct vm_area_struct *vma;
+
+			if (arg4 | arg5)
+				return -EINVAL;
+
+			if (!capable(CAP_SYS_ADMIN))
+				return -EPERM;
+
+			error = -ENOENT;
+			mm = get_task_mm(current);
+			if (!mm)
+				return error;
+
+			down_read(&mm->mmap_sem);
+			vma = find_vma(mm, arg3);
+			if (!vma)
+				goto out;
+
+			error = 0;
+			switch (arg2) {
+			case PR_SET_MM_START_CODE:
+				current->mm->start_code = arg3;
+				break;
+			case PR_SET_MM_END_CODE:
+				current->mm->end_code = arg3;
+				break;
+			case PR_SET_MM_START_DATA:
+				current->mm->start_data = arg3;
+				break;
+			case PR_SET_MM_END_DATA:
+				current->mm->end_data = arg3;
+				break;
+			case PR_SET_MM_START_STACK:
+				current->mm->start_stack = arg3;
+				break;
+			case PR_SET_MM_START_BRK:
+				current->mm->start_brk = arg3;
+				break;
+			case PR_SET_MM_BRK:
+				current->mm->brk = arg3;
+				break;
+			default:
+				error = -EINVAL;
+				break;
+			}
+out:
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			break;
+		}
 		default:
 			error = -EINVAL;
 			break;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-11-29 19:12 ` [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
@ 2011-11-29 20:06   ` Kees Cook
  2011-12-02  0:24     ` Alexey Dobriyan
  2011-11-29 20:32   ` Serge Hallyn
  2011-11-30  5:04   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 51+ messages in thread
From: Kees Cook @ 2011-11-29 20:06 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 11:12:53PM +0400, Cyrill Gorcunov wrote:
> It helps to dump and restore this mm_struct members.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> ---
>  fs/proc/array.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.git/fs/proc/array.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/array.c
> +++ linux-2.6.git/fs/proc/array.c
> @@ -464,7 +464,7 @@ static int do_task_stat(struct seq_file
>  
>  	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
>  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
> +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
>  		pid_nr_ns(pid, ns),
>  		tcomm,
>  		state,
> @@ -511,7 +511,10 @@ static int do_task_stat(struct seq_file
>  		task->policy,
>  		(unsigned long long)delayacct_blkio_ticks(task),
>  		cputime_to_clock_t(gtime),
> -		cputime_to_clock_t(cgtime));
> +		cputime_to_clock_t(cgtime),
> +		mm ? (permitted ? mm->start_data : 1) : 0,
> +		mm ? (permitted ? mm->end_data : 1) : 0,
> +		mm ? (permitted ? mm->start_brk : 1) : 0);
>  	if (mm)
>  		mmput(mm);
>  	return 0;

Thanks for using "permitted" here. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 19:12 ` [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires Cyrill Gorcunov
@ 2011-11-29 20:19   ` Kees Cook
  2011-11-29 20:29     ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Kees Cook @ 2011-11-29 20:19 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 11:12:55PM +0400, Cyrill Gorcunov wrote:
> At restore time we need a mechanism to restore those values
> back and for this sake PR_SET_MM prctl code is introduced.
> 
> Note at moment this inteface is allowed for CAP_SYS_ADMIN
> only.

NAK from me; this needs more bounds checking. Though, yes, it absolutely
must be a privileged action since this is potentially very dangerous. Can
we invent something stronger than CAP_SYS_ADMIN? ;)

> @@ -1841,6 +1841,58 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
>  			else
>  				error = PR_MCE_KILL_DEFAULT;
>  			break;
> +		case PR_SET_MM: {
> +			struct mm_struct *mm;
> +			struct vm_area_struct *vma;
> +
> +			if (arg4 | arg5)
> +				return -EINVAL;
> +
> +			if (!capable(CAP_SYS_ADMIN))
> +				return -EPERM;
> +
> +			error = -ENOENT;
> +			mm = get_task_mm(current);
> +			if (!mm)
> +				return error;
> +
> +			down_read(&mm->mmap_sem);
> +			vma = find_vma(mm, arg3);
> +			if (!vma)
> +				goto out;

arg3 needs to be significantly more carefully validated. find_vma() doesn't
say that vm_start <= addr, only that vm_end > addr. This effectively
bypasses all the vma checks (mmap_min_addr, max process size, etc), with
some pretty crazy side-effects, I think.

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:19   ` Kees Cook
@ 2011-11-29 20:29     ` Cyrill Gorcunov
  2011-11-29 20:37       ` Cyrill Gorcunov
                         ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 20:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 12:19:38PM -0800, Kees Cook wrote:
> On Tue, Nov 29, 2011 at 11:12:55PM +0400, Cyrill Gorcunov wrote:
> > At restore time we need a mechanism to restore those values
> > back and for this sake PR_SET_MM prctl code is introduced.
> > 
> > Note at moment this inteface is allowed for CAP_SYS_ADMIN
> > only.
> 
> NAK from me; this needs more bounds checking. Though, yes, it absolutely
> must be a privileged action since this is potentially very dangerous. Can
> we invent something stronger than CAP_SYS_ADMIN? ;)

Heh.

> 
> > @@ -1841,6 +1841,58 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
> >  			else
> >  				error = PR_MCE_KILL_DEFAULT;
> >  			break;
> > +		case PR_SET_MM: {
> > +			struct mm_struct *mm;
> > +			struct vm_area_struct *vma;
> > +
> > +			if (arg4 | arg5)
> > +				return -EINVAL;
> > +
> > +			if (!capable(CAP_SYS_ADMIN))
> > +				return -EPERM;
> > +
> > +			error = -ENOENT;
> > +			mm = get_task_mm(current);
> > +			if (!mm)
> > +				return error;
> > +
> > +			down_read(&mm->mmap_sem);
> > +			vma = find_vma(mm, arg3);
> > +			if (!vma)
> > +				goto out;
> 
> arg3 needs to be significantly more carefully validated. find_vma() doesn't
> say that vm_start <= addr, only that vm_end > addr. This effectively
> bypasses all the vma checks (mmap_min_addr, max process size, etc), with
> some pretty crazy side-effects, I think.
> 

Yes, I know it needs some more testing, but apart from vma bounds (yup,
good point with find_vma, I'll fix) I thought about what else should be
checked? I think VMA prototype should be checked to fit "code", "data"
templates, ie code should be at least readable and execytable, but what
about data and stack and brk, should stack be executable? That is the
point where I've got a bit confused and though putting RFC out might be
a good idea to collect opinions.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-11-29 19:12 ` [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
  2011-11-29 20:06   ` Kees Cook
@ 2011-11-29 20:32   ` Serge Hallyn
  2011-11-30  5:04   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 51+ messages in thread
From: Serge Hallyn @ 2011-11-29 20:32 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Pavel Emelyanov, Vasiliy Kulikov

On 11/29/2011 01:12 PM, Cyrill Gorcunov wrote:
> It helps to dump and restore this mm_struct members.
>
> Signed-off-by: Cyrill Gorcunov<gorcunov@openvz.org>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

> ---
>   fs/proc/array.c |    7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.git/fs/proc/array.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/array.c
> +++ linux-2.6.git/fs/proc/array.c
> @@ -464,7 +464,7 @@ static int do_task_stat(struct seq_file
>
>   	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
>   %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
> +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
>   		pid_nr_ns(pid, ns),
>   		tcomm,
>   		state,
> @@ -511,7 +511,10 @@ static int do_task_stat(struct seq_file
>   		task->policy,
>   		(unsigned long long)delayacct_blkio_ticks(task),
>   		cputime_to_clock_t(gtime),
> -		cputime_to_clock_t(cgtime));
> +		cputime_to_clock_t(cgtime),
> +		mm ? (permitted ? mm->start_data : 1) : 0,
> +		mm ? (permitted ? mm->end_data : 1) : 0,
> +		mm ? (permitted ? mm->start_brk : 1) : 0);
>   	if (mm)
>   		mmput(mm);
>   	return 0;
>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:29     ` Cyrill Gorcunov
@ 2011-11-29 20:37       ` Cyrill Gorcunov
  2011-11-29 20:40         ` Kees Cook
  2011-11-29 20:37       ` Kees Cook
  2011-11-29 20:49       ` Serge Hallyn
  2 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 20:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Wed, Nov 30, 2011 at 12:29:51AM +0400, Cyrill Gorcunov wrote:
...
> > 
> > arg3 needs to be significantly more carefully validated. find_vma() doesn't
> > say that vm_start <= addr, only that vm_end > addr. This effectively
> > bypasses all the vma checks (mmap_min_addr, max process size, etc), with
> > some pretty crazy side-effects, I think.
> > 
> 
> Yes, I know it needs some more testing, but apart from vma bounds (yup,
> good point with find_vma, I'll fix) I thought about what else should be
> checked? I think VMA prototype should be checked to fit "code", "data"
> templates, ie code should be at least readable and execytable, but what
> about data and stack and brk, should stack be executable? That is the
> point where I've got a bit confused and though putting RFC out might be
> a good idea to collect opinions.

On the other hands these fields are set up by elf hanlder code, which
does mmap these areas, so we have to check that particular member
belongs to existing VMA and never cross user-space area, and together
with root-only approach would not it be enough? I'm sure missing something
that is why I'm asking.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:29     ` Cyrill Gorcunov
  2011-11-29 20:37       ` Cyrill Gorcunov
@ 2011-11-29 20:37       ` Kees Cook
  2011-11-29 20:49       ` Serge Hallyn
  2 siblings, 0 replies; 51+ messages in thread
From: Kees Cook @ 2011-11-29 20:37 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 12:29 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Tue, Nov 29, 2011 at 12:19:38PM -0800, Kees Cook wrote:
>> On Tue, Nov 29, 2011 at 11:12:55PM +0400, Cyrill Gorcunov wrote:
>> > At restore time we need a mechanism to restore those values
>> > back and for this sake PR_SET_MM prctl code is introduced.
>>
>> arg3 needs to be significantly more carefully validated. find_vma() doesn't
>> say that vm_start <= addr, only that vm_end > addr. This effectively
>> bypasses all the vma checks (mmap_min_addr, max process size, etc), with
>> some pretty crazy side-effects, I think.
>
> Yes, I know it needs some more testing, but apart from vma bounds (yup,
> good point with find_vma, I'll fix) I thought about what else should be
> checked? I think VMA prototype should be checked to fit "code", "data"
> templates, ie code should be at least readable and execytable, but what
> about data and stack and brk, should stack be executable? That is the
> point where I've got a bit confused and though putting RFC out might be
> a good idea to collect opinions.

In the most basic sense, it shouldn't be possible to create an mm
state that can't be reached through the normal ELF loader, calls to
mmap(), mprotect(), stack growth, etc. I think it might be worth
examining those paths to see how they're being bounds-checked. (And
note the especially scary handling of the hidden stack guard page.)

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:37       ` Cyrill Gorcunov
@ 2011-11-29 20:40         ` Kees Cook
  2011-11-29 20:47           ` Cyrill Gorcunov
  2011-11-30 17:37           ` Cyrill Gorcunov
  0 siblings, 2 replies; 51+ messages in thread
From: Kees Cook @ 2011-11-29 20:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 12:37 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Wed, Nov 30, 2011 at 12:29:51AM +0400, Cyrill Gorcunov wrote:
> ...
>> >
>> > arg3 needs to be significantly more carefully validated. find_vma() doesn't
>> > say that vm_start <= addr, only that vm_end > addr. This effectively
>> > bypasses all the vma checks (mmap_min_addr, max process size, etc), with
>> > some pretty crazy side-effects, I think.
>> >
>>
>> Yes, I know it needs some more testing, but apart from vma bounds (yup,
>> good point with find_vma, I'll fix) I thought about what else should be
>> checked? I think VMA prototype should be checked to fit "code", "data"
>> templates, ie code should be at least readable and execytable, but what
>> about data and stack and brk, should stack be executable? That is the
>> point where I've got a bit confused and though putting RFC out might be
>> a good idea to collect opinions.
>
> On the other hands these fields are set up by elf hanlder code, which
> does mmap these areas, so we have to check that particular member
> belongs to existing VMA and never cross user-space area, and together
> with root-only approach would not it be enough? I'm sure missing something
> that is why I'm asking.

Right, if you verify that the addresses are actually inside valid
userspace vmas, that is likely to be right, though there are probably
other things I haven't thought of. The trouble is avoiding vdso, stack
guard page, vsyscall, and anything else that isn't meant for the mm to
have direct access to.

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:40         ` Kees Cook
@ 2011-11-29 20:47           ` Cyrill Gorcunov
  2011-11-30 17:37           ` Cyrill Gorcunov
  1 sibling, 0 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 20:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 12:40:57PM -0800, Kees Cook wrote:
> 
> Right, if you verify that the addresses are actually inside valid
> userspace vmas, that is likely to be right, though there are probably
> other things I haven't thought of. The trouble is avoiding vdso, stack
> guard page, vsyscall, and anything else that isn't meant for the mm to
> have direct access to.

Yeah, these service area should not be messed up, I'll add such tests.
Thanks a lot for comments, Kees!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:29     ` Cyrill Gorcunov
  2011-11-29 20:37       ` Cyrill Gorcunov
  2011-11-29 20:37       ` Kees Cook
@ 2011-11-29 20:49       ` Serge Hallyn
  2011-11-29 20:55         ` Cyrill Gorcunov
  2 siblings, 1 reply; 51+ messages in thread
From: Serge Hallyn @ 2011-11-29 20:49 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Pavel Emelyanov, Vasiliy Kulikov

On 11/29/2011 02:29 PM, Cyrill Gorcunov wrote:
> On Tue, Nov 29, 2011 at 12:19:38PM -0800, Kees Cook wrote:
>> On Tue, Nov 29, 2011 at 11:12:55PM +0400, Cyrill Gorcunov wrote:
>>> At restore time we need a mechanism to restore those values
>>> back and for this sake PR_SET_MM prctl code is introduced.
>>>
>>> Note at moment this inteface is allowed for CAP_SYS_ADMIN
>>> only.
>>
>> NAK from me; this needs more bounds checking. Though, yes, it absolutely
>> must be a privileged action since this is potentially very dangerous. Can
>> we invent something stronger than CAP_SYS_ADMIN? ;)
>
> Heh.
>
>>
>>> @@ -1841,6 +1841,58 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
>>>   			else
>>>   				error = PR_MCE_KILL_DEFAULT;
>>>   			break;
>>> +		case PR_SET_MM: {
>>> +			struct mm_struct *mm;
>>> +			struct vm_area_struct *vma;
>>> +
>>> +			if (arg4 | arg5)
>>> +				return -EINVAL;
>>> +
>>> +			if (!capable(CAP_SYS_ADMIN))
>>> +				return -EPERM;
>>> +
>>> +			error = -ENOENT;
>>> +			mm = get_task_mm(current);
>>> +			if (!mm)
>>> +				return error;
>>> +
>>> +			down_read(&mm->mmap_sem);
>>> +			vma = find_vma(mm, arg3);
>>> +			if (!vma)
>>> +				goto out;
>>
>> arg3 needs to be significantly more carefully validated. find_vma() doesn't
>> say that vm_start<= addr, only that vm_end>  addr. This effectively
>> bypasses all the vma checks (mmap_min_addr, max process size, etc), with
>> some pretty crazy side-effects, I think.
>>
>
> Yes, I know it needs some more testing, but apart from vma bounds (yup,
> good point with find_vma, I'll fix) I thought about what else should be
> checked? I think VMA prototype should be checked to fit "code", "data"
> templates, ie code should be at least readable and execytable, but what
> about data and stack and brk, should stack be executable? That is the
> point where I've got a bit confused and though putting RFC out might be
> a good idea to collect opinions.

My memory is a bit hazy here, but cryo 
(http://git.sr71.net/?p=cryo-forhallyn.git;a=summary) did also do this 
from userspace.  As I recall the one problem we had was ... that we 
couldn't lower the mm_start of the first segment?  I think.  But I bring 
it up only because the advantage of doing it this way was that all of 
the ptrace protections automatically applied.

-serge

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:49       ` Serge Hallyn
@ 2011-11-29 20:55         ` Cyrill Gorcunov
  0 siblings, 0 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-29 20:55 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Kees Cook, linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 02:49:11PM -0600, Serge Hallyn wrote:
> 
> My memory is a bit hazy here, but cryo
> (http://git.sr71.net/?p=cryo-forhallyn.git;a=summary) did also do
> this from userspace.  As I recall the one problem we had was ...
> that we couldn't lower the mm_start of the first segment?  I think.
> But I bring it up only because the advantage of doing it this way
> was that all of the ptrace protections automatically applied.
> 

Thanks for the link Serge, I'll take a look. I think for start_code
we might stick for exactly Elf path -- ie it should be vma with
at least "r-xp" prototipe as mapped by elf path (and restrictions
from Kees's mail should be applied on it as well). Other members
are not that clear for me.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-11-29 19:12 ` [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
@ 2011-11-30  5:00   ` KAMEZAWA Hiroyuki
  2011-11-30  6:05     ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-30  5:00 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, 29 Nov 2011 23:12:54 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> From: Pavel Emelyanov <xemul@parallels.com>
> 
> There is no easy way to make a reverse parent->children chain
> from the task status, in turn children->parent provided with "PPid"
> field.
> 
> So instead of walking over all pids in system to figure out what
> children the task have -- we add explicit "Children" member to
> /proc/<pid>/status since kernel already knows this kind of information
> but it was not yet exported.
> 
> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>

I may be too pessimistic but what amount of overhead will this add to
ps -elf/ top ? Assuming an environment 'ps -elf' is called once per a sec,
if there are 2000 processes, task_list lock is taken 2000 times by this patch.

Isn't it better to add /proc/<pid>/children file or dir (as task)?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-11-29 19:12 ` [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
  2011-11-29 20:06   ` Kees Cook
  2011-11-29 20:32   ` Serge Hallyn
@ 2011-11-30  5:04   ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-30  5:04 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, 29 Nov 2011 23:12:53 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> It helps to dump and restore this mm_struct members.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-11-30  5:00   ` KAMEZAWA Hiroyuki
@ 2011-11-30  6:05     ` Cyrill Gorcunov
  2011-12-01  9:54       ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-30  6:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Wed, Nov 30, 2011 at 02:00:09PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 29 Nov 2011 23:12:54 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > From: Pavel Emelyanov <xemul@parallels.com>
> > 
> > There is no easy way to make a reverse parent->children chain
> > from the task status, in turn children->parent provided with "PPid"
> > field.
> > 
> > So instead of walking over all pids in system to figure out what
> > children the task have -- we add explicit "Children" member to
> > /proc/<pid>/status since kernel already knows this kind of information
> > but it was not yet exported.
> > 
> > Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> > Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> 
> I may be too pessimistic but what amount of overhead will this add to
> ps -elf/ top ? Assuming an environment 'ps -elf' is called once per a sec,
> if there are 2000 processes, task_list lock is taken 2000 times by this patch.
> 

Hi Kame, good point! Yes, it introduces latency on high loaded systems.
I must admit I tested this patch on a regular system, where not that much
processes were launched but technically I think more correct would be to
switch to children file. I'll tune up the patch. Thanks!

> Isn't it better to add /proc/<pid>/children file or dir (as task)?
> 
> Thanks,
> -Kame
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-29 20:40         ` Kees Cook
  2011-11-29 20:47           ` Cyrill Gorcunov
@ 2011-11-30 17:37           ` Cyrill Gorcunov
  2011-11-30 18:10             ` Kees Cook
  1 sibling, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-30 17:37 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 12:40:57PM -0800, Kees Cook wrote:
> >
> > On the other hands these fields are set up by elf hanlder code, which
> > does mmap these areas, so we have to check that particular member
> > belongs to existing VMA and never cross user-space area, and together
> > with root-only approach would not it be enough? I'm sure missing something
> > that is why I'm asking.
> 
> Right, if you verify that the addresses are actually inside valid
> userspace vmas, that is likely to be right, though there are probably
> other things I haven't thought of. The trouble is avoiding vdso, stack
> guard page, vsyscall, and anything else that isn't meant for the mm to
> have direct access to.
> 

Hi Kees,

what about this one? Note that these mm_struct members don't affect
kernel much (at least as far as I see, except maybe brk,start_brk and
start_stack values), so I've added some sanity checks here, hope they
would fit. Still main protection is root-only access only. The kernel
itself uses vma_area::start/end members for overlows tests internally
so I think even passing crazy data here won't crash the kernel itself.
What do you think?

	Cyrill
---
prctl: Add PR_SET_MM codes to tune up mm_struct entires v2

A few members of mm_struct such as start_code, end_code,
start_data, end_data, start_stack, start_brk, brk provided
by the kernel via /proc/$pid/stat and we use it at checkpoint
time.

At restore time we need a mechanism to restore those values
back and for this sake PR_SET_MM prctl code is introduced.

Note because of being a dangerous operation this inteface
is allowed for CAP_SYS_ADMIN only.

v2:
 - Add a check for vma start address, testing for vma ending
   address is not enough. From Kees Cook.

 - Add some sanity tests for assigned addresses.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Kees Cook <keescook@chromium.org>
---
 include/linux/prctl.h |   12 +++++
 kernel/sys.c          |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 130 insertions(+)

Index: linux-2.6.git/include/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/linux/prctl.h
+++ linux-2.6.git/include/linux/prctl.h
@@ -102,4 +102,16 @@
 
 #define PR_MCE_KILL_GET 34
 
+/*
+ * Tune up process memory map specifics.
+ */
+#define PR_SET_MM		35
+# define PR_SET_MM_START_CODE		1
+# define PR_SET_MM_END_CODE		2
+# define PR_SET_MM_START_DATA		3
+# define PR_SET_MM_END_DATA		4
+# define PR_SET_MM_START_STACK		5
+# define PR_SET_MM_START_BRK		6
+# define PR_SET_MM_BRK			7
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1692,6 +1692,118 @@ SYSCALL_DEFINE1(umask, int, mask)
 	return mask;
 }
 
+static int prctl_set_mm(int opt, unsigned long addr)
+{
+	unsigned long rlim = rlimit(RLIMIT_DATA);
+	unsigned long vm_req_flags;
+	unsigned long vm_bad_flags;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int error = 0;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (addr >= TASK_SIZE)
+		return -EINVAL;
+
+	mm = get_task_mm(current);
+	if (!mm)
+		return -ENOENT;
+
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, addr);
+
+	if (opt != PR_SET_MM_START_BRK &&
+	    opt != PR_SET_MM_BRK) {
+		/* It must be existing VMA */
+		if (!vma || vma->vm_start > addr)
+			goto out;
+	}
+
+	error = -EINVAL;
+	switch (opt) {
+	case PR_SET_MM_START_CODE:
+	case PR_SET_MM_END_CODE:
+
+		vm_req_flags = VM_READ | VM_EXEC;
+		vm_bad_flags = VM_WRITE | VM_MAYSHARE;
+
+		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
+		    (vma->vm_flags & vm_bad_flags))
+			goto out;
+
+		if (opt == PR_SET_MM_START_CODE)
+			current->mm->start_code = addr;
+		else
+			current->mm->end_code = addr;
+		break;
+
+	case PR_SET_MM_START_DATA:
+	case PR_SET_MM_END_DATA:
+
+		vm_req_flags = VM_READ | VM_WRITE;
+		vm_bad_flags = VM_EXEC | VM_MAYSHARE;
+
+		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
+		    (vma->vm_flags & vm_bad_flags))
+			goto out;
+
+		if (opt == PR_SET_MM_START_DATA)
+			current->mm->start_data = addr;
+		else
+			current->mm->end_data = addr;
+		break;
+
+	case PR_SET_MM_START_STACK:
+
+#ifdef CONFIG_STACK_GROWSUP
+		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP;
+#else
+		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
+#endif
+		if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
+			goto out;
+
+		current->mm->start_stack = addr;
+		break;
+
+	case PR_SET_MM_START_BRK:
+		if (addr <= mm->end_data)
+			goto out;
+
+		if (rlim < RLIM_INFINITY &&
+		    (mm->brk - addr) + (mm->end_data - mm->start_data) > rlim)
+			goto out;
+
+		current->mm->start_brk = addr;
+		break;
+
+	case PR_SET_MM_BRK:
+		if (addr <= mm->end_data)
+			goto out;
+
+		if (rlim < RLIM_INFINITY &&
+		    (addr - mm->start_brk) + (mm->end_data - mm->start_data) > rlim)
+			goto out;
+
+		current->mm->brk = addr;
+		break;
+
+	default:
+		error = -EINVAL;
+		goto out;
+	}
+
+	error = 0;
+
+out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+
+	return error;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -1841,6 +1953,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_MM: {
+			if (arg4 | arg5)
+				return -EINVAL;
+			error = prctl_set_mm(arg2, arg3);
+			break;
+		}
 		default:
 			error = -EINVAL;
 			break;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-30 17:37           ` Cyrill Gorcunov
@ 2011-11-30 18:10             ` Kees Cook
  2011-11-30 18:23               ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Kees Cook @ 2011-11-30 18:10 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Wed, Nov 30, 2011 at 9:37 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Tue, Nov 29, 2011 at 12:40:57PM -0800, Kees Cook wrote:
>> >
>> > On the other hands these fields are set up by elf hanlder code, which
>> > does mmap these areas, so we have to check that particular member
>> > belongs to existing VMA and never cross user-space area, and together
>> > with root-only approach would not it be enough? I'm sure missing something
>> > that is why I'm asking.
>>
>> Right, if you verify that the addresses are actually inside valid
>> userspace vmas, that is likely to be right, though there are probably
>> other things I haven't thought of. The trouble is avoiding vdso, stack
>> guard page, vsyscall, and anything else that isn't meant for the mm to
>> have direct access to.
>>
>
> Hi Kees,
>
> what about this one? Note that these mm_struct members don't affect
> kernel much (at least as far as I see, except maybe brk,start_brk and
> start_stack values), so I've added some sanity checks here, hope they
> would fit. Still main protection is root-only access only. The kernel
> itself uses vma_area::start/end members for overlows tests internally
> so I think even passing crazy data here won't crash the kernel itself.

Right, though besides just crashing the kernel, I'm trying to look at
this from the perspective of a paranoid admin that doesn't even trust
the root user. Is there some way this new interface could be used to
provide the building blocks for gaining kernel execute control?
(Imagine a system running with modules disabled, STRICT_DEVMEM
enabled, etc.)

> What do you think?

This looks way better, yes. I have this feeling like these validations
should be more centralized or tied to the mm code more directly to
avoid drift, but I don't have any constructive suggestions
unfortunately. Maybe other folks do?

> +       switch (opt) {
> +       case PR_SET_MM_START_CODE:
> +       case PR_SET_MM_END_CODE:
> +
> +               vm_req_flags = VM_READ | VM_EXEC;
> +               vm_bad_flags = VM_WRITE | VM_MAYSHARE;
> +
> +               if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
> +                   (vma->vm_flags & vm_bad_flags))
> +                       goto out;

Another random thought: given this very regular set of checks you're
doing, perhaps the flags should be part of a data structure instead,
just to reduce the size of this routine?

struct mm_flags {
  int req_flags;
  int bad_flags;
};

struct mm_flags opt_flags[] = {
...
 { VM_READ | VM_EXEC, VM_WRITE | VM_MAYSHARE }, /* PR_SET_MM_START_CODE */
 { VM_READ | VM_EXEC, VM_WRITE | VM_MAYSHARE }, /* PR_SET_MM_END_CODE */
...

then do validation before the switch statement all in one place, and
leave the switch for more programmatic checks?

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-30 18:10             ` Kees Cook
@ 2011-11-30 18:23               ` Cyrill Gorcunov
  2011-11-30 21:06                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-30 18:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Wed, Nov 30, 2011 at 10:10:35AM -0800, Kees Cook wrote:
...
> >
> > Hi Kees,
> >
> > what about this one? Note that these mm_struct members don't affect
> > kernel much (at least as far as I see, except maybe brk,start_brk and
> > start_stack values), so I've added some sanity checks here, hope they
> > would fit. Still main protection is root-only access only. The kernel
> > itself uses vma_area::start/end members for overlows tests internally
> > so I think even passing crazy data here won't crash the kernel itself.
> 
> Right, though besides just crashing the kernel, I'm trying to look at
> this from the perspective of a paranoid admin that doesn't even trust
> the root user. Is there some way this new interface could be used to
> provide the building blocks for gaining kernel execute control?
> (Imagine a system running with modules disabled, STRICT_DEVMEM
>  enabled, etc.)
> 

Well, I don't see a way to use it for gaining some control execution at the
moment but (!) kernel evolves and I fear one might forget about this new
interface when he will be doing some tossing over this mm_struct members.

I though about adding some kind of mark to kernel log on first usage of
this prctl codes (so if some crash happened for any reason, the log would
point someone used this interface).

> > What do you think?
> 
> This looks way better, yes. I have this feeling like these validations
> should be more centralized or tied to the mm code more directly to
> avoid drift, but I don't have any constructive suggestions
> unfortunately. Maybe other folks do?
> 

Yeah, good point, but on the other hands have these tests in one place
is easier to read I think.

> > +       switch (opt) {
> > +       case PR_SET_MM_START_CODE:
> > +       case PR_SET_MM_END_CODE:
> > +
> > +               vm_req_flags = VM_READ | VM_EXEC;
> > +               vm_bad_flags = VM_WRITE | VM_MAYSHARE;
> > +
> > +               if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
> > +                   (vma->vm_flags & vm_bad_flags))
> > +                       goto out;
> 
> Another random thought: given this very regular set of checks you're
> doing, perhaps the flags should be part of a data structure instead,
> just to reduce the size of this routine?
> 
> struct mm_flags {
>   int req_flags;
>   int bad_flags;
> };
> 
> struct mm_flags opt_flags[] = {
> ...
>  { VM_READ | VM_EXEC, VM_WRITE | VM_MAYSHARE }, /* PR_SET_MM_START_CODE */
>  { VM_READ | VM_EXEC, VM_WRITE | VM_MAYSHARE }, /* PR_SET_MM_END_CODE */
> ...
> 
> then do validation before the switch statement all in one place, and
> leave the switch for more programmatic checks?
> 
> -Kees
> 

Nod! I'll update, thanks!

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-30 18:23               ` Cyrill Gorcunov
@ 2011-11-30 21:06                 ` Cyrill Gorcunov
  2011-12-07 12:27                   ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-11-30 21:06 UTC (permalink / raw)
  To: Kees Cook, linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Wed, Nov 30, 2011 at 10:23:10PM +0400, Cyrill Gorcunov wrote:
...
> 
> > > +       switch (opt) {
> > > +       case PR_SET_MM_START_CODE:
> > > +       case PR_SET_MM_END_CODE:
> > > +
> > > +               vm_req_flags = VM_READ | VM_EXEC;
> > > +               vm_bad_flags = VM_WRITE | VM_MAYSHARE;
> > > +
> > > +               if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
> > > +                   (vma->vm_flags & vm_bad_flags))
> > > +                       goto out;
> > 
> > Another random thought: given this very regular set of checks you're
> > doing, perhaps the flags should be part of a data structure instead,
> > just to reduce the size of this routine?
> > 
> > struct mm_flags {
> >   int req_flags;
> >   int bad_flags;
> > };
> > 
> > struct mm_flags opt_flags[] = {
> > ...
> >  { VM_READ | VM_EXEC, VM_WRITE | VM_MAYSHARE }, /* PR_SET_MM_START_CODE */
> >  { VM_READ | VM_EXEC, VM_WRITE | VM_MAYSHARE }, /* PR_SET_MM_END_CODE */
> > ...
> > 
> > then do validation before the switch statement all in one place, and
> > leave the switch for more programmatic checks?
> > 
> > -Kees
> > 
> 
> Nod! I'll update, thanks!
> 

You know Kees, I tried it, and finally I think it's overheaded, so I prefer
to stick with original version (no need to duplicate same data in two differen
memory places as it'll be in case of arrays, and since the VM_ flags are
constant the former code bloats kernel lesser. Thanks anyway!

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-11-30  6:05     ` Cyrill Gorcunov
@ 2011-12-01  9:54       ` Cyrill Gorcunov
  2011-12-01 15:43         ` Tejun Heo
                           ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-01  9:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Tejun Heo,
	Andrew Vagin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Wed, Nov 30, 2011 at 10:05:37AM +0400, Cyrill Gorcunov wrote:
> On Wed, Nov 30, 2011 at 02:00:09PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Tue, 29 Nov 2011 23:12:54 +0400
> > Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> > 
> > > From: Pavel Emelyanov <xemul@parallels.com>
> > > 
> > > There is no easy way to make a reverse parent->children chain
> > > from the task status, in turn children->parent provided with "PPid"
> > > field.
> > > 
> > > So instead of walking over all pids in system to figure out what
> > > children the task have -- we add explicit "Children" member to
> > > /proc/<pid>/status since kernel already knows this kind of information
> > > but it was not yet exported.
> > > 
> > > Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> > > Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> > > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > 
> > I may be too pessimistic but what amount of overhead will this add to
> > ps -elf/ top ? Assuming an environment 'ps -elf' is called once per a sec,
> > if there are 2000 processes, task_list lock is taken 2000 times by this patch.
> > 
> 
> Hi Kame, good point! Yes, it introduces latency on high loaded systems.
> I must admit I tested this patch on a regular system, where not that much
> processes were launched but technically I think more correct would be to
> switch to children file. I'll tune up the patch. Thanks!
> 
> > Isn't it better to add /proc/<pid>/children file or dir (as task)?
> > 
> > Thanks,
> > -Kame
> > 

What about this one?
---
fs, proc: Introduce the /proc/<pid>/children entry

There is no easy way to make a reverse parent->children chain
from the task status, in turn children->parent provided with "PPid"
field.

So instead of walking over all pids in system to figure out what
children the task have -- we add explicit /proc/<pid>/children entry,
since kernel already knows this kind of information but it was not
yet exported.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/array.c    |   14 ++++++++++++++
 fs/proc/base.c     |    1 +
 fs/proc/internal.h |    3 +++
 3 files changed, 18 insertions(+)

Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -547,3 +547,17 @@ int proc_pid_statm(struct seq_file *m, s
 
 	return 0;
 }
+
+int proc_pid_children(struct seq_file *m, struct pid_namespace *ns,
+		      struct pid *pid, struct task_struct *task)
+{
+	struct task_struct *c;
+
+	read_lock(&tasklist_lock);
+	list_for_each_entry(c, &task->children, sibling)
+		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
+	read_unlock(&tasklist_lock);
+	seq_putc(m, '\n');
+
+	return 0;
+}
Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3204,6 +3204,7 @@ static const struct pid_entry tgid_base_
 	INF("cmdline",    S_IRUGO, proc_pid_cmdline),
 	ONE("stat",       S_IRUGO, proc_tgid_stat),
 	ONE("statm",      S_IRUGO, proc_pid_statm),
+	ONE("children",   S_IRUGO, proc_pid_children),
 	REG("maps",       S_IRUGO, proc_maps_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_numa_maps_operations),
Index: linux-2.6.git/fs/proc/internal.h
===================================================================
--- linux-2.6.git.orig/fs/proc/internal.h
+++ linux-2.6.git/fs/proc/internal.h
@@ -51,6 +51,9 @@ extern int proc_pid_status(struct seq_fi
 				struct pid *pid, struct task_struct *task);
 extern int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task);
+extern int proc_pid_children(struct seq_file *m, struct pid_namespace *ns,
+			     struct pid *pid, struct task_struct *task);
+
 extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);
 
 extern const struct file_operations proc_maps_operations;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-01  9:54       ` Cyrill Gorcunov
@ 2011-12-01 15:43         ` Tejun Heo
  2011-12-01 15:53           ` Cyrill Gorcunov
  2011-12-01 21:29         ` Andrew Morton
  2011-12-02  0:40         ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 51+ messages in thread
From: Tejun Heo @ 2011-12-01 15:43 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

Hello, Cyrill.

On Thu, Dec 01, 2011 at 01:54:34PM +0400, Cyrill Gorcunov wrote:
> Index: linux-2.6.git/fs/proc/array.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/array.c
> +++ linux-2.6.git/fs/proc/array.c
> @@ -547,3 +547,17 @@ int proc_pid_statm(struct seq_file *m, s
>  
>  	return 0;
>  }
> +
> +int proc_pid_children(struct seq_file *m, struct pid_namespace *ns,
> +		      struct pid *pid, struct task_struct *task)
> +{
> +	struct task_struct *c;
> +
> +	read_lock(&tasklist_lock);
> +	list_for_each_entry(c, &task->children, sibling)
> +		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
> +	read_unlock(&tasklist_lock);
> +	seq_putc(m, '\n');
> +
> +	return 0;
> +}

I don't think using non-seekable single seqfile is a good idea here.
It works if the whole list fits in PAGE_SIZE but assuming five digit
pid, that's only ~680 pids.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-01 15:43         ` Tejun Heo
@ 2011-12-01 15:53           ` Cyrill Gorcunov
  2011-12-01 16:07             ` Tejun Heo
  0 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-01 15:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Thu, Dec 01, 2011 at 07:43:57AM -0800, Tejun Heo wrote:
> > +
> > +int proc_pid_children(struct seq_file *m, struct pid_namespace *ns,
> > +		      struct pid *pid, struct task_struct *task)
> > +{
> > +	struct task_struct *c;
> > +
> > +	read_lock(&tasklist_lock);
> > +	list_for_each_entry(c, &task->children, sibling)
> > +		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
> > +	read_unlock(&tasklist_lock);
> > +	seq_putc(m, '\n');
> > +
> > +	return 0;
> > +}
> 
> I don't think using non-seekable single seqfile is a good idea here.
> It works if the whole list fits in PAGE_SIZE but assuming five digit
> pid, that's only ~680 pids.
> 

Hi Tejun, indeed this is not pleasant limit, thanks for the point,
I'll update! But other than this technical detail, I assume there is
no objection from introducing such feature?

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-01 15:53           ` Cyrill Gorcunov
@ 2011-12-01 16:07             ` Tejun Heo
  0 siblings, 0 replies; 51+ messages in thread
From: Tejun Heo @ 2011-12-01 16:07 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

Hello,

On Thu, Dec 1, 2011 at 7:53 AM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> Hi Tejun, indeed this is not pleasant limit, thanks for the point,
> I'll update! But other than this technical detail, I assume there is
> no objection from introducing such feature?

Nope, I don't have any objection. Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-01  9:54       ` Cyrill Gorcunov
  2011-12-01 15:43         ` Tejun Heo
@ 2011-12-01 21:29         ` Andrew Morton
  2011-12-01 21:38           ` Cyrill Gorcunov
  2011-12-02  0:40         ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-01 21:29 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Thu, 1 Dec 2011 13:54:34 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> +int proc_pid_children(struct seq_file *m, struct pid_namespace *ns,
> +		      struct pid *pid, struct task_struct *task)
> +{
> +	struct task_struct *c;
> +
> +	read_lock(&tasklist_lock);
> +	list_for_each_entry(c, &task->children, sibling)
> +		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
> +	read_unlock(&tasklist_lock);
> +	seq_putc(m, '\n');
> +
> +	return 0;
> +}

That's a potentially very long hold time for tasklist_lock, and
userspace can invoke this at a high frequency.  Not good.

I think this can all be done under rcu_read_lock() with
list_for_each_entry_rcu().


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-01 21:29         ` Andrew Morton
@ 2011-12-01 21:38           ` Cyrill Gorcunov
  0 siblings, 0 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-01 21:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Thu, Dec 01, 2011 at 01:29:11PM -0800, Andrew Morton wrote:
> 
> That's a potentially very long hold time for tasklist_lock, and
> userspace can invoke this at a high frequency.  Not good.
> 
> I think this can all be done under rcu_read_lock() with
> list_for_each_entry_rcu().
> 

Yeah, good point, thanks Andrew!

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-11-29 20:06   ` Kees Cook
@ 2011-12-02  0:24     ` Alexey Dobriyan
  2011-12-02  7:28       ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Alexey Dobriyan @ 2011-12-02  0:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Cyrill Gorcunov, linux-kernel, Andrew Morton, Tejun Heo,
	Andrew Vagin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Tue, Nov 29, 2011 at 12:06:27PM -0800, Kees Cook wrote:
> On Tue, Nov 29, 2011 at 11:12:53PM +0400, Cyrill Gorcunov wrote:
> > It helps to dump and restore this mm_struct members.
> > 
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > ---
> >  fs/proc/array.c |    7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6.git/fs/proc/array.c
> > ===================================================================
> > --- linux-2.6.git.orig/fs/proc/array.c
> > +++ linux-2.6.git/fs/proc/array.c
> > @@ -464,7 +464,7 @@ static int do_task_stat(struct seq_file
> >  
> >  	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
> >  %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
> > -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
> > +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
> >  		pid_nr_ns(pid, ns),
> >  		tcomm,
> >  		state,
> > @@ -511,7 +511,10 @@ static int do_task_stat(struct seq_file
> >  		task->policy,
> >  		(unsigned long long)delayacct_blkio_ticks(task),
> >  		cputime_to_clock_t(gtime),
> > -		cputime_to_clock_t(cgtime));
> > +		cputime_to_clock_t(cgtime),
> > +		mm ? (permitted ? mm->start_data : 1) : 0,
> > +		mm ? (permitted ? mm->end_data : 1) : 0,
> > +		mm ? (permitted ? mm->start_brk : 1) : 0);
> >  	if (mm)
> >  		mmput(mm);
> >  	return 0;
> 
> Thanks for using "permitted" here. :)

And these are new fields, so "1" hack is unnecessary.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-01  9:54       ` Cyrill Gorcunov
  2011-12-01 15:43         ` Tejun Heo
  2011-12-01 21:29         ` Andrew Morton
@ 2011-12-02  0:40         ` KAMEZAWA Hiroyuki
  2011-12-02 12:41           ` Pedro Alves
  2 siblings, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-02  0:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Thu, 1 Dec 2011 13:54:34 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On Wed, Nov 30, 2011 at 10:05:37AM +0400, Cyrill Gorcunov wrote:
> > On Wed, Nov 30, 2011 at 02:00:09PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 29 Nov 2011 23:12:54 +0400
> > > Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> > > 
> > > > From: Pavel Emelyanov <xemul@parallels.com>
> > > > 
> > > > There is no easy way to make a reverse parent->children chain
> > > > from the task status, in turn children->parent provided with "PPid"
> > > > field.
> > > > 
> > > > So instead of walking over all pids in system to figure out what
> > > > children the task have -- we add explicit "Children" member to
> > > > /proc/<pid>/status since kernel already knows this kind of information
> > > > but it was not yet exported.
> > > > 
> > > > Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> > > > Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> > > > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > > 
> > > I may be too pessimistic but what amount of overhead will this add to
> > > ps -elf/ top ? Assuming an environment 'ps -elf' is called once per a sec,
> > > if there are 2000 processes, task_list lock is taken 2000 times by this patch.
> > > 
> > 
> > Hi Kame, good point! Yes, it introduces latency on high loaded systems.
> > I must admit I tested this patch on a regular system, where not that much
> > processes were launched but technically I think more correct would be to
> > switch to children file. I'll tune up the patch. Thanks!
> > 
> > > Isn't it better to add /proc/<pid>/children file or dir (as task)?
> > > 
> > > Thanks,
> > > -Kame
> > > 
> 
> What about this one?
> ---
> fs, proc: Introduce the /proc/<pid>/children entry
> 
> There is no easy way to make a reverse parent->children chain
> from the task status, in turn children->parent provided with "PPid"
> field.
> 
> So instead of walking over all pids in system to figure out what
> children the task have -- we add explicit /proc/<pid>/children entry,
> since kernel already knows this kind of information but it was not
> yet exported.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Serge Hallyn <serge.hallyn@canonical.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Yes, I like /children file. other points seems to be pointed out by other
reviewers. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-12-02  0:24     ` Alexey Dobriyan
@ 2011-12-02  7:28       ` Cyrill Gorcunov
  2011-12-02 19:23         ` Kees Cook
  0 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-02  7:28 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Kees Cook, linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Fri, Dec 02, 2011 at 03:24:06AM +0300, Alexey Dobriyan wrote:
> > > +		cputime_to_clock_t(cgtime),
> > > +		mm ? (permitted ? mm->start_data : 1) : 0,
> > > +		mm ? (permitted ? mm->end_data : 1) : 0,
> > > +		mm ? (permitted ? mm->start_brk : 1) : 0);
> > >  	if (mm)
> > >  		mmput(mm);
> > >  	return 0;
> > 
> > Thanks for using "permitted" here. :)
> 
> And these are new fields, so "1" hack is unnecessary.
> 

I perhaps miss some history here, you mean just print 0
instead of 1, right?

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02  0:40         ` KAMEZAWA Hiroyuki
@ 2011-12-02 12:41           ` Pedro Alves
  2011-12-02 12:43             ` Pavel Emelyanov
  0 siblings, 1 reply; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 12:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Cyrill Gorcunov, linux-kernel, Andrew Morton, Tejun Heo,
	Andrew Vagin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Friday 02 December 2011 00:40:10, KAMEZAWA Hiroyuki wrote:
> > What about this one?
> > ---
> > fs, proc: Introduce the /proc/<pid>/children entry
> > 
> > There is no easy way to make a reverse parent->children chain
> > from the task status, in turn children->parent provided with "PPid"
> > field.
> > 
> > So instead of walking over all pids in system to figure out what
> > children the task have -- we add explicit /proc/<pid>/children entry,
> > since kernel already knows this kind of information but it was not
> > yet exported.
> > 
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > Cc: Pavel Emelyanov <xemul@parallels.com>
> > Cc: Serge Hallyn <serge.hallyn@canonical.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Yes, I like /children file. other points seems to be pointed out by other
> reviewers. 

Any reason this is a file instead of a directory like /proc/PID/task/ ?

$ sudo ls /proc/8167/task/
8167  854  855  856  857  858  859
$ sudo ls /proc/8167/task/855/
attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall

Much easier to follow the chain from the command line this way.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 12:41           ` Pedro Alves
@ 2011-12-02 12:43             ` Pavel Emelyanov
  2011-12-02 12:45               ` Cyrill Gorcunov
  2011-12-02 12:58               ` Pedro Alves
  0 siblings, 2 replies; 51+ messages in thread
From: Pavel Emelyanov @ 2011-12-02 12:43 UTC (permalink / raw)
  To: Pedro Alves
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On 12/02/2011 04:41 PM, Pedro Alves wrote:
> On Friday 02 December 2011 00:40:10, KAMEZAWA Hiroyuki wrote:
>>> What about this one?
>>> ---
>>> fs, proc: Introduce the /proc/<pid>/children entry
>>>
>>> There is no easy way to make a reverse parent->children chain
>>> from the task status, in turn children->parent provided with "PPid"
>>> field.
>>>
>>> So instead of walking over all pids in system to figure out what
>>> children the task have -- we add explicit /proc/<pid>/children entry,
>>> since kernel already knows this kind of information but it was not
>>> yet exported.
>>>
>>> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
>>> Cc: Pavel Emelyanov <xemul@parallels.com>
>>> Cc: Serge Hallyn <serge.hallyn@canonical.com>
>>> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> Yes, I like /children file. other points seems to be pointed out by other
>> reviewers. 
> 
> Any reason this is a file instead of a directory like /proc/PID/task/ ?
> 
> $ sudo ls /proc/8167/task/
> 8167  854  855  856  857  858  859
> $ sudo ls /proc/8167/task/855/
> attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> 
> Much easier to follow the chain from the command line this way.

What do you propose to put into these directories? Another directories named with
children pid-s?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 12:43             ` Pavel Emelyanov
@ 2011-12-02 12:45               ` Cyrill Gorcunov
  2011-12-02 13:10                 ` Pedro Alves
  2011-12-02 12:58               ` Pedro Alves
  1 sibling, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-02 12:45 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Pedro Alves, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Fri, Dec 02, 2011 at 04:43:10PM +0400, Pavel Emelyanov wrote:
...
> >>
> >> Yes, I like /children file. other points seems to be pointed out by other
> >> reviewers. 
> > 
> > Any reason this is a file instead of a directory like /proc/PID/task/ ?
> > 
> > $ sudo ls /proc/8167/task/
> > 8167  854  855  856  857  858  859
> > $ sudo ls /proc/8167/task/855/
> > attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> > auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> > cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> > 
> > Much easier to follow the chain from the command line this way.
> 
> What do you propose to put into these directories? Another directories named with
> children pid-s?
> 

Yes, I suppose additional directory/links and whatever would be just
noneeded overhead.

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 12:43             ` Pavel Emelyanov
  2011-12-02 12:45               ` Cyrill Gorcunov
@ 2011-12-02 12:58               ` Pedro Alves
  2011-12-02 13:16                 ` Pavel Emelyanov
  1 sibling, 1 reply; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 12:58 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Friday 02 December 2011 12:43:10, Pavel Emelyanov wrote:

> >> Yes, I like /children file. other points seems to be pointed out by other
> >> reviewers. 
> > 
> > Any reason this is a file instead of a directory like /proc/PID/task/ ?
> > 
> > $ sudo ls /proc/8167/task/
> > 8167  854  855  856  857  858  859
> > $ sudo ls /proc/8167/task/855/
> > attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> > auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> > cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> > 
> > Much easier to follow the chain from the command line this way.
> 
> What do you propose to put into these directories? Another directories named with
> children pid-s?

Yes, just like the task/ dir gives you directories named with the
processes's thread ids.  Opening /proc/PID/children/PID-CHILD1/ would get
you the same as opening /proc/PID-CHILD1/.  Just like
opening /proc/PID/task/PID-CHILD1/ gets you (almost) the same as opening
/proc/PID-CHILD1/.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 12:45               ` Cyrill Gorcunov
@ 2011-12-02 13:10                 ` Pedro Alves
  2011-12-02 13:40                   ` Pedro Alves
  0 siblings, 1 reply; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 13:10 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Friday 02 December 2011 12:45:51, Cyrill Gorcunov wrote:
> On Fri, Dec 02, 2011 at 04:43:10PM +0400, Pavel Emelyanov wrote:
> ...
> > >>
> > >> Yes, I like /children file. other points seems to be pointed out by other
> > >> reviewers. 
> > > 
> > > Any reason this is a file instead of a directory like /proc/PID/task/ ?
> > > 
> > > $ sudo ls /proc/8167/task/
> > > 8167  854  855  856  857  858  859
> > > $ sudo ls /proc/8167/task/855/
> > > attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> > > auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> > > cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> > > 
> > > Much easier to follow the chain from the command line this way.
> > 
> > What do you propose to put into these directories? Another directories named with
> > children pid-s?
> > 
> 
> Yes, I suppose additional directory/links and whatever would be just
> noneeded overhead.

/proc/8167/task/ being a directory could also be claimed
"noneeded overhead", yet it exists, and is very useful (I use
it a lot).  Why diverge instead of being consistent?

(Note that listing /proc only shows thread group ids, but
you can still open /proc/THREAD-ID/, so a /proc/PID/task file
with lists of pids would have been "good enough".)

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 12:58               ` Pedro Alves
@ 2011-12-02 13:16                 ` Pavel Emelyanov
  2011-12-02 13:44                   ` Pedro Alves
  0 siblings, 1 reply; 51+ messages in thread
From: Pavel Emelyanov @ 2011-12-02 13:16 UTC (permalink / raw)
  To: Pedro Alves
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On 12/02/2011 04:58 PM, Pedro Alves wrote:
> On Friday 02 December 2011 12:43:10, Pavel Emelyanov wrote:
> 
>>>> Yes, I like /children file. other points seems to be pointed out by other
>>>> reviewers. 
>>>
>>> Any reason this is a file instead of a directory like /proc/PID/task/ ?
>>>
>>> $ sudo ls /proc/8167/task/
>>> 8167  854  855  856  857  858  859
>>> $ sudo ls /proc/8167/task/855/
>>> attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
>>> auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
>>> cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
>>>
>>> Much easier to follow the chain from the command line this way.
>>
>> What do you propose to put into these directories? Another directories named with
>> children pid-s?
> 
> Yes, just like the task/ dir gives you directories named with the
> processes's thread ids.  Opening /proc/PID/children/PID-CHILD1/ would get
> you the same as opening /proc/PID-CHILD1/.  Just like
> opening /proc/PID/task/PID-CHILD1/ gets you (almost) the same as opening
> /proc/PID-CHILD1/.

You cannot make the dentry named /proc/<pid1>/children/<pid2> be a hardlink on
the /proc/<pid2>. Thus you have to make arbitrary amount of inodes to point to
a single task. This brings unnecessary complexity and memory usage (by dentries
and proc inodes).

I'd accept the symbolic links, but how would they look like? Like this:
   # ls -l /proc/123/children
            234 -> ../../234
?

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 13:10                 ` Pedro Alves
@ 2011-12-02 13:40                   ` Pedro Alves
  0 siblings, 0 replies; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 13:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, KAMEZAWA Hiroyuki, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Friday 02 December 2011 13:10:59, Pedro Alves wrote:
> On Friday 02 December 2011 12:45:51, Cyrill Gorcunov wrote:
> > On Fri, Dec 02, 2011 at 04:43:10PM +0400, Pavel Emelyanov wrote:
> > ...
> > > >>
> > > >> Yes, I like /children file. other points seems to be pointed out by other
> > > >> reviewers. 
> > > > 
> > > > Any reason this is a file instead of a directory like /proc/PID/task/ ?
> > > > 
> > > > $ sudo ls /proc/8167/task/
> > > > 8167  854  855  856  857  858  859
> > > > $ sudo ls /proc/8167/task/855/
> > > > attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> > > > auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> > > > cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> > > > 
> > > > Much easier to follow the chain from the command line this way.
> > > 
> > > What do you propose to put into these directories? Another directories named with
> > > children pid-s?
> > > 
> > 
> > Yes, I suppose additional directory/links and whatever would be just
> > noneeded overhead.
> 
> /proc/8167/task/ being a directory could also be claimed
> "noneeded overhead", yet it exists, and is very useful (I use
> it a lot).  Why diverge instead of being consistent?

Guess I should give an example.  Here's one.  While debugging
gdb stuff, I often do:

$ cat /proc/8167/task/*/status | grep State
State:  S (sleeping)
State:  S (sleeping)
State:  S (sleeping)
State:  S (sleeping)
State:  S (sleeping)
State:  S (sleeping)
State:  S (sleeping)

If children were a directory (or "child", for
consistency with singular "task"), you could do child
things in the same natural way, like for instance:

$ ls -als /proc/8167/child/*/exe 

It's like interactive pstree...

> (Note that listing /proc only shows thread group ids, but
> you can still open /proc/THREAD-ID/, so a /proc/PID/task file
> with lists of pids would have been "good enough".)

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 13:16                 ` Pavel Emelyanov
@ 2011-12-02 13:44                   ` Pedro Alves
  2011-12-02 13:52                     ` Pavel Emelyanov
  0 siblings, 1 reply; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 13:44 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Friday 02 December 2011 13:16:52, Pavel Emelyanov wrote:
> On 12/02/2011 04:58 PM, Pedro Alves wrote:
> > On Friday 02 December 2011 12:43:10, Pavel Emelyanov wrote:
> > 
> >>>> Yes, I like /children file. other points seems to be pointed out by other
> >>>> reviewers. 
> >>>
> >>> Any reason this is a file instead of a directory like /proc/PID/task/ ?
> >>>
> >>> $ sudo ls /proc/8167/task/
> >>> 8167  854  855  856  857  858  859
> >>> $ sudo ls /proc/8167/task/855/
> >>> attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> >>> auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> >>> cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> >>>
> >>> Much easier to follow the chain from the command line this way.
> >>
> >> What do you propose to put into these directories? Another directories named with
> >> children pid-s?
> > 
> > Yes, just like the task/ dir gives you directories named with the
> > processes's thread ids.  Opening /proc/PID/children/PID-CHILD1/ would get
> > you the same as opening /proc/PID-CHILD1/.  Just like
> > opening /proc/PID/task/PID-CHILD1/ gets you (almost) the same as opening
> > /proc/PID-CHILD1/.
> 
> You cannot make the dentry named /proc/<pid1>/children/<pid2> be a hardlink on
> the /proc/<pid2>. Thus you have to make arbitrary amount of inodes to point to
> a single task. This brings unnecessary complexity and memory usage (by dentries
> and proc inodes).

How is this different from the _already existing_ /proc/<pid1>/task/ directory?
I can imagine that 98% of the code would be shared even?  It's "just" a matter of
listing thread group children (child/), instead of clone children (task/),
isn't it?

They are not symbolic links under task/.  /proc/<pid1>/task/<pid2>/ does not
have a task/ subdir, only /proc/<pid1>/ does, I guess to avoid the memory usage
issue you raise.

> I'd accept the symbolic links, but how would they look like? Like this:
>    # ls -l /proc/123/children
>             234 -> ../../234
> ?

That'd work for me...  but really, why not reuse tasks/'s code and
behave the same?

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 13:44                   ` Pedro Alves
@ 2011-12-02 13:52                     ` Pavel Emelyanov
  2011-12-02 14:00                       ` Pedro Alves
  0 siblings, 1 reply; 51+ messages in thread
From: Pavel Emelyanov @ 2011-12-02 13:52 UTC (permalink / raw)
  To: Pedro Alves
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On 12/02/2011 05:44 PM, Pedro Alves wrote:
> On Friday 02 December 2011 13:16:52, Pavel Emelyanov wrote:
>> On 12/02/2011 04:58 PM, Pedro Alves wrote:
>>> On Friday 02 December 2011 12:43:10, Pavel Emelyanov wrote:
>>>
>>>>>> Yes, I like /children file. other points seems to be pointed out by other
>>>>>> reviewers. 
>>>>>
>>>>> Any reason this is a file instead of a directory like /proc/PID/task/ ?
>>>>>
>>>>> $ sudo ls /proc/8167/task/
>>>>> 8167  854  855  856  857  858  859
>>>>> $ sudo ls /proc/8167/task/855/
>>>>> attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
>>>>> auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
>>>>> cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
>>>>>
>>>>> Much easier to follow the chain from the command line this way.
>>>>
>>>> What do you propose to put into these directories? Another directories named with
>>>> children pid-s?
>>>
>>> Yes, just like the task/ dir gives you directories named with the
>>> processes's thread ids.  Opening /proc/PID/children/PID-CHILD1/ would get
>>> you the same as opening /proc/PID-CHILD1/.  Just like
>>> opening /proc/PID/task/PID-CHILD1/ gets you (almost) the same as opening
>>> /proc/PID-CHILD1/.
>>
>> You cannot make the dentry named /proc/<pid1>/children/<pid2> be a hardlink on
>> the /proc/<pid2>. Thus you have to make arbitrary amount of inodes to point to
>> a single task. This brings unnecessary complexity and memory usage (by dentries
>> and proc inodes).
> 
> How is this different from the _already existing_ /proc/<pid1>/task/ directory?

Those living in /proc/<pid1>/task do not live in /proc. At all. This explains
everything below.

> I can imagine that 98% of the code would be shared even?  It's "just" a matter of
> listing thread group children (child/), instead of clone children (task/),
> isn't it?
> 
> They are not symbolic links under task/.  /proc/<pid1>/task/<pid2>/ does not
> have a task/ subdir, only /proc/<pid1>/ does, I guess to avoid the memory usage
> issue you raise.
> 
>> I'd accept the symbolic links, but how would they look like? Like this:
>>    # ls -l /proc/123/children
>>             234 -> ../../234
>> ?
> 
> That'd work for me...  but really, why not reuse tasks/'s code and
> behave the same?

Why wouldn't /proc/<pid>/children work for you? This is as simple as

  # ls -l $(cat /proc/<pid>/children)

:)

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 13:52                     ` Pavel Emelyanov
@ 2011-12-02 14:00                       ` Pedro Alves
  2011-12-02 14:17                         ` Pavel Emelyanov
  0 siblings, 1 reply; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 14:00 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Friday 02 December 2011 13:52:43, Pavel Emelyanov wrote:
> On 12/02/2011 05:44 PM, Pedro Alves wrote:
> > On Friday 02 December 2011 13:16:52, Pavel Emelyanov wrote:
> >> On 12/02/2011 04:58 PM, Pedro Alves wrote:
> >>> On Friday 02 December 2011 12:43:10, Pavel Emelyanov wrote:
> >>>
> >>>>>> Yes, I like /children file. other points seems to be pointed out by other
> >>>>>> reviewers. 
> >>>>>
> >>>>> Any reason this is a file instead of a directory like /proc/PID/task/ ?
> >>>>>
> >>>>> $ sudo ls /proc/8167/task/
> >>>>> 8167  854  855  856  857  858  859
> >>>>> $ sudo ls /proc/8167/task/855/
> >>>>> attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> >>>>> auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> >>>>> cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> >>>>>
> >>>>> Much easier to follow the chain from the command line this way.
> >>>>
> >>>> What do you propose to put into these directories? Another directories named with
> >>>> children pid-s?
> >>>
> >>> Yes, just like the task/ dir gives you directories named with the
> >>> processes's thread ids.  Opening /proc/PID/children/PID-CHILD1/ would get
> >>> you the same as opening /proc/PID-CHILD1/.  Just like
> >>> opening /proc/PID/task/PID-CHILD1/ gets you (almost) the same as opening
> >>> /proc/PID-CHILD1/.
> >>
> >> You cannot make the dentry named /proc/<pid1>/children/<pid2> be a hardlink on
> >> the /proc/<pid2>. Thus you have to make arbitrary amount of inodes to point to
> >> a single task. This brings unnecessary complexity and memory usage (by dentries
> >> and proc inodes).
> > 
> > How is this different from the _already existing_ /proc/<pid1>/task/ directory?
> 
> Those living in /proc/<pid1>/task do not live in /proc. At all. This explains
> everything below.

Well, except they do, and it doesn't.  The're not visible when
listing /proc/, but they're there.  Try it.

$ls /proc/ | grep 854

(empty)

$ ls /proc/8167/task/854
attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall

$ls /proc/854/
attr       cgroup      comm             cwd      fd      latency   maps       mounts      numa_maps  oom_score_adj  root       sessionid  stat    syscall
autogroup  clear_refs  coredump_filter  environ  fdinfo  limits    mem        mountstats  oom_adj    pagemap        sched      smaps      statm   task
auxv       cmdline     cpuset           exe      io      loginuid  mountinfo  net         oom_score  personality    schedstat  stack      status  wchan

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 14:00                       ` Pedro Alves
@ 2011-12-02 14:17                         ` Pavel Emelyanov
  2011-12-02 14:25                           ` Pedro Alves
  0 siblings, 1 reply; 51+ messages in thread
From: Pavel Emelyanov @ 2011-12-02 14:17 UTC (permalink / raw)
  To: Pedro Alves
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

>>>>> Yes, just like the task/ dir gives you directories named with the
>>>>> processes's thread ids.  Opening /proc/PID/children/PID-CHILD1/ would get
>>>>> you the same as opening /proc/PID-CHILD1/.  Just like
>>>>> opening /proc/PID/task/PID-CHILD1/ gets you (almost) the same as opening
>>>>> /proc/PID-CHILD1/.
>>>>
>>>> You cannot make the dentry named /proc/<pid1>/children/<pid2> be a hardlink on
>>>> the /proc/<pid2>. Thus you have to make arbitrary amount of inodes to point to
>>>> a single task. This brings unnecessary complexity and memory usage (by dentries
>>>> and proc inodes).
>>>
>>> How is this different from the _already existing_ /proc/<pid1>/task/ directory?
>>
>> Those living in /proc/<pid1>/task do not live in /proc. At all. This explains
>> everything below.
> 
> Well, except they do, and it doesn't.  The're not visible when
> listing /proc/, but they're there.  Try it.
> 
> $ls /proc/ | grep 854
> 
> (empty)
> 
> $ ls /proc/8167/task/854
> attr    clear_refs  cpuset   exe     io       loginuid  mountinfo  oom_adj        pagemap      sched      smaps  statm    wchan
> auxv    cmdline     cwd      fd      latency  maps      mounts     oom_score      personality  schedstat  stack  status
> cgroup  comm        environ  fdinfo  limits   mem       numa_maps  oom_score_adj  root         sessionid  stat   syscall
> 
> $ls /proc/854/
> attr       cgroup      comm             cwd      fd      latency   maps       mounts      numa_maps  oom_score_adj  root       sessionid  stat    syscall
> autogroup  clear_refs  coredump_filter  environ  fdinfo  limits    mem        mountstats  oom_adj    pagemap        sched      smaps      statm   task
> auxv       cmdline     cpuset           exe      io      loginuid  mountinfo  net         oom_score  personality    schedstat  stack      status  wchan
> 

O_O   OK, I was wrong, they do live there. But I consider this as bug.

Anyway -- my concern about unneeded memory overhead still stands.
Even a simple find /proc will result in smth like

/proc/1
/proc/1/children/2
/proc/1/children/2/children/4
/proc/1/children/3
/proc/1/children/3/children/5
/proc/2
/proc/2/children/4
/proc/3
/proc/3/children/5
/proc/4
/proc/5

Instead of

/proc/1
/proc/2
/proc/3
/proc/4
/proc/5

I.e. each task will be shown multiple times, which is not very fun, but memory exhaustive from my POV.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 14:17                         ` Pavel Emelyanov
@ 2011-12-02 14:25                           ` Pedro Alves
  2011-12-02 14:37                             ` Pavel Emelyanov
  0 siblings, 1 reply; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 14:25 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Friday 02 December 2011 14:17:08, Pavel Emelyanov wrote:

> O_O   OK, I was wrong, they do live there. But I consider this as bug.

You can't change that.  It'd break current gdb at least.

> Anyway -- my concern about unneeded memory overhead still stands.
> Even a simple find /proc will result in smth like
> 
> /proc/1
> /proc/1/children/2
> /proc/1/children/2/children/4
> /proc/1/children/2/children/4
> /proc/1/children/3
> /proc/1/children/3/children/5
> /proc/2
> /proc/2/children/4
> /proc/3
> /proc/3/children/5
> /proc/4
> /proc/5
> 
> Instead of
> 
> /proc/1
> /proc/2
> /proc/3
> /proc/4
> /proc/5
> 
> I.e. each task will be shown multiple times, which is not very fun, but memory exhaustive from my POV.

Now that is a good argument against hard linking.  But not if you make
the entries under children/ symlinks.  Then find doesn't recurse.  And
then 

$ find -L /proc/PID/

does recurse and give you the whole tree.  Which I'd say is
actually useful...

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 14:25                           ` Pedro Alves
@ 2011-12-02 14:37                             ` Pavel Emelyanov
  2011-12-02 14:45                               ` Pedro Alves
  0 siblings, 1 reply; 51+ messages in thread
From: Pavel Emelyanov @ 2011-12-02 14:37 UTC (permalink / raw)
  To: Pedro Alves
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On 12/02/2011 06:25 PM, Pedro Alves wrote:
> On Friday 02 December 2011 14:17:08, Pavel Emelyanov wrote:
> 
>> O_O   OK, I was wrong, they do live there. But I consider this as bug.
> 
> You can't change that.  It'd break current gdb at least.

OMG!

>> I.e. each task will be shown multiple times, which is not very fun, but memory exhaustive from my POV.
> 
> Now that is a good argument against hard linking.  But not if you make
> the entries under children/ symlinks.  Then find doesn't recurse.  And
> then 
> 
> $ find -L /proc/PID/
> 
> does recurse and give you the whole tree.  Which I'd say is
> actually useful...

It is useful, but the /proc/pid/children file solves the same problem in a much
more simple way. The memory usage by proc (one file vs one dir and a set of files)
is less and time to lookup a child is also less (read + lookup vs readdir + lookup 
(symlink itself) + lookup (symlink resolve)).

Yes, it doesn't allow you to have fun with find, but frankly, do you really need
this? Even if we're talking about gdb -- reading /proc/pid/children is not harder
and not easier than readdir-ing it.

IOW - what's the real benefit of a directory with symlinks against a file except
for a fun?

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status
  2011-12-02 14:37                             ` Pavel Emelyanov
@ 2011-12-02 14:45                               ` Pedro Alves
  0 siblings, 0 replies; 51+ messages in thread
From: Pedro Alves @ 2011-12-02 14:45 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, Cyrill Gorcunov, linux-kernel, Andrew Morton,
	Tejun Heo, Andrew Vagin, Serge Hallyn, Vasiliy Kulikov

On Friday 02 December 2011 14:37:10, Pavel Emelyanov wrote:
> On 12/02/2011 06:25 PM, Pedro Alves wrote:
> > On Friday 02 December 2011 14:17:08, Pavel Emelyanov wrote:
> > 
> >> O_O   OK, I was wrong, they do live there. But I consider this as bug.
> > 
> > You can't change that.  It'd break current gdb at least.
> 
> OMG!

You know, NPTL didn't exist on earlier kernels.  This preserved
backwards compatibility.

> >> I.e. each task will be shown multiple times, which is not very fun, but memory exhaustive from my POV.
> > 
> > Now that is a good argument against hard linking.  But not if you make
> > the entries under children/ symlinks.  Then find doesn't recurse.  And
> > then 
> > 
> > $ find -L /proc/PID/
> > 
> > does recurse and give you the whole tree.  Which I'd say is
> > actually useful...
> 
> It is useful, but the /proc/pid/children file solves the same problem in a much
> more simple way. The memory usage by proc (one file vs one dir and a set of files)
> is less and time to lookup a child is also less (read + lookup vs readdir + lookup 
> (symlink itself) + lookup (symlink resolve)).
> 
> Yes, it doesn't allow you to have fun with find, but frankly, do you really need
> this? Even if we're talking about gdb -- reading /proc/pid/children is not harder
> and not easier than readdir-ing it.
> 
> IOW - what's the real benefit of a directory with symlinks against a file except
> for a fun?

As I said on the first message, it's easier on the command line, likely for
quick scripting too.  And for consistency.  gdb or whatever other software
can of course do whatever programatically.  But if I can't persuade you guys
that's a good thing, fine.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-12-02  7:28       ` Cyrill Gorcunov
@ 2011-12-02 19:23         ` Kees Cook
  2011-12-02 19:28           ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Kees Cook @ 2011-12-02 19:23 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Alexey Dobriyan, linux-kernel, Andrew Morton, Tejun Heo,
	Andrew Vagin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Thu, Dec 1, 2011 at 11:28 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> On Fri, Dec 02, 2011 at 03:24:06AM +0300, Alexey Dobriyan wrote:
>> > > +         cputime_to_clock_t(cgtime),
>> > > +         mm ? (permitted ? mm->start_data : 1) : 0,
>> > > +         mm ? (permitted ? mm->end_data : 1) : 0,
>> > > +         mm ? (permitted ? mm->start_brk : 1) : 0);
>> > >   if (mm)
>> > >           mmput(mm);
>> > >   return 0;
>> >
>> > Thanks for using "permitted" here. :)
>>
>> And these are new fields, so "1" hack is unnecessary.
>
> I perhaps miss some history here, you mean just print 0
> instead of 1, right?

The history is that when "0" is seen for start_code, tools like "ps"
assumed the process was a kernel thread, so "1" was used as a hack to
not break userspace reporting. So yeah, since this is a new field,
it's fine to just do:

+         (mm && permitted) ? mm->start_data : 0,
+         (mm && permitted) ? mm->end_data : 0,
+         (mm && permitted) ? mm->start_brk : 0,

-Kees

-- 
Kees Cook
ChromeOS Security

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-12-02 19:23         ` Kees Cook
@ 2011-12-02 19:28           ` Cyrill Gorcunov
  0 siblings, 0 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-02 19:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Alexey Dobriyan, linux-kernel, Andrew Morton, Tejun Heo,
	Andrew Vagin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov

On Fri, Dec 02, 2011 at 11:23:32AM -0800, Kees Cook wrote:
> >
> > I perhaps miss some history here, you mean just print 0
> > instead of 1, right?
> 
> The history is that when "0" is seen for start_code, tools like "ps"
> assumed the process was a kernel thread, so "1" was used as a hack to
> not break userspace reporting. So yeah, since this is a new field,
> it's fine to just do:
> 
> +         (mm && permitted) ? mm->start_data : 0,
> +         (mm && permitted) ? mm->end_data : 0,
> +         (mm && permitted) ? mm->start_brk : 0,
> 

Yup, thanks a lot Kees, I'll update.

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-11-30 21:06                 ` Cyrill Gorcunov
@ 2011-12-07 12:27                   ` Cyrill Gorcunov
  2011-12-07 22:43                     ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-07 12:27 UTC (permalink / raw)
  To: Kees Cook, linux-kernel, Andrew Morton, Tejun Heo, Andrew Vagin,
	Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov
  Cc: KAMEZAWA Hiroyuki

On Thu, Dec 01, 2011 at 01:06:22AM +0400, Cyrill Gorcunov wrote:
...
> 
> You know Kees, I tried it, and finally I think it's overheaded, so I prefer
> to stick with original version (no need to duplicate same data in two differen
> memory places as it'll be in case of arrays, and since the VM_ flags are
> constant the former code bloats kernel lesser. Thanks anyway!
> 

Ping. Kees, Andrew, are there some other objections I'm not yet addressed?
I've updated changelog a bit more. Please review.

(Kame CC'ed)

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: [PATCH] prctl: Add PR_SET_MM codes to tune up mm_struct entires v2

At process of task restoration we need a way to tune up
a few members of mm_struct structure such as start_code,
end_code, start_data, end_data, start_stack, start_brk, brk.

While most of them have a statistical nature (their values
are involved into calculation of /proc/<pid>/statm output)
the start_brk and brk values are used to compute an allowed
size of program data segment expansion. Which means an arbitrary
changes of this value might be a bit dangerous operation.

To restrict access to this facility the following requirements
applied to prctl users:

 - The process has to have CAP_SYS_ADMIN capability granted.
 - For all opcodes except start_brk/brk members an appropriate
   VMA area must be existing and should fit certain VMA flags,
   such as:
   - code segment must be executable but not writable;
   - data segment must not be executable.

start_brk/brk values must not intersect with data segment
and must not exceed RLIMIT_DATA resource limit.

Still the main guard is CAP_SYS_ADMIN capability check.

v2:
 - Add a check for vma start address, testing for vma ending
   address is not enough. From Kees Cook.

 - Add some sanity tests for assigned addresses.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Kees Cook <keescook@chromium.org>
---
 include/linux/prctl.h |   12 +++++
 kernel/sys.c          |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 130 insertions(+)

Index: linux-2.6.git/include/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/linux/prctl.h
+++ linux-2.6.git/include/linux/prctl.h
@@ -102,4 +102,16 @@
 
 #define PR_MCE_KILL_GET 34
 
+/*
+ * Tune up process memory map specifics.
+ */
+#define PR_SET_MM		35
+# define PR_SET_MM_START_CODE		1
+# define PR_SET_MM_END_CODE		2
+# define PR_SET_MM_START_DATA		3
+# define PR_SET_MM_END_DATA		4
+# define PR_SET_MM_START_STACK		5
+# define PR_SET_MM_START_BRK		6
+# define PR_SET_MM_BRK			7
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1692,6 +1692,118 @@ SYSCALL_DEFINE1(umask, int, mask)
 	return mask;
 }
 
+static int prctl_set_mm(int opt, unsigned long addr)
+{
+	unsigned long rlim = rlimit(RLIMIT_DATA);
+	unsigned long vm_req_flags;
+	unsigned long vm_bad_flags;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	int error = 0;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (addr >= TASK_SIZE)
+		return -EINVAL;
+
+	mm = get_task_mm(current);
+	if (!mm)
+		return -ENOENT;
+
+	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, addr);
+
+	if (opt != PR_SET_MM_START_BRK &&
+	    opt != PR_SET_MM_BRK) {
+		/* It must be existing VMA */
+		if (!vma || vma->vm_start > addr)
+			goto out;
+	}
+
+	error = -EINVAL;
+	switch (opt) {
+	case PR_SET_MM_START_CODE:
+	case PR_SET_MM_END_CODE:
+
+		vm_req_flags = VM_READ | VM_EXEC;
+		vm_bad_flags = VM_WRITE | VM_MAYSHARE;
+
+		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
+		    (vma->vm_flags & vm_bad_flags))
+			goto out;
+
+		if (opt == PR_SET_MM_START_CODE)
+			current->mm->start_code = addr;
+		else
+			current->mm->end_code = addr;
+		break;
+
+	case PR_SET_MM_START_DATA:
+	case PR_SET_MM_END_DATA:
+
+		vm_req_flags = VM_READ | VM_WRITE;
+		vm_bad_flags = VM_EXEC | VM_MAYSHARE;
+
+		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
+		    (vma->vm_flags & vm_bad_flags))
+			goto out;
+
+		if (opt == PR_SET_MM_START_DATA)
+			current->mm->start_data = addr;
+		else
+			current->mm->end_data = addr;
+		break;
+
+	case PR_SET_MM_START_STACK:
+
+#ifdef CONFIG_STACK_GROWSUP
+		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP;
+#else
+		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
+#endif
+		if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
+			goto out;
+
+		current->mm->start_stack = addr;
+		break;
+
+	case PR_SET_MM_START_BRK:
+		if (addr <= mm->end_data)
+			goto out;
+
+		if (rlim < RLIM_INFINITY &&
+		    (mm->brk - addr) + (mm->end_data - mm->start_data) > rlim)
+			goto out;
+
+		current->mm->start_brk = addr;
+		break;
+
+	case PR_SET_MM_BRK:
+		if (addr <= mm->end_data)
+			goto out;
+
+		if (rlim < RLIM_INFINITY &&
+		    (addr - mm->start_brk) + (mm->end_data - mm->start_data) > rlim)
+			goto out;
+
+		current->mm->brk = addr;
+		break;
+
+	default:
+		error = -EINVAL;
+		goto out;
+	}
+
+	error = 0;
+
+out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+
+	return error;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -1841,6 +1953,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 			else
 				error = PR_MCE_KILL_DEFAULT;
 			break;
+		case PR_SET_MM: {
+			if (arg4 | arg5)
+				return -EINVAL;
+			error = prctl_set_mm(arg2, arg3);
+			break;
+		}
 		default:
 			error = -EINVAL;
 			break;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-12-07 12:27                   ` Cyrill Gorcunov
@ 2011-12-07 22:43                     ` Andrew Morton
  2011-12-08  7:07                       ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-07 22:43 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, linux-kernel, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki,
	Michael Kerrisk

On Wed, 7 Dec 2011 16:27:18 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> At process of task restoration we need a way to tune up
> a few members of mm_struct structure such as start_code,
> end_code, start_data, end_data, start_stack, start_brk, brk.

I don't really know what "tune up" means in this context.  Can we
please be more specific and detailed here?  It appears that the patch
permits userspace to directly modify these fields.

> While most of them have a statistical nature (their values
> are involved into calculation of /proc/<pid>/statm output)
> the start_brk and brk values are used to compute an allowed
> size of program data segment expansion. Which means an arbitrary
> changes of this value might be a bit dangerous operation.
> 
> To restrict access to this facility the following requirements
> applied to prctl users:
> 
>  - The process has to have CAP_SYS_ADMIN capability granted.
>  - For all opcodes except start_brk/brk members an appropriate
>    VMA area must be existing and should fit certain VMA flags,
>    such as:
>    - code segment must be executable but not writable;
>    - data segment must not be executable.
> 
> start_brk/brk values must not intersect with data segment
> and must not exceed RLIMIT_DATA resource limit.
> 
> Still the main guard is CAP_SYS_ADMIN capability check.
>
> ...
>
>  include/linux/prctl.h |   12 +++++
>  kernel/sys.c          |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++

The prctl(2) manpage will need to be updated.  Please Cc Michael on all
such changes.

> 
> Index: linux-2.6.git/include/linux/prctl.h
> ===================================================================
> --- linux-2.6.git.orig/include/linux/prctl.h
> +++ linux-2.6.git/include/linux/prctl.h
> @@ -102,4 +102,16 @@
>  
>  #define PR_MCE_KILL_GET 34
>  
> +/*
> + * Tune up process memory map specifics.
> + */
> +#define PR_SET_MM		35
> +# define PR_SET_MM_START_CODE		1
> +# define PR_SET_MM_END_CODE		2
> +# define PR_SET_MM_START_DATA		3
> +# define PR_SET_MM_END_DATA		4
> +# define PR_SET_MM_START_STACK		5
> +# define PR_SET_MM_START_BRK		6
> +# define PR_SET_MM_BRK			7
> +
>  #endif /* _LINUX_PRCTL_H */
> Index: linux-2.6.git/kernel/sys.c
> ===================================================================
> --- linux-2.6.git.orig/kernel/sys.c
> +++ linux-2.6.git/kernel/sys.c
> @@ -1692,6 +1692,118 @@ SYSCALL_DEFINE1(umask, int, mask)
>  	return mask;
>  }
>  
> +static int prctl_set_mm(int opt, unsigned long addr)
> +{
> +	unsigned long rlim = rlimit(RLIMIT_DATA);
> +	unsigned long vm_req_flags;
> +	unsigned long vm_bad_flags;
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +	int error = 0;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (addr >= TASK_SIZE)
> +		return -EINVAL;
> +
> +	mm = get_task_mm(current);

Is it necessaary to run the expensive get_task_mm() for `current'? 
`current' is known to be running and you have control of it here -
nobody will be taking our mm away.  Simply use current->mm?  The
function actually uses current->mm later on in several places.

> +	if (!mm)
> +		return -ENOENT;
> +
> +	down_read(&mm->mmap_sem);
> +	vma = find_vma(mm, addr);
> +
> +	if (opt != PR_SET_MM_START_BRK &&
> +	    opt != PR_SET_MM_BRK) {

80 columns, not 40 :)

> +		/* It must be existing VMA */
> +		if (!vma || vma->vm_start > addr)
> +			goto out;
> +	}
> +
> +	error = -EINVAL;
> +	switch (opt) {
> +	case PR_SET_MM_START_CODE:
> +	case PR_SET_MM_END_CODE:
> +

You're adding unneeded and unconventional newlines after the `case'
statements.

> +		vm_req_flags = VM_READ | VM_EXEC;
> +		vm_bad_flags = VM_WRITE | VM_MAYSHARE;
> +
> +		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
> +		    (vma->vm_flags & vm_bad_flags))
> +			goto out;
> +
> +		if (opt == PR_SET_MM_START_CODE)
> +			current->mm->start_code = addr;
> +		else
> +			current->mm->end_code = addr;
> +		break;
> +
> +	case PR_SET_MM_START_DATA:
> +	case PR_SET_MM_END_DATA:
> +
> +		vm_req_flags = VM_READ | VM_WRITE;
> +		vm_bad_flags = VM_EXEC | VM_MAYSHARE;
> +
> +		if ((vma->vm_flags & vm_req_flags) != vm_req_flags ||
> +		    (vma->vm_flags & vm_bad_flags))
> +			goto out;
> +
> +		if (opt == PR_SET_MM_START_DATA)
> +			current->mm->start_data = addr;
> +		else
> +			current->mm->end_data = addr;
> +		break;
> +
> +	case PR_SET_MM_START_STACK:
> +
> +#ifdef CONFIG_STACK_GROWSUP
> +		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSUP;
> +#else
> +		vm_req_flags = VM_READ | VM_WRITE | VM_GROWSDOWN;
> +#endif
> +		if ((vma->vm_flags & vm_req_flags) != vm_req_flags)
> +			goto out;
> +
> +		current->mm->start_stack = addr;
> +		break;
> +
> +	case PR_SET_MM_START_BRK:
> +		if (addr <= mm->end_data)
> +			goto out;
> +
> +		if (rlim < RLIM_INFINITY &&
> +		    (mm->brk - addr) + (mm->end_data - mm->start_data) > rlim)
> +			goto out;
> +
> +		current->mm->start_brk = addr;
> +		break;
> +
> +	case PR_SET_MM_BRK:
> +		if (addr <= mm->end_data)
> +			goto out;
> +
> +		if (rlim < RLIM_INFINITY &&
> +		    (addr - mm->start_brk) + (mm->end_data - mm->start_data) > rlim)
> +			goto out;
> +
> +		current->mm->brk = addr;
> +		break;
> +
> +	default:
> +		error = -EINVAL;
> +		goto out;
> +	}
> +
> +	error = 0;
> +
> +out:
> +	up_read(&mm->mmap_sem);
> +	mmput(mm);
> +
> +	return error;
> +}

This is starting to add a non-trivial amount of code.  Perhaps we need
to introduce a Kconfig variable to control such things as this, to
prevent bloating up kernels which aren't require to support c/r?

>
> ...
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-12-07 22:43                     ` Andrew Morton
@ 2011-12-08  7:07                       ` Cyrill Gorcunov
  2011-12-08  7:15                         ` Andrew Morton
  0 siblings, 1 reply; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-08  7:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, linux-kernel, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki,
	Michael Kerrisk

On Wed, Dec 07, 2011 at 02:43:55PM -0800, Andrew Morton wrote:
> On Wed, 7 Dec 2011 16:27:18 +0400
> Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > At process of task restoration we need a way to tune up
> > a few members of mm_struct structure such as start_code,
> > end_code, start_data, end_data, start_stack, start_brk, brk.
> 
> I don't really know what "tune up" means in this context.  Can we
> please be more specific and detailed here?  It appears that the patch
> permits userspace to directly modify these fields.
> 

ok

> 
> The prctl(2) manpage will need to be updated.  Please Cc Michael on all
> such changes.
>

you mean -- Michael Kerrisk, mtk AT man7.org, right?

...
> > +
> > +	mm = get_task_mm(current);
> 
> Is it necessaary to run the expensive get_task_mm() for `current'? 
> `current' is known to be running and you have control of it here -
> nobody will be taking our mm away.  Simply use current->mm?  The
> function actually uses current->mm later on in several places.

hmm, indeed, i'll update, thanks!

> 
> > +	if (!mm)
> > +		return -ENOENT;
> > +
> > +	down_read(&mm->mmap_sem);
> > +	vma = find_vma(mm, addr);
> > +
> > +	if (opt != PR_SET_MM_START_BRK &&
> > +	    opt != PR_SET_MM_BRK) {
> 
> 80 columns, not 40 :)
> 
> > +		/* It must be existing VMA */
> > +		if (!vma || vma->vm_start > addr)
> > +			goto out;
> > +	}
> > +
> > +	error = -EINVAL;
> > +	switch (opt) {
> > +	case PR_SET_MM_START_CODE:
> > +	case PR_SET_MM_END_CODE:
> > +
> 
> You're adding unneeded and unconventional newlines after the `case'
> statements.
> 

no, I added them by a purpose -- it's a way easier to read these
assignments, but fine -- I'll drop this nits.

> 
> This is starting to add a non-trivial amount of code.  Perhaps we need
> to introduce a Kconfig variable to control such things as this, to
> prevent bloating up kernels which aren't require to support c/r?
> 

Dunno, Andrew. Actually I agreed that these snippets are mostly
needed for c/r only, but the initial idea over all changes was
to add levers into kernel which might be helpful not only
for c/r but for someone else as well.

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-12-08  7:07                       ` Cyrill Gorcunov
@ 2011-12-08  7:15                         ` Andrew Morton
  2011-12-08  7:30                           ` Cyrill Gorcunov
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Morton @ 2011-12-08  7:15 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, linux-kernel, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki,
	Michael Kerrisk

On Thu, 8 Dec 2011 11:07:24 +0400 Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> > 
> > The prctl(2) manpage will need to be updated.  Please Cc Michael on all
> > such changes.
> >
> 
> you mean -- Michael Kerrisk, mtk AT man7.org, right?

That isn't what ./MAINTAINERS says?

> > 
> > This is starting to add a non-trivial amount of code.  Perhaps we need
> > to introduce a Kconfig variable to control such things as this, to
> > prevent bloating up kernels which aren't require to support c/r?
> > 
> 
> Dunno, Andrew. Actually I agreed that these snippets are mostly
> needed for c/r only, but the initial idea over all changes was
> to add levers into kernel which might be helpful not only
> for c/r but for someone else as well.

In each instance, a case would need to be made for that.  Plus if
people later come along who want access to these things for other
purposes, they can send a patch!

There will always be people who would *prefer* that this material not
be present in their vmlinux.  For those people, this patch is a
regression!



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires
  2011-12-08  7:15                         ` Andrew Morton
@ 2011-12-08  7:30                           ` Cyrill Gorcunov
  0 siblings, 0 replies; 51+ messages in thread
From: Cyrill Gorcunov @ 2011-12-08  7:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, linux-kernel, Tejun Heo, Andrew Vagin, Serge Hallyn,
	Pavel Emelyanov, Vasiliy Kulikov, KAMEZAWA Hiroyuki,
	Michael Kerrisk

On Wed, Dec 07, 2011 at 11:15:12PM -0800, Andrew Morton wrote:
> On Thu, 8 Dec 2011 11:07:24 +0400 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> 
> > > 
> > > The prctl(2) manpage will need to be updated.  Please Cc Michael on all
> > > such changes.
> > >
> > 
> > you mean -- Michael Kerrisk, mtk AT man7.org, right?
> 
> That isn't what ./MAINTAINERS says?
> 

OK, I've just found him in MAINTAINERS.

> > > 
> > > This is starting to add a non-trivial amount of code.  Perhaps we need
> > > to introduce a Kconfig variable to control such things as this, to
> > > prevent bloating up kernels which aren't require to support c/r?
> > > 
> > 
> > Dunno, Andrew. Actually I agreed that these snippets are mostly
> > needed for c/r only, but the initial idea over all changes was
> > to add levers into kernel which might be helpful not only
> > for c/r but for someone else as well.
> 
> In each instance, a case would need to be made for that.  Plus if
> people later come along who want access to these things for other
> purposes, they can send a patch!
> 
> There will always be people who would *prefer* that this material not
> be present in their vmlinux.  For those people, this patch is a
> regression!
> 

OK, I'll add some Kconfing variable for that. Lets see how it goes.

	Cyrill

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2011-12-08  7:30 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-29 19:12 [rfc 0/3] A small bundle in a sake of checkpoint/restore Cyrill Gorcunov
2011-11-29 19:12 ` [rfc 1/3] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
2011-11-29 20:06   ` Kees Cook
2011-12-02  0:24     ` Alexey Dobriyan
2011-12-02  7:28       ` Cyrill Gorcunov
2011-12-02 19:23         ` Kees Cook
2011-12-02 19:28           ` Cyrill Gorcunov
2011-11-29 20:32   ` Serge Hallyn
2011-11-30  5:04   ` KAMEZAWA Hiroyuki
2011-11-29 19:12 ` [rfc 2/3] fs, proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
2011-11-30  5:00   ` KAMEZAWA Hiroyuki
2011-11-30  6:05     ` Cyrill Gorcunov
2011-12-01  9:54       ` Cyrill Gorcunov
2011-12-01 15:43         ` Tejun Heo
2011-12-01 15:53           ` Cyrill Gorcunov
2011-12-01 16:07             ` Tejun Heo
2011-12-01 21:29         ` Andrew Morton
2011-12-01 21:38           ` Cyrill Gorcunov
2011-12-02  0:40         ` KAMEZAWA Hiroyuki
2011-12-02 12:41           ` Pedro Alves
2011-12-02 12:43             ` Pavel Emelyanov
2011-12-02 12:45               ` Cyrill Gorcunov
2011-12-02 13:10                 ` Pedro Alves
2011-12-02 13:40                   ` Pedro Alves
2011-12-02 12:58               ` Pedro Alves
2011-12-02 13:16                 ` Pavel Emelyanov
2011-12-02 13:44                   ` Pedro Alves
2011-12-02 13:52                     ` Pavel Emelyanov
2011-12-02 14:00                       ` Pedro Alves
2011-12-02 14:17                         ` Pavel Emelyanov
2011-12-02 14:25                           ` Pedro Alves
2011-12-02 14:37                             ` Pavel Emelyanov
2011-12-02 14:45                               ` Pedro Alves
2011-11-29 19:12 ` [rfc 3/3] prctl: Add PR_SET_MM codes to tune up mm_struct entires Cyrill Gorcunov
2011-11-29 20:19   ` Kees Cook
2011-11-29 20:29     ` Cyrill Gorcunov
2011-11-29 20:37       ` Cyrill Gorcunov
2011-11-29 20:40         ` Kees Cook
2011-11-29 20:47           ` Cyrill Gorcunov
2011-11-30 17:37           ` Cyrill Gorcunov
2011-11-30 18:10             ` Kees Cook
2011-11-30 18:23               ` Cyrill Gorcunov
2011-11-30 21:06                 ` Cyrill Gorcunov
2011-12-07 12:27                   ` Cyrill Gorcunov
2011-12-07 22:43                     ` Andrew Morton
2011-12-08  7:07                       ` Cyrill Gorcunov
2011-12-08  7:15                         ` Andrew Morton
2011-12-08  7:30                           ` Cyrill Gorcunov
2011-11-29 20:37       ` Kees Cook
2011-11-29 20:49       ` Serge Hallyn
2011-11-29 20:55         ` Cyrill Gorcunov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.