All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
@ 2012-02-15 14:36 Oleg Nesterov
  2012-02-15 15:10 ` Cyrill Gorcunov
  2012-02-15 16:06 ` Oleg Nesterov
  0 siblings, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-15 14:36 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

> +/* The caller must have pinned the task */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	spin_lock(&task->files->file_lock);

task->files can be NULL, we can race with exit_files().

> +	fdt = files_fdtable(task->files);
> +	if (idx < fdt->max_fds)
> +		file = fdt->fd[idx];

You can probably rely on rcu instead of ->file_lock, but this is minor.

> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> +		unsigned long, idx1, unsigned long, idx2)
> +{
> +	struct task_struct *task1, *task2;
> +	int ret;
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * Tasks are looked up in caller's PID namespace only.
> +	 */
> +	task1 = find_task_by_vpid(pid1);
> +	task2 = find_task_by_vpid(pid2);
> +	if (!task1 || !task2)
> +		goto err_no_task;
> +
> +	get_task_struct(task1);
> +	get_task_struct(task2);
> +
> +	rcu_read_unlock();
> +
> +	/*
> +	 * One should have enough rights to inspect task details.
> +	 */
> +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> +		ret = -EACCES;

Well, probably this is fine... but may be you can add a comment.
The task can change its credentials right after ptrace_may_access()
succeeds. This _looks_ wrong, perhaps it makes sense to  add the
"we do not care" note.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 14:36 + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree Oleg Nesterov
@ 2012-02-15 15:10 ` Cyrill Gorcunov
  2012-02-15 15:38   ` Oleg Nesterov
  2012-02-15 18:32   ` Cyrill Gorcunov
  2012-02-15 16:06 ` Oleg Nesterov
  1 sibling, 2 replies; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 15:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
> 
> task->files can be NULL, we can race with exit_files().

So I need to call get_files_struct. Thanks Oleg!

> 
> > +	fdt = files_fdtable(task->files);
> > +	if (idx < fdt->max_fds)
> > +		file = fdt->fd[idx];
> 
> You can probably rely on rcu instead of ->file_lock, but this is minor.
> 

I think so. I'll be updating the patch anyway (on top of current
one) so I'll address this comment too. Thanks!

> > +
> > +	/*
> > +	 * One should have enough rights to inspect task details.
> > +	 */
> > +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> > +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> > +		ret = -EACCES;
> 
> Well, probably this is fine... but may be you can add a comment.
> The task can change its credentials right after ptrace_may_access()
> succeeds. This _looks_ wrong, perhaps it makes sense to  add the
> "we do not care" note.
> 

Wait, how it's differ from other ptrace_may_access calls all over
the kernel? I suppose I'm missing something obvious?

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 15:10 ` Cyrill Gorcunov
@ 2012-02-15 15:38   ` Oleg Nesterov
  2012-02-15 16:13     ` Cyrill Gorcunov
  2012-02-15 18:32   ` Cyrill Gorcunov
  1 sibling, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-15 15:38 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
> >
> > > +
> > > +	/*
> > > +	 * One should have enough rights to inspect task details.
> > > +	 */
> > > +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> > > +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> > > +		ret = -EACCES;
> >
> > Well, probably this is fine... but may be you can add a comment.
> > The task can change its credentials right after ptrace_may_access()
> > succeeds. This _looks_ wrong, perhaps it makes sense to  add the
> > "we do not care" note.
> >
>
> Wait, how it's differ from other ptrace_may_access calls all over
> the kernel? I suppose I'm missing something obvious?

For example? Say, mm_access() is fine because it returns ->mm
which we are going to play with.

But map_files_d_revalidate/proc_map_files_get_link looks wrong,
there are obviously racy and should use mm_access().

Probably something else is wrong too.

Once again, I am not saying that this code really has the security
problems. I simply do not know. But it looks wrong without the
comment. I do not really understand why do we need ptrace_may_access(),
but whatever reason we have how we can trust it? Say, when KCMP_VM
checks ->mm, all we know is that PTRACE_MODE_READ succeed in the
past. This looks confusing, imho.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 14:36 + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree Oleg Nesterov
  2012-02-15 15:10 ` Cyrill Gorcunov
@ 2012-02-15 16:06 ` Oleg Nesterov
  2012-02-15 16:27   ` Cyrill Gorcunov
  1 sibling, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-15 16:06 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

Not a comment, but the question. I am just curious...

> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparison results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values and compare
> + * the production instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +       return (v ^ cookies[type][0]) * cookies[type][1];
> +}

OK, but why do we need this per type? Just to add more obfuscation
or there is another reason?

> +static __init int kcmp_cookies_init(void)
> +{
> +       int i;
> +
> +       get_random_bytes(cookies, sizeof(cookies));
> +
> +       for (i = 0; i < KCMP_TYPES; i++)
> +               cookies[i][1] |= (~(~0UL >>  1) | 1);

I am puzzled, help ;) this is equal to

		cookies[i][1] |= -LONG_MAX;
or
		cookies[i][1] |= (LONG_MIN | 1);

for what? why do we want to set these 2 bits (MSB and LSB) ?

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 15:38   ` Oleg Nesterov
@ 2012-02-15 16:13     ` Cyrill Gorcunov
  2012-02-15 16:22       ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 16:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 04:38:16PM +0100, Oleg Nesterov wrote:
...
> >
> > Wait, how it's differ from other ptrace_may_access calls all over
> > the kernel? I suppose I'm missing something obvious?
> 
> For example? Say, mm_access() is fine because it returns ->mm
> which we are going to play with.

So, say we have

environ_read
  mm_for_maps
   mm_access
    success, and first reader continue here

then say task change own credentials and all
this sequence fails because access is not granted
anymore (say for second reader), but first reader
still able to continue reading because access was
graned earlier.

So I don't understand how it's different from what
is provided in this patch. What I'm missing?

> Once again, I am not saying that this code really has the security
> problems. I simply do not know. But it looks wrong without the
> comment. I do not really understand why do we need ptrace_may_access(),
> but whatever reason we have how we can trust it? Say, when KCMP_VM
> checks ->mm, all we know is that PTRACE_MODE_READ succeed in the
> past. This looks confusing, imho.

Adding the comment is not a problem. The problem is that I
dont understand if there error in patch or not, can we stick
with ptrace_may_access or need something different here?
The idea was exactly like -- if you have enough rights to
proceed ptrace_may_access then syscall should continue
executing and return comparision result.

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 16:13     ` Cyrill Gorcunov
@ 2012-02-15 16:22       ` Oleg Nesterov
  2012-02-15 17:53         ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-15 16:22 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 04:38:16PM +0100, Oleg Nesterov wrote:
> ...
> > >
> > > Wait, how it's differ from other ptrace_may_access calls all over
> > > the kernel? I suppose I'm missing something obvious?
> >
> > For example? Say, mm_access() is fine because it returns ->mm
> > which we are going to play with.
>
> So, say we have
>
> environ_read
>   mm_for_maps
>    mm_access
>     success, and first reader continue here
>
> then say task change own credentials and all
> this sequence fails because access is not granted
> anymore (say for second reader), but first reader
> still able to continue reading because access was
> graned earlier.

Can't understand... Yes, environ_read() can succeed for the
first reader, and then later it can fail for the same/another
reader. And?

> So I don't understand how it's different from what
> is provided in this patch. What I'm missing?

environ_read() does

	mm = mm_access(task);
	if (mm)
		do_something(mm);

even if it races with, say, execve(setuid_app) we can't read the
new ->mm.

while your code (very roughly) does something like

	mm = mm_access(task);

	if (mm)
		do_something(task->mm);

while it is quite possible that mm != task->mm.

> > Once again, I am not saying that this code really has the security
> > problems. I simply do not know. But it looks wrong without the
> > comment. I do not really understand why do we need ptrace_may_access(),
> > but whatever reason we have how we can trust it? Say, when KCMP_VM
> > checks ->mm, all we know is that PTRACE_MODE_READ succeed in the
> > past. This looks confusing, imho.
>
> Adding the comment is not a problem. The problem is that I
> dont understand if there error in patch or not, can we stick
> with ptrace_may_access or need something different here?
> The idea was exactly like -- if you have enough rights to
> proceed ptrace_may_access then syscall should continue
> executing and return comparision result.

My only point is: this check is obviously racy, and thus it looks
confusing. Whether this is fine or not, I do not know. Personally
I see no reason for ptrace_may_access(), but I am not security
expert.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 16:06 ` Oleg Nesterov
@ 2012-02-15 16:27   ` Cyrill Gorcunov
  2012-04-09 22:10     ` Andrew Morton
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 16:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote:
> Not a comment, but the question. I am just curious...
> 
> > +/*
> > + * We don't expose real in-memory order of objects for security
> > + * reasons, still the comparison results should be suitable for
> > + * sorting. Thus, we obfuscate kernel pointers values and compare
> > + * the production instead.
> > + */
> > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> > +
> > +static long kptr_obfuscate(long v, int type)
> > +{
> > +       return (v ^ cookies[type][0]) * cookies[type][1];
> > +}
> 
> OK, but why do we need this per type? Just to add more obfuscation
> or there is another reason?

Just to add more obfuscation.

> 
> > +static __init int kcmp_cookies_init(void)
> > +{
> > +       int i;
> > +
> > +       get_random_bytes(cookies, sizeof(cookies));
> > +
> > +       for (i = 0; i < KCMP_TYPES; i++)
> > +               cookies[i][1] |= (~(~0UL >>  1) | 1);
> 
> I am puzzled, help ;) this is equal to
> 
> 		cookies[i][1] |= -LONG_MAX;
> or
> 		cookies[i][1] |= (LONG_MIN | 1);
> 
> for what? why do we want to set these 2 bits (MSB and LSB) ?

Letme quote hpa@ here :)

 | This code is wrong.  You will have a zero cookie, legitimately, once in
 | 2^32 or 2^64 attempts, depending on the bitness.
 |
 | The other thing is that for the multiplicative cookie you should OR in
 | the value (~(~0UL >> 1) | 1) in order to make sure that the value is (a)
 | large and (b) odd.

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 16:22       ` Oleg Nesterov
@ 2012-02-15 17:53         ` Cyrill Gorcunov
  2012-02-15 18:43           ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 17:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 05:22:22PM +0100, Oleg Nesterov wrote:
> 
> > So I don't understand how it's different from what
> > is provided in this patch. What I'm missing?
> 
> environ_read() does
> 
> 	mm = mm_access(task);
> 	if (mm)
> 		do_something(mm);
> 
> even if it races with, say, execve(setuid_app) we can't read the
> new ->mm.

Wait, I'm confused

	process 1 (reader)	process 2 ("task" itself)
	mm = mm_access(task);
				task changes own credentials
				so reader can't access on next
				read if it would try, but since
				access already granted... it
				continues do_something(mm)
	if (mm)
		do_something(mm);

So in the patch I tried the same, once access is granted it
belongs to a caller.

> 
> while your code (very roughly) does something like
> 
> 	mm = mm_access(task);
> 	if (mm)
> 		do_something(task->mm);
> 
> while it is quite possible that mm != task->mm.

Oleg, could you please explain me where it happens
that task->mm (I've got access to) will be changed
to some new -mm while I'm inspecting it.

If permission changed while the caller inside syscall,
it's the same situation as with mm_access above. No?

> 
> My only point is: this check is obviously racy, and thus it looks
> confusing. Whether this is fine or not, I do not know. Personally
> I see no reason for ptrace_may_access(), but I am not security
> expert.

The idea was -- non-privilege caller should not have access
to privileged tasks.

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 15:10 ` Cyrill Gorcunov
  2012-02-15 15:38   ` Oleg Nesterov
@ 2012-02-15 18:32   ` Cyrill Gorcunov
  2012-02-15 19:06     ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 18:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 07:10:08PM +0400, Cyrill Gorcunov wrote:
> On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
> > 
> > task->files can be NULL, we can race with exit_files().
> 
> So I need to call get_files_struct. Thanks Oleg!
> 

Something like below I think, right? (it seems this
patc doesn't hit lenux-next yet, so here is interdiff
output)

	Cyrill
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -44,16 +44,25 @@
 static struct file *
 get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 {
+	struct files_struct *files;
 	struct fdtable *fdt;
 	struct file *file;
 
-	spin_lock(&task->files->file_lock);
-	fdt = files_fdtable(task->files);
+	files = get_files_struct(task);
+	if (!files)
+		return NULL;
+
+	rcu_read_lock();
+
+	fdt = files_fdtable(files);
 	if (idx < fdt->max_fds)
 		file = fdt->fd[idx];
 	else
 		file = NULL;
-	spin_unlock(&task->files->file_lock);
+
+	rcu_read_unlock();
+
+	put_files_struct(files);
 
 	return file;
 }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 17:53         ` Cyrill Gorcunov
@ 2012-02-15 18:43           ` Oleg Nesterov
  2012-02-15 19:56             ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-15 18:43 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 05:22:22PM +0100, Oleg Nesterov wrote:
> >
> > > So I don't understand how it's different from what
> > > is provided in this patch. What I'm missing?
> >
> > environ_read() does
> >
> > 	mm = mm_access(task);
> > 	if (mm)
> > 		do_something(mm);
> >
> > even if it races with, say, execve(setuid_app) we can't read the
> > new ->mm.
>
> Wait, I'm confused
>
> 	process 1 (reader)	process 2 ("task" itself)
> 	mm = mm_access(task);
> 				task changes own credentials
> 				so reader can't access on next
> 				read if it would try, but since
> 				access already granted... it
> 				continues do_something(mm)
> 	if (mm)
> 		do_something(mm);
>
> So in the patch I tried the same, once access is granted it
> belongs to a caller.

See the "execve(setuid_app)", this is what I meant. Even if we
race with execve() and the task raises its privileges we can't
read the new ->mm, we will read the old mm for which we have
(had) the rights to access.

> > while your code (very roughly) does something like
> >
> > 	mm = mm_access(task);
> > 	if (mm)
> > 		do_something(task->mm);
> >
> > while it is quite possible that mm != task->mm.
>
> Oleg, could you please explain me where it happens
> that task->mm (I've got access to) will be changed
> to some new -mm while I'm inspecting it.

Cough... this is question I am trying to ask ;)

Let me try again. To simplify, lets discuss the KCMP_VM case
only.

I do not really understand why do we need ptrace_may_access().
I do not see any security problems with kcmp_ptr(task->mm), but
I am not expert.

However, you added this check so I assume you have some reason.
But this can race with execve(setuid_app) and KCMP_VM can play
with task->mm after this task raises its caps. If this is fine,
then why do we need ptrace_may_access?

OK, please ignore. I sent the initial email just becase KCMP_FILE
is buggy.

> > > +       for (i = 0; i < KCMP_TYPES; i++)
> > > +               cookies[i][1] |= (~(~0UL >>  1) | 1);
> >
> > I am puzzled, help ;) this is equal to
> >
> > 		cookies[i][1] |= -LONG_MAX;
> > or
> > 		cookies[i][1] |= (LONG_MIN | 1);
> >
> > for what? why do we want to set these 2 bits (MSB and LSB) ?
>
> Letme quote hpa@ here :)
>
>  | This code is wrong.  You will have a zero cookie, legitimately, once in
>  | 2^32 or 2^64 attempts, depending on the bitness.
>  |
>  | The other thing is that for the multiplicative cookie you should OR in
>  | the value (~(~0UL >> 1) | 1) in order to make sure that the value is (a)
>  | large and (b) odd.

OK, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 18:32   ` Cyrill Gorcunov
@ 2012-02-15 19:06     ` Oleg Nesterov
  2012-02-15 19:18       ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-15 19:06 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 07:10:08PM +0400, Cyrill Gorcunov wrote:
> > On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
> > >
> > > task->files can be NULL, we can race with exit_files().
> >
> > So I need to call get_files_struct. Thanks Oleg!
> >
>
> Something like below I think, right? (it seems this
> patc doesn't hit lenux-next yet, so here is interdiff
> output)
>
> 	Cyrill
> ---
> diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
> --- linux-2.6.git/kernel/kcmp.c
> +++ linux-2.6.git/kernel/kcmp.c
> @@ -44,16 +44,25 @@
>  static struct file *
>  get_file_raw_ptr(struct task_struct *task, unsigned int idx)
>  {
> +	struct files_struct *files;
>  	struct fdtable *fdt;
>  	struct file *file;
>
> -	spin_lock(&task->files->file_lock);
> -	fdt = files_fdtable(task->files);
> +	files = get_files_struct(task);
> +	if (!files)
> +		return NULL;
> +
> +	rcu_read_lock();
> +
> +	fdt = files_fdtable(files);
>  	if (idx < fdt->max_fds)
>  		file = fdt->fd[idx];
>  	else
>  		file = NULL;
> -	spin_unlock(&task->files->file_lock);
> +
> +	rcu_read_unlock();
> +
> +	put_files_struct(files);

I think this should work. Or you can simply do

	struct file *file = NULL;

	task_lock(task);
	rcu_read_lock();
	if (task->files)
		file = fcheck_files(task->files, idx);
	rcu_read_unlock();
	task_unlock(task);

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 19:06     ` Oleg Nesterov
@ 2012-02-15 19:18       ` Cyrill Gorcunov
  0 siblings, 0 replies; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 19:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 08:06:22PM +0100, Oleg Nesterov wrote:
> 
> I think this should work. Or you can simply do
> 
> 	struct file *file = NULL;
> 
> 	task_lock(task);
> 	rcu_read_lock();
> 	if (task->files)
> 		file = fcheck_files(task->files, idx);
> 	rcu_read_unlock();
> 	task_unlock(task);
> 

Yeah, thanks, this one is better.

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 18:43           ` Oleg Nesterov
@ 2012-02-15 19:56             ` Cyrill Gorcunov
  2012-02-15 19:57               ` Vasiliy Kulikov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 19:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 07:43:36PM +0100, Oleg Nesterov wrote:
...
> 
> Cough... this is question I am trying to ask ;)
> 
> Let me try again. To simplify, lets discuss the KCMP_VM case
> only.
> 
> I do not really understand why do we need ptrace_may_access().
> I do not see any security problems with kcmp_ptr(task->mm), but
> I am not expert.
> 
> However, you added this check so I assume you have some reason.
> But this can race with execve(setuid_app) and KCMP_VM can play
> with task->mm after this task raises its caps. If this is fine,
> then why do we need ptrace_may_access?
> 

This makes me scratch the head ;) I think ptrace_may_access (or
some other security test) should remain since it's somehow weird
if non-root task will be able to find objects order from privileged
task. Thus I need to find a way how to handle execve(setuid_app).
Need to think...

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 19:56             ` Cyrill Gorcunov
@ 2012-02-15 19:57               ` Vasiliy Kulikov
  2012-02-15 20:05                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Vasiliy Kulikov @ 2012-02-15 19:57 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 23:56 +0400, Cyrill Gorcunov wrote:
> On Wed, Feb 15, 2012 at 07:43:36PM +0100, Oleg Nesterov wrote:
> ...
> > 
> > Cough... this is question I am trying to ask ;)
> > 
> > Let me try again. To simplify, lets discuss the KCMP_VM case
> > only.
> > 
> > I do not really understand why do we need ptrace_may_access().
> > I do not see any security problems with kcmp_ptr(task->mm), but
> > I am not expert.
> > 
> > However, you added this check so I assume you have some reason.
> > But this can race with execve(setuid_app) and KCMP_VM can play
> > with task->mm after this task raises its caps. If this is fine,
> > then why do we need ptrace_may_access?
> > 
> 
> This makes me scratch the head ;) I think ptrace_may_access (or
> some other security test) should remain since it's somehow weird
> if non-root task will be able to find objects order from privileged
> task. Thus I need to find a way how to handle execve(setuid_app).
> Need to think...

Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex
for the whole period of time when it uses a resource.

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 19:57               ` Vasiliy Kulikov
@ 2012-02-15 20:05                 ` Cyrill Gorcunov
  2012-02-15 20:25                   ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 20:05 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, Andrew Morton, linux-kernel

On Wed, Feb 15, 2012 at 11:57:33PM +0400, Vasiliy Kulikov wrote:
> > 
> > This makes me scratch the head ;) I think ptrace_may_access (or
> > some other security test) should remain since it's somehow weird
> > if non-root task will be able to find objects order from privileged
> > task. Thus I need to find a way how to handle execve(setuid_app).
> > Need to think...
> 
> Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex
> for the whole period of time when it uses a resource.

Yup, thanks Vasiliy! I've just found cred_guard_mutex in
install_exec_creds. Now I'm thinking if this is what we
need here ;)

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 20:05                 ` Cyrill Gorcunov
@ 2012-02-15 20:25                   ` Cyrill Gorcunov
  2012-02-15 21:09                     ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 20:25 UTC (permalink / raw)
  To: Vasiliy Kulikov, Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, Andrew Morton, linux-kernel

On Thu, Feb 16, 2012 at 12:05:33AM +0400, Cyrill Gorcunov wrote:
> On Wed, Feb 15, 2012 at 11:57:33PM +0400, Vasiliy Kulikov wrote:
> > > 
> > > This makes me scratch the head ;) I think ptrace_may_access (or
> > > some other security test) should remain since it's somehow weird
> > > if non-root task will be able to find objects order from privileged
> > > task. Thus I need to find a way how to handle execve(setuid_app).
> > > Need to think...
> > 
> > Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex
> > for the whole period of time when it uses a resource.
> 
> Yup, thanks Vasiliy! I've just found cred_guard_mutex in
> install_exec_creds. Now I'm thinking if this is what we
> need here ;)
> 

Something like below I think (not yet tested, overall update).

	Cyrill
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -44,20 +44,34 @@
 static struct file *
 get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 {
-	struct fdtable *fdt;
-	struct file *file;
+	struct file *file = NULL;
 
-	spin_lock(&task->files->file_lock);
-	fdt = files_fdtable(task->files);
-	if (idx < fdt->max_fds)
-		file = fdt->fd[idx];
-	else
-		file = NULL;
-	spin_unlock(&task->files->file_lock);
+	task_lock(task);
+	rcu_read_lock();
+
+	if (task->files)
+		file = fcheck_files(task->files, idx);
+
+	rcu_read_unlock();
+	task_unlock(task);
 
 	return file;
 }
 
+static int may_access(struct task_struct *task)
+{
+	int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+	if (err)
+		return err;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		mutex_unlock(&task->signal->cred_guard_mutex);
+		return -EPERM;
+	}
+
+	return 0;
+}
+
 SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
 		unsigned long, idx1, unsigned long, idx2)
 {
@@ -82,11 +96,12 @@
 	/*
 	 * One should have enough rights to inspect task details.
 	 */
-	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
-	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
-		ret = -EACCES;
+	ret = may_access(task1);
+	if (ret)
 		goto err;
-	}
+	ret = may_access(task2);
+	if (ret)
+		goto err_unlock;
 
 	switch (type) {
 	case KCMP_FILE: {
@@ -130,6 +145,9 @@
 		break;
 	}
 
+	mutex_unlock(&task2->signal->cred_guard_mutex);
+err_unlock:
+	mutex_unlock(&task1->signal->cred_guard_mutex);
 err:
 	put_task_struct(task1);
 	put_task_struct(task2);


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 20:25                   ` Cyrill Gorcunov
@ 2012-02-15 21:09                     ` Cyrill Gorcunov
  2012-02-15 21:58                       ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 21:09 UTC (permalink / raw)
  To: Vasiliy Kulikov, Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, Andrew Morton, linux-kernel

On Thu, Feb 16, 2012 at 12:25:38AM +0400, Cyrill Gorcunov wrote:
> 
> Something like below I think (not yet tested, overall update).
>

OK, this one I've just tested. Please review, so I won't miss
something obvious (I'm fetching linux-next now, if the patch
is already there I'll cook one on top).

	Cyrill
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -44,20 +44,38 @@
 static struct file *
 get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 {
-	struct fdtable *fdt;
-	struct file *file;
+	struct file *file = NULL;
 
-	spin_lock(&task->files->file_lock);
-	fdt = files_fdtable(task->files);
-	if (idx < fdt->max_fds)
-		file = fdt->fd[idx];
-	else
-		file = NULL;
-	spin_unlock(&task->files->file_lock);
+	task_lock(task);
+	rcu_read_lock();
+
+	if (task->files)
+		file = fcheck_files(task->files, idx);
+
+	rcu_read_unlock();
+	task_unlock(task);
 
 	return file;
 }
 
+static void access_unlock(struct task_struct *task)
+{
+	mutex_unlock(&task->signal->cred_guard_mutex);
+}
+
+static int access_trylock(struct task_struct *task)
+{
+	if (!mutex_trylock(&task->signal->cred_guard_mutex))
+		return -EBUSY;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		mutex_unlock(&task->signal->cred_guard_mutex);
+		return -EPERM;
+	}
+
+	return 0;
+}
+
 SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
 		unsigned long, idx1, unsigned long, idx2)
 {
@@ -82,11 +100,12 @@
 	/*
 	 * One should have enough rights to inspect task details.
 	 */
-	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
-	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
-		ret = -EACCES;
+	ret = access_trylock(task1);
+	if (ret)
 		goto err;
-	}
+	ret = access_trylock(task2);
+	if (ret)
+		goto err_unlock;
 
 	switch (type) {
 	case KCMP_FILE: {
@@ -130,6 +149,9 @@
 		break;
 	}
 
+	access_unlock(task2);
+err_unlock:
+	access_unlock(task1);
 err:
 	put_task_struct(task1);
 	put_task_struct(task2);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 21:09                     ` Cyrill Gorcunov
@ 2012-02-15 21:58                       ` Cyrill Gorcunov
  2012-02-16 14:49                         ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-15 21:58 UTC (permalink / raw)
  To: Vasiliy Kulikov, Oleg Nesterov, Andrew Morton
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

On Thu, Feb 16, 2012 at 01:09:34AM +0400, Cyrill Gorcunov wrote:
> On Thu, Feb 16, 2012 at 12:25:38AM +0400, Cyrill Gorcunov wrote:
> > 
> > Something like below I think (not yet tested, overall update).
> >
> 
> OK, this one I've just tested. Please review, so I won't miss
> something obvious (I'm fetching linux-next now, if the patch
> is already there I'll cook one on top).
> ---

Unfortunately I dont see this patch in linux-next, so I cooked
it as interdiff'ed. Please review.

Maybe (if this patch is fine) we could drop v8 and I would
squash this changed into v9 (together with update to self
test)?

	Cyrill

[-- Attachment #2: sys-kcmp-interdiff --]
[-- Type: text/plain, Size: 2197 bytes --]

From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: syscalls, x86: Fix __NR_kcmp execve race and potential NULL dereference

Plain ptrace_may_access() check used in kcmp is not safe
against race with execve(setuid_app), so we need to grab
cred_guard_mutex and keep it until kcmp is finished.

Also task->files may be nil, better to use task_lock
and fcheck_files helpers instead of direct file_lock
usage.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -44,20 +44,38 @@
 static struct file *
 get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 {
-	struct fdtable *fdt;
-	struct file *file;
+	struct file *file = NULL;
 
-	spin_lock(&task->files->file_lock);
-	fdt = files_fdtable(task->files);
-	if (idx < fdt->max_fds)
-		file = fdt->fd[idx];
-	else
-		file = NULL;
-	spin_unlock(&task->files->file_lock);
+	task_lock(task);
+	rcu_read_lock();
+
+	if (task->files)
+		file = fcheck_files(task->files, idx);
+
+	rcu_read_unlock();
+	task_unlock(task);
 
 	return file;
 }
 
+static void access_unlock(struct task_struct *task)
+{
+	mutex_unlock(&task->signal->cred_guard_mutex);
+}
+
+static int access_trylock(struct task_struct *task)
+{
+	if (!mutex_trylock(&task->signal->cred_guard_mutex))
+		return -EBUSY;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		mutex_unlock(&task->signal->cred_guard_mutex);
+		return -EPERM;
+	}
+
+	return 0;
+}
+
 SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
 		unsigned long, idx1, unsigned long, idx2)
 {
@@ -82,11 +100,12 @@
 	/*
 	 * One should have enough rights to inspect task details.
 	 */
-	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
-	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
-		ret = -EACCES;
+	ret = access_trylock(task1);
+	if (ret)
 		goto err;
-	}
+	ret = access_trylock(task2);
+	if (ret)
+		goto err_unlock;
 
 	switch (type) {
 	case KCMP_FILE: {
@@ -130,6 +149,9 @@
 		break;
 	}
 
+	access_unlock(task2);
+err_unlock:
+	access_unlock(task1);
 err:
 	put_task_struct(task1);
 	put_task_struct(task2);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 21:58                       ` Cyrill Gorcunov
@ 2012-02-16 14:49                         ` Oleg Nesterov
  2012-02-16 15:13                           ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-16 14:49 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On 02/16, Cyrill Gorcunov wrote:
>
> +static int access_trylock(struct task_struct *task)
> +{
> +	if (!mutex_trylock(&task->signal->cred_guard_mutex))
> +		return -EBUSY;
> +
> +	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
> +		mutex_unlock(&task->signal->cred_guard_mutex);
> +		return -EPERM;
> +	}
> +
> +	return 0;
> +}

OK, this looks correct, but I don't really understand _trylock.
This means the caller should always retry if -EBUSY, and
kcmp(pid, pid) can never succeed. Sure, kcmp() doesn't make
a lot of sense if pid1 == pid2, but this looks a bit strange.

You could simply do

	int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2)
	{
		int err;

		if (m2 > m1)
			swap(m1, m2);

		err = mutex_lock_killable(m1);

		if (!err && likely(m1 != m2)) {
			err = mutex_lock_killable_nested(m2);
			if (err)
				mutex_unlock(m1);
		}

		return err;
	}

but I won't unsist.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 14:49                         ` Oleg Nesterov
@ 2012-02-16 15:13                           ` Cyrill Gorcunov
  2012-02-16 16:49                             ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-16 15:13 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On Thu, Feb 16, 2012 at 03:49:54PM +0100, Oleg Nesterov wrote:
> On 02/16, Cyrill Gorcunov wrote:
> >
> > +static int access_trylock(struct task_struct *task)
> > +{
> > +	if (!mutex_trylock(&task->signal->cred_guard_mutex))
> > +		return -EBUSY;
> > +
> > +	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
> > +		mutex_unlock(&task->signal->cred_guard_mutex);
> > +		return -EPERM;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> OK, this looks correct, but I don't really understand _trylock.
> This means the caller should always retry if -EBUSY, and
> kcmp(pid, pid) can never succeed. Sure, kcmp() doesn't make
> a lot of sense if pid1 == pid2, but this looks a bit strange.
> 

Hi Oleg, sure I can make it this way, also I think if pid1 == pid2
and idx1 == idx2 I can return 0 immediately.

> You could simply do
> 	int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2)
> 	{
> 		int err;
> 
> 		if (m2 > m1)
> 			swap(m1, m2);
> 
> 		err = mutex_lock_killable(m1);
> 
> 		if (!err && likely(m1 != m2)) {
> 			err = mutex_lock_killable_nested(m2);
> 			if (err)
> 				mutex_unlock(m1);
> 		}
> 
> 		return err;
> 	}
> 
> but I won't insist.

Initially I wanted kcmp would be brining minimum impact
and if mutex is already taken by someone, we would not sleep
but return immediately with -EBUSY and it would be up to caller
to deside if to try again or make some delay first. I simply
not sure what is better here.

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 15:13                           ` Cyrill Gorcunov
@ 2012-02-16 16:49                             ` Cyrill Gorcunov
  2012-02-16 17:40                               ` Oleg Nesterov
  2012-02-16 18:21                               ` Vasiliy Kulikov
  0 siblings, 2 replies; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-16 16:49 UTC (permalink / raw)
  To: Oleg Nesterov, Vasiliy Kulikov, Andrew Morton
  Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On Thu, Feb 16, 2012 at 07:13:40PM +0400, Cyrill Gorcunov wrote:
...
> 
> > You could simply do
> > 	int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2)
> > 	{
> > 		int err;
> > 
> > 		if (m2 > m1)
> > 			swap(m1, m2);
> > 
> > 		err = mutex_lock_killable(m1);
> > 
> > 		if (!err && likely(m1 != m2)) {
> > 			err = mutex_lock_killable_nested(m2);
> > 			if (err)
> > 				mutex_unlock(m1);
> > 		}
> > 
> > 		return err;
> > 	}
> > 
> > but I won't insist.
> 
> Initially I wanted kcmp would be brining minimum impact
> and if mutex is already taken by someone, we would not sleep
> but return immediately with -EBUSY and it would be up to caller
> to deside if to try again or make some delay first. I simply
> not sure what is better here.
> 
This one should do the trick.

	Cyrill
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: syscalls, x86: Make __NR_kcmp to work with equivalent pids

In case if pid1 is equal to pid2 the kcmp will return -EBUSY,
which makes no sence. Make it able to work with equivalent pids.
Selftest is extended as well.

Repored-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -58,22 +58,31 @@
 	return file;
 }
 
-static void access_unlock(struct task_struct *task)
+static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
 {
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	if (m2 > m1)
+		swap(m1, m2);
+
+	if (likely(m2 != m1))
+		mutex_unlock(m2);
+	mutex_unlock(m1);
 }
 
-static int access_trylock(struct task_struct *task)
+static int kcmp_lock(struct mutex *m1, struct mutex *m2)
 {
-	if (!mutex_trylock(&task->signal->cred_guard_mutex))
-		return -EBUSY;
+	int err;
+
+	if (m2 > m1)
+		swap(m1, m2);
 
-	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
-		mutex_unlock(&task->signal->cred_guard_mutex);
-		return -EPERM;
+	err = mutex_lock_killable(m1);
+	if (!err && likely(m1 != m2)) {
+		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
+		if (err)
+			mutex_unlock(m1);
 	}
 
-	return 0;
+	return err;
 }
 
 SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
@@ -100,12 +109,15 @@
 	/*
 	 * One should have enough rights to inspect task details.
 	 */
-	ret = access_trylock(task1);
+	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
+			&task2->signal->cred_guard_mutex);
 	if (ret)
 		goto err;
-	ret = access_trylock(task2);
-	if (ret)
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EPERM;
 		goto err_unlock;
+	}
 
 	switch (type) {
 	case KCMP_FILE: {
@@ -149,9 +161,9 @@
 		break;
 	}
 
-	access_unlock(task2);
 err_unlock:
-	access_unlock(task1);
+	kcmp_unlock(&task1->signal->cred_guard_mutex,
+		    &task2->signal->cred_guard_mutex);
 err:
 	put_task_struct(task1);
 	put_task_struct(task2);
diff -u linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
--- linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
+++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
@@ -74,6 +74,15 @@
 			ret = -1;
 		} else
 			printf("PASS: 0 returned as expected\n");
+
+		/* Compare with self */
+		ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
+		if (ret) {
+			printf("FAIL: 0 expected but %li returned\n", ret);
+			ret = -1;
+		} else
+			printf("PASS: 0 returned as expected\n");
+
 		exit(ret);
 	}
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 16:49                             ` Cyrill Gorcunov
@ 2012-02-16 17:40                               ` Oleg Nesterov
  2012-02-16 17:58                                 ` Cyrill Gorcunov
  2012-02-16 18:21                               ` Vasiliy Kulikov
  1 sibling, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-16 17:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On 02/16, Cyrill Gorcunov wrote:
>
> -static void access_unlock(struct task_struct *task)
> +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
>  {
> -	mutex_unlock(&task->signal->cred_guard_mutex);
> +	if (m2 > m1)
> +		swap(m1, m2);

Well, the order doesn't matter in case of _unlock, you can remove
this part. Not that it really hurts though, I won't argue.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 17:40                               ` Oleg Nesterov
@ 2012-02-16 17:58                                 ` Cyrill Gorcunov
  2012-02-16 19:03                                   ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-16 17:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote:
> On 02/16, Cyrill Gorcunov wrote:
> >
> > -static void access_unlock(struct task_struct *task)
> > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> >  {
> > -	mutex_unlock(&task->signal->cred_guard_mutex);
> > +	if (m2 > m1)
> > +		swap(m1, m2);
> 
> Well, the order doesn't matter in case of _unlock, you can remove
> this part. Not that it really hurts though, I won't argue.
> 

It drops some instructions so I think it worth removing (still
unlocking not in reverse order is something which always make
me nervious ;)

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 16:49                             ` Cyrill Gorcunov
  2012-02-16 17:40                               ` Oleg Nesterov
@ 2012-02-16 18:21                               ` Vasiliy Kulikov
  2012-02-16 18:34                                 ` Cyrill Gorcunov
  2012-02-16 18:49                                 ` Oleg Nesterov
  1 sibling, 2 replies; 49+ messages in thread
From: Vasiliy Kulikov @ 2012-02-16 18:21 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Oleg Nesterov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan,
	Valdis.Kletnieks, Michal Marek, Frederic Weisbecker,
	linux-kernel

On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> +	err = mutex_lock_killable(m1);
> +	if (!err && likely(m1 != m2)) {
> +		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);

Doesn't it lead to a deadlock?

mutex_lock_killable(task1)
|					mutex_lock_killable(task2)
mutex_lock_killable_nested(task2)	|
(locked)				|
					mutex_lock_killable_nested(task1)
					(locked)

I suppose you should use some global lock (kcmp_lock) before both locks.

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 18:34                                 ` Cyrill Gorcunov
@ 2012-02-16 18:33                                   ` Vasiliy Kulikov
  0 siblings, 0 replies; 49+ messages in thread
From: Vasiliy Kulikov @ 2012-02-16 18:33 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Oleg Nesterov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan,
	Valdis.Kletnieks, Michal Marek, Frederic Weisbecker,
	linux-kernel

On Thu, Feb 16, 2012 at 22:34 +0400, Cyrill Gorcunov wrote:
> On Thu, Feb 16, 2012 at 10:21:06PM +0400, Vasiliy Kulikov wrote:
> > On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> > > +	err = mutex_lock_killable(m1);
> > > +	if (!err && likely(m1 != m2)) {
> > > +		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
> > 
> > Doesn't it lead to a deadlock?
> > 
> > mutex_lock_killable(task1)
> > |					mutex_lock_killable(task2)
> > mutex_lock_killable_nested(task2)	|
> > (locked)				|
> > 					mutex_lock_killable_nested(task1)
> > 					(locked)
> > 
> > I suppose you should use some global lock (kcmp_lock) before both locks.
> 
> but here is if (m1 > m2) and swap() do take place.

Ah, ok.  Then this deadlock scenario is impossible, sorry.

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 18:21                               ` Vasiliy Kulikov
@ 2012-02-16 18:34                                 ` Cyrill Gorcunov
  2012-02-16 18:33                                   ` Vasiliy Kulikov
  2012-02-16 18:49                                 ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-16 18:34 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Oleg Nesterov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov,
	Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan,
	Valdis.Kletnieks, Michal Marek, Frederic Weisbecker,
	linux-kernel

On Thu, Feb 16, 2012 at 10:21:06PM +0400, Vasiliy Kulikov wrote:
> On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> > +	err = mutex_lock_killable(m1);
> > +	if (!err && likely(m1 != m2)) {
> > +		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
> 
> Doesn't it lead to a deadlock?
> 
> mutex_lock_killable(task1)
> |					mutex_lock_killable(task2)
> mutex_lock_killable_nested(task2)	|
> (locked)				|
> 					mutex_lock_killable_nested(task1)
> 					(locked)
> 
> I suppose you should use some global lock (kcmp_lock) before both locks.

but here is if (m1 > m2) and swap() do take place.

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 18:21                               ` Vasiliy Kulikov
  2012-02-16 18:34                                 ` Cyrill Gorcunov
@ 2012-02-16 18:49                                 ` Oleg Nesterov
  1 sibling, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-16 18:49 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Cyrill Gorcunov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On 02/16, Vasiliy Kulikov wrote:
>
> On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> > +	err = mutex_lock_killable(m1);
> > +	if (!err && likely(m1 != m2)) {
> > +		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
>
> Doesn't it lead to a deadlock?
>
> mutex_lock_killable(task1)
> |					mutex_lock_killable(task2)
> mutex_lock_killable_nested(task2)	|
> (locked)				|
> 					mutex_lock_killable_nested(task1)
> 					(locked)

Please note the

	if (m1 >= m2)
		swap(m1. m2)

at the start.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 17:58                                 ` Cyrill Gorcunov
@ 2012-02-16 19:03                                   ` Oleg Nesterov
  2012-02-16 19:20                                     ` H. Peter Anvin
  2012-02-16 19:29                                     ` Cyrill Gorcunov
  0 siblings, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2012-02-16 19:03 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On 02/16, Cyrill Gorcunov wrote:
>
> On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote:
> > On 02/16, Cyrill Gorcunov wrote:
> > >
> > > -static void access_unlock(struct task_struct *task)
> > > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> > >  {
> > > -	mutex_unlock(&task->signal->cred_guard_mutex);
> > > +	if (m2 > m1)
> > > +		swap(m1, m2);
> >
> > Well, the order doesn't matter in case of _unlock, you can remove
> > this part. Not that it really hurts though, I won't argue.
>
> It drops some instructions so I think it worth removing

Yes.

> (still
> unlocking not in reverse order is something which always make
> me nervious ;)

Yes ;)

so let me repeat, I am not arguing. But IMHO every piece of code
should be understandable. Personally I do not mind at all, just
I _personally_ think this code _can_ confuse the reader, "damn why
we can't simply unlock in any order???".

If you add the "not really needed" comment above this swap - I agree.
If you simply remove this swap - I agree as well.

But. I won't argue if you prefer to keep this patch as is. You are the
author. If it looks better to _you_ - OK, this is correct (afaics).

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 19:03                                   ` Oleg Nesterov
@ 2012-02-16 19:20                                     ` H. Peter Anvin
  2012-02-16 19:29                                     ` Cyrill Gorcunov
  1 sibling, 0 replies; 49+ messages in thread
From: H. Peter Anvin @ 2012-02-16 19:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cyrill Gorcunov, Vasiliy Kulikov, Andrew Morton,
	Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa,
	Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On 02/16/2012 11:03 AM, Oleg Nesterov wrote:
> 
> so let me repeat, I am not arguing. But IMHO every piece of code
> should be understandable.

Amen!

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 19:03                                   ` Oleg Nesterov
  2012-02-16 19:20                                     ` H. Peter Anvin
@ 2012-02-16 19:29                                     ` Cyrill Gorcunov
  2012-02-16 19:52                                       ` Andrew Morton
  1 sibling, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-16 19:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On Thu, Feb 16, 2012 at 08:03:21PM +0100, Oleg Nesterov wrote:
> On 02/16, Cyrill Gorcunov wrote:
> >
> > On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote:
> > > On 02/16, Cyrill Gorcunov wrote:
> > > >
> > > > -static void access_unlock(struct task_struct *task)
> > > > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> > > >  {
> > > > -	mutex_unlock(&task->signal->cred_guard_mutex);
> > > > +	if (m2 > m1)
> > > > +		swap(m1, m2);
> > >
> > > Well, the order doesn't matter in case of _unlock, you can remove
> > > this part. Not that it really hurts though, I won't argue.
> >
> > It drops some instructions so I think it worth removing
> 
> Yes.
> 

Final one ;) I agreed on every line of your comment, thanks a lot Oleg!
---
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: syscalls, x86: Make __NR_kcmp to work with equivalent pids

In case if pid1 is equal to pid2 the kcmp will return -EBUSY,
which makes no sence. Make it able to work with equivalent pids.
Selftest is extended as well.

Repored-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -58,22 +58,28 @@
 	return file;
 }
 
-static void access_unlock(struct task_struct *task)
+static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
 {
-	mutex_unlock(&task->signal->cred_guard_mutex);
+	if (likely(m2 != m1))
+		mutex_unlock(m2);
+	mutex_unlock(m1);
 }
 
-static int access_trylock(struct task_struct *task)
+static int kcmp_lock(struct mutex *m1, struct mutex *m2)
 {
-	if (!mutex_trylock(&task->signal->cred_guard_mutex))
-		return -EBUSY;
+	int err;
 
-	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
-		mutex_unlock(&task->signal->cred_guard_mutex);
-		return -EPERM;
+	if (m2 > m1)
+		swap(m1, m2);
+
+	err = mutex_lock_killable(m1);
+	if (!err && likely(m1 != m2)) {
+		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
+		if (err)
+			mutex_unlock(m1);
 	}
 
-	return 0;
+	return err;
 }
 
 SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
@@ -100,12 +106,15 @@
 	/*
 	 * One should have enough rights to inspect task details.
 	 */
-	ret = access_trylock(task1);
+	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
+			&task2->signal->cred_guard_mutex);
 	if (ret)
 		goto err;
-	ret = access_trylock(task2);
-	if (ret)
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EPERM;
 		goto err_unlock;
+	}
 
 	switch (type) {
 	case KCMP_FILE: {
@@ -149,9 +158,9 @@
 		break;
 	}
 
-	access_unlock(task2);
 err_unlock:
-	access_unlock(task1);
+	kcmp_unlock(&task1->signal->cred_guard_mutex,
+		    &task2->signal->cred_guard_mutex);
 err:
 	put_task_struct(task1);
 	put_task_struct(task2);
diff -u linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
--- linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
+++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
@@ -74,6 +74,15 @@
 			ret = -1;
 		} else
 			printf("PASS: 0 returned as expected\n");
+
+		/* Compare with self */
+		ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
+		if (ret) {
+			printf("FAIL: 0 expected but %li returned\n", ret);
+			ret = -1;
+		} else
+			printf("PASS: 0 returned as expected\n");
+
 		exit(ret);
 	}
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 19:29                                     ` Cyrill Gorcunov
@ 2012-02-16 19:52                                       ` Andrew Morton
  2012-02-16 20:01                                         ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2012-02-16 19:52 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Oleg Nesterov, Vasiliy Kulikov, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On Thu, 16 Feb 2012 23:29:52 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> Final one ;)

syscalls-x86-add-__nr_kcmp-syscall-v8-fix-fix-fix-fix-fix.patch
is a tie for the world record!

I folded everything back into a single patch.  Here's what we currently
have:


From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: syscalls, x86: add __NR_kcmp syscall

While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g.  mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate.  So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

[akpm@linux-foundation.org: fix up selftests, warnings]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek <mmarek@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/syscalls/syscall_32.tbl         |    1 
 arch/x86/syscalls/syscall_64.tbl         |    1 
 include/linux/kcmp.h                     |   17 +
 include/linux/syscalls.h                 |    2 
 kernel/Makefile                          |    3 
 kernel/kcmp.c                            |  186 +++++++++++++++++++++
 kernel/sys_ni.c                          |    3 
 tools/testing/selftests/Makefile         |    2 
 tools/testing/selftests/kcmp/Makefile    |   29 +++
 tools/testing/selftests/kcmp/kcmp_test.c |   94 ++++++++++
 10 files changed, 337 insertions(+), 1 deletion(-)

diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl
--- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl
--- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
 309	64	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp
diff -puN /dev/null include/linux/kcmp.h
--- /dev/null
+++ a/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparison type */
+enum kcmp_type {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h
--- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/include/linux/syscalls.h
@@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi
 				      unsigned long riovcnt,
 				      unsigned long flags);
 
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
 #endif
diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile
--- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/Makefile
@@ -25,6 +25,9 @@ endif
 obj-y += sched/
 obj-y += power/
 
+ifeq ($(CONFIG_CHECKPOINT_RESTORE),y)
+obj-$(CONFIG_X86) += kcmp.o
+endif
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff -puN /dev/null kernel/kcmp.c
--- /dev/null
+++ a/kernel/kcmp.c
@@ -0,0 +1,186 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+/*
+ * We don't expose real in-memory order of objects for security
+ * reasons, still the comparison results should be suitable for
+ * sorting. Thus, we obfuscate kernel pointers values and compare
+ * the production instead.
+ */
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal, i.e. v1 = v2
+ * 1 - less than, i.e. v1 < v2
+ * 2 - greater than, i.e. v1 > v2
+ * 3 - not equal but ordering unavailable (reserved for future)
+ */
+static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
+{
+	long ret;
+
+	ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+/* The caller must have pinned the task */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct file *file = NULL;
+
+	task_lock(task);
+	rcu_read_lock();
+
+	if (task->files)
+		file = fcheck_files(task->files, idx);
+
+	rcu_read_unlock();
+	task_unlock(task);
+
+	return file;
+}
+
+static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
+{
+	if (likely(m2 != m1))
+		mutex_unlock(m2);
+	mutex_unlock(m1);
+}
+
+static int kcmp_lock(struct mutex *m1, struct mutex *m2)
+{
+	int err;
+
+	if (m2 > m1)
+		swap(m1, m2);
+
+	err = mutex_lock_killable(m1);
+	if (!err && likely(m1 != m2)) {
+		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
+		if (err)
+			mutex_unlock(m1);
+	}
+
+	return err;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1, *task2;
+	int ret;
+
+	rcu_read_lock();
+
+	/*
+	 * Tasks are looked up in caller's PID namespace only.
+	 */
+	task1 = find_task_by_vpid(pid1);
+	task2 = find_task_by_vpid(pid2);
+	if (!task1 || !task2)
+		goto err_no_task;
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	/*
+	 * One should have enough rights to inspect task details.
+	 */
+	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
+			&task2->signal->cred_guard_mutex);
+	if (ret)
+		goto err;
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EPERM;
+		goto err_unlock;
+	}
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = kcmp_ptr(filp1, filp2, KCMP_FILE);
+		else
+			ret = -EBADF;
+		break;
+	}
+	case KCMP_VM:
+		ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = kcmp_ptr(task1->sysvsem.undo_list,
+			       task2->sysvsem.undo_list,
+			       KCMP_SYSVSEM);
+#else
+		ret = -EOPNOTSUP;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+err_unlock:
+	kcmp_unlock(&task1->signal->cred_guard_mutex,
+		    &task2->signal->cred_guard_mutex);
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+
+err_no_task:
+	rcu_read_unlock();
+	return -ESRCH;
+}
+
+static __init int kcmp_cookies_init(void)
+{
+	int i;
+
+	get_random_bytes(cookies, sizeof(cookies));
+
+	for (i = 0; i < KCMP_TYPES; i++)
+		cookies[i][1] |= (~(~0UL >>  1) | 1);
+
+	return 0;
+}
+arch_initcall(kcmp_cookies_init);
diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c
--- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/sys_ni.c
@@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark);
 cond_syscall(sys_name_to_handle_at);
 cond_syscall(sys_open_by_handle_at);
 cond_syscall(compat_sys_open_by_handle_at);
+
+/* compare kernel pointers */
+cond_syscall(sys_kcmp);
diff -puN /dev/null tools/testing/selftests/kcmp/Makefile
--- /dev/null
+++ a/tools/testing/selftests/kcmp/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+	CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+	CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+all:
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) kcmp_test.c -o run_test
+else
+	echo "Not an x86 target, can't build kcmp selftest"
+endif
+
+run-tests: all
+	./kcmp_test
+
+clean:
+	rm -fr ./run_test
+	rm -fr ./test-file
diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c
--- /dev/null
+++ a/tools/testing/selftests/kcmp/kcmp_test.c
@@ -0,0 +1,94 @@
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <limits.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+
+#include <linux/unistd.h>
+#include <linux/kcmp.h>
+
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+
+static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
+{
+	return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2);
+}
+
+int main(int argc, char **argv)
+{
+	const char kpath[] = "kcmp-test-file";
+	int pid1, pid2;
+	int fd1, fd2;
+	int status;
+
+	fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644);
+	pid1 = getpid();
+
+	if (fd1 < 0) {
+		perror("Can't create file");
+		exit(1);
+	}
+
+	pid2 = fork();
+	if (pid2 < 0) {
+		perror("fork failed");
+		exit(1);
+	}
+
+	if (!pid2) {
+		int pid2 = getpid();
+		int ret;
+
+		fd2 = open(kpath, O_RDWR, 0644);
+		if (fd2 < 0) {
+			perror("Can't open file");
+			exit(1);
+		}
+
+		/* An example of output and arguments */
+		printf("pid1: %6d pid2: %6d FD: %2ld FILES: %2ld VM: %2ld "
+		       "FS: %2ld SIGHAND: %2ld IO: %2ld SYSVSEM: %2ld "
+		       "INV: %2ld\n",
+		       pid1, pid2,
+		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
+		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),
+
+			/* This one should fail */
+		       sys_kcmp(pid1, pid2, KCMP_TYPES + 1,	0, 0));
+
+		/* This one should return same fd */
+		ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1);
+		if (ret) {
+			printf("FAIL: 0 expected but %d returned\n", ret);
+			ret = -1;
+		} else
+			printf("PASS: 0 returned as expected\n");
+
+		/* Compare with self */
+		ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
+		if (ret) {
+			printf("FAIL: 0 expected but %li returned\n", ret);
+			ret = -1;
+		} else
+			printf("PASS: 0 returned as expected\n");
+
+		exit(ret);
+	}
+
+	waitpid(pid2, &status, P_ALL);
+
+	return 0;
+}
diff -puN tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 tools/testing/selftests/Makefile
--- a/tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/tools/testing/selftests/Makefile
@@ -1,4 +1,4 @@
-TARGETS = breakpoints vm
+TARGETS = breakpoints vm kcmp
 
 all:
 	for TARGET in $(TARGETS); do \
_


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-16 19:52                                       ` Andrew Morton
@ 2012-02-16 20:01                                         ` Cyrill Gorcunov
  0 siblings, 0 replies; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-02-16 20:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Vasiliy Kulikov, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel

On Thu, Feb 16, 2012 at 11:52:08AM -0800, Andrew Morton wrote:
> On Thu, 16 Feb 2012 23:29:52 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > Final one ;)
> 
> syscalls-x86-add-__nr_kcmp-syscall-v8-fix-fix-fix-fix-fix.patch
> is a tie for the world record!

It was close to v13 ;)

> 
> I folded everything back into a single patch.  Here's what we currently
> have:
>

Thanks a lot, Andrew!

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-02-15 16:27   ` Cyrill Gorcunov
@ 2012-04-09 22:10     ` Andrew Morton
  2012-04-09 22:24       ` Cyrill Gorcunov
                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Andrew Morton @ 2012-04-09 22:10 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Oleg Nesterov,
	Jonathan Corbet, H. Peter Anvin

Back on to kcmp.

On Wed, 15 Feb 2012 20:27:52 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote:
> > Not a comment, but the question. I am just curious...
> > 
> > > +/*
> > > + * We don't expose real in-memory order of objects for security
> > > + * reasons, still the comparison results should be suitable for
> > > + * sorting. Thus, we obfuscate kernel pointers values and compare
> > > + * the production instead.
> > > + */
> > > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> > > +
> > > +static long kptr_obfuscate(long v, int type)
> > > +{
> > > +       return (v ^ cookies[type][0]) * cookies[type][1];
> > > +}
> > 
> > OK, but why do we need this per type? Just to add more obfuscation
> > or there is another reason?
> 
> Just to add more obfuscation.

Having re-read most of the (enormous) email discussion on the kcmp()
syscall patch, I'm thinking:

- Nobody seems to understand the obfuscation logic.  Jon sounded
  confused, Oleg sounds confused and it's rather unclear what it does,
  how it does it and why it does it.

- Lots of people have looked at the code and made comments and there
  have been lots of changes.  But we presently have zero Acked-by's and
  Reviewed-by's.

I guess this means that at present nobody is aware of any issues with
the proposal, btu nobody is terribly excisted about it either?

So what do people think?  Any issues?  Any nacks?  Should I sneak it
into Linus this week or do we need to go another round with it all?

I'd like to at least have a fighting chance of understnading what's
going on with that obfuscation code.



From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: syscalls, x86: add __NR_kcmp syscall

While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g.  mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate.  So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

[akpm@linux-foundation.org: fix up selftests, warnings]
[akpm@linux-foundation.org: include errno.h]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek <mmarek@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/syscalls/syscall_32.tbl         |    1 
 arch/x86/syscalls/syscall_64.tbl         |    1 
 include/linux/kcmp.h                     |   17 +
 include/linux/syscalls.h                 |    2 
 kernel/Makefile                          |    3 
 kernel/kcmp.c                            |  187 +++++++++++++++++++++
 kernel/sys_ni.c                          |    3 
 tools/testing/selftests/Makefile         |    2 
 tools/testing/selftests/kcmp/Makefile    |   29 +++
 tools/testing/selftests/kcmp/kcmp_test.c |   94 ++++++++++
 10 files changed, 338 insertions(+), 1 deletion(-)

diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl
--- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl
--- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_64.tbl
@@ -318,6 +318,7 @@
 309	common	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
 # for native 64-bit operation.
diff -puN /dev/null include/linux/kcmp.h
--- /dev/null
+++ a/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparison type */
+enum kcmp_type {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h
--- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/include/linux/syscalls.h
@@ -858,4 +858,6 @@ asmlinkage long sys_process_vm_writev(pi
 				      unsigned long riovcnt,
 				      unsigned long flags);
 
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
 #endif
diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile
--- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/Makefile
@@ -25,6 +25,9 @@ endif
 obj-y += sched/
 obj-y += power/
 
+ifeq ($(CONFIG_CHECKPOINT_RESTORE),y)
+obj-$(CONFIG_X86) += kcmp.o
+endif
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff -puN /dev/null kernel/kcmp.c
--- /dev/null
+++ a/kernel/kcmp.c
@@ -0,0 +1,187 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+/*
+ * We don't expose real in-memory order of objects for security
+ * reasons, still the comparison results should be suitable for
+ * sorting. Thus, we obfuscate kernel pointers values and compare
+ * the production instead.
+ */
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal, i.e. v1 = v2
+ * 1 - less than, i.e. v1 < v2
+ * 2 - greater than, i.e. v1 > v2
+ * 3 - not equal but ordering unavailable (reserved for future)
+ */
+static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
+{
+	long ret;
+
+	ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+/* The caller must have pinned the task */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct file *file = NULL;
+
+	task_lock(task);
+	rcu_read_lock();
+
+	if (task->files)
+		file = fcheck_files(task->files, idx);
+
+	rcu_read_unlock();
+	task_unlock(task);
+
+	return file;
+}
+
+static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
+{
+	if (likely(m2 != m1))
+		mutex_unlock(m2);
+	mutex_unlock(m1);
+}
+
+static int kcmp_lock(struct mutex *m1, struct mutex *m2)
+{
+	int err;
+
+	if (m2 > m1)
+		swap(m1, m2);
+
+	err = mutex_lock_killable(m1);
+	if (!err && likely(m1 != m2)) {
+		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
+		if (err)
+			mutex_unlock(m1);
+	}
+
+	return err;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1, *task2;
+	int ret;
+
+	rcu_read_lock();
+
+	/*
+	 * Tasks are looked up in caller's PID namespace only.
+	 */
+	task1 = find_task_by_vpid(pid1);
+	task2 = find_task_by_vpid(pid2);
+	if (!task1 || !task2)
+		goto err_no_task;
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	/*
+	 * One should have enough rights to inspect task details.
+	 */
+	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
+			&task2->signal->cred_guard_mutex);
+	if (ret)
+		goto err;
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EPERM;
+		goto err_unlock;
+	}
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = kcmp_ptr(filp1, filp2, KCMP_FILE);
+		else
+			ret = -EBADF;
+		break;
+	}
+	case KCMP_VM:
+		ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = kcmp_ptr(task1->sysvsem.undo_list,
+			       task2->sysvsem.undo_list,
+			       KCMP_SYSVSEM);
+#else
+		ret = -EOPNOTSUPP;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+err_unlock:
+	kcmp_unlock(&task1->signal->cred_guard_mutex,
+		    &task2->signal->cred_guard_mutex);
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+
+err_no_task:
+	rcu_read_unlock();
+	return -ESRCH;
+}
+
+static __init int kcmp_cookies_init(void)
+{
+	int i;
+
+	get_random_bytes(cookies, sizeof(cookies));
+
+	for (i = 0; i < KCMP_TYPES; i++)
+		cookies[i][1] |= (~(~0UL >>  1) | 1);
+
+	return 0;
+}
+arch_initcall(kcmp_cookies_init);
diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c
--- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/sys_ni.c
@@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark);
 cond_syscall(sys_name_to_handle_at);
 cond_syscall(sys_open_by_handle_at);
 cond_syscall(compat_sys_open_by_handle_at);
+
+/* compare kernel pointers */
+cond_syscall(sys_kcmp);
diff -puN tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 tools/testing/selftests/Makefile
--- a/tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/tools/testing/selftests/Makefile
@@ -1,4 +1,4 @@
-TARGETS = breakpoints vm
+TARGETS = breakpoints vm kcmp
 
 all:
 	for TARGET in $(TARGETS); do \
diff -puN /dev/null tools/testing/selftests/kcmp/Makefile
--- /dev/null
+++ a/tools/testing/selftests/kcmp/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+	CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+	CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+all:
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) kcmp_test.c -o run_test
+else
+	echo "Not an x86 target, can't build kcmp selftest"
+endif
+
+run-tests: all
+	./kcmp_test
+
+clean:
+	rm -fr ./run_test
+	rm -fr ./test-file
diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c
--- /dev/null
+++ a/tools/testing/selftests/kcmp/kcmp_test.c
@@ -0,0 +1,94 @@
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <limits.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+
+#include <linux/unistd.h>
+#include <linux/kcmp.h>
+
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+
+static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
+{
+	return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2);
+}
+
+int main(int argc, char **argv)
+{
+	const char kpath[] = "kcmp-test-file";
+	int pid1, pid2;
+	int fd1, fd2;
+	int status;
+
+	fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644);
+	pid1 = getpid();
+
+	if (fd1 < 0) {
+		perror("Can't create file");
+		exit(1);
+	}
+
+	pid2 = fork();
+	if (pid2 < 0) {
+		perror("fork failed");
+		exit(1);
+	}
+
+	if (!pid2) {
+		int pid2 = getpid();
+		int ret;
+
+		fd2 = open(kpath, O_RDWR, 0644);
+		if (fd2 < 0) {
+			perror("Can't open file");
+			exit(1);
+		}
+
+		/* An example of output and arguments */
+		printf("pid1: %6d pid2: %6d FD: %2ld FILES: %2ld VM: %2ld "
+		       "FS: %2ld SIGHAND: %2ld IO: %2ld SYSVSEM: %2ld "
+		       "INV: %2ld\n",
+		       pid1, pid2,
+		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
+		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),
+
+			/* This one should fail */
+		       sys_kcmp(pid1, pid2, KCMP_TYPES + 1,	0, 0));
+
+		/* This one should return same fd */
+		ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1);
+		if (ret) {
+			printf("FAIL: 0 expected but %d returned\n", ret);
+			ret = -1;
+		} else
+			printf("PASS: 0 returned as expected\n");
+
+		/* Compare with self */
+		ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
+		if (ret) {
+			printf("FAIL: 0 expected but %li returned\n", ret);
+			ret = -1;
+		} else
+			printf("PASS: 0 returned as expected\n");
+
+		exit(ret);
+	}
+
+	waitpid(pid2, &status, P_ALL);
+
+	return 0;
+}
_



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-09 22:10     ` Andrew Morton
@ 2012-04-09 22:24       ` Cyrill Gorcunov
  2012-04-09 23:22         ` H. Peter Anvin
  2012-04-10  3:25       ` Eric W. Biederman
  2012-04-10 23:58       ` Valdis.Kletnieks
  2 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-04-09 22:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet

On Mon, Apr 09, 2012 at 03:10:27PM -0700, Andrew Morton wrote:
> Back on to kcmp.
> 
> On Wed, 15 Feb 2012 20:27:52 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote:
> > > Not a comment, but the question. I am just curious...
> > > 
> > > > +/*
> > > > + * We don't expose real in-memory order of objects for security
> > > > + * reasons, still the comparison results should be suitable for
> > > > + * sorting. Thus, we obfuscate kernel pointers values and compare
> > > > + * the production instead.
> > > > + */
> > > > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> > > > +
> > > > +static long kptr_obfuscate(long v, int type)
> > > > +{
> > > > +       return (v ^ cookies[type][0]) * cookies[type][1];
> > > > +}
> > > 
> > > OK, but why do we need this per type? Just to add more obfuscation
> > > or there is another reason?
> > 
> > Just to add more obfuscation.
> 
> Having re-read most of the (enormous) email discussion on the kcmp()
> syscall patch, I'm thinking:
> 
> - Nobody seems to understand the obfuscation logic.  Jon sounded
>   confused, Oleg sounds confused and it's rather unclear what it does,
>   how it does it and why it does it.

The obfuscation logic was done with great help from hpa@. And the main
idea was to have ordered results after obfuscation. Per-type noise increase
randomization of results. So Andrew, I actually dont know what to add
here. We don't want to provide kernel order back to user-space in
naked manner.

> 
> - Lots of people have looked at the code and made comments and there
>   have been lots of changes.  But we presently have zero Acked-by's and
>   Reviewed-by's.
> 

I guess I can ask hpa@ and Eric for Reviewed-by or Acked-by tag?

> I guess this means that at present nobody is aware of any issues with
> the proposal, btu nobody is terribly excisted about it either?
> 

I would rather say not much people yet use it.

> So what do people think?  Any issues?  Any nacks?  Should I sneak it
> into Linus this week or do we need to go another round with it all?
> 
> I'd like to at least have a fighting chance of understnading what's
> going on with that obfuscation code.

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-09 22:24       ` Cyrill Gorcunov
@ 2012-04-09 23:22         ` H. Peter Anvin
  2012-04-10 22:37           ` Cyrill Gorcunov
  2012-04-11  0:02           ` Valdis.Kletnieks
  0 siblings, 2 replies; 49+ messages in thread
From: H. Peter Anvin @ 2012-04-09 23:22 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov,
	Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet

On 04/09/2012 03:24 PM, Cyrill Gorcunov wrote:
>>
>> Having re-read most of the (enormous) email discussion on the kcmp()
>> syscall patch, I'm thinking:
>>
>> - Nobody seems to understand the obfuscation logic.  Jon sounded
>>   confused, Oleg sounds confused and it's rather unclear what it does,
>>   how it does it and why it does it.
> 
> The obfuscation logic was done with great help from hpa@. And the main
> idea was to have ordered results after obfuscation. Per-type noise increase
> randomization of results. So Andrew, I actually dont know what to add
> here. We don't want to provide kernel order back to user-space in
> naked manner.
> 

The obfuscation logic is to provide a 1:1 mapping but which doesn't
preserve ordering, thereby avoid leaking information of kernel pointers
to user space.

	-hpa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-09 22:10     ` Andrew Morton
  2012-04-09 22:24       ` Cyrill Gorcunov
@ 2012-04-10  3:25       ` Eric W. Biederman
  2012-04-10 22:54         ` Cyrill Gorcunov
  2012-04-10 23:58       ` Valdis.Kletnieks
  2 siblings, 1 reply; 49+ messages in thread
From: Eric W. Biederman @ 2012-04-10  3:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Oleg Nesterov, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet

Andrew Morton <akpm@linux-foundation.org> writes:

> Back on to kcmp.
>
> On Wed, 15 Feb 2012 20:27:52 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
>
>> On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote:
>> > Not a comment, but the question. I am just curious...
>> > 
>> > > +/*
>> > > + * We don't expose real in-memory order of objects for security
>> > > + * reasons, still the comparison results should be suitable for
>> > > + * sorting. Thus, we obfuscate kernel pointers values and compare
>> > > + * the production instead.
>> > > + */
>> > > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
>> > > +
>> > > +static long kptr_obfuscate(long v, int type)
>> > > +{
>> > > +       return (v ^ cookies[type][0]) * cookies[type][1];
>> > > +}
>> > 
>> > OK, but why do we need this per type? Just to add more obfuscation
>> > or there is another reason?
>> 
>> Just to add more obfuscation.
>
> Having re-read most of the (enormous) email discussion on the kcmp()
> syscall patch, I'm thinking:
>
> - Nobody seems to understand the obfuscation logic.  Jon sounded
>   confused, Oleg sounds confused and it's rather unclear what it does,
>   how it does it and why it does it.

Peter explained it fairly well earlier.  The xor trivially makes sense
to me.  I don't recall what the multiplication does.

It would be nice if someone would get Peter's comment on why the
multiply into a comment.  But obscuring things makes sense.

> - Lots of people have looked at the code and made comments and there
>   have been lots of changes.  But we presently have zero Acked-by's and
>   Reviewed-by's.
>
> I guess this means that at present nobody is aware of any issues with
> the proposal, btu nobody is terribly excisted about it either?

Having just read through it again the only possible issue I can see is
that we compare file descriptors after dropping all of the locks.

Since rcu_read_lock doesn't participate in ABBA deadlocks. My gut feel
is that we should hold rcu_read_lock across the hole file pointer
comparison to remove the possibility of races as file descriptor
pointers go away.

Still in practice I don't think it matters.  At worst there is the
slightest possibility of returning a value instead of -EBADF.  The
expectation is for all of the tasks we are operating on to be frozen,
and even if the tasks are not frozen it is a very tiny window for a race
to be in.

> So what do people think?  Any issues?  Any nacks?  Should I sneak it
> into Linus this week or do we need to go another round with it all?

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

> I'd like to at least have a fighting chance of understnading what's
> going on with that obfuscation code.

My gut feel is that this code is good enough.  Any and all security
issues have been addressed.  Having the system call would seem
to add momentum to the people working on checkpoint/restart.
So I don't see why this patch should not go forward, unless someone
can point out an outright bug.

Eric

> From: Cyrill Gorcunov <gorcunov@openvz.org>
> Subject: syscalls, x86: add __NR_kcmp syscall
>
> While doing the checkpoint-restore in the user space one need to determine
> whether various kernel objects (like mm_struct-s of file_struct-s) are
> shared between tasks and restore this state.
>
> The 2nd step can be solved by using appropriate CLONE_ flags and the
> unshare syscall, while there's currently no ways for solving the 1st one.
>
> One of the ways for checking whether two tasks share e.g.  mm_struct is to
> provide some mm_struct ID of a task to its proc file, but showing such
> info considered to be not that good for security reasons.
>
> Thus after some debates we end up in conclusion that using that named
> 'comparison' syscall might be the best candidate.  So here is it --
> __NR_kcmp.
>
> It takes up to 5 arguments - the pids of the two tasks (which
> characteristics should be compared), the comparison type and (in case of
> comparison of files) two file descriptors.
>
> Lookups for pids are done in the caller's PID namespace only.
>
> At moment only x86 is supported and tested.
>
> [akpm@linux-foundation.org: fix up selftests, warnings]
> [akpm@linux-foundation.org: include errno.h]
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Andrey Vagin <avagin@openvz.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Glauber Costa <glommer@parallels.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Matt Helsley <matthltc@us.ibm.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Vasiliy Kulikov <segoon@openwall.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Valdis.Kletnieks@vt.edu
> Cc: Michal Marek <mmarek@suse.cz>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  arch/x86/syscalls/syscall_32.tbl         |    1 
>  arch/x86/syscalls/syscall_64.tbl         |    1 
>  include/linux/kcmp.h                     |   17 +
>  include/linux/syscalls.h                 |    2 
>  kernel/Makefile                          |    3 
>  kernel/kcmp.c                            |  187 +++++++++++++++++++++
>  kernel/sys_ni.c                          |    3 
>  tools/testing/selftests/Makefile         |    2 
>  tools/testing/selftests/kcmp/Makefile    |   29 +++
>  tools/testing/selftests/kcmp/kcmp_test.c |   94 ++++++++++
>  10 files changed, 338 insertions(+), 1 deletion(-)
>
> diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl
> --- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
> +++ a/arch/x86/syscalls/syscall_32.tbl
> @@ -355,3 +355,4 @@
>  346	i386	setns			sys_setns
>  347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
>  348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
> +349	i386	kcmp			sys_kcmp
> diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl
> --- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
> +++ a/arch/x86/syscalls/syscall_64.tbl
> @@ -318,6 +318,7 @@
>  309	common	getcpu			sys_getcpu
>  310	64	process_vm_readv	sys_process_vm_readv
>  311	64	process_vm_writev	sys_process_vm_writev
> +312	64	kcmp			sys_kcmp
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
>  # for native 64-bit operation.
> diff -puN /dev/null include/linux/kcmp.h
> --- /dev/null
> +++ a/include/linux/kcmp.h
> @@ -0,0 +1,17 @@
> +#ifndef _LINUX_KCMP_H
> +#define _LINUX_KCMP_H
> +
> +/* Comparison type */
> +enum kcmp_type {
> +	KCMP_FILE,
> +	KCMP_VM,
> +	KCMP_FILES,
> +	KCMP_FS,
> +	KCMP_SIGHAND,
> +	KCMP_IO,
> +	KCMP_SYSVSEM,
> +
> +	KCMP_TYPES,
> +};
> +
> +#endif /* _LINUX_KCMP_H */
> diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h
> --- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8
> +++ a/include/linux/syscalls.h
> @@ -858,4 +858,6 @@ asmlinkage long sys_process_vm_writev(pi
>  				      unsigned long riovcnt,
>  				      unsigned long flags);
>  
> +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> +			 unsigned long idx1, unsigned long idx2);
>  #endif
> diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile
> --- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
> +++ a/kernel/Makefile
> @@ -25,6 +25,9 @@ endif
>  obj-y += sched/
>  obj-y += power/
>  
> +ifeq ($(CONFIG_CHECKPOINT_RESTORE),y)
> +obj-$(CONFIG_X86) += kcmp.o
> +endif
>  obj-$(CONFIG_FREEZER) += freezer.o
>  obj-$(CONFIG_PROFILING) += profile.o
>  obj-$(CONFIG_STACKTRACE) += stacktrace.o
> diff -puN /dev/null kernel/kcmp.c
> --- /dev/null
> +++ a/kernel/kcmp.c
> @@ -0,0 +1,187 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <linux/fdtable.h>
> +#include <linux/string.h>
> +#include <linux/random.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/errno.h>
> +#include <linux/cache.h>
> +#include <linux/bug.h>
> +#include <linux/err.h>
> +#include <linux/kcmp.h>
> +
> +#include <asm/unistd.h>
> +
> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparison results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values and compare
> + * the production instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> +	return (v ^ cookies[type][0]) * cookies[type][1];
> +}
> +
> +/*
> + * 0 - equal, i.e. v1 = v2
> + * 1 - less than, i.e. v1 < v2
> + * 2 - greater than, i.e. v1 > v2
> + * 3 - not equal but ordering unavailable (reserved for future)
> + */
> +static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
> +{
> +	long ret;
> +
> +	ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
> +
> +	return (ret < 0) | ((ret > 0) << 1);
> +}
> +
> +/* The caller must have pinned the task */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> +	struct file *file = NULL;
> +
> +	task_lock(task);
> +	rcu_read_lock();
> +
> +	if (task->files)
> +		file = fcheck_files(task->files, idx);
> +
> +	rcu_read_unlock();
> +	task_unlock(task);
> +
> +	return file;
> +}
> +
> +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> +{
> +	if (likely(m2 != m1))
> +		mutex_unlock(m2);
> +	mutex_unlock(m1);
> +}
> +
> +static int kcmp_lock(struct mutex *m1, struct mutex *m2)
> +{
> +	int err;
> +
> +	if (m2 > m1)
> +		swap(m1, m2);
> +
> +	err = mutex_lock_killable(m1);
> +	if (!err && likely(m1 != m2)) {
> +		err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
> +		if (err)
> +			mutex_unlock(m1);
> +	}
> +
> +	return err;
> +}
> +
> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> +		unsigned long, idx1, unsigned long, idx2)
> +{
> +	struct task_struct *task1, *task2;
> +	int ret;
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * Tasks are looked up in caller's PID namespace only.
> +	 */
> +	task1 = find_task_by_vpid(pid1);
> +	task2 = find_task_by_vpid(pid2);
> +	if (!task1 || !task2)
> +		goto err_no_task;
> +
> +	get_task_struct(task1);
> +	get_task_struct(task2);
> +
> +	rcu_read_unlock();
> +
> +	/*
> +	 * One should have enough rights to inspect task details.
> +	 */
> +	ret = kcmp_lock(&task1->signal->cred_guard_mutex,
> +			&task2->signal->cred_guard_mutex);
> +	if (ret)
> +		goto err;
> +	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> +	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> +		ret = -EPERM;
> +		goto err_unlock;
> +	}
> +
> +	switch (type) {
> +	case KCMP_FILE: {
> +		struct file *filp1, *filp2;
> +
> +		filp1 = get_file_raw_ptr(task1, idx1);
> +		filp2 = get_file_raw_ptr(task2, idx2);
> +
> +		if (filp1 && filp2)
> +			ret = kcmp_ptr(filp1, filp2, KCMP_FILE);
> +		else
> +			ret = -EBADF;
> +		break;
> +	}
> +	case KCMP_VM:
> +		ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM);
> +		break;
> +	case KCMP_FILES:
> +		ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES);
> +		break;
> +	case KCMP_FS:
> +		ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS);
> +		break;
> +	case KCMP_SIGHAND:
> +		ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND);
> +		break;
> +	case KCMP_IO:
> +		ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO);
> +		break;
> +	case KCMP_SYSVSEM:
> +#ifdef CONFIG_SYSVIPC
> +		ret = kcmp_ptr(task1->sysvsem.undo_list,
> +			       task2->sysvsem.undo_list,
> +			       KCMP_SYSVSEM);
> +#else
> +		ret = -EOPNOTSUPP;
> +#endif
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +err_unlock:
> +	kcmp_unlock(&task1->signal->cred_guard_mutex,
> +		    &task2->signal->cred_guard_mutex);
> +err:
> +	put_task_struct(task1);
> +	put_task_struct(task2);
> +
> +	return ret;
> +
> +err_no_task:
> +	rcu_read_unlock();
> +	return -ESRCH;
> +}
> +
> +static __init int kcmp_cookies_init(void)
> +{
> +	int i;
> +
> +	get_random_bytes(cookies, sizeof(cookies));
> +
> +	for (i = 0; i < KCMP_TYPES; i++)
> +		cookies[i][1] |= (~(~0UL >>  1) | 1);
> +
> +	return 0;
> +}
> +arch_initcall(kcmp_cookies_init);
> diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c
> --- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8
> +++ a/kernel/sys_ni.c
> @@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark);
>  cond_syscall(sys_name_to_handle_at);
>  cond_syscall(sys_open_by_handle_at);
>  cond_syscall(compat_sys_open_by_handle_at);
> +
> +/* compare kernel pointers */
> +cond_syscall(sys_kcmp);
> diff -puN tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 tools/testing/selftests/Makefile
> --- a/tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
> +++ a/tools/testing/selftests/Makefile
> @@ -1,4 +1,4 @@
> -TARGETS = breakpoints vm
> +TARGETS = breakpoints vm kcmp
>  
>  all:
>  	for TARGET in $(TARGETS); do \
> diff -puN /dev/null tools/testing/selftests/kcmp/Makefile
> --- /dev/null
> +++ a/tools/testing/selftests/kcmp/Makefile
> @@ -0,0 +1,29 @@
> +uname_M := $(shell uname -m 2>/dev/null || echo not)
> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
> +ifeq ($(ARCH),i386)
> +        ARCH := X86
> +	CFLAGS := -DCONFIG_X86_32 -D__i386__
> +endif
> +ifeq ($(ARCH),x86_64)
> +	ARCH := X86
> +	CFLAGS := -DCONFIG_X86_64 -D__x86_64__
> +endif
> +
> +CFLAGS += -I../../../../arch/x86/include/generated/
> +CFLAGS += -I../../../../include/
> +CFLAGS += -I../../../../usr/include/
> +CFLAGS += -I../../../../arch/x86/include/
> +
> +all:
> +ifeq ($(ARCH),X86)
> +	gcc $(CFLAGS) kcmp_test.c -o run_test
> +else
> +	echo "Not an x86 target, can't build kcmp selftest"
> +endif
> +
> +run-tests: all
> +	./kcmp_test
> +
> +clean:
> +	rm -fr ./run_test
> +	rm -fr ./test-file
> diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c
> --- /dev/null
> +++ a/tools/testing/selftests/kcmp/kcmp_test.c
> @@ -0,0 +1,94 @@
> +#define _GNU_SOURCE
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <signal.h>
> +#include <limits.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <string.h>
> +#include <fcntl.h>
> +
> +#include <linux/unistd.h>
> +#include <linux/kcmp.h>
> +
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <sys/wait.h>
> +
> +static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
> +{
> +	return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	const char kpath[] = "kcmp-test-file";
> +	int pid1, pid2;
> +	int fd1, fd2;
> +	int status;
> +
> +	fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644);
> +	pid1 = getpid();
> +
> +	if (fd1 < 0) {
> +		perror("Can't create file");
> +		exit(1);
> +	}
> +
> +	pid2 = fork();
> +	if (pid2 < 0) {
> +		perror("fork failed");
> +		exit(1);
> +	}
> +
> +	if (!pid2) {
> +		int pid2 = getpid();
> +		int ret;
> +
> +		fd2 = open(kpath, O_RDWR, 0644);
> +		if (fd2 < 0) {
> +			perror("Can't open file");
> +			exit(1);
> +		}
> +
> +		/* An example of output and arguments */
> +		printf("pid1: %6d pid2: %6d FD: %2ld FILES: %2ld VM: %2ld "
> +		       "FS: %2ld SIGHAND: %2ld IO: %2ld SYSVSEM: %2ld "
> +		       "INV: %2ld\n",
> +		       pid1, pid2,
> +		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
> +		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
> +		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),
> +
> +			/* This one should fail */
> +		       sys_kcmp(pid1, pid2, KCMP_TYPES + 1,	0, 0));
> +
> +		/* This one should return same fd */
> +		ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1);
> +		if (ret) {
> +			printf("FAIL: 0 expected but %d returned\n", ret);
> +			ret = -1;
> +		} else
> +			printf("PASS: 0 returned as expected\n");
> +
> +		/* Compare with self */
> +		ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
> +		if (ret) {
> +			printf("FAIL: 0 expected but %li returned\n", ret);
> +			ret = -1;
> +		} else
> +			printf("PASS: 0 returned as expected\n");
> +
> +		exit(ret);
> +	}
> +
> +	waitpid(pid2, &status, P_ALL);
> +
> +	return 0;
> +}
> _

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-09 23:22         ` H. Peter Anvin
@ 2012-04-10 22:37           ` Cyrill Gorcunov
  2012-04-10 22:39             ` H. Peter Anvin
  2012-04-10 23:08             ` Oleg Nesterov
  2012-04-11  0:02           ` Valdis.Kletnieks
  1 sibling, 2 replies; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-04-10 22:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov,
	Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet

On Mon, Apr 09, 2012 at 04:22:38PM -0700, H. Peter Anvin wrote:
> > The obfuscation logic was done with great help from hpa@. And the main
> > idea was to have ordered results after obfuscation. Per-type noise increase
> > randomization of results. So Andrew, I actually dont know what to add
> > here. We don't want to provide kernel order back to user-space in
> > naked manner.
> > 
> 
> The obfuscation logic is to provide a 1:1 mapping but which doesn't
> preserve ordering, thereby avoid leaking information of kernel pointers
> to user space.
> 

OK, Peter, would the following comment bring light on the obfuscation
procedure?
---
Add a comment on kcmp obfuscation method

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 kernel/kcmp.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

Index: linux-2.6.git/kernel/kcmp.c
===================================================================
--- linux-2.6.git.orig/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -17,6 +17,17 @@
  * reasons, still the comparison results should be suitable for
  * sorting. Thus, we obfuscate kernel pointers values and compare
  * the production instead.
+ *
+ * The obfuscation is done in two steps. First -- we use xor on
+ * kernel pointer with random value, which puts pointer into
+ * a new position in reordered space. Second -- we multiply
+ * the xor production with big odd random number to permute
+ * bits even more (the oddity is important here, it allow
+ * us to have meaningful production even if multiplicants
+ * are big numbers).
+ *
+ * Note also the obfuscation itself is invisible to user-space
+ * and if needed it can be changed to any suitable scheme.
  */
 static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10 22:37           ` Cyrill Gorcunov
@ 2012-04-10 22:39             ` H. Peter Anvin
  2012-04-10 22:48               ` Cyrill Gorcunov
  2012-04-10 23:08             ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: H. Peter Anvin @ 2012-04-10 22:39 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov,
	Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet

On 04/10/2012 03:37 PM, Cyrill Gorcunov wrote:
> + * bits even more (the oddity is important here, it allow
> + * us to have meaningful production even if multiplicants
> + * are big numbers).

I would be more clear:

"the odd multiplier guarantees that the product is unique ever after the
high bits are truncated, since any odd number is relative prime to 2^n".

	-hpa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10 22:39             ` H. Peter Anvin
@ 2012-04-10 22:48               ` Cyrill Gorcunov
  0 siblings, 0 replies; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-04-10 22:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov,
	Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet

On Tue, Apr 10, 2012 at 03:39:53PM -0700, H. Peter Anvin wrote:
> On 04/10/2012 03:37 PM, Cyrill Gorcunov wrote:
> > + * bits even more (the oddity is important here, it allow
> > + * us to have meaningful production even if multiplicants
> > + * are big numbers).
> 
> I would be more clear:
> 
> "the odd multiplier guarantees that the product is unique ever after the
> high bits are truncated, since any odd number is relative prime to 2^n".
> 

Yeah, updated, thanks. Andrew, fold it up please.

---
Subject: Add a comment on kcmp obfuscation method

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: "H. Peter Anvin" <hpa@zytor.com>
---
 kernel/kcmp.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

Index: linux-2.6.git/kernel/kcmp.c
===================================================================
--- linux-2.6.git.orig/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -17,6 +17,17 @@
  * reasons, still the comparison results should be suitable for
  * sorting. Thus, we obfuscate kernel pointers values and compare
  * the production instead.
+ *
+ * The obfuscation is done in two steps. First -- we use xor on
+ * kernel pointer with random value, which puts pointer into
+ * a new position in reordered space. Second -- we multiply
+ * the xor production with big odd random number to permute
+ * bits even more (the odd multiplier guarantees that the product
+ * is unique ever after the high bits are truncated, since any odd
+ * number is relative prime to 2^n).
+ *
+ * Note also the obfuscation itself is invisible to user-space
+ * and if needed it can be changed to any suitable scheme.
  */
 static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10  3:25       ` Eric W. Biederman
@ 2012-04-10 22:54         ` Cyrill Gorcunov
  0 siblings, 0 replies; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-04-10 22:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Oleg Nesterov, Pavel Emelyanov, Andrey Vagin,
	KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks,
	Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet

On Mon, Apr 09, 2012 at 08:25:22PM -0700, Eric W. Biederman wrote:
...
> 
> Having just read through it again the only possible issue I can see is
> that we compare file descriptors after dropping all of the locks.
> 
> Since rcu_read_lock doesn't participate in ABBA deadlocks. My gut feel
> is that we should hold rcu_read_lock across the hole file pointer
> comparison to remove the possibility of races as file descriptor
> pointers go away.
> 
> Still in practice I don't think it matters.  At worst there is the
> slightest possibility of returning a value instead of -EBADF.  The
> expectation is for all of the tasks we are operating on to be frozen,
> and even if the tasks are not frozen it is a very tiny window for a race
> to be in.

yeah, we use this call heavily on stopped tasks atm

> > So what do people think?  Any issues?  Any nacks?  Should I sneak it
> > into Linus this week or do we need to go another round with it all?
> 
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 

Thanks a lot, Eric!

	Cyrill

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10 22:37           ` Cyrill Gorcunov
  2012-04-10 22:39             ` H. Peter Anvin
@ 2012-04-10 23:08             ` Oleg Nesterov
  2012-04-10 23:32               ` H. Peter Anvin
  1 sibling, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-04-10 23:08 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: H. Peter Anvin, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel, Jonathan Corbet

On 04/11, Cyrill Gorcunov wrote:
>
> --- linux-2.6.git.orig/kernel/kcmp.c
> +++ linux-2.6.git/kernel/kcmp.c
> @@ -17,6 +17,17 @@
>   * reasons, still the comparison results should be suitable for
>   * sorting. Thus, we obfuscate kernel pointers values and compare
>   * the production instead.
> + *
> + * The obfuscation is done in two steps. First -- we use xor on
> + * kernel pointer with random value, which puts pointer into
> + * a new position in reordered space. Second -- we multiply
> + * the xor production with big odd random number to permute
> + * bits even more (the oddity is important here, it allow
> + * us to have meaningful production even if multiplicants
> + * are big numbers).
> + *
> + * Note also the obfuscation itself is invisible to user-space
> + * and if needed it can be changed to any suitable scheme.
>   */
>  static unsigned long cookies[KCMP_TYPES][2] __read_mostly;

OK, since this is discussed again...

Can this comment can also explain why do we obfuscate the pointers
by type? I mean, I don't really understand why the one-dimensional
cookies[2] is "not enough" from security pov.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10 23:08             ` Oleg Nesterov
@ 2012-04-10 23:32               ` H. Peter Anvin
  2012-04-10 23:42                 ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: H. Peter Anvin @ 2012-04-10 23:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cyrill Gorcunov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel, Jonathan Corbet

On 04/10/2012 04:08 PM, Oleg Nesterov wrote:
> 
> OK, since this is discussed again...
> 
> Can this comment can also explain why do we obfuscate the pointers
> by type? I mean, I don't really understand why the one-dimensional
> cookies[2] is "not enough" from security pov.
> 

Because it's cheap.  "Just enough" is not what you want to shoot for,
ever, you want to get past the "just enough" point and then consider
"what can I get for cheap at this point"?

	-hpa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10 23:32               ` H. Peter Anvin
@ 2012-04-10 23:42                 ` Oleg Nesterov
  2012-04-11  6:39                   ` Cyrill Gorcunov
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2012-04-10 23:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Cyrill Gorcunov, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel, Jonathan Corbet

On 04/10, H. Peter Anvin wrote:
>
> On 04/10/2012 04:08 PM, Oleg Nesterov wrote:
> >
> > OK, since this is discussed again...
> >
> > Can this comment can also explain why do we obfuscate the pointers
> > by type? I mean, I don't really understand why the one-dimensional
> > cookies[2] is "not enough" from security pov.
>
> Because it's cheap.  "Just enough" is not what you want to shoot for,
> ever, you want to get past the "just enough" point and then consider
> "what can I get for cheap at this point"?

OK, I am not arguing. Just I thought that the small note like
"we are doing this per-type to obfuscate even more" can help.
I wouldn't have asked, but Cyrill rewrites this comment anyway.

Perhaps this is just me, but my first (and wrong) impression was
that somehow this is needed for correctness.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-09 22:10     ` Andrew Morton
  2012-04-09 22:24       ` Cyrill Gorcunov
  2012-04-10  3:25       ` Eric W. Biederman
@ 2012-04-10 23:58       ` Valdis.Kletnieks
  2012-04-11  0:06         ` H. Peter Anvin
  2 siblings, 1 reply; 49+ messages in thread
From: Valdis.Kletnieks @ 2012-04-10 23:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Oleg Nesterov, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen,
	Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet,
	Vasiliy Kulikov, Alexey Dobriyan, Michal Marek,
	Frederic Weisbecker, linux-kernel, Jonathan Corbet

[-- Attachment #1: Type: text/plain, Size: 950 bytes --]

On Mon, 09 Apr 2012 15:10:27 -0700, Andrew Morton said:
> Back on to kcmp.

I've totally forgotten what my original comment(s) on this patch were.. :)

> +/*
> + * 0 - equal, i.e. v1 = v2
> + * 1 - less than, i.e. v1 < v2
> + * 2 - greater than, i.e. v1 > v2
> + * 3 - not equal but ordering unavailable (reserved for future)
> + */
> +static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
> +{
> +	long ret;
> +
> +	ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);

I'm not able to convince myself that "less than" and "greater than" mean
anything - do we have a good proof that for all v1 and v2, the obfuscated
pointers have the same ordering as the original pointers?

Hmm... consider the simplified example  v1 = 5 and v2= 16., and cookies[0] is
also 16. Then obfus(v1) == 21, and obfus(v2) == 0, and the ordering is
different.  So I'm thinking 0 and 3 are the only sane return values?

Or do I need more caffeine?


[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-09 23:22         ` H. Peter Anvin
  2012-04-10 22:37           ` Cyrill Gorcunov
@ 2012-04-11  0:02           ` Valdis.Kletnieks
  1 sibling, 0 replies; 49+ messages in thread
From: Valdis.Kletnieks @ 2012-04-11  0:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Cyrill Gorcunov, Andrew Morton, Oleg Nesterov, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Alexey Dobriyan, Michal Marek, Frederic Weisbecker, linux-kernel,
	Jonathan Corbet

[-- Attachment #1: Type: text/plain, Size: 1009 bytes --]

On Mon, 09 Apr 2012 16:22:38 -0700, "H. Peter Anvin" said:
> On 04/09/2012 03:24 PM, Cyrill Gorcunov wrote:
> >>
> >> Having re-read most of the (enormous) email discussion on the kcmp()
> >> syscall patch, I'm thinking:
> >>
> >> - Nobody seems to understand the obfuscation logic.  Jon sounded
> >>   confused, Oleg sounds confused and it's rather unclear what it does,
> >>   how it does it and why it does it.
> >
> > The obfuscation logic was done with great help from hpa@. And the main
> > idea was to have ordered results after obfuscation. Per-type noise increase
> > randomization of results. So Andrew, I actually dont know what to add
> > here. We don't want to provide kernel order back to user-space in
> > naked manner.
> >
>
> The obfuscation logic is to provide a 1:1 mapping but which doesn't
> preserve ordering, thereby avoid leaking information of kernel pointers
> to user space.

Oh, OK... Ignore my previous note then.  But we should lose the comment
that implies we have an ordering?

[-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10 23:58       ` Valdis.Kletnieks
@ 2012-04-11  0:06         ` H. Peter Anvin
  0 siblings, 0 replies; 49+ messages in thread
From: H. Peter Anvin @ 2012-04-11  0:06 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Andrew Morton, Cyrill Gorcunov, Oleg Nesterov, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Alexey Dobriyan, Michal Marek, Frederic Weisbecker, linux-kernel,
	Jonathan Corbet

On 04/10/2012 04:58 PM, Valdis.Kletnieks@vt.edu wrote:
> 
> I'm not able to convince myself that "less than" and "greater than"
> mean anything - do we have a good proof that for all v1 and v2, the
> obfuscated pointers have the same ordering as the original
> pointers?
> 

No, of course they don't.

That's the point.

The whole point is that kcmp() exports an ordered set, but the
ordering is explicitly and intentionally different than the actual
ordering in memory.  It is still valid for sorting, in particular.

> Hmm... consider the simplified example  v1 = 5 and v2= 16., and
> cookies[0] is also 16. Then obfus(v1) == 21, and obfus(v2) == 0,
> and the ordering is different.  So I'm thinking 0 and 3 are the
> only sane return values?
> 
> Or do I need more caffeine?

You need more caffeine.  You're thinking about the problem wrong.

	-hpa

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-10 23:42                 ` Oleg Nesterov
@ 2012-04-11  6:39                   ` Cyrill Gorcunov
  2012-04-11 18:31                     ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Cyrill Gorcunov @ 2012-04-11  6:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: H. Peter Anvin, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel, Jonathan Corbet

On Wed, Apr 11, 2012 at 01:42:06AM +0200, Oleg Nesterov wrote:
> On 04/10, H. Peter Anvin wrote:
> >
> > On 04/10/2012 04:08 PM, Oleg Nesterov wrote:
> > >
> > > OK, since this is discussed again...
> > >
> > > Can this comment can also explain why do we obfuscate the pointers
> > > by type? I mean, I don't really understand why the one-dimensional
> > > cookies[2] is "not enough" from security pov.
> >
> > Because it's cheap.  "Just enough" is not what you want to shoot for,
> > ever, you want to get past the "just enough" point and then consider
> > "what can I get for cheap at this point"?
> 
> OK, I am not arguing. Just I thought that the small note like
> "we are doing this per-type to obfuscate even more" can help.
> I wouldn't have asked, but Cyrill rewrites this comment anyway.
> 
> Perhaps this is just me, but my first (and wrong) impression was
> that somehow this is needed for correctness.

Hi Oleg, would the form below sounds good?

---
Subject: Add a comment on kcmp obfuscation method

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 kernel/kcmp.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: linux-2.6.git/kernel/kcmp.c
===================================================================
--- linux-2.6.git.orig/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -17,6 +17,20 @@
  * reasons, still the comparison results should be suitable for
  * sorting. Thus, we obfuscate kernel pointers values and compare
  * the production instead.
+ *
+ * The obfuscation is done in two steps. First -- we use xor on
+ * kernel pointer with random value, which puts pointer into
+ * a new position in reordered space. Second -- we multiply
+ * the xor production with big odd random number to permute
+ * bits even more (the odd multiplier guarantees that the product
+ * is unique ever after the high bits are truncated, since any odd
+ * number is relative prime to 2^n).
+ *
+ * The obfuscation is done with per-type cookie values to increase
+ * initial entropy of results.
+ *
+ * Note also the obfuscation itself is invisible to user-space
+ * and if needed it can be changed to any suitable scheme.
  */
 static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
  2012-04-11  6:39                   ` Cyrill Gorcunov
@ 2012-04-11 18:31                     ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2012-04-11 18:31 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: H. Peter Anvin, Andrew Morton, Eric W. Biederman,
	Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar,
	Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo,
	Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov,
	Alexey Dobriyan, Valdis.Kletnieks, Michal Marek,
	Frederic Weisbecker, linux-kernel, Jonathan Corbet

On 04/11, Cyrill Gorcunov wrote:
>
> On Wed, Apr 11, 2012 at 01:42:06AM +0200, Oleg Nesterov wrote:
> > OK, I am not arguing. Just I thought that the small note like
> > "we are doing this per-type to obfuscate even more" can help.
> > I wouldn't have asked, but Cyrill rewrites this comment anyway.
> >
> > Perhaps this is just me, but my first (and wrong) impression was
> > that somehow this is needed for correctness.
>
> Hi Oleg, would the form below sounds good?

Sure, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree
@ 2012-02-14 23:15 akpm
  0 siblings, 0 replies; 49+ messages in thread
From: akpm @ 2012-02-14 23:15 UTC (permalink / raw)
  To: mm-commits
  Cc: gorcunov, adobriyan, andi, avagin, ebiederm, eric.dumazet,
	fweisbec, glommer, hpa, kosaki.motohiro, matthltc, mingo, mmarek,
	penberg, segoon, tglx, tj, xemul


The patch titled
     Subject: syscalls, x86: add __NR_kcmp syscall
has been added to the -mm tree.  Its filename is
     syscalls-x86-add-__nr_kcmp-syscall-v8.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: syscalls, x86: add __NR_kcmp syscall

While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g.  mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate.  So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: Michal Marek <mmarek@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/syscalls/syscall_32.tbl         |    1 
 arch/x86/syscalls/syscall_64.tbl         |    1 
 include/linux/kcmp.h                     |   17 ++
 include/linux/syscalls.h                 |    2 
 kernel/Makefile                          |    3 
 kernel/kcmp.c                            |  155 +++++++++++++++++++++
 kernel/sys_ni.c                          |    3 
 tools/testing/selftests/kcmp/Makefile    |   36 ++++
 tools/testing/selftests/kcmp/kcmp_test.c |   84 +++++++++++
 9 files changed, 302 insertions(+)

diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl
--- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
 346	i386	setns			sys_setns
 347	i386	process_vm_readv	sys_process_vm_readv		compat_sys_process_vm_readv
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
+349	i386	kcmp			sys_kcmp
diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl
--- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
 309	64	getcpu			sys_getcpu
 310	64	process_vm_readv	sys_process_vm_readv
 311	64	process_vm_writev	sys_process_vm_writev
+312	64	kcmp			sys_kcmp
diff -puN /dev/null include/linux/kcmp.h
--- /dev/null
+++ a/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparison type */
+enum kcmp_type {
+	KCMP_FILE,
+	KCMP_VM,
+	KCMP_FILES,
+	KCMP_FS,
+	KCMP_SIGHAND,
+	KCMP_IO,
+	KCMP_SYSVSEM,
+
+	KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h
--- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/include/linux/syscalls.h
@@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi
 				      unsigned long riovcnt,
 				      unsigned long flags);
 
+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+			 unsigned long idx1, unsigned long idx2);
 #endif
diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile
--- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/Makefile
@@ -25,6 +25,9 @@ endif
 obj-y += sched/
 obj-y += power/
 
+ifeq ($(CONFIG_CHECKPOINT_RESTORE),y)
+obj-$(CONFIG_X86) += kcmp.o
+endif
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff -puN /dev/null kernel/kcmp.c
--- /dev/null
+++ a/kernel/kcmp.c
@@ -0,0 +1,155 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+/*
+ * We don't expose real in-memory order of objects for security
+ * reasons, still the comparison results should be suitable for
+ * sorting. Thus, we obfuscate kernel pointers values and compare
+ * the production instead.
+ */
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+	return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal, i.e. v1 = v2
+ * 1 - less than, i.e. v1 < v2
+ * 2 - greater than, i.e. v1 > v2
+ * 3 - not equal but ordering unavailable (reserved for future)
+ */
+static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
+{
+	long ret;
+
+	ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
+
+	return (ret < 0) | ((ret > 0) << 1);
+}
+
+/* The caller must have pinned the task */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+	struct fdtable *fdt;
+	struct file *file;
+
+	spin_lock(&task->files->file_lock);
+	fdt = files_fdtable(task->files);
+	if (idx < fdt->max_fds)
+		file = fdt->fd[idx];
+	else
+		file = NULL;
+	spin_unlock(&task->files->file_lock);
+
+	return file;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+		unsigned long, idx1, unsigned long, idx2)
+{
+	struct task_struct *task1, *task2;
+	int ret;
+
+	rcu_read_lock();
+
+	/*
+	 * Tasks are looked up in caller's PID namespace only.
+	 */
+	task1 = find_task_by_vpid(pid1);
+	task2 = find_task_by_vpid(pid2);
+	if (!task1 || !task2)
+		goto err_no_task;
+
+	get_task_struct(task1);
+	get_task_struct(task2);
+
+	rcu_read_unlock();
+
+	/*
+	 * One should have enough rights to inspect task details.
+	 */
+	if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+	    !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	switch (type) {
+	case KCMP_FILE: {
+		struct file *filp1, *filp2;
+
+		filp1 = get_file_raw_ptr(task1, idx1);
+		filp2 = get_file_raw_ptr(task2, idx2);
+
+		if (filp1 && filp2)
+			ret = kcmp_ptr(filp1, filp2, KCMP_FILE);
+		else
+			ret = -EBADF;
+		break;
+	}
+	case KCMP_VM:
+		ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM);
+		break;
+	case KCMP_FILES:
+		ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES);
+		break;
+	case KCMP_FS:
+		ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS);
+		break;
+	case KCMP_SIGHAND:
+		ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND);
+		break;
+	case KCMP_IO:
+		ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO);
+		break;
+	case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+		ret = kcmp_ptr(task1->sysvsem.undo_list,
+			       task2->sysvsem.undo_list,
+			       KCMP_SYSVSEM);
+#else
+		ret = -EOPNOTSUP;
+#endif
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+err:
+	put_task_struct(task1);
+	put_task_struct(task2);
+
+	return ret;
+
+err_no_task:
+	rcu_read_unlock();
+	return -ESRCH;
+}
+
+static __init int kcmp_cookies_init(void)
+{
+	int i;
+
+	get_random_bytes(cookies, sizeof(cookies));
+
+	for (i = 0; i < KCMP_TYPES; i++)
+		cookies[i][1] |= (~(~0UL >>  1) | 1);
+
+	return 0;
+}
+arch_initcall(kcmp_cookies_init);
diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c
--- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/sys_ni.c
@@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark);
 cond_syscall(sys_name_to_handle_at);
 cond_syscall(sys_open_by_handle_at);
 cond_syscall(compat_sys_open_by_handle_at);
+
+/* compare kernel pointers */
+cond_syscall(sys_kcmp);
diff -puN /dev/null tools/testing/selftests/kcmp/Makefile
--- /dev/null
+++ a/tools/testing/selftests/kcmp/Makefile
@@ -0,0 +1,36 @@
+ifeq ($(strip $(V)),)
+	E = @echo
+	Q = @
+else
+	E = @\#
+	Q =
+endif
+export E Q
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+        ARCH := X86
+	CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+	CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+
+all:
+ifeq ($(ARCH),X86)
+	$(E) "  CC run_test"
+	$(Q) gcc $(CFLAGS) kcmp_test.c -o run_test
+else
+	$(E) "Not an x86 target, can't build kcmp selftest"
+endif
+
+clean:
+	$(E) "  CLEAN"
+	$(Q) rm -fr ./run_test
+	$(Q) rm -fr ./test-file
diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c
--- /dev/null
+++ a/tools/testing/selftests/kcmp/kcmp_test.c
@@ -0,0 +1,84 @@
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <limits.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+
+#include <linux/unistd.h>
+#include <linux/kcmp.h>
+
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+
+static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
+{
+	return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2);
+}
+
+int main(int argc, char **argv)
+{
+	const char kpath[] = "kcmp-test-file";
+	int pid1, pid2;
+	int fd1, fd2;
+	int status;
+
+	fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644);
+	pid1 = getpid();
+
+	if (fd1 < 0) {
+		perror("Can't create file");
+		exit(1);
+	}
+
+	pid2 = fork();
+	if (pid2 < 0) {
+		perror("fork failed");
+		exit(1);
+	}
+
+	if (!pid2) {
+		int pid2 = getpid();
+		int ret;
+
+		fd2 = open(kpath, O_RDWR, 0644);
+		if (fd2 < 0) {
+			perror("Can't open file");
+			exit(1);
+		}
+
+		/* An example of output and arguments */
+		printf("pid1: %6d pid2: %6d FD: %2d FILES: %2d VM: %2d FS: %2d "
+		       "SIGHAND: %2d IO: %2d SYSVSEM: %2d INV: %2d\n",
+		       pid1, pid2,
+		       sys_kcmp(pid1, pid2, KCMP_FILE,		fd1, fd2),
+		       sys_kcmp(pid1, pid2, KCMP_FILES,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_VM,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_FS,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SIGHAND,	0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_IO,		0, 0),
+		       sys_kcmp(pid1, pid2, KCMP_SYSVSEM,	0, 0),
+
+			/* This one should fail */
+		       sys_kcmp(pid1, pid2, KCMP_TYPES + 1,	0, 0));
+
+		/* This one should return same fd */
+		ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1);
+		if (ret) {
+			printf("FAIL: 0 expected but %d returned\n", ret);
+			ret = -1;
+		} else
+			printf("PASS: 0 returned as expected\n");
+		exit(ret);
+	}
+
+	waitpid(pid2, &status, P_ALL);
+
+	return 0;
+}
_
Subject: Subject: syscalls, x86: add __NR_kcmp syscall

Patches currently in -mm which might be from gorcunov@openvz.org are

linux-next.patch
sysctl-make-kernelns_last_pid-control-being-checkpoint_restore-dependent.patch
fs-proc-introduce-proc-pid-task-tid-children-entry-v9.patch
syscalls-x86-add-__nr_kcmp-syscall-v8.patch
syscalls-x86-add-__nr_kcmp-syscall-v8-fix.patch
c-r-procfs-add-arg_start-end-env_start-end-and-exit_code-members-to-proc-pid-stat.patch
c-r-prctl-extend-pr_set_mm-to-set-up-more-mm_struct-entries-v2.patch


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2012-04-11 18:38 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-15 14:36 + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree Oleg Nesterov
2012-02-15 15:10 ` Cyrill Gorcunov
2012-02-15 15:38   ` Oleg Nesterov
2012-02-15 16:13     ` Cyrill Gorcunov
2012-02-15 16:22       ` Oleg Nesterov
2012-02-15 17:53         ` Cyrill Gorcunov
2012-02-15 18:43           ` Oleg Nesterov
2012-02-15 19:56             ` Cyrill Gorcunov
2012-02-15 19:57               ` Vasiliy Kulikov
2012-02-15 20:05                 ` Cyrill Gorcunov
2012-02-15 20:25                   ` Cyrill Gorcunov
2012-02-15 21:09                     ` Cyrill Gorcunov
2012-02-15 21:58                       ` Cyrill Gorcunov
2012-02-16 14:49                         ` Oleg Nesterov
2012-02-16 15:13                           ` Cyrill Gorcunov
2012-02-16 16:49                             ` Cyrill Gorcunov
2012-02-16 17:40                               ` Oleg Nesterov
2012-02-16 17:58                                 ` Cyrill Gorcunov
2012-02-16 19:03                                   ` Oleg Nesterov
2012-02-16 19:20                                     ` H. Peter Anvin
2012-02-16 19:29                                     ` Cyrill Gorcunov
2012-02-16 19:52                                       ` Andrew Morton
2012-02-16 20:01                                         ` Cyrill Gorcunov
2012-02-16 18:21                               ` Vasiliy Kulikov
2012-02-16 18:34                                 ` Cyrill Gorcunov
2012-02-16 18:33                                   ` Vasiliy Kulikov
2012-02-16 18:49                                 ` Oleg Nesterov
2012-02-15 18:32   ` Cyrill Gorcunov
2012-02-15 19:06     ` Oleg Nesterov
2012-02-15 19:18       ` Cyrill Gorcunov
2012-02-15 16:06 ` Oleg Nesterov
2012-02-15 16:27   ` Cyrill Gorcunov
2012-04-09 22:10     ` Andrew Morton
2012-04-09 22:24       ` Cyrill Gorcunov
2012-04-09 23:22         ` H. Peter Anvin
2012-04-10 22:37           ` Cyrill Gorcunov
2012-04-10 22:39             ` H. Peter Anvin
2012-04-10 22:48               ` Cyrill Gorcunov
2012-04-10 23:08             ` Oleg Nesterov
2012-04-10 23:32               ` H. Peter Anvin
2012-04-10 23:42                 ` Oleg Nesterov
2012-04-11  6:39                   ` Cyrill Gorcunov
2012-04-11 18:31                     ` Oleg Nesterov
2012-04-11  0:02           ` Valdis.Kletnieks
2012-04-10  3:25       ` Eric W. Biederman
2012-04-10 22:54         ` Cyrill Gorcunov
2012-04-10 23:58       ` Valdis.Kletnieks
2012-04-11  0:06         ` H. Peter Anvin
  -- strict thread matches above, loose matches on Subject: below --
2012-02-14 23:15 akpm

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.