* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree @ 2012-02-15 14:36 Oleg Nesterov 2012-02-15 15:10 ` Cyrill Gorcunov 2012-02-15 16:06 ` Oleg Nesterov 0 siblings, 2 replies; 48+ messages in thread From: Oleg Nesterov @ 2012-02-15 14:36 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel > +/* The caller must have pinned the task */ > +static struct file * > +get_file_raw_ptr(struct task_struct *task, unsigned int idx) > +{ > + struct fdtable *fdt; > + struct file *file; > + > + spin_lock(&task->files->file_lock); task->files can be NULL, we can race with exit_files(). > + fdt = files_fdtable(task->files); > + if (idx < fdt->max_fds) > + file = fdt->fd[idx]; You can probably rely on rcu instead of ->file_lock, but this is minor. > +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, > + unsigned long, idx1, unsigned long, idx2) > +{ > + struct task_struct *task1, *task2; > + int ret; > + > + rcu_read_lock(); > + > + /* > + * Tasks are looked up in caller's PID namespace only. > + */ > + task1 = find_task_by_vpid(pid1); > + task2 = find_task_by_vpid(pid2); > + if (!task1 || !task2) > + goto err_no_task; > + > + get_task_struct(task1); > + get_task_struct(task2); > + > + rcu_read_unlock(); > + > + /* > + * One should have enough rights to inspect task details. > + */ > + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || > + !ptrace_may_access(task2, PTRACE_MODE_READ)) { > + ret = -EACCES; Well, probably this is fine... but may be you can add a comment. The task can change its credentials right after ptrace_may_access() succeeds. This _looks_ wrong, perhaps it makes sense to add the "we do not care" note. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 14:36 + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree Oleg Nesterov @ 2012-02-15 15:10 ` Cyrill Gorcunov 2012-02-15 15:38 ` Oleg Nesterov 2012-02-15 18:32 ` Cyrill Gorcunov 2012-02-15 16:06 ` Oleg Nesterov 1 sibling, 2 replies; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 15:10 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote: > > task->files can be NULL, we can race with exit_files(). So I need to call get_files_struct. Thanks Oleg! > > > + fdt = files_fdtable(task->files); > > + if (idx < fdt->max_fds) > > + file = fdt->fd[idx]; > > You can probably rely on rcu instead of ->file_lock, but this is minor. > I think so. I'll be updating the patch anyway (on top of current one) so I'll address this comment too. Thanks! > > + > > + /* > > + * One should have enough rights to inspect task details. > > + */ > > + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || > > + !ptrace_may_access(task2, PTRACE_MODE_READ)) { > > + ret = -EACCES; > > Well, probably this is fine... but may be you can add a comment. > The task can change its credentials right after ptrace_may_access() > succeeds. This _looks_ wrong, perhaps it makes sense to add the > "we do not care" note. > Wait, how it's differ from other ptrace_may_access calls all over the kernel? I suppose I'm missing something obvious? Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 15:10 ` Cyrill Gorcunov @ 2012-02-15 15:38 ` Oleg Nesterov 2012-02-15 16:13 ` Cyrill Gorcunov 2012-02-15 18:32 ` Cyrill Gorcunov 1 sibling, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-02-15 15:38 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On 02/15, Cyrill Gorcunov wrote: > > On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote: > > > > > + > > > + /* > > > + * One should have enough rights to inspect task details. > > > + */ > > > + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || > > > + !ptrace_may_access(task2, PTRACE_MODE_READ)) { > > > + ret = -EACCES; > > > > Well, probably this is fine... but may be you can add a comment. > > The task can change its credentials right after ptrace_may_access() > > succeeds. This _looks_ wrong, perhaps it makes sense to add the > > "we do not care" note. > > > > Wait, how it's differ from other ptrace_may_access calls all over > the kernel? I suppose I'm missing something obvious? For example? Say, mm_access() is fine because it returns ->mm which we are going to play with. But map_files_d_revalidate/proc_map_files_get_link looks wrong, there are obviously racy and should use mm_access(). Probably something else is wrong too. Once again, I am not saying that this code really has the security problems. I simply do not know. But it looks wrong without the comment. I do not really understand why do we need ptrace_may_access(), but whatever reason we have how we can trust it? Say, when KCMP_VM checks ->mm, all we know is that PTRACE_MODE_READ succeed in the past. This looks confusing, imho. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 15:38 ` Oleg Nesterov @ 2012-02-15 16:13 ` Cyrill Gorcunov 2012-02-15 16:22 ` Oleg Nesterov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 16:13 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 04:38:16PM +0100, Oleg Nesterov wrote: ... > > > > Wait, how it's differ from other ptrace_may_access calls all over > > the kernel? I suppose I'm missing something obvious? > > For example? Say, mm_access() is fine because it returns ->mm > which we are going to play with. So, say we have environ_read mm_for_maps mm_access success, and first reader continue here then say task change own credentials and all this sequence fails because access is not granted anymore (say for second reader), but first reader still able to continue reading because access was graned earlier. So I don't understand how it's different from what is provided in this patch. What I'm missing? > Once again, I am not saying that this code really has the security > problems. I simply do not know. But it looks wrong without the > comment. I do not really understand why do we need ptrace_may_access(), > but whatever reason we have how we can trust it? Say, when KCMP_VM > checks ->mm, all we know is that PTRACE_MODE_READ succeed in the > past. This looks confusing, imho. Adding the comment is not a problem. The problem is that I dont understand if there error in patch or not, can we stick with ptrace_may_access or need something different here? The idea was exactly like -- if you have enough rights to proceed ptrace_may_access then syscall should continue executing and return comparision result. Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 16:13 ` Cyrill Gorcunov @ 2012-02-15 16:22 ` Oleg Nesterov 2012-02-15 17:53 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-02-15 16:22 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On 02/15, Cyrill Gorcunov wrote: > > On Wed, Feb 15, 2012 at 04:38:16PM +0100, Oleg Nesterov wrote: > ... > > > > > > Wait, how it's differ from other ptrace_may_access calls all over > > > the kernel? I suppose I'm missing something obvious? > > > > For example? Say, mm_access() is fine because it returns ->mm > > which we are going to play with. > > So, say we have > > environ_read > mm_for_maps > mm_access > success, and first reader continue here > > then say task change own credentials and all > this sequence fails because access is not granted > anymore (say for second reader), but first reader > still able to continue reading because access was > graned earlier. Can't understand... Yes, environ_read() can succeed for the first reader, and then later it can fail for the same/another reader. And? > So I don't understand how it's different from what > is provided in this patch. What I'm missing? environ_read() does mm = mm_access(task); if (mm) do_something(mm); even if it races with, say, execve(setuid_app) we can't read the new ->mm. while your code (very roughly) does something like mm = mm_access(task); if (mm) do_something(task->mm); while it is quite possible that mm != task->mm. > > Once again, I am not saying that this code really has the security > > problems. I simply do not know. But it looks wrong without the > > comment. I do not really understand why do we need ptrace_may_access(), > > but whatever reason we have how we can trust it? Say, when KCMP_VM > > checks ->mm, all we know is that PTRACE_MODE_READ succeed in the > > past. This looks confusing, imho. > > Adding the comment is not a problem. The problem is that I > dont understand if there error in patch or not, can we stick > with ptrace_may_access or need something different here? > The idea was exactly like -- if you have enough rights to > proceed ptrace_may_access then syscall should continue > executing and return comparision result. My only point is: this check is obviously racy, and thus it looks confusing. Whether this is fine or not, I do not know. Personally I see no reason for ptrace_may_access(), but I am not security expert. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 16:22 ` Oleg Nesterov @ 2012-02-15 17:53 ` Cyrill Gorcunov 2012-02-15 18:43 ` Oleg Nesterov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 17:53 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 05:22:22PM +0100, Oleg Nesterov wrote: > > > So I don't understand how it's different from what > > is provided in this patch. What I'm missing? > > environ_read() does > > mm = mm_access(task); > if (mm) > do_something(mm); > > even if it races with, say, execve(setuid_app) we can't read the > new ->mm. Wait, I'm confused process 1 (reader) process 2 ("task" itself) mm = mm_access(task); task changes own credentials so reader can't access on next read if it would try, but since access already granted... it continues do_something(mm) if (mm) do_something(mm); So in the patch I tried the same, once access is granted it belongs to a caller. > > while your code (very roughly) does something like > > mm = mm_access(task); > if (mm) > do_something(task->mm); > > while it is quite possible that mm != task->mm. Oleg, could you please explain me where it happens that task->mm (I've got access to) will be changed to some new -mm while I'm inspecting it. If permission changed while the caller inside syscall, it's the same situation as with mm_access above. No? > > My only point is: this check is obviously racy, and thus it looks > confusing. Whether this is fine or not, I do not know. Personally > I see no reason for ptrace_may_access(), but I am not security > expert. The idea was -- non-privilege caller should not have access to privileged tasks. Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 17:53 ` Cyrill Gorcunov @ 2012-02-15 18:43 ` Oleg Nesterov 2012-02-15 19:56 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-02-15 18:43 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On 02/15, Cyrill Gorcunov wrote: > > On Wed, Feb 15, 2012 at 05:22:22PM +0100, Oleg Nesterov wrote: > > > > > So I don't understand how it's different from what > > > is provided in this patch. What I'm missing? > > > > environ_read() does > > > > mm = mm_access(task); > > if (mm) > > do_something(mm); > > > > even if it races with, say, execve(setuid_app) we can't read the > > new ->mm. > > Wait, I'm confused > > process 1 (reader) process 2 ("task" itself) > mm = mm_access(task); > task changes own credentials > so reader can't access on next > read if it would try, but since > access already granted... it > continues do_something(mm) > if (mm) > do_something(mm); > > So in the patch I tried the same, once access is granted it > belongs to a caller. See the "execve(setuid_app)", this is what I meant. Even if we race with execve() and the task raises its privileges we can't read the new ->mm, we will read the old mm for which we have (had) the rights to access. > > while your code (very roughly) does something like > > > > mm = mm_access(task); > > if (mm) > > do_something(task->mm); > > > > while it is quite possible that mm != task->mm. > > Oleg, could you please explain me where it happens > that task->mm (I've got access to) will be changed > to some new -mm while I'm inspecting it. Cough... this is question I am trying to ask ;) Let me try again. To simplify, lets discuss the KCMP_VM case only. I do not really understand why do we need ptrace_may_access(). I do not see any security problems with kcmp_ptr(task->mm), but I am not expert. However, you added this check so I assume you have some reason. But this can race with execve(setuid_app) and KCMP_VM can play with task->mm after this task raises its caps. If this is fine, then why do we need ptrace_may_access? OK, please ignore. I sent the initial email just becase KCMP_FILE is buggy. > > > + for (i = 0; i < KCMP_TYPES; i++) > > > + cookies[i][1] |= (~(~0UL >> 1) | 1); > > > > I am puzzled, help ;) this is equal to > > > > cookies[i][1] |= -LONG_MAX; > > or > > cookies[i][1] |= (LONG_MIN | 1); > > > > for what? why do we want to set these 2 bits (MSB and LSB) ? > > Letme quote hpa@ here :) > > | This code is wrong. You will have a zero cookie, legitimately, once in > | 2^32 or 2^64 attempts, depending on the bitness. > | > | The other thing is that for the multiplicative cookie you should OR in > | the value (~(~0UL >> 1) | 1) in order to make sure that the value is (a) > | large and (b) odd. OK, thanks. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 18:43 ` Oleg Nesterov @ 2012-02-15 19:56 ` Cyrill Gorcunov 2012-02-15 19:57 ` Vasiliy Kulikov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 19:56 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 07:43:36PM +0100, Oleg Nesterov wrote: ... > > Cough... this is question I am trying to ask ;) > > Let me try again. To simplify, lets discuss the KCMP_VM case > only. > > I do not really understand why do we need ptrace_may_access(). > I do not see any security problems with kcmp_ptr(task->mm), but > I am not expert. > > However, you added this check so I assume you have some reason. > But this can race with execve(setuid_app) and KCMP_VM can play > with task->mm after this task raises its caps. If this is fine, > then why do we need ptrace_may_access? > This makes me scratch the head ;) I think ptrace_may_access (or some other security test) should remain since it's somehow weird if non-root task will be able to find objects order from privileged task. Thus I need to find a way how to handle execve(setuid_app). Need to think... Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 19:56 ` Cyrill Gorcunov @ 2012-02-15 19:57 ` Vasiliy Kulikov 2012-02-15 20:05 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Vasiliy Kulikov @ 2012-02-15 19:57 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 23:56 +0400, Cyrill Gorcunov wrote: > On Wed, Feb 15, 2012 at 07:43:36PM +0100, Oleg Nesterov wrote: > ... > > > > Cough... this is question I am trying to ask ;) > > > > Let me try again. To simplify, lets discuss the KCMP_VM case > > only. > > > > I do not really understand why do we need ptrace_may_access(). > > I do not see any security problems with kcmp_ptr(task->mm), but > > I am not expert. > > > > However, you added this check so I assume you have some reason. > > But this can race with execve(setuid_app) and KCMP_VM can play > > with task->mm after this task raises its caps. If this is fine, > > then why do we need ptrace_may_access? > > > > This makes me scratch the head ;) I think ptrace_may_access (or > some other security test) should remain since it's somehow weird > if non-root task will be able to find objects order from privileged > task. Thus I need to find a way how to handle execve(setuid_app). > Need to think... Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex for the whole period of time when it uses a resource. -- Vasiliy Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 19:57 ` Vasiliy Kulikov @ 2012-02-15 20:05 ` Cyrill Gorcunov 2012-02-15 20:25 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 20:05 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 11:57:33PM +0400, Vasiliy Kulikov wrote: > > > > This makes me scratch the head ;) I think ptrace_may_access (or > > some other security test) should remain since it's somehow weird > > if non-root task will be able to find objects order from privileged > > task. Thus I need to find a way how to handle execve(setuid_app). > > Need to think... > > Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex > for the whole period of time when it uses a resource. Yup, thanks Vasiliy! I've just found cred_guard_mutex in install_exec_creds. Now I'm thinking if this is what we need here ;) Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 20:05 ` Cyrill Gorcunov @ 2012-02-15 20:25 ` Cyrill Gorcunov 2012-02-15 21:09 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 20:25 UTC (permalink / raw) To: Vasiliy Kulikov, Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Thu, Feb 16, 2012 at 12:05:33AM +0400, Cyrill Gorcunov wrote: > On Wed, Feb 15, 2012 at 11:57:33PM +0400, Vasiliy Kulikov wrote: > > > > > > This makes me scratch the head ;) I think ptrace_may_access (or > > > some other security test) should remain since it's somehow weird > > > if non-root task will be able to find objects order from privileged > > > task. Thus I need to find a way how to handle execve(setuid_app). > > > Need to think... > > > > Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex > > for the whole period of time when it uses a resource. > > Yup, thanks Vasiliy! I've just found cred_guard_mutex in > install_exec_creds. Now I'm thinking if this is what we > need here ;) > Something like below I think (not yet tested, overall update). Cyrill --- diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c --- linux-2.6.git/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -44,20 +44,34 @@ static struct file * get_file_raw_ptr(struct task_struct *task, unsigned int idx) { - struct fdtable *fdt; - struct file *file; + struct file *file = NULL; - spin_lock(&task->files->file_lock); - fdt = files_fdtable(task->files); - if (idx < fdt->max_fds) - file = fdt->fd[idx]; - else - file = NULL; - spin_unlock(&task->files->file_lock); + task_lock(task); + rcu_read_lock(); + + if (task->files) + file = fcheck_files(task->files, idx); + + rcu_read_unlock(); + task_unlock(task); return file; } +static int may_access(struct task_struct *task) +{ + int err = mutex_lock_killable(&task->signal->cred_guard_mutex); + if (err) + return err; + + if (!ptrace_may_access(task, PTRACE_MODE_READ)) { + mutex_unlock(&task->signal->cred_guard_mutex); + return -EPERM; + } + + return 0; +} + SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, unsigned long, idx1, unsigned long, idx2) { @@ -82,11 +96,12 @@ /* * One should have enough rights to inspect task details. */ - if (!ptrace_may_access(task1, PTRACE_MODE_READ) || - !ptrace_may_access(task2, PTRACE_MODE_READ)) { - ret = -EACCES; + ret = may_access(task1); + if (ret) goto err; - } + ret = may_access(task2); + if (ret) + goto err_unlock; switch (type) { case KCMP_FILE: { @@ -130,6 +145,9 @@ break; } + mutex_unlock(&task2->signal->cred_guard_mutex); +err_unlock: + mutex_unlock(&task1->signal->cred_guard_mutex); err: put_task_struct(task1); put_task_struct(task2); ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 20:25 ` Cyrill Gorcunov @ 2012-02-15 21:09 ` Cyrill Gorcunov 2012-02-15 21:58 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 21:09 UTC (permalink / raw) To: Vasiliy Kulikov, Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Thu, Feb 16, 2012 at 12:25:38AM +0400, Cyrill Gorcunov wrote: > > Something like below I think (not yet tested, overall update). > OK, this one I've just tested. Please review, so I won't miss something obvious (I'm fetching linux-next now, if the patch is already there I'll cook one on top). Cyrill --- diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c --- linux-2.6.git/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -44,20 +44,38 @@ static struct file * get_file_raw_ptr(struct task_struct *task, unsigned int idx) { - struct fdtable *fdt; - struct file *file; + struct file *file = NULL; - spin_lock(&task->files->file_lock); - fdt = files_fdtable(task->files); - if (idx < fdt->max_fds) - file = fdt->fd[idx]; - else - file = NULL; - spin_unlock(&task->files->file_lock); + task_lock(task); + rcu_read_lock(); + + if (task->files) + file = fcheck_files(task->files, idx); + + rcu_read_unlock(); + task_unlock(task); return file; } +static void access_unlock(struct task_struct *task) +{ + mutex_unlock(&task->signal->cred_guard_mutex); +} + +static int access_trylock(struct task_struct *task) +{ + if (!mutex_trylock(&task->signal->cred_guard_mutex)) + return -EBUSY; + + if (!ptrace_may_access(task, PTRACE_MODE_READ)) { + mutex_unlock(&task->signal->cred_guard_mutex); + return -EPERM; + } + + return 0; +} + SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, unsigned long, idx1, unsigned long, idx2) { @@ -82,11 +100,12 @@ /* * One should have enough rights to inspect task details. */ - if (!ptrace_may_access(task1, PTRACE_MODE_READ) || - !ptrace_may_access(task2, PTRACE_MODE_READ)) { - ret = -EACCES; + ret = access_trylock(task1); + if (ret) goto err; - } + ret = access_trylock(task2); + if (ret) + goto err_unlock; switch (type) { case KCMP_FILE: { @@ -130,6 +149,9 @@ break; } + access_unlock(task2); +err_unlock: + access_unlock(task1); err: put_task_struct(task1); put_task_struct(task2); ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 21:09 ` Cyrill Gorcunov @ 2012-02-15 21:58 ` Cyrill Gorcunov 2012-02-16 14:49 ` Oleg Nesterov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 21:58 UTC (permalink / raw) To: Vasiliy Kulikov, Oleg Nesterov, Andrew Morton Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel [-- Attachment #1: Type: text/plain, Size: 619 bytes --] On Thu, Feb 16, 2012 at 01:09:34AM +0400, Cyrill Gorcunov wrote: > On Thu, Feb 16, 2012 at 12:25:38AM +0400, Cyrill Gorcunov wrote: > > > > Something like below I think (not yet tested, overall update). > > > > OK, this one I've just tested. Please review, so I won't miss > something obvious (I'm fetching linux-next now, if the patch > is already there I'll cook one on top). > --- Unfortunately I dont see this patch in linux-next, so I cooked it as interdiff'ed. Please review. Maybe (if this patch is fine) we could drop v8 and I would squash this changed into v9 (together with update to self test)? Cyrill [-- Attachment #2: sys-kcmp-interdiff --] [-- Type: text/plain, Size: 2197 bytes --] From: Cyrill Gorcunov <gorcunov@openvz.org> Subject: syscalls, x86: Fix __NR_kcmp execve race and potential NULL dereference Plain ptrace_may_access() check used in kcmp is not safe against race with execve(setuid_app), so we need to grab cred_guard_mutex and keep it until kcmp is finished. Also task->files may be nil, better to use task_lock and fcheck_files helpers instead of direct file_lock usage. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> --- diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c --- linux-2.6.git/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -44,20 +44,38 @@ static struct file * get_file_raw_ptr(struct task_struct *task, unsigned int idx) { - struct fdtable *fdt; - struct file *file; + struct file *file = NULL; - spin_lock(&task->files->file_lock); - fdt = files_fdtable(task->files); - if (idx < fdt->max_fds) - file = fdt->fd[idx]; - else - file = NULL; - spin_unlock(&task->files->file_lock); + task_lock(task); + rcu_read_lock(); + + if (task->files) + file = fcheck_files(task->files, idx); + + rcu_read_unlock(); + task_unlock(task); return file; } +static void access_unlock(struct task_struct *task) +{ + mutex_unlock(&task->signal->cred_guard_mutex); +} + +static int access_trylock(struct task_struct *task) +{ + if (!mutex_trylock(&task->signal->cred_guard_mutex)) + return -EBUSY; + + if (!ptrace_may_access(task, PTRACE_MODE_READ)) { + mutex_unlock(&task->signal->cred_guard_mutex); + return -EPERM; + } + + return 0; +} + SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, unsigned long, idx1, unsigned long, idx2) { @@ -82,11 +100,12 @@ /* * One should have enough rights to inspect task details. */ - if (!ptrace_may_access(task1, PTRACE_MODE_READ) || - !ptrace_may_access(task2, PTRACE_MODE_READ)) { - ret = -EACCES; + ret = access_trylock(task1); + if (ret) goto err; - } + ret = access_trylock(task2); + if (ret) + goto err_unlock; switch (type) { case KCMP_FILE: { @@ -130,6 +149,9 @@ break; } + access_unlock(task2); +err_unlock: + access_unlock(task1); err: put_task_struct(task1); put_task_struct(task2); ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 21:58 ` Cyrill Gorcunov @ 2012-02-16 14:49 ` Oleg Nesterov 2012-02-16 15:13 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-02-16 14:49 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On 02/16, Cyrill Gorcunov wrote: > > +static int access_trylock(struct task_struct *task) > +{ > + if (!mutex_trylock(&task->signal->cred_guard_mutex)) > + return -EBUSY; > + > + if (!ptrace_may_access(task, PTRACE_MODE_READ)) { > + mutex_unlock(&task->signal->cred_guard_mutex); > + return -EPERM; > + } > + > + return 0; > +} OK, this looks correct, but I don't really understand _trylock. This means the caller should always retry if -EBUSY, and kcmp(pid, pid) can never succeed. Sure, kcmp() doesn't make a lot of sense if pid1 == pid2, but this looks a bit strange. You could simply do int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2) { int err; if (m2 > m1) swap(m1, m2); err = mutex_lock_killable(m1); if (!err && likely(m1 != m2)) { err = mutex_lock_killable_nested(m2); if (err) mutex_unlock(m1); } return err; } but I won't unsist. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 14:49 ` Oleg Nesterov @ 2012-02-16 15:13 ` Cyrill Gorcunov 2012-02-16 16:49 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-16 15:13 UTC (permalink / raw) To: Oleg Nesterov Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 03:49:54PM +0100, Oleg Nesterov wrote: > On 02/16, Cyrill Gorcunov wrote: > > > > +static int access_trylock(struct task_struct *task) > > +{ > > + if (!mutex_trylock(&task->signal->cred_guard_mutex)) > > + return -EBUSY; > > + > > + if (!ptrace_may_access(task, PTRACE_MODE_READ)) { > > + mutex_unlock(&task->signal->cred_guard_mutex); > > + return -EPERM; > > + } > > + > > + return 0; > > +} > > OK, this looks correct, but I don't really understand _trylock. > This means the caller should always retry if -EBUSY, and > kcmp(pid, pid) can never succeed. Sure, kcmp() doesn't make > a lot of sense if pid1 == pid2, but this looks a bit strange. > Hi Oleg, sure I can make it this way, also I think if pid1 == pid2 and idx1 == idx2 I can return 0 immediately. > You could simply do > int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2) > { > int err; > > if (m2 > m1) > swap(m1, m2); > > err = mutex_lock_killable(m1); > > if (!err && likely(m1 != m2)) { > err = mutex_lock_killable_nested(m2); > if (err) > mutex_unlock(m1); > } > > return err; > } > > but I won't insist. Initially I wanted kcmp would be brining minimum impact and if mutex is already taken by someone, we would not sleep but return immediately with -EBUSY and it would be up to caller to deside if to try again or make some delay first. I simply not sure what is better here. Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 15:13 ` Cyrill Gorcunov @ 2012-02-16 16:49 ` Cyrill Gorcunov 2012-02-16 17:40 ` Oleg Nesterov 2012-02-16 18:21 ` Vasiliy Kulikov 0 siblings, 2 replies; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-16 16:49 UTC (permalink / raw) To: Oleg Nesterov, Vasiliy Kulikov, Andrew Morton Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 07:13:40PM +0400, Cyrill Gorcunov wrote: ... > > > You could simply do > > int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2) > > { > > int err; > > > > if (m2 > m1) > > swap(m1, m2); > > > > err = mutex_lock_killable(m1); > > > > if (!err && likely(m1 != m2)) { > > err = mutex_lock_killable_nested(m2); > > if (err) > > mutex_unlock(m1); > > } > > > > return err; > > } > > > > but I won't insist. > > Initially I wanted kcmp would be brining minimum impact > and if mutex is already taken by someone, we would not sleep > but return immediately with -EBUSY and it would be up to caller > to deside if to try again or make some delay first. I simply > not sure what is better here. > This one should do the trick. Cyrill --- From: Cyrill Gorcunov <gorcunov@openvz.org> Subject: syscalls, x86: Make __NR_kcmp to work with equivalent pids In case if pid1 is equal to pid2 the kcmp will return -EBUSY, which makes no sence. Make it able to work with equivalent pids. Selftest is extended as well. Repored-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> --- diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c --- linux-2.6.git/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -58,22 +58,31 @@ return file; } -static void access_unlock(struct task_struct *task) +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) { - mutex_unlock(&task->signal->cred_guard_mutex); + if (m2 > m1) + swap(m1, m2); + + if (likely(m2 != m1)) + mutex_unlock(m2); + mutex_unlock(m1); } -static int access_trylock(struct task_struct *task) +static int kcmp_lock(struct mutex *m1, struct mutex *m2) { - if (!mutex_trylock(&task->signal->cred_guard_mutex)) - return -EBUSY; + int err; + + if (m2 > m1) + swap(m1, m2); - if (!ptrace_may_access(task, PTRACE_MODE_READ)) { - mutex_unlock(&task->signal->cred_guard_mutex); - return -EPERM; + err = mutex_lock_killable(m1); + if (!err && likely(m1 != m2)) { + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); + if (err) + mutex_unlock(m1); } - return 0; + return err; } SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, @@ -100,12 +109,15 @@ /* * One should have enough rights to inspect task details. */ - ret = access_trylock(task1); + ret = kcmp_lock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); if (ret) goto err; - ret = access_trylock(task2); - if (ret) + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || + !ptrace_may_access(task2, PTRACE_MODE_READ)) { + ret = -EPERM; goto err_unlock; + } switch (type) { case KCMP_FILE: { @@ -149,9 +161,9 @@ break; } - access_unlock(task2); err_unlock: - access_unlock(task1); + kcmp_unlock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); err: put_task_struct(task1); put_task_struct(task2); diff -u linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c --- linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c +++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c @@ -74,6 +74,15 @@ ret = -1; } else printf("PASS: 0 returned as expected\n"); + + /* Compare with self */ + ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0); + if (ret) { + printf("FAIL: 0 expected but %li returned\n", ret); + ret = -1; + } else + printf("PASS: 0 returned as expected\n"); + exit(ret); } ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 16:49 ` Cyrill Gorcunov @ 2012-02-16 17:40 ` Oleg Nesterov 2012-02-16 17:58 ` Cyrill Gorcunov 2012-02-16 18:21 ` Vasiliy Kulikov 1 sibling, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-02-16 17:40 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On 02/16, Cyrill Gorcunov wrote: > > -static void access_unlock(struct task_struct *task) > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) > { > - mutex_unlock(&task->signal->cred_guard_mutex); > + if (m2 > m1) > + swap(m1, m2); Well, the order doesn't matter in case of _unlock, you can remove this part. Not that it really hurts though, I won't argue. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 17:40 ` Oleg Nesterov @ 2012-02-16 17:58 ` Cyrill Gorcunov 2012-02-16 19:03 ` Oleg Nesterov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-16 17:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote: > On 02/16, Cyrill Gorcunov wrote: > > > > -static void access_unlock(struct task_struct *task) > > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) > > { > > - mutex_unlock(&task->signal->cred_guard_mutex); > > + if (m2 > m1) > > + swap(m1, m2); > > Well, the order doesn't matter in case of _unlock, you can remove > this part. Not that it really hurts though, I won't argue. > It drops some instructions so I think it worth removing (still unlocking not in reverse order is something which always make me nervious ;) Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 17:58 ` Cyrill Gorcunov @ 2012-02-16 19:03 ` Oleg Nesterov 2012-02-16 19:20 ` H. Peter Anvin 2012-02-16 19:29 ` Cyrill Gorcunov 0 siblings, 2 replies; 48+ messages in thread From: Oleg Nesterov @ 2012-02-16 19:03 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On 02/16, Cyrill Gorcunov wrote: > > On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote: > > On 02/16, Cyrill Gorcunov wrote: > > > > > > -static void access_unlock(struct task_struct *task) > > > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) > > > { > > > - mutex_unlock(&task->signal->cred_guard_mutex); > > > + if (m2 > m1) > > > + swap(m1, m2); > > > > Well, the order doesn't matter in case of _unlock, you can remove > > this part. Not that it really hurts though, I won't argue. > > It drops some instructions so I think it worth removing Yes. > (still > unlocking not in reverse order is something which always make > me nervious ;) Yes ;) so let me repeat, I am not arguing. But IMHO every piece of code should be understandable. Personally I do not mind at all, just I _personally_ think this code _can_ confuse the reader, "damn why we can't simply unlock in any order???". If you add the "not really needed" comment above this swap - I agree. If you simply remove this swap - I agree as well. But. I won't argue if you prefer to keep this patch as is. You are the author. If it looks better to _you_ - OK, this is correct (afaics). Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 19:03 ` Oleg Nesterov @ 2012-02-16 19:20 ` H. Peter Anvin 2012-02-16 19:29 ` Cyrill Gorcunov 1 sibling, 0 replies; 48+ messages in thread From: H. Peter Anvin @ 2012-02-16 19:20 UTC (permalink / raw) To: Oleg Nesterov Cc: Cyrill Gorcunov, Vasiliy Kulikov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On 02/16/2012 11:03 AM, Oleg Nesterov wrote: > > so let me repeat, I am not arguing. But IMHO every piece of code > should be understandable. Amen! -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 19:03 ` Oleg Nesterov 2012-02-16 19:20 ` H. Peter Anvin @ 2012-02-16 19:29 ` Cyrill Gorcunov 2012-02-16 19:52 ` Andrew Morton 1 sibling, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-16 19:29 UTC (permalink / raw) To: Oleg Nesterov Cc: Vasiliy Kulikov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 08:03:21PM +0100, Oleg Nesterov wrote: > On 02/16, Cyrill Gorcunov wrote: > > > > On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote: > > > On 02/16, Cyrill Gorcunov wrote: > > > > > > > > -static void access_unlock(struct task_struct *task) > > > > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) > > > > { > > > > - mutex_unlock(&task->signal->cred_guard_mutex); > > > > + if (m2 > m1) > > > > + swap(m1, m2); > > > > > > Well, the order doesn't matter in case of _unlock, you can remove > > > this part. Not that it really hurts though, I won't argue. > > > > It drops some instructions so I think it worth removing > > Yes. > Final one ;) I agreed on every line of your comment, thanks a lot Oleg! --- From: Cyrill Gorcunov <gorcunov@openvz.org> Subject: syscalls, x86: Make __NR_kcmp to work with equivalent pids In case if pid1 is equal to pid2 the kcmp will return -EBUSY, which makes no sence. Make it able to work with equivalent pids. Selftest is extended as well. Repored-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> --- diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c --- linux-2.6.git/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -58,22 +58,28 @@ return file; } -static void access_unlock(struct task_struct *task) +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) { - mutex_unlock(&task->signal->cred_guard_mutex); + if (likely(m2 != m1)) + mutex_unlock(m2); + mutex_unlock(m1); } -static int access_trylock(struct task_struct *task) +static int kcmp_lock(struct mutex *m1, struct mutex *m2) { - if (!mutex_trylock(&task->signal->cred_guard_mutex)) - return -EBUSY; + int err; - if (!ptrace_may_access(task, PTRACE_MODE_READ)) { - mutex_unlock(&task->signal->cred_guard_mutex); - return -EPERM; + if (m2 > m1) + swap(m1, m2); + + err = mutex_lock_killable(m1); + if (!err && likely(m1 != m2)) { + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); + if (err) + mutex_unlock(m1); } - return 0; + return err; } SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, @@ -100,12 +106,15 @@ /* * One should have enough rights to inspect task details. */ - ret = access_trylock(task1); + ret = kcmp_lock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); if (ret) goto err; - ret = access_trylock(task2); - if (ret) + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || + !ptrace_may_access(task2, PTRACE_MODE_READ)) { + ret = -EPERM; goto err_unlock; + } switch (type) { case KCMP_FILE: { @@ -149,9 +158,9 @@ break; } - access_unlock(task2); err_unlock: - access_unlock(task1); + kcmp_unlock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); err: put_task_struct(task1); put_task_struct(task2); diff -u linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c --- linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c +++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c @@ -74,6 +74,15 @@ ret = -1; } else printf("PASS: 0 returned as expected\n"); + + /* Compare with self */ + ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0); + if (ret) { + printf("FAIL: 0 expected but %li returned\n", ret); + ret = -1; + } else + printf("PASS: 0 returned as expected\n"); + exit(ret); } ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 19:29 ` Cyrill Gorcunov @ 2012-02-16 19:52 ` Andrew Morton 2012-02-16 20:01 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Andrew Morton @ 2012-02-16 19:52 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Oleg Nesterov, Vasiliy Kulikov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, 16 Feb 2012 23:29:52 +0400 Cyrill Gorcunov <gorcunov@openvz.org> wrote: > Final one ;) syscalls-x86-add-__nr_kcmp-syscall-v8-fix-fix-fix-fix-fix.patch is a tie for the world record! I folded everything back into a single patch. Here's what we currently have: From: Cyrill Gorcunov <gorcunov@openvz.org> Subject: syscalls, x86: add __NR_kcmp syscall While doing the checkpoint-restore in the user space one need to determine whether various kernel objects (like mm_struct-s of file_struct-s) are shared between tasks and restore this state. The 2nd step can be solved by using appropriate CLONE_ flags and the unshare syscall, while there's currently no ways for solving the 1st one. One of the ways for checking whether two tasks share e.g. mm_struct is to provide some mm_struct ID of a task to its proc file, but showing such info considered to be not that good for security reasons. Thus after some debates we end up in conclusion that using that named 'comparison' syscall might be the best candidate. So here is it -- __NR_kcmp. It takes up to 5 arguments - the pids of the two tasks (which characteristics should be compared), the comparison type and (in case of comparison of files) two file descriptors. Lookups for pids are done in the caller's PID namespace only. At moment only x86 is supported and tested. [akpm@linux-foundation.org: fix up selftests, warnings] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Glauber Costa <glommer@parallels.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Valdis.Kletnieks@vt.edu Cc: Michal Marek <mmarek@suse.cz> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- arch/x86/syscalls/syscall_32.tbl | 1 arch/x86/syscalls/syscall_64.tbl | 1 include/linux/kcmp.h | 17 + include/linux/syscalls.h | 2 kernel/Makefile | 3 kernel/kcmp.c | 186 +++++++++++++++++++++ kernel/sys_ni.c | 3 tools/testing/selftests/Makefile | 2 tools/testing/selftests/kcmp/Makefile | 29 +++ tools/testing/selftests/kcmp/kcmp_test.c | 94 ++++++++++ 10 files changed, 337 insertions(+), 1 deletion(-) diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl --- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/arch/x86/syscalls/syscall_32.tbl @@ -355,3 +355,4 @@ 346 i386 setns sys_setns 347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev +349 i386 kcmp sys_kcmp diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl --- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/arch/x86/syscalls/syscall_64.tbl @@ -318,3 +318,4 @@ 309 64 getcpu sys_getcpu 310 64 process_vm_readv sys_process_vm_readv 311 64 process_vm_writev sys_process_vm_writev +312 64 kcmp sys_kcmp diff -puN /dev/null include/linux/kcmp.h --- /dev/null +++ a/include/linux/kcmp.h @@ -0,0 +1,17 @@ +#ifndef _LINUX_KCMP_H +#define _LINUX_KCMP_H + +/* Comparison type */ +enum kcmp_type { + KCMP_FILE, + KCMP_VM, + KCMP_FILES, + KCMP_FS, + KCMP_SIGHAND, + KCMP_IO, + KCMP_SYSVSEM, + + KCMP_TYPES, +}; + +#endif /* _LINUX_KCMP_H */ diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h --- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/include/linux/syscalls.h @@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi unsigned long riovcnt, unsigned long flags); +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, + unsigned long idx1, unsigned long idx2); #endif diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile --- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/kernel/Makefile @@ -25,6 +25,9 @@ endif obj-y += sched/ obj-y += power/ +ifeq ($(CONFIG_CHECKPOINT_RESTORE),y) +obj-$(CONFIG_X86) += kcmp.o +endif obj-$(CONFIG_FREEZER) += freezer.o obj-$(CONFIG_PROFILING) += profile.o obj-$(CONFIG_STACKTRACE) += stacktrace.o diff -puN /dev/null kernel/kcmp.c --- /dev/null +++ a/kernel/kcmp.c @@ -0,0 +1,186 @@ +#include <linux/kernel.h> +#include <linux/syscalls.h> +#include <linux/fdtable.h> +#include <linux/string.h> +#include <linux/random.h> +#include <linux/module.h> +#include <linux/init.h> +#include <linux/cache.h> +#include <linux/bug.h> +#include <linux/err.h> +#include <linux/kcmp.h> + +#include <asm/unistd.h> + +/* + * We don't expose real in-memory order of objects for security + * reasons, still the comparison results should be suitable for + * sorting. Thus, we obfuscate kernel pointers values and compare + * the production instead. + */ +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; + +static long kptr_obfuscate(long v, int type) +{ + return (v ^ cookies[type][0]) * cookies[type][1]; +} + +/* + * 0 - equal, i.e. v1 = v2 + * 1 - less than, i.e. v1 < v2 + * 2 - greater than, i.e. v1 > v2 + * 3 - not equal but ordering unavailable (reserved for future) + */ +static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type) +{ + long ret; + + ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type); + + return (ret < 0) | ((ret > 0) << 1); +} + +/* The caller must have pinned the task */ +static struct file * +get_file_raw_ptr(struct task_struct *task, unsigned int idx) +{ + struct file *file = NULL; + + task_lock(task); + rcu_read_lock(); + + if (task->files) + file = fcheck_files(task->files, idx); + + rcu_read_unlock(); + task_unlock(task); + + return file; +} + +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) +{ + if (likely(m2 != m1)) + mutex_unlock(m2); + mutex_unlock(m1); +} + +static int kcmp_lock(struct mutex *m1, struct mutex *m2) +{ + int err; + + if (m2 > m1) + swap(m1, m2); + + err = mutex_lock_killable(m1); + if (!err && likely(m1 != m2)) { + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); + if (err) + mutex_unlock(m1); + } + + return err; +} + +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, + unsigned long, idx1, unsigned long, idx2) +{ + struct task_struct *task1, *task2; + int ret; + + rcu_read_lock(); + + /* + * Tasks are looked up in caller's PID namespace only. + */ + task1 = find_task_by_vpid(pid1); + task2 = find_task_by_vpid(pid2); + if (!task1 || !task2) + goto err_no_task; + + get_task_struct(task1); + get_task_struct(task2); + + rcu_read_unlock(); + + /* + * One should have enough rights to inspect task details. + */ + ret = kcmp_lock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); + if (ret) + goto err; + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || + !ptrace_may_access(task2, PTRACE_MODE_READ)) { + ret = -EPERM; + goto err_unlock; + } + + switch (type) { + case KCMP_FILE: { + struct file *filp1, *filp2; + + filp1 = get_file_raw_ptr(task1, idx1); + filp2 = get_file_raw_ptr(task2, idx2); + + if (filp1 && filp2) + ret = kcmp_ptr(filp1, filp2, KCMP_FILE); + else + ret = -EBADF; + break; + } + case KCMP_VM: + ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM); + break; + case KCMP_FILES: + ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES); + break; + case KCMP_FS: + ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS); + break; + case KCMP_SIGHAND: + ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND); + break; + case KCMP_IO: + ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO); + break; + case KCMP_SYSVSEM: +#ifdef CONFIG_SYSVIPC + ret = kcmp_ptr(task1->sysvsem.undo_list, + task2->sysvsem.undo_list, + KCMP_SYSVSEM); +#else + ret = -EOPNOTSUP; +#endif + break; + default: + ret = -EINVAL; + break; + } + +err_unlock: + kcmp_unlock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); +err: + put_task_struct(task1); + put_task_struct(task2); + + return ret; + +err_no_task: + rcu_read_unlock(); + return -ESRCH; +} + +static __init int kcmp_cookies_init(void) +{ + int i; + + get_random_bytes(cookies, sizeof(cookies)); + + for (i = 0; i < KCMP_TYPES; i++) + cookies[i][1] |= (~(~0UL >> 1) | 1); + + return 0; +} +arch_initcall(kcmp_cookies_init); diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c --- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/kernel/sys_ni.c @@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark); cond_syscall(sys_name_to_handle_at); cond_syscall(sys_open_by_handle_at); cond_syscall(compat_sys_open_by_handle_at); + +/* compare kernel pointers */ +cond_syscall(sys_kcmp); diff -puN /dev/null tools/testing/selftests/kcmp/Makefile --- /dev/null +++ a/tools/testing/selftests/kcmp/Makefile @@ -0,0 +1,29 @@ +uname_M := $(shell uname -m 2>/dev/null || echo not) +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/) +ifeq ($(ARCH),i386) + ARCH := X86 + CFLAGS := -DCONFIG_X86_32 -D__i386__ +endif +ifeq ($(ARCH),x86_64) + ARCH := X86 + CFLAGS := -DCONFIG_X86_64 -D__x86_64__ +endif + +CFLAGS += -I../../../../arch/x86/include/generated/ +CFLAGS += -I../../../../include/ +CFLAGS += -I../../../../usr/include/ +CFLAGS += -I../../../../arch/x86/include/ + +all: +ifeq ($(ARCH),X86) + gcc $(CFLAGS) kcmp_test.c -o run_test +else + echo "Not an x86 target, can't build kcmp selftest" +endif + +run-tests: all + ./kcmp_test + +clean: + rm -fr ./run_test + rm -fr ./test-file diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c --- /dev/null +++ a/tools/testing/selftests/kcmp/kcmp_test.c @@ -0,0 +1,94 @@ +#define _GNU_SOURCE + +#include <stdio.h> +#include <stdlib.h> +#include <signal.h> +#include <limits.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> +#include <fcntl.h> + +#include <linux/unistd.h> +#include <linux/kcmp.h> + +#include <sys/syscall.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/wait.h> + +static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2) +{ + return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2); +} + +int main(int argc, char **argv) +{ + const char kpath[] = "kcmp-test-file"; + int pid1, pid2; + int fd1, fd2; + int status; + + fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644); + pid1 = getpid(); + + if (fd1 < 0) { + perror("Can't create file"); + exit(1); + } + + pid2 = fork(); + if (pid2 < 0) { + perror("fork failed"); + exit(1); + } + + if (!pid2) { + int pid2 = getpid(); + int ret; + + fd2 = open(kpath, O_RDWR, 0644); + if (fd2 < 0) { + perror("Can't open file"); + exit(1); + } + + /* An example of output and arguments */ + printf("pid1: %6d pid2: %6d FD: %2ld FILES: %2ld VM: %2ld " + "FS: %2ld SIGHAND: %2ld IO: %2ld SYSVSEM: %2ld " + "INV: %2ld\n", + pid1, pid2, + sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd2), + sys_kcmp(pid1, pid2, KCMP_FILES, 0, 0), + sys_kcmp(pid1, pid2, KCMP_VM, 0, 0), + sys_kcmp(pid1, pid2, KCMP_FS, 0, 0), + sys_kcmp(pid1, pid2, KCMP_SIGHAND, 0, 0), + sys_kcmp(pid1, pid2, KCMP_IO, 0, 0), + sys_kcmp(pid1, pid2, KCMP_SYSVSEM, 0, 0), + + /* This one should fail */ + sys_kcmp(pid1, pid2, KCMP_TYPES + 1, 0, 0)); + + /* This one should return same fd */ + ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1); + if (ret) { + printf("FAIL: 0 expected but %d returned\n", ret); + ret = -1; + } else + printf("PASS: 0 returned as expected\n"); + + /* Compare with self */ + ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0); + if (ret) { + printf("FAIL: 0 expected but %li returned\n", ret); + ret = -1; + } else + printf("PASS: 0 returned as expected\n"); + + exit(ret); + } + + waitpid(pid2, &status, P_ALL); + + return 0; +} diff -puN tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 tools/testing/selftests/Makefile --- a/tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/tools/testing/selftests/Makefile @@ -1,4 +1,4 @@ -TARGETS = breakpoints vm +TARGETS = breakpoints vm kcmp all: for TARGET in $(TARGETS); do \ _ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 19:52 ` Andrew Morton @ 2012-02-16 20:01 ` Cyrill Gorcunov 0 siblings, 0 replies; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-16 20:01 UTC (permalink / raw) To: Andrew Morton Cc: Oleg Nesterov, Vasiliy Kulikov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 11:52:08AM -0800, Andrew Morton wrote: > On Thu, 16 Feb 2012 23:29:52 +0400 > Cyrill Gorcunov <gorcunov@openvz.org> wrote: > > > Final one ;) > > syscalls-x86-add-__nr_kcmp-syscall-v8-fix-fix-fix-fix-fix.patch > is a tie for the world record! It was close to v13 ;) > > I folded everything back into a single patch. Here's what we currently > have: > Thanks a lot, Andrew! Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 16:49 ` Cyrill Gorcunov 2012-02-16 17:40 ` Oleg Nesterov @ 2012-02-16 18:21 ` Vasiliy Kulikov 2012-02-16 18:34 ` Cyrill Gorcunov 2012-02-16 18:49 ` Oleg Nesterov 1 sibling, 2 replies; 48+ messages in thread From: Vasiliy Kulikov @ 2012-02-16 18:21 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Oleg Nesterov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote: > + err = mutex_lock_killable(m1); > + if (!err && likely(m1 != m2)) { > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); Doesn't it lead to a deadlock? mutex_lock_killable(task1) | mutex_lock_killable(task2) mutex_lock_killable_nested(task2) | (locked) | mutex_lock_killable_nested(task1) (locked) I suppose you should use some global lock (kcmp_lock) before both locks. Thanks, -- Vasiliy Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 18:21 ` Vasiliy Kulikov @ 2012-02-16 18:34 ` Cyrill Gorcunov 2012-02-16 18:33 ` Vasiliy Kulikov 2012-02-16 18:49 ` Oleg Nesterov 1 sibling, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-16 18:34 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Oleg Nesterov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 10:21:06PM +0400, Vasiliy Kulikov wrote: > On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote: > > + err = mutex_lock_killable(m1); > > + if (!err && likely(m1 != m2)) { > > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); > > Doesn't it lead to a deadlock? > > mutex_lock_killable(task1) > | mutex_lock_killable(task2) > mutex_lock_killable_nested(task2) | > (locked) | > mutex_lock_killable_nested(task1) > (locked) > > I suppose you should use some global lock (kcmp_lock) before both locks. but here is if (m1 > m2) and swap() do take place. Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 18:34 ` Cyrill Gorcunov @ 2012-02-16 18:33 ` Vasiliy Kulikov 0 siblings, 0 replies; 48+ messages in thread From: Vasiliy Kulikov @ 2012-02-16 18:33 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Oleg Nesterov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On Thu, Feb 16, 2012 at 22:34 +0400, Cyrill Gorcunov wrote: > On Thu, Feb 16, 2012 at 10:21:06PM +0400, Vasiliy Kulikov wrote: > > On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote: > > > + err = mutex_lock_killable(m1); > > > + if (!err && likely(m1 != m2)) { > > > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); > > > > Doesn't it lead to a deadlock? > > > > mutex_lock_killable(task1) > > | mutex_lock_killable(task2) > > mutex_lock_killable_nested(task2) | > > (locked) | > > mutex_lock_killable_nested(task1) > > (locked) > > > > I suppose you should use some global lock (kcmp_lock) before both locks. > > but here is if (m1 > m2) and swap() do take place. Ah, ok. Then this deadlock scenario is impossible, sorry. -- Vasiliy Kulikov http://www.openwall.com - bringing security into open computing environments ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-16 18:21 ` Vasiliy Kulikov 2012-02-16 18:34 ` Cyrill Gorcunov @ 2012-02-16 18:49 ` Oleg Nesterov 1 sibling, 0 replies; 48+ messages in thread From: Oleg Nesterov @ 2012-02-16 18:49 UTC (permalink / raw) To: Vasiliy Kulikov Cc: Cyrill Gorcunov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel On 02/16, Vasiliy Kulikov wrote: > > On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote: > > + err = mutex_lock_killable(m1); > > + if (!err && likely(m1 != m2)) { > > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); > > Doesn't it lead to a deadlock? > > mutex_lock_killable(task1) > | mutex_lock_killable(task2) > mutex_lock_killable_nested(task2) | > (locked) | > mutex_lock_killable_nested(task1) > (locked) Please note the if (m1 >= m2) swap(m1. m2) at the start. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 15:10 ` Cyrill Gorcunov 2012-02-15 15:38 ` Oleg Nesterov @ 2012-02-15 18:32 ` Cyrill Gorcunov 2012-02-15 19:06 ` Oleg Nesterov 1 sibling, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 18:32 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 07:10:08PM +0400, Cyrill Gorcunov wrote: > On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote: > > > > task->files can be NULL, we can race with exit_files(). > > So I need to call get_files_struct. Thanks Oleg! > Something like below I think, right? (it seems this patc doesn't hit lenux-next yet, so here is interdiff output) Cyrill --- diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c --- linux-2.6.git/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -44,16 +44,25 @@ static struct file * get_file_raw_ptr(struct task_struct *task, unsigned int idx) { + struct files_struct *files; struct fdtable *fdt; struct file *file; - spin_lock(&task->files->file_lock); - fdt = files_fdtable(task->files); + files = get_files_struct(task); + if (!files) + return NULL; + + rcu_read_lock(); + + fdt = files_fdtable(files); if (idx < fdt->max_fds) file = fdt->fd[idx]; else file = NULL; - spin_unlock(&task->files->file_lock); + + rcu_read_unlock(); + + put_files_struct(files); return file; } ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 18:32 ` Cyrill Gorcunov @ 2012-02-15 19:06 ` Oleg Nesterov 2012-02-15 19:18 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-02-15 19:06 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On 02/15, Cyrill Gorcunov wrote: > > On Wed, Feb 15, 2012 at 07:10:08PM +0400, Cyrill Gorcunov wrote: > > On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote: > > > > > > task->files can be NULL, we can race with exit_files(). > > > > So I need to call get_files_struct. Thanks Oleg! > > > > Something like below I think, right? (it seems this > patc doesn't hit lenux-next yet, so here is interdiff > output) > > Cyrill > --- > diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c > --- linux-2.6.git/kernel/kcmp.c > +++ linux-2.6.git/kernel/kcmp.c > @@ -44,16 +44,25 @@ > static struct file * > get_file_raw_ptr(struct task_struct *task, unsigned int idx) > { > + struct files_struct *files; > struct fdtable *fdt; > struct file *file; > > - spin_lock(&task->files->file_lock); > - fdt = files_fdtable(task->files); > + files = get_files_struct(task); > + if (!files) > + return NULL; > + > + rcu_read_lock(); > + > + fdt = files_fdtable(files); > if (idx < fdt->max_fds) > file = fdt->fd[idx]; > else > file = NULL; > - spin_unlock(&task->files->file_lock); > + > + rcu_read_unlock(); > + > + put_files_struct(files); I think this should work. Or you can simply do struct file *file = NULL; task_lock(task); rcu_read_lock(); if (task->files) file = fcheck_files(task->files, idx); rcu_read_unlock(); task_unlock(task); Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 19:06 ` Oleg Nesterov @ 2012-02-15 19:18 ` Cyrill Gorcunov 0 siblings, 0 replies; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 19:18 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 08:06:22PM +0100, Oleg Nesterov wrote: > > I think this should work. Or you can simply do > > struct file *file = NULL; > > task_lock(task); > rcu_read_lock(); > if (task->files) > file = fcheck_files(task->files, idx); > rcu_read_unlock(); > task_unlock(task); > Yeah, thanks, this one is better. Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 14:36 + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree Oleg Nesterov 2012-02-15 15:10 ` Cyrill Gorcunov @ 2012-02-15 16:06 ` Oleg Nesterov 2012-02-15 16:27 ` Cyrill Gorcunov 1 sibling, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-02-15 16:06 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel Not a comment, but the question. I am just curious... > +/* > + * We don't expose real in-memory order of objects for security > + * reasons, still the comparison results should be suitable for > + * sorting. Thus, we obfuscate kernel pointers values and compare > + * the production instead. > + */ > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; > + > +static long kptr_obfuscate(long v, int type) > +{ > + return (v ^ cookies[type][0]) * cookies[type][1]; > +} OK, but why do we need this per type? Just to add more obfuscation or there is another reason? > +static __init int kcmp_cookies_init(void) > +{ > + int i; > + > + get_random_bytes(cookies, sizeof(cookies)); > + > + for (i = 0; i < KCMP_TYPES; i++) > + cookies[i][1] |= (~(~0UL >> 1) | 1); I am puzzled, help ;) this is equal to cookies[i][1] |= -LONG_MAX; or cookies[i][1] |= (LONG_MIN | 1); for what? why do we want to set these 2 bits (MSB and LSB) ? Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 16:06 ` Oleg Nesterov @ 2012-02-15 16:27 ` Cyrill Gorcunov 2012-04-09 22:10 ` Andrew Morton 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-02-15 16:27 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, Andrew Morton, linux-kernel On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote: > Not a comment, but the question. I am just curious... > > > +/* > > + * We don't expose real in-memory order of objects for security > > + * reasons, still the comparison results should be suitable for > > + * sorting. Thus, we obfuscate kernel pointers values and compare > > + * the production instead. > > + */ > > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; > > + > > +static long kptr_obfuscate(long v, int type) > > +{ > > + return (v ^ cookies[type][0]) * cookies[type][1]; > > +} > > OK, but why do we need this per type? Just to add more obfuscation > or there is another reason? Just to add more obfuscation. > > > +static __init int kcmp_cookies_init(void) > > +{ > > + int i; > > + > > + get_random_bytes(cookies, sizeof(cookies)); > > + > > + for (i = 0; i < KCMP_TYPES; i++) > > + cookies[i][1] |= (~(~0UL >> 1) | 1); > > I am puzzled, help ;) this is equal to > > cookies[i][1] |= -LONG_MAX; > or > cookies[i][1] |= (LONG_MIN | 1); > > for what? why do we want to set these 2 bits (MSB and LSB) ? Letme quote hpa@ here :) | This code is wrong. You will have a zero cookie, legitimately, once in | 2^32 or 2^64 attempts, depending on the bitness. | | The other thing is that for the multiplicative cookie you should OR in | the value (~(~0UL >> 1) | 1) in order to make sure that the value is (a) | large and (b) odd. Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-02-15 16:27 ` Cyrill Gorcunov @ 2012-04-09 22:10 ` Andrew Morton 2012-04-09 22:24 ` Cyrill Gorcunov ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Andrew Morton @ 2012-04-09 22:10 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Oleg Nesterov, Jonathan Corbet, H. Peter Anvin Back on to kcmp. On Wed, 15 Feb 2012 20:27:52 +0400 Cyrill Gorcunov <gorcunov@openvz.org> wrote: > On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote: > > Not a comment, but the question. I am just curious... > > > > > +/* > > > + * We don't expose real in-memory order of objects for security > > > + * reasons, still the comparison results should be suitable for > > > + * sorting. Thus, we obfuscate kernel pointers values and compare > > > + * the production instead. > > > + */ > > > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; > > > + > > > +static long kptr_obfuscate(long v, int type) > > > +{ > > > + return (v ^ cookies[type][0]) * cookies[type][1]; > > > +} > > > > OK, but why do we need this per type? Just to add more obfuscation > > or there is another reason? > > Just to add more obfuscation. Having re-read most of the (enormous) email discussion on the kcmp() syscall patch, I'm thinking: - Nobody seems to understand the obfuscation logic. Jon sounded confused, Oleg sounds confused and it's rather unclear what it does, how it does it and why it does it. - Lots of people have looked at the code and made comments and there have been lots of changes. But we presently have zero Acked-by's and Reviewed-by's. I guess this means that at present nobody is aware of any issues with the proposal, btu nobody is terribly excisted about it either? So what do people think? Any issues? Any nacks? Should I sneak it into Linus this week or do we need to go another round with it all? I'd like to at least have a fighting chance of understnading what's going on with that obfuscation code. From: Cyrill Gorcunov <gorcunov@openvz.org> Subject: syscalls, x86: add __NR_kcmp syscall While doing the checkpoint-restore in the user space one need to determine whether various kernel objects (like mm_struct-s of file_struct-s) are shared between tasks and restore this state. The 2nd step can be solved by using appropriate CLONE_ flags and the unshare syscall, while there's currently no ways for solving the 1st one. One of the ways for checking whether two tasks share e.g. mm_struct is to provide some mm_struct ID of a task to its proc file, but showing such info considered to be not that good for security reasons. Thus after some debates we end up in conclusion that using that named 'comparison' syscall might be the best candidate. So here is it -- __NR_kcmp. It takes up to 5 arguments - the pids of the two tasks (which characteristics should be compared), the comparison type and (in case of comparison of files) two file descriptors. Lookups for pids are done in the caller's PID namespace only. At moment only x86 is supported and tested. [akpm@linux-foundation.org: fix up selftests, warnings] [akpm@linux-foundation.org: include errno.h] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Glauber Costa <glommer@parallels.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tejun Heo <tj@kernel.org> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Valdis.Kletnieks@vt.edu Cc: Michal Marek <mmarek@suse.cz> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- arch/x86/syscalls/syscall_32.tbl | 1 arch/x86/syscalls/syscall_64.tbl | 1 include/linux/kcmp.h | 17 + include/linux/syscalls.h | 2 kernel/Makefile | 3 kernel/kcmp.c | 187 +++++++++++++++++++++ kernel/sys_ni.c | 3 tools/testing/selftests/Makefile | 2 tools/testing/selftests/kcmp/Makefile | 29 +++ tools/testing/selftests/kcmp/kcmp_test.c | 94 ++++++++++ 10 files changed, 338 insertions(+), 1 deletion(-) diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl --- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/arch/x86/syscalls/syscall_32.tbl @@ -355,3 +355,4 @@ 346 i386 setns sys_setns 347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev +349 i386 kcmp sys_kcmp diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl --- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/arch/x86/syscalls/syscall_64.tbl @@ -318,6 +318,7 @@ 309 common getcpu sys_getcpu 310 64 process_vm_readv sys_process_vm_readv 311 64 process_vm_writev sys_process_vm_writev +312 64 kcmp sys_kcmp # # x32-specific system call numbers start at 512 to avoid cache impact # for native 64-bit operation. diff -puN /dev/null include/linux/kcmp.h --- /dev/null +++ a/include/linux/kcmp.h @@ -0,0 +1,17 @@ +#ifndef _LINUX_KCMP_H +#define _LINUX_KCMP_H + +/* Comparison type */ +enum kcmp_type { + KCMP_FILE, + KCMP_VM, + KCMP_FILES, + KCMP_FS, + KCMP_SIGHAND, + KCMP_IO, + KCMP_SYSVSEM, + + KCMP_TYPES, +}; + +#endif /* _LINUX_KCMP_H */ diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h --- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/include/linux/syscalls.h @@ -858,4 +858,6 @@ asmlinkage long sys_process_vm_writev(pi unsigned long riovcnt, unsigned long flags); +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, + unsigned long idx1, unsigned long idx2); #endif diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile --- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/kernel/Makefile @@ -25,6 +25,9 @@ endif obj-y += sched/ obj-y += power/ +ifeq ($(CONFIG_CHECKPOINT_RESTORE),y) +obj-$(CONFIG_X86) += kcmp.o +endif obj-$(CONFIG_FREEZER) += freezer.o obj-$(CONFIG_PROFILING) += profile.o obj-$(CONFIG_STACKTRACE) += stacktrace.o diff -puN /dev/null kernel/kcmp.c --- /dev/null +++ a/kernel/kcmp.c @@ -0,0 +1,187 @@ +#include <linux/kernel.h> +#include <linux/syscalls.h> +#include <linux/fdtable.h> +#include <linux/string.h> +#include <linux/random.h> +#include <linux/module.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/cache.h> +#include <linux/bug.h> +#include <linux/err.h> +#include <linux/kcmp.h> + +#include <asm/unistd.h> + +/* + * We don't expose real in-memory order of objects for security + * reasons, still the comparison results should be suitable for + * sorting. Thus, we obfuscate kernel pointers values and compare + * the production instead. + */ +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; + +static long kptr_obfuscate(long v, int type) +{ + return (v ^ cookies[type][0]) * cookies[type][1]; +} + +/* + * 0 - equal, i.e. v1 = v2 + * 1 - less than, i.e. v1 < v2 + * 2 - greater than, i.e. v1 > v2 + * 3 - not equal but ordering unavailable (reserved for future) + */ +static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type) +{ + long ret; + + ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type); + + return (ret < 0) | ((ret > 0) << 1); +} + +/* The caller must have pinned the task */ +static struct file * +get_file_raw_ptr(struct task_struct *task, unsigned int idx) +{ + struct file *file = NULL; + + task_lock(task); + rcu_read_lock(); + + if (task->files) + file = fcheck_files(task->files, idx); + + rcu_read_unlock(); + task_unlock(task); + + return file; +} + +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) +{ + if (likely(m2 != m1)) + mutex_unlock(m2); + mutex_unlock(m1); +} + +static int kcmp_lock(struct mutex *m1, struct mutex *m2) +{ + int err; + + if (m2 > m1) + swap(m1, m2); + + err = mutex_lock_killable(m1); + if (!err && likely(m1 != m2)) { + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); + if (err) + mutex_unlock(m1); + } + + return err; +} + +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, + unsigned long, idx1, unsigned long, idx2) +{ + struct task_struct *task1, *task2; + int ret; + + rcu_read_lock(); + + /* + * Tasks are looked up in caller's PID namespace only. + */ + task1 = find_task_by_vpid(pid1); + task2 = find_task_by_vpid(pid2); + if (!task1 || !task2) + goto err_no_task; + + get_task_struct(task1); + get_task_struct(task2); + + rcu_read_unlock(); + + /* + * One should have enough rights to inspect task details. + */ + ret = kcmp_lock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); + if (ret) + goto err; + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || + !ptrace_may_access(task2, PTRACE_MODE_READ)) { + ret = -EPERM; + goto err_unlock; + } + + switch (type) { + case KCMP_FILE: { + struct file *filp1, *filp2; + + filp1 = get_file_raw_ptr(task1, idx1); + filp2 = get_file_raw_ptr(task2, idx2); + + if (filp1 && filp2) + ret = kcmp_ptr(filp1, filp2, KCMP_FILE); + else + ret = -EBADF; + break; + } + case KCMP_VM: + ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM); + break; + case KCMP_FILES: + ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES); + break; + case KCMP_FS: + ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS); + break; + case KCMP_SIGHAND: + ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND); + break; + case KCMP_IO: + ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO); + break; + case KCMP_SYSVSEM: +#ifdef CONFIG_SYSVIPC + ret = kcmp_ptr(task1->sysvsem.undo_list, + task2->sysvsem.undo_list, + KCMP_SYSVSEM); +#else + ret = -EOPNOTSUPP; +#endif + break; + default: + ret = -EINVAL; + break; + } + +err_unlock: + kcmp_unlock(&task1->signal->cred_guard_mutex, + &task2->signal->cred_guard_mutex); +err: + put_task_struct(task1); + put_task_struct(task2); + + return ret; + +err_no_task: + rcu_read_unlock(); + return -ESRCH; +} + +static __init int kcmp_cookies_init(void) +{ + int i; + + get_random_bytes(cookies, sizeof(cookies)); + + for (i = 0; i < KCMP_TYPES; i++) + cookies[i][1] |= (~(~0UL >> 1) | 1); + + return 0; +} +arch_initcall(kcmp_cookies_init); diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c --- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/kernel/sys_ni.c @@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark); cond_syscall(sys_name_to_handle_at); cond_syscall(sys_open_by_handle_at); cond_syscall(compat_sys_open_by_handle_at); + +/* compare kernel pointers */ +cond_syscall(sys_kcmp); diff -puN tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 tools/testing/selftests/Makefile --- a/tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 +++ a/tools/testing/selftests/Makefile @@ -1,4 +1,4 @@ -TARGETS = breakpoints vm +TARGETS = breakpoints vm kcmp all: for TARGET in $(TARGETS); do \ diff -puN /dev/null tools/testing/selftests/kcmp/Makefile --- /dev/null +++ a/tools/testing/selftests/kcmp/Makefile @@ -0,0 +1,29 @@ +uname_M := $(shell uname -m 2>/dev/null || echo not) +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/) +ifeq ($(ARCH),i386) + ARCH := X86 + CFLAGS := -DCONFIG_X86_32 -D__i386__ +endif +ifeq ($(ARCH),x86_64) + ARCH := X86 + CFLAGS := -DCONFIG_X86_64 -D__x86_64__ +endif + +CFLAGS += -I../../../../arch/x86/include/generated/ +CFLAGS += -I../../../../include/ +CFLAGS += -I../../../../usr/include/ +CFLAGS += -I../../../../arch/x86/include/ + +all: +ifeq ($(ARCH),X86) + gcc $(CFLAGS) kcmp_test.c -o run_test +else + echo "Not an x86 target, can't build kcmp selftest" +endif + +run-tests: all + ./kcmp_test + +clean: + rm -fr ./run_test + rm -fr ./test-file diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c --- /dev/null +++ a/tools/testing/selftests/kcmp/kcmp_test.c @@ -0,0 +1,94 @@ +#define _GNU_SOURCE + +#include <stdio.h> +#include <stdlib.h> +#include <signal.h> +#include <limits.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> +#include <fcntl.h> + +#include <linux/unistd.h> +#include <linux/kcmp.h> + +#include <sys/syscall.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/wait.h> + +static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2) +{ + return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2); +} + +int main(int argc, char **argv) +{ + const char kpath[] = "kcmp-test-file"; + int pid1, pid2; + int fd1, fd2; + int status; + + fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644); + pid1 = getpid(); + + if (fd1 < 0) { + perror("Can't create file"); + exit(1); + } + + pid2 = fork(); + if (pid2 < 0) { + perror("fork failed"); + exit(1); + } + + if (!pid2) { + int pid2 = getpid(); + int ret; + + fd2 = open(kpath, O_RDWR, 0644); + if (fd2 < 0) { + perror("Can't open file"); + exit(1); + } + + /* An example of output and arguments */ + printf("pid1: %6d pid2: %6d FD: %2ld FILES: %2ld VM: %2ld " + "FS: %2ld SIGHAND: %2ld IO: %2ld SYSVSEM: %2ld " + "INV: %2ld\n", + pid1, pid2, + sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd2), + sys_kcmp(pid1, pid2, KCMP_FILES, 0, 0), + sys_kcmp(pid1, pid2, KCMP_VM, 0, 0), + sys_kcmp(pid1, pid2, KCMP_FS, 0, 0), + sys_kcmp(pid1, pid2, KCMP_SIGHAND, 0, 0), + sys_kcmp(pid1, pid2, KCMP_IO, 0, 0), + sys_kcmp(pid1, pid2, KCMP_SYSVSEM, 0, 0), + + /* This one should fail */ + sys_kcmp(pid1, pid2, KCMP_TYPES + 1, 0, 0)); + + /* This one should return same fd */ + ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1); + if (ret) { + printf("FAIL: 0 expected but %d returned\n", ret); + ret = -1; + } else + printf("PASS: 0 returned as expected\n"); + + /* Compare with self */ + ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0); + if (ret) { + printf("FAIL: 0 expected but %li returned\n", ret); + ret = -1; + } else + printf("PASS: 0 returned as expected\n"); + + exit(ret); + } + + waitpid(pid2, &status, P_ALL); + + return 0; +} _ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-09 22:10 ` Andrew Morton @ 2012-04-09 22:24 ` Cyrill Gorcunov 2012-04-09 23:22 ` H. Peter Anvin 2012-04-10 3:25 ` Eric W. Biederman 2012-04-10 23:58 ` Valdis.Kletnieks 2 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-04-09 22:24 UTC (permalink / raw) To: Andrew Morton Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On Mon, Apr 09, 2012 at 03:10:27PM -0700, Andrew Morton wrote: > Back on to kcmp. > > On Wed, 15 Feb 2012 20:27:52 +0400 > Cyrill Gorcunov <gorcunov@openvz.org> wrote: > > > On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote: > > > Not a comment, but the question. I am just curious... > > > > > > > +/* > > > > + * We don't expose real in-memory order of objects for security > > > > + * reasons, still the comparison results should be suitable for > > > > + * sorting. Thus, we obfuscate kernel pointers values and compare > > > > + * the production instead. > > > > + */ > > > > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; > > > > + > > > > +static long kptr_obfuscate(long v, int type) > > > > +{ > > > > + return (v ^ cookies[type][0]) * cookies[type][1]; > > > > +} > > > > > > OK, but why do we need this per type? Just to add more obfuscation > > > or there is another reason? > > > > Just to add more obfuscation. > > Having re-read most of the (enormous) email discussion on the kcmp() > syscall patch, I'm thinking: > > - Nobody seems to understand the obfuscation logic. Jon sounded > confused, Oleg sounds confused and it's rather unclear what it does, > how it does it and why it does it. The obfuscation logic was done with great help from hpa@. And the main idea was to have ordered results after obfuscation. Per-type noise increase randomization of results. So Andrew, I actually dont know what to add here. We don't want to provide kernel order back to user-space in naked manner. > > - Lots of people have looked at the code and made comments and there > have been lots of changes. But we presently have zero Acked-by's and > Reviewed-by's. > I guess I can ask hpa@ and Eric for Reviewed-by or Acked-by tag? > I guess this means that at present nobody is aware of any issues with > the proposal, btu nobody is terribly excisted about it either? > I would rather say not much people yet use it. > So what do people think? Any issues? Any nacks? Should I sneak it > into Linus this week or do we need to go another round with it all? > > I'd like to at least have a fighting chance of understnading what's > going on with that obfuscation code. Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-09 22:24 ` Cyrill Gorcunov @ 2012-04-09 23:22 ` H. Peter Anvin 2012-04-10 22:37 ` Cyrill Gorcunov 2012-04-11 0:02 ` Valdis.Kletnieks 0 siblings, 2 replies; 48+ messages in thread From: H. Peter Anvin @ 2012-04-09 23:22 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On 04/09/2012 03:24 PM, Cyrill Gorcunov wrote: >> >> Having re-read most of the (enormous) email discussion on the kcmp() >> syscall patch, I'm thinking: >> >> - Nobody seems to understand the obfuscation logic. Jon sounded >> confused, Oleg sounds confused and it's rather unclear what it does, >> how it does it and why it does it. > > The obfuscation logic was done with great help from hpa@. And the main > idea was to have ordered results after obfuscation. Per-type noise increase > randomization of results. So Andrew, I actually dont know what to add > here. We don't want to provide kernel order back to user-space in > naked manner. > The obfuscation logic is to provide a 1:1 mapping but which doesn't preserve ordering, thereby avoid leaking information of kernel pointers to user space. -hpa ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-09 23:22 ` H. Peter Anvin @ 2012-04-10 22:37 ` Cyrill Gorcunov 2012-04-10 22:39 ` H. Peter Anvin 2012-04-10 23:08 ` Oleg Nesterov 2012-04-11 0:02 ` Valdis.Kletnieks 1 sibling, 2 replies; 48+ messages in thread From: Cyrill Gorcunov @ 2012-04-10 22:37 UTC (permalink / raw) To: H. Peter Anvin Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On Mon, Apr 09, 2012 at 04:22:38PM -0700, H. Peter Anvin wrote: > > The obfuscation logic was done with great help from hpa@. And the main > > idea was to have ordered results after obfuscation. Per-type noise increase > > randomization of results. So Andrew, I actually dont know what to add > > here. We don't want to provide kernel order back to user-space in > > naked manner. > > > > The obfuscation logic is to provide a 1:1 mapping but which doesn't > preserve ordering, thereby avoid leaking information of kernel pointers > to user space. > OK, Peter, would the following comment bring light on the obfuscation procedure? --- Add a comment on kcmp obfuscation method Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> --- kernel/kcmp.c | 11 +++++++++++ 1 file changed, 11 insertions(+) Index: linux-2.6.git/kernel/kcmp.c =================================================================== --- linux-2.6.git.orig/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -17,6 +17,17 @@ * reasons, still the comparison results should be suitable for * sorting. Thus, we obfuscate kernel pointers values and compare * the production instead. + * + * The obfuscation is done in two steps. First -- we use xor on + * kernel pointer with random value, which puts pointer into + * a new position in reordered space. Second -- we multiply + * the xor production with big odd random number to permute + * bits even more (the oddity is important here, it allow + * us to have meaningful production even if multiplicants + * are big numbers). + * + * Note also the obfuscation itself is invisible to user-space + * and if needed it can be changed to any suitable scheme. */ static unsigned long cookies[KCMP_TYPES][2] __read_mostly; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 22:37 ` Cyrill Gorcunov @ 2012-04-10 22:39 ` H. Peter Anvin 2012-04-10 22:48 ` Cyrill Gorcunov 2012-04-10 23:08 ` Oleg Nesterov 1 sibling, 1 reply; 48+ messages in thread From: H. Peter Anvin @ 2012-04-10 22:39 UTC (permalink / raw) To: Cyrill Gorcunov Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On 04/10/2012 03:37 PM, Cyrill Gorcunov wrote: > + * bits even more (the oddity is important here, it allow > + * us to have meaningful production even if multiplicants > + * are big numbers). I would be more clear: "the odd multiplier guarantees that the product is unique ever after the high bits are truncated, since any odd number is relative prime to 2^n". -hpa ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 22:39 ` H. Peter Anvin @ 2012-04-10 22:48 ` Cyrill Gorcunov 0 siblings, 0 replies; 48+ messages in thread From: Cyrill Gorcunov @ 2012-04-10 22:48 UTC (permalink / raw) To: H. Peter Anvin Cc: Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On Tue, Apr 10, 2012 at 03:39:53PM -0700, H. Peter Anvin wrote: > On 04/10/2012 03:37 PM, Cyrill Gorcunov wrote: > > + * bits even more (the oddity is important here, it allow > > + * us to have meaningful production even if multiplicants > > + * are big numbers). > > I would be more clear: > > "the odd multiplier guarantees that the product is unique ever after the > high bits are truncated, since any odd number is relative prime to 2^n". > Yeah, updated, thanks. Andrew, fold it up please. --- Subject: Add a comment on kcmp obfuscation method Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: "H. Peter Anvin" <hpa@zytor.com> --- kernel/kcmp.c | 11 +++++++++++ 1 file changed, 11 insertions(+) Index: linux-2.6.git/kernel/kcmp.c =================================================================== --- linux-2.6.git.orig/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -17,6 +17,17 @@ * reasons, still the comparison results should be suitable for * sorting. Thus, we obfuscate kernel pointers values and compare * the production instead. + * + * The obfuscation is done in two steps. First -- we use xor on + * kernel pointer with random value, which puts pointer into + * a new position in reordered space. Second -- we multiply + * the xor production with big odd random number to permute + * bits even more (the odd multiplier guarantees that the product + * is unique ever after the high bits are truncated, since any odd + * number is relative prime to 2^n). + * + * Note also the obfuscation itself is invisible to user-space + * and if needed it can be changed to any suitable scheme. */ static unsigned long cookies[KCMP_TYPES][2] __read_mostly; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 22:37 ` Cyrill Gorcunov 2012-04-10 22:39 ` H. Peter Anvin @ 2012-04-10 23:08 ` Oleg Nesterov 2012-04-10 23:32 ` H. Peter Anvin 1 sibling, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-04-10 23:08 UTC (permalink / raw) To: Cyrill Gorcunov Cc: H. Peter Anvin, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On 04/11, Cyrill Gorcunov wrote: > > --- linux-2.6.git.orig/kernel/kcmp.c > +++ linux-2.6.git/kernel/kcmp.c > @@ -17,6 +17,17 @@ > * reasons, still the comparison results should be suitable for > * sorting. Thus, we obfuscate kernel pointers values and compare > * the production instead. > + * > + * The obfuscation is done in two steps. First -- we use xor on > + * kernel pointer with random value, which puts pointer into > + * a new position in reordered space. Second -- we multiply > + * the xor production with big odd random number to permute > + * bits even more (the oddity is important here, it allow > + * us to have meaningful production even if multiplicants > + * are big numbers). > + * > + * Note also the obfuscation itself is invisible to user-space > + * and if needed it can be changed to any suitable scheme. > */ > static unsigned long cookies[KCMP_TYPES][2] __read_mostly; OK, since this is discussed again... Can this comment can also explain why do we obfuscate the pointers by type? I mean, I don't really understand why the one-dimensional cookies[2] is "not enough" from security pov. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 23:08 ` Oleg Nesterov @ 2012-04-10 23:32 ` H. Peter Anvin 2012-04-10 23:42 ` Oleg Nesterov 0 siblings, 1 reply; 48+ messages in thread From: H. Peter Anvin @ 2012-04-10 23:32 UTC (permalink / raw) To: Oleg Nesterov Cc: Cyrill Gorcunov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On 04/10/2012 04:08 PM, Oleg Nesterov wrote: > > OK, since this is discussed again... > > Can this comment can also explain why do we obfuscate the pointers > by type? I mean, I don't really understand why the one-dimensional > cookies[2] is "not enough" from security pov. > Because it's cheap. "Just enough" is not what you want to shoot for, ever, you want to get past the "just enough" point and then consider "what can I get for cheap at this point"? -hpa ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 23:32 ` H. Peter Anvin @ 2012-04-10 23:42 ` Oleg Nesterov 2012-04-11 6:39 ` Cyrill Gorcunov 0 siblings, 1 reply; 48+ messages in thread From: Oleg Nesterov @ 2012-04-10 23:42 UTC (permalink / raw) To: H. Peter Anvin Cc: Cyrill Gorcunov, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On 04/10, H. Peter Anvin wrote: > > On 04/10/2012 04:08 PM, Oleg Nesterov wrote: > > > > OK, since this is discussed again... > > > > Can this comment can also explain why do we obfuscate the pointers > > by type? I mean, I don't really understand why the one-dimensional > > cookies[2] is "not enough" from security pov. > > Because it's cheap. "Just enough" is not what you want to shoot for, > ever, you want to get past the "just enough" point and then consider > "what can I get for cheap at this point"? OK, I am not arguing. Just I thought that the small note like "we are doing this per-type to obfuscate even more" can help. I wouldn't have asked, but Cyrill rewrites this comment anyway. Perhaps this is just me, but my first (and wrong) impression was that somehow this is needed for correctness. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 23:42 ` Oleg Nesterov @ 2012-04-11 6:39 ` Cyrill Gorcunov 2012-04-11 18:31 ` Oleg Nesterov 0 siblings, 1 reply; 48+ messages in thread From: Cyrill Gorcunov @ 2012-04-11 6:39 UTC (permalink / raw) To: Oleg Nesterov Cc: H. Peter Anvin, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On Wed, Apr 11, 2012 at 01:42:06AM +0200, Oleg Nesterov wrote: > On 04/10, H. Peter Anvin wrote: > > > > On 04/10/2012 04:08 PM, Oleg Nesterov wrote: > > > > > > OK, since this is discussed again... > > > > > > Can this comment can also explain why do we obfuscate the pointers > > > by type? I mean, I don't really understand why the one-dimensional > > > cookies[2] is "not enough" from security pov. > > > > Because it's cheap. "Just enough" is not what you want to shoot for, > > ever, you want to get past the "just enough" point and then consider > > "what can I get for cheap at this point"? > > OK, I am not arguing. Just I thought that the small note like > "we are doing this per-type to obfuscate even more" can help. > I wouldn't have asked, but Cyrill rewrites this comment anyway. > > Perhaps this is just me, but my first (and wrong) impression was > that somehow this is needed for correctness. Hi Oleg, would the form below sounds good? --- Subject: Add a comment on kcmp obfuscation method Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> --- kernel/kcmp.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) Index: linux-2.6.git/kernel/kcmp.c =================================================================== --- linux-2.6.git.orig/kernel/kcmp.c +++ linux-2.6.git/kernel/kcmp.c @@ -17,6 +17,20 @@ * reasons, still the comparison results should be suitable for * sorting. Thus, we obfuscate kernel pointers values and compare * the production instead. + * + * The obfuscation is done in two steps. First -- we use xor on + * kernel pointer with random value, which puts pointer into + * a new position in reordered space. Second -- we multiply + * the xor production with big odd random number to permute + * bits even more (the odd multiplier guarantees that the product + * is unique ever after the high bits are truncated, since any odd + * number is relative prime to 2^n). + * + * The obfuscation is done with per-type cookie values to increase + * initial entropy of results. + * + * Note also the obfuscation itself is invisible to user-space + * and if needed it can be changed to any suitable scheme. */ static unsigned long cookies[KCMP_TYPES][2] __read_mostly; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-11 6:39 ` Cyrill Gorcunov @ 2012-04-11 18:31 ` Oleg Nesterov 0 siblings, 0 replies; 48+ messages in thread From: Oleg Nesterov @ 2012-04-11 18:31 UTC (permalink / raw) To: Cyrill Gorcunov Cc: H. Peter Anvin, Andrew Morton, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On 04/11, Cyrill Gorcunov wrote: > > On Wed, Apr 11, 2012 at 01:42:06AM +0200, Oleg Nesterov wrote: > > OK, I am not arguing. Just I thought that the small note like > > "we are doing this per-type to obfuscate even more" can help. > > I wouldn't have asked, but Cyrill rewrites this comment anyway. > > > > Perhaps this is just me, but my first (and wrong) impression was > > that somehow this is needed for correctness. > > Hi Oleg, would the form below sounds good? Sure, thanks. Oleg. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-09 23:22 ` H. Peter Anvin 2012-04-10 22:37 ` Cyrill Gorcunov @ 2012-04-11 0:02 ` Valdis.Kletnieks 1 sibling, 0 replies; 48+ messages in thread From: Valdis.Kletnieks @ 2012-04-11 0:02 UTC (permalink / raw) To: H. Peter Anvin Cc: Cyrill Gorcunov, Andrew Morton, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet [-- Attachment #1: Type: text/plain, Size: 1009 bytes --] On Mon, 09 Apr 2012 16:22:38 -0700, "H. Peter Anvin" said: > On 04/09/2012 03:24 PM, Cyrill Gorcunov wrote: > >> > >> Having re-read most of the (enormous) email discussion on the kcmp() > >> syscall patch, I'm thinking: > >> > >> - Nobody seems to understand the obfuscation logic. Jon sounded > >> confused, Oleg sounds confused and it's rather unclear what it does, > >> how it does it and why it does it. > > > > The obfuscation logic was done with great help from hpa@. And the main > > idea was to have ordered results after obfuscation. Per-type noise increase > > randomization of results. So Andrew, I actually dont know what to add > > here. We don't want to provide kernel order back to user-space in > > naked manner. > > > > The obfuscation logic is to provide a 1:1 mapping but which doesn't > preserve ordering, thereby avoid leaking information of kernel pointers > to user space. Oh, OK... Ignore my previous note then. But we should lose the comment that implies we have an ordering? [-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --] ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-09 22:10 ` Andrew Morton 2012-04-09 22:24 ` Cyrill Gorcunov @ 2012-04-10 3:25 ` Eric W. Biederman 2012-04-10 22:54 ` Cyrill Gorcunov 2012-04-10 23:58 ` Valdis.Kletnieks 2 siblings, 1 reply; 48+ messages in thread From: Eric W. Biederman @ 2012-04-10 3:25 UTC (permalink / raw) To: Andrew Morton Cc: Cyrill Gorcunov, Oleg Nesterov, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet Andrew Morton <akpm@linux-foundation.org> writes: > Back on to kcmp. > > On Wed, 15 Feb 2012 20:27:52 +0400 > Cyrill Gorcunov <gorcunov@openvz.org> wrote: > >> On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote: >> > Not a comment, but the question. I am just curious... >> > >> > > +/* >> > > + * We don't expose real in-memory order of objects for security >> > > + * reasons, still the comparison results should be suitable for >> > > + * sorting. Thus, we obfuscate kernel pointers values and compare >> > > + * the production instead. >> > > + */ >> > > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; >> > > + >> > > +static long kptr_obfuscate(long v, int type) >> > > +{ >> > > + return (v ^ cookies[type][0]) * cookies[type][1]; >> > > +} >> > >> > OK, but why do we need this per type? Just to add more obfuscation >> > or there is another reason? >> >> Just to add more obfuscation. > > Having re-read most of the (enormous) email discussion on the kcmp() > syscall patch, I'm thinking: > > - Nobody seems to understand the obfuscation logic. Jon sounded > confused, Oleg sounds confused and it's rather unclear what it does, > how it does it and why it does it. Peter explained it fairly well earlier. The xor trivially makes sense to me. I don't recall what the multiplication does. It would be nice if someone would get Peter's comment on why the multiply into a comment. But obscuring things makes sense. > - Lots of people have looked at the code and made comments and there > have been lots of changes. But we presently have zero Acked-by's and > Reviewed-by's. > > I guess this means that at present nobody is aware of any issues with > the proposal, btu nobody is terribly excisted about it either? Having just read through it again the only possible issue I can see is that we compare file descriptors after dropping all of the locks. Since rcu_read_lock doesn't participate in ABBA deadlocks. My gut feel is that we should hold rcu_read_lock across the hole file pointer comparison to remove the possibility of races as file descriptor pointers go away. Still in practice I don't think it matters. At worst there is the slightest possibility of returning a value instead of -EBADF. The expectation is for all of the tasks we are operating on to be frozen, and even if the tasks are not frozen it is a very tiny window for a race to be in. > So what do people think? Any issues? Any nacks? Should I sneak it > into Linus this week or do we need to go another round with it all? Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > I'd like to at least have a fighting chance of understnading what's > going on with that obfuscation code. My gut feel is that this code is good enough. Any and all security issues have been addressed. Having the system call would seem to add momentum to the people working on checkpoint/restart. So I don't see why this patch should not go forward, unless someone can point out an outright bug. Eric > From: Cyrill Gorcunov <gorcunov@openvz.org> > Subject: syscalls, x86: add __NR_kcmp syscall > > While doing the checkpoint-restore in the user space one need to determine > whether various kernel objects (like mm_struct-s of file_struct-s) are > shared between tasks and restore this state. > > The 2nd step can be solved by using appropriate CLONE_ flags and the > unshare syscall, while there's currently no ways for solving the 1st one. > > One of the ways for checking whether two tasks share e.g. mm_struct is to > provide some mm_struct ID of a task to its proc file, but showing such > info considered to be not that good for security reasons. > > Thus after some debates we end up in conclusion that using that named > 'comparison' syscall might be the best candidate. So here is it -- > __NR_kcmp. > > It takes up to 5 arguments - the pids of the two tasks (which > characteristics should be compared), the comparison type and (in case of > comparison of files) two file descriptors. > > Lookups for pids are done in the caller's PID namespace only. > > At moment only x86 is supported and tested. > > [akpm@linux-foundation.org: fix up selftests, warnings] > [akpm@linux-foundation.org: include errno.h] > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> > Cc: "Eric W. Biederman" <ebiederm@xmission.com> > Cc: Pavel Emelyanov <xemul@parallels.com> > Cc: Andrey Vagin <avagin@openvz.org> > Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Glauber Costa <glommer@parallels.com> > Cc: Andi Kleen <andi@firstfloor.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: Matt Helsley <matthltc@us.ibm.com> > Cc: Pekka Enberg <penberg@kernel.org> > Cc: Eric Dumazet <eric.dumazet@gmail.com> > Cc: Vasiliy Kulikov <segoon@openwall.com> > Cc: Alexey Dobriyan <adobriyan@gmail.com> > Cc: Valdis.Kletnieks@vt.edu > Cc: Michal Marek <mmarek@suse.cz> > Cc: Frederic Weisbecker <fweisbec@gmail.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > arch/x86/syscalls/syscall_32.tbl | 1 > arch/x86/syscalls/syscall_64.tbl | 1 > include/linux/kcmp.h | 17 + > include/linux/syscalls.h | 2 > kernel/Makefile | 3 > kernel/kcmp.c | 187 +++++++++++++++++++++ > kernel/sys_ni.c | 3 > tools/testing/selftests/Makefile | 2 > tools/testing/selftests/kcmp/Makefile | 29 +++ > tools/testing/selftests/kcmp/kcmp_test.c | 94 ++++++++++ > 10 files changed, 338 insertions(+), 1 deletion(-) > > diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl > --- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 > +++ a/arch/x86/syscalls/syscall_32.tbl > @@ -355,3 +355,4 @@ > 346 i386 setns sys_setns > 347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv > 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev > +349 i386 kcmp sys_kcmp > diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl > --- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 > +++ a/arch/x86/syscalls/syscall_64.tbl > @@ -318,6 +318,7 @@ > 309 common getcpu sys_getcpu > 310 64 process_vm_readv sys_process_vm_readv > 311 64 process_vm_writev sys_process_vm_writev > +312 64 kcmp sys_kcmp > # > # x32-specific system call numbers start at 512 to avoid cache impact > # for native 64-bit operation. > diff -puN /dev/null include/linux/kcmp.h > --- /dev/null > +++ a/include/linux/kcmp.h > @@ -0,0 +1,17 @@ > +#ifndef _LINUX_KCMP_H > +#define _LINUX_KCMP_H > + > +/* Comparison type */ > +enum kcmp_type { > + KCMP_FILE, > + KCMP_VM, > + KCMP_FILES, > + KCMP_FS, > + KCMP_SIGHAND, > + KCMP_IO, > + KCMP_SYSVSEM, > + > + KCMP_TYPES, > +}; > + > +#endif /* _LINUX_KCMP_H */ > diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h > --- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 > +++ a/include/linux/syscalls.h > @@ -858,4 +858,6 @@ asmlinkage long sys_process_vm_writev(pi > unsigned long riovcnt, > unsigned long flags); > > +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, > + unsigned long idx1, unsigned long idx2); > #endif > diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile > --- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 > +++ a/kernel/Makefile > @@ -25,6 +25,9 @@ endif > obj-y += sched/ > obj-y += power/ > > +ifeq ($(CONFIG_CHECKPOINT_RESTORE),y) > +obj-$(CONFIG_X86) += kcmp.o > +endif > obj-$(CONFIG_FREEZER) += freezer.o > obj-$(CONFIG_PROFILING) += profile.o > obj-$(CONFIG_STACKTRACE) += stacktrace.o > diff -puN /dev/null kernel/kcmp.c > --- /dev/null > +++ a/kernel/kcmp.c > @@ -0,0 +1,187 @@ > +#include <linux/kernel.h> > +#include <linux/syscalls.h> > +#include <linux/fdtable.h> > +#include <linux/string.h> > +#include <linux/random.h> > +#include <linux/module.h> > +#include <linux/init.h> > +#include <linux/errno.h> > +#include <linux/cache.h> > +#include <linux/bug.h> > +#include <linux/err.h> > +#include <linux/kcmp.h> > + > +#include <asm/unistd.h> > + > +/* > + * We don't expose real in-memory order of objects for security > + * reasons, still the comparison results should be suitable for > + * sorting. Thus, we obfuscate kernel pointers values and compare > + * the production instead. > + */ > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly; > + > +static long kptr_obfuscate(long v, int type) > +{ > + return (v ^ cookies[type][0]) * cookies[type][1]; > +} > + > +/* > + * 0 - equal, i.e. v1 = v2 > + * 1 - less than, i.e. v1 < v2 > + * 2 - greater than, i.e. v1 > v2 > + * 3 - not equal but ordering unavailable (reserved for future) > + */ > +static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type) > +{ > + long ret; > + > + ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type); > + > + return (ret < 0) | ((ret > 0) << 1); > +} > + > +/* The caller must have pinned the task */ > +static struct file * > +get_file_raw_ptr(struct task_struct *task, unsigned int idx) > +{ > + struct file *file = NULL; > + > + task_lock(task); > + rcu_read_lock(); > + > + if (task->files) > + file = fcheck_files(task->files, idx); > + > + rcu_read_unlock(); > + task_unlock(task); > + > + return file; > +} > + > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2) > +{ > + if (likely(m2 != m1)) > + mutex_unlock(m2); > + mutex_unlock(m1); > +} > + > +static int kcmp_lock(struct mutex *m1, struct mutex *m2) > +{ > + int err; > + > + if (m2 > m1) > + swap(m1, m2); > + > + err = mutex_lock_killable(m1); > + if (!err && likely(m1 != m2)) { > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING); > + if (err) > + mutex_unlock(m1); > + } > + > + return err; > +} > + > +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, > + unsigned long, idx1, unsigned long, idx2) > +{ > + struct task_struct *task1, *task2; > + int ret; > + > + rcu_read_lock(); > + > + /* > + * Tasks are looked up in caller's PID namespace only. > + */ > + task1 = find_task_by_vpid(pid1); > + task2 = find_task_by_vpid(pid2); > + if (!task1 || !task2) > + goto err_no_task; > + > + get_task_struct(task1); > + get_task_struct(task2); > + > + rcu_read_unlock(); > + > + /* > + * One should have enough rights to inspect task details. > + */ > + ret = kcmp_lock(&task1->signal->cred_guard_mutex, > + &task2->signal->cred_guard_mutex); > + if (ret) > + goto err; > + if (!ptrace_may_access(task1, PTRACE_MODE_READ) || > + !ptrace_may_access(task2, PTRACE_MODE_READ)) { > + ret = -EPERM; > + goto err_unlock; > + } > + > + switch (type) { > + case KCMP_FILE: { > + struct file *filp1, *filp2; > + > + filp1 = get_file_raw_ptr(task1, idx1); > + filp2 = get_file_raw_ptr(task2, idx2); > + > + if (filp1 && filp2) > + ret = kcmp_ptr(filp1, filp2, KCMP_FILE); > + else > + ret = -EBADF; > + break; > + } > + case KCMP_VM: > + ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM); > + break; > + case KCMP_FILES: > + ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES); > + break; > + case KCMP_FS: > + ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS); > + break; > + case KCMP_SIGHAND: > + ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND); > + break; > + case KCMP_IO: > + ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO); > + break; > + case KCMP_SYSVSEM: > +#ifdef CONFIG_SYSVIPC > + ret = kcmp_ptr(task1->sysvsem.undo_list, > + task2->sysvsem.undo_list, > + KCMP_SYSVSEM); > +#else > + ret = -EOPNOTSUPP; > +#endif > + break; > + default: > + ret = -EINVAL; > + break; > + } > + > +err_unlock: > + kcmp_unlock(&task1->signal->cred_guard_mutex, > + &task2->signal->cred_guard_mutex); > +err: > + put_task_struct(task1); > + put_task_struct(task2); > + > + return ret; > + > +err_no_task: > + rcu_read_unlock(); > + return -ESRCH; > +} > + > +static __init int kcmp_cookies_init(void) > +{ > + int i; > + > + get_random_bytes(cookies, sizeof(cookies)); > + > + for (i = 0; i < KCMP_TYPES; i++) > + cookies[i][1] |= (~(~0UL >> 1) | 1); > + > + return 0; > +} > +arch_initcall(kcmp_cookies_init); > diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c > --- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 > +++ a/kernel/sys_ni.c > @@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark); > cond_syscall(sys_name_to_handle_at); > cond_syscall(sys_open_by_handle_at); > cond_syscall(compat_sys_open_by_handle_at); > + > +/* compare kernel pointers */ > +cond_syscall(sys_kcmp); > diff -puN tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 tools/testing/selftests/Makefile > --- a/tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 > +++ a/tools/testing/selftests/Makefile > @@ -1,4 +1,4 @@ > -TARGETS = breakpoints vm > +TARGETS = breakpoints vm kcmp > > all: > for TARGET in $(TARGETS); do \ > diff -puN /dev/null tools/testing/selftests/kcmp/Makefile > --- /dev/null > +++ a/tools/testing/selftests/kcmp/Makefile > @@ -0,0 +1,29 @@ > +uname_M := $(shell uname -m 2>/dev/null || echo not) > +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/) > +ifeq ($(ARCH),i386) > + ARCH := X86 > + CFLAGS := -DCONFIG_X86_32 -D__i386__ > +endif > +ifeq ($(ARCH),x86_64) > + ARCH := X86 > + CFLAGS := -DCONFIG_X86_64 -D__x86_64__ > +endif > + > +CFLAGS += -I../../../../arch/x86/include/generated/ > +CFLAGS += -I../../../../include/ > +CFLAGS += -I../../../../usr/include/ > +CFLAGS += -I../../../../arch/x86/include/ > + > +all: > +ifeq ($(ARCH),X86) > + gcc $(CFLAGS) kcmp_test.c -o run_test > +else > + echo "Not an x86 target, can't build kcmp selftest" > +endif > + > +run-tests: all > + ./kcmp_test > + > +clean: > + rm -fr ./run_test > + rm -fr ./test-file > diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c > --- /dev/null > +++ a/tools/testing/selftests/kcmp/kcmp_test.c > @@ -0,0 +1,94 @@ > +#define _GNU_SOURCE > + > +#include <stdio.h> > +#include <stdlib.h> > +#include <signal.h> > +#include <limits.h> > +#include <unistd.h> > +#include <errno.h> > +#include <string.h> > +#include <fcntl.h> > + > +#include <linux/unistd.h> > +#include <linux/kcmp.h> > + > +#include <sys/syscall.h> > +#include <sys/types.h> > +#include <sys/stat.h> > +#include <sys/wait.h> > + > +static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2) > +{ > + return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2); > +} > + > +int main(int argc, char **argv) > +{ > + const char kpath[] = "kcmp-test-file"; > + int pid1, pid2; > + int fd1, fd2; > + int status; > + > + fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644); > + pid1 = getpid(); > + > + if (fd1 < 0) { > + perror("Can't create file"); > + exit(1); > + } > + > + pid2 = fork(); > + if (pid2 < 0) { > + perror("fork failed"); > + exit(1); > + } > + > + if (!pid2) { > + int pid2 = getpid(); > + int ret; > + > + fd2 = open(kpath, O_RDWR, 0644); > + if (fd2 < 0) { > + perror("Can't open file"); > + exit(1); > + } > + > + /* An example of output and arguments */ > + printf("pid1: %6d pid2: %6d FD: %2ld FILES: %2ld VM: %2ld " > + "FS: %2ld SIGHAND: %2ld IO: %2ld SYSVSEM: %2ld " > + "INV: %2ld\n", > + pid1, pid2, > + sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd2), > + sys_kcmp(pid1, pid2, KCMP_FILES, 0, 0), > + sys_kcmp(pid1, pid2, KCMP_VM, 0, 0), > + sys_kcmp(pid1, pid2, KCMP_FS, 0, 0), > + sys_kcmp(pid1, pid2, KCMP_SIGHAND, 0, 0), > + sys_kcmp(pid1, pid2, KCMP_IO, 0, 0), > + sys_kcmp(pid1, pid2, KCMP_SYSVSEM, 0, 0), > + > + /* This one should fail */ > + sys_kcmp(pid1, pid2, KCMP_TYPES + 1, 0, 0)); > + > + /* This one should return same fd */ > + ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1); > + if (ret) { > + printf("FAIL: 0 expected but %d returned\n", ret); > + ret = -1; > + } else > + printf("PASS: 0 returned as expected\n"); > + > + /* Compare with self */ > + ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0); > + if (ret) { > + printf("FAIL: 0 expected but %li returned\n", ret); > + ret = -1; > + } else > + printf("PASS: 0 returned as expected\n"); > + > + exit(ret); > + } > + > + waitpid(pid2, &status, P_ALL); > + > + return 0; > +} > _ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 3:25 ` Eric W. Biederman @ 2012-04-10 22:54 ` Cyrill Gorcunov 0 siblings, 0 replies; 48+ messages in thread From: Cyrill Gorcunov @ 2012-04-10 22:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Oleg Nesterov, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Valdis.Kletnieks, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On Mon, Apr 09, 2012 at 08:25:22PM -0700, Eric W. Biederman wrote: ... > > Having just read through it again the only possible issue I can see is > that we compare file descriptors after dropping all of the locks. > > Since rcu_read_lock doesn't participate in ABBA deadlocks. My gut feel > is that we should hold rcu_read_lock across the hole file pointer > comparison to remove the possibility of races as file descriptor > pointers go away. > > Still in practice I don't think it matters. At worst there is the > slightest possibility of returning a value instead of -EBADF. The > expectation is for all of the tasks we are operating on to be frozen, > and even if the tasks are not frozen it is a very tiny window for a race > to be in. yeah, we use this call heavily on stopped tasks atm > > So what do people think? Any issues? Any nacks? Should I sneak it > > into Linus this week or do we need to go another round with it all? > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > Thanks a lot, Eric! Cyrill ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-09 22:10 ` Andrew Morton 2012-04-09 22:24 ` Cyrill Gorcunov 2012-04-10 3:25 ` Eric W. Biederman @ 2012-04-10 23:58 ` Valdis.Kletnieks 2012-04-11 0:06 ` H. Peter Anvin 2 siblings, 1 reply; 48+ messages in thread From: Valdis.Kletnieks @ 2012-04-10 23:58 UTC (permalink / raw) To: Andrew Morton Cc: Cyrill Gorcunov, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet [-- Attachment #1: Type: text/plain, Size: 950 bytes --] On Mon, 09 Apr 2012 15:10:27 -0700, Andrew Morton said: > Back on to kcmp. I've totally forgotten what my original comment(s) on this patch were.. :) > +/* > + * 0 - equal, i.e. v1 = v2 > + * 1 - less than, i.e. v1 < v2 > + * 2 - greater than, i.e. v1 > v2 > + * 3 - not equal but ordering unavailable (reserved for future) > + */ > +static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type) > +{ > + long ret; > + > + ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type); I'm not able to convince myself that "less than" and "greater than" mean anything - do we have a good proof that for all v1 and v2, the obfuscated pointers have the same ordering as the original pointers? Hmm... consider the simplified example v1 = 5 and v2= 16., and cookies[0] is also 16. Then obfus(v1) == 21, and obfus(v2) == 0, and the ordering is different. So I'm thinking 0 and 3 are the only sane return values? Or do I need more caffeine? [-- Attachment #2: Type: application/pgp-signature, Size: 865 bytes --] ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree 2012-04-10 23:58 ` Valdis.Kletnieks @ 2012-04-11 0:06 ` H. Peter Anvin 0 siblings, 0 replies; 48+ messages in thread From: H. Peter Anvin @ 2012-04-11 0:06 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Andrew Morton, Cyrill Gorcunov, Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, Andrey Vagin, KOSAKI Motohiro, Ingo Molnar, Thomas Gleixner, Glauber Costa, Andi Kleen, Tejun Heo, Matt Helsley, Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Michal Marek, Frederic Weisbecker, linux-kernel, Jonathan Corbet On 04/10/2012 04:58 PM, Valdis.Kletnieks@vt.edu wrote: > > I'm not able to convince myself that "less than" and "greater than" > mean anything - do we have a good proof that for all v1 and v2, the > obfuscated pointers have the same ordering as the original > pointers? > No, of course they don't. That's the point. The whole point is that kcmp() exports an ordered set, but the ordering is explicitly and intentionally different than the actual ordering in memory. It is still valid for sorting, in particular. > Hmm... consider the simplified example v1 = 5 and v2= 16., and > cookies[0] is also 16. Then obfus(v1) == 21, and obfus(v2) == 0, > and the ordering is different. So I'm thinking 0 and 3 are the > only sane return values? > > Or do I need more caffeine? You need more caffeine. You're thinking about the problem wrong. -hpa ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2012-04-11 18:38 UTC | newest] Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-02-15 14:36 + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree Oleg Nesterov 2012-02-15 15:10 ` Cyrill Gorcunov 2012-02-15 15:38 ` Oleg Nesterov 2012-02-15 16:13 ` Cyrill Gorcunov 2012-02-15 16:22 ` Oleg Nesterov 2012-02-15 17:53 ` Cyrill Gorcunov 2012-02-15 18:43 ` Oleg Nesterov 2012-02-15 19:56 ` Cyrill Gorcunov 2012-02-15 19:57 ` Vasiliy Kulikov 2012-02-15 20:05 ` Cyrill Gorcunov 2012-02-15 20:25 ` Cyrill Gorcunov 2012-02-15 21:09 ` Cyrill Gorcunov 2012-02-15 21:58 ` Cyrill Gorcunov 2012-02-16 14:49 ` Oleg Nesterov 2012-02-16 15:13 ` Cyrill Gorcunov 2012-02-16 16:49 ` Cyrill Gorcunov 2012-02-16 17:40 ` Oleg Nesterov 2012-02-16 17:58 ` Cyrill Gorcunov 2012-02-16 19:03 ` Oleg Nesterov 2012-02-16 19:20 ` H. Peter Anvin 2012-02-16 19:29 ` Cyrill Gorcunov 2012-02-16 19:52 ` Andrew Morton 2012-02-16 20:01 ` Cyrill Gorcunov 2012-02-16 18:21 ` Vasiliy Kulikov 2012-02-16 18:34 ` Cyrill Gorcunov 2012-02-16 18:33 ` Vasiliy Kulikov 2012-02-16 18:49 ` Oleg Nesterov 2012-02-15 18:32 ` Cyrill Gorcunov 2012-02-15 19:06 ` Oleg Nesterov 2012-02-15 19:18 ` Cyrill Gorcunov 2012-02-15 16:06 ` Oleg Nesterov 2012-02-15 16:27 ` Cyrill Gorcunov 2012-04-09 22:10 ` Andrew Morton 2012-04-09 22:24 ` Cyrill Gorcunov 2012-04-09 23:22 ` H. Peter Anvin 2012-04-10 22:37 ` Cyrill Gorcunov 2012-04-10 22:39 ` H. Peter Anvin 2012-04-10 22:48 ` Cyrill Gorcunov 2012-04-10 23:08 ` Oleg Nesterov 2012-04-10 23:32 ` H. Peter Anvin 2012-04-10 23:42 ` Oleg Nesterov 2012-04-11 6:39 ` Cyrill Gorcunov 2012-04-11 18:31 ` Oleg Nesterov 2012-04-11 0:02 ` Valdis.Kletnieks 2012-04-10 3:25 ` Eric W. Biederman 2012-04-10 22:54 ` Cyrill Gorcunov 2012-04-10 23:58 ` Valdis.Kletnieks 2012-04-11 0:06 ` H. Peter Anvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).