All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] arch_check_bp_in_kernelspace: fix the range check
@ 2012-11-09 18:29 Oleg Nesterov
  2012-11-09 18:30 ` Oleg Nesterov
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-09 18:29 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra
  Cc: Amnon Shiloh, linux-kernel

arch_check_bp_in_kernelspace() tries to avoid the overflow and does 2
TASK_SIZE checks but it needs OR, not AND. Consider va = TASK_SIZE -1
and len = 2 case.

Note: TASK_SIZE doesn't look right at least on x86, I think it should
be replaced by TASK_SIZE_MAX.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>

--- x/arch/arm64/kernel/hw_breakpoint.c
+++ x/arch/arm64/kernel/hw_breakpoint.c
@@ -293,7 +293,7 @@ int arch_check_bp_in_kernelspace(struct 
 	va = info->address;
 	len = get_hbp_len(info->ctrl.len);
 
-	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
 }
 
 /*
--- x/arch/arm/kernel/hw_breakpoint.c
+++ x/arch/arm/kernel/hw_breakpoint.c
@@ -464,7 +464,7 @@ int arch_check_bp_in_kernelspace(struct 
 	va = info->address;
 	len = get_hbp_len(info->ctrl.len);
 
-	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
 }
 
 /*
--- x/arch/sh/kernel/hw_breakpoint.c
+++ x/arch/sh/kernel/hw_breakpoint.c
@@ -132,7 +132,7 @@ int arch_check_bp_in_kernelspace(struct 
 	va = info->address;
 	len = get_hbp_len(info->len);
 
-	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
 }
 
 int arch_bp_generic_fields(int sh_len, int sh_type,
--- x/arch/x86/kernel/hw_breakpoint.c
+++ x/arch/x86/kernel/hw_breakpoint.c
@@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct 
 	va = info->address;
 	len = get_hbp_len(info->len);
 
-	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
 }
 
 int arch_bp_generic_fields(int x86_len, int x86_type,


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-09 18:29 [PATCH] arch_check_bp_in_kernelspace: fix the range check Oleg Nesterov
@ 2012-11-09 18:30 ` Oleg Nesterov
  2012-11-19 17:47   ` Oleg Nesterov
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-09 18:30 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra
  Cc: Amnon Shiloh, linux-kernel

On 11/09, Oleg Nesterov wrote:
>
> Note: TASK_SIZE doesn't look right at least on x86, I think it should
> be replaced by TASK_SIZE_MAX.
> ...
> --- x/arch/x86/kernel/hw_breakpoint.c
> +++ x/arch/x86/kernel/hw_breakpoint.c
> @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
>  	va = info->address;
>  	len = get_hbp_len(info->len);
>
> -	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
> +	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);

But actully I'd like to ask another question.

Can't we add the additional !in_gate_area_no_mm() check to allow
the hw breakpoints in vsyscall?

Quoting Amnon:

	My service needs to catch the system-calls of its traced son.
	Almost all system-calls are caught with PTRACE_SYSCALL, but not those
	using the vsyscall page - especially "gettimeofday()" and "time()".

	...

	However, the code in "arch/x86/kernel/ptrace.c" forbids catching kernel
	addresses.

I tend to agree with Amnon...

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-09 18:30 ` Oleg Nesterov
@ 2012-11-19 17:47   ` Oleg Nesterov
  2012-11-19 18:25     ` Steven Rostedt
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-19 17:47 UTC (permalink / raw)
  To: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Steven Rostedt
  Cc: Amnon Shiloh, linux-kernel

On 11/09, Oleg Nesterov wrote:
>
> On 11/09, Oleg Nesterov wrote:
> >
> > Note: TASK_SIZE doesn't look right at least on x86, I think it should
> > be replaced by TASK_SIZE_MAX.
> > ...
> > --- x/arch/x86/kernel/hw_breakpoint.c
> > +++ x/arch/x86/kernel/hw_breakpoint.c
> > @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
> >  	va = info->address;
> >  	len = get_hbp_len(info->len);
> >
> > -	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
> > +	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>
> But actully I'd like to ask another question.
>
> Can't we add the additional !in_gate_area_no_mm() check to allow
> the hw breakpoints in vsyscall?
>
> Quoting Amnon:
>
> 	My service needs to catch the system-calls of its traced son.
> 	Almost all system-calls are caught with PTRACE_SYSCALL, but not those
> 	using the vsyscall page - especially "gettimeofday()" and "time()".
>
> 	...
>
> 	However, the code in "arch/x86/kernel/ptrace.c" forbids catching kernel
> 	addresses.
>
> I tend to agree with Amnon...
>
> What do you think?

ping ;)

I agree the patch I sent is very minor (although it looks like the trivial
bugfix to me).

Just I think Amnon has a point, shouldn't we change this change like below?
(on top of this fixlet).

Oleg.

--- x/arch/x86/kernel/hw_breakpoint.c
+++ x/arch/x86/kernel/hw_breakpoint.c
@@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
 	return len_in_bytes;
 }
 
+static inline bool bp_in_gate(unsigned long start, unsigned long end)
+{
+	return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
+}
+
 /*
  * Check for virtual address in kernel space.
  */
@@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct 
 	va = info->address;
 	len = get_hbp_len(info->len);
 
-	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
+	return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
+		!bp_in_gate(va, va + len - 1);
 }
 
 int arch_bp_generic_fields(int x86_len, int x86_type,


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-19 17:47   ` Oleg Nesterov
@ 2012-11-19 18:25     ` Steven Rostedt
  2012-11-20 10:33       ` u3557
  2012-11-20 15:48       ` Oleg Nesterov
  0 siblings, 2 replies; 62+ messages in thread
From: Steven Rostedt @ 2012-11-19 18:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Amnon Shiloh,
	linux-kernel

On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
> On 11/09, Oleg Nesterov wrote:
> >
> > On 11/09, Oleg Nesterov wrote:
> > >
> > > Note: TASK_SIZE doesn't look right at least on x86, I think it should
> > > be replaced by TASK_SIZE_MAX.
> > > ...
> > > --- x/arch/x86/kernel/hw_breakpoint.c
> > > +++ x/arch/x86/kernel/hw_breakpoint.c
> > > @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
> > >  	va = info->address;
> > >  	len = get_hbp_len(info->len);
> > >
> > > -	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
> > > +	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
> >
> > But actully I'd like to ask another question.
> >
> > Can't we add the additional !in_gate_area_no_mm() check to allow
> > the hw breakpoints in vsyscall?
> >
> > Quoting Amnon:
> >
> > 	My service needs to catch the system-calls of its traced son.
> > 	Almost all system-calls are caught with PTRACE_SYSCALL, but not those
> > 	using the vsyscall page - especially "gettimeofday()" and "time()".
> >
> > 	...
> >
> > 	However, the code in "arch/x86/kernel/ptrace.c" forbids catching kernel
> > 	addresses.
> >
> > I tend to agree with Amnon...
> >
> > What do you think?
> 
> ping ;)
> 
> I agree the patch I sent is very minor (although it looks like the trivial
> bugfix to me).
> 
> Just I think Amnon has a point, shouldn't we change this change like below?
> (on top of this fixlet).
> 
> Oleg.
> 
> --- x/arch/x86/kernel/hw_breakpoint.c
> +++ x/arch/x86/kernel/hw_breakpoint.c
> @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
>  	return len_in_bytes;
>  }
>  
> +static inline bool bp_in_gate(unsigned long start, unsigned long end)
> +{
> +	return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
> +}
> +
>  /*
>   * Check for virtual address in kernel space.
>   */
> @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct 
>  	va = info->address;
>  	len = get_hbp_len(info->len);
>  
> -	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
> +	return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
> +		!bp_in_gate(va, va + len - 1);

So we want to allow non privileged to add a breakpoint in a vsyscall
area even if it happens to lay in kernel space?

Is this really what we want? I mean, isn't syscall tracing in the kernel
done by a flag set to the task's structure. If the task has a flag set,
then it does the slow (ptrace) path which handles the breakpoint for the
user.

But here, there's no prejudice between tasks. All tasks will now hit the
breakpoint regardless of if it is being traced or not.

-- Steve


>  }
>  
>  int arch_bp_generic_fields(int x86_len, int x86_type,



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-19 18:25     ` Steven Rostedt
@ 2012-11-20 10:33       ` u3557
  2012-11-20 15:48       ` Oleg Nesterov
  1 sibling, 0 replies; 62+ messages in thread
From: u3557 @ 2012-11-20 10:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Oleg Nesterov, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Amnon Shiloh, linux-kernel

Dear Steve,

> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.

Just to clarify, there is no intention to allow conventional breakpoints
in the vsyscall page - that would indeed be a disaster affecting all other
processes.

The aim of this patch is to allow ptracers to set the x86 debug-registers
of their traced process, so that it stops when it reaches the entry points
of those (few) system-calls that are in the vsyscall page.  Note that those
debug registers are anyway swapped at context-switch, so no other processes
are affected.

> Is this really what we want? I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.

Most system-calls can be trapped by the PTRACE_SYSCALL option (and the
later PTRACE_SYSEMU), but those few system-calls on the vsyscall page
defy the intention to trap ALL system-calls.  They also cannot be
checkpointed by inserting a trap-instruction (as that, if allowed, would
break all other processes), hence the only alternative left is to have
this patch that fixes the oversight in the design of
PTRACE_SYSCALL/PTRACE_SYSEMU in the presence of a vsyscall page.

Best Regards,
Amnon.

> On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
>> On 11/09, Oleg Nesterov wrote:
>> >
>> > On 11/09, Oleg Nesterov wrote:
>> > >
>> > > Note: TASK_SIZE doesn't look right at least on x86, I think it
>> should
>> > > be replaced by TASK_SIZE_MAX.
>> > > ...
>> > > --- x/arch/x86/kernel/hw_breakpoint.c
>> > > +++ x/arch/x86/kernel/hw_breakpoint.c
>> > > @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
>> > >  	va = info->address;
>> > >  	len = get_hbp_len(info->len);
>> > >
>> > > -	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
>> > > +	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>> >
>> > But actully I'd like to ask another question.
>> >
>> > Can't we add the additional !in_gate_area_no_mm() check to allow
>> > the hw breakpoints in vsyscall?
>> >
>> > Quoting Amnon:
>> >
>> > 	My service needs to catch the system-calls of its traced son.
>> > 	Almost all system-calls are caught with PTRACE_SYSCALL, but not those
>> > 	using the vsyscall page - especially "gettimeofday()" and "time()".
>> >
>> > 	...> Is this really what we want? I mean, isn't syscall tracing in
the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.
>> >
>> > 	However, the code in "arch/x86/kernel/ptrace.c" forbids catching
>> kernel
>> > 	addresses.
>> >
>> > I tend to agree with Amnon...
>> >
>> > What do you think?
>>
>> ping ;)
>>
>> I agree the patch I sent is very minor (although it looks like the
>> trivial
>> bugfix to me).
>>
>> Just I think Amnon has a point, shouldn't we change this change like
>> below?
>> (on top of this fixlet).
>>
>> Oleg.
>>
>> --- x/arch/x86/kernel/hw_breakpoint.c
>> +++ x/arch/x86/kernel/hw_breakpoint.c
>> @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
>>  	return len_in_bytes;
>>  }
>>
>> +static inline bool bp_in_gate(unsigned long start, unsigned long end)
>> +{
>> +	return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
>> +}
>> +
>>  /*
>>   * Check for virtual address in kernel space.
>>   */
>> @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
>>  	va = info->address;
>>  	len = get_hbp_len(info->len);
>>
>> -	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>> +	return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
>> +		!bp_in_gate(va, va + len - 1);
>
> So we want to allow non privileged to add a breakpoint in a vsyscall
> area even if it happens to lay in kernel space?
>
> Is this really what we want? I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.
>
> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.
>
> -- Steve
>
>
>>  }
>>
>>  int arch_bp_generic_fields(int x86_len, int x86_type,
>
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-19 18:25     ` Steven Rostedt
  2012-11-20 10:33       ` u3557
@ 2012-11-20 15:48       ` Oleg Nesterov
  2012-11-20 15:55         ` Steven Rostedt
  2012-11-20 18:32         ` Oleg Nesterov
  1 sibling, 2 replies; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-20 15:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Amnon Shiloh,
	linux-kernel

On 11/19, Steven Rostedt wrote:
>
> On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
> >
> > Just I think Amnon has a point, shouldn't we change this change like below?
> > (on top of this fixlet).
> >
> > Oleg.
> >
> > --- x/arch/x86/kernel/hw_breakpoint.c
> > +++ x/arch/x86/kernel/hw_breakpoint.c
> > @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
> >  	return len_in_bytes;
> >  }
> >
> > +static inline bool bp_in_gate(unsigned long start, unsigned long end)
> > +{
> > +	return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
> > +}
> > +
> >  /*
> >   * Check for virtual address in kernel space.
> >   */
> > @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
> >  	va = info->address;
> >  	len = get_hbp_len(info->len);
> >
> > -	return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
> > +	return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
> > +		!bp_in_gate(va, va + len - 1);
>
> So we want to allow non privileged to add a breakpoint in a vsyscall
> area even if it happens to lay in kernel space?

Yes,

> Is this really what we want?

I am not sure, that is why I am asking.

Hmm... I have to admit, I didn't even know this code was completely
rewritten, and a long ago... and in the default mode it works via
emulate_vsyscall() from page-fault, iow not in user_mode().

> I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.

Well yes, with the new implementation __vsyscall_page simply does
syscall, so I guess (say) strace will "just work". But, afaics, only
if vsyscall_mode == NATIVE.

So perhaps bp in vsyscall_page still makes sense...

> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.
  ^^^^^^^^^^

This is hardware breakpoint, it is per-task.

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-20 15:48       ` Oleg Nesterov
@ 2012-11-20 15:55         ` Steven Rostedt
  2012-11-20 18:32         ` Oleg Nesterov
  1 sibling, 0 replies; 62+ messages in thread
From: Steven Rostedt @ 2012-11-20 15:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Amnon Shiloh,
	linux-kernel

On Tue, 2012-11-20 at 16:48 +0100, Oleg Nesterov wrote:

> > But here, there's no prejudice between tasks. All tasks will now hit the
> > breakpoint regardless of if it is being traced or not.
>   ^^^^^^^^^^
> 
> This is hardware breakpoint, it is per-task.

I forgot that hw breakpoints are swapped out at context switch (as Amnon
informed me). 

OK, that makes a big difference.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-20 15:48       ` Oleg Nesterov
  2012-11-20 15:55         ` Steven Rostedt
@ 2012-11-20 18:32         ` Oleg Nesterov
  2012-11-20 23:16           ` u3557
  1 sibling, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-20 18:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, Amnon Shiloh,
	linux-kernel

On 11/20, Oleg Nesterov wrote:
>
> On 11/19, Steven Rostedt wrote:
> >
> > Is this really what we want?
>
> I am not sure, that is why I am asking.

Yes.

And,

> Well yes, with the new implementation __vsyscall_page simply does
> syscall, so I guess (say) strace will "just work". But, afaics, only
> if vsyscall_mode == NATIVE.
>
> So perhaps bp in vsyscall_page still makes sense...

Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to do


	if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
		send_sigtrap(TRAP_VSYSCALL, ...);

if it returns true?

This way the debugger doesn't need to play with the debug registers,
it can use ptrace(PTRACE_SYSCALL). The tracee will report SIGTRAP after
vsyscall, debugger can detect this case looking at info->si_code.

Yes, the tracee won't report _before_ it calls the function, but perhaps
this is enough? Amnon?

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-20 18:32         ` Oleg Nesterov
@ 2012-11-20 23:16           ` u3557
  2012-11-21 14:16             ` Oleg Nesterov
  0 siblings, 1 reply; 62+ messages in thread
From: u3557 @ 2012-11-20 23:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	Amnon Shiloh, linux-kernel

Hi Oleg,

> Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> do
>
>
> 	if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> 		send_sigtrap(TRAP_VSYSCALL, ...);
>
> if it returns true?
>

I wish it were possible, but the vsyscall page is entered in user-mode,
where it is not possible to read either "current" or the per-thread flags.

One nice alternative, however, is to take away the execute permission from
the vsyscall page of the traced process, so it will incur a SIGSEGV (which
the ptracer can easily handle and cancel).

The drawbacks of this nice-to-have alternative are:
1) It may require a bigger patch.
2) It needs to use a separate ptrace option, because current applications
   that use PTRACE_SYSCALL (or PTRACE_SYSEMU), including "strace", will not
   be aware of this new feature, so they can be broken.

> Yes, the tracee won't report _before_ it calls the function, but perhaps
> this is enough? Amnon?

The vsyscall page was designed in order to avoid user/kernel context
switches, which is possible when a system-call is only reading kernel
global information, not modifying any.  In most cases therefore (with
a few exceptions, such as when the hardware clock is broken or
misconfigured), system-calls in the vsyscall page never invoke the kernel
at all - so no trap would be generated either before or after.

Best Regards,
Amnon.

> On 11/20, Oleg Nesterov wrote:
>>
>> On 11/19, Steven Rostedt wrote:
>> >
>> > Is this really what we want?
>>
>> I am not sure, that is why I am asking.
>
> Yes.
>
> And,
>
>> Well yes, with the new implementation __vsyscall_page simply does
>> syscall, so I guess (say) strace will "just work". But, afaics, only
>> if vsyscall_mode == NATIVE.
>>
>> So perhaps bp in vsyscall_page still makes sense...
>
> Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> do
>
>
> 	if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> 		send_sigtrap(TRAP_VSYSCALL, ...);
>
> if it returns true?
>
> This way the debugger doesn't need to play with the debug registers,
> it can use ptrace(PTRACE_SYSCALL). The tracee will report SIGTRAP after
> vsyscall, debugger can detect this case looking at info->si_code.
>
> Yes, the tracee won't report _before_ it calls the function, but perhaps
> this is enough? Amnon?
>
> Oleg.
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-20 23:16           ` u3557
@ 2012-11-21 14:16             ` Oleg Nesterov
  2012-11-21 17:30               ` Amnon Shiloh
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-21 14:16 UTC (permalink / raw)
  To: u3557
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	linux-kernel

Hi Amnon,

Please read my previous email ;)
http://marc.info/?l=linux-kernel&m=135342649119153

On 11/21, u3557@miso.sublimeip.com wrote:
>
> Hi Oleg,
>
> > Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> > do
> >
> >
> > 	if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> > 		send_sigtrap(TRAP_VSYSCALL, ...);
> >
> > if it returns true?
> >
>
> I wish it were possible, but the vsyscall page is entered in user-mode,

Only in NATIVE mode. emulate_vsyscall() runs in kernel mode.

And in the NATIVE mode PTRACE_SYSCALL should work just fine, because:

> The vsyscall page was designed in order to avoid user/kernel context
> switches,

True, it was. But not today. Please look at __vsyscall_page:

	__vsyscall_page:

		mov $__NR_gettimeofday, %rax
		syscall
		ret

If you want the "fast" sys_time() without entering the kernel, you can
use __vdso_time(). And since vdso has the user-space mapping you can
insert "int3" or use hw breakpoints.

At least this is my understanding after I glanced at the new implementation.


However. It is not that I think that TRAP_VSYSCALL is really good idea.
At least it needs another option...

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check
  2012-11-21 14:16             ` Oleg Nesterov
@ 2012-11-21 17:30               ` Amnon Shiloh
  2012-11-22 16:12                 ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
  0 siblings, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-21 17:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	linux-kernel

Hi Oleg,

Yes, I can see that "arch/x86/kernel/vsyscall_64.c"
has changed dramatically since I last looked at it.

Since this is the case, I no longer need to trap the vsyscall page.

Now however, that "vsyscall" was effectively replaced by vdso, it
creates a new problem for me and probably for anyone else who uses
some form of checkpoint/restore:

Suppose a process is checkpointed because the system needs to reboot
for a kernel-upgrade, then restored on the new and different kernel.
The new VDSO page may no longer match the new kernel - it could for
example fetch data from addresses in the vsyscall page that now
contain different things; or in case the hardware also was changed,
it may use machine-instructions that are now illegal.

As I don't mind to forego the "fast" sys_time(), my obvious solution
is to disable the vdso for traced processes that may be checkpointed.

One way to do it would be by brute-force: straight after "execve"
unmap the tracee's vdso page, then manipulate the ELF tables in
its memory so the VDSO entry is gone and the library will not go
looking for it.  Alternately, the function-table within the VDSO
page can be erased.

I just wonder whether you know of an easier and more standard way
to disable the vdso in user-mode - ideally on a per-process basis,
or otherwise, if it's too hard, on the whole computer.  I searched
the web and found references to "/proc/sys/vm/vdso_enable", but I
have no such file or "sysctl" option on my system.

Best Regards,
Amnon.


> 
> Hi Amnon,
> 
> Please read my previous email ;)
> http://marc.info/?l=linux-kernel&m=135342649119153
> 
> On 11/21, u3557@miso.sublimeip.com wrote:
> >
> > Hi Oleg,
> >
> > > Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> > > do
> > >
> > >
> > > 	if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
> > > 		send_sigtrap(TRAP_VSYSCALL, ...);
> > >
> > > if it returns true?
> > >
> >
> > I wish it were possible, but the vsyscall page is entered in user-mode,
> 
> Only in NATIVE mode. emulate_vsyscall() runs in kernel mode.
> 
> And in the NATIVE mode PTRACE_SYSCALL should work just fine, because:
> 
> > The vsyscall page was designed in order to avoid user/kernel context
> > switches,
> 
> True, it was. But not today. Please look at __vsyscall_page:
> 
> 	__vsyscall_page:
> 
> 		mov $__NR_gettimeofday, %rax
> 		syscall
> 		ret
> 
> If you want the "fast" sys_time() without entering the kernel, you can
> use __vdso_time(). And since vdso has the user-space mapping you can
> insert "int3" or use hw breakpoints.
> 
> At least this is my understanding after I glanced at the new implementation.
> 
> 
> However. It is not that I think that TRAP_VSYSCALL is really good idea.
> At least it needs another option...
> 
> Oleg.
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-21 17:30               ` Amnon Shiloh
@ 2012-11-22 16:12                 ` Oleg Nesterov
  2012-11-22 20:57                   ` Pavel Emelyanov
                                     ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-22 16:12 UTC (permalink / raw)
  To: Amnon Shiloh, Cyrill Gorcunov, Pavel Emelyanov
  Cc: Steven Rostedt, Frederic Weisbecker, Ingo Molnar, Peter Zijlstra,
	linux-kernel

On 11/22, Amnon Shiloh wrote:
>
> Now however, that "vsyscall" was effectively replaced by vdso, it
> creates a new problem for me and probably for anyone else who uses
> some form of checkpoint/restore:

Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
seem to enjoy trying to solve the c/r problems.

> Suppose a process is checkpointed because the system needs to reboot
> for a kernel-upgrade, then restored on the new and different kernel.
> The new VDSO page may no longer match the new kernel - it could for
> example fetch data from addresses in the vsyscall page that now
> contain different things; or in case the hardware also was changed,
> it may use machine-instructions that are now illegal.

Sure. You shouldn't try to save/restore this page(s) directly. But
I do not really understand why do you need. IOW, I don't really
understand the problem, it depends on what c/r actually does.

> As I don't mind to forego the "fast" sys_time(), my obvious solution
> is to disable the vdso for traced processes that may be checkpointed.
>
> One way to do it would be by brute-force: straight after "execve"
> unmap the tracee's vdso page,

Not sure this will be always possible. For example, my (old) glibc
assumes that vsyscall() must work, I won't be surprised if some time
later it won't work without vdso. But again, I do not know.

> then manipulate the ELF tables in
> its memory so the VDSO entry is gone and the library will not go
> looking for it.

Probably it would be enough to simply erase AT_SYSINFO_EHDR note,
but again, I can be easily wrong.

> I just wonder whether you know of an easier and more standard way
> to disable the vdso in user-mode

Only the kernel parameter, afaics. vdso=0

> - ideally on a per-process basis,
> or otherwise, if it's too hard, on the whole computer.  I searched
> the web and found references to "/proc/sys/vm/vdso_enable", but I
> have no such file or "sysctl" option on my system.

sys/vm/vdso_enabled, but only if CONFIG_X86_32 for some reason. See
kernel/sysctl.c

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-22 16:12                 ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
@ 2012-11-22 20:57                   ` Pavel Emelyanov
  2012-11-23  0:20                     ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range Amnon Shiloh
  2012-11-23 17:42                     ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
  2012-11-23  9:14                   ` arch_check_bp_in_kernelspace: fix the range check Amnon Shiloh
  2012-11-26  9:44                   ` vdso && cr " Cyrill Gorcunov
  2 siblings, 2 replies; 62+ messages in thread
From: Pavel Emelyanov @ 2012-11-22 20:57 UTC (permalink / raw)
  To: Oleg Nesterov, Amnon Shiloh
  Cc: Cyrill Gorcunov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On 11/22/2012 08:12 PM, Oleg Nesterov wrote:
> On 11/22, Amnon Shiloh wrote:
>>
>> Now however, that "vsyscall" was effectively replaced by vdso, it
>> creates a new problem for me and probably for anyone else who uses
>> some form of checkpoint/restore:
> 
> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> seem to enjoy trying to solve the c/r problems.

Thank you :)

>> Suppose a process is checkpointed because the system needs to reboot
>> for a kernel-upgrade, then restored on the new and different kernel.
>> The new VDSO page may no longer match the new kernel - it could for
>> example fetch data from addresses in the vsyscall page that now
>> contain different things; or in case the hardware also was changed,
>> it may use machine-instructions that are now illegal.

If we could make VDSO entry points not move across the kernels (iow, make
them looks as yet another syscall table) this would help, I suppose.

> Sure. You shouldn't try to save/restore this page(s) directly. But
> I do not really understand why do you need. IOW, I don't really
> understand the problem, it depends on what c/r actually does.

Think about it like this -- you stop a process, then change the kern^w VDSO
page. Everything should work as it used to be :)

>> As I don't mind to forego the "fast" sys_time(), my obvious solution
>> is to disable the vdso for traced processes that may be checkpointed.

This is very poor solution from my POV. Nobody wants to have their applications
work fast only until it's checkpointed.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range
  2012-11-22 20:57                   ` Pavel Emelyanov
@ 2012-11-23  0:20                     ` Amnon Shiloh
  2012-11-23 17:45                       ` Oleg Nesterov
  2012-11-23 17:42                     ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
  1 sibling, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-23  0:20 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Oleg Nesterov, Cyrill Gorcunov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Pavel,

> >>
> >> Now however, that "vsyscall" was effectively replaced by vdso, it
> >> creates a new problem for me and probably for anyone else who uses
> >> some form of checkpoint/restore:
> > 
> > Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> > seem to enjoy trying to solve the c/r problems.
> 
> Thank you :)

Thank YOU for joining!

> >> Suppose a process is checkpointed because the system needs to reboot
> >> for a kernel-upgrade, then restored on the new and different kernel.
> >> The new VDSO page may no longer match the new kernel - it could for
> >> example fetch data from addresses in the vsyscall page that now
> >> contain different things; or in case the hardware also was changed,
> >> it may use machine-instructions that are now illegal.
> 
> If we could make VDSO entry points not move across the kernels (iow, make
> them looks as yet another syscall table) this would help, I suppose.

It will indeed solve PART of the problem, but there is one more issue:

One obviously cannot c/r a process while it runs in the VDSO page
without c/r'ing that page itself, but this can probably be handled
by single-stepping the process until it is out of that page (assuming
there are no sleeps, pauses or extremely long loops on that page) -
but suppose a catched signal interrupts the VDSO code and the process
needs to be checkpointed within that interrupt code - eventually it
will return ("sigreturn") to the VDSO page... a different page...
and probably fall on the wrong machine-instruction (or even between
machine-instructions), with all registers scrambled anyway.

The solution can be to hold all catched signals while in the VDSO page.
This is not something the application (or library) can reasonably do
due to the prohibitive cost of "sigprocmask()" before and after, defying
the whole purpose of the VDSO page, but could be achieved by some new
'prctl' option (or perhaps even be the default).

In my specific case, because the checkpointed process is ptraced,
and assuming VDSO entry points are fixed, the ptracer can postpone
all catched signals that occur within the VDSO page, but for others
who write/maintain a c/r package, that's probably not an option.

> 
> > Sure. You shouldn't try to save/restore this page(s) directly. But
> > I do not really understand why do you need. IOW, I don't really
> > understand the problem, it depends on what c/r actually does.
> 
> Think about it like this -- you stop a process, then change the kern^w VDSO
> page. Everything should work as it used to be :)

There are two reasons one may need to save/restore this page:
1) Entry points are not fixed (yet).
2) In case the process needs to return to it back from an interrupt.

> 
> >> As I don't mind to forego the "fast" sys_time(), my obvious solution
> >> is to disable the vdso for traced processes that may be checkpointed.
> 
> This is very poor solution from my POV. Nobody wants to have their applications
> work fast only until it's checkpointed.

I know, but it's a price I must and am willing to pay until a solution
is found that prevents catching signals within the VDSO page.

I made a small experiment and just zeroed out the whole VDSO page
straight after "execve" (brute force, easier than having to study
the internal format of the VDSO page).  The program worked, using
the glibc version of "gettimeofday()" instead (which used "vsyscall",
but probably for not much longer).

So consider my immediate personal problem solved - what I'll do next
is to compile a special temporary kernel with all vdso functions
(__vdso_gettimeofday, __vdso_time, __vdso_clock_gettime, __vdso_getcpu)
reduced to system-calls, so they become kernel/hardware-independent,
then I'll save and set aside the resulting VDSO page and always replace
original VDSO pages with "my-vdso" after "execve".

However, this doesn't solve the problem for other c/r packages that do
not ptrace their processes all the time, and therefore unable to replace
the VDSO page immediately after each "execve".
For them you will need to either:

1) fix the VDSO entry points + introduce a kernel feature to prevent
   catching signals within the VDSO page (probably a new prctl,
   or make it the default) ; or
2) Introduce a kernel feature (probably a new prctl, so long as
   it is not reset across fork/clone/exec) for those programs who
   request it to load a "slow-but-sure", kernel/hardware-independent
   version of the VDSO page.

> 
> Thanks,
> Pavel
> 


Thank you and Best Regards,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: arch_check_bp_in_kernelspace: fix the range check
  2012-11-22 16:12                 ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
  2012-11-22 20:57                   ` Pavel Emelyanov
@ 2012-11-23  9:14                   ` Amnon Shiloh
  2012-11-23 16:33                     ` Oleg Nesterov
  2012-11-26  9:44                   ` vdso && cr " Cyrill Gorcunov
  2 siblings, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-23  9:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Dear Oleg,

After the detour into the VDSO page, which is really a separate issue,
but one which I was able to work around in user-mode (for my particular
needs, not yet for all others who write checkpoint/restore pacakges),
I am sad to report that despite my previous post, the "vsyscall" problem
was not solved after all.

I made some further tests, without the fix that allows a hardware
breakpoint in the kernel, hoping that fixing the VDSO issue solves
it all, but alas, current libraries still call the vsyscall page even
when the VDSO page is present.

My dynamic glibc library calls call the VDSO version of "gettimeofday()",
but still uses the vsyscall version of "time()", while the static library
uses vsyscall for both.

What I discovered now, is that PTRACE_SYSCALL (also PTRACE_SINGLESTEP)
does not work within the vsyscall page, so I cannot trap the kernel-calls
there (this is very simple to verify using "gdb" or "strace").

I suspect that statically-linked executables will be around for a while,
long after the rest of the glibc library moves to VDSO, hence I still need
to place hardware breakpoints in the vsyscall page.

The necessary patch was already discussed and is very simple.

Or, there is an alternative: if only I (the ptracer or the traced process)
was allowed to munmap the vsyscall page, just get rid of that page
altogether, or at least make it non-executable, that would be fine
too for me, because then the process will get a SIGSEGV signal, which
the ptracer can easily handle.

Best Regards,
Amnon.

> On 11/22, Amnon Shiloh wrote:
> >
> > Now however, that "vsyscall" was effectively replaced by vdso, it
> > creates a new problem for me and probably for anyone else who uses
> > some form of checkpoint/restore:
> 
> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> seem to enjoy trying to solve the c/r problems.
> 
> > Suppose a process is checkpointed because the system needs to reboot
> > for a kernel-upgrade, then restored on the new and different kernel.
> > The new VDSO page may no longer match the new kernel - it could for
> > example fetch data from addresses in the vsyscall page that now
> > contain different things; or in case the hardware also was changed,
> > it may use machine-instructions that are now illegal.
> 
> Sure. You shouldn't try to save/restore this page(s) directly. But
> I do not really understand why do you need. IOW, I don't really
> understand the problem, it depends on what c/r actually does.
> 
> > As I don't mind to forego the "fast" sys_time(), my obvious solution
> > is to disable the vdso for traced processes that may be checkpointed.
> >
> > One way to do it would be by brute-force: straight after "execve"
> > unmap the tracee's vdso page,
> 
> Not sure this will be always possible. For example, my (old) glibc
> assumes that vsyscall() must work, I won't be surprised if some time
> later it won't work without vdso. But again, I do not know.
> 
> > then manipulate the ELF tables in
> > its memory so the VDSO entry is gone and the library will not go
> > looking for it.
> 
> Probably it would be enough to simply erase AT_SYSINFO_EHDR note,
> but again, I can be easily wrong.
> 
> > I just wonder whether you know of an easier and more standard way
> > to disable the vdso in user-mode
> 
> Only the kernel parameter, afaics. vdso=0
> 
> > - ideally on a per-process basis,
> > or otherwise, if it's too hard, on the whole computer.  I searched
> > the web and found references to "/proc/sys/vm/vdso_enable", but I
> > have no such file or "sysctl" option on my system.
> 
> sys/vm/vdso_enabled, but only if CONFIG_X86_32 for some reason. See
> kernel/sysctl.c
> 
> Oleg.
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: arch_check_bp_in_kernelspace: fix the range check
  2012-11-23  9:14                   ` arch_check_bp_in_kernelspace: fix the range check Amnon Shiloh
@ 2012-11-23 16:33                     ` Oleg Nesterov
  2012-11-23 17:05                       ` Oleg Nesterov
  2012-11-24 13:45                       ` Amnon Shiloh
  0 siblings, 2 replies; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-23 16:33 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hello Amnon,

I am a bit confused,

On 11/23, Amnon Shiloh wrote:
>
> What I discovered now, is that PTRACE_SYSCALL (also PTRACE_SINGLESTEP)
> does not work within the vsyscall page, so I cannot trap the kernel-calls
> there (this is very simple to verify using "gdb" or "strace").

Sure, but we alredy discussed this?

Once again, PTRACE_SYSCALL should work in the NATIVE mode. Obviously it
won't work in EMULATE mode but we can change emulate_vsyscall() to report
TRAP_VSYSCALL or even introduce PTRACE_EVENT_VSYSCALL.

> The necessary patch was already discussed and is very simple.

Do you mean TRAP_VSYSCALL/PTRACE_EVENT_VSYSCALL above or additional
in_gate_area_no_mm() check to allow the hw bp?

> Or, there is an alternative: if only I (the ptracer or the traced process)
> was allowed to munmap the vsyscall page,

It is not possible to unmap it. The kernel (swapper_pg_dir) has this
mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
globally and obviously you can't do this from the userspace.

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: arch_check_bp_in_kernelspace: fix the range check
  2012-11-23 16:33                     ` Oleg Nesterov
@ 2012-11-23 17:05                       ` Oleg Nesterov
  2012-11-24 14:14                         ` Amnon Shiloh
  2012-11-24 13:45                       ` Amnon Shiloh
  1 sibling, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-23 17:05 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

forgot to mention...

On 11/23, Oleg Nesterov wrote:
>
> On 11/23, Amnon Shiloh wrote:
> >
> > Or, there is an alternative: if only I (the ptracer or the traced process)
> > was allowed to munmap the vsyscall page,
>
> It is not possible to unmap it. The kernel (swapper_pg_dir) has this
> mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
> globally and obviously you can't do this from the userspace.

And even if this were possible, this can't help. Please look at
__bad_area_nosemaphore()->emulate_vsyscall(), the process won't get
SIGSEGV. IOW, in fact EMULATE already "unmaps" this page (sets _NX)
to trigger the fault.

Sure, we can do something like below, but it doesn't look very nice
too.

Oleg.

--- x/arch/x86/mm/fault.c
+++ x/arch/x86/mm/fault.c
@@ -744,7 +744,8 @@ __bad_area_nosemaphore(struct pt_regs *r
 		 */
 		if (unlikely((error_code & PF_INSTR) &&
 			     ((address & ~0xfff) == VSYSCALL_START))) {
-			if (emulate_vsyscall(regs, address))
+			if (!(tsk->ptrace & PTRACE_O_DONTEMULATE) &&
+			    emulate_vsyscall(regs, address))
 				return;
 		}
 #endif


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-22 20:57                   ` Pavel Emelyanov
  2012-11-23  0:20                     ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range Amnon Shiloh
@ 2012-11-23 17:42                     ` Oleg Nesterov
  1 sibling, 0 replies; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-23 17:42 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Amnon Shiloh, Cyrill Gorcunov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On 11/23, Pavel Emelyanov wrote:
>
> On 11/22/2012 08:12 PM, Oleg Nesterov wrote:
>
> > Sure. You shouldn't try to save/restore this page(s) directly. But
> > I do not really understand why do you need. IOW, I don't really
> > understand the problem, it depends on what c/r actually does.
>
> Think about it like this -- you stop a process, then change the kern^w VDSO
> page. Everything should work as it used to be :)

I understand. But again, this depends on how you restore. For example,
if you do not "transfer" libc/ld code, then probably vdso is not a
problem. Just you need to ensure the task was not stopped "inside" vdso.
Yes, yes, I understand that this is too limited, you probably want to
transfer "everything".

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range
  2012-11-23  0:20                     ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range Amnon Shiloh
@ 2012-11-23 17:45                       ` Oleg Nesterov
  2012-11-24 12:47                         ` Amnon Shiloh
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-23 17:45 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Amnon,

I am going to "ignore" this thread because this is not my area and
I can't help anyway. Just one note:

On 11/23, Amnon Shiloh wrote:
>
> The solution can be to hold all catched signals while in the VDSO page.
> ...
>
> 1) + introduce a kernel feature to prevent
>    catching signals within the VDSO page (probably a new prctl,
>    or make it the default)

Sorry, never ;)

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range
  2012-11-23 17:45                       ` Oleg Nesterov
@ 2012-11-24 12:47                         ` Amnon Shiloh
  0 siblings, 0 replies; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-24 12:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pavel Emelyanov, Cyrill Gorcunov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Oleg,

> Amnon,
> 
> I am going to "ignore" this thread because this is not my area and
> I can't help anyway. Just one note:
> 
> On 11/23, Amnon Shiloh wrote:
> >
> > The solution can be to hold all catched signals while in the VDSO page.
> > ...
> >
> > 1) + introduce a kernel feature to prevent
> >    catching signals within the VDSO page (probably a new prctl,
> >    or make it the default)
> 
> Sorry, never ;)
> 
> Oleg.

It's OK with me because I already found a way to work around this that
works for me, but I suspect that other people who write checkpoint/restore
packages may not be able to use my soltion and so they will have a problem
with interrupts occuring within the VDSO page.

I therefore suggested an alternate solution, for all such systems where
applications can be checkpointed on one kernel and restarted on another:
to allow the user to ask for an ultra-compatible VDSO version, which would
be exactly the same on all kernels (from a given point in time) and all
kernel configurations, even if it means a loss of performance.  This is
needed for systems where applications can be checkpointed on one kernel
and restarted on another.

It could even be a kernel configuration option: CONFIG_ULTRA_COMPAT_VDSO,
but ideally it should be the user's choice.

Best Regards,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: arch_check_bp_in_kernelspace: fix the range check
  2012-11-23 16:33                     ` Oleg Nesterov
  2012-11-23 17:05                       ` Oleg Nesterov
@ 2012-11-24 13:45                       ` Amnon Shiloh
  2012-11-25 22:55                         ` Oleg Nesterov
  1 sibling, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-24 13:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Oleg,

> Hello Amnon,
> 
> I am a bit confused,

So let's get things in order.

1) I asked for the ability to set hardware breakpoints on the vsyscall
   page (x86 debug registers), so that the ptracer can stop the process
   whenever it attempts to jump there, then the ptracer can emulate those
   system calls instead (gettimeofday, time, getcpu).

   That would solve all my problems, because the traced process will
   never even enter the vsyscall page (the ptracer will adjust its
   program-counter).

2) I was then told (in my own words): "oh, don't worry, the vsyscall page
   has now been minimized, all it contains now is *real* system calls,
   and it always calls them".

   [as a side-issue I was introduced to the new VDSO, had some issues there
    and solved them separately, so we are back on the original topic]

3) I was thinking to myself - well, that's fine, if the vsyscall now
   always invokes a *real* system-call (and nothing else), then the
   ptracer can catch it just like any other system-call using
   PTRACE_SYSCALL (or PTRACE_SYSEMU), and emulate it as usual,
   vsyscall-or-no-vsyscall.

4) I made some tests and found that I was wrong in my assumption:
   PTRACE_SYSCALL does NOT work within the vsyscall page (nor does
   PTRACE_SINGLESTEP).  Ptracers are not even aware that their tracee
   ever issued a system call there (despite using PTRACE_SYSCALL or
   PTRACE_SYSEMU), so they are unable to emulate it (or even to report
   it, in the case of "strace").

5) Therefore, I still need the original feature - to relax
   "arch_check_bp_in_kernelspace()", or whatever else will allow me
   to set the x86 debug-registers to trap all attempts to enter the
   vsyscall page.

6) I just suggested an alternative: to have the whole vsyscall page
   removed on a per-process basis. I accept your reply that this is
   not possible.

7) I suggested a third alternative: to have the vsyscall page be
   unexecutable on a per-process basis, so attempts to use it will
   incur SIGSEGV.   I understand that this option is still under
   discussion.

8) Any solution that allows a ptracer to prevent its traced process
   from entering the vsyscall page and execute there system-calls
   unchecked (thus in effect escape its jailer), would do for me.

Best Regards,
Amnon.


> 
> On 11/23, Amnon Shiloh wrote:
> >
> > What I discovered now, is that PTRACE_SYSCALL (also PTRACE_SINGLESTEP)
> > does not work within the vsyscall page, so I cannot trap the kernel-calls
> > there (this is very simple to verify using "gdb" or "strace").
> 
> Sure, but we alredy discussed this?
> 
> Once again, PTRACE_SYSCALL should work in the NATIVE mode. Obviously it
> won't work in EMULATE mode but we can change emulate_vsyscall() to report
> TRAP_VSYSCALL or even introduce PTRACE_EVENT_VSYSCALL.
> 
> > The necessary patch was already discussed and is very simple.
> 
> Do you mean TRAP_VSYSCALL/PTRACE_EVENT_VSYSCALL above or additional
> in_gate_area_no_mm() check to allow the hw bp?
> 
> > Or, there is an alternative: if only I (the ptracer or the traced process)
> > was allowed to munmap the vsyscall page,
> 
> It is not possible to unmap it. The kernel (swapper_pg_dir) has this
> mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
> globally and obviously you can't do this from the userspace.
> 
> Oleg.
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: arch_check_bp_in_kernelspace: fix the range check
  2012-11-23 17:05                       ` Oleg Nesterov
@ 2012-11-24 14:14                         ` Amnon Shiloh
  0 siblings, 0 replies; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-24 14:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Oleg,

This patch may look ugly, but it is one way to solve my problem.

This way, "strace" too, which is broken since the introduction of
the vsyscall page, will again be able to report when the program
calls "time()" or "gettimeofday()" - currently it cannot!

I think that allowing to set the x86 debug-registers to the
vsyscall page is more elegant - but do whatever you prefer.

Best Regards,
Amnon.



> forgot to mention...
> 
> On 11/23, Oleg Nesterov wrote:
> >
> > On 11/23, Amnon Shiloh wrote:
> > >
> > > Or, there is an alternative: if only I (the ptracer or the traced process)
> > > was allowed to munmap the vsyscall page,
> >
> > It is not possible to unmap it. The kernel (swapper_pg_dir) has this
> > mapping, not the process. Unlike vdso. IOW, you can only "unmap" it
> > globally and obviously you can't do this from the userspace.
> 
> And even if this were possible, this can't help. Please look at
> __bad_area_nosemaphore()->emulate_vsyscall(), the process won't get
> SIGSEGV. IOW, in fact EMULATE already "unmaps" this page (sets _NX)
> to trigger the fault.
> 
> Sure, we can do something like below, but it doesn't look very nice
> too.
> 
> Oleg.
> 
> --- x/arch/x86/mm/fault.c
> +++ x/arch/x86/mm/fault.c
> @@ -744,7 +744,8 @@ __bad_area_nosemaphore(struct pt_regs *r
>  		 */
>  		if (unlikely((error_code & PF_INSTR) &&
>  			     ((address & ~0xfff) == VSYSCALL_START))) {
> -			if (emulate_vsyscall(regs, address))
> +			if (!(tsk->ptrace & PTRACE_O_DONTEMULATE) &&
> +			    emulate_vsyscall(regs, address))
>  				return;
>  		}
>  #endif
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: arch_check_bp_in_kernelspace: fix the range check
  2012-11-24 13:45                       ` Amnon Shiloh
@ 2012-11-25 22:55                         ` Oleg Nesterov
  2012-11-25 23:48                           ` Amnon Shiloh
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2012-11-25 22:55 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On 11/25, Amnon Shiloh wrote:
>
> 2) I was then told (in my own words): "oh, don't worry, the vsyscall page
>    has now been minimized, all it contains now is *real* system calls,
>    and it always calls them".

Not sure where did you get this idea ;) From the very beginning you were
told that EMULATE mode doesn't do this.

The NATIVE mode should be fine, yes.

> 6) I just suggested an alternative: to have the whole vsyscall page
>    removed on a per-process basis. I accept your reply that this is
>    not possible.

Yes, this is not possible.

> 7) I suggested a third alternative: to have the vsyscall page be
>    unexecutable on a per-process basis,

Like above, this is simply not possible. And at the same time the
vsyscall page is already unexecutable in EMULATE mode, but globally.

> 8) Any solution that allows a ptracer to prevent its traced process
>    from entering the vsyscall page and execute there system-calls
>    unchecked (thus in effect escape its jailer), would do for me.

Well. I am even more confused... probably this was already discussed
and I missed this, but.

Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
change emulate_vsyscall() to respect PTRACE_SYSCALL and report
TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
http://marc.info/?l=linux-kernel&m=135343635523715 ?

You previously replied that this can not work. Now that you see that
this _can_ work, could you please explain why this is not enough?

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: arch_check_bp_in_kernelspace: fix the range check
  2012-11-25 22:55                         ` Oleg Nesterov
@ 2012-11-25 23:48                           ` Amnon Shiloh
  2012-12-02 19:30                             ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
  0 siblings, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-25 23:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Oleg,

> > 2) I was then told (in my own words): "oh, don't worry, the vsyscall page
> >    has now been minimized, all it contains now is *real* system calls,
> >    and it always calls them".
> 
> Not sure where did you get this idea ;) From the very beginning you were
> told that EMULATE mode doesn't do this.

Sorry, I was not aware of the existence of "EMULATE" at the time,
or that it was the default, so I lived in a "NATIVE" world... and
was content that yesterday's problem was solved...  I just looked
at the vsyscall page itself, found the system-calls there and was
"happy" with it, that I could now catch them like anywhere else.

> > 8) Any solution that allows a ptracer to prevent its traced process
> >    from entering the vsyscall page and execute there system-calls
> >    unchecked (thus in effect escape its jailer), would do for me.
> 
> Well. I am even more confused... probably this was already discussed
> and I missed this, but.
> 
> Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
> change emulate_vsyscall() to respect PTRACE_SYSCALL and report
> TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
> http://marc.info/?l=linux-kernel&m=135343635523715 ?
> 
> Oleg.
> 

For my own application, I would be happy with this.

But I suspect it might break current versions of "strace",
or similar programs that expect to find the program-counter
pointing at a "syscall" instruction.

At present "strace" fails to report "gettimeofday()", but at
least it does not crash.  Surely "strace" can and should be
enhanced to handle this, but existing versions may suffer.

> 
> You previously replied that this can not work. Now that you see that
> this _can_ work, could you please explain why this is not enough?

I think it COULD work, but not based on PTRACE_SYSCALL
(or PTRACE_SYSEMU) alone.  A new ptrace option will be needed, saying:
"Yes, I am aware of TRAP_VSYSCALL and I know how to handle it."

While for my own application, just fixing the range-check in
arch_check_bp_in_kernelspace will do, requiring a smaller patch,
I agree that fixing this properly by adding a new ptrace option
can help other programmers, so they need not bother with the x86
debug-registers (or perhaps they may need them for other purposes).

Best Regards,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-22 16:12                 ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
  2012-11-22 20:57                   ` Pavel Emelyanov
  2012-11-23  9:14                   ` arch_check_bp_in_kernelspace: fix the range check Amnon Shiloh
@ 2012-11-26  9:44                   ` Cyrill Gorcunov
  2012-11-26 12:27                     ` Andrey Wagin
  2 siblings, 1 reply; 62+ messages in thread
From: Cyrill Gorcunov @ 2012-11-26  9:44 UTC (permalink / raw)
  To: Oleg Nesterov, Amnon Shiloh
  Cc: Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, Andrey Vagin, linux-kernel

On Thu, Nov 22, 2012 at 05:12:38PM +0100, Oleg Nesterov wrote:
> On 11/22, Amnon Shiloh wrote:
> >
> > Now however, that "vsyscall" was effectively replaced by vdso, it
> > creates a new problem for me and probably for anyone else who uses
> > some form of checkpoint/restore:
> 
> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> seem to enjoy trying to solve the c/r problems.

Hi Oleg, Amnon, sorry for delay on this message. First of all, I'm
not vDSO specialist but as to c/r aspect of it I've had in mind the
the following scenario: we dump vDSO area either complete euther simply
remember the functions addresses it provides, then on restore we replace
the functions prologue with jmp instruction which would point to the
vDSO entries provided us by a kernel. But again, I didn't spend much time
for vDSO yet.

	Cyrill

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-26  9:44                   ` vdso && cr " Cyrill Gorcunov
@ 2012-11-26 12:27                     ` Andrey Wagin
  2012-11-26 12:55                       ` Amnon Shiloh
  0 siblings, 1 reply; 62+ messages in thread
From: Andrey Wagin @ 2012-11-26 12:27 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Oleg Nesterov, Amnon Shiloh, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, LKML

2012/11/26 Cyrill Gorcunov <gorcunov@openvz.org>:
> On Thu, Nov 22, 2012 at 05:12:38PM +0100, Oleg Nesterov wrote:
>> On 11/22, Amnon Shiloh wrote:
>> >
>> > Now however, that "vsyscall" was effectively replaced by vdso, it
>> > creates a new problem for me and probably for anyone else who uses
>> > some form of checkpoint/restore:
>>
>> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
>> seem to enjoy trying to solve the c/r problems.
>
> Hi Oleg, Amnon, sorry for delay on this message. First of all, I'm
> not vDSO specialist but as to c/r aspect of it I've had in mind the
> the following scenario: we dump vDSO area either complete euther simply
> remember the functions addresses it provides, then on restore we replace
> the functions prologue with jmp instruction which would point to the
> vDSO entries provided us by a kernel. But again, I didn't spend much time
> for vDSO yet.

I think this approach should work. Here is a bit more details:
1. A task will be not dumped in vdso code. If a task is in vdso code,
we will try to catch it again.
2. Only addresses of vdso symbols will be dumped.
3. On restore we will compare addresses of symbols. If addresses are
not changes, a new vdso will be mapped instead of old one. If they are
changed, a new vdso will be mapped in a free space and the old vdso
will be replaced on a proxy library, where all functions are replaced
on jumps to suitable functions in the new vdso.

Looks simple and does not required to hack a kernel.

>
>         Cyrill

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-26 12:27                     ` Andrey Wagin
@ 2012-11-26 12:55                       ` Amnon Shiloh
  2012-11-26 14:18                         ` Cyrill Gorcunov
  0 siblings, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-26 12:55 UTC (permalink / raw)
  To: Andrey Wagin
  Cc: Cyrill Gorcunov, Oleg Nesterov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, LKML

Hi Andrey,

So what do you do if a catched interrupt occured inside vdso code?

You can easily detect that the task is currently in vdso code,
then delay its dump, but how do you detect whether a sigreturn()
or even a series of sigreturn()s won't bring it back to the middle
of some old vdso code?

You could of course keep that old code and modify only the very
first instruction of each routine into a jump instruction, but then
the code to which the process returns may not be compatible with
the new kernel and/or hardware configuration.

You cannot even rely on a vdso enter/exit counter, because interrupt
code can use "longjmp()".

My idea was to introduce a kernel option saying "don't send me any
catched signals while the program-counter is in this range (eg. the
vdso page), but I was told that it is not feasible to implement it.

Regards,
Amnon.

> 2012/11/26 Cyrill Gorcunov <gorcunov@openvz.org>:
> > On Thu, Nov 22, 2012 at 05:12:38PM +0100, Oleg Nesterov wrote:
> >> On 11/22, Amnon Shiloh wrote:
> >> >
> >> > Now however, that "vsyscall" was effectively replaced by vdso, it
> >> > creates a new problem for me and probably for anyone else who uses
> >> > some form of checkpoint/restore:
> >>
> >> Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> >> seem to enjoy trying to solve the c/r problems.
> >
> > Hi Oleg, Amnon, sorry for delay on this message. First of all, I'm
> > not vDSO specialist but as to c/r aspect of it I've had in mind the
> > the following scenario: we dump vDSO area either complete euther simply
> > remember the functions addresses it provides, then on restore we replace
> > the functions prologue with jmp instruction which would point to the
> > vDSO entries provided us by a kernel. But again, I didn't spend much time
> > for vDSO yet.
> 
> I think this approach should work. Here is a bit more details:
> 1. A task will be not dumped in vdso code. If a task is in vdso code,
> we will try to catch it again.
> 2. Only addresses of vdso symbols will be dumped.
> 3. On restore we will compare addresses of symbols. If addresses are
> not changes, a new vdso will be mapped instead of old one. If they are
> changed, a new vdso will be mapped in a free space and the old vdso
> will be replaced on a proxy library, where all functions are replaced
> on jumps to suitable functions in the new vdso.
> 
> Looks simple and does not required to hack a kernel.
> 
> >
> >         Cyrill
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-26 12:55                       ` Amnon Shiloh
@ 2012-11-26 14:18                         ` Cyrill Gorcunov
  2012-11-26 14:26                           ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range Amnon Shiloh
  0 siblings, 1 reply; 62+ messages in thread
From: Cyrill Gorcunov @ 2012-11-26 14:18 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Andrey Wagin, Oleg Nesterov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, LKML

On Mon, Nov 26, 2012 at 11:55:01PM +1100, Amnon Shiloh wrote:
> 
> You could of course keep that old code and modify only the very
> first instruction of each routine into a jump instruction, but then
> the code to which the process returns may not be compatible with
> the new kernel and/or hardware configuration.

For sure there will be some limitations but I fear we can't do
that much with it. I don't expect the regular program to use
sigreturn for jumping into vdso code, but I could be wrong.

	Cyrill

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range
  2012-11-26 14:18                         ` Cyrill Gorcunov
@ 2012-11-26 14:26                           ` Amnon Shiloh
  2012-11-26 14:41                             ` vdso && cr Cyrill Gorcunov
  0 siblings, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2012-11-26 14:26 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrey Wagin, Oleg Nesterov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, LKML

Hi Cyrill,

Programmers don't (and the manual-page says they shouldn't even try)
call "sigreturn" directly.

If an interrupt happens, by bad luck, to occur while the process
is running vdso code, then eventually, once signal-processing is
complete, "sigreturn" (issued by  glibc) will take the process back
to where it was before the interrupt happend, inside the vdso page.

Best Regards,
Amnon.

> On Mon, Nov 26, 2012 at 11:55:01PM +1100, Amnon Shiloh wrote:
> > 
> > You could of course keep that old code and modify only the very
> > first instruction of each routine into a jump instruction, but then
> > the code to which the process returns may not be compatible with
> > the new kernel and/or hardware configuration.
> 
> For sure there will be some limitations but I fear we can't do
> that much with it. I don't expect the regular program to use
> sigreturn for jumping into vdso code, but I could be wrong.
> 
> 	Cyrill
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: vdso && cr
  2012-11-26 14:26                           ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range Amnon Shiloh
@ 2012-11-26 14:41                             ` Cyrill Gorcunov
  0 siblings, 0 replies; 62+ messages in thread
From: Cyrill Gorcunov @ 2012-11-26 14:41 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Andrey Wagin, Oleg Nesterov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, LKML

On Tue, Nov 27, 2012 at 01:26:55AM +1100, Amnon Shiloh wrote:
> Programmers don't (and the manual-page says they shouldn't even try)
> call "sigreturn" directly.
> 
> If an interrupt happens, by bad luck, to occur while the process
> is running vdso code, then eventually, once signal-processing is
> complete, "sigreturn" (issued by  glibc) will take the process back
> to where it was before the interrupt happend, inside the vdso page.

Hi Amnon, good point, need to think. Frankly I would like to escape
kernel patching as long as possible ;)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-11-25 23:48                           ` Amnon Shiloh
@ 2012-12-02 19:30                             ` Oleg Nesterov
  2012-12-02 23:54                               ` u3557
  2012-12-05  9:29                               ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Jan Kratochvil
  0 siblings, 2 replies; 62+ messages in thread
From: Oleg Nesterov @ 2012-12-02 19:30 UTC (permalink / raw)
  To: Amnon Shiloh, Denys Vlasenko, Pedro Alves, Jan Kratochvil
  Cc: Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Amnon, sorry for delay...

On 11/26, Amnon Shiloh wrote:
>
> > Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
> > change emulate_vsyscall() to respect PTRACE_SYSCALL and report
> > TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
> > http://marc.info/?l=linux-kernel&m=135343635523715 ?
> >
> > Oleg.
> >
>
> For my own application, I would be happy with this.

OK, good.

> But I suspect it might break current versions of "strace",
> ...
> I think it COULD work, but not based on PTRACE_SYSCALL
> (or PTRACE_SYSEMU) alone.  A new ptrace option will be needed, saying:
> "Yes, I am aware of TRAP_VSYSCALL and I know how to handle it."

Yes, that is why I said this needs the new option.

However. Of course it would be nice to avoid the new option. IMO it
would be better to do nothing ;) vsyscall is deprecated, and EMULATE
is x86-specific.


May be we could simply do something like the patch below? (Just in
case, this hack is only for illustration, it is not complete).

If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
_after_ gettimeofday/etc. The tracer can look at regs->orig_ax == -1
and detect that this is not syscall but vsyscall, it can look at
regs->ip then (not with the patch below).

Denys, Jan, Pedro. Do you think this change can break/confuse
gdb/strace ?


> While for my own application, just fixing the range-check in
> arch_check_bp_in_kernelspace will do,

You forgot again that EMULATE does not execute the code in the
vsyscall page.

Oleg.

--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -186,6 +186,8 @@ static bool write_ok_or_segv(unsigned long ptr, size_t size)
 	}
 }
 
+#include <linux/tracehook.h>
+
 bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
 {
 	struct task_struct *tsk;
@@ -312,6 +314,8 @@ do_ret:
 	regs->ip = caller;
 	regs->sp += 8;
 done:
+	if (test_thread_flag(TIF_SYSCALL_TRACE))
+		ptrace_report_syscall(regs);
 	return true;
 
 sigsegv:


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-12-02 19:30                             ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
@ 2012-12-02 23:54                               ` u3557
  2012-12-04 17:59                                 ` Oleg Nesterov
  2012-12-05  9:29                               ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Jan Kratochvil
  1 sibling, 1 reply; 62+ messages in thread
From: u3557 @ 2012-12-02 23:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Amnon Shiloh, Denys Vlasenko, Pedro Alves, Jan Kratochvil,
	Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Oleg,

> However. Of course it would be nice to avoid the new option. IMO it
> would be better to do nothing ;) vsyscall is deprecated, and EMULATE
> is x86-specific.

The problem is that the current static glibc invokes the vsyscall page,
so statically-linked 3rd-party executables that were distributed only
in binary form are not going to stop using vsyscall even in 100 years.

Note also that even now, while the dynamic glibc library uses vDSO
for "gettimeofday()" it still uses the vsyscall page for "time()",
so even when fixed, it would still take years in any case until all
Linux distributions are free of references to the vsyscall page.

> You forgot again that EMULATE does not execute the code in the
> vsyscall page.

The beauty of using the x86 debug-registers, is that they do not
trap the instruction, but rather the fact that the program-counter
has a given value.  They work like single-stepping, so the condition
of attempting to access the vsyscall page is detected in hardware
BEFORE attempting to access the vsyscall page itself, BEFORE even
discovering that the vsyscall page is inaccessible (in EMULATE mode).

Once trapped of course, the program-counter will be changed by the ptracer,
so neither NATIVE nor EMULATE will ever be invoked.

Current versions of "strace" and "gdb" will not automatically benefit
from this, but wouldn't be harmed either.  Future versions can then
be written to make use of the debug-registers to detect a vsyscall.

Best Regards,
Amnon.

> Amnon, sorry for delay...
>
> On 11/26, Amnon Shiloh wrote:
>>
>> > Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
>> > change emulate_vsyscall() to respect PTRACE_SYSCALL and report
>> > TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
>> > http://marc.info/?l=linux-kernel&m=135343635523715 ?
>> >
>> > Oleg.
>> >
>>
>> For my own application, I would be happy with this.
>
> OK, good.
>
>> But I suspect it might break current versions of "strace",
>> ...
>> I think it COULD work, but not based on PTRACE_SYSCALL
>> (or PTRACE_SYSEMU) alone.  A new ptrace option will be needed, saying:
>> "Yes, I am aware of TRAP_VSYSCALL and I know how to handle it."
>
> Yes, that is why I said this needs the new option.
>
> However. Of course it would be nice to avoid the new option. IMO it
> would be better to do nothing ;) vsyscall is deprecated, and EMULATE
> is x86-specific.
>
>
> May be we could simply do something like the patch below? (Just in
> case, this hack is only for illustration, it is not complete).
>
> If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
> _after_ gettimeofday/etc. The tracer can look at regs->orig_ax == -1
> and detect that this is not syscall but vsyscall, it can look at
> regs->ip then (not with the patch below).
>
> Denys, Jan, Pedro. Do you think this change can break/confuse
> gdb/strace ?
>
>
>> While for my own application, just fixing the range-check in
>> arch_check_bp_in_kernelspace will do,
>
> You forgot again that EMULATE does not execute the code in the
> vsyscall page.
>
> Oleg.
>
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -186,6 +186,8 @@ static bool write_ok_or_segv(unsigned long ptr, size_t
> size)
>  	}
>  }
>
> +#include <linux/tracehook.h>
> +
>  bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>  {
>  	struct task_struct *tsk;
> @@ -312,6 +314,8 @@ do_ret:
>  	regs->ip = caller;
>  	regs->sp += 8;
>  done:
> +	if (test_thread_flag(TIF_SYSCALL_TRACE))
> +		ptrace_report_syscall(regs);
>  	return true;
>
>  sigsegv:
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-12-02 23:54                               ` u3557
@ 2012-12-04 17:59                                 ` Oleg Nesterov
  2012-12-04 22:44                                   ` u3557
  2013-01-08 17:08                                   ` Pedro Alves
  0 siblings, 2 replies; 62+ messages in thread
From: Oleg Nesterov @ 2012-12-04 17:59 UTC (permalink / raw)
  To: u3557
  Cc: Denys Vlasenko, Pedro Alves, Jan Kratochvil, Cyrill Gorcunov,
	Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On 12/03, u3557@miso.sublimeip.com wrote:
>
> > However. Of course it would be nice to avoid the new option. IMO it
> > would be better to do nothing ;) vsyscall is deprecated, and EMULATE
> > is x86-specific.
>
> The problem is that the current static glibc invokes the vsyscall page,

Yes I know.

Still I'd like to avoid to change the ptrace API, even if the change is
simple. This emulate_vsyscall() is too "exotic" imho.

> > You forgot again that EMULATE does not execute the code in the
> > vsyscall page.
>
> The beauty of using the x86 debug-registers, is that they do not
> trap the instruction, but rather the fact that the program-counter
> has a given value.

Yes, I understand, so DR_RW_EXECUTE should probably work. And I even
sent the patch (untested/uncompiled). But given that even the simple
bugfix which started this thread was ignored by maintainers, I am
not sure how we can convince them this change makes sense ;)

However. This looks like a hack to me, because this code is never
executed. But this is sudjective and I am not saying this can't work.
And yes, this doesn't add new ptrace hacks.



But If we want to allow to trace vsyscall's, hw bp doesn't look very
nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.

And what about strace? It won't be easy to change it to use hwbp.


That is why I think PTRACE_SYSCALL should "simply work" somehow. And
so far I think that "just report syscall_exit with orig_ax = -1" is
the best (and simple) solution.

OK. We can do more. We can report both syscall_enter/exit and we can
change orig_ax/ax temporary to "fool" the tracer, so that everything
will look as a "normal" syscall. Like vsyscall_seccomp() does.

But this needs much more changes.

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-12-04 17:59                                 ` Oleg Nesterov
@ 2012-12-04 22:44                                   ` u3557
  2013-01-08 17:08                                   ` Pedro Alves
  1 sibling, 0 replies; 62+ messages in thread
From: u3557 @ 2012-12-04 22:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: u3557, Denys Vlasenko, Pedro Alves, Jan Kratochvil,
	Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Dear Oleg,

> Yes, I understand, so DR_RW_EXECUTE should probably work. And I even
> sent the patch (untested/uncompiled). But given that even the simple
> bugfix which started this thread was ignored by maintainers, I am
> not sure how we can convince them this change makes sense ;)

Just to confirm, DR_RW_EXECUTE won't only "probably" work - it DOES work,
I have tested it.

A fully super-duper automatic and transparent emulation of PTRACE_SYSCALL,
including the faking of user-registers, would undoubtedly be great, but
it's complex and will require a large patch, while here is a trivial 1-2
line fix which doesn't harm anyone and allows ptracers to trap a vsyscall
in no time.

> But If we want to allow to trace vsyscall's, hw bp doesn't look very
> nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.

The next solution in line, in terms of patch-size, if we don't want to
waste debug registers, is to have the execute permission of the vsyscall
page changed on context switches when a process/ptracer requests so, in a
manner similar to prctl(PR_SET_TSC).

Best Regards,
Amnon.

>>
>> > However. Of course it would be nice to avoid the new option. IMO it
>> > would be better to do nothing ;) vsyscall is deprecated, and EMULATE
>> > is x86-specific.
>>
>> The problem is that the current static glibc invokes the vsyscall page,
>
> Yes I know.
>
> Still I'd like to avoid to change the ptrace API, even if the change is
> simple. This emulate_vsyscall() is too "exotic" imho.
>
>> > You forgot again that EMULATE does not execute the code in the
>> > vsyscall page.
>>
>> The beauty of using the x86 debug-registers, is that they do not
>> trap the instruction, but rather the fact that the program-counter
>> has a given value.
>
> Yes, I understand, so DR_RW_EXECUTE should probably work. And I even
> sent the patch (untested/uncompiled). But given that even the simple
> bugfix which started this thread was ignored by maintainers, I am
> not sure how we can convince them this change makes sense ;)
>
> However. This looks like a hack to me, because this code is never
> executed. But this is sudjective and I am not saying this can't work.
> And yes, this doesn't add new ptrace hacks.
>
>
>
> But If we want to allow to trace vsyscall's, hw bp doesn't look very
> nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.
>
> And what about strace? It won't be easy to change it to use hwbp.
>
>
> That is why I think PTRACE_SYSCALL should "simply work" somehow. And
> so far I think that "just report syscall_exit with orig_ax = -1" is
> the best (and simple) solution.
>
> OK. We can do more. We can report both syscall_enter/exit and we can
> change orig_ax/ax temporary to "fool" the tracer, so that everything
> will look as a "normal" syscall. Like vsyscall_seccomp() does.
>
> But this needs much more changes.
>
> Oleg.
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-12-02 19:30                             ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
  2012-12-02 23:54                               ` u3557
@ 2012-12-05  9:29                               ` Jan Kratochvil
  2012-12-05 13:14                                 ` u3557
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Kratochvil @ 2012-12-05  9:29 UTC (permalink / raw)
  To: Oleg Nesterov, mosix
  Cc: Amnon Shiloh, Denys Vlasenko, Pedro Alves, Cyrill Gorcunov,
	Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On Sun, 02 Dec 2012 20:30:58 +0100, Oleg Nesterov wrote:
> Yes, that is why I said this needs the new option.

I do not mind new options although personally I do not find them meaningful
for an already deprecated ABI compatibility-only issue.


> If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
> _after_ gettimeofday/etc. The tracer can look at regs->orig_ax == -1
> and detect that this is not syscall but vsyscall, it can look at
> regs->ip then (not with the patch below).

I believe applications just call PTRACE_SYSCALL twice, without checking
orig_eax.  At least strace and its TCB_INSYSCALL looks so.


On Mon, 03 Dec 2012 00:54:58 +0100, u3557@miso.sublimeip.com wrote:
> The beauty of using the x86 debug-registers,

x86 debug registers are already very scarce.  Besides that userland
applications know they have 4 of them available so it would also break them.


Regards,
Jan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-12-05  9:29                               ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Jan Kratochvil
@ 2012-12-05 13:14                                 ` u3557
  0 siblings, 0 replies; 62+ messages in thread
From: u3557 @ 2012-12-05 13:14 UTC (permalink / raw)
  To: Jan Kratochvil
  Cc: Oleg Nesterov, Amnon Shiloh, Denys Vlasenko, Pedro Alves,
	Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Dear Jan,

> x86 debug registers are already very scarce.  Besides that userland
> applications know they have 4 of them available so it would also break
> them.

If a userland application wants to cheat, then it has no need to bypass
the debug registers: even if there were 4096 of them, covering the whole
vsyscall page, it could simply copy the whole vsyscall page somewhere
else, then run (or emulate) it, or look directly at the raw data within
the vsyscall page.  The only way to overcome that would be to make the
vsyscall page either non-existent or unreadable.

Personally, allowing userland applications to read the vsyscall page
can't hurt me and my applications, but if someone else is concerned with
such malicious programs (does anyone?), if they need to construct the
strictest-of-strict jail, where jailed applications cannot glimpse
any information from the kernel they run on no matter how hard they try,
then they must at least make the vsyscall page unreadable, then rely
either on kernel emulation or a SIGSEGV (the later would be quite
sufficient for my own needs as a substitute for debug-registers,
but unfortunately not for the current version of "strace").

If, as I was told, it's too hard to remove the vsyscall page on a
per-process basis, then it's sufficient to make it unreadable on
context-switch.

My concern, however, is not with the bad guys, but with good honest
programs that would run incorrectly if allowed to call "time()" or
"gettimeofday()" unsupervised.  No good program or library jumps into
the vsyscall page except into its 3 official entry points.

In summary, it should be decided:

If it is important enough for Linux to support paranoidically strict
jails, then full kernel emulation of PTRACE_SYSCALL (and PTRACE_SYSEMU)
is inescapable.

If, on the other hand, there is only a need to allow applications such as
mine and "strace"/"gdb" to trap system-calls that occur via the vsyscall
page, then in principle a variety of options are possible:

1. Allow setting the x86 hardware-debug registers into the vsyscall page.
2. Optional (per-process) removal of execute-permission from the vsyscall
   page.
3. Optional (per-process) removal of both read and execute permissions
   from the vsyscall page.
4. Optional (per-process) elimination of the vsyscall page altogether.
5. Kernel vsyscall emulation code to send some signal or event to traced
   processes if the ptracer asked for it (using a new ptrace option).
6. Complete and transparent emulation of PTRACE_SYSCALL/PTRACE_SYSEMU.

Option #1 requires the least effort (a 2-line fix).
Option #6 requires the most effort.

Best Regards,
Amnon.

> On Sun, 02 Dec 2012 20:30:58 +0100, Oleg Nesterov wrote:
>> Yes, that is why I said this needs the new option.
>
> I do not mind new options although personally I do not find them
> meaningful
> for an already deprecated ABI compatibility-only issue.
>
>
>> If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
>> _after_ gettimeofday/etc. The tracer can look at regs->orig_ax == -1
>> and detect that this is not syscall but vsyscall, it can look at
>> regs->ip then (not with the patch below).
>
> I believe applications just call PTRACE_SYSCALL twice, without checking
> orig_eax.  At least strace and its TCB_INSYSCALL looks so.
>
>
> On Mon, 03 Dec 2012 00:54:58 +0100, u3557@miso.sublimeip.com wrote:
>> The beauty of using the x86 debug-registers,
>
> x86 debug registers are already very scarce.  Besides that userland
> applications know they have 4 of them available so it would also break
> them.
>
>
> Regards,
> Jan
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2012-12-04 17:59                                 ` Oleg Nesterov
  2012-12-04 22:44                                   ` u3557
@ 2013-01-08 17:08                                   ` Pedro Alves
  2013-01-09 17:52                                     ` Oleg Nesterov
  1 sibling, 1 reply; 62+ messages in thread
From: Pedro Alves @ 2013-01-08 17:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: u3557, Denys Vlasenko, Jan Kratochvil, Cyrill Gorcunov,
	Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On 12/04/2012 05:59 PM, Oleg Nesterov wrote:

> But If we want to allow to trace vsyscall's, hw bp doesn't look very
> nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.

Irrespective of the whole syscall tracing issue, allowing HW bkpts in
the vsyscall just seems like a bug fix to me.

> That is why I think PTRACE_SYSCALL should "simply work" somehow. And
> so far I think that "just report syscall_exit with orig_ax = -1" is
> the best (and simple) solution.

If you report exit alone, you'll confuse current GDB into mistaking
it for an enter, and all following enter/exits end up swapped/confused.
GDB assumes trap/sysgood alternates between enter/exit, and always
starts with enter.

(Mildly related, GDB has an old comment in the code (linux-nat.c)
that says:

 "However, most architectures can't handle a syscall
 being traced on the way out if it wasn't traced on
 the way in."

I've no clue if that's still true nowadays.)

> OK. We can do more. We can report both syscall_enter/exit and we can
> change orig_ax/ax temporary to "fool" the tracer, so that everything
> will look as a "normal" syscall. Like vsyscall_seccomp() does.
> 
> But this needs much more changes.

I'd just like to add, that if any new syscall related option is
to be added, can we please just go all the way and add
PTRACE_EVENT_SYSCALL_ENTER|PTRACE_EVENT_SYSCALL_EXIT instead?

http://sourceware.org/gdb/wiki/LinuxKernelWishList

-- 
Pedro Alves


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2013-01-08 17:08                                   ` Pedro Alves
@ 2013-01-09 17:52                                     ` Oleg Nesterov
  2013-01-10  6:54                                       ` u3557
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2013-01-09 17:52 UTC (permalink / raw)
  To: Pedro Alves
  Cc: u3557, Denys Vlasenko, Jan Kratochvil, Cyrill Gorcunov,
	Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On 01/08, Pedro Alves wrote:
>
> On 12/04/2012 05:59 PM, Oleg Nesterov wrote:
>
> > But If we want to allow to trace vsyscall's, hw bp doesn't look very
> > nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.
>
> Irrespective of the whole syscall tracing issue, allowing HW bkpts in
> the vsyscall just seems like a bug fix to me.

And I never argued. I sent the patch iirc ;)

> > That is why I think PTRACE_SYSCALL should "simply work" somehow. And
> > so far I think that "just report syscall_exit with orig_ax = -1" is
> > the best (and simple) solution.
>
> If you report exit alone, you'll confuse current GDB into mistaking
> it for an enter,

Sure. That is why I asked Jan.

> > OK. We can do more. We can report both syscall_enter/exit and we can
> > change orig_ax/ax temporary to "fool" the tracer, so that everything
> > will look as a "normal" syscall. Like vsyscall_seccomp() does.
> >
> > But this needs much more changes.
>
> I'd just like to add, that if any new syscall related option is
> to be added, can we please just go all the way and add
> PTRACE_EVENT_SYSCALL_ENTER|PTRACE_EVENT_SYSCALL_EXIT instead?

Oh yes, this was suggested many times.

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2013-01-09 17:52                                     ` Oleg Nesterov
@ 2013-01-10  6:54                                       ` u3557
  2013-01-12 18:12                                         ` Oleg Nesterov
  0 siblings, 1 reply; 62+ messages in thread
From: u3557 @ 2013-01-10  6:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pedro Alves, u3557, Denys Vlasenko, Jan Kratochvil,
	Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Everyone,

> On 01/08, Pedro Alves wrote:
>>
>> On 12/04/2012 05:59 PM, Oleg Nesterov wrote:
>>
>> > But If we want to allow to trace vsyscall's, hw bp doesn't look very
>> > nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.
>>
>> Irrespective of the whole syscall tracing issue, allowing HW bkpts in
>> the vsyscall just seems like a bug fix to me.
>
> And I never argued. I sent the patch iirc ;)

Exactly, it is a bug and I am still waiting for it to be fixed in the
Linux kernel.

Fully emulating PTRACE_SYSCALL could also provide a suitable way to
fix my problem, and it may also help others by saving them the need
to program and waste x86 debug registers, but it doesn't change the
fact that my problem is caused by a bug in the first place, which
should be fixed in any case.

Best Regards,
Amnon.


>
>> > That is why I think PTRACE_SYSCALL should "simply work" somehow. And
>> > so far I think that "just report syscall_exit with orig_ax = -1" is
>> > the best (and simple) solution.
>>
>> If you report exit alone, you'll confuse current GDB into mistaking
>> it for an enter,
>
> Sure. That is why I asked Jan.
>
>> > OK. We can do more. We can report both syscall_enter/exit and we can
>> > change orig_ax/ax temporary to "fool" the tracer, so that everything
>> > will look as a "normal" syscall. Like vsyscall_seccomp() does.
>> >
>> > But this needs much more changes.
>>
>> I'd just like to add, that if any new syscall related option is
>> to be added, can we please just go all the way and add
>> PTRACE_EVENT_SYSCALL_ENTER|PTRACE_EVENT_SYSCALL_EXIT instead?
>
> Oh yes, this was suggested many times.
>
> Oleg.
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2013-01-10  6:54                                       ` u3557
@ 2013-01-12 18:12                                         ` Oleg Nesterov
  2013-01-14  2:31                                           ` u3557
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2013-01-12 18:12 UTC (permalink / raw)
  To: u3557
  Cc: Pedro Alves, Denys Vlasenko, Jan Kratochvil, Cyrill Gorcunov,
	Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On 01/10, u3557@miso.sublimeip.com wrote:
>
> Hi Everyone,
>
> > On 01/08, Pedro Alves wrote:
> >>
> >> On 12/04/2012 05:59 PM, Oleg Nesterov wrote:
> >>
> >> > But If we want to allow to trace vsyscall's, hw bp doesn't look very
> >> > nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.
> >>
> >> Irrespective of the whole syscall tracing issue, allowing HW bkpts in
> >> the vsyscall just seems like a bug fix to me.
> >
> > And I never argued. I sent the patch iirc ;)
>
> Exactly, it is a bug and I am still waiting for it to be fixed in the
> Linux kernel.

I would not say this is a bug but let me repeat, no need to convince me.

Please feel free to re-send the patch(es) I sent to maintainers. Sorry,
I can't push these changes into Linus's tree.

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2013-01-12 18:12                                         ` Oleg Nesterov
@ 2013-01-14  2:31                                           ` u3557
  2013-01-14 16:01                                             ` Oleg Nesterov
  0 siblings, 1 reply; 62+ messages in thread
From: u3557 @ 2013-01-14  2:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: u3557, Pedro Alves, Denys Vlasenko, Jan Kratochvil,
	Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1866 bytes --]

Hi,

> I would not say this is a bug but let me repeat, no need to convince me.
>
> Please feel free to re-send the patch(es) I sent to maintainers. Sorry,
> I can't push these changes into Linus's tree.

So here again is the patch that I need so badly - clearly it fixes a bug
and harms nobody:

-----------------------------------------------------------------------
diff -Naur before/arch/x86/kernel/hw_breakpoint.c
after/arch/x86/kernel/hw_breakpoint.c
--- before/arch/x86/kernel/hw_breakpoint.c	2013-01-14 12:45:20.000000000
+1030
+++ after/arch/x86/kernel/hw_breakpoint.c	2013-01-14 12:46:24.000000000 +1030
@@ -200,7 +200,8 @@
 	va = info->address;
 	len = get_hbp_len(info->len);

-	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE) &&
+		!((va >= VSYSCALL_START) && ((va + len - 1) <= VSYSCALL_END));
 }

 int arch_bp_generic_fields(int x86_len, int x86_type,
-----------------------------------------------------------------------

Where else can I send it?
Amnon.

> On 01/10, u3557@miso.sublimeip.com wrote:
>>
>> Hi Everyone,
>>
>> > On 01/08, Pedro Alves wrote:
>> >>
>> >> On 12/04/2012 05:59 PM, Oleg Nesterov wrote:
>> >>
>> >> > But If we want to allow to trace vsyscall's, hw bp doesn't look
>> very
>> >> > nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them
>> all.
>> >>
>> >> Irrespective of the whole syscall tracing issue, allowing HW bkpts in
>> >> the vsyscall just seems like a bug fix to me.
>> >
>> > And I never argued. I sent the patch iirc ;)
>>
>> Exactly, it is a bug and I am still waiting for it to be fixed in the
>> Linux kernel.
>
> I would not say this is a bug but let me repeat, no need to convince me.
>
> Please feel free to re-send the patch(es) I sent to maintainers. Sorry,
> I can't push these changes into Linus's tree.
>
> Oleg.
>
>

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 571 bytes --]

diff -Naur before/arch/x86/kernel/hw_breakpoint.c after/arch/x86/kernel/hw_breakpoint.c
--- before/arch/x86/kernel/hw_breakpoint.c	2013-01-14 12:45:20.000000000 +1030
+++ after/arch/x86/kernel/hw_breakpoint.c	2013-01-14 12:46:24.000000000 +1030
@@ -200,7 +200,8 @@
 	va = info->address;
 	len = get_hbp_len(info->len);
 
-	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE) &&
+		!((va >= VSYSCALL_START) && ((va + len - 1) <= VSYSCALL_END));
 }
 
 int arch_bp_generic_fields(int x86_len, int x86_type,

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
  2013-01-14  2:31                                           ` u3557
@ 2013-01-14 16:01                                             ` Oleg Nesterov
  2013-02-18  1:39                                               ` prctl(PR_SET_MM) Amnon Shiloh
  0 siblings, 1 reply; 62+ messages in thread
From: Oleg Nesterov @ 2013-01-14 16:01 UTC (permalink / raw)
  To: u3557
  Cc: Pedro Alves, Denys Vlasenko, Jan Kratochvil, Cyrill Gorcunov,
	Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On 01/14, u3557@miso.sublimeip.com wrote:
>
> So here again is the patch that I need so badly - clearly it fixes a bug
> and harms nobody:
>
> -----------------------------------------------------------------------
> diff -Naur before/arch/x86/kernel/hw_breakpoint.c
> after/arch/x86/kernel/hw_breakpoint.c
> --- before/arch/x86/kernel/hw_breakpoint.c	2013-01-14 12:45:20.000000000
> +1030
> +++ after/arch/x86/kernel/hw_breakpoint.c	2013-01-14 12:46:24.000000000 +1030
> @@ -200,7 +200,8 @@
>  	va = info->address;
>  	len = get_hbp_len(info->len);
> 
> -	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
> +	return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE) &&
> +		!((va >= VSYSCALL_START) && ((va + len - 1) <= VSYSCALL_END));
>  }

I meant this one: http://marc.info/?l=linux-kernel&m=135336050319266
on top of http://marc.info/?l=linux-kernel&m=135248575426474

But nobody bothers to take even the trivial bugfix I sent ;)

Oleg.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* prctl(PR_SET_MM)
  2013-01-14 16:01                                             ` Oleg Nesterov
@ 2013-02-18  1:39                                               ` Amnon Shiloh
  2013-02-18  5:44                                                 ` prctl(PR_SET_MM) Randy Dunlap
  2013-02-18 15:21                                                 ` prctl(PR_SET_MM) Steven Rostedt
  0 siblings, 2 replies; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-18  1:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Pedro Alves, Denys Vlasenko, Jan Kratochvil, Cyrill Gorcunov,
	Pavel Emelyanov, Steven Rostedt, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

Hello,

The code in "kernel/sys.c" provides the "prctl(PR_SET_MM)" function,
which is the only way a process can set or modify the following 11
per-process fields:

 	start_code, end_code, start_data, end_data, start_brk, brk,
 	start_stack, arg_start, arg_end, env_start, env_end.

Being able to set those fields is important, even crucial,
for any conceivable user-level checkpointing software, as
well as for migrating processes between different computers.

Unfortunately, this code (essentially "prctl_set_mm()") is presently
enclosed in "#ifdef CONFIG_CHECKPOINT_RESTORE" which is configured
as "default n" in "init/Kconfig".  Many system-administrators who
may like to have a checkpoint/restore or process-migration facility,
but use standard pre-packaged kernels, find the requirement to
configure and compile their own non-standard kernel difficult or
too prohibitive.

Would it be possible to have this code enabled by default?

This could be done in one of 4 ways:
1) Having CONFIG_CHECKPOINT_RESTORE enabled by default; or
2) Releasing this code from the "#ifdef CONFIG_CHECK_RESTORE"; or
3) Placing this code within a different kernel-configuration option
   (say "CONFIG_BASIC_CHECKPOINTING") that is enabled by default; or
4) Placing this code under a dual #if, so instead of:
   #ifdef CONFIG_CHECKPOINT_RESTORE
	   have:
   #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_BASIC_CHECKPOINTING)


Thank you,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-18  1:39                                               ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-18  5:44                                                 ` Randy Dunlap
  2013-02-18 15:21                                                 ` prctl(PR_SET_MM) Steven Rostedt
  1 sibling, 0 replies; 62+ messages in thread
From: Randy Dunlap @ 2013-02-18  5:44 UTC (permalink / raw)
  To: u3557
  Cc: Amnon Shiloh, Oleg Nesterov, Pedro Alves, Denys Vlasenko,
	Jan Kratochvil, Cyrill Gorcunov, Pavel Emelyanov, Steven Rostedt,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On 02/17/13 17:39, Amnon Shiloh wrote:
> Hello,
> 
> The code in "kernel/sys.c" provides the "prctl(PR_SET_MM)" function,
> which is the only way a process can set or modify the following 11
> per-process fields:
> 
>  	start_code, end_code, start_data, end_data, start_brk, brk,
>  	start_stack, arg_start, arg_end, env_start, env_end.
> 
> Being able to set those fields is important, even crucial,
> for any conceivable user-level checkpointing software, as
> well as for migrating processes between different computers.
> 
> Unfortunately, this code (essentially "prctl_set_mm()") is presently
> enclosed in "#ifdef CONFIG_CHECKPOINT_RESTORE" which is configured
> as "default n" in "init/Kconfig".  Many system-administrators who
> may like to have a checkpoint/restore or process-migration facility,
> but use standard pre-packaged kernels, find the requirement to
> configure and compile their own non-standard kernel difficult or
> too prohibitive.
> 
> Would it be possible to have this code enabled by default?
> 
> This could be done in one of 4 ways:
> 1) Having CONFIG_CHECKPOINT_RESTORE enabled by default; or
> 2) Releasing this code from the "#ifdef CONFIG_CHECK_RESTORE"; or
> 3) Placing this code within a different kernel-configuration option
>    (say "CONFIG_BASIC_CHECKPOINTING") that is enabled by default; or
> 4) Placing this code under a dual #if, so instead of:
>    #ifdef CONFIG_CHECKPOINT_RESTORE
> 	   have:
>    #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_BASIC_CHECKPOINTING)


This is basically a distro issue.  Distros can choose to enable
this code by default, but the Linux kernel that Linus maintains does
not need to enable it.


-- 
~Randy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-18  1:39                                               ` prctl(PR_SET_MM) Amnon Shiloh
  2013-02-18  5:44                                                 ` prctl(PR_SET_MM) Randy Dunlap
@ 2013-02-18 15:21                                                 ` Steven Rostedt
  2013-02-18 16:33                                                   ` prctl(PR_SET_MM) Amnon Shiloh
  1 sibling, 1 reply; 62+ messages in thread
From: Steven Rostedt @ 2013-02-18 15:21 UTC (permalink / raw)
  To: u3557
  Cc: Oleg Nesterov, Pedro Alves, Denys Vlasenko, Jan Kratochvil,
	Cyrill Gorcunov, Pavel Emelyanov, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On Mon, 2013-02-18 at 12:39 +1100, Amnon Shiloh wrote:
> Hello,
> 
> The code in "kernel/sys.c" provides the "prctl(PR_SET_MM)" function,
> which is the only way a process can set or modify the following 11
> per-process fields:
> 
>  	start_code, end_code, start_data, end_data, start_brk, brk,
>  	start_stack, arg_start, arg_end, env_start, env_end.
> 
> Being able to set those fields is important, even crucial,
> for any conceivable user-level checkpointing software, as
> well as for migrating processes between different computers.

You're saying that this is useful for code not needing a kernel with
CHECKPOINT_RESTORE enabled. Correct?

> 
> Unfortunately, this code (essentially "prctl_set_mm()") is presently
> enclosed in "#ifdef CONFIG_CHECKPOINT_RESTORE" which is configured
> as "default n" in "init/Kconfig".  Many system-administrators who
> may like to have a checkpoint/restore or process-migration facility,
> but use standard pre-packaged kernels, find the requirement to
> configure and compile their own non-standard kernel difficult or
> too prohibitive.
> 
> Would it be possible to have this code enabled by default?
> 
> This could be done in one of 4 ways:
> 1) Having CONFIG_CHECKPOINT_RESTORE enabled by default; or

Nope, that wont due. Kernel policy is to have things default n. Have an
issue with a config, talk with the distribution you are dealing with.
They set the policy of what configs get set for their kernels.

> 2) Releasing this code from the "#ifdef CONFIG_CHECK_RESTORE"; or
> 3) Placing this code within a different kernel-configuration option
>    (say "CONFIG_BASIC_CHECKPOINTING") that is enabled by default; or
> 4) Placing this code under a dual #if, so instead of:
>    #ifdef CONFIG_CHECKPOINT_RESTORE
> 	   have:
>    #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_BASIC_CHECKPOINTING)

One of the above 3 can probably be worked out.

-- Steve



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-18 15:21                                                 ` prctl(PR_SET_MM) Steven Rostedt
@ 2013-02-18 16:33                                                   ` Amnon Shiloh
  2013-02-18 19:49                                                     ` prctl(PR_SET_MM) Steven Rostedt
  0 siblings, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-18 16:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: u3557, Oleg Nesterov, Pedro Alves, Denys Vlasenko,
	Jan Kratochvil, Cyrill Gorcunov, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Steven Rostedt wrote:

> On Mon, 2013-02-18 at 12:39 +1100, Amnon Shiloh wrote:
> > Hello,
> > 
> > The code in "kernel/sys.c" provides the "prctl(PR_SET_MM)" function,
> > which is the only way a process can set or modify the following 11
> > per-process fields:
> > 
> >  	start_code, end_code, start_data, end_data, start_brk, brk,
> >  	start_stack, arg_start, arg_end, env_start, env_end.
> > 
> > Being able to set those fields is important, even crucial,
> > for any conceivable user-level checkpointing software, as
> > well as for migrating processes between different computers.
> 
> You're saying that this is useful for code not needing a kernel with
> CHECKPOINT_RESTORE enabled. Correct?

Correct, this is an important feature that is useful for a whole
general class of applications, not only those needing CHECKPOINT_RESTORE.

Had this not been done as part of the CHECKPOINT_RESTORE project, it
would have certainly been done, sooner or later, by some other developers:
it just so happened that the CHECKPOINT_RESTORE people were the first to
(publically) fill this gap, but in fact this code in "kernel/sys.c" should
be general kernel code, not part of CHECKPOINT_RESTORE.

> 
> > 
> > Unfortunately, this code (essentially "prctl_set_mm()") is presently
> > enclosed in "#ifdef CONFIG_CHECKPOINT_RESTORE" which is configured
> > as "default n" in "init/Kconfig".  Many system-administrators who
> > may like to have a checkpoint/restore or process-migration facility,
> > but use standard pre-packaged kernels, find the requirement to
> > configure and compile their own non-standard kernel difficult or
> > too prohibitive.
> > 
> > Would it be possible to have this code enabled by default?
> > 
> > This could be done in one of 4 ways:
> > 1) Having CONFIG_CHECKPOINT_RESTORE enabled by default; or
> 
> Nope, that wont due. Kernel policy is to have things default n. Have an
> issue with a config, talk with the distribution you are dealing with.
> They set the policy of what configs get set for their kernels.

Yes, Randy Dunlap already raised this point, but I have no dealings with
any particular Linux distribution or the right connections to chase them
all, one by one - I develop generic software for the general Linux community,
that is intended to work distribution-independently.  Even if I had access
to all distributions, it may be hard to convince them to configure
CONFIG_CHECKPOINT_RESTORE as a whole since it contains so much other code.

BTW, Can anyone explain this policy of "have things default n"?
When I go over "init/Kconfig" or most other Kconfig's, I can
actually see lots of "default y".

> 
> > 2) Releasing this code from the "#ifdef CONFIG_CHECK_RESTORE"; or
> > 3) Placing this code within a different kernel-configuration option
> >    (say "CONFIG_BASIC_CHECKPOINTING") that is enabled by default; or
> > 4) Placing this code under a dual #if, so instead of:
> >    #ifdef CONFIG_CHECKPOINT_RESTORE
> > 	   have:
> >    #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_BASIC_CHECKPOINTING)
> 
> One of the above 3 can probably be worked out.
> 
> -- Steve

Great!

Naturally I prefer option 2 (but the other two will do as well).

Thanks,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-18 16:33                                                   ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-18 19:49                                                     ` Steven Rostedt
  2013-02-19  6:25                                                       ` prctl(PR_SET_MM) Amnon Shiloh
  0 siblings, 1 reply; 62+ messages in thread
From: Steven Rostedt @ 2013-02-18 19:49 UTC (permalink / raw)
  To: u3557
  Cc: Oleg Nesterov, Pedro Alves, Denys Vlasenko, Jan Kratochvil,
	Cyrill Gorcunov, Pavel Emelyanov, Frederic Weisbecker,
	Ingo Molnar, Peter Zijlstra, linux-kernel

On Tue, 2013-02-19 at 03:33 +1100, Amnon Shiloh wrote:

> Yes, Randy Dunlap already raised this point, but I have no dealings with
> any particular Linux distribution or the right connections to chase them
> all, one by one - I develop generic software for the general Linux community,
> that is intended to work distribution-independently.  Even if I had access
> to all distributions, it may be hard to convince them to configure
> CONFIG_CHECKPOINT_RESTORE as a whole since it contains so much other code.

If only you, or a few people are using it (ie. distros don't see a
need), then it will be up to you to make the changes.

> 
> BTW, Can anyone explain this policy of "have things default n"?
> When I go over "init/Kconfig" or most other Kconfig's, I can
> actually see lots of "default y".

Linus stated that he doesn't want any new feature default on, unless it
gives better performance to the general populace. Because someone's old
config should not add new features when they use it for a new kernel.
Basically, if something is default on, then it probably shouldn't have a
config for it.

Some are on because they were always on (although several were changed
to 'n'). Also it's fine to have a default y that depends on something
that is default n, if it makes sense.

For example, the function graph tracer is default y, but requires the
function tracer to be enabled which is default n. This is because I
assumed that you would want the function graph tracer if you enabled
function tracing, as the graph tracer gives much more information.

> 
> > 
> > > 2) Releasing this code from the "#ifdef CONFIG_CHECK_RESTORE"; or
> > > 3) Placing this code within a different kernel-configuration option
> > >    (say "CONFIG_BASIC_CHECKPOINTING") that is enabled by default; or
> > > 4) Placing this code under a dual #if, so instead of:
> > >    #ifdef CONFIG_CHECKPOINT_RESTORE
> > > 	   have:
> > >    #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_BASIC_CHECKPOINTING)
> > 
> > One of the above 3 can probably be worked out.
> > 
> > -- Steve
> 
> Great!
> 
> Naturally I prefer option 2 (but the other two will do as well).

I have no problems with this out of line.

-- Steve



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-18 19:49                                                     ` prctl(PR_SET_MM) Steven Rostedt
@ 2013-02-19  6:25                                                       ` Amnon Shiloh
  2013-02-20  8:39                                                         ` prctl(PR_SET_MM) Cyrill Gorcunov
  2013-02-22 14:23                                                         ` prctl(PR_SET_MM) Denys Vlasenko
  0 siblings, 2 replies; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-19  6:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: u3557, Oleg Nesterov, Pedro Alves, Denys Vlasenko,
	Jan Kratochvil, Cyrill Gorcunov, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Steven Rostedt wrote:

> If only you, or a few people are using it (ie. distros don't see a
> need), then it will be up to you to make the changes.

I believe that this functionality is of a general nature and is needed
by many, not only by myself and by the CRIU group, but by all user-level
software packages, past present and future, that provide some form or
another of reconstructing a Linux process.

For example:
1) Checkpoint-restore
   - the ability to resume a computation after computer crashes.
2) Process-migration
   - the ability to have a running computation change computers.
     (for numerous reasons, including planned shutdown; overheating;
     load-balancing; increased memory demands; the original computer
     being required by higher-priority users, etc. etc.)
3) Process-duplication
   - the ability to have a running computation continue in parallel from
     a point in time on two or more computers, so that if one of them
     fails, at least one copy will still be running (in time-critical
     missions, a periodic checkpointing may not be sufficient).

To reconstruct Linux process(es), one must be able to restore all their
memory contents, states and registers to their original and consistent
values.    

Other than the code currently enclosed in "CONFIG_CHECKPOINT_RESTORE",
nothing else in the kernel provides the ability to set those 11 fields:

     start_code, end_code, start_data, end_data, start_brk, brk,
     start_stack, arg_start, arg_end, env_start, env_end.

There are many possible ways to implement this functionality and if you
believe that it would be a good idea, I could instead send a simple,
independent kernel patch that sets those 11 fields.  It would be shorter
than the current and perhaps implemented as a new "ptrace" function,
but given that the CRIU group already wrote this code and that it is
already in the kernel, I am concerned that trying to duplicate it might
interfere with their work, so I first ask for the relevant code in
"kernel/sys.c" to no longer be enclosed by an #ifdef.  What do you think?

Best Regards,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-19  6:25                                                       ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-20  8:39                                                         ` Cyrill Gorcunov
  2013-02-20  9:38                                                           ` prctl(PR_SET_MM) Amnon Shiloh
  2013-02-22 14:23                                                         ` prctl(PR_SET_MM) Denys Vlasenko
  1 sibling, 1 reply; 62+ messages in thread
From: Cyrill Gorcunov @ 2013-02-20  8:39 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On Tue, Feb 19, 2013 at 05:25:31PM +1100, Amnon Shiloh wrote:
> 
> To reconstruct Linux process(es), one must be able to restore all their
> memory contents, states and registers to their original and consistent
> values.

I personally don't mind if this come become y by default, because it will
work for us. Still I guess if you need to reconstruct Linux process(es)
plain prctl extension is not enough, you still need a functionality which
is enclosed in config-checkpoint-restore (say /proc/pid/map_files, kcmp),
no?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-20  8:39                                                         ` prctl(PR_SET_MM) Cyrill Gorcunov
@ 2013-02-20  9:38                                                           ` Amnon Shiloh
  2013-02-20 10:51                                                             ` prctl(PR_SET_MM) Cyrill Gorcunov
  0 siblings, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-20  9:38 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Cyrill,

Cyrill Gorcunov wrote:

> On Tue, Feb 19, 2013 at 05:25:31PM +1100, Amnon Shiloh wrote:
> > 
> > To reconstruct Linux process(es), one must be able to restore all their
> > memory contents, states and registers to their original and consistent
> > values.
> 
> I personally don't mind if this come become y by default, because it will
> work for us.

I don't mind either, to say the least, to have CONFIG_CHECKPOINT_RESTORE
have a "default y" in "init/Kconfig" - that would solve my problem and make
me happy, as well as your group and others, but we are told here that Linus
has a policy vetoing such changes and I don't believe either of us can make
him change his mind.

As such, we must look at other options, such as having the code in
"kernel/sys.c" out in the open, not enclosed by any #ifdef's altogether
- surely you would like that!

> Still I guess if you need to reconstruct Linux process(es)
> plain prctl extension is not enough, you still need a functionality which
> is enclosed in config-checkpoint-restore (say /proc/pid/map_files, kcmp),
> no?

My process-migration package only reconstructs Linux processes partially:
as it's a bit different than any classical checkpoint-restore, the critical
code in "kernel/sys.c" is all I need at the moment (from within
CONFIG_CHECKPOINT_RESTORE).  I do appreciate that anyone attempting to
perform complete, classical, checkpoint-restore operations, needs more
than that.

Best Regards,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-20  9:38                                                           ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-20 10:51                                                             ` Cyrill Gorcunov
  2013-02-20 11:16                                                               ` prctl(PR_SET_MM) Amnon Shiloh
  2013-02-21  7:46                                                               ` prctl(PR_SET_MM) Amnon Shiloh
  0 siblings, 2 replies; 62+ messages in thread
From: Cyrill Gorcunov @ 2013-02-20 10:51 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On Wed, Feb 20, 2013 at 08:38:01PM +1100, Amnon Shiloh wrote:
> > I personally don't mind if this come become y by default, because it will
> > work for us.
> 
> I don't mind either, to say the least, to have CONFIG_CHECKPOINT_RESTORE
> have a "default y" in "init/Kconfig" - that would solve my problem and make
> me happy, as well as your group and others, but we are told here that Linus
> has a policy vetoing such changes and I don't believe either of us can make
> him change his mind.

That's perfectly fine for all new features :-)

> As such, we must look at other options, such as having the code in
> "kernel/sys.c" out in the open, not enclosed by any #ifdef's altogether
> - surely you would like that!
> 
> > Still I guess if you need to reconstruct Linux process(es)
> > plain prctl extension is not enough, you still need a functionality which
> > is enclosed in config-checkpoint-restore (say /proc/pid/map_files, kcmp),
> > no?
> 
> My process-migration package only reconstructs Linux processes partially:
> as it's a bit different than any classical checkpoint-restore, the critical
> code in "kernel/sys.c" is all I need at the moment (from within
> CONFIG_CHECKPOINT_RESTORE).  I do appreciate that anyone attempting to
> perform complete, classical, checkpoint-restore operations, needs more
> than that.

I see. Thanks for explanation! Thus we need some new config option which would
enable this prctl opcodes (y by default), in turn config-checkpoint-restore
kconfig option need to select this feature if set. Sounds reasonable?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-20 10:51                                                             ` prctl(PR_SET_MM) Cyrill Gorcunov
@ 2013-02-20 11:16                                                               ` Amnon Shiloh
  2013-02-21  7:46                                                               ` prctl(PR_SET_MM) Amnon Shiloh
  1 sibling, 0 replies; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-20 11:16 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Hi Cyrill,

Cyrill Gorcunov wrote:

> I see. Thanks for explanation! Thus we need some new config option which would
> enable this prctl opcodes (y by default), in turn config-checkpoint-restore
> kconfig option need to select this feature if set. Sounds reasonable?

Yes, This would be one possible way to do it.
Another possibility is to not have an #ifdef at all around this code.
Another possibility is to have a dual #if:

#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_MM_FIELDS_SETTING)

Best Regards,
Amnon.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-20 10:51                                                             ` prctl(PR_SET_MM) Cyrill Gorcunov
  2013-02-20 11:16                                                               ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-21  7:46                                                               ` Amnon Shiloh
  2013-02-21  8:00                                                                 ` prctl(PR_SET_MM) Cyrill Gorcunov
  1 sibling, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-21  7:46 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 368 bytes --]

Cyrill Gorcunov wrote:

>> Another possibility is to have a dual #if:
>>
>> #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_MM_FIELDS_SETTING)
>
> Thus this approach looks preferred. And MM_FIELDS_SETTING will be y by default.
> Mind to cook a patch and lets see if community accept it? Don't forget to
> CC Andrew Morton.

Very well, patch attached.

Amnon.

[-- Attachment #2: unified diff output, ASCII text --]
[-- Type: text/plain, Size: 2040 bytes --]

diff -Naur before/init/Kconfig after/init/Kconfig
--- before/init/Kconfig	2013-02-19 10:28:34.000000000 +1030
+++ after/init/Kconfig	2013-02-21 18:03:48.000000000 +1030
@@ -999,6 +999,22 @@
 
 	  If unsure, say N here.
 
+config MM_FIELDS_SETTING
+	bool "Allow modifying per-process memory-region fields"
+	default y
+	help
+	   Support "prctl(PR_SET_MM)" which allows applications to modify
+	   the following in their "mm_struct":
+
+	      start_code, end_code, start_data, end_data, start_brk, brk,
+	      start_stack, arg_start, arg_end, env_start, env_end.
+
+	    Also to modify their executable file ("/proc/self/exe").
+
+	    This option is needed for reconstructing processes (such as when
+	    restoring a process from a checkpoint; duplicating a process;
+	    or migrating it to another computer).
+
 menuconfig NAMESPACES
 	bool "Namespaces support" if EXPERT
 	default !EXPERT
diff -Naur before/kernel/sys.c after/kernel/sys.c
--- before/kernel/sys.c	2013-02-19 10:28:34.000000000 +1030
+++ after/kernel/sys.c	2013-02-21 17:19:10.000000000 +1030
@@ -1788,7 +1788,7 @@
 	return mask;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_MM_FIELDS_SETTING)
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
 	struct fd exe;
@@ -1981,18 +1981,22 @@
 	up_read(&mm->mmap_sem);
 	return error;
 }
+#else /* CONFIG_CHECKPOINT_RESTORE || CONFIG_MM_FIELDS_SETTING */
 
-static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
-{
-	return put_user(me->clear_child_tid, tid_addr);
-}
-
-#else /* CONFIG_CHECKPOINT_RESTORE */
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
 	return -EINVAL;
 }
+#endif
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
+{
+	return put_user(me->clear_child_tid, tid_addr);
+}
+
+#else
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
 	return -EINVAL;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-21  7:46                                                               ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-21  8:00                                                                 ` Cyrill Gorcunov
  2013-02-21  8:03                                                                   ` prctl(PR_SET_MM) Amnon Shiloh
  2013-02-21 22:18                                                                   ` prctl(PR_SET_MM) Andrew Morton
  0 siblings, 2 replies; 62+ messages in thread
From: Cyrill Gorcunov @ 2013-02-21  8:00 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Andrew Morton

On Thu, Feb 21, 2013 at 06:46:39PM +1100, Amnon Shiloh wrote:
> Cyrill Gorcunov wrote:
> 
> >> Another possibility is to have a dual #if:
> >>
> >> #if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_MM_FIELDS_SETTING)
> >
> > Thus this approach looks preferred. And MM_FIELDS_SETTING will be y by default.
> > Mind to cook a patch and lets see if community accept it? Don't forget to
> > CC Andrew Morton.
> 
> Very well, patch attached.

Wouldn't the below do the same trick but eliminate OR in preproc code?
---
From: Amnon Shiloh <u3557@miso.sublimeip.com>
Subject: prctl: Make PR_SET_MM being depend on own CONFIG_MM_FIELDS_SETTING

...

Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
---
 init/Kconfig |   17 +++++++++++++++++
 kernel/sys.c |   16 ++++++++--------
 2 files changed, 25 insertions(+), 8 deletions(-)

Index: linux-2.6.git/init/Kconfig
===================================================================
--- linux-2.6.git.orig/init/Kconfig
+++ linux-2.6.git/init/Kconfig
@@ -991,6 +991,7 @@ endif # CGROUPS
 config CHECKPOINT_RESTORE
 	bool "Checkpoint/restore support" if EXPERT
 	default n
+	select MM_FIELDS_SETTING
 	help
 	  Enables additional kernel features in a sake of checkpoint/restore.
 	  In particular it adds auxiliary prctl codes to setup process text,
@@ -999,6 +1000,22 @@ config CHECKPOINT_RESTORE
 
 	  If unsure, say N here.
 
+config MM_FIELDS_SETTING
+	bool "Allow modifying per-process memory-region fields"
+	default y
+	help
+	   Support "prctl(PR_SET_MM)" which allows applications to modify
+	   the following in their "mm_struct":
+
+	      start_code, end_code, start_data, end_data, start_brk, brk,
+	      start_stack, arg_start, arg_end, env_start, env_end.
+
+	    Also to modify their executable file ("/proc/self/exe").
+
+	    This option is needed for reconstructing processes (such as when
+	    restoring a process from a checkpoint; duplicating a process;
+	    or migrating it to another computer).
+
 menuconfig NAMESPACES
 	bool "Namespaces support" if EXPERT
 	default !EXPERT
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1788,7 +1788,7 @@ SYSCALL_DEFINE1(umask, int, mask)
 	return mask;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#ifdef CONFIG_MM_FIELDS_SETTING
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
 	struct fd exe;
@@ -1981,23 +1981,23 @@ out:
 	up_read(&mm->mmap_sem);
 	return error;
 }
+#else /* CONFIG_MM_FIELDS_SETTING */
 
-static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
-{
-	return put_user(me->clear_child_tid, tid_addr);
-}
-
-#else /* CONFIG_CHECKPOINT_RESTORE */
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
 	return -EINVAL;
 }
+#endif
+
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	return put_user(me->clear_child_tid, tid_addr);
+#else
 	return -EINVAL;
-}
 #endif
+}
 
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-21  8:00                                                                 ` prctl(PR_SET_MM) Cyrill Gorcunov
@ 2013-02-21  8:03                                                                   ` Amnon Shiloh
  2013-02-21  8:09                                                                     ` prctl(PR_SET_MM) Cyrill Gorcunov
  2013-02-21 22:18                                                                   ` prctl(PR_SET_MM) Andrew Morton
  1 sibling, 1 reply; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-21  8:03 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Andrew Morton

Hi,

Cyrill Gorcunov wrote:

> Wouldn't the below do the same trick but eliminate OR in preproc code?

Yes it would.  I don't mind having it either way.

Best Regards,
Amnon.

> ---
> From: Amnon Shiloh <u3557@miso.sublimeip.com>
> Subject: prctl: Make PR_SET_MM being depend on own CONFIG_MM_FIELDS_SETTING
> 
> ...
> 
> Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
> ---
>  init/Kconfig |   17 +++++++++++++++++
>  kernel/sys.c |   16 ++++++++--------
>  2 files changed, 25 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6.git/init/Kconfig
> ===================================================================
> --- linux-2.6.git.orig/init/Kconfig
> +++ linux-2.6.git/init/Kconfig
> @@ -991,6 +991,7 @@ endif # CGROUPS
>  config CHECKPOINT_RESTORE
>  	bool "Checkpoint/restore support" if EXPERT
>  	default n
> +	select MM_FIELDS_SETTING
>  	help
>  	  Enables additional kernel features in a sake of checkpoint/restore.
>  	  In particular it adds auxiliary prctl codes to setup process text,
> @@ -999,6 +1000,22 @@ config CHECKPOINT_RESTORE
>  
>  	  If unsure, say N here.
>  
> +config MM_FIELDS_SETTING
> +	bool "Allow modifying per-process memory-region fields"
> +	default y
> +	help
> +	   Support "prctl(PR_SET_MM)" which allows applications to modify
> +	   the following in their "mm_struct":
> +
> +	      start_code, end_code, start_data, end_data, start_brk, brk,
> +	      start_stack, arg_start, arg_end, env_start, env_end.
> +
> +	    Also to modify their executable file ("/proc/self/exe").
> +
> +	    This option is needed for reconstructing processes (such as when
> +	    restoring a process from a checkpoint; duplicating a process;
> +	    or migrating it to another computer).
> +
>  menuconfig NAMESPACES
>  	bool "Namespaces support" if EXPERT
>  	default !EXPERT
> Index: linux-2.6.git/kernel/sys.c
> ===================================================================
> --- linux-2.6.git.orig/kernel/sys.c
> +++ linux-2.6.git/kernel/sys.c
> @@ -1788,7 +1788,7 @@ SYSCALL_DEFINE1(umask, int, mask)
>  	return mask;
>  }
>  
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#ifdef CONFIG_MM_FIELDS_SETTING
>  static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
>  {
>  	struct fd exe;
> @@ -1981,23 +1981,23 @@ out:
>  	up_read(&mm->mmap_sem);
>  	return error;
>  }
> +#else /* CONFIG_MM_FIELDS_SETTING */
>  
> -static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
> -{
> -	return put_user(me->clear_child_tid, tid_addr);
> -}
> -
> -#else /* CONFIG_CHECKPOINT_RESTORE */
>  static int prctl_set_mm(int opt, unsigned long addr,
>  			unsigned long arg4, unsigned long arg5)
>  {
>  	return -EINVAL;
>  }
> +#endif
> +
>  static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
>  {
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> +	return put_user(me->clear_child_tid, tid_addr);
> +#else
>  	return -EINVAL;
> -}
>  #endif
> +}
>  
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-21  8:03                                                                   ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-21  8:09                                                                     ` Cyrill Gorcunov
  0 siblings, 0 replies; 62+ messages in thread
From: Cyrill Gorcunov @ 2013-02-21  8:09 UTC (permalink / raw)
  To: Amnon Shiloh
  Cc: Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel,
	Andrew Morton

On Thu, Feb 21, 2013 at 07:03:58PM +1100, Amnon Shiloh wrote:
> Hi,
> 
> Cyrill Gorcunov wrote:
> 
> > Wouldn't the below do the same trick but eliminate OR in preproc code?
> 
> Yes it would.  I don't mind having it either way.

OK, lets wait for opinions and see if this approach is acceptable.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-21  8:00                                                                 ` prctl(PR_SET_MM) Cyrill Gorcunov
  2013-02-21  8:03                                                                   ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-21 22:18                                                                   ` Andrew Morton
  2013-02-21 22:42                                                                     ` prctl(PR_SET_MM) Cyrill Gorcunov
  2013-02-22  1:18                                                                     ` prctl(PR_SET_MM) Amnon Shiloh
  1 sibling, 2 replies; 62+ messages in thread
From: Andrew Morton @ 2013-02-21 22:18 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Amnon Shiloh, Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On Thu, 21 Feb 2013 12:00:28 +0400
Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> From: Amnon Shiloh <u3557@miso.sublimeip.com>
> Subject: prctl: Make PR_SET_MM being depend on own CONFIG_MM_FIELDS_SETTING
> 
> ...
> 
> Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>

The "..." makes me sad.

If/when this patch gets sent for real, please explain the reasons? 
AFAICT it permits the enabling of prctl(PR_SET_MM) in
CONFIG_CHECKPOINT_RESTORE=n kernels.  If that was indeed the intent,
why?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-21 22:18                                                                   ` prctl(PR_SET_MM) Andrew Morton
@ 2013-02-21 22:42                                                                     ` Cyrill Gorcunov
  2013-02-22  1:18                                                                     ` prctl(PR_SET_MM) Amnon Shiloh
  1 sibling, 0 replies; 62+ messages in thread
From: Cyrill Gorcunov @ 2013-02-21 22:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Amnon Shiloh, Steven Rostedt, u3557, Oleg Nesterov, Pedro Alves,
	Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On Thu, Feb 21, 2013 at 02:18:41PM -0800, Andrew Morton wrote:
> On Thu, 21 Feb 2013 12:00:28 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > From: Amnon Shiloh <u3557@miso.sublimeip.com>
> > Subject: prctl: Make PR_SET_MM being depend on own CONFIG_MM_FIELDS_SETTING
> > 
> > ...
> > 
> > Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
> 
> The "..." makes me sad.
> 
> If/when this patch gets sent for real, please explain the reasons? 
> AFAICT it permits the enabling of prctl(PR_SET_MM) in
> CONFIG_CHECKPOINT_RESTORE=n kernels.  If that was indeed the intent,
> why?

Sorry for this "...", it was a draft version for Amnon, not for inclusion.
As far as I understand Amnon needs these prctl opcodes to be enabled by
default (but still turnable off in Kconfig if needed) for his minimal
c/r software, he do not need the whole c/r functionality (procfs map-files,
get-tid-address,kcmp and such). That is the idea if I understand correctly.

Quoting Amnon
|
| Correct, this is an important feature that is useful for a whole 
| general class of applications, not only those needing CHECKPOINT_RESTORE. 
|
| Had this not been done as part of the CHECKPOINT_RESTORE project, it 
| would have certainly been done, sooner or later, by some other developers: 
| it just so happened that the CHECKPOINT_RESTORE people were the first to 
| (publically) fill this gap, but in fact this code in "kernel/sys.c" should 
| be general kernel code, not part of CHECKPOINT_RESTORE.
|

I personally don't mind if this code become y by default (it requires
cap-sys-resource capability granted anyway), but for normal c/r this
prctl opcodes only is not enough and CHECKPOINT_RESTORE should be set.
Thus, if people agree with enabling prctl extension by default I certainly
won't object.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-21 22:18                                                                   ` prctl(PR_SET_MM) Andrew Morton
  2013-02-21 22:42                                                                     ` prctl(PR_SET_MM) Cyrill Gorcunov
@ 2013-02-22  1:18                                                                     ` Amnon Shiloh
  1 sibling, 0 replies; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-22  1:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, Steven Rostedt, u3557, Oleg Nesterov,
	Pedro Alves, Denys Vlasenko, Jan Kratochvil, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

Dear Andrew,

The code in "kernel/sys.c" that is currently within
CONFIG_CHECKPOINT_RESTORE is in fact, as I explain below,
one possible solution to a general issue, required by a wide
class of applications.  It just so happened that the CRIU group
were the first to place this, or an equivalent code, in the kernel,
that allows a privileged process to set its 11 per-process memory-region
fields:
     start_code, end_code, start_data, end_data, start_brk, brk,
     start_stack, arg_start, arg_end, env_start, env_end.


Contrary to the rest of the CHECKPOINT_RESTORE code, which is specific
to the CRIU package, the code in "kernel/sys.c" (or its equivalent) is
needed by ANY application or package that needs to reconstruct Linux
processes, that means, starting them from the middle rather than from
an executable file.

That includes user-level checkpointing (any, not just CRIU's),
process-migration (to other computers, as my own package does)
and process duplication (for high-availability/reliability) -
in fact even for starting a process from an executable format
that is not supported by Linux, thus requiring a "manual execve"
by a user-level utility.

My first preference is to remove that "#ifdef CONFIG_CHECKPOINT_RESTORE"
altogether.  Note that there are no security issues because this code
is already restricted to "capable(CAP_SYS_RESOURCE)".
Short of that is the proposed patch.

Best Regards,
Amnon.


Andrew Morton wrote:
> 
> On Thu, 21 Feb 2013 12:00:28 +0400
> Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > From: Amnon Shiloh <u3557@miso.sublimeip.com>
> > Subject: prctl: Make PR_SET_MM being depend on own CONFIG_MM_FIELDS_SETTING
> > 
> > ...
> > 
> > Signed-off-by: Amnon Shiloh <u3557@miso.sublimeip.com>
> 
> The "..." makes me sad.
> 
> If/when this patch gets sent for real, please explain the reasons? 
> AFAICT it permits the enabling of prctl(PR_SET_MM) in
> CONFIG_CHECKPOINT_RESTORE=n kernels.  If that was indeed the intent,
> why?
> 
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
  2013-02-19  6:25                                                       ` prctl(PR_SET_MM) Amnon Shiloh
  2013-02-20  8:39                                                         ` prctl(PR_SET_MM) Cyrill Gorcunov
@ 2013-02-22 14:23                                                         ` Denys Vlasenko
  1 sibling, 0 replies; 62+ messages in thread
From: Denys Vlasenko @ 2013-02-22 14:23 UTC (permalink / raw)
  To: u3557
  Cc: Amnon Shiloh, Steven Rostedt, Oleg Nesterov, Pedro Alves,
	Jan Kratochvil, Cyrill Gorcunov, Pavel Emelyanov,
	Frederic Weisbecker, Ingo Molnar, Peter Zijlstra, linux-kernel

On 02/19/2013 07:25 AM, Amnon Shiloh wrote:
> Steven Rostedt wrote:
> 
>> If only you, or a few people are using it (ie. distros don't see a
>> need), then it will be up to you to make the changes.
> 
> I believe that this functionality is of a general nature and is needed
> by many, not only by myself and by the CRIU group, but by all user-level
> software packages, past present and future, that provide some form or
> another of reconstructing a Linux process.

That's what "RESTORE" part of CONFIG_CHECKPOINT_RESTORE refers to:
the ability to restore (reconstruct) a process.

If you want to be able to restore a process, you need RESTORE
feature. It's that simple.

Why do you want yet another config option for it?
What's the problem if you simply enable CONFIG_CHECKPOINT_RESTORE?

The only problem I can imagine is "CONFIG_CHECKPOINT_RESTORE
enables too many things I don't need".

Frankly, I find it not very likely, unless you are planning
on working on resource-constrained machines (like mobile phone).

-- 
vda



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
       [not found] <20130222142603.987c6e3c.akpm@linux-foundation.org>
  2013-02-24  6:24 ` prctl(PR_SET_MM) Amnon Shiloh
@ 2013-02-24  6:28 ` Amnon Shiloh
  1 sibling, 0 replies; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-24  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: rostedt, oleg, palves, oleg, palves, dvlasenk, jan.kratochvil,
	xemul, fweisbec, mingo, a.p.zijlstra, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3560 bytes --]

Dear Andrew,

Andrew Morton <akpm@linux-foundation.org> Wrote:
> Well OK.  Put all that on top of a patch, add suitable signoffs and
> cc's and send it along?

The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:

      start_code, end_code, start_data, end_data, start_brk, brk,
      start_stack, arg_start, arg_end, env_start, env_end.

This functionality is needed by any application or package that
needs to reconstruct Linux processes, that is, to start them in
any way other than by means of an "execve()" from an executable
file.  This includes:

1. Restoring processes from a checkpoint-file (by all potential
   user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
   and high-availablity).
4. Starting a process from an executable format that is not supported
   by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
   executable that, for confidentiality, licensing or other reasons,
   may not be written to the local file-systems.

The code that does that was already included in the Linux kernel by
the CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this
was enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE",
which is normally disabled.

It was not clear from your answer, Andrew, whether you prefer to
remove the "#ifdef CONFIG_CHECKPOINT_RESTORE" altogether from the
said code, or to enclose it in a new configuration option that is
enabled by default.   I therefore attach two alternative patches
to choose from: the first removes the #ifdef altogether while the
second introduces a new option.

Signed-off-by: Amnon Shiloh.

Best Regards,
Amnon.


> On Fri, 22 Feb 2013 12:18:01 +1100 (EST)
> u3557@miso.sublimeip.com (Amnon Shiloh) wrote:
> 
> > The code in "kernel/sys.c" that is currently within
> > CONFIG_CHECKPOINT_RESTORE is in fact, as I explain below,
> > one possible solution to a general issue, required by a wide
> > class of applications.  It just so happened that the CRIU group
> > were the first to place this, or an equivalent code, in the kernel,
> > that allows a privileged process to set its 11 per-process memory-region
> > fields:
> >      start_code, end_code, start_data, end_data, start_brk, brk,
> >      start_stack, arg_start, arg_end, env_start, env_end.
> > 
> > 
> > Contrary to the rest of the CHECKPOINT_RESTORE code, which is specific
> > to the CRIU package, the code in "kernel/sys.c" (or its equivalent) is
> > needed by ANY application or package that needs to reconstruct Linux
> > processes, that means, starting them from the middle rather than from
> > an executable file.
> > 
> > That includes user-level checkpointing (any, not just CRIU's),
> > process-migration (to other computers, as my own package does)
> > and process duplication (for high-availability/reliability) -
> > in fact even for starting a process from an executable format
> > that is not supported by Linux, thus requiring a "manual execve"
> > by a user-level utility.
> > 
> > My first preference is to remove that "#ifdef CONFIG_CHECKPOINT_RESTORE"
> > altogether.  Note that there are no security issues because this code
> > is already restricted to "capable(CAP_SYS_RESOURCE)".
> > Short of that is the proposed patch.
> 
> Well OK.  Put all that on top of a patch, add suitable signoffs and
> cc's and send it along?
> 

[-- Attachment #2: unified diff output, ASCII text --]
[-- Type: text/plain, Size: 2270 bytes --]

diff -Naur linux-3.8/init/Kconfig option2/init/Kconfig
--- linux-3.8/init/Kconfig	2013-02-19 10:28:34.000000000 +1030
+++ option2/init/Kconfig	2013-02-24 13:57:02.000000000 +1030
@@ -991,6 +991,7 @@
 config CHECKPOINT_RESTORE
 	bool "Checkpoint/restore support" if EXPERT
 	default n
+	select MM_FIELDS_SETTING
 	help
 	  Enables additional kernel features in a sake of checkpoint/restore.
 	  In particular it adds auxiliary prctl codes to setup process text,
@@ -999,6 +1000,22 @@
 
 	  If unsure, say N here.
 
+config MM_FIELDS_SETTING
+	bool "Allow modifying per-process memory-region fields"
+	default y
+	help
+	   Support "prctl(PR_SET_MM)" which allows applications to modify
+	   the following in their "mm_struct":
+
+	      start_code, end_code, start_data, end_data, start_brk, brk,
+	      start_stack, arg_start, arg_end, env_start, env_end.
+
+	    Also to modify their executable file ("/proc/self/exe").
+
+	    This option is needed for reconstructing processes (such as when
+	    restoring a process from a checkpoint; duplicating a process;
+	    or migrating it to another computer).
+
 menuconfig NAMESPACES
 	bool "Namespaces support" if EXPERT
 	default !EXPERT
diff -Naur linux-3.8/kernel/sys.c option2/kernel/sys.c
--- linux-3.8/kernel/sys.c	2013-02-19 10:28:34.000000000 +1030
+++ option2/kernel/sys.c	2013-02-24 10:37:08.000000000 +1030
@@ -1788,7 +1788,7 @@
 	return mask;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#ifdef CONFIG_MM_FIELDS_SETTING
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
 	struct fd exe;
@@ -1981,18 +1981,22 @@
 	up_read(&mm->mmap_sem);
 	return error;
 }
+#else /* CONFIG_MM_FIELDS_SETTING */
 
-static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
-{
-	return put_user(me->clear_child_tid, tid_addr);
-}
-
-#else /* CONFIG_CHECKPOINT_RESTORE */
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
 	return -EINVAL;
 }
+#endif
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
+{
+	return put_user(me->clear_child_tid, tid_addr);
+}
+
+#else
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
 	return -EINVAL;

[-- Attachment #3: unified diff output, ASCII text --]
[-- Type: text/plain, Size: 836 bytes --]

diff -Naur linux-3.8/kernel/sys.c option1/kernel/sys.c
--- linux-3.8/kernel/sys.c	2013-02-19 10:28:34.000000000 +1030
+++ option1/kernel/sys.c	2013-02-24 10:47:45.000000000 +1030
@@ -1788,7 +1788,6 @@
 	return mask;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
 	struct fd exe;
@@ -1982,17 +1981,12 @@
 	return error;
 }
 
+#ifdef CONFIG_CHECKPOINT_RESTORE
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
 	return put_user(me->clear_child_tid, tid_addr);
 }
-
-#else /* CONFIG_CHECKPOINT_RESTORE */
-static int prctl_set_mm(int opt, unsigned long addr,
-			unsigned long arg4, unsigned long arg5)
-{
-	return -EINVAL;
-}
+#else
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
 	return -EINVAL;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: prctl(PR_SET_MM)
       [not found] <20130222142603.987c6e3c.akpm@linux-foundation.org>
@ 2013-02-24  6:24 ` Amnon Shiloh
  2013-02-24  6:28 ` prctl(PR_SET_MM) Amnon Shiloh
  1 sibling, 0 replies; 62+ messages in thread
From: Amnon Shiloh @ 2013-02-24  6:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steven, Rostedt, rostedt, Oleg, Nesterov, oleg, Pedro, Alves,
	palves, Denys, Vlasenko, dvlasenk, Jan, Kratochvil,
	jan.kratochvil, Pavel, Emelyanov, xemul, Frederic, Weisbecker,
	fweisbec, Ingo, Molnar, mingo, Peter, Zijlstra, a.p.zijlstra,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3560 bytes --]

Dear Andrew,

Andrew Morton <akpm@linux-foundation.org> Wrote:
> Well OK.  Put all that on top of a patch, add suitable signoffs and
> cc's and send it along?

The purpose of this patch is to allow privileged processes to set
their own per-memory memory-region fields:

      start_code, end_code, start_data, end_data, start_brk, brk,
      start_stack, arg_start, arg_end, env_start, env_end.

This functionality is needed by any application or package that
needs to reconstruct Linux processes, that is, to start them in
any way other than by means of an "execve()" from an executable
file.  This includes:

1. Restoring processes from a checkpoint-file (by all potential
   user-level checkpointing packages, not only CRIU's).
2. Restarting processes on another node after process migration.
3. Starting duplicated copies of a running process (for reliability
   and high-availablity).
4. Starting a process from an executable format that is not supported
   by Linux, thus requiring a "manual execve" by a user-level utility.
5. Similarly, starting a process from a networked and/or crypted
   executable that, for confidentiality, licensing or other reasons,
   may not be written to the local file-systems.

The code that does that was already included in the Linux kernel by
the CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this
was enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE",
which is normally disabled.

It was not clear from your answer, Andrew, whether you prefer to
remove the "#ifdef CONFIG_CHECKPOINT_RESTORE" altogether from the
said code, or to enclose it in a new configuration option that is
enabled by default.   I therefore attach two alternative patches
to choose from: the first removes the #ifdef altogether while the
second introduces a new option.

Signed-off-by: Amnon Shiloh.

Best Regards,
Amnon.


> On Fri, 22 Feb 2013 12:18:01 +1100 (EST)
> u3557@miso.sublimeip.com (Amnon Shiloh) wrote:
> 
> > The code in "kernel/sys.c" that is currently within
> > CONFIG_CHECKPOINT_RESTORE is in fact, as I explain below,
> > one possible solution to a general issue, required by a wide
> > class of applications.  It just so happened that the CRIU group
> > were the first to place this, or an equivalent code, in the kernel,
> > that allows a privileged process to set its 11 per-process memory-region
> > fields:
> >      start_code, end_code, start_data, end_data, start_brk, brk,
> >      start_stack, arg_start, arg_end, env_start, env_end.
> > 
> > 
> > Contrary to the rest of the CHECKPOINT_RESTORE code, which is specific
> > to the CRIU package, the code in "kernel/sys.c" (or its equivalent) is
> > needed by ANY application or package that needs to reconstruct Linux
> > processes, that means, starting them from the middle rather than from
> > an executable file.
> > 
> > That includes user-level checkpointing (any, not just CRIU's),
> > process-migration (to other computers, as my own package does)
> > and process duplication (for high-availability/reliability) -
> > in fact even for starting a process from an executable format
> > that is not supported by Linux, thus requiring a "manual execve"
> > by a user-level utility.
> > 
> > My first preference is to remove that "#ifdef CONFIG_CHECKPOINT_RESTORE"
> > altogether.  Note that there are no security issues because this code
> > is already restricted to "capable(CAP_SYS_RESOURCE)".
> > Short of that is the proposed patch.
> 
> Well OK.  Put all that on top of a patch, add suitable signoffs and
> cc's and send it along?
> 

[-- Attachment #2: unified diff output, ASCII text --]
[-- Type: text/plain, Size: 2270 bytes --]

diff -Naur linux-3.8/init/Kconfig option2/init/Kconfig
--- linux-3.8/init/Kconfig	2013-02-19 10:28:34.000000000 +1030
+++ option2/init/Kconfig	2013-02-24 13:57:02.000000000 +1030
@@ -991,6 +991,7 @@
 config CHECKPOINT_RESTORE
 	bool "Checkpoint/restore support" if EXPERT
 	default n
+	select MM_FIELDS_SETTING
 	help
 	  Enables additional kernel features in a sake of checkpoint/restore.
 	  In particular it adds auxiliary prctl codes to setup process text,
@@ -999,6 +1000,22 @@
 
 	  If unsure, say N here.
 
+config MM_FIELDS_SETTING
+	bool "Allow modifying per-process memory-region fields"
+	default y
+	help
+	   Support "prctl(PR_SET_MM)" which allows applications to modify
+	   the following in their "mm_struct":
+
+	      start_code, end_code, start_data, end_data, start_brk, brk,
+	      start_stack, arg_start, arg_end, env_start, env_end.
+
+	    Also to modify their executable file ("/proc/self/exe").
+
+	    This option is needed for reconstructing processes (such as when
+	    restoring a process from a checkpoint; duplicating a process;
+	    or migrating it to another computer).
+
 menuconfig NAMESPACES
 	bool "Namespaces support" if EXPERT
 	default !EXPERT
diff -Naur linux-3.8/kernel/sys.c option2/kernel/sys.c
--- linux-3.8/kernel/sys.c	2013-02-19 10:28:34.000000000 +1030
+++ option2/kernel/sys.c	2013-02-24 10:37:08.000000000 +1030
@@ -1788,7 +1788,7 @@
 	return mask;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#ifdef CONFIG_MM_FIELDS_SETTING
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
 	struct fd exe;
@@ -1981,18 +1981,22 @@
 	up_read(&mm->mmap_sem);
 	return error;
 }
+#else /* CONFIG_MM_FIELDS_SETTING */
 
-static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
-{
-	return put_user(me->clear_child_tid, tid_addr);
-}
-
-#else /* CONFIG_CHECKPOINT_RESTORE */
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
 	return -EINVAL;
 }
+#endif
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
+{
+	return put_user(me->clear_child_tid, tid_addr);
+}
+
+#else
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
 	return -EINVAL;

[-- Attachment #3: unified diff output, ASCII text --]
[-- Type: text/plain, Size: 836 bytes --]

diff -Naur linux-3.8/kernel/sys.c option1/kernel/sys.c
--- linux-3.8/kernel/sys.c	2013-02-19 10:28:34.000000000 +1030
+++ option1/kernel/sys.c	2013-02-24 10:47:45.000000000 +1030
@@ -1788,7 +1788,6 @@
 	return mask;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
 	struct fd exe;
@@ -1982,17 +1981,12 @@
 	return error;
 }
 
+#ifdef CONFIG_CHECKPOINT_RESTORE
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
 	return put_user(me->clear_child_tid, tid_addr);
 }
-
-#else /* CONFIG_CHECKPOINT_RESTORE */
-static int prctl_set_mm(int opt, unsigned long addr,
-			unsigned long arg4, unsigned long arg5)
-{
-	return -EINVAL;
-}
+#else
 static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
 {
 	return -EINVAL;

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2013-02-24  6:28 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-09 18:29 [PATCH] arch_check_bp_in_kernelspace: fix the range check Oleg Nesterov
2012-11-09 18:30 ` Oleg Nesterov
2012-11-19 17:47   ` Oleg Nesterov
2012-11-19 18:25     ` Steven Rostedt
2012-11-20 10:33       ` u3557
2012-11-20 15:48       ` Oleg Nesterov
2012-11-20 15:55         ` Steven Rostedt
2012-11-20 18:32         ` Oleg Nesterov
2012-11-20 23:16           ` u3557
2012-11-21 14:16             ` Oleg Nesterov
2012-11-21 17:30               ` Amnon Shiloh
2012-11-22 16:12                 ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
2012-11-22 20:57                   ` Pavel Emelyanov
2012-11-23  0:20                     ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range Amnon Shiloh
2012-11-23 17:45                       ` Oleg Nesterov
2012-11-24 12:47                         ` Amnon Shiloh
2012-11-23 17:42                     ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
2012-11-23  9:14                   ` arch_check_bp_in_kernelspace: fix the range check Amnon Shiloh
2012-11-23 16:33                     ` Oleg Nesterov
2012-11-23 17:05                       ` Oleg Nesterov
2012-11-24 14:14                         ` Amnon Shiloh
2012-11-24 13:45                       ` Amnon Shiloh
2012-11-25 22:55                         ` Oleg Nesterov
2012-11-25 23:48                           ` Amnon Shiloh
2012-12-02 19:30                             ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Oleg Nesterov
2012-12-02 23:54                               ` u3557
2012-12-04 17:59                                 ` Oleg Nesterov
2012-12-04 22:44                                   ` u3557
2013-01-08 17:08                                   ` Pedro Alves
2013-01-09 17:52                                     ` Oleg Nesterov
2013-01-10  6:54                                       ` u3557
2013-01-12 18:12                                         ` Oleg Nesterov
2013-01-14  2:31                                           ` u3557
2013-01-14 16:01                                             ` Oleg Nesterov
2013-02-18  1:39                                               ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-18  5:44                                                 ` prctl(PR_SET_MM) Randy Dunlap
2013-02-18 15:21                                                 ` prctl(PR_SET_MM) Steven Rostedt
2013-02-18 16:33                                                   ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-18 19:49                                                     ` prctl(PR_SET_MM) Steven Rostedt
2013-02-19  6:25                                                       ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-20  8:39                                                         ` prctl(PR_SET_MM) Cyrill Gorcunov
2013-02-20  9:38                                                           ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-20 10:51                                                             ` prctl(PR_SET_MM) Cyrill Gorcunov
2013-02-20 11:16                                                               ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-21  7:46                                                               ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-21  8:00                                                                 ` prctl(PR_SET_MM) Cyrill Gorcunov
2013-02-21  8:03                                                                   ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-21  8:09                                                                     ` prctl(PR_SET_MM) Cyrill Gorcunov
2013-02-21 22:18                                                                   ` prctl(PR_SET_MM) Andrew Morton
2013-02-21 22:42                                                                     ` prctl(PR_SET_MM) Cyrill Gorcunov
2013-02-22  1:18                                                                     ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-22 14:23                                                         ` prctl(PR_SET_MM) Denys Vlasenko
2012-12-05  9:29                               ` PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check) Jan Kratochvil
2012-12-05 13:14                                 ` u3557
2012-11-26  9:44                   ` vdso && cr " Cyrill Gorcunov
2012-11-26 12:27                     ` Andrey Wagin
2012-11-26 12:55                       ` Amnon Shiloh
2012-11-26 14:18                         ` Cyrill Gorcunov
2012-11-26 14:26                           ` vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range Amnon Shiloh
2012-11-26 14:41                             ` vdso && cr Cyrill Gorcunov
     [not found] <20130222142603.987c6e3c.akpm@linux-foundation.org>
2013-02-24  6:24 ` prctl(PR_SET_MM) Amnon Shiloh
2013-02-24  6:28 ` prctl(PR_SET_MM) Amnon Shiloh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.