All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6] prctl.2: Document Syscall User Dispatch
@ 2020-12-28 17:38 Gabriel Krisman Bertazi
  2020-12-30 10:24 ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 4+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-12-28 17:38 UTC (permalink / raw)
  To: alx.manpages, mtk.manpages; +Cc: linux-man, kernel, Gabriel Krisman Bertazi

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>

---
Changes since v5:
(suggested by Michael Kerrisk)
  - Change () punctuation
  - fix grammar
  - Add information about interception, return and return value

Changes since v4:
(suggested by Michael Kerrisk)
  - Modify explanation of what dispatch to user space means.
  - Drop references to emulation.
  - Document suggestion about placing libc in allowed-region.
  - Comment about avoiding syscall cost.
Changes since v3:
(suggested by Michael Kerrisk)
  - Explain what dispatch to user space means.
  - Document the fact that the memory region is a single consecutive
  range.
  - Explain failure if *arg5 is set to a bad value.
  - fix english typo.
  - Define what 'invalid memory region' means.

Changes since v2:
(suggested by Alejandro Colomar)
  - selective -> selectively
  - Add missing oxford comma.

Changes since v1:
(suggested by Alejandro Colomar)
  - Use semantic lines
  - Fix usage of .{B|I}R and .{B|I}
  - Don't format literals
  - Fix preferred spelling of userspace
  - Fix case of word
---
 man2/prctl.2 | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 159 insertions(+)

diff --git a/man2/prctl.2 b/man2/prctl.2
index f25f05fdb593..0a0abfb78055 100644
--- a/man2/prctl.2
+++ b/man2/prctl.2
@@ -1533,6 +1533,135 @@ For more information, see the kernel source file
 (or
 .I Documentation/arm64/sve.txt
 before Linux 5.3).
+.TP
+.\" prctl PR_SET_SYSCALL_USER_DISPATCH
+.\" commit 1446e1df9eb183fdf81c3f0715402f1d7595d4
+.BR PR_SET_SYSCALL_USER_DISPATCH " (since Linux 5.11, x86 only)"
+.IP
+Configure the Syscall User Dispatch mechanism
+for the calling thread.
+This mechanism allows an application
+to selectively intercept system calls
+so that they can be handled within the application itself.
+Interception takes the form of a thread-directed
+.B SIGSYS
+signal that is delivered to the thread
+when it makes a system call.
+If intercepted,
+the system call is not executed by the kernel.
+.IP
+The current Syscall User Dispatch mode is selected via
+.IR arg2 ,
+which can either be set to
+.B PR_SYS_DISPATCH_ON
+to enable the feature,
+or to
+.B PR_SYS_DISPATCH_OFF
+to turn it off.
+.IP
+When
+.I arg2
+is set to
+.BR PR_SYS_DISPATCH_ON ,
+.I arg3
+and
+.I arg4
+respectively identify the
+.I offset
+and
+.I length
+of a single contiguous memory region in the process map
+from where system calls are always allowed to be executed,
+regardless of the switch variable
+(Typically, this area would include the area of memory
+containing the C library.)
+.I arg5
+points to a char-sized variable
+that is a fast switch to enable/disable the mechanism
+without the overhead of doing a system call.
+The variable pointed by
+.I arg5
+can either be set to
+.B PR_SYS_DISPATCH_ON
+to enable the mechanism
+or to
+.B PR_SYS_DISPATCH_OFF
+to temporarily disable it.
+This value is checked by the kernel
+on every system call entry,
+and any unexpected value will raise
+an uncatchable
+.B SIGSYS
+at that time,
+killing the application.
+.IP
+When a system call is intercepted,
+the kernel sends a thread-directed
+.B SIGSYS
+signal to the triggering thread.
+Various fields will be set in the
+.I siginfo_t
+structure (see
+.BR sigaction (2))
+associated with the signal:
+.RS
+.IP * 3
+.I si_signo
+will contain
+.BR SIGSYS .
+.IP *
+.IR si_call_addr
+will show the address of the system call instruction.
+.IP *
+.IR si_syscall
+and
+.IR si_arch
+will indicate which system call was attempted.
+.IP *
+.I si_code
+will contain
+.BR SYS_USER_DISPATCH .
+.IP *
+.I si_errno
+will be set to 0.
+.RE
+.IP
+The program counter will be as though the system call happened
+(i.e., the program counter will not point to the system call instruction).
+.IP
+When the signal handler returns to the kernel,
+the system call completes immediately
+and returns to the calling thread,
+without actually being executed.
+If necessary
+(i.e., when emulating the system call on user space.),
+the signal handler should set the system call return value
+to a sane value,
+by modifying the register context stored in the
+.I ucontext
+argument of the signal handler.
+See
+.BR sigaction (2),
+.BR sigreturn (2),
+and
+.BR getcontext (3)
+for more information.
+.IP
+When
+.I arg2
+is set to
+.BR PR_SYS_DISPATCH_OFF ,
+the remaining arguments must be set to 0.
+.IP
+The setting is not preserved across
+.BR fork (2),
+.BR clone (2),
+or
+.BR execve (2).
+.IP
+For more information,
+see the kernel source file
+.IR Documentation/admin-guide/syscall-user-dispatch.rst
 .\" prctl PR_SET_TAGGED_ADDR_CTRL
 .\" commit 63f0c60379650d82250f22e4cf4137ef3dc4f43d
 .TP
@@ -2000,6 +2129,14 @@ and
 .I arg3
 is an invalid address.
 .TP
+.B EFAULT
+.I option
+is
+.B PR_SET_SYSCALL_USER_DISPATCH
+and
+.I arg5
+has an invalid address.
+.TP
 .B EINVAL
 The value of
 .I option
@@ -2231,6 +2368,28 @@ and SVE is not available on this platform.
 .B EINVAL
 .I option
 is
+.B PR_SET_SYSCALL_USER_DISPATCH
+and one of the following is true:
+.RS
+.IP * 3
+.I arg2
+is
+.B PR_SYS_DISPATCH_OFF
+and the remaining arguments are not 0;
+.IP * 3
+.I arg2
+is
+.B PR_SYS_DISPATCH_ON
+and the memory range specified is outside the
+address space of the process.
+.IP * 3
+.I arg2
+is invalid.
+.RE
+.TP
+.B EINVAL
+.I option
+is
 .BR PR_SET_TAGGED_ADDR_CTRL
 and the arguments are invalid or unsupported.
 See the description of
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v6] prctl.2: Document Syscall User Dispatch
  2020-12-28 17:38 [PATCH v6] prctl.2: Document Syscall User Dispatch Gabriel Krisman Bertazi
@ 2020-12-30 10:24 ` Michael Kerrisk (man-pages)
  2020-12-30 16:51   ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-12-30 10:24 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, alx.manpages; +Cc: mtk.manpages, linux-man, kernel

Hello Gabriel

This is looking much better. Thank you! I have a few more
comments still.

On 12/28/20 6:38 PM, Gabriel Krisman Bertazi wrote:
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> 
> ---
> Changes since v5:
> (suggested by Michael Kerrisk)
>   - Change () punctuation
>   - fix grammar
>   - Add information about interception, return and return value
> 
> Changes since v4:
> (suggested by Michael Kerrisk)
>   - Modify explanation of what dispatch to user space means.
>   - Drop references to emulation.
>   - Document suggestion about placing libc in allowed-region.
>   - Comment about avoiding syscall cost.
> Changes since v3:
> (suggested by Michael Kerrisk)
>   - Explain what dispatch to user space means.
>   - Document the fact that the memory region is a single consecutive
>   range.
>   - Explain failure if *arg5 is set to a bad value.
>   - fix english typo.
>   - Define what 'invalid memory region' means.
> 
> Changes since v2:
> (suggested by Alejandro Colomar)
>   - selective -> selectively
>   - Add missing oxford comma.
> 
> Changes since v1:
> (suggested by Alejandro Colomar)
>   - Use semantic lines
>   - Fix usage of .{B|I}R and .{B|I}
>   - Don't format literals
>   - Fix preferred spelling of userspace
>   - Fix case of word
> ---
>  man2/prctl.2 | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 159 insertions(+)
> 
> diff --git a/man2/prctl.2 b/man2/prctl.2
> index f25f05fdb593..0a0abfb78055 100644
> --- a/man2/prctl.2
> +++ b/man2/prctl.2
> @@ -1533,6 +1533,135 @@ For more information, see the kernel source file
>  (or
>  .I Documentation/arm64/sve.txt
>  before Linux 5.3).
> +.TP
> +.\" prctl PR_SET_SYSCALL_USER_DISPATCH
> +.\" commit 1446e1df9eb183fdf81c3f0715402f1d7595d4
> +.BR PR_SET_SYSCALL_USER_DISPATCH " (since Linux 5.11, x86 only)"
> +.IP
> +Configure the Syscall User Dispatch mechanism
> +for the calling thread.
> +This mechanism allows an application
> +to selectively intercept system calls
> +so that they can be handled within the application itself.
> +Interception takes the form of a thread-directed
> +.B SIGSYS
> +signal that is delivered to the thread
> +when it makes a system call.
> +If intercepted,
> +the system call is not executed by the kernel.
> +.IP
> +The current Syscall User Dispatch mode is selected via
> +.IR arg2 ,
> +which can either be set to
> +.B PR_SYS_DISPATCH_ON
> +to enable the feature,
> +or to
> +.B PR_SYS_DISPATCH_OFF
> +to turn it off.

So, I realize now that I'm slightly confused.

The value of arg2 can be either PR_SYS_DISPATCH_ON or
PR_SYS_DISPATCH_OFF. The value of the selector pointed to by
arg5 can likewise be R_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.
What is the relationship between these two attributes? For example,
what does it mean if arg2 isP R_SYS_DISPATCH_ON and, at the time of
the prctl() call, the selector has the value PR_SYS_DISPATCH_OFF?

> +.IP
> +When
> +.I arg2
> +is set to
> +.BR PR_SYS_DISPATCH_ON ,
> +.I arg3
> +and
> +.I arg4
> +respectively identify the
> +.I offset
> +and
> +.I length
> +of a single contiguous memory region in the process map

Better: s/map/address space/ ?

> +from where system calls are always allowed to be executed,
> +regardless of the switch variable

s/variable/variable./

> +(Typically, this area would include the area of memory
> +containing the C library.)

I think just to ease readability (smaller paragraphs), insert
.IP
here.

> +.I arg5
> +points to a char-sized variable
> +that is a fast switch to enable/disable the mechanism
> +without the overhead of doing a system call.
> +The variable pointed by
> +.I arg5
> +can either be set to
> +.B PR_SYS_DISPATCH_ON
> +to enable the mechanism
> +or to
> +.B PR_SYS_DISPATCH_OFF
> +to temporarily disable it.
> +This value is checked by the kernel
> +on every system call entry,
> +and any unexpected value will raise
> +an uncatchable
> +.B SIGSYS
> +at that time,
> +killing the application.
> +.IP
> +When a system call is intercepted,
> +the kernel sends a thread-directed
> +.B SIGSYS
> +signal to the triggering thread.
> +Various fields will be set in the
> +.I siginfo_t
> +structure (see
> +.BR sigaction (2))
> +associated with the signal:
> +.RS
> +.IP * 3
> +.I si_signo
> +will contain
> +.BR SIGSYS .
> +.IP *
> +.IR si_call_addr
> +will show the address of the system call instruction.
> +.IP *
> +.IR si_syscall
> +and
> +.IR si_arch
> +will indicate which system call was attempted.
> +.IP *
> +.I si_code
> +will contain
> +.BR SYS_USER_DISPATCH .
> +.IP *
> +.I si_errno
> +will be set to 0.
> +.RE
> +.IP
> +The program counter will be as though the system call happened
> +(i.e., the program counter will not point to the system call instruction).
> +.IP
> +When the signal handler returns to the kernel,
> +the system call completes immediately
> +and returns to the calling thread,
> +without actually being executed.
> +If necessary
> +(i.e., when emulating the system call on user space.),
> +the signal handler should set the system call return value
> +to a sane value,
> +by modifying the register context stored in the
> +.I ucontext
> +argument of the signal handler.

Just for my own education, do you have any example code somewhere
that demonstrates setting the syscall return value?

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v6] prctl.2: Document Syscall User Dispatch
  2020-12-30 10:24 ` Michael Kerrisk (man-pages)
@ 2020-12-30 16:51   ` Gabriel Krisman Bertazi
  2020-12-30 19:50     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 4+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-12-30 16:51 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages); +Cc: alx.manpages, linux-man, kernel

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hello Gabriel
>
> This is looking much better. Thank you! I have a few more
> comments still.
>
> On 12/28/20 6:38 PM, Gabriel Krisman Bertazi wrote:
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>> 
>> ---
>> Changes since v5:
>> (suggested by Michael Kerrisk)
>>   - Change () punctuation
>>   - fix grammar
>>   - Add information about interception, return and return value
>> 
>> Changes since v4:
>> (suggested by Michael Kerrisk)
>>   - Modify explanation of what dispatch to user space means.
>>   - Drop references to emulation.
>>   - Document suggestion about placing libc in allowed-region.
>>   - Comment about avoiding syscall cost.
>> Changes since v3:
>> (suggested by Michael Kerrisk)
>>   - Explain what dispatch to user space means.
>>   - Document the fact that the memory region is a single consecutive
>>   range.
>>   - Explain failure if *arg5 is set to a bad value.
>>   - fix english typo.
>>   - Define what 'invalid memory region' means.
>> 
>> Changes since v2:
>> (suggested by Alejandro Colomar)
>>   - selective -> selectively
>>   - Add missing oxford comma.
>> 
>> Changes since v1:
>> (suggested by Alejandro Colomar)
>>   - Use semantic lines
>>   - Fix usage of .{B|I}R and .{B|I}
>>   - Don't format literals
>>   - Fix preferred spelling of userspace
>>   - Fix case of word
>> ---
>>  man2/prctl.2 | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 159 insertions(+)
>> 
>> diff --git a/man2/prctl.2 b/man2/prctl.2
>> index f25f05fdb593..0a0abfb78055 100644
>> --- a/man2/prctl.2
>> +++ b/man2/prctl.2
>> @@ -1533,6 +1533,135 @@ For more information, see the kernel source file
>>  (or
>>  .I Documentation/arm64/sve.txt
>>  before Linux 5.3).
>> +.TP
>> +.\" prctl PR_SET_SYSCALL_USER_DISPATCH
>> +.\" commit 1446e1df9eb183fdf81c3f0715402f1d7595d4
>> +.BR PR_SET_SYSCALL_USER_DISPATCH " (since Linux 5.11, x86 only)"
>> +.IP
>> +Configure the Syscall User Dispatch mechanism
>> +for the calling thread.
>> +This mechanism allows an application
>> +to selectively intercept system calls
>> +so that they can be handled within the application itself.
>> +Interception takes the form of a thread-directed
>> +.B SIGSYS
>> +signal that is delivered to the thread
>> +when it makes a system call.
>> +If intercepted,
>> +the system call is not executed by the kernel.
>> +.IP
>> +The current Syscall User Dispatch mode is selected via
>> +.IR arg2 ,
>> +which can either be set to
>> +.B PR_SYS_DISPATCH_ON
>> +to enable the feature,
>> +or to
>> +.B PR_SYS_DISPATCH_OFF
>> +to turn it off.
>
> So, I realize now that I'm slightly confused.
>
> The value of arg2 can be either PR_SYS_DISPATCH_ON or
> PR_SYS_DISPATCH_OFF. The value of the selector pointed to by
> arg5 can likewise be R_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.
> What is the relationship between these two attributes? For example,
> what does it mean if arg2 isP R_SYS_DISPATCH_ON and, at the time of
> the prctl() call, the selector has the value PR_SYS_DISPATCH_OFF?

Hi Michael,

arg2 turns SUD on and off, thread-wide.  arg5 is what is used to filter
syscalls when SUD is on.  Based on the value of arg5, syscalls will be
intercepted, iff SUD is enabled for that thread.

arg5 is checked only upon syscall entry, so it doesn't have any effect
on the prctl itself, SUD was off at that time.

I'm not sure how to add this information to the man page, but I will try
to clarify it further.

>> +.IP
>> +When
>> +.I arg2
>> +is set to
>> +.BR PR_SYS_DISPATCH_ON ,
>> +.I arg3
>> +and
>> +.I arg4
>> +respectively identify the
>> +.I offset
>> +and
>> +.I length
>> +of a single contiguous memory region in the process map
>
> Better: s/map/address space/ ?
>
>> +from where system calls are always allowed to be executed,
>> +regardless of the switch variable
>
> s/variable/variable./
>
>> +(Typically, this area would include the area of memory
>> +containing the C library.)
>
> I think just to ease readability (smaller paragraphs), insert
> .IP
> here.
>
>> +.I arg5
>> +points to a char-sized variable
>> +that is a fast switch to enable/disable the mechanism
>> +without the overhead of doing a system call.
>> +The variable pointed by
>> +.I arg5
>> +can either be set to
>> +.B PR_SYS_DISPATCH_ON
>> +to enable the mechanism
>> +or to
>> +.B PR_SYS_DISPATCH_OFF
>> +to temporarily disable it.
>> +This value is checked by the kernel
>> +on every system call entry,
>> +and any unexpected value will raise
>> +an uncatchable
>> +.B SIGSYS
>> +at that time,
>> +killing the application.
>> +.IP
>> +When a system call is intercepted,
>> +the kernel sends a thread-directed
>> +.B SIGSYS
>> +signal to the triggering thread.
>> +Various fields will be set in the
>> +.I siginfo_t
>> +structure (see
>> +.BR sigaction (2))
>> +associated with the signal:
>> +.RS
>> +.IP * 3
>> +.I si_signo
>> +will contain
>> +.BR SIGSYS .
>> +.IP *
>> +.IR si_call_addr
>> +will show the address of the system call instruction.
>> +.IP *
>> +.IR si_syscall
>> +and
>> +.IR si_arch
>> +will indicate which system call was attempted.
>> +.IP *
>> +.I si_code
>> +will contain
>> +.BR SYS_USER_DISPATCH .
>> +.IP *
>> +.I si_errno
>> +will be set to 0.
>> +.RE
>> +.IP
>> +The program counter will be as though the system call happened
>> +(i.e., the program counter will not point to the system call instruction).
>> +.IP
>> +When the signal handler returns to the kernel,
>> +the system call completes immediately
>> +and returns to the calling thread,
>> +without actually being executed.
>> +If necessary
>> +(i.e., when emulating the system call on user space.),
>> +the signal handler should set the system call return value
>> +to a sane value,
>> +by modifying the register context stored in the
>> +.I ucontext
>> +argument of the signal handler.
>
> Just for my own education, do you have any example code somewhere
> that demonstrates setting the syscall return value?

I though I had this exemplified in the kselftests, but I just saw that
is not the case.  Something to fix there...

In the following signal handler code (untested, adapted from an old test
case of mine), SUD is used as a poor-man strace (x86 specific):

static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
{
	struct ucontext_t *ctx = ucontext;
	mcontext_t *mctx = &ctx->uc_mcontext;
	char buf[1024];
	unsigned long long r;
       	int len;

        *arg5 = PR_SYS_DISPATCH_OFF;

	len = snprintf(buf, 1024, "sys_%d(%d,%d,%d,%d,%d,%d)\n",
		       mctx->gregs[REG_RAX], mctx->gregs[REG_RDI],
		       mctx->gregs[REG_RSI], mctx->gregs[REG_RDX],
		       mctx->gregs[REG_R10], mctx->gregs[REG_R8],
		       mctx->gregs[REG_R9]);

	write(2, buf, len);

	r = syscall(mctx->gregs[REG_RAX], mctx->gregs[REG_RDI],
		    mctx->gregs[REG_RSI], mctx->gregs[REG_RDX],
		    mctx->gregs[REG_R10], mctx->gregs[REG_R8],
		    mctx->gregs[REG_R9]);

	/* Set the return value. */
	mctx->gregs[REG_RAX] = r;

        *arg5 = PR_SYS_DISPATCH_ON;

        /* Assumes the return trampoline is in the allowed region */
}

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v6] prctl.2: Document Syscall User Dispatch
  2020-12-30 16:51   ` Gabriel Krisman Bertazi
@ 2020-12-30 19:50     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 4+ messages in thread
From: Michael Kerrisk (man-pages) @ 2020-12-30 19:50 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: mtk.manpages, alx.manpages, linux-man, kernel, Linux API

[CC += linux-api]

Hi Gabriel,

On 12/30/20 5:51 PM, Gabriel Krisman Bertazi wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> 
>> Hello Gabriel
>>
>> This is looking much better. Thank you! I have a few more
>> comments still.
>>
>> On 12/28/20 6:38 PM, Gabriel Krisman Bertazi wrote:
>>> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>>>
>>> ---
>>> Changes since v5:
>>> (suggested by Michael Kerrisk)
>>>   - Change () punctuation
>>>   - fix grammar
>>>   - Add information about interception, return and return value
>>>
>>> Changes since v4:
>>> (suggested by Michael Kerrisk)
>>>   - Modify explanation of what dispatch to user space means.
>>>   - Drop references to emulation.
>>>   - Document suggestion about placing libc in allowed-region.
>>>   - Comment about avoiding syscall cost.
>>> Changes since v3:
>>> (suggested by Michael Kerrisk)
>>>   - Explain what dispatch to user space means.
>>>   - Document the fact that the memory region is a single consecutive
>>>   range.
>>>   - Explain failure if *arg5 is set to a bad value.
>>>   - fix english typo.
>>>   - Define what 'invalid memory region' means.
>>>
>>> Changes since v2:
>>> (suggested by Alejandro Colomar)
>>>   - selective -> selectively
>>>   - Add missing oxford comma.
>>>
>>> Changes since v1:
>>> (suggested by Alejandro Colomar)
>>>   - Use semantic lines
>>>   - Fix usage of .{B|I}R and .{B|I}
>>>   - Don't format literals
>>>   - Fix preferred spelling of userspace
>>>   - Fix case of word
>>> ---
>>>  man2/prctl.2 | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 159 insertions(+)
>>>
>>> diff --git a/man2/prctl.2 b/man2/prctl.2
>>> index f25f05fdb593..0a0abfb78055 100644
>>> --- a/man2/prctl.2
>>> +++ b/man2/prctl.2
>>> @@ -1533,6 +1533,135 @@ For more information, see the kernel source file
>>>  (or
>>>  .I Documentation/arm64/sve.txt
>>>  before Linux 5.3).
>>> +.TP
>>> +.\" prctl PR_SET_SYSCALL_USER_DISPATCH
>>> +.\" commit 1446e1df9eb183fdf81c3f0715402f1d7595d4
>>> +.BR PR_SET_SYSCALL_USER_DISPATCH " (since Linux 5.11, x86 only)"
>>> +.IP
>>> +Configure the Syscall User Dispatch mechanism
>>> +for the calling thread.
>>> +This mechanism allows an application
>>> +to selectively intercept system calls
>>> +so that they can be handled within the application itself.
>>> +Interception takes the form of a thread-directed
>>> +.B SIGSYS
>>> +signal that is delivered to the thread
>>> +when it makes a system call.
>>> +If intercepted,
>>> +the system call is not executed by the kernel.
>>> +.IP
>>> +The current Syscall User Dispatch mode is selected via
>>> +.IR arg2 ,
>>> +which can either be set to
>>> +.B PR_SYS_DISPATCH_ON
>>> +to enable the feature,
>>> +or to
>>> +.B PR_SYS_DISPATCH_OFF
>>> +to turn it off.
>>
>> So, I realize now that I'm slightly confused.
>>
>> The value of arg2 can be either PR_SYS_DISPATCH_ON or
>> PR_SYS_DISPATCH_OFF. The value of the selector pointed to by
>> arg5 can likewise be R_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.
>> What is the relationship between these two attributes? For example,
>> what does it mean if arg2 isP R_SYS_DISPATCH_ON and, at the time of
>> the prctl() call, the selector has the value PR_SYS_DISPATCH_OFF?
> 
> Hi Michael,
> 
> arg2 turns SUD on and off, thread-wide.  arg5 is what is used to filter
> syscalls when SUD is on.  Based on the value of arg5, syscalls will be
> intercepted, iff SUD is enabled for that thread.

Okay -- I suspected as much, but I wanted to confirm.

So, I want to say that in API design terms this feels wrong. You 
are using the same name(s) to mean two different things:
1. Define/clear SUD/the non-SUD memory region
2. Enable/disable SUD filtering in the SUD memory region (i.e., the 
part of the virtual address space outside the region defined in 1).

My terms here may not be the best, but I guess the distinction that
I make is clear.

My point is that these are different things, and I think they 
really should have different names (though of course the numbers
could stay the same). Otherwise, people fall into the confusion
that I ran into. Given that this feature is not yet released, and
that the numbers don't need to change, I'd like to suggest that
the names should change before 5.11 is released

For arg2, maybe:
    PR_SYS_DISPATCH_SET_REGION / PR_SYS_DISPATCH_CLEAR_REGION
or
    PR_SYS_DISPATCH_SET / PR_SYS_DISPATCH_CLEAR
or
    PR_SYS_DISPATCH_ON / PR_SYS_DISPATCH_OFF

For the selector (pointed to by arg5), maybe stay with what you
have:

    PR_SYS_DISPATCH_ON / PR_SYS_DISPATCH_OFF

or (maybe better, and would also fit well with keeping
arg2 as PR_SYS_DISPATCH_ON / PR_SYS_DISPATCH_OFF):

    PR_SYS_DISPATCH_FILTER_ON / PR_SYS_DISPATCH_FILTER_OFF

Do you see what I mean? We live with APIs for a long time, and I
think it serves us to make them as clear as possible.

Thanks,

Michael



> 
> arg5 is checked only upon syscall entry, so it doesn't have any effect
> on the prctl itself, SUD was off at that time.
> 
> I'm not sure how to add this information to the man page, but I will try
> to clarify it further.
> 
>>> +.IP
>>> +When
>>> +.I arg2
>>> +is set to
>>> +.BR PR_SYS_DISPATCH_ON ,
>>> +.I arg3
>>> +and
>>> +.I arg4
>>> +respectively identify the
>>> +.I offset
>>> +and
>>> +.I length
>>> +of a single contiguous memory region in the process map
>>
>> Better: s/map/address space/ ?
>>
>>> +from where system calls are always allowed to be executed,
>>> +regardless of the switch variable
>>
>> s/variable/variable./
>>
>>> +(Typically, this area would include the area of memory
>>> +containing the C library.)
>>
>> I think just to ease readability (smaller paragraphs), insert
>> .IP
>> here.
>>
>>> +.I arg5
>>> +points to a char-sized variable
>>> +that is a fast switch to enable/disable the mechanism
>>> +without the overhead of doing a system call.
>>> +The variable pointed by
>>> +.I arg5
>>> +can either be set to
>>> +.B PR_SYS_DISPATCH_ON
>>> +to enable the mechanism
>>> +or to
>>> +.B PR_SYS_DISPATCH_OFF
>>> +to temporarily disable it.
>>> +This value is checked by the kernel
>>> +on every system call entry,
>>> +and any unexpected value will raise
>>> +an uncatchable
>>> +.B SIGSYS
>>> +at that time,
>>> +killing the application.
>>> +.IP
>>> +When a system call is intercepted,
>>> +the kernel sends a thread-directed
>>> +.B SIGSYS
>>> +signal to the triggering thread.
>>> +Various fields will be set in the
>>> +.I siginfo_t
>>> +structure (see
>>> +.BR sigaction (2))
>>> +associated with the signal:
>>> +.RS
>>> +.IP * 3
>>> +.I si_signo
>>> +will contain
>>> +.BR SIGSYS .
>>> +.IP *
>>> +.IR si_call_addr
>>> +will show the address of the system call instruction.
>>> +.IP *
>>> +.IR si_syscall
>>> +and
>>> +.IR si_arch
>>> +will indicate which system call was attempted.
>>> +.IP *
>>> +.I si_code
>>> +will contain
>>> +.BR SYS_USER_DISPATCH .
>>> +.IP *
>>> +.I si_errno
>>> +will be set to 0.
>>> +.RE
>>> +.IP
>>> +The program counter will be as though the system call happened
>>> +(i.e., the program counter will not point to the system call instruction).
>>> +.IP
>>> +When the signal handler returns to the kernel,
>>> +the system call completes immediately
>>> +and returns to the calling thread,
>>> +without actually being executed.
>>> +If necessary
>>> +(i.e., when emulating the system call on user space.),
>>> +the signal handler should set the system call return value
>>> +to a sane value,
>>> +by modifying the register context stored in the
>>> +.I ucontext
>>> +argument of the signal handler.
>>
>> Just for my own education, do you have any example code somewhere
>> that demonstrates setting the syscall return value?
> 
> I though I had this exemplified in the kselftests, but I just saw that
> is not the case.  Something to fix there...
> 
> In the following signal handler code (untested, adapted from an old test
> case of mine), SUD is used as a poor-man strace (x86 specific):
> 
> static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
> {
> 	struct ucontext_t *ctx = ucontext;
> 	mcontext_t *mctx = &ctx->uc_mcontext;
> 	char buf[1024];
> 	unsigned long long r;
>        	int len;
> 
>         *arg5 = PR_SYS_DISPATCH_OFF;
> 
> 	len = snprintf(buf, 1024, "sys_%d(%d,%d,%d,%d,%d,%d)\n",
> 		       mctx->gregs[REG_RAX], mctx->gregs[REG_RDI],
> 		       mctx->gregs[REG_RSI], mctx->gregs[REG_RDX],
> 		       mctx->gregs[REG_R10], mctx->gregs[REG_R8],
> 		       mctx->gregs[REG_R9]);
> 
> 	write(2, buf, len);
> 
> 	r = syscall(mctx->gregs[REG_RAX], mctx->gregs[REG_RDI],
> 		    mctx->gregs[REG_RSI], mctx->gregs[REG_RDX],
> 		    mctx->gregs[REG_R10], mctx->gregs[REG_R8],
> 		    mctx->gregs[REG_R9]);
> 
> 	/* Set the return value. */
> 	mctx->gregs[REG_RAX] = r;
> 
>         *arg5 = PR_SYS_DISPATCH_ON;
> 
>         /* Assumes the return trampoline is in the allowed region */
> }
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-12-30 19:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-28 17:38 [PATCH v6] prctl.2: Document Syscall User Dispatch Gabriel Krisman Bertazi
2020-12-30 10:24 ` Michael Kerrisk (man-pages)
2020-12-30 16:51   ` Gabriel Krisman Bertazi
2020-12-30 19:50     ` Michael Kerrisk (man-pages)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.