All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH man-pages] Add rseq manpage
@ 2018-09-19 14:40 ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2018-09-19 14:40 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-kernel, linux-api, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 man2/rseq.2 | 291 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 man2/rseq.2

diff --git a/man2/rseq.2 b/man2/rseq.2
new file mode 100644
index 000000000..a381963ba
--- /dev/null
+++ b/man2/rseq.2
@@ -0,0 +1,291 @@
+.\" Copyright 2015-2018 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH RSEQ 2 2018-09-19 "Linux" "Linux Programmer's Manual"
+.SH NAME
+rseq \- Restartable sequences and cpu number cache
+.SH SYNOPSIS
+.nf
+.B #include <linux/rseq.h>
+.sp
+.BI "int rseq(struct rseq * " rseq ", uint32_t " rseq_len ", int " flags ", uint32_t " sig ");
+.sp
+.SH DESCRIPTION
+The
+.BR rseq ()
+ABI accelerates user-space operations on per-cpu data by defining a
+shared data structure ABI between each user-space thread and the kernel.
+
+It allows user-space to perform update operations on per-cpu data
+without requiring heavy-weight atomic operations.
+
+The term CPU used in this documentation refers to a hardware execution
+context.
+
+Restartable sequences are atomic with respect to preemption (making it
+atomic with respect to other threads running on the same CPU), as well
+as signal delivery (user-space execution contexts nested over the same
+thread). They either complete atomically with respect to preemption on
+the current CPU and signal delivery, or they are aborted.
+
+It is suited for update operations on per-cpu data.
+
+It can be used on data structures shared between threads within a
+process, and on data structures shared between threads across different
+processes.
+
+.PP
+Some examples of operations that can be accelerated or improved
+by this ABI:
+.IP \[bu] 2
+Memory allocator per-cpu free-lists,
+.IP \[bu] 2
+Querying the current CPU number,
+.IP \[bu] 2
+Incrementing per-CPU counters,
+.IP \[bu] 2
+Modifying data protected by per-CPU spinlocks,
+.IP \[bu] 2
+Inserting/removing elements in per-CPU linked-lists,
+.IP \[bu] 2
+Writing/reading per-CPU ring buffers content.
+.IP \[bu] 2
+Accurately reading performance monitoring unit counters
+with respect to thread migration.
+
+.PP
+Restartable sequences must not perform system calls. Doing so may result
+in termination of the process by a segmentation fault.
+
+.PP
+The
+.I rseq
+argument is a pointer to the thread-local rseq structure to be shared
+between kernel and user-space.
+
+.PP
+The layout of
+.B struct rseq
+is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure is extensible. Its size is passed as parameter to the
+rseq system call.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I cpu_id_start
+Optimistic cache of the CPU number on which the current thread is
+running. Its value is guaranteed to always be a possible CPU number,
+even when rseq is not initialized. The value it contains should always
+be confirmed by reading the cpu_id field.
+
+This field is an optimistic cache in the sense that it is always
+guaranteed to hold a valid CPU number in the range [ 0 ..
+nr_possible_cpus - 1 ]. It can therefore be loaded by user-space and
+used as an offset in per-cpu data structures without having to
+check whether its value is within the valid bounds compared to the
+number of possible CPUs in the system.
+
+For user-space applications executed on a kernel without rseq support,
+the cpu_id_start field stays initialized at 0, which is indeed a valid
+CPU number. It is therefore valid to use it as an offset in per-cpu data
+structures, and only validate whether it's actually the current CPU
+number by comparing it with the cpu_id field within the rseq critical
+section. If the kernel does not provide rseq support, that cpu_id field
+stays initialized at -1, so the comparison always fails, as intended.
+
+It is then up to user-space to use a fall-back mechanism, considering
+that rseq is not available.
+
+.in
+.TP
+.in +4n
+.I cpu_id
+Cache of the CPU number on which the current thread is running.
+-1 if uninitialized.
+.in
+.TP
+.in +4n
+.I rseq_cs
+The rseq_cs field is a pointer to a struct rseq_cs. Is is NULL when no
+rseq assembly block critical section is active for the current thread.
+Setting it to point to a critical section descriptor (struct rseq_cs)
+marks the beginning of the critical section.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior for the current thread. This is
+mainly used for debugging purposes. Can be either:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.in
+
+.PP
+The layout of
+.B struct rseq_cs
+version 0 is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure has a fixed size of 32 bytes.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I version
+Version of this structure.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior of this structure. Can be
+a combination of:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.TP
+.in +4n
+.I start_ip
+Instruction pointer address of the first instruction of the sequence of
+consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I post_commit_offset
+Offset (from start_ip address) of the address after the last instruction
+of the sequence of consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I abort_ip
+Instruction pointer address where to move the execution flow in case of
+abort of the sequence of consecutive assembly instructions.
+.in
+
+.PP
+The
+.I rseq_len
+argument is the size of the
+.I struct rseq
+to register.
+
+.PP
+The
+.I flags
+argument is 0 for registration, and
+.IR RSEQ_FLAG_UNREGISTER
+for unregistration.
+
+.PP
+The
+.I sig
+argument is the 32-bit signature to be expected before the abort
+handler code.
+
+.PP
+A single library per process should keep the rseq structure in a
+thread-local storage variable.
+The
+.I cpu_id
+field should be initialized to -1, and the
+.I cpu_id_start
+field should be initialized to a possible CPU value (typically 0).
+
+.PP
+Each thread is responsible for registering and unregistering its rseq
+structure. No more than one rseq structure address can be registered
+per thread at a given time.
+
+.PP
+In a typical usage scenario, the thread registering the rseq
+structure will be performing loads and stores from/to that structure. It
+is however also allowed to read that structure from other threads.
+The rseq field updates performed by the kernel provide relaxed atomicity
+semantics, which guarantee that other threads performing relaxed atomic
+reads of the cpu number cache will always observe a consistent value.
+
+.SH RETURN VALUE
+A return value of 0 indicates success. On error, \-1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+.TP
+.B EINVAL
+Either
+.I flags
+contains an invalid value, or
+.I rseq
+contains an address which is not appropriately aligned, or
+.I rseq_len
+contains a size that does not match the size received on registration.
+.TP
+.B ENOSYS
+The
+.BR rseq ()
+system call is not implemented by this kernel.
+.TP
+.B EFAULT
+.I rseq
+is an invalid address.
+.TP
+.B EBUSY
+Restartable sequence is already registered for this thread.
+.TP
+.B EPERM
+The
+.I sig
+argument on unregistration does not match the signature received
+on registration.
+
+.SH VERSIONS
+The
+.BR rseq ()
+system call was added in Linux 4.18.
+
+.SH CONFORMING TO
+.BR rseq ()
+is Linux-specific.
+
+.in
+.SH SEE ALSO
+.BR sched_getcpu (3) ,
+.BR membarrier (2)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH man-pages] Add rseq manpage
@ 2018-09-19 14:40 ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2018-09-19 14:40 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-kernel, linux-api, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 man2/rseq.2 | 291 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 man2/rseq.2

diff --git a/man2/rseq.2 b/man2/rseq.2
new file mode 100644
index 000000000..a381963ba
--- /dev/null
+++ b/man2/rseq.2
@@ -0,0 +1,291 @@
+.\" Copyright 2015-2018 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH RSEQ 2 2018-09-19 "Linux" "Linux Programmer's Manual"
+.SH NAME
+rseq \- Restartable sequences and cpu number cache
+.SH SYNOPSIS
+.nf
+.B #include <linux/rseq.h>
+.sp
+.BI "int rseq(struct rseq * " rseq ", uint32_t " rseq_len ", int " flags ", uint32_t " sig ");
+.sp
+.SH DESCRIPTION
+The
+.BR rseq ()
+ABI accelerates user-space operations on per-cpu data by defining a
+shared data structure ABI between each user-space thread and the kernel.
+
+It allows user-space to perform update operations on per-cpu data
+without requiring heavy-weight atomic operations.
+
+The term CPU used in this documentation refers to a hardware execution
+context.
+
+Restartable sequences are atomic with respect to preemption (making it
+atomic with respect to other threads running on the same CPU), as well
+as signal delivery (user-space execution contexts nested over the same
+thread). They either complete atomically with respect to preemption on
+the current CPU and signal delivery, or they are aborted.
+
+It is suited for update operations on per-cpu data.
+
+It can be used on data structures shared between threads within a
+process, and on data structures shared between threads across different
+processes.
+
+.PP
+Some examples of operations that can be accelerated or improved
+by this ABI:
+.IP \[bu] 2
+Memory allocator per-cpu free-lists,
+.IP \[bu] 2
+Querying the current CPU number,
+.IP \[bu] 2
+Incrementing per-CPU counters,
+.IP \[bu] 2
+Modifying data protected by per-CPU spinlocks,
+.IP \[bu] 2
+Inserting/removing elements in per-CPU linked-lists,
+.IP \[bu] 2
+Writing/reading per-CPU ring buffers content.
+.IP \[bu] 2
+Accurately reading performance monitoring unit counters
+with respect to thread migration.
+
+.PP
+Restartable sequences must not perform system calls. Doing so may result
+in termination of the process by a segmentation fault.
+
+.PP
+The
+.I rseq
+argument is a pointer to the thread-local rseq structure to be shared
+between kernel and user-space.
+
+.PP
+The layout of
+.B struct rseq
+is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure is extensible. Its size is passed as parameter to the
+rseq system call.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I cpu_id_start
+Optimistic cache of the CPU number on which the current thread is
+running. Its value is guaranteed to always be a possible CPU number,
+even when rseq is not initialized. The value it contains should always
+be confirmed by reading the cpu_id field.
+
+This field is an optimistic cache in the sense that it is always
+guaranteed to hold a valid CPU number in the range [ 0 ..
+nr_possible_cpus - 1 ]. It can therefore be loaded by user-space and
+used as an offset in per-cpu data structures without having to
+check whether its value is within the valid bounds compared to the
+number of possible CPUs in the system.
+
+For user-space applications executed on a kernel without rseq support,
+the cpu_id_start field stays initialized at 0, which is indeed a valid
+CPU number. It is therefore valid to use it as an offset in per-cpu data
+structures, and only validate whether it's actually the current CPU
+number by comparing it with the cpu_id field within the rseq critical
+section. If the kernel does not provide rseq support, that cpu_id field
+stays initialized at -1, so the comparison always fails, as intended.
+
+It is then up to user-space to use a fall-back mechanism, considering
+that rseq is not available.
+
+.in
+.TP
+.in +4n
+.I cpu_id
+Cache of the CPU number on which the current thread is running.
+-1 if uninitialized.
+.in
+.TP
+.in +4n
+.I rseq_cs
+The rseq_cs field is a pointer to a struct rseq_cs. Is is NULL when no
+rseq assembly block critical section is active for the current thread.
+Setting it to point to a critical section descriptor (struct rseq_cs)
+marks the beginning of the critical section.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior for the current thread. This is
+mainly used for debugging purposes. Can be either:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.in
+
+.PP
+The layout of
+.B struct rseq_cs
+version 0 is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure has a fixed size of 32 bytes.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I version
+Version of this structure.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior of this structure. Can be
+a combination of:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.TP
+.in +4n
+.I start_ip
+Instruction pointer address of the first instruction of the sequence of
+consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I post_commit_offset
+Offset (from start_ip address) of the address after the last instruction
+of the sequence of consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I abort_ip
+Instruction pointer address where to move the execution flow in case of
+abort of the sequence of consecutive assembly instructions.
+.in
+
+.PP
+The
+.I rseq_len
+argument is the size of the
+.I struct rseq
+to register.
+
+.PP
+The
+.I flags
+argument is 0 for registration, and
+.IR RSEQ_FLAG_UNREGISTER
+for unregistration.
+
+.PP
+The
+.I sig
+argument is the 32-bit signature to be expected before the abort
+handler code.
+
+.PP
+A single library per process should keep the rseq structure in a
+thread-local storage variable.
+The
+.I cpu_id
+field should be initialized to -1, and the
+.I cpu_id_start
+field should be initialized to a possible CPU value (typically 0).
+
+.PP
+Each thread is responsible for registering and unregistering its rseq
+structure. No more than one rseq structure address can be registered
+per thread at a given time.
+
+.PP
+In a typical usage scenario, the thread registering the rseq
+structure will be performing loads and stores from/to that structure. It
+is however also allowed to read that structure from other threads.
+The rseq field updates performed by the kernel provide relaxed atomicity
+semantics, which guarantee that other threads performing relaxed atomic
+reads of the cpu number cache will always observe a consistent value.
+
+.SH RETURN VALUE
+A return value of 0 indicates success. On error, \-1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+.TP
+.B EINVAL
+Either
+.I flags
+contains an invalid value, or
+.I rseq
+contains an address which is not appropriately aligned, or
+.I rseq_len
+contains a size that does not match the size received on registration.
+.TP
+.B ENOSYS
+The
+.BR rseq ()
+system call is not implemented by this kernel.
+.TP
+.B EFAULT
+.I rseq
+is an invalid address.
+.TP
+.B EBUSY
+Restartable sequence is already registered for this thread.
+.TP
+.B EPERM
+The
+.I sig
+argument on unregistration does not match the signature received
+on registration.
+
+.SH VERSIONS
+The
+.BR rseq ()
+system call was added in Linux 4.18.
+
+.SH CONFORMING TO
+.BR rseq ()
+is Linux-specific.
+
+.in
+.SH SEE ALSO
+.BR sched_getcpu (3) ,
+.BR membarrier (2)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH man-pages] Add rseq manpage
  2019-03-04 18:02     ` Mathieu Desnoyers
  (?)
@ 2020-04-27 15:15     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2020-04-27 15:15 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-kernel, linux-api, Peter Zijlstra, Paul, Boqun Feng,
	Andy Lutomirski, Dave Watson, Paul Turner, Andrew Morton,
	Russell King, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Will Deacon, carlos,
	Florian Weimer

----- On Mar 4, 2019, at 1:02 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Feb 28, 2019, at 3:42 AM, Michael Kerrisk mtk.manpages@gmail.com wrote:
> 
>> On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
>>> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>>>   patch which adds rseq documentation to the man-pages project ? ]
>> Hi Matthieu
>> 
>> Sorry for the long delay. I've merged this page into a private
>> branch and have done quite a lot of editing. I have many
>> questions :-).
> 
> No worries, thanks for looking into it!
> 
>> 
>> In the first instance, I think it is probably best to have
>> a free-form text discussion rather than firing patches
>> back and forward. Could you take a look at the questions below
>> and respond?
> 
> Sure,

Hi Michael,

Gentle bump of this email in your inbox, since I suspect you might have
forgotten about it altogether. A year ago I you had an heavily edited
man page for rseq(2). I provided the requested feedback, but I did not
hear back from you since then.

We are now close to integrate rseq into glibc, and having an official
man page would be useful.

Thanks,

Mathieu


> 
>> 
>> Thanks,
>> 
>> Michael
>> 
>> 
>> RSEQ(2)                    Linux Programmer's Manual                   RSEQ(2)
>> 
>> NAME
>>       rseq - Restartable sequences and CPU number cache
>> 
>> SYNOPSIS
>>       #include <linux/rseq.h>
>> 
>>       int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);
>> 
>> DESCRIPTION
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Imagine  you  are  someone who is pretty new to this │
>>       │idea...  What is notably lacking from this  page  is │
>>       │an overview explaining:                              │
>>       │                                                     │
>>       │    * What a restartable sequence actually is.       │
>>       │                                                     │
>>       │    * An outline of the steps to perform when using  │
>>       │    restartable sequences / rseq(2).                 │
>>       │                                                     │
>>       │I.e.,  something  along  the  lines  of Jon Corbet's │
>>       │https://lwn.net/Articles/697979/.  Can you  come  up │
>>       │with something? (Part of it might be at the start of │
>>       │this page, and the rest in NOTES; it need not be all │
>>       │in one place.)                                       │
>>       └─────────────────────────────────────────────────────┘
> 
> We recently published a blog post about rseq, which might contain just the
> right level of information we are looking for here:
> 
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
> 
> Could something along the following lines work ?
> 
> "A restartable sequence is a sequence of instructions guaranteed to be
> executed atomically with respect to other threads and signal handlers on the
> current CPU. If its execution does not complete atomically, the kernel changes
> the execution flow by jumping to an abort handler defined by user-space for
> that restartable sequence.
> 
> Using restartable sequences requires to register a __rseq_abi thread-local
> storage
> data structure (struct rseq) through the rseq(2) system call. Only one
> __rseq_abi
> can be registered per thread, so user-space libraries and applications must
> follow
> a user-space ABI defining how to share this resource. The ABI defining how to
> share
> this resource between applications and libraries is defined by the C library.
> 
> The __rseq_abi contains a rseq_cs field which points to the currently executing
> critical section. For each thread, a single rseq critical section can run at any
> given point. Each critical section need to be implemented in assembly."
> 
> 
>>       The  rseq()  ABI  accelerates  user-space operations on per-CPU data by
>>       defining a shared data structure ABI between each user-space thread and
>>       the kernel.
>> 
>>       It allows user-space to perform update operations on per-CPU data with‐
>>       out requiring heavy-weight atomic operations.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In the following para: "a  hardware  execution  con‐ │
>>       │text"?   What  is  the contrast being drawn here? It │
>>       │would be good to state it more explicitly.           │
>>       └─────────────────────────────────────────────────────┘
> 
> Here I'm trying to clarify what we mean by "CPU" in this document. We define
> a CPU as having its own number returned by sched_getcpu(), which I think is
> sometimes referred to as "logical cpu". This is the current hyperthread on
> the current core, on the current "physical CPU", in the current socket.
> 
> 
>>       The term CPU used in this documentation refers to a hardware  execution
>>       context.
>> 
>>       Restartable  sequences are atomic with respect to preemption (making it
>>       atomic with respect to other threads running on the same CPU), as  well
>>       as  signal delivery (user-space execution contexts nested over the same
>>       thread).  They either complete atomically with respect to preemption on
>>       the current CPU and signal delivery, or they are aborted.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  preceding sentence, we need a definition of │
>>       │"current CPU".                                       │
>>       └─────────────────────────────────────────────────────┘
> 
> Not sure how to word it. If a thread or signal handler execution context can
> possibly run and issue, for instance, "sched_getcpu()" between the beginning
> and the end of the critical section and get the same logical CPU number as the
> current thread, then we are guaranteed to abort. Of course, sched_getcpu() is
> just one way to get the CPU number, considering that we can also read it
> from the __rseq_abi cpu_id and cpu_id_start fields.
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In the following, does "It  is"  means  "Restartable │
>>       │sequences are"?                                      │
>>       └─────────────────────────────────────────────────────┘
>>       It is suited for update operations on per-CPU data.
> 
> Yes.
> 
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  following,  does "It is" means "Restartable │
>>       │sequences are"?                                      │
>>       └─────────────────────────────────────────────────────┘
> 
> "Restartable sequences can be..."
> 
>>       It can be used on data  structures  shared  between  threads  within  a
>>       process, and on data structures shared between threads across different
>>       processes.
>> 
>>       Some examples of operations that can be accelerated or improved by this
>>       ABI:
>> 
>>       · Memory allocator per-CPU free-lists
>> 
>>       · Querying the current CPU number
>> 
>>       · Incrementing per-CPU counters
>> 
>>       · Modifying data protected by per-CPU spinlocks
>> 
>>       · Inserting/removing elements in per-CPU linked-lists
>> 
>>       · Writing/reading per-CPU ring buffers content
>> 
>>       · Accurately  reading performance monitoring unit counters with respect
>>         to thread migration
>> 
>>       Restartable sequences must not perform  system  calls.   Doing  so  may
>>       result in termination of the process by a segmentation fault.
>> 
>>       The rseq argument is a pointer to the thread-local rseq structure to be
>>       shared between kernel and user-space.  The layout of this structure  is
>>       shown below.
>> 
>>       The rseq_len argument is the size of the struct rseq to register.
>> 
>>       The  flags  argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
>>       unregistration.
>> 
>>       The sig argument is the 32-bit signature  to  be  expected  before  the
>>       abort handler code.
>> 
>>   The rseq structure
>>       The  struct  rseq  is aligned on a 32-byte boundary.  This structure is
>>       extensible.  Its size is passed as parameter to the rseq() system call.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Below, I added the structure definition (in abbrevi‐ │
>>       │ated form).  Is there any reason not to do this?     │
>>       └─────────────────────────────────────────────────────┘
> 
> It seems appropriate.
> 
>> 
>>           struct rseq {
>>               __u32             cpu_id_start;
>>               __u32             cpu_id;
>>               union {
>>                   __u64 ptr64;
>>           #ifdef __LP64__
>>                   __u64 ptr;
>>           #else
>>                   ....
>>           #endif
>>               }                 rseq_cs;
>>               __u32             flags;
>>           } __attribute__((aligned(4 * sizeof(__u64))));
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  text  below, I think it would be helpful to │
>>       │explicitly note which of these fields are set by the │
>>       │kernel  (on  return from the reseq() call) and which │
>>       │are set by the caller (before  calling  rseq()).  Is │
>>       │the following correct:                               │
>>       │                                                     │
>>       │    cpu_id_start - initialized by caller to possible │
>>       │    CPU number (e.g., 0), updated by kernel          │
>>       │    on return                                        │
> 
> "initialized by caller to possible CPU number (e.g., 0), updated
> by the kernel on return, and updated by the kernel on return after
> thread migration to a different CPU"
> 
>>       │                                                     │
>>       │    cpu_id - initialized to -1 by caller,            │
>>       │    updated by kernel on return                      │
> 
> "initialized to -1 by caller, updated by the kernel on return, and
> updated by the kernel on return after thread migration to a different
> CPU"
> 
>>       │                                                     │
>>       │    rseq_cs - initialized by caller, either to NULL  │
>>       │    or a pointer to an 'rseq_cs' structure           │
>>       │    that is initialized by the caller                │
> 
> "initialized by caller to NULL, then, after returning from successful
> registration, updated to a pointer to an "rseq_cs" structure by user-space.
> Set to NULL by the kernel when it restarts a rseq critical section,
> when it preempts or deliver a signal outside of the range targeted by the
> rseq_cs. Set to NULL by user-space before reclaiming memory that
> contains the targeted struct rseq_cs."
> 
> 
>>       │                                                     │
>>       │    flags - initialized by caller, used by kernel    │
>>       └─────────────────────────────────────────────────────┘
>> 
>>       The structure fields are as follows:
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  following paragraph, and in later places, I │
>>       │changed "current thread" to "calling thread". Okay?  │
>>       └─────────────────────────────────────────────────────┘
> 
> Yes.
> 
>> 
>>       cpu_id_start
>>              Optimistic cache of the CPU number on which the  calling  thread
>>              is  running.  The value in this field is guaranteed to always be
>>              a possible CPU number, even when rseq is not  initialized.   The
>>              value  it  contains  should  always  be confirmed by reading the
>>              cpu_id field.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │What does the last sentence mean?                    │
>>              └─────────────────────────────────────────────────────┘
> 
> It means the caller thread can always use __rseq_abi.cpu_id_start to index an
> array of per-cpu data and this won't cause an out-of-bound access on load, but
> it
> does not mean it really contains the current CPU number. For instance, if rseq
> registration failed, it will contain "0".
> 
> Therefore, it's fine to use cpu_is_start to fetch per-cpu data, but the cpu_id
> field should be used to compare the cpu_is_start value, so the case where rseq
> is not registered is caught. In that case, cpu_id_start=0, but cpu_id=-1 or -2,
> which differ, and therefore the critical section needs to jump to the abort
> handler.
> 
>> 
>>              This field is an optimistic cache in the sense that it is always
>>              guaranteed  to hold a valid CPU number in the range [0..(nr_pos‐
>>              sible_cpus - 1)].  It can therefore be loaded by user-space  and
>>              used  as  an offset in per-CPU data structures without having to
>>              check whether its value is within the valid bounds  compared  to
>>              the number of possible CPUs in the system.
>> 
>>              For  user-space  applications  executed on a kernel without rseq
>>              support, the cpu_id_start field stays initialized at 0, which is
>>              indeed  a  valid CPU number.  It is therefore valid to use it as
>>              an offset in per-CPU data structures, and only validate  whether
>>              it's  actually  the  current CPU number by comparing it with the
>>              cpu_id field within the rseq critical section.
>> 
>>              If the kernel does not provide rseq support, that  cpu_id  field
>>              stays  initialized  at  -1,  so  the comparison always fails, as
>>              intended.  It is then up to user-space to use a fall-back mecha‐
>>              nism, considering that rseq is not available.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │The  last  sentence is rather difficult to grok. Can │
>>              │we say some more here?                               │
>>              └─────────────────────────────────────────────────────┘
> 
> Perhaps we could use the explanation I've written above in my reply ?
> 
>> 
>>       cpu_id Cache of the CPU number on which the calling thread is  running.
>>              -1 if uninitialized.
>> 
>>       rseq_cs
>>              The  rseq_cs  field  is a pointer to a struct rseq_cs (described
>>              below).  It is NULL when no rseq assembly block critical section
>>              is  active  for  the  calling  thread.  Setting it to point to a
>>              critical section descriptor (struct rseq_cs) marks the beginning
>>              of the critical section.
>> 
>>       flags  Flags  indicating  the  restart behavior for the calling thread.
>>              This is mainly used for debugging purposes.  Can be either:
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> 
> Inhibit instruction sequence block restart on preemption for this thread.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> 
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> 
> Inhibit instruction sequence block restart on migration for this thread.
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Each of the above values needs an explanation.       │
>>       │                                                     │
>>       │Is it correct that only one of  the  values  may  be │
>>       │specified in 'flags'? I ask because in the 'rseq_cs' │
>>       │structure below, the 'flags' field  is  a  bit  mask │
>>       │where  any  combination  of  these flags may be ORed │
>>       │together.                                            │
>>       │                                                     │
>>       └─────────────────────────────────────────────────────┘
> 
> Those are also masks and can be ORed.
> 
> 
>> 
>>   The rseq_cs structure
>>       The struct rseq_cs is aligned on a 32-byte boundary  and  has  a  fixed
>>       size of 32 bytes.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Below, I added the structure definition (in abbrevi‐ │
>>       │ated form).  Is there any reason not to do this?     │
>>       └─────────────────────────────────────────────────────┘
> 
> It's fine.
> 
>> 
>>           struct rseq_cs {
>>               __u32   version;
>>               __u32   flags;
>>               __u64   start_ip;
>>               __u64   post_commit_offset;
>>               __u64   abort_ip;
>>           } __attribute__((aligned(4 * sizeof(__u64))));
>> 
>>       The structure fields are as follows:
>> 
>>       version
>>              Version of this structure.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │What does 'version' need to be initialized to?       │
>>              └─────────────────────────────────────────────────────┘
> 
> Currently version needs to be 0. Eventually, if we implement support for new
> flags to rseq(),
> we could add feature flags which register support for newer versions of struct
> rseq_cs.
> 
>> 
>>       flags  Flags indicating the restart behavior of this structure.  Can be
>>              a combination of:
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> 
> Inhibit instruction sequence block restart on preemption for this thread.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> 
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
> 
>> 
>>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> 
> Inhibit instruction sequence block restart on migration for this thread.
> 
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Each of the above values needs an explanation.       │
>>       └─────────────────────────────────────────────────────┘
>> 
>>       start_ip
>>              Instruction  pointer  address  of  the  first instruction of the
>>              sequence of consecutive assembly instructions.
>> 
>>       post_commit_offset
>>              Offset (from start_ip address) of the  address  after  the  last
>>              instruction  of  the  sequence  of consecutive assembly instruc‐
>>              tions.
>> 
>>       abort_ip
>>              Instruction pointer address where to move the execution flow  in
>>              case  of  abort of the sequence of consecutive assembly instruc‐
>>              tions.
>> 
>> NOTES
>>       A single library per process  should  keep  the  rseq  structure  in  a
>>       thread-local  storage variable.  The cpu_id field should be initialized
>>       to -1, and the cpu_id_start field should be initialized to  a  possible
>>       CPU value (typically 0).
> 
> The part above is not quite right. All applications/libraries wishing to
> register
> rseq must follow the ABI specified by the C library. It can be defined within
> more
> that a single application/library, but in the end only one symbol will be chosen
> for the process's global symbol table.
> 
>> 
>>       Each  thread  is responsible for registering and unregistering its rseq
>>       structure.  No more than one rseq structure address can  be  registered
>>       per thread at a given time.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  following paragraph, what is the difference │
>>       │between "freed" and "reclaim"?  I'm  supposing  they │
>>       │mean the same thing, but it's not clear. And if they │
>>       │do mean the same thing, then the first two sentences │
>>       │appear to contain contradictory information.         │
>>       └─────────────────────────────────────────────────────┘
> 
> The mean the same thing, and they are subtly not contradictory.
> 
> The first states that memory of a _registered_ rseq object must not
> be freed before the thread exits.
> 
> The second states that memory of a rseq object must not be freed before
> it is unregistered or the thread exits.
> 
> Do you have an alternative wording in mind to make this clearer ?
> 
>> 
>>       Memory  of a registered rseq object must not be freed before the thread
>>       exits.  Reclaim of rseq object's memory must only be done after  either
>>       an explicit rseq unregistration is performed or after the thread exits.
>>       Keep in mind that the implementation of  the  Thread-Local  Storage  (C
>>       language  __thread)  lifetime  does  not guarantee existence of the TLS
>>       area up until the thread exits.
>> 
>>       In a typical usage scenario, the thread registering the rseq  structure
>>       will be performing loads and stores from/to that structure.  It is how‐
>>       ever also allowed to read that structure from other threads.  The  rseq
>>       field  updates performed by the kernel provide relaxed atomicity seman‐
>>       tics, which guarantee that  other  threads  performing  relaxed  atomic
>>       reads of the CPU number cache will always observe a consistent value.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │In  the  preceding  paragraph, can we reasonably add │
>>       │some words to explain "relaxed atomicity  semantics" │
>>       │and "relaxed atomic reads"?                          │
>>       └─────────────────────────────────────────────────────┘
> 
> Not sure how to word this exactly, but here it means the stores and loads need
> to be done atomically, but don't require nor provide any ordering guarantees
> with respect to other loads/stores (no memory barriers).
> 
>> 
>> RETURN VALUE
>>       A  return  value of 0 indicates success.  On error, -1 is returned, and
>>       errno is set appropriately.
>> 
>> ERRORS
>>       EBUSY  Restartable sequence is already registered for this thread.
>> 
>>       EFAULT rseq is an invalid address.
>> 
>>       EINVAL Either flags contains an invalid  value,  or  rseq  contains  an
>>              address which is not appropriately aligned, or rseq_len contains
>>              a size that does not match the size received on registration.
>> 
>>              ┌─────────────────────────────────────────────────────┐
>>              │FIXME                                                │
>>              ├─────────────────────────────────────────────────────┤
>>              │The last case "rseq_len contains a  size  that  does │
>>              │not  match  the  size  received on registration" can │
>>              │occur only on RSEQ_FLAG_UNREGISTER, tight?           │
>>              └─────────────────────────────────────────────────────┘
>> 
>>       ENOSYS The rseq() system call is not implemented by this kernel.
>> 
>>       EPERM  The sig argument on unregistration does not match the  signature
>>              received on registration.
>> 
>> VERSIONS
>>       The rseq() system call was added in Linux 4.18.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │What is the current state of library support?        │
>>       └─────────────────────────────────────────────────────┘
> 
> After going through a few RFC rounds, it's been posted as non-rfc a
> few weeks ago. It is pending review from glibc maintainers. I currently
> aim for inclusion of the rseq TLS registration by glibc for glibc 2.30:
> 
> https://sourceware.org/ml/libc-alpha/2019-02/msg00317.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00320.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00319.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00318.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00321.html
> 
> Note that the C library will define a user-space ABI which states how
> applications/libraries wishing to register the rseq TLS need to behave so they
> are compatible with the C library when it gets updated to a new version
> providing
> rseq registration support. It seems like an important point to document,
> perhaps even here in the rseq(2) man page.
> 
> 
>> 
>> CONFORMING TO
>>       rseq() is Linux-specific.
>> 
>>       ┌─────────────────────────────────────────────────────┐
>>       │FIXME                                                │
>>       ├─────────────────────────────────────────────────────┤
>>       │Is  there  any  example  code that can reasonably be │
>>       │included in this manual page? Or some  example  code │
>>       │that can be referred to?                             │
>>       └─────────────────────────────────────────────────────┘
>> 
> 
> The per-cpu counter example we have here seems compact enough:
> 
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
> 
> Thanks,
> 
> Mathieu
> 
> 
>> SEE ALSO
>>       sched_getcpu(3), membarrier(2)
>> 
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH man-pages] Add rseq manpage
  2019-02-28  8:42   ` Michael Kerrisk (man-pages)
@ 2019-03-04 18:02     ` Mathieu Desnoyers
  -1 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2019-03-04 18:02 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-kernel, linux-api, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon


----- On Feb 28, 2019, at 3:42 AM, Michael Kerrisk mtk.manpages@gmail.com wrote:

> On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
>> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>>   patch which adds rseq documentation to the man-pages project ? ]
> Hi Matthieu
> 
> Sorry for the long delay. I've merged this page into a private
> branch and have done quite a lot of editing. I have many
> questions :-).

No worries, thanks for looking into it!

> 
> In the first instance, I think it is probably best to have
> a free-form text discussion rather than firing patches
> back and forward. Could you take a look at the questions below
> and respond?

Sure,

> 
> Thanks,
> 
> Michael
> 
> 
> RSEQ(2)                    Linux Programmer's Manual                   RSEQ(2)
> 
> NAME
>       rseq - Restartable sequences and CPU number cache
> 
> SYNOPSIS
>       #include <linux/rseq.h>
> 
>       int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);
> 
> DESCRIPTION
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Imagine  you  are  someone who is pretty new to this │
>       │idea...  What is notably lacking from this  page  is │
>       │an overview explaining:                              │
>       │                                                     │
>       │    * What a restartable sequence actually is.       │
>       │                                                     │
>       │    * An outline of the steps to perform when using  │
>       │    restartable sequences / rseq(2).                 │
>       │                                                     │
>       │I.e.,  something  along  the  lines  of Jon Corbet's │
>       │https://lwn.net/Articles/697979/.  Can you  come  up │
>       │with something? (Part of it might be at the start of │
>       │this page, and the rest in NOTES; it need not be all │
>       │in one place.)                                       │
>       └─────────────────────────────────────────────────────┘

We recently published a blog post about rseq, which might contain just the
right level of information we are looking for here:

https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/

Could something along the following lines work ?

"A restartable sequence is a sequence of instructions guaranteed to be
executed atomically with respect to other threads and signal handlers on the
current CPU. If its execution does not complete atomically, the kernel changes
the execution flow by jumping to an abort handler defined by user-space for
that restartable sequence.

Using restartable sequences requires to register a __rseq_abi thread-local storage
data structure (struct rseq) through the rseq(2) system call. Only one __rseq_abi
can be registered per thread, so user-space libraries and applications must follow
a user-space ABI defining how to share this resource. The ABI defining how to share
this resource between applications and libraries is defined by the C library.

The __rseq_abi contains a rseq_cs field which points to the currently executing
critical section. For each thread, a single rseq critical section can run at any
given point. Each critical section need to be implemented in assembly."


>       The  rseq()  ABI  accelerates  user-space operations on per-CPU data by
>       defining a shared data structure ABI between each user-space thread and
>       the kernel.
> 
>       It allows user-space to perform update operations on per-CPU data with‐
>       out requiring heavy-weight atomic operations.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In the following para: "a  hardware  execution  con‐ │
>       │text"?   What  is  the contrast being drawn here? It │
>       │would be good to state it more explicitly.           │
>       └─────────────────────────────────────────────────────┘

Here I'm trying to clarify what we mean by "CPU" in this document. We define
a CPU as having its own number returned by sched_getcpu(), which I think is
sometimes referred to as "logical cpu". This is the current hyperthread on
the current core, on the current "physical CPU", in the current socket.


>       The term CPU used in this documentation refers to a hardware  execution
>       context.
> 
>       Restartable  sequences are atomic with respect to preemption (making it
>       atomic with respect to other threads running on the same CPU), as  well
>       as  signal delivery (user-space execution contexts nested over the same
>       thread).  They either complete atomically with respect to preemption on
>       the current CPU and signal delivery, or they are aborted.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  preceding sentence, we need a definition of │
>       │"current CPU".                                       │
>       └─────────────────────────────────────────────────────┘

Not sure how to word it. If a thread or signal handler execution context can
possibly run and issue, for instance, "sched_getcpu()" between the beginning
and the end of the critical section and get the same logical CPU number as the
current thread, then we are guaranteed to abort. Of course, sched_getcpu() is
just one way to get the CPU number, considering that we can also read it
from the __rseq_abi cpu_id and cpu_id_start fields.

> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In the following, does "It  is"  means  "Restartable │
>       │sequences are"?                                      │
>       └─────────────────────────────────────────────────────┘
>       It is suited for update operations on per-CPU data.

Yes.


> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  following,  does "It is" means "Restartable │
>       │sequences are"?                                      │
>       └─────────────────────────────────────────────────────┘

"Restartable sequences can be..."

>       It can be used on data  structures  shared  between  threads  within  a
>       process, and on data structures shared between threads across different
>       processes.
> 
>       Some examples of operations that can be accelerated or improved by this
>       ABI:
> 
>       · Memory allocator per-CPU free-lists
> 
>       · Querying the current CPU number
> 
>       · Incrementing per-CPU counters
> 
>       · Modifying data protected by per-CPU spinlocks
> 
>       · Inserting/removing elements in per-CPU linked-lists
> 
>       · Writing/reading per-CPU ring buffers content
> 
>       · Accurately  reading performance monitoring unit counters with respect
>         to thread migration
> 
>       Restartable sequences must not perform  system  calls.   Doing  so  may
>       result in termination of the process by a segmentation fault.
> 
>       The rseq argument is a pointer to the thread-local rseq structure to be
>       shared between kernel and user-space.  The layout of this structure  is
>       shown below.
> 
>       The rseq_len argument is the size of the struct rseq to register.
> 
>       The  flags  argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
>       unregistration.
> 
>       The sig argument is the 32-bit signature  to  be  expected  before  the
>       abort handler code.
> 
>   The rseq structure
>       The  struct  rseq  is aligned on a 32-byte boundary.  This structure is
>       extensible.  Its size is passed as parameter to the rseq() system call.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Below, I added the structure definition (in abbrevi‐ │
>       │ated form).  Is there any reason not to do this?     │
>       └─────────────────────────────────────────────────────┘

It seems appropriate.

> 
>           struct rseq {
>               __u32             cpu_id_start;
>               __u32             cpu_id;
>               union {
>                   __u64 ptr64;
>           #ifdef __LP64__
>                   __u64 ptr;
>           #else
>                   ....
>           #endif
>               }                 rseq_cs;
>               __u32             flags;
>           } __attribute__((aligned(4 * sizeof(__u64))));
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  text  below, I think it would be helpful to │
>       │explicitly note which of these fields are set by the │
>       │kernel  (on  return from the reseq() call) and which │
>       │are set by the caller (before  calling  rseq()).  Is │
>       │the following correct:                               │
>       │                                                     │
>       │    cpu_id_start - initialized by caller to possible │
>       │    CPU number (e.g., 0), updated by kernel          │
>       │    on return                                        │

"initialized by caller to possible CPU number (e.g., 0), updated
by the kernel on return, and updated by the kernel on return after
thread migration to a different CPU"

>       │                                                     │
>       │    cpu_id - initialized to -1 by caller,            │
>       │    updated by kernel on return                      │

"initialized to -1 by caller, updated by the kernel on return, and
updated by the kernel on return after thread migration to a different
CPU"

>       │                                                     │
>       │    rseq_cs - initialized by caller, either to NULL  │
>       │    or a pointer to an 'rseq_cs' structure           │
>       │    that is initialized by the caller                │

"initialized by caller to NULL, then, after returning from successful
registration, updated to a pointer to an "rseq_cs" structure by user-space.
Set to NULL by the kernel when it restarts a rseq critical section,
when it preempts or deliver a signal outside of the range targeted by the
rseq_cs. Set to NULL by user-space before reclaiming memory that
contains the targeted struct rseq_cs."


>       │                                                     │
>       │    flags - initialized by caller, used by kernel    │
>       └─────────────────────────────────────────────────────┘
> 
>       The structure fields are as follows:
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  following paragraph, and in later places, I │
>       │changed "current thread" to "calling thread". Okay?  │
>       └─────────────────────────────────────────────────────┘

Yes.

> 
>       cpu_id_start
>              Optimistic cache of the CPU number on which the  calling  thread
>              is  running.  The value in this field is guaranteed to always be
>              a possible CPU number, even when rseq is not  initialized.   The
>              value  it  contains  should  always  be confirmed by reading the
>              cpu_id field.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │What does the last sentence mean?                    │
>              └─────────────────────────────────────────────────────┘

It means the caller thread can always use __rseq_abi.cpu_id_start to index an
array of per-cpu data and this won't cause an out-of-bound access on load, but it
does not mean it really contains the current CPU number. For instance, if rseq
registration failed, it will contain "0".

Therefore, it's fine to use cpu_is_start to fetch per-cpu data, but the cpu_id
field should be used to compare the cpu_is_start value, so the case where rseq
is not registered is caught. In that case, cpu_id_start=0, but cpu_id=-1 or -2,
which differ, and therefore the critical section needs to jump to the abort
handler.

> 
>              This field is an optimistic cache in the sense that it is always
>              guaranteed  to hold a valid CPU number in the range [0..(nr_pos‐
>              sible_cpus - 1)].  It can therefore be loaded by user-space  and
>              used  as  an offset in per-CPU data structures without having to
>              check whether its value is within the valid bounds  compared  to
>              the number of possible CPUs in the system.
> 
>              For  user-space  applications  executed on a kernel without rseq
>              support, the cpu_id_start field stays initialized at 0, which is
>              indeed  a  valid CPU number.  It is therefore valid to use it as
>              an offset in per-CPU data structures, and only validate  whether
>              it's  actually  the  current CPU number by comparing it with the
>              cpu_id field within the rseq critical section.
> 
>              If the kernel does not provide rseq support, that  cpu_id  field
>              stays  initialized  at  -1,  so  the comparison always fails, as
>              intended.  It is then up to user-space to use a fall-back mecha‐
>              nism, considering that rseq is not available.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │The  last  sentence is rather difficult to grok. Can │
>              │we say some more here?                               │
>              └─────────────────────────────────────────────────────┘

Perhaps we could use the explanation I've written above in my reply ?

> 
>       cpu_id Cache of the CPU number on which the calling thread is  running.
>              -1 if uninitialized.
> 
>       rseq_cs
>              The  rseq_cs  field  is a pointer to a struct rseq_cs (described
>              below).  It is NULL when no rseq assembly block critical section
>              is  active  for  the  calling  thread.  Setting it to point to a
>              critical section descriptor (struct rseq_cs) marks the beginning
>              of the critical section.
> 
>       flags  Flags  indicating  the  restart behavior for the calling thread.
>              This is mainly used for debugging purposes.  Can be either:
> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

Inhibit instruction sequence block restart on preemption for this thread.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

Inhibit instruction sequence block restart on signal delivery for this thread.
Restart on signal can only be inhibited when restart on preemption and restart
on migration are inhibited too, else it will terminate the offending process with
a segmentation fault.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

Inhibit instruction sequence block restart on migration for this thread.

> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Each of the above values needs an explanation.       │
>       │                                                     │
>       │Is it correct that only one of  the  values  may  be │
>       │specified in 'flags'? I ask because in the 'rseq_cs' │
>       │structure below, the 'flags' field  is  a  bit  mask │
>       │where  any  combination  of  these flags may be ORed │
>       │together.                                            │
>       │                                                     │
>       └─────────────────────────────────────────────────────┘

Those are also masks and can be ORed.


> 
>   The rseq_cs structure
>       The struct rseq_cs is aligned on a 32-byte boundary  and  has  a  fixed
>       size of 32 bytes.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Below, I added the structure definition (in abbrevi‐ │
>       │ated form).  Is there any reason not to do this?     │
>       └─────────────────────────────────────────────────────┘

It's fine.

> 
>           struct rseq_cs {
>               __u32   version;
>               __u32   flags;
>               __u64   start_ip;
>               __u64   post_commit_offset;
>               __u64   abort_ip;
>           } __attribute__((aligned(4 * sizeof(__u64))));
> 
>       The structure fields are as follows:
> 
>       version
>              Version of this structure.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │What does 'version' need to be initialized to?       │
>              └─────────────────────────────────────────────────────┘

Currently version needs to be 0. Eventually, if we implement support for new flags to rseq(),
we could add feature flags which register support for newer versions of struct rseq_cs.

> 
>       flags  Flags indicating the restart behavior of this structure.  Can be
>              a combination of:
> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

Inhibit instruction sequence block restart on preemption for this thread.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

Inhibit instruction sequence block restart on signal delivery for this thread.
Restart on signal can only be inhibited when restart on preemption and restart
on migration are inhibited too, else it will terminate the offending process with
a segmentation fault.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

Inhibit instruction sequence block restart on migration for this thread.

> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Each of the above values needs an explanation.       │
>       └─────────────────────────────────────────────────────┘
> 
>       start_ip
>              Instruction  pointer  address  of  the  first instruction of the
>              sequence of consecutive assembly instructions.
> 
>       post_commit_offset
>              Offset (from start_ip address) of the  address  after  the  last
>              instruction  of  the  sequence  of consecutive assembly instruc‐
>              tions.
> 
>       abort_ip
>              Instruction pointer address where to move the execution flow  in
>              case  of  abort of the sequence of consecutive assembly instruc‐
>              tions.
> 
> NOTES
>       A single library per process  should  keep  the  rseq  structure  in  a
>       thread-local  storage variable.  The cpu_id field should be initialized
>       to -1, and the cpu_id_start field should be initialized to  a  possible
>       CPU value (typically 0).

The part above is not quite right. All applications/libraries wishing to register
rseq must follow the ABI specified by the C library. It can be defined within more
that a single application/library, but in the end only one symbol will be chosen
for the process's global symbol table.

> 
>       Each  thread  is responsible for registering and unregistering its rseq
>       structure.  No more than one rseq structure address can  be  registered
>       per thread at a given time.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  following paragraph, what is the difference │
>       │between "freed" and "reclaim"?  I'm  supposing  they │
>       │mean the same thing, but it's not clear. And if they │
>       │do mean the same thing, then the first two sentences │
>       │appear to contain contradictory information.         │
>       └─────────────────────────────────────────────────────┘

The mean the same thing, and they are subtly not contradictory.

The first states that memory of a _registered_ rseq object must not
be freed before the thread exits.

The second states that memory of a rseq object must not be freed before
it is unregistered or the thread exits.

Do you have an alternative wording in mind to make this clearer ?

> 
>       Memory  of a registered rseq object must not be freed before the thread
>       exits.  Reclaim of rseq object's memory must only be done after  either
>       an explicit rseq unregistration is performed or after the thread exits.
>       Keep in mind that the implementation of  the  Thread-Local  Storage  (C
>       language  __thread)  lifetime  does  not guarantee existence of the TLS
>       area up until the thread exits.
> 
>       In a typical usage scenario, the thread registering the rseq  structure
>       will be performing loads and stores from/to that structure.  It is how‐
>       ever also allowed to read that structure from other threads.  The  rseq
>       field  updates performed by the kernel provide relaxed atomicity seman‐
>       tics, which guarantee that  other  threads  performing  relaxed  atomic
>       reads of the CPU number cache will always observe a consistent value.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  preceding  paragraph, can we reasonably add │
>       │some words to explain "relaxed atomicity  semantics" │
>       │and "relaxed atomic reads"?                          │
>       └─────────────────────────────────────────────────────┘

Not sure how to word this exactly, but here it means the stores and loads need
to be done atomically, but don't require nor provide any ordering guarantees
with respect to other loads/stores (no memory barriers).

> 
> RETURN VALUE
>       A  return  value of 0 indicates success.  On error, -1 is returned, and
>       errno is set appropriately.
> 
> ERRORS
>       EBUSY  Restartable sequence is already registered for this thread.
> 
>       EFAULT rseq is an invalid address.
> 
>       EINVAL Either flags contains an invalid  value,  or  rseq  contains  an
>              address which is not appropriately aligned, or rseq_len contains
>              a size that does not match the size received on registration.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │The last case "rseq_len contains a  size  that  does │
>              │not  match  the  size  received on registration" can │
>              │occur only on RSEQ_FLAG_UNREGISTER, tight?           │
>              └─────────────────────────────────────────────────────┘
> 
>       ENOSYS The rseq() system call is not implemented by this kernel.
> 
>       EPERM  The sig argument on unregistration does not match the  signature
>              received on registration.
> 
> VERSIONS
>       The rseq() system call was added in Linux 4.18.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │What is the current state of library support?        │
>       └─────────────────────────────────────────────────────┘

After going through a few RFC rounds, it's been posted as non-rfc a
few weeks ago. It is pending review from glibc maintainers. I currently
aim for inclusion of the rseq TLS registration by glibc for glibc 2.30:

https://sourceware.org/ml/libc-alpha/2019-02/msg00317.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00320.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00319.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00318.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00321.html

Note that the C library will define a user-space ABI which states how
applications/libraries wishing to register the rseq TLS need to behave so they
are compatible with the C library when it gets updated to a new version providing
rseq registration support. It seems like an important point to document,
perhaps even here in the rseq(2) man page.


> 
> CONFORMING TO
>       rseq() is Linux-specific.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Is  there  any  example  code that can reasonably be │
>       │included in this manual page? Or some  example  code │
>       │that can be referred to?                             │
>       └─────────────────────────────────────────────────────┘
> 

The per-cpu counter example we have here seems compact enough:

https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/

Thanks,

Mathieu


> SEE ALSO
>       sched_getcpu(3), membarrier(2)
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH man-pages] Add rseq manpage
@ 2019-03-04 18:02     ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2019-03-04 18:02 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-kernel, linux-api, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will


----- On Feb 28, 2019, at 3:42 AM, Michael Kerrisk mtk.manpages@gmail.com wrote:

> On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
>> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>>   patch which adds rseq documentation to the man-pages project ? ]
> Hi Matthieu
> 
> Sorry for the long delay. I've merged this page into a private
> branch and have done quite a lot of editing. I have many
> questions :-).

No worries, thanks for looking into it!

> 
> In the first instance, I think it is probably best to have
> a free-form text discussion rather than firing patches
> back and forward. Could you take a look at the questions below
> and respond?

Sure,

> 
> Thanks,
> 
> Michael
> 
> 
> RSEQ(2)                    Linux Programmer's Manual                   RSEQ(2)
> 
> NAME
>       rseq - Restartable sequences and CPU number cache
> 
> SYNOPSIS
>       #include <linux/rseq.h>
> 
>       int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);
> 
> DESCRIPTION
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Imagine  you  are  someone who is pretty new to this │
>       │idea...  What is notably lacking from this  page  is │
>       │an overview explaining:                              │
>       │                                                     │
>       │    * What a restartable sequence actually is.       │
>       │                                                     │
>       │    * An outline of the steps to perform when using  │
>       │    restartable sequences / rseq(2).                 │
>       │                                                     │
>       │I.e.,  something  along  the  lines  of Jon Corbet's │
>       │https://lwn.net/Articles/697979/.  Can you  come  up │
>       │with something? (Part of it might be at the start of │
>       │this page, and the rest in NOTES; it need not be all │
>       │in one place.)                                       │
>       └─────────────────────────────────────────────────────┘

We recently published a blog post about rseq, which might contain just the
right level of information we are looking for here:

https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/

Could something along the following lines work ?

"A restartable sequence is a sequence of instructions guaranteed to be
executed atomically with respect to other threads and signal handlers on the
current CPU. If its execution does not complete atomically, the kernel changes
the execution flow by jumping to an abort handler defined by user-space for
that restartable sequence.

Using restartable sequences requires to register a __rseq_abi thread-local storage
data structure (struct rseq) through the rseq(2) system call. Only one __rseq_abi
can be registered per thread, so user-space libraries and applications must follow
a user-space ABI defining how to share this resource. The ABI defining how to share
this resource between applications and libraries is defined by the C library.

The __rseq_abi contains a rseq_cs field which points to the currently executing
critical section. For each thread, a single rseq critical section can run at any
given point. Each critical section need to be implemented in assembly."


>       The  rseq()  ABI  accelerates  user-space operations on per-CPU data by
>       defining a shared data structure ABI between each user-space thread and
>       the kernel.
> 
>       It allows user-space to perform update operations on per-CPU data with‐
>       out requiring heavy-weight atomic operations.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In the following para: "a  hardware  execution  con‐ │
>       │text"?   What  is  the contrast being drawn here? It │
>       │would be good to state it more explicitly.           │
>       └─────────────────────────────────────────────────────┘

Here I'm trying to clarify what we mean by "CPU" in this document. We define
a CPU as having its own number returned by sched_getcpu(), which I think is
sometimes referred to as "logical cpu". This is the current hyperthread on
the current core, on the current "physical CPU", in the current socket.


>       The term CPU used in this documentation refers to a hardware  execution
>       context.
> 
>       Restartable  sequences are atomic with respect to preemption (making it
>       atomic with respect to other threads running on the same CPU), as  well
>       as  signal delivery (user-space execution contexts nested over the same
>       thread).  They either complete atomically with respect to preemption on
>       the current CPU and signal delivery, or they are aborted.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  preceding sentence, we need a definition of │
>       │"current CPU".                                       │
>       └─────────────────────────────────────────────────────┘

Not sure how to word it. If a thread or signal handler execution context can
possibly run and issue, for instance, "sched_getcpu()" between the beginning
and the end of the critical section and get the same logical CPU number as the
current thread, then we are guaranteed to abort. Of course, sched_getcpu() is
just one way to get the CPU number, considering that we can also read it
from the __rseq_abi cpu_id and cpu_id_start fields.

> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In the following, does "It  is"  means  "Restartable │
>       │sequences are"?                                      │
>       └─────────────────────────────────────────────────────┘
>       It is suited for update operations on per-CPU data.

Yes.


> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  following,  does "It is" means "Restartable │
>       │sequences are"?                                      │
>       └─────────────────────────────────────────────────────┘

"Restartable sequences can be..."

>       It can be used on data  structures  shared  between  threads  within  a
>       process, and on data structures shared between threads across different
>       processes.
> 
>       Some examples of operations that can be accelerated or improved by this
>       ABI:
> 
>       · Memory allocator per-CPU free-lists
> 
>       · Querying the current CPU number
> 
>       · Incrementing per-CPU counters
> 
>       · Modifying data protected by per-CPU spinlocks
> 
>       · Inserting/removing elements in per-CPU linked-lists
> 
>       · Writing/reading per-CPU ring buffers content
> 
>       · Accurately  reading performance monitoring unit counters with respect
>         to thread migration
> 
>       Restartable sequences must not perform  system  calls.   Doing  so  may
>       result in termination of the process by a segmentation fault.
> 
>       The rseq argument is a pointer to the thread-local rseq structure to be
>       shared between kernel and user-space.  The layout of this structure  is
>       shown below.
> 
>       The rseq_len argument is the size of the struct rseq to register.
> 
>       The  flags  argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
>       unregistration.
> 
>       The sig argument is the 32-bit signature  to  be  expected  before  the
>       abort handler code.
> 
>   The rseq structure
>       The  struct  rseq  is aligned on a 32-byte boundary.  This structure is
>       extensible.  Its size is passed as parameter to the rseq() system call.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Below, I added the structure definition (in abbrevi‐ │
>       │ated form).  Is there any reason not to do this?     │
>       └─────────────────────────────────────────────────────┘

It seems appropriate.

> 
>           struct rseq {
>               __u32             cpu_id_start;
>               __u32             cpu_id;
>               union {
>                   __u64 ptr64;
>           #ifdef __LP64__
>                   __u64 ptr;
>           #else
>                   ....
>           #endif
>               }                 rseq_cs;
>               __u32             flags;
>           } __attribute__((aligned(4 * sizeof(__u64))));
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  text  below, I think it would be helpful to │
>       │explicitly note which of these fields are set by the │
>       │kernel  (on  return from the reseq() call) and which │
>       │are set by the caller (before  calling  rseq()).  Is │
>       │the following correct:                               │
>       │                                                     │
>       │    cpu_id_start - initialized by caller to possible │
>       │    CPU number (e.g., 0), updated by kernel          │
>       │    on return                                        │

"initialized by caller to possible CPU number (e.g., 0), updated
by the kernel on return, and updated by the kernel on return after
thread migration to a different CPU"

>       │                                                     │
>       │    cpu_id - initialized to -1 by caller,            │
>       │    updated by kernel on return                      │

"initialized to -1 by caller, updated by the kernel on return, and
updated by the kernel on return after thread migration to a different
CPU"

>       │                                                     │
>       │    rseq_cs - initialized by caller, either to NULL  │
>       │    or a pointer to an 'rseq_cs' structure           │
>       │    that is initialized by the caller                │

"initialized by caller to NULL, then, after returning from successful
registration, updated to a pointer to an "rseq_cs" structure by user-space.
Set to NULL by the kernel when it restarts a rseq critical section,
when it preempts or deliver a signal outside of the range targeted by the
rseq_cs. Set to NULL by user-space before reclaiming memory that
contains the targeted struct rseq_cs."


>       │                                                     │
>       │    flags - initialized by caller, used by kernel    │
>       └─────────────────────────────────────────────────────┘
> 
>       The structure fields are as follows:
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  following paragraph, and in later places, I │
>       │changed "current thread" to "calling thread". Okay?  │
>       └─────────────────────────────────────────────────────┘

Yes.

> 
>       cpu_id_start
>              Optimistic cache of the CPU number on which the  calling  thread
>              is  running.  The value in this field is guaranteed to always be
>              a possible CPU number, even when rseq is not  initialized.   The
>              value  it  contains  should  always  be confirmed by reading the
>              cpu_id field.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │What does the last sentence mean?                    │
>              └─────────────────────────────────────────────────────┘

It means the caller thread can always use __rseq_abi.cpu_id_start to index an
array of per-cpu data and this won't cause an out-of-bound access on load, but it
does not mean it really contains the current CPU number. For instance, if rseq
registration failed, it will contain "0".

Therefore, it's fine to use cpu_is_start to fetch per-cpu data, but the cpu_id
field should be used to compare the cpu_is_start value, so the case where rseq
is not registered is caught. In that case, cpu_id_start=0, but cpu_id=-1 or -2,
which differ, and therefore the critical section needs to jump to the abort
handler.

> 
>              This field is an optimistic cache in the sense that it is always
>              guaranteed  to hold a valid CPU number in the range [0..(nr_pos‐
>              sible_cpus - 1)].  It can therefore be loaded by user-space  and
>              used  as  an offset in per-CPU data structures without having to
>              check whether its value is within the valid bounds  compared  to
>              the number of possible CPUs in the system.
> 
>              For  user-space  applications  executed on a kernel without rseq
>              support, the cpu_id_start field stays initialized at 0, which is
>              indeed  a  valid CPU number.  It is therefore valid to use it as
>              an offset in per-CPU data structures, and only validate  whether
>              it's  actually  the  current CPU number by comparing it with the
>              cpu_id field within the rseq critical section.
> 
>              If the kernel does not provide rseq support, that  cpu_id  field
>              stays  initialized  at  -1,  so  the comparison always fails, as
>              intended.  It is then up to user-space to use a fall-back mecha‐
>              nism, considering that rseq is not available.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │The  last  sentence is rather difficult to grok. Can │
>              │we say some more here?                               │
>              └─────────────────────────────────────────────────────┘

Perhaps we could use the explanation I've written above in my reply ?

> 
>       cpu_id Cache of the CPU number on which the calling thread is  running.
>              -1 if uninitialized.
> 
>       rseq_cs
>              The  rseq_cs  field  is a pointer to a struct rseq_cs (described
>              below).  It is NULL when no rseq assembly block critical section
>              is  active  for  the  calling  thread.  Setting it to point to a
>              critical section descriptor (struct rseq_cs) marks the beginning
>              of the critical section.
> 
>       flags  Flags  indicating  the  restart behavior for the calling thread.
>              This is mainly used for debugging purposes.  Can be either:
> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

Inhibit instruction sequence block restart on preemption for this thread.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

Inhibit instruction sequence block restart on signal delivery for this thread.
Restart on signal can only be inhibited when restart on preemption and restart
on migration are inhibited too, else it will terminate the offending process with
a segmentation fault.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

Inhibit instruction sequence block restart on migration for this thread.

> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Each of the above values needs an explanation.       │
>       │                                                     │
>       │Is it correct that only one of  the  values  may  be │
>       │specified in 'flags'? I ask because in the 'rseq_cs' │
>       │structure below, the 'flags' field  is  a  bit  mask │
>       │where  any  combination  of  these flags may be ORed │
>       │together.                                            │
>       │                                                     │
>       └─────────────────────────────────────────────────────┘

Those are also masks and can be ORed.


> 
>   The rseq_cs structure
>       The struct rseq_cs is aligned on a 32-byte boundary  and  has  a  fixed
>       size of 32 bytes.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Below, I added the structure definition (in abbrevi‐ │
>       │ated form).  Is there any reason not to do this?     │
>       └─────────────────────────────────────────────────────┘

It's fine.

> 
>           struct rseq_cs {
>               __u32   version;
>               __u32   flags;
>               __u64   start_ip;
>               __u64   post_commit_offset;
>               __u64   abort_ip;
>           } __attribute__((aligned(4 * sizeof(__u64))));
> 
>       The structure fields are as follows:
> 
>       version
>              Version of this structure.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │What does 'version' need to be initialized to?       │
>              └─────────────────────────────────────────────────────┘

Currently version needs to be 0. Eventually, if we implement support for new flags to rseq(),
we could add feature flags which register support for newer versions of struct rseq_cs.

> 
>       flags  Flags indicating the restart behavior of this structure.  Can be
>              a combination of:
> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

Inhibit instruction sequence block restart on preemption for this thread.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

Inhibit instruction sequence block restart on signal delivery for this thread.
Restart on signal can only be inhibited when restart on preemption and restart
on migration are inhibited too, else it will terminate the offending process with
a segmentation fault.

> 
>              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

Inhibit instruction sequence block restart on migration for this thread.

> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Each of the above values needs an explanation.       │
>       └─────────────────────────────────────────────────────┘
> 
>       start_ip
>              Instruction  pointer  address  of  the  first instruction of the
>              sequence of consecutive assembly instructions.
> 
>       post_commit_offset
>              Offset (from start_ip address) of the  address  after  the  last
>              instruction  of  the  sequence  of consecutive assembly instruc‐
>              tions.
> 
>       abort_ip
>              Instruction pointer address where to move the execution flow  in
>              case  of  abort of the sequence of consecutive assembly instruc‐
>              tions.
> 
> NOTES
>       A single library per process  should  keep  the  rseq  structure  in  a
>       thread-local  storage variable.  The cpu_id field should be initialized
>       to -1, and the cpu_id_start field should be initialized to  a  possible
>       CPU value (typically 0).

The part above is not quite right. All applications/libraries wishing to register
rseq must follow the ABI specified by the C library. It can be defined within more
that a single application/library, but in the end only one symbol will be chosen
for the process's global symbol table.

> 
>       Each  thread  is responsible for registering and unregistering its rseq
>       structure.  No more than one rseq structure address can  be  registered
>       per thread at a given time.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  following paragraph, what is the difference │
>       │between "freed" and "reclaim"?  I'm  supposing  they │
>       │mean the same thing, but it's not clear. And if they │
>       │do mean the same thing, then the first two sentences │
>       │appear to contain contradictory information.         │
>       └─────────────────────────────────────────────────────┘

The mean the same thing, and they are subtly not contradictory.

The first states that memory of a _registered_ rseq object must not
be freed before the thread exits.

The second states that memory of a rseq object must not be freed before
it is unregistered or the thread exits.

Do you have an alternative wording in mind to make this clearer ?

> 
>       Memory  of a registered rseq object must not be freed before the thread
>       exits.  Reclaim of rseq object's memory must only be done after  either
>       an explicit rseq unregistration is performed or after the thread exits.
>       Keep in mind that the implementation of  the  Thread-Local  Storage  (C
>       language  __thread)  lifetime  does  not guarantee existence of the TLS
>       area up until the thread exits.
> 
>       In a typical usage scenario, the thread registering the rseq  structure
>       will be performing loads and stores from/to that structure.  It is how‐
>       ever also allowed to read that structure from other threads.  The  rseq
>       field  updates performed by the kernel provide relaxed atomicity seman‐
>       tics, which guarantee that  other  threads  performing  relaxed  atomic
>       reads of the CPU number cache will always observe a consistent value.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │In  the  preceding  paragraph, can we reasonably add │
>       │some words to explain "relaxed atomicity  semantics" │
>       │and "relaxed atomic reads"?                          │
>       └─────────────────────────────────────────────────────┘

Not sure how to word this exactly, but here it means the stores and loads need
to be done atomically, but don't require nor provide any ordering guarantees
with respect to other loads/stores (no memory barriers).

> 
> RETURN VALUE
>       A  return  value of 0 indicates success.  On error, -1 is returned, and
>       errno is set appropriately.
> 
> ERRORS
>       EBUSY  Restartable sequence is already registered for this thread.
> 
>       EFAULT rseq is an invalid address.
> 
>       EINVAL Either flags contains an invalid  value,  or  rseq  contains  an
>              address which is not appropriately aligned, or rseq_len contains
>              a size that does not match the size received on registration.
> 
>              ┌─────────────────────────────────────────────────────┐
>              │FIXME                                                │
>              ├─────────────────────────────────────────────────────┤
>              │The last case "rseq_len contains a  size  that  does │
>              │not  match  the  size  received on registration" can │
>              │occur only on RSEQ_FLAG_UNREGISTER, tight?           │
>              └─────────────────────────────────────────────────────┘
> 
>       ENOSYS The rseq() system call is not implemented by this kernel.
> 
>       EPERM  The sig argument on unregistration does not match the  signature
>              received on registration.
> 
> VERSIONS
>       The rseq() system call was added in Linux 4.18.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │What is the current state of library support?        │
>       └─────────────────────────────────────────────────────┘

After going through a few RFC rounds, it's been posted as non-rfc a
few weeks ago. It is pending review from glibc maintainers. I currently
aim for inclusion of the rseq TLS registration by glibc for glibc 2.30:

https://sourceware.org/ml/libc-alpha/2019-02/msg00317.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00320.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00319.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00318.html
https://sourceware.org/ml/libc-alpha/2019-02/msg00321.html

Note that the C library will define a user-space ABI which states how
applications/libraries wishing to register the rseq TLS need to behave so they
are compatible with the C library when it gets updated to a new version providing
rseq registration support. It seems like an important point to document,
perhaps even here in the rseq(2) man page.


> 
> CONFORMING TO
>       rseq() is Linux-specific.
> 
>       ┌─────────────────────────────────────────────────────┐
>       │FIXME                                                │
>       ├─────────────────────────────────────────────────────┤
>       │Is  there  any  example  code that can reasonably be │
>       │included in this manual page? Or some  example  code │
>       │that can be referred to?                             │
>       └─────────────────────────────────────────────────────┘
> 

The per-cpu counter example we have here seems compact enough:

https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/

Thanks,

Mathieu


> SEE ALSO
>       sched_getcpu(3), membarrier(2)
> 
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH man-pages] Add rseq manpage
  2018-12-06 14:42 ` Mathieu Desnoyers
@ 2019-02-28  8:42   ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-02-28  8:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: mtk.manpages, linux-kernel, linux-api, Peter Zijlstra,
	Paul E . McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon

On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>   patch which adds rseq documentation to the man-pages project ? ]
Hi Matthieu

Sorry for the long delay. I've merged this page into a private
branch and have done quite a lot of editing. I have many
questions :-).

In the first instance, I think it is probably best to have
a free-form text discussion rather than firing patches
back and forward. Could you take a look at the questions below
and respond?

Thanks,

Michael


RSEQ(2)                    Linux Programmer's Manual                   RSEQ(2)

NAME
       rseq - Restartable sequences and CPU number cache

SYNOPSIS
       #include <linux/rseq.h>

       int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);

DESCRIPTION
       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Imagine  you  are  someone who is pretty new to this │
       │idea...  What is notably lacking from this  page  is │
       │an overview explaining:                              │
       │                                                     │
       │    * What a restartable sequence actually is.       │
       │                                                     │
       │    * An outline of the steps to perform when using  │
       │    restartable sequences / rseq(2).                 │
       │                                                     │
       │I.e.,  something  along  the  lines  of Jon Corbet's │
       │https://lwn.net/Articles/697979/.  Can you  come  up │
       │with something? (Part of it might be at the start of │
       │this page, and the rest in NOTES; it need not be all │
       │in one place.)                                       │
       └─────────────────────────────────────────────────────┘
       The  rseq()  ABI  accelerates  user-space operations on per-CPU data by
       defining a shared data structure ABI between each user-space thread and
       the kernel.

       It allows user-space to perform update operations on per-CPU data with‐
       out requiring heavy-weight atomic operations.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In the following para: "a  hardware  execution  con‐ │
       │text"?   What  is  the contrast being drawn here? It │
       │would be good to state it more explicitly.           │
       └─────────────────────────────────────────────────────┘
       The term CPU used in this documentation refers to a hardware  execution
       context.

       Restartable  sequences are atomic with respect to preemption (making it
       atomic with respect to other threads running on the same CPU), as  well
       as  signal delivery (user-space execution contexts nested over the same
       thread).  They either complete atomically with respect to preemption on
       the current CPU and signal delivery, or they are aborted.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  preceding sentence, we need a definition of │
       │"current CPU".                                       │
       └─────────────────────────────────────────────────────┘

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In the following, does "It  is"  means  "Restartable │
       │sequences are"?                                      │
       └─────────────────────────────────────────────────────┘
       It is suited for update operations on per-CPU data.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  following,  does "It is" means "Restartable │
       │sequences are"?                                      │
       └─────────────────────────────────────────────────────┘
       It can be used on data  structures  shared  between  threads  within  a
       process, and on data structures shared between threads across different
       processes.

       Some examples of operations that can be accelerated or improved by this
       ABI:

       · Memory allocator per-CPU free-lists

       · Querying the current CPU number

       · Incrementing per-CPU counters

       · Modifying data protected by per-CPU spinlocks

       · Inserting/removing elements in per-CPU linked-lists

       · Writing/reading per-CPU ring buffers content

       · Accurately  reading performance monitoring unit counters with respect
         to thread migration

       Restartable sequences must not perform  system  calls.   Doing  so  may
       result in termination of the process by a segmentation fault.

       The rseq argument is a pointer to the thread-local rseq structure to be
       shared between kernel and user-space.  The layout of this structure  is
       shown below.

       The rseq_len argument is the size of the struct rseq to register.

       The  flags  argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
       unregistration.

       The sig argument is the 32-bit signature  to  be  expected  before  the
       abort handler code.

   The rseq structure
       The  struct  rseq  is aligned on a 32-byte boundary.  This structure is
       extensible.  Its size is passed as parameter to the rseq() system call.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Below, I added the structure definition (in abbrevi‐ │
       │ated form).  Is there any reason not to do this?     │
       └─────────────────────────────────────────────────────┘

           struct rseq {
               __u32             cpu_id_start;
               __u32             cpu_id;
               union {
                   __u64 ptr64;
           #ifdef __LP64__
                   __u64 ptr;
           #else
                   ....
           #endif
               }                 rseq_cs;
               __u32             flags;
           } __attribute__((aligned(4 * sizeof(__u64))));

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  text  below, I think it would be helpful to │
       │explicitly note which of these fields are set by the │
       │kernel  (on  return from the reseq() call) and which │
       │are set by the caller (before  calling  rseq()).  Is │
       │the following correct:                               │
       │                                                     │
       │    cpu_id_start - initialized by caller to possible │
       │    CPU number (e.g., 0), updated by kernel          │
       │    on return                                        │
       │                                                     │
       │    cpu_id - initialized to -1 by caller,            │
       │    updated by kernel on return                      │
       │                                                     │
       │    rseq_cs - initialized by caller, either to NULL  │
       │    or a pointer to an 'rseq_cs' structure           │
       │    that is initialized by the caller                │
       │                                                     │
       │    flags - initialized by caller, used by kernel    │
       └─────────────────────────────────────────────────────┘

       The structure fields are as follows:

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  following paragraph, and in later places, I │
       │changed "current thread" to "calling thread". Okay?  │
       └─────────────────────────────────────────────────────┘

       cpu_id_start
              Optimistic cache of the CPU number on which the  calling  thread
              is  running.  The value in this field is guaranteed to always be
              a possible CPU number, even when rseq is not  initialized.   The
              value  it  contains  should  always  be confirmed by reading the
              cpu_id field.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │What does the last sentence mean?                    │
              └─────────────────────────────────────────────────────┘

              This field is an optimistic cache in the sense that it is always
              guaranteed  to hold a valid CPU number in the range [0..(nr_pos‐
              sible_cpus - 1)].  It can therefore be loaded by user-space  and
              used  as  an offset in per-CPU data structures without having to
              check whether its value is within the valid bounds  compared  to
              the number of possible CPUs in the system.

              For  user-space  applications  executed on a kernel without rseq
              support, the cpu_id_start field stays initialized at 0, which is
              indeed  a  valid CPU number.  It is therefore valid to use it as
              an offset in per-CPU data structures, and only validate  whether
              it's  actually  the  current CPU number by comparing it with the
              cpu_id field within the rseq critical section.

              If the kernel does not provide rseq support, that  cpu_id  field
              stays  initialized  at  -1,  so  the comparison always fails, as
              intended.  It is then up to user-space to use a fall-back mecha‐
              nism, considering that rseq is not available.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │The  last  sentence is rather difficult to grok. Can │
              │we say some more here?                               │
              └─────────────────────────────────────────────────────┘

       cpu_id Cache of the CPU number on which the calling thread is  running.
              -1 if uninitialized.

       rseq_cs
              The  rseq_cs  field  is a pointer to a struct rseq_cs (described
              below).  It is NULL when no rseq assembly block critical section
              is  active  for  the  calling  thread.  Setting it to point to a
              critical section descriptor (struct rseq_cs) marks the beginning
              of the critical section.

       flags  Flags  indicating  the  restart behavior for the calling thread.
              This is mainly used for debugging purposes.  Can be either:

              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Each of the above values needs an explanation.       │
       │                                                     │
       │Is it correct that only one of  the  values  may  be │
       │specified in 'flags'? I ask because in the 'rseq_cs' │
       │structure below, the 'flags' field  is  a  bit  mask │
       │where  any  combination  of  these flags may be ORed │
       │together.                                            │
       │                                                     │
       └─────────────────────────────────────────────────────┘

   The rseq_cs structure
       The struct rseq_cs is aligned on a 32-byte boundary  and  has  a  fixed
       size of 32 bytes.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Below, I added the structure definition (in abbrevi‐ │
       │ated form).  Is there any reason not to do this?     │
       └─────────────────────────────────────────────────────┘

           struct rseq_cs {
               __u32   version;
               __u32   flags;
               __u64   start_ip;
               __u64   post_commit_offset;
               __u64   abort_ip;
           } __attribute__((aligned(4 * sizeof(__u64))));

       The structure fields are as follows:

       version
              Version of this structure.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │What does 'version' need to be initialized to?       │
              └─────────────────────────────────────────────────────┘

       flags  Flags indicating the restart behavior of this structure.  Can be
              a combination of:

              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Each of the above values needs an explanation.       │
       └─────────────────────────────────────────────────────┘

       start_ip
              Instruction  pointer  address  of  the  first instruction of the
              sequence of consecutive assembly instructions.

       post_commit_offset
              Offset (from start_ip address) of the  address  after  the  last
              instruction  of  the  sequence  of consecutive assembly instruc‐
              tions.

       abort_ip
              Instruction pointer address where to move the execution flow  in
              case  of  abort of the sequence of consecutive assembly instruc‐
              tions.

NOTES
       A single library per process  should  keep  the  rseq  structure  in  a
       thread-local  storage variable.  The cpu_id field should be initialized
       to -1, and the cpu_id_start field should be initialized to  a  possible
       CPU value (typically 0).

       Each  thread  is responsible for registering and unregistering its rseq
       structure.  No more than one rseq structure address can  be  registered
       per thread at a given time.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  following paragraph, what is the difference │
       │between "freed" and "reclaim"?  I'm  supposing  they │
       │mean the same thing, but it's not clear. And if they │
       │do mean the same thing, then the first two sentences │
       │appear to contain contradictory information.         │
       └─────────────────────────────────────────────────────┘

       Memory  of a registered rseq object must not be freed before the thread
       exits.  Reclaim of rseq object's memory must only be done after  either
       an explicit rseq unregistration is performed or after the thread exits.
       Keep in mind that the implementation of  the  Thread-Local  Storage  (C
       language  __thread)  lifetime  does  not guarantee existence of the TLS
       area up until the thread exits.

       In a typical usage scenario, the thread registering the rseq  structure
       will be performing loads and stores from/to that structure.  It is how‐
       ever also allowed to read that structure from other threads.  The  rseq
       field  updates performed by the kernel provide relaxed atomicity seman‐
       tics, which guarantee that  other  threads  performing  relaxed  atomic
       reads of the CPU number cache will always observe a consistent value.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  preceding  paragraph, can we reasonably add │
       │some words to explain "relaxed atomicity  semantics" │
       │and "relaxed atomic reads"?                          │
       └─────────────────────────────────────────────────────┘

RETURN VALUE
       A  return  value of 0 indicates success.  On error, -1 is returned, and
       errno is set appropriately.

ERRORS
       EBUSY  Restartable sequence is already registered for this thread.

       EFAULT rseq is an invalid address.

       EINVAL Either flags contains an invalid  value,  or  rseq  contains  an
              address which is not appropriately aligned, or rseq_len contains
              a size that does not match the size received on registration.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │The last case "rseq_len contains a  size  that  does │
              │not  match  the  size  received on registration" can │
              │occur only on RSEQ_FLAG_UNREGISTER, tight?           │
              └─────────────────────────────────────────────────────┘

       ENOSYS The rseq() system call is not implemented by this kernel.

       EPERM  The sig argument on unregistration does not match the  signature
              received on registration.

VERSIONS
       The rseq() system call was added in Linux 4.18.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │What is the current state of library support?        │
       └─────────────────────────────────────────────────────┘

CONFORMING TO
       rseq() is Linux-specific.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Is  there  any  example  code that can reasonably be │
       │included in this manual page? Or some  example  code │
       │that can be referred to?                             │
       └─────────────────────────────────────────────────────┘

SEE ALSO
       sched_getcpu(3), membarrier(2)

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH man-pages] Add rseq manpage
@ 2019-02-28  8:42   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2019-02-28  8:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: mtk.manpages, linux-kernel, linux-api, Peter Zijlstra,
	Paul E . McKenney, Boqun Feng, Andy Lutomirski, Dave Watson,
	Paul Turner, Andrew Morton, Russell King, Thomas Gleixner,
	Ingo Molnar, H . Peter Anvin, Andi Kleen, Chris Lameter,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will

On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>   patch which adds rseq documentation to the man-pages project ? ]
Hi Matthieu

Sorry for the long delay. I've merged this page into a private
branch and have done quite a lot of editing. I have many
questions :-).

In the first instance, I think it is probably best to have
a free-form text discussion rather than firing patches
back and forward. Could you take a look at the questions below
and respond?

Thanks,

Michael


RSEQ(2)                    Linux Programmer's Manual                   RSEQ(2)

NAME
       rseq - Restartable sequences and CPU number cache

SYNOPSIS
       #include <linux/rseq.h>

       int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);

DESCRIPTION
       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Imagine  you  are  someone who is pretty new to this │
       │idea...  What is notably lacking from this  page  is │
       │an overview explaining:                              │
       │                                                     │
       │    * What a restartable sequence actually is.       │
       │                                                     │
       │    * An outline of the steps to perform when using  │
       │    restartable sequences / rseq(2).                 │
       │                                                     │
       │I.e.,  something  along  the  lines  of Jon Corbet's │
       │https://lwn.net/Articles/697979/.  Can you  come  up │
       │with something? (Part of it might be at the start of │
       │this page, and the rest in NOTES; it need not be all │
       │in one place.)                                       │
       └─────────────────────────────────────────────────────┘
       The  rseq()  ABI  accelerates  user-space operations on per-CPU data by
       defining a shared data structure ABI between each user-space thread and
       the kernel.

       It allows user-space to perform update operations on per-CPU data with‐
       out requiring heavy-weight atomic operations.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In the following para: "a  hardware  execution  con‐ │
       │text"?   What  is  the contrast being drawn here? It │
       │would be good to state it more explicitly.           │
       └─────────────────────────────────────────────────────┘
       The term CPU used in this documentation refers to a hardware  execution
       context.

       Restartable  sequences are atomic with respect to preemption (making it
       atomic with respect to other threads running on the same CPU), as  well
       as  signal delivery (user-space execution contexts nested over the same
       thread).  They either complete atomically with respect to preemption on
       the current CPU and signal delivery, or they are aborted.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  preceding sentence, we need a definition of │
       │"current CPU".                                       │
       └─────────────────────────────────────────────────────┘

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In the following, does "It  is"  means  "Restartable │
       │sequences are"?                                      │
       └─────────────────────────────────────────────────────┘
       It is suited for update operations on per-CPU data.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  following,  does "It is" means "Restartable │
       │sequences are"?                                      │
       └─────────────────────────────────────────────────────┘
       It can be used on data  structures  shared  between  threads  within  a
       process, and on data structures shared between threads across different
       processes.

       Some examples of operations that can be accelerated or improved by this
       ABI:

       · Memory allocator per-CPU free-lists

       · Querying the current CPU number

       · Incrementing per-CPU counters

       · Modifying data protected by per-CPU spinlocks

       · Inserting/removing elements in per-CPU linked-lists

       · Writing/reading per-CPU ring buffers content

       · Accurately  reading performance monitoring unit counters with respect
         to thread migration

       Restartable sequences must not perform  system  calls.   Doing  so  may
       result in termination of the process by a segmentation fault.

       The rseq argument is a pointer to the thread-local rseq structure to be
       shared between kernel and user-space.  The layout of this structure  is
       shown below.

       The rseq_len argument is the size of the struct rseq to register.

       The  flags  argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
       unregistration.

       The sig argument is the 32-bit signature  to  be  expected  before  the
       abort handler code.

   The rseq structure
       The  struct  rseq  is aligned on a 32-byte boundary.  This structure is
       extensible.  Its size is passed as parameter to the rseq() system call.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Below, I added the structure definition (in abbrevi‐ │
       │ated form).  Is there any reason not to do this?     │
       └─────────────────────────────────────────────────────┘

           struct rseq {
               __u32             cpu_id_start;
               __u32             cpu_id;
               union {
                   __u64 ptr64;
           #ifdef __LP64__
                   __u64 ptr;
           #else
                   ....
           #endif
               }                 rseq_cs;
               __u32             flags;
           } __attribute__((aligned(4 * sizeof(__u64))));

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  text  below, I think it would be helpful to │
       │explicitly note which of these fields are set by the │
       │kernel  (on  return from the reseq() call) and which │
       │are set by the caller (before  calling  rseq()).  Is │
       │the following correct:                               │
       │                                                     │
       │    cpu_id_start - initialized by caller to possible │
       │    CPU number (e.g., 0), updated by kernel          │
       │    on return                                        │
       │                                                     │
       │    cpu_id - initialized to -1 by caller,            │
       │    updated by kernel on return                      │
       │                                                     │
       │    rseq_cs - initialized by caller, either to NULL  │
       │    or a pointer to an 'rseq_cs' structure           │
       │    that is initialized by the caller                │
       │                                                     │
       │    flags - initialized by caller, used by kernel    │
       └─────────────────────────────────────────────────────┘

       The structure fields are as follows:

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  following paragraph, and in later places, I │
       │changed "current thread" to "calling thread". Okay?  │
       └─────────────────────────────────────────────────────┘

       cpu_id_start
              Optimistic cache of the CPU number on which the  calling  thread
              is  running.  The value in this field is guaranteed to always be
              a possible CPU number, even when rseq is not  initialized.   The
              value  it  contains  should  always  be confirmed by reading the
              cpu_id field.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │What does the last sentence mean?                    │
              └─────────────────────────────────────────────────────┘

              This field is an optimistic cache in the sense that it is always
              guaranteed  to hold a valid CPU number in the range [0..(nr_pos‐
              sible_cpus - 1)].  It can therefore be loaded by user-space  and
              used  as  an offset in per-CPU data structures without having to
              check whether its value is within the valid bounds  compared  to
              the number of possible CPUs in the system.

              For  user-space  applications  executed on a kernel without rseq
              support, the cpu_id_start field stays initialized at 0, which is
              indeed  a  valid CPU number.  It is therefore valid to use it as
              an offset in per-CPU data structures, and only validate  whether
              it's  actually  the  current CPU number by comparing it with the
              cpu_id field within the rseq critical section.

              If the kernel does not provide rseq support, that  cpu_id  field
              stays  initialized  at  -1,  so  the comparison always fails, as
              intended.  It is then up to user-space to use a fall-back mecha‐
              nism, considering that rseq is not available.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │The  last  sentence is rather difficult to grok. Can │
              │we say some more here?                               │
              └─────────────────────────────────────────────────────┘

       cpu_id Cache of the CPU number on which the calling thread is  running.
              -1 if uninitialized.

       rseq_cs
              The  rseq_cs  field  is a pointer to a struct rseq_cs (described
              below).  It is NULL when no rseq assembly block critical section
              is  active  for  the  calling  thread.  Setting it to point to a
              critical section descriptor (struct rseq_cs) marks the beginning
              of the critical section.

       flags  Flags  indicating  the  restart behavior for the calling thread.
              This is mainly used for debugging purposes.  Can be either:

              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Each of the above values needs an explanation.       │
       │                                                     │
       │Is it correct that only one of  the  values  may  be │
       │specified in 'flags'? I ask because in the 'rseq_cs' │
       │structure below, the 'flags' field  is  a  bit  mask │
       │where  any  combination  of  these flags may be ORed │
       │together.                                            │
       │                                                     │
       └─────────────────────────────────────────────────────┘

   The rseq_cs structure
       The struct rseq_cs is aligned on a 32-byte boundary  and  has  a  fixed
       size of 32 bytes.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Below, I added the structure definition (in abbrevi‐ │
       │ated form).  Is there any reason not to do this?     │
       └─────────────────────────────────────────────────────┘

           struct rseq_cs {
               __u32   version;
               __u32   flags;
               __u64   start_ip;
               __u64   post_commit_offset;
               __u64   abort_ip;
           } __attribute__((aligned(4 * sizeof(__u64))));

       The structure fields are as follows:

       version
              Version of this structure.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │What does 'version' need to be initialized to?       │
              └─────────────────────────────────────────────────────┘

       flags  Flags indicating the restart behavior of this structure.  Can be
              a combination of:

              RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT

              RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL

              RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Each of the above values needs an explanation.       │
       └─────────────────────────────────────────────────────┘

       start_ip
              Instruction  pointer  address  of  the  first instruction of the
              sequence of consecutive assembly instructions.

       post_commit_offset
              Offset (from start_ip address) of the  address  after  the  last
              instruction  of  the  sequence  of consecutive assembly instruc‐
              tions.

       abort_ip
              Instruction pointer address where to move the execution flow  in
              case  of  abort of the sequence of consecutive assembly instruc‐
              tions.

NOTES
       A single library per process  should  keep  the  rseq  structure  in  a
       thread-local  storage variable.  The cpu_id field should be initialized
       to -1, and the cpu_id_start field should be initialized to  a  possible
       CPU value (typically 0).

       Each  thread  is responsible for registering and unregistering its rseq
       structure.  No more than one rseq structure address can  be  registered
       per thread at a given time.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  following paragraph, what is the difference │
       │between "freed" and "reclaim"?  I'm  supposing  they │
       │mean the same thing, but it's not clear. And if they │
       │do mean the same thing, then the first two sentences │
       │appear to contain contradictory information.         │
       └─────────────────────────────────────────────────────┘

       Memory  of a registered rseq object must not be freed before the thread
       exits.  Reclaim of rseq object's memory must only be done after  either
       an explicit rseq unregistration is performed or after the thread exits.
       Keep in mind that the implementation of  the  Thread-Local  Storage  (C
       language  __thread)  lifetime  does  not guarantee existence of the TLS
       area up until the thread exits.

       In a typical usage scenario, the thread registering the rseq  structure
       will be performing loads and stores from/to that structure.  It is how‐
       ever also allowed to read that structure from other threads.  The  rseq
       field  updates performed by the kernel provide relaxed atomicity seman‐
       tics, which guarantee that  other  threads  performing  relaxed  atomic
       reads of the CPU number cache will always observe a consistent value.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │In  the  preceding  paragraph, can we reasonably add │
       │some words to explain "relaxed atomicity  semantics" │
       │and "relaxed atomic reads"?                          │
       └─────────────────────────────────────────────────────┘

RETURN VALUE
       A  return  value of 0 indicates success.  On error, -1 is returned, and
       errno is set appropriately.

ERRORS
       EBUSY  Restartable sequence is already registered for this thread.

       EFAULT rseq is an invalid address.

       EINVAL Either flags contains an invalid  value,  or  rseq  contains  an
              address which is not appropriately aligned, or rseq_len contains
              a size that does not match the size received on registration.

              ┌─────────────────────────────────────────────────────┐
              │FIXME                                                │
              ├─────────────────────────────────────────────────────┤
              │The last case "rseq_len contains a  size  that  does │
              │not  match  the  size  received on registration" can │
              │occur only on RSEQ_FLAG_UNREGISTER, tight?           │
              └─────────────────────────────────────────────────────┘

       ENOSYS The rseq() system call is not implemented by this kernel.

       EPERM  The sig argument on unregistration does not match the  signature
              received on registration.

VERSIONS
       The rseq() system call was added in Linux 4.18.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │What is the current state of library support?        │
       └─────────────────────────────────────────────────────┘

CONFORMING TO
       rseq() is Linux-specific.

       ┌─────────────────────────────────────────────────────┐
       │FIXME                                                │
       ├─────────────────────────────────────────────────────┤
       │Is  there  any  example  code that can reasonably be │
       │included in this manual page? Or some  example  code │
       │that can be referred to?                             │
       └─────────────────────────────────────────────────────┘

SEE ALSO
       sched_getcpu(3), membarrier(2)

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH man-pages] Add rseq manpage
@ 2018-12-06 14:42 ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2018-12-06 14:42 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-kernel, linux-api, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Mathieu Desnoyers

[ Michael, rseq(2) was merged into 4.18. Can you have a look at this
  patch which adds rseq documentation to the man-pages project ? ]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 man2/rseq.2 | 299 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 299 insertions(+)
 create mode 100644 man2/rseq.2

diff --git a/man2/rseq.2 b/man2/rseq.2
new file mode 100644
index 000000000..005c1cee4
--- /dev/null
+++ b/man2/rseq.2
@@ -0,0 +1,299 @@
+.\" Copyright 2015-2018 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH RSEQ 2 2018-09-19 "Linux" "Linux Programmer's Manual"
+.SH NAME
+rseq \- Restartable sequences and cpu number cache
+.SH SYNOPSIS
+.nf
+.B #include <linux/rseq.h>
+.sp
+.BI "int rseq(struct rseq * " rseq ", uint32_t " rseq_len ", int " flags ", uint32_t " sig ");
+.sp
+.SH DESCRIPTION
+The
+.BR rseq ()
+ABI accelerates user-space operations on per-cpu data by defining a
+shared data structure ABI between each user-space thread and the kernel.
+
+It allows user-space to perform update operations on per-cpu data
+without requiring heavy-weight atomic operations.
+
+The term CPU used in this documentation refers to a hardware execution
+context.
+
+Restartable sequences are atomic with respect to preemption (making it
+atomic with respect to other threads running on the same CPU), as well
+as signal delivery (user-space execution contexts nested over the same
+thread). They either complete atomically with respect to preemption on
+the current CPU and signal delivery, or they are aborted.
+
+It is suited for update operations on per-cpu data.
+
+It can be used on data structures shared between threads within a
+process, and on data structures shared between threads across different
+processes.
+
+.PP
+Some examples of operations that can be accelerated or improved
+by this ABI:
+.IP \[bu] 2
+Memory allocator per-cpu free-lists,
+.IP \[bu] 2
+Querying the current CPU number,
+.IP \[bu] 2
+Incrementing per-CPU counters,
+.IP \[bu] 2
+Modifying data protected by per-CPU spinlocks,
+.IP \[bu] 2
+Inserting/removing elements in per-CPU linked-lists,
+.IP \[bu] 2
+Writing/reading per-CPU ring buffers content.
+.IP \[bu] 2
+Accurately reading performance monitoring unit counters
+with respect to thread migration.
+
+.PP
+Restartable sequences must not perform system calls. Doing so may result
+in termination of the process by a segmentation fault.
+
+.PP
+The
+.I rseq
+argument is a pointer to the thread-local rseq structure to be shared
+between kernel and user-space.
+
+.PP
+The layout of
+.B struct rseq
+is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure is extensible. Its size is passed as parameter to the
+rseq system call.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I cpu_id_start
+Optimistic cache of the CPU number on which the current thread is
+running. Its value is guaranteed to always be a possible CPU number,
+even when rseq is not initialized. The value it contains should always
+be confirmed by reading the cpu_id field.
+
+This field is an optimistic cache in the sense that it is always
+guaranteed to hold a valid CPU number in the range [ 0 ..
+nr_possible_cpus - 1 ]. It can therefore be loaded by user-space and
+used as an offset in per-cpu data structures without having to
+check whether its value is within the valid bounds compared to the
+number of possible CPUs in the system.
+
+For user-space applications executed on a kernel without rseq support,
+the cpu_id_start field stays initialized at 0, which is indeed a valid
+CPU number. It is therefore valid to use it as an offset in per-cpu data
+structures, and only validate whether it's actually the current CPU
+number by comparing it with the cpu_id field within the rseq critical
+section. If the kernel does not provide rseq support, that cpu_id field
+stays initialized at -1, so the comparison always fails, as intended.
+
+It is then up to user-space to use a fall-back mechanism, considering
+that rseq is not available.
+
+.in
+.TP
+.in +4n
+.I cpu_id
+Cache of the CPU number on which the current thread is running.
+-1 if uninitialized.
+.in
+.TP
+.in +4n
+.I rseq_cs
+The rseq_cs field is a pointer to a struct rseq_cs. Is is NULL when no
+rseq assembly block critical section is active for the current thread.
+Setting it to point to a critical section descriptor (struct rseq_cs)
+marks the beginning of the critical section.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior for the current thread. This is
+mainly used for debugging purposes. Can be either:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.in
+
+.PP
+The layout of
+.B struct rseq_cs
+version 0 is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure has a fixed size of 32 bytes.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I version
+Version of this structure.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior of this structure. Can be
+a combination of:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.TP
+.in +4n
+.I start_ip
+Instruction pointer address of the first instruction of the sequence of
+consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I post_commit_offset
+Offset (from start_ip address) of the address after the last instruction
+of the sequence of consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I abort_ip
+Instruction pointer address where to move the execution flow in case of
+abort of the sequence of consecutive assembly instructions.
+.in
+
+.PP
+The
+.I rseq_len
+argument is the size of the
+.I struct rseq
+to register.
+
+.PP
+The
+.I flags
+argument is 0 for registration, and
+.IR RSEQ_FLAG_UNREGISTER
+for unregistration.
+
+.PP
+The
+.I sig
+argument is the 32-bit signature to be expected before the abort
+handler code.
+
+.PP
+A single library per process should keep the rseq structure in a
+thread-local storage variable.
+The
+.I cpu_id
+field should be initialized to -1, and the
+.I cpu_id_start
+field should be initialized to a possible CPU value (typically 0).
+
+.PP
+Each thread is responsible for registering and unregistering its rseq
+structure. No more than one rseq structure address can be registered
+per thread at a given time.
+
+.PP
+Memory of a registered rseq object must not be freed before the thread
+exits. Reclaim of rseq object's memory must only be done after either an
+explicit rseq unregistration is performed or after the thread exits. Keep
+in mind that the implementation of the Thread-Local Storage (C language
+__thread) lifetime does not guarantee existence of the TLS area up until
+the thread exits.
+
+.PP
+In a typical usage scenario, the thread registering the rseq
+structure will be performing loads and stores from/to that structure. It
+is however also allowed to read that structure from other threads.
+The rseq field updates performed by the kernel provide relaxed atomicity
+semantics, which guarantee that other threads performing relaxed atomic
+reads of the cpu number cache will always observe a consistent value.
+
+.SH RETURN VALUE
+A return value of 0 indicates success. On error, \-1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+.TP
+.B EINVAL
+Either
+.I flags
+contains an invalid value, or
+.I rseq
+contains an address which is not appropriately aligned, or
+.I rseq_len
+contains a size that does not match the size received on registration.
+.TP
+.B ENOSYS
+The
+.BR rseq ()
+system call is not implemented by this kernel.
+.TP
+.B EFAULT
+.I rseq
+is an invalid address.
+.TP
+.B EBUSY
+Restartable sequence is already registered for this thread.
+.TP
+.B EPERM
+The
+.I sig
+argument on unregistration does not match the signature received
+on registration.
+
+.SH VERSIONS
+The
+.BR rseq ()
+system call was added in Linux 4.18.
+
+.SH CONFORMING TO
+.BR rseq ()
+is Linux-specific.
+
+.in
+.SH SEE ALSO
+.BR sched_getcpu (3) ,
+.BR membarrier (2)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH man-pages] Add rseq manpage
@ 2018-12-06 14:42 ` Mathieu Desnoyers
  0 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2018-12-06 14:42 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-kernel, linux-api, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, Andy Lutomirski, Dave Watson, Paul Turner,
	Andrew Morton, Russell King, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon

[ Michael, rseq(2) was merged into 4.18. Can you have a look at this
  patch which adds rseq documentation to the man-pages project ? ]

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Paul Turner <pjt@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: Dave Watson <davejwatson@fb.com>
CC: Chris Lameter <cl@linux.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ben Maurer <bmaurer@fb.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Russell King <linux@arm.linux.org.uk>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: linux-api@vger.kernel.org
---
 man2/rseq.2 | 299 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 299 insertions(+)
 create mode 100644 man2/rseq.2

diff --git a/man2/rseq.2 b/man2/rseq.2
new file mode 100644
index 000000000..005c1cee4
--- /dev/null
+++ b/man2/rseq.2
@@ -0,0 +1,299 @@
+.\" Copyright 2015-2018 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH RSEQ 2 2018-09-19 "Linux" "Linux Programmer's Manual"
+.SH NAME
+rseq \- Restartable sequences and cpu number cache
+.SH SYNOPSIS
+.nf
+.B #include <linux/rseq.h>
+.sp
+.BI "int rseq(struct rseq * " rseq ", uint32_t " rseq_len ", int " flags ", uint32_t " sig ");
+.sp
+.SH DESCRIPTION
+The
+.BR rseq ()
+ABI accelerates user-space operations on per-cpu data by defining a
+shared data structure ABI between each user-space thread and the kernel.
+
+It allows user-space to perform update operations on per-cpu data
+without requiring heavy-weight atomic operations.
+
+The term CPU used in this documentation refers to a hardware execution
+context.
+
+Restartable sequences are atomic with respect to preemption (making it
+atomic with respect to other threads running on the same CPU), as well
+as signal delivery (user-space execution contexts nested over the same
+thread). They either complete atomically with respect to preemption on
+the current CPU and signal delivery, or they are aborted.
+
+It is suited for update operations on per-cpu data.
+
+It can be used on data structures shared between threads within a
+process, and on data structures shared between threads across different
+processes.
+
+.PP
+Some examples of operations that can be accelerated or improved
+by this ABI:
+.IP \[bu] 2
+Memory allocator per-cpu free-lists,
+.IP \[bu] 2
+Querying the current CPU number,
+.IP \[bu] 2
+Incrementing per-CPU counters,
+.IP \[bu] 2
+Modifying data protected by per-CPU spinlocks,
+.IP \[bu] 2
+Inserting/removing elements in per-CPU linked-lists,
+.IP \[bu] 2
+Writing/reading per-CPU ring buffers content.
+.IP \[bu] 2
+Accurately reading performance monitoring unit counters
+with respect to thread migration.
+
+.PP
+Restartable sequences must not perform system calls. Doing so may result
+in termination of the process by a segmentation fault.
+
+.PP
+The
+.I rseq
+argument is a pointer to the thread-local rseq structure to be shared
+between kernel and user-space.
+
+.PP
+The layout of
+.B struct rseq
+is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure is extensible. Its size is passed as parameter to the
+rseq system call.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I cpu_id_start
+Optimistic cache of the CPU number on which the current thread is
+running. Its value is guaranteed to always be a possible CPU number,
+even when rseq is not initialized. The value it contains should always
+be confirmed by reading the cpu_id field.
+
+This field is an optimistic cache in the sense that it is always
+guaranteed to hold a valid CPU number in the range [ 0 ..
+nr_possible_cpus - 1 ]. It can therefore be loaded by user-space and
+used as an offset in per-cpu data structures without having to
+check whether its value is within the valid bounds compared to the
+number of possible CPUs in the system.
+
+For user-space applications executed on a kernel without rseq support,
+the cpu_id_start field stays initialized at 0, which is indeed a valid
+CPU number. It is therefore valid to use it as an offset in per-cpu data
+structures, and only validate whether it's actually the current CPU
+number by comparing it with the cpu_id field within the rseq critical
+section. If the kernel does not provide rseq support, that cpu_id field
+stays initialized at -1, so the comparison always fails, as intended.
+
+It is then up to user-space to use a fall-back mechanism, considering
+that rseq is not available.
+
+.in
+.TP
+.in +4n
+.I cpu_id
+Cache of the CPU number on which the current thread is running.
+-1 if uninitialized.
+.in
+.TP
+.in +4n
+.I rseq_cs
+The rseq_cs field is a pointer to a struct rseq_cs. Is is NULL when no
+rseq assembly block critical section is active for the current thread.
+Setting it to point to a critical section descriptor (struct rseq_cs)
+marks the beginning of the critical section.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior for the current thread. This is
+mainly used for debugging purposes. Can be either:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.in
+
+.PP
+The layout of
+.B struct rseq_cs
+version 0 is as follows:
+.TP
+.B Structure alignment
+This structure is aligned on 32-byte boundary.
+.TP
+.B Structure size
+This structure has a fixed size of 32 bytes.
+.TP
+.B Fields
+
+.TP
+.in +4n
+.I version
+Version of this structure.
+.in
+.TP
+.in +4n
+.I flags
+Flags indicating the restart behavior of this structure. Can be
+a combination of:
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+.IP \[bu]
+RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+.TP
+.in +4n
+.I start_ip
+Instruction pointer address of the first instruction of the sequence of
+consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I post_commit_offset
+Offset (from start_ip address) of the address after the last instruction
+of the sequence of consecutive assembly instructions.
+.in
+.TP
+.in +4n
+.I abort_ip
+Instruction pointer address where to move the execution flow in case of
+abort of the sequence of consecutive assembly instructions.
+.in
+
+.PP
+The
+.I rseq_len
+argument is the size of the
+.I struct rseq
+to register.
+
+.PP
+The
+.I flags
+argument is 0 for registration, and
+.IR RSEQ_FLAG_UNREGISTER
+for unregistration.
+
+.PP
+The
+.I sig
+argument is the 32-bit signature to be expected before the abort
+handler code.
+
+.PP
+A single library per process should keep the rseq structure in a
+thread-local storage variable.
+The
+.I cpu_id
+field should be initialized to -1, and the
+.I cpu_id_start
+field should be initialized to a possible CPU value (typically 0).
+
+.PP
+Each thread is responsible for registering and unregistering its rseq
+structure. No more than one rseq structure address can be registered
+per thread at a given time.
+
+.PP
+Memory of a registered rseq object must not be freed before the thread
+exits. Reclaim of rseq object's memory must only be done after either an
+explicit rseq unregistration is performed or after the thread exits. Keep
+in mind that the implementation of the Thread-Local Storage (C language
+__thread) lifetime does not guarantee existence of the TLS area up until
+the thread exits.
+
+.PP
+In a typical usage scenario, the thread registering the rseq
+structure will be performing loads and stores from/to that structure. It
+is however also allowed to read that structure from other threads.
+The rseq field updates performed by the kernel provide relaxed atomicity
+semantics, which guarantee that other threads performing relaxed atomic
+reads of the cpu number cache will always observe a consistent value.
+
+.SH RETURN VALUE
+A return value of 0 indicates success. On error, \-1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+.TP
+.B EINVAL
+Either
+.I flags
+contains an invalid value, or
+.I rseq
+contains an address which is not appropriately aligned, or
+.I rseq_len
+contains a size that does not match the size received on registration.
+.TP
+.B ENOSYS
+The
+.BR rseq ()
+system call is not implemented by this kernel.
+.TP
+.B EFAULT
+.I rseq
+is an invalid address.
+.TP
+.B EBUSY
+Restartable sequence is already registered for this thread.
+.TP
+.B EPERM
+The
+.I sig
+argument on unregistration does not match the signature received
+on registration.
+
+.SH VERSIONS
+The
+.BR rseq ()
+system call was added in Linux 4.18.
+
+.SH CONFORMING TO
+.BR rseq ()
+is Linux-specific.
+
+.in
+.SH SEE ALSO
+.BR sched_getcpu (3) ,
+.BR membarrier (2)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-27 15:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19 14:40 [PATCH man-pages] Add rseq manpage Mathieu Desnoyers
2018-09-19 14:40 ` Mathieu Desnoyers
2018-12-06 14:42 Mathieu Desnoyers
2018-12-06 14:42 ` Mathieu Desnoyers
2019-02-28  8:42 ` Michael Kerrisk (man-pages)
2019-02-28  8:42   ` Michael Kerrisk (man-pages)
2019-03-04 18:02   ` Mathieu Desnoyers
2019-03-04 18:02     ` Mathieu Desnoyers
2020-04-27 15:15     ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.