linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Randy Dunlap <rdunlap@infradead.org>
To: "André Almeida" <andrealmeid@collabora.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Darren Hart" <dvhart@infradead.org>,
	linux-kernel@vger.kernel.org,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>
Cc: kernel@collabora.com, krisman@collabora.com,
	pgriffais@valvesoftware.com, z.figura12@gmail.com,
	joel@joelfernandes.org, malteskarupke@fastmail.fm,
	linux-api@vger.kernel.org, fweimer@redhat.com,
	libc-alpha@sourceware.org, linux-kselftest@vger.kernel.org,
	shuah@kernel.org, acme@kernel.org, corbet@lwn.net
Subject: Re: [RFC PATCH 06/13] docs: locking: futex2: Add documentation
Date: Tue, 16 Feb 2021 10:34:05 -0800	[thread overview]
Message-ID: <4e70780e-1a4d-79a9-e183-44a3454e0298@infradead.org> (raw)
In-Reply-To: <20210215152404.250281-7-andrealmeid@collabora.com>

On 2/15/21 7:23 AM, André Almeida wrote:
> Add a new documentation file specifying both userspace API and internal
> implementation details of futex2 syscalls.
> 
> Signed-off-by: André Almeida <andrealmeid@collabora.com>
> ---
>  Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++
>  Documentation/locking/index.rst  |   1 +
>  2 files changed, 199 insertions(+)
>  create mode 100644 Documentation/locking/futex2.rst
> 
> diff --git a/Documentation/locking/futex2.rst b/Documentation/locking/futex2.rst
> new file mode 100644
> index 000000000000..edd47c22f2df
> --- /dev/null
> +++ b/Documentation/locking/futex2.rst
> @@ -0,0 +1,198 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +======
> +futex2
> +======
> +
> +:Author: André Almeida <andrealmeid@collabora.com>
> +
> +futex, or fast user mutex, is a set of syscalls to allow the userspace to create

                                                       drop ^^^ "the"

> +performant synchronization mechanisms, such as mutexes, semaphores and
> +conditional variables in userspace. C standard libraries, like glibc, uses it
> +as means to implements more high level interfaces like pthreads.

   as a means to implement

> +
> +The interface
> +=============
> +
> +uAPI functions
> +--------------
> +
> +.. kernel-doc:: kernel/futex2.c
> +   :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue		
> +
> +uAPI structures
> +---------------
> +
> +.. kernel-doc:: include/uapi/linux/futex.h
> +
> +The ``flag`` argument
> +---------------------
> +
> +The flag is used to specify the size of the futex word
> +(FUTEX_[8, 16, 32]). It's mandatory to define one, since there's no
> +default size.
> +
> +By default, the timeout uses a monotonic clock, but can be used as a realtime
> +one by using the FUTEX_REALTIME_CLOCK flag.
> +
> +By default, futexes are of the private type, that means that this user address
> +will be accessed by threads that shares the same memory region. This allows for

                                    share

> +some internal optimizations, so they are faster. However, if the address needs
> +to be shared with different processes (like using ``mmap()`` or ``shm()``), they
> +need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to set that.
> +
> +By default, the operation has no NUMA-awareness, meaning that the user can't
> +choose the memory node where the kernel side futex data will be stored. The
> +user can choose the node where it wants to operate by setting the
> +FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, or
> +32)::
> +
> + struct futexX_numa {
> +         __uX value;
> +         __sX hint;
> + };
> +
> +This structure should be passed at the ``void *uaddr`` of futex functions. The
> +address of the structure will be used to be waited on/waken on, and the
> +``value`` will be compared to ``val`` as usual. The ``hint`` member is used to
> +defined which node the futex will use. When waiting, the futex will be

   define

> +registered on a kernel-side table stored on that node; when waking, the futex
> +will be searched for on that given table. That means that there's no redundancy
> +between tables, and the wrong ``hint`` value will led to undesired behavior.

                                                     lead

> +Userspace is responsible for dealing with node migrations issues that may
> +occur. ``hint`` can range from [0, MAX_NUMA_NODES], for specifying a node, or

                             from [0, MAX_NUMA_NODES)
i.e., what does hint == MAX_NUMA_NODES mean?

> +-1, to use the same node the current process is using.
> +
> +When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored on a
> +global table on some node, defined at compilation time.

   defined how at compilation time?

> +
> +The ``timo`` argument
> +---------------------
> +
> +As per the Y2038 work done in the kernel, new interfaces shouldn't add timeout
> +options known to be buggy. Given that, ``timo`` should be a 64bit timeout at

                                                               64-bit

> +all platforms, using an absolute timeout value.
> +
> +Implementation
> +==============
> +
> +The internal implementation follows a similar design to the original futex.
> +Given that we want to replicate the same external behavior of current futex,
> +this should be somewhat expected.
> +
> +Waiting
> +-------
> +
> +For the wait operations, they are all treated as if you want to wait on N
> +futexes, so the path for futex_wait and futex_waitv is the basically the same.
> +For both syscalls, the first step is to prepare an internal list for the list
> +of futexes to wait for (using struct futexv_head). For futex_wait() calls, this
> +list will have a single object.
> +
> +We have a hash table, were waiters register themselves before sleeping.  Then,

                         where                                              Then <no comma>

> +the wake function checks this table looking for waiters at uaddr.  The hash
> +bucket to be used is determined by a struct futex_key, that stores information
> +to uniquely identify an address from a given process. Given the huge address
> +space, there'll be hash collisions, so we store information to be later used on
> +collision treatment.
> +
> +First, for every futex we want to wait on, we check if (``*uaddr == val``).
> +This check is done holding the bucket lock, so we are correctly serialized with
> +any futex_wake() calls. If any waiter fails the check above, we dequeue all
> +futexes. The check (``*uaddr == val``) can fail for two reasons:
> +
> +- The values are different, and we return -EAGAIN. However, if while
> +  dequeueing we found that some futex were awakened, we prioritize this

                                   futexes

> +  and return success.
> +
> +- When trying to access the user address, we do so with page faults
> +  disabled because we are holding a bucket's spin lock (and can't sleep
> +  while holding a spin lock). If there's an error, it might be a page
> +  fault, or an invalid address. We release the lock, dequeue everyone
> +  (because it's illegal to sleep while there are futexes enqueued, we
> +  could lose wakeups) and try again with page fault enabled. If we
> +  succeeded, this means that the address is valid, but we need to do

     succeed,

> +  all the work again. For serialization reasons, we need to have the
> +  spin lock when getting the user value. Additionally, for shared
> +  futexes, we also need to recalculate the hash, since the underlying
> +  mapping mechanisms could have changed when dealing with page fault.
> +  If, even with page fault enabled, we can't access the address, it
> +  means it's an invalid user address, and we return -EFAULT. For this
> +  case, we prioritize the error, even if some futex were awaken.

                                                 futexes
> +
> +If the check is OK, they are enqueued on a linked list in our bucket, and
> +proceed to the next one. If all waiters succeed, we put the thread to sleep
> +until a futex_wake() call, timeout expires or we get a signal. After waking up,
> +we dequeue everyone, and check if some futex was awaken. This dequeue is done by

                                                was awakened.

> +iteratively walking at each element of struct futex_head list.
> +
> +All enqueuing/dequeuing operations requires to hold the bucket lock, to avoid
> +racing while modifying the list.
> +
> +Waking
> +------
> +
> +We get the bucket that's storing the waiters at uaddr, and wake the required
> +number of waiters, checking for hash collision.
> +
> +There's an optimization that makes futex_wake() not taking the bucket lock if

                                                       take

> +there's no one to be wake on that bucket. It checks an atomic counter that each

                        woken / awakened ?

> +bucket has, if it says 0, than the syscall exits. In order to this work, the

                             then                             for this to work, the

> +waiter thread increases it before taking the lock, so the wake thread will
> +correctly see that there's someone waiting and will continue the path to take
> +the bucket lock. To get the correct serialization, the waiter issues a memory
> +barrier after increasing the bucket counter and the waker issues a memory
> +barrier before checking it.
> +
> +Requeuing
> +---------
> +
> +The requeue path first checks for each struct futex_requeue and their flags.
> +Then, it will compare the excepted value with the one at uaddr1::uaddr.

                             accepted ?

> +Following the same serialization explained at Waking_, we increase the atomic
> +counter for the bucket of uaddr2 before taking the lock. We need to have both
> +buckets locks at same time so we don't race with others futexes operations. To

                                                    other futex operations.

> +ensure the locks are taken in the same order for all threads (and thus avoiding
> +deadlocks), every requeue operation takes the "smaller" bucket first, when
> +comparing both addresses.
> +
> +If the compare with user value succeeds, we proceed by waking ``nr_wake``
> +futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to the uaddr2.
> +This consists in a simple list deletion/addition and replacing the old futex key
> +for the new one.

   with the new one.
(I think.)

> +
> +Futex keys
> +----------
> +
> +There are two types of futexes: private and shared ones. The private are futexes
> +meant to be used by threads that shares the same memory space, are easier to be

                                    share

> +uniquely identified an thus can have some performance optimization. The elements

                       and

> +for identifying one are: the start address of the page where the address is,
> +the address offset within the page and the current->mm pointer.
> +
> +Now, for uniquely identifying shared futex:

                                 a shared futex:

> +
> +- If the page containing the user address is an anonymous page, we can
> +  just use the same data used for private futexes (the start address of
> +  the page, the address offset within the page and the current->mm
> +  pointer) that will be enough for uniquely identifying such futex. We

     pointer); that

> +  also set one bit at the key to differentiate if a private futex is
> +  used on the same address (mixing shared and private calls do not

                                                               does not

> +  work).
> +
> +- If the page is file-backed, current->mm maybe isn't the same one for
> +  every user of this futex, so we need to use other data: the
> +  page->index, an UUID for the struct inode and the offset within the

                  a UUID

> +  page.
> +
> +Note that members of futex_key doesn't have any particular meaning after they

                                  don't

> +are part of the struct - they are just bytes to identify a futex.  Given that,
> +we don't need to use a particular name or type that matches the original data,
> +we only need to care about the bitsize of each component and make both private
> +and shared fit in the same memory space.
> +
> +Source code documentation
> +=========================
> +
> +.. kernel-doc:: kernel/futex2.c
> +   :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue


-- 
~Randy


  reply	other threads:[~2021-02-16 18:35 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-15 15:23 [RFC PATCH 00/13] Add futex2 syscalls André Almeida
2021-02-15 15:23 ` [RFC PATCH 01/13] futex2: Implement wait and wake functions André Almeida
2021-02-15 19:59   ` Gabriel Krisman Bertazi
2021-02-18 13:29     ` André Almeida
2021-02-18 15:48       ` Gabriel Krisman Bertazi
2021-02-16  9:02   ` Peter Zijlstra
2021-02-18 20:09     ` André Almeida
2021-02-16  9:35   ` Peter Zijlstra
2021-02-16  9:56   ` Peter Zijlstra
2021-02-16 10:20     ` Sebastian Andrzej Siewior
2021-02-16 12:42       ` Peter Zijlstra
2021-02-16 22:12     ` Gabriel Krisman Bertazi
2021-02-15 15:23 ` [RFC PATCH 02/13] futex2: Add support for shared futexes André Almeida
2021-02-15 15:23 ` [RFC PATCH 03/13] futex2: Implement vectorized wait André Almeida
2021-02-15 20:03   ` Gabriel Krisman Bertazi
2021-02-15 20:06     ` Zebediah Figura
2021-02-15 20:08   ` Gabriel Krisman Bertazi
2021-02-15 15:23 ` [RFC PATCH 04/13] futex2: Implement requeue operation André Almeida
2021-02-15 15:23 ` [RFC PATCH 05/13] futex2: Add compatibility entry point for x86_x32 ABI André Almeida
2021-02-15 15:23 ` [RFC PATCH 06/13] docs: locking: futex2: Add documentation André Almeida
2021-02-16 18:34   ` Randy Dunlap [this message]
2021-02-18 19:12     ` André Almeida
2021-02-15 15:23 ` [RFC PATCH 07/13] selftests: futex2: Add wake/wait test André Almeida
2021-02-15 15:23 ` [RFC PATCH 08/13] selftests: futex2: Add timeout test André Almeida
2021-02-15 15:24 ` [RFC PATCH 09/13] selftests: futex2: Add wouldblock test André Almeida
2021-02-15 15:24 ` [RFC PATCH 10/13] selftests: futex2: Add waitv test André Almeida
2021-02-15 15:24 ` [RFC PATCH 11/13] selftests: futex2: Add requeue test André Almeida
2021-02-15 15:24 ` [RFC PATCH 12/13] perf bench: Add futex2 benchmark tests André Almeida
2021-02-15 15:24 ` [RFC PATCH 13/13] kernel: Enable waitpid() for futex2 André Almeida

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4e70780e-1a4d-79a9-e183-44a3454e0298@infradead.org \
    --to=rdunlap@infradead.org \
    --cc=acme@kernel.org \
    --cc=andrealmeid@collabora.com \
    --cc=bigeasy@linutronix.de \
    --cc=corbet@lwn.net \
    --cc=dvhart@infradead.org \
    --cc=fweimer@redhat.com \
    --cc=joel@joelfernandes.org \
    --cc=kernel@collabora.com \
    --cc=krisman@collabora.com \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=malteskarupke@fastmail.fm \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pgriffais@valvesoftware.com \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=z.figura12@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).