linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH 00/13] Add futex2 syscalls
@ 2021-02-16 12:13 Andrey Semashev
  2021-02-16 12:17 ` Andrey Semashev
  0 siblings, 1 reply; 3+ messages in thread
From: Andrey Semashev @ 2021-02-16 12:13 UTC (permalink / raw)
  To: LKML

Sorry for posting out-of-tree, I just subscribed to the list to reply to 
a post that was already sent.

André Almeida wrote:

> ** "And what's about FUTEX_64?"
> 
>  By supporting 64 bit futexes, the kernel structure for futex would
>  need to have a 64 bit field for the value, and that could defeat one of
>  the purposes of having different sized futexes in the first place:
>  supporting smaller ones to decrease memory usage. This might be
>  something that could be disabled for 32bit archs (and even for
>  CONFIG_BASE_SMALL).
> 
>  Which use case would benefit for FUTEX_64? Does it worth the trade-offs?

I strongly believe that 64-bit futex must be supported. I have a few use 
cases in mind:

1. Cooperative robust futexes.

I have a real-world case where multiple processes need to communicate 
via shared memory and synchronize via a futex. The processes run under a 
supervisor parent process, which can detect termination of its children 
and also has access to the shared memory. In order to make the 
communication more or less safe in face of one of the child process 
crashing, the futex currently contains a portion of pid of the process 
that locked it. The parent supervisor is then able to tell that the 
crashed child was holding the futex locked and then marke the futex as 
"broken" and notify any other threads blocked on it.

Given that pid can be up to 32-bits in size, and we also need some bits 
in the futex to implement its logic (i.e. at least "locked" and "broken" 
bits, some bits for the ABA counter, etc.), the pid can be truncated and 
the above logic may be broken. In the real application, only 15 bits are 
left for the pid, which is already less than the actual pid range on the 
system.

Note: We're not using the proper pthread robust mutexes because we also 
need a condition variable, and condition variables contain a non-robust 
mutex internally, which basically nullifies robustness. One could argue 
to fix pthread instead, but I view that as a more difficult task as 
pthread interface is standardized. We would rather use futex directly 
anyway because of more flexibility and less performance overhead.

2. Parity with WaitOnAddress[1] on Windows.

WaitOnAddress is explicitly documented to support 8-byte states, and its 
interface allows for further extension. I'm not a Wine developer, but I 
would guess that having a 8-byte futex support to match would be useful 
there.

Besides Wine, having a 64-bit futex would be important for 
std::atomic[2] and Boost.Atomic in C++, which support waiting and 
notifying operations (for std::atomic, introduced in C++20). Waiting and 
notifying operations are normally implemented using futex API on Linux 
and WaitOnAddress on Windows, and can be emulated with a process-wide 
global mutex pool if such API is unavailable for a given atomic size on 
the target platform. This means that 64-bit atomics on Linux currently 
must be implemented with a lock and therefore cannot be used in 
process-shared memory, while there is no such limitation on Windows.


I'm not sure how much memory is saved by not having 64-bit state in the 
kernel futex structures, but this doesn't look like a huge deal on 
modern systems - server, desktop or mobile. It may make sense for 
extremely low memory embedded systems, and for those targets the support 
may be disabled with a switch. In fact, such systems would probably not 
support 64-bit atomics anyway. For any other targets I would prefer 
64-bit futex to be available by default.

My main issue with 64-bit being optional though is that applications and 
libraries like Boost.Atomic would like (or even require) to know if the 
feature is available at compile time rather than run time. std::atomic, 
for example, is supposed to be a thin abstraction over atomic 
instructions and OS primitives like futex, so performing runtime 
detection of the available features in the kernel would be detrimental 
there. I'm not sure if this is possible in the current kernel 
infrastructure, but it would be best if the lack of 64-bit atomics in 
the kernel was detectable through kernel headers (e.g. by a macro for 
64-bit futexes not being defined or something like that), which means 
the headers must be generated at kernel configuration time.

[1]: 
https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress
[2]: https://en.cppreference.com/w/cpp/atomic/atomic

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH 00/13] Add futex2 syscalls
  2021-02-16 12:13 [RFC PATCH 00/13] Add futex2 syscalls Andrey Semashev
@ 2021-02-16 12:17 ` Andrey Semashev
  0 siblings, 0 replies; 3+ messages in thread
From: Andrey Semashev @ 2021-02-16 12:17 UTC (permalink / raw)
  To: LKML; +Cc: andrealmeid

Adding André Almeida to CC.

On 2/16/21 3:13 PM, Andrey Semashev wrote:
> Sorry for posting out-of-tree, I just subscribed to the list to reply to 
> a post that was already sent.
> 
> André Almeida wrote:
> 
>> ** "And what's about FUTEX_64?"
>>
>>  By supporting 64 bit futexes, the kernel structure for futex would
>>  need to have a 64 bit field for the value, and that could defeat one of
>>  the purposes of having different sized futexes in the first place:
>>  supporting smaller ones to decrease memory usage. This might be
>>  something that could be disabled for 32bit archs (and even for
>>  CONFIG_BASE_SMALL).
>>
>>  Which use case would benefit for FUTEX_64? Does it worth the trade-offs?
> 
> I strongly believe that 64-bit futex must be supported. I have a few use 
> cases in mind:
> 
> 1. Cooperative robust futexes.
> 
> I have a real-world case where multiple processes need to communicate 
> via shared memory and synchronize via a futex. The processes run under a 
> supervisor parent process, which can detect termination of its children 
> and also has access to the shared memory. In order to make the 
> communication more or less safe in face of one of the child process 
> crashing, the futex currently contains a portion of pid of the process 
> that locked it. The parent supervisor is then able to tell that the 
> crashed child was holding the futex locked and then marke the futex as 
> "broken" and notify any other threads blocked on it.
> 
> Given that pid can be up to 32-bits in size, and we also need some bits 
> in the futex to implement its logic (i.e. at least "locked" and "broken" 
> bits, some bits for the ABA counter, etc.), the pid can be truncated and 
> the above logic may be broken. In the real application, only 15 bits are 
> left for the pid, which is already less than the actual pid range on the 
> system.
> 
> Note: We're not using the proper pthread robust mutexes because we also 
> need a condition variable, and condition variables contain a non-robust 
> mutex internally, which basically nullifies robustness. One could argue 
> to fix pthread instead, but I view that as a more difficult task as 
> pthread interface is standardized. We would rather use futex directly 
> anyway because of more flexibility and less performance overhead.
> 
> 2. Parity with WaitOnAddress[1] on Windows.
> 
> WaitOnAddress is explicitly documented to support 8-byte states, and its 
> interface allows for further extension. I'm not a Wine developer, but I 
> would guess that having a 8-byte futex support to match would be useful 
> there.
> 
> Besides Wine, having a 64-bit futex would be important for 
> std::atomic[2] and Boost.Atomic in C++, which support waiting and 
> notifying operations (for std::atomic, introduced in C++20). Waiting and 
> notifying operations are normally implemented using futex API on Linux 
> and WaitOnAddress on Windows, and can be emulated with a process-wide 
> global mutex pool if such API is unavailable for a given atomic size on 
> the target platform. This means that 64-bit atomics on Linux currently 
> must be implemented with a lock and therefore cannot be used in 
> process-shared memory, while there is no such limitation on Windows.
> 
> 
> I'm not sure how much memory is saved by not having 64-bit state in the 
> kernel futex structures, but this doesn't look like a huge deal on 
> modern systems - server, desktop or mobile. It may make sense for 
> extremely low memory embedded systems, and for those targets the support 
> may be disabled with a switch. In fact, such systems would probably not 
> support 64-bit atomics anyway. For any other targets I would prefer 
> 64-bit futex to be available by default.
> 
> My main issue with 64-bit being optional though is that applications and 
> libraries like Boost.Atomic would like (or even require) to know if the 
> feature is available at compile time rather than run time. std::atomic, 
> for example, is supposed to be a thin abstraction over atomic 
> instructions and OS primitives like futex, so performing runtime 
> detection of the available features in the kernel would be detrimental 
> there. I'm not sure if this is possible in the current kernel 
> infrastructure, but it would be best if the lack of 64-bit atomics in 
> the kernel was detectable through kernel headers (e.g. by a macro for 
> 64-bit futexes not being defined or something like that), which means 
> the headers must be generated at kernel configuration time.
> 
> [1]: 
> https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress 
> 
> [2]: https://en.cppreference.com/w/cpp/atomic/atomic


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC PATCH 00/13] Add futex2 syscalls
@ 2021-02-15 15:23 André Almeida
  0 siblings, 0 replies; 3+ messages in thread
From: André Almeida @ 2021-02-15 15:23 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior
  Cc: kernel, krisman, pgriffais, z.figura12, joel, malteskarupke,
	linux-api, fweimer, libc-alpha, linux-kselftest, shuah, acme,
	corbet, André Almeida

Hi,

This patch series introduces the futex2 syscalls.

* What happened to the current futex()?

For some years now, developers have been trying to add new features to
futex, but maintainers have been reluctant to accept then, given the
multiplexed interface full of legacy features and tricky to do big
changes. Some problems that people tried to address with patchsets are:
NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
NUMA, for instance, just doesn't fit the current API in a reasonable
way. Considering that, it's not possible to merge new features into the
current futex.

 ** The NUMA problem

 At the current implementation, all futex kernel side infrastructure is
 stored on a single node. Given that, all futex() calls issued by
 processors that aren't located on that node will have a memory access
 penalty when doing it.

 ** The 32bit sized futex problem

 Embedded systems or anything with memory constrains would benefit of
 using smaller sizes for the futex userspace integer. Also, a mutex
 implementation can be done using just three values, so 8 bits is enough
 for various scenarios.

 ** The wait on multiple problem

 The use case lies in the Wine implementation of the Windows NT interface
 WaitMultipleObjects. This Windows API function allows a thread to sleep
 waiting on the first of a set of event sources (mutexes, timers, signal,
 console input, etc) to signal.  Considering this is a primitive
 synchronization operation for Windows applications, being able to quickly
 signal events on the producer side, and quickly go to sleep on the
 consumer side is essential for good performance of those running
 over Wine.

[0] https://lore.kernel.org/lkml/20160505204230.932454245@linutronix.de/
[1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@web.de/
[2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@collabora.com/

* The solution

As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
is required to solve this, which must be designed with those features in
mind. futex2() is that interface. As opposed to the current multiplexed
interface, the new one should have one syscall per operation. This will
allow the maintainability of the API if it gets extended, and will help
users with type checking of arguments.

In particular, the new interface is extended to support the ability to
wait on any of a list of futexes at a time, which could be seen as a
vectored extension of the FUTEX_WAIT semantics.

[3] https://lore.kernel.org/lkml/20200303120050.GC2596@hirez.programming.kicks-ass.net/

* The interface

The new interface can be seen in details in the following patches, but
this is a high level summary of what the interface can do:

 - Supports wake/wait semantics, as in futex()
 - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
   individual flags for each address
 - Supports waiting for a vector of futexes, using a new syscall named
   futex_waitv()
 - Supports variable sized futexes (8bits, 16bits and 32bits)
 - Supports NUMA-awareness operations, where the user can specify on
   which memory node would like to operate

* Implementation

The internal implementation follows a similar design to the original futex.
Given that we want to replicate the same external behavior of current
futex, this should be somewhat expected. For some functions, like the
init and the code to get a shared key, I literally copied code and
comments from kernel/futex.c. I decided to do so instead of exposing the
original function as a public function since in that way we can freely
modify our implementation if required, without any impact on old futex.
Also, the comments precisely describes the details and corner cases of
the implementation.

Each patch contains a brief description of implementation, but patch 6
"docs: locking: futex2: Add documentation" adds a more complete document
about it.

* The patchset

This patchset can be also found at my git tree:

https://gitlab.collabora.com/tonyk/linux/-/tree/futex2

  - Patch 1: Implements wait/wake, and the basics foundations of futex2

  - Patches 2-4: Implement the remaining features (shared, waitv, requeue).

  - Patch 5:  Adds the x86_x32 ABI handling. I kept it in a separated
    patch since I'm not sure if x86_x32 is still a thing, or if it should
    return -ENOSYS.

  - Patch 6: Add a documentation file which details the interface and
    the internal implementation.

  - Patches 7-13: Selftests for all operations along with perf
    support for futex2.

  - Patch 14: While working on porting glibc for futex2, I found out
    that there's a futex_wake() call at the user thread exit path, if
    that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In
    order to make pthreads work with futex2, it was required to add
    this patch. Note that this is more a proof-of-concept of what we
    will need to do in future, rather than part of the interface and
    shouldn't be merged as it is.

* Testing:

This patchset provides selftests for each operation and their flags.
Along with that, the following work was done:

 ** Stability

 To stress the interface in "real world scenarios":

 - glibc[4]: nptl's low level locking was modified to use futex2 API
   (except for robust and PI things). All relevant nptl/ tests passed.

 - Wine[5]: Proton/Wine was modified in order to use futex2() for the
   emulation of Windows NT sync mechanisms based on futex, called "fsync".
   Triple-A games with huge CPU's loads and tons of parallel jobs worked
   as expected when compared with the previous FUTEX_WAIT_MULTIPLE
   implementation at futex(). Some games issue 42k futex2() calls
   per second.

 - Full GNU/Linux distro: I installed the modified glibc in my host
   machine, so all pthread's programs would use futex2(). After tweaking
   systemd[6] to allow futex2() calls at seccomp, everything worked as
   expected (web browsers do some syscall sandboxing and need some
   configuration as well).

 - perf: The perf benchmarks tests can also be used to stress the
   interface, and they can be found in this patchset.

 ** Performance

 - For comparing futex() and futex2() performance, I used the artificial
   benchmarks implemented at perf (wake, wake-parallel, hash and
   requeue). The setup was 200 runs for each test and using 8, 80, 800,
   8000 for the number of threads, Note that for this test, I'm not using
   patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
   at "The patchset" section.

 - For the first three ones, I measured an average of 4% gain in
   performance. This is not a big step, but it shows that the new
   interface is at least comparable in performance with the current one.

 - For requeue, I measured an average of 21% decrease in performance
   compared to the original futex implementation. This is expected given
   the new design with individual flags. The performance trade-offs are
   explained at patch 4 ("futex2: Implement requeue operation").

[4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2
[5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13
[6] https://gitlab.collabora.com/tonyk/systemd

* FAQ

 ** "Where's the code for NUMA and FUTEX_8/16?"

 The current code is already complex enough to take some time for
 review, so I believe it's better to split that work out to a future
 iteration of this patchset. Besides that, this RFC is the core part of the
 infrastructure, and the following features will not pose big design
 changes to it, the work will be more about wiring up the flags and
 modifying some functions.

 ** "And what's about FUTEX_64?"

 By supporting 64 bit futexes, the kernel structure for futex would
 need to have a 64 bit field for the value, and that could defeat one of
 the purposes of having different sized futexes in the first place:
 supporting smaller ones to decrease memory usage. This might be
 something that could be disabled for 32bit archs (and even for
 CONFIG_BASE_SMALL).

 Which use case would benefit for FUTEX_64? Does it worth the trade-offs?

 ** "Where's the PI/robust stuff?"

 As said by Peter Zijlstra at [3], all those new features are related to
 the "simple" futex interface, that doesn't use PI or robust. Do we want
 to have this complexity at futex2() and if so, should it be part of
 this patchset or can it be future work?

Thanks,
	André

André Almeida (13):
  futex2: Implement wait and wake functions
  futex2: Add support for shared futexes
  futex2: Implement vectorized wait
  futex2: Implement requeue operation
  futex2: Add compatibility entry point for x86_x32 ABI
  docs: locking: futex2: Add documentation
  selftests: futex2: Add wake/wait test
  selftests: futex2: Add timeout test
  selftests: futex2: Add wouldblock test
  selftests: futex2: Add waitv test
  selftests: futex2: Add requeue test
  perf bench: Add futex2 benchmark tests
  kernel: Enable waitpid() for futex2

 Documentation/locking/futex2.rst              |  198 +++
 Documentation/locking/index.rst               |    1 +
 MAINTAINERS                                   |    2 +-
 arch/arm/tools/syscall.tbl                    |    4 +
 arch/arm64/include/asm/unistd.h               |    2 +-
 arch/arm64/include/asm/unistd32.h             |    4 +
 arch/x86/entry/syscalls/syscall_32.tbl        |    4 +
 arch/x86/entry/syscalls/syscall_64.tbl        |    4 +
 fs/inode.c                                    |    1 +
 include/linux/compat.h                        |   23 +
 include/linux/fs.h                            |    1 +
 include/linux/syscalls.h                      |   18 +
 include/uapi/asm-generic/unistd.h             |   14 +-
 include/uapi/linux/futex.h                    |   56 +
 init/Kconfig                                  |    7 +
 kernel/Makefile                               |    1 +
 kernel/fork.c                                 |    2 +
 kernel/futex2.c                               | 1255 +++++++++++++++++
 kernel/sys_ni.c                               |    6 +
 tools/arch/x86/include/asm/unistd_64.h        |   12 +
 tools/include/uapi/asm-generic/unistd.h       |   11 +-
 .../arch/x86/entry/syscalls/syscall_64.tbl    |    3 +
 tools/perf/bench/bench.h                      |    4 +
 tools/perf/bench/futex-hash.c                 |   24 +-
 tools/perf/bench/futex-requeue.c              |   57 +-
 tools/perf/bench/futex-wake-parallel.c        |   41 +-
 tools/perf/bench/futex-wake.c                 |   37 +-
 tools/perf/bench/futex.h                      |   47 +
 tools/perf/builtin-bench.c                    |   18 +-
 .../selftests/futex/functional/.gitignore     |    3 +
 .../selftests/futex/functional/Makefile       |    8 +-
 .../futex/functional/futex2_requeue.c         |  164 +++
 .../selftests/futex/functional/futex2_wait.c  |  209 +++
 .../selftests/futex/functional/futex2_waitv.c |  157 +++
 .../futex/functional/futex_wait_timeout.c     |   58 +-
 .../futex/functional/futex_wait_wouldblock.c  |   33 +-
 .../testing/selftests/futex/functional/run.sh |    6 +
 .../selftests/futex/include/futex2test.h      |  121 ++
 38 files changed, 2563 insertions(+), 53 deletions(-)
 create mode 100644 Documentation/locking/futex2.rst
 create mode 100644 kernel/futex2.c
 create mode 100644 tools/testing/selftests/futex/functional/futex2_requeue.c
 create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c
 create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c
 create mode 100644 tools/testing/selftests/futex/include/futex2test.h

-- 
2.30.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-02-16 12:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-16 12:13 [RFC PATCH 00/13] Add futex2 syscalls Andrey Semashev
2021-02-16 12:17 ` Andrey Semashev
  -- strict thread matches above, loose matches on Subject: below --
2021-02-15 15:23 André Almeida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).