linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation
@ 2021-03-17 17:57 Jim Newsome
  2021-03-17 18:43 ` Peter Oskolkov
  0 siblings, 1 reply; 4+ messages in thread
From: Jim Newsome @ 2021-03-17 17:57 UTC (permalink / raw)
  To: Peter Oskolkov; +Cc: linux-kernel, Rob Jansen, Ryan Wails

I'm not well versed in this part of the kernel (ok, any part, really),
but I wanted to chime in from a user perspective that I'm very
interested in this functionality.

We (Rob + Ryan + I, cc'd) are currently developing the second generation
of the Shadow simulator <https://shadow.github.io/>, which is used by
various researchers and the Tor Project. In this new architecture,
simulated network-application processes (such as tor, browsers, and web
servers) are each run as a native OS process, started by forking and
exec'ing its unmodified binary. We are interested in supporting large
simulations (e.g. 50k+ processes), and expect them to take on the order
of hours or even days to execute, so scalability and performance matters.

We've prototyped two mechanisms for controlling these simulated
processes, and a third hybrid mechanism that combines the two. I've
mentioned one of these (ptrace) in another thread ("do_wait: make
PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use
an LD_PRELOAD'd shim that implements the libc interface, and
communicates with Shadow via a syscall-like API over IPC.

So far the most performant version we've tried of this IPC is with a bit
of shared memory and a pair of semaphores. It looks much like the
example in Peter's proposal:

> a. T1: futex-wake T2, futex-wait
> b. T2: wakes, does what it has been woken to do
> c. T2: futex-wake T1, futex-wait

We've been able to get the switching costs down using CPU pinning and
SCHED_FIFO. Each physical CPU spends most of its time swapping back and
forth between a Shadow worker thread and an emulated process. Even so,
the new architecture is so far slower than the first generation of
Shadow, which multiplexes the simulated processes into its own handful
of OS processes (but is complex and fragile).

> With FUTEX_SWAP, steps a and c above can be reduced to one futex
> operation that runs 5-10 times faster.

IIUC the proposed primitives could let us further improve performance,
and perhaps drop some of the complexity of attempting to control the
scheduler via pinning and SCHED_FIFO.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation
  2021-03-17 17:57 [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation Jim Newsome
@ 2021-03-17 18:43 ` Peter Oskolkov
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Oskolkov @ 2021-03-17 18:43 UTC (permalink / raw)
  To: Jim Newsome
  Cc: Peter Oskolkov, Linux Kernel Mailing List, Rob Jansen,
	Ryan Wails, Paul Turner, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Ben Segall

Hi Jim, thank you for your interest!

While FUTEX_SWAP seems to be a nonstarter, there is a discussion
off-list on how to approach the larger problem of userspace
scheduling. A full userspace scheduling patchset is likely to take
some time to shape out, but the "core" patches of wait/wake/swap are
more or less ready, so I'll probably post an early RFC version here in
the next week or two.

CC-ing the maintainers.

Thanks,
Peter

On Wed, Mar 17, 2021 at 10:59 AM Jim Newsome <jnewsome@torproject.org> wrote:
>
> I'm not well versed in this part of the kernel (ok, any part, really),
> but I wanted to chime in from a user perspective that I'm very
> interested in this functionality.
>
> We (Rob + Ryan + I, cc'd) are currently developing the second generation
> of the Shadow simulator <https://shadow.github.io/>, which is used by
> various researchers and the Tor Project. In this new architecture,
> simulated network-application processes (such as tor, browsers, and web
> servers) are each run as a native OS process, started by forking and
> exec'ing its unmodified binary. We are interested in supporting large
> simulations (e.g. 50k+ processes), and expect them to take on the order
> of hours or even days to execute, so scalability and performance matters.
>
> We've prototyped two mechanisms for controlling these simulated
> processes, and a third hybrid mechanism that combines the two. I've
> mentioned one of these (ptrace) in another thread ("do_wait: make
> PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use
> an LD_PRELOAD'd shim that implements the libc interface, and
> communicates with Shadow via a syscall-like API over IPC.
>
> So far the most performant version we've tried of this IPC is with a bit
> of shared memory and a pair of semaphores. It looks much like the
> example in Peter's proposal:
>
> > a. T1: futex-wake T2, futex-wait
> > b. T2: wakes, does what it has been woken to do
> > c. T2: futex-wake T1, futex-wait
>
> We've been able to get the switching costs down using CPU pinning and
> SCHED_FIFO. Each physical CPU spends most of its time swapping back and
> forth between a Shadow worker thread and an emulated process. Even so,
> the new architecture is so far slower than the first generation of
> Shadow, which multiplexes the simulated processes into its own handful
> of OS processes (but is complex and fragile).
>
> > With FUTEX_SWAP, steps a and c above can be reduced to one futex
> > operation that runs 5-10 times faster.
>
> IIUC the proposed primitives could let us further improve performance,
> and perhaps drop some of the complexity of attempting to control the
> scheduler via pinning and SCHED_FIFO.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation
  2020-06-24 18:52 Peter Oskolkov
@ 2020-06-29 16:44 ` Peter Oskolkov
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Oskolkov @ 2020-06-29 16:44 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Darren Hart,
	Peter Oskolkov, Vincent Guittot, Andrei Vagin, Paul Turner,
	Ben Segall, Aaron Lu

Hi Thomas, Ingo!

Do you have any comments/suggestions/objections here? FUTEX_SWAP seems
to be quite useful for fast task context switching, and several teams
at Google would like to see this capability upstreamed.

Thanks,
Peter

On Wed, Jun 24, 2020 at 11:53 AM Peter Oskolkov <posk@posk.io> wrote:
>
> From: Peter Oskolkov <posk@google.com>
>
> This is an RFC!
>
> As Paul Turner presented at LPC in 2013 ...
> - pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
> - video: https://www.youtube.com/watch?v=KXuZi9aeGTw
>
> ... Google has developed an M:N userspace threading subsystem backed
> by Google-private SwitchTo Linux Kernel API (page 17 in the pdf referenced
> above). This subsystem provides latency-sensitive services at Google with
> fine-grained user-space control/scheduling over what is running when,
> and this subsystem is used widely internally (called schedulers or fibers).
>
> This RFC patchset is the first step to open-source this work. As explained
> in the linked pdf and video, SwitchTo API has three core operations: wait,
> resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation
> that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation
> on top of which user-space threading libraries can be built.
>
> Another common use case for FUTEX_SWAP is message passing a-la RPC
> between tasks: task/thread T1 prepares a message,
> wakes T2 to work on it, and waits for the results; when T2 is done, it
> wakes T1 and waits for more work to arrive. Currently the simplest
> way to implement this is
>
> a. T1: futex-wake T2, futex-wait
> b. T2: wakes, does what it has been woken to do
> c. T2: futex-wake T1, futex-wait
>
> With FUTEX_SWAP, steps a and c above can be reduced to one futex operation
> that runs 5-10 times faster.
>
> Patches in this patchset:
>
> Patch 1: introduce FUTEX_SWAP futex operation that,
>          internally, does wake + wait. The purpose of this patch is
>          to work out the API.
> Patch 2: a first rough attempt to make FUTEX_SWAP faster than
>          what wake + wait can do.
> Patch 3: a selftest that can also be used to benchmark FUTEX_SWAP vs
>          FUTEX_WAKE + FUTEX_WAIT.
>
> v2: fix undefined symbol error ifndef CONFIG_SMP.
> v3: rebased onto the latest tip/locking/core.
>
> Peter Oskolkov (3):
>   futex: introduce FUTEX_SWAP operation
>   futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP
>   selftests/futex: add futex_swap selftest
>
>  include/linux/sched.h                         |   1 +
>  include/uapi/linux/futex.h                    |   2 +
>  kernel/futex.c                                |  96 ++++++--
>  kernel/sched/core.c                           |   5 +
>  kernel/sched/fair.c                           |   3 +
>  kernel/sched/sched.h                          |   1 +
>  .../selftests/futex/functional/.gitignore     |   1 +
>  .../selftests/futex/functional/Makefile       |   1 +
>  .../selftests/futex/functional/futex_swap.c   | 209 ++++++++++++++++++
>  .../selftests/futex/include/futextest.h       |  19 ++
>  10 files changed, 322 insertions(+), 16 deletions(-)
>  create mode 100644 tools/testing/selftests/futex/functional/futex_swap.c
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation
@ 2020-06-24 18:52 Peter Oskolkov
  2020-06-29 16:44 ` Peter Oskolkov
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Oskolkov @ 2020-06-24 18:52 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Darren Hart, Vincent Guittot
  Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

From: Peter Oskolkov <posk@google.com>

This is an RFC!

As Paul Turner presented at LPC in 2013 ...
- pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
- video: https://www.youtube.com/watch?v=KXuZi9aeGTw

... Google has developed an M:N userspace threading subsystem backed
by Google-private SwitchTo Linux Kernel API (page 17 in the pdf referenced
above). This subsystem provides latency-sensitive services at Google with
fine-grained user-space control/scheduling over what is running when,
and this subsystem is used widely internally (called schedulers or fibers).

This RFC patchset is the first step to open-source this work. As explained
in the linked pdf and video, SwitchTo API has three core operations: wait,
resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation
that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation
on top of which user-space threading libraries can be built.

Another common use case for FUTEX_SWAP is message passing a-la RPC
between tasks: task/thread T1 prepares a message,
wakes T2 to work on it, and waits for the results; when T2 is done, it
wakes T1 and waits for more work to arrive. Currently the simplest
way to implement this is

a. T1: futex-wake T2, futex-wait
b. T2: wakes, does what it has been woken to do
c. T2: futex-wake T1, futex-wait

With FUTEX_SWAP, steps a and c above can be reduced to one futex operation
that runs 5-10 times faster.

Patches in this patchset:

Patch 1: introduce FUTEX_SWAP futex operation that,
         internally, does wake + wait. The purpose of this patch is
         to work out the API.
Patch 2: a first rough attempt to make FUTEX_SWAP faster than
         what wake + wait can do.
Patch 3: a selftest that can also be used to benchmark FUTEX_SWAP vs
         FUTEX_WAKE + FUTEX_WAIT.

v2: fix undefined symbol error ifndef CONFIG_SMP.
v3: rebased onto the latest tip/locking/core.

Peter Oskolkov (3):
  futex: introduce FUTEX_SWAP operation
  futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP
  selftests/futex: add futex_swap selftest

 include/linux/sched.h                         |   1 +
 include/uapi/linux/futex.h                    |   2 +
 kernel/futex.c                                |  96 ++++++--
 kernel/sched/core.c                           |   5 +
 kernel/sched/fair.c                           |   3 +
 kernel/sched/sched.h                          |   1 +
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   1 +
 .../selftests/futex/functional/futex_swap.c   | 209 ++++++++++++++++++
 .../selftests/futex/include/futextest.h       |  19 ++
 10 files changed, 322 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/futex/functional/futex_swap.c

--
2.25.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-03-17 18:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 17:57 [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation Jim Newsome
2021-03-17 18:43 ` Peter Oskolkov
  -- strict thread matches above, loose matches on Subject: below --
2020-06-24 18:52 Peter Oskolkov
2020-06-29 16:44 ` Peter Oskolkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).