All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Maurer <bmaurer@fb.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	David Goldblatt <davidgoldblatt@fb.com>, Qi Wang <qiwang@fb.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Paul Turner <pjt@google.com>, Andrew Hunter <ahh@google.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Dave Watson <davejwatson@fb.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Will Deacon <will.deacon@arm.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andi Kleen <andi@firstfloor.org>, Chris Lameter <cl@linux.com>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Russell King <linux@arm.linux.org.uk>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
Date: Fri, 13 Oct 2017 09:35:32 +0000	[thread overview]
Message-ID: <DM5PR15MB1690DA99E4AA74FBE54CF7F9CF480@DM5PR15MB1690.namprd15.prod.outlook.com> (raw)
In-Reply-To: <CA+55aFy54Sk63VkS3HFnd4MPiSbgV6==NaYwr+D9sjG6KZGmHQ@mail.gmail.com>

Hey,

I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers. I have a few thoughts and questions -- hopefully ones that could help better understand where you'd like to see more data (potentially from myself and other Facebook folks)

> A "increment percpu value" simply isn't relevant.

While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads. In applications with 1000s of threads, keeping a counter like this can pose a challenge. One can use a per-thread variable, but the size overhead here can be very large (8 bytes per counter per thread adds up very quickly). And atomic instructions can cause contention quickly. Server programs tend to have many statistical counters, being able to implement them efficiently without size bloat is a real world win.

This type of per-cpu counter can also very quickly be used to implement other abstractions in common use -- eg an asymmetric reader-writer lock or a reference counted object that is changed infrequently. While thread local storage can also be used in these cases this can be a substantial size overhead and can also require cooperation between the application and the library to manage thread lifecycle.

At least from what I've seen of our usage of these types of abstractions within Facebook, if rseq met these use cases and did absolutely nothing else it would still be a feature that our applications would benefit from. Hopefully we can find evidence that it can do even more than this, but I think that this "trivial" use case is actually addressing a real world problem.

> When I asked last time, people pointed me to potential uses, including
> malloc libraries that could get per-thread performance with just
> per-cpu (not per thread) data structure overhead. I see that you once
> more point to the slides from 2013 that again talks about it.
>
> But people didn't post code, people didn't post numbers, and people
> didn't point to actual real uses, just "this could happen".

At Facebook we did some work to experiment with rseq and jemalloc Qi and David (cc'd) may be able to provide more context on the current state.

> Because without real-world uses, it's not obvious that there won't be
> somebody who goes "oh, this isn't quite enough for us, the semantics
> are subtly incompatible with our real-world use case".

Is your concern mainly this question (is this patchset a good way to bring per-cpu algorithms to userspace)? I'm hoping that given the variety of ways that per-cpu data structures are used in the kernel the concerns around this patch set are mainly around what approach we should take rather than if per-cpu algorithms are a good idea at all. If this is your main concern perhaps our focus should be around demonstrating that a number of useful per-cpu algorithms can be implemented using restartable sequences.

Ultimately I'm worried there's a chicken and egg problem here. It's hard to get people to commit to investing in rseq without a clear path to the patchset seeing the light of day. It's also easy to understand why you'd be reluctant to merge such a substantial and unfamiliar API without extensive data. If we're still not able to get compelling data, I'm wondering if there are other approaches that could get us unstuck, eg

(1) Could we merge enough of this patchset (eg things like the faster getcpu() operation, which seems like a good improvement over the current approach). If we make the remaining patches small enough it may be easier for sophisticated users to apply the remaining patches, maintain them, and provide real-world operational experience with this abstraction.

(2) Could we implement restartable sequences in the kernel but only allow the vdso to make use of them? We could have the vdso export a number of high-level operations (like the ones suggested in Paul Turner's original presentation -- per-cpu spin lock, per-cpu atomic increment/decrement, per-cpu list push/pop). This would allow us to get real-world data about how these primitives are used without committing to a complex ABI -- only committing to support the specific operations. If the whole idea flops we could eventually create a slow/naive implementation of the vdso functions and kill restartable sequences entirely.

-b

WARNING: multiple messages have this Message-ID (diff)
From: Ben Maurer <bmaurer@fb.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	David Goldblatt <davidgoldblatt@fb.com>, Qi Wang <qiwang@fb.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Paul Turner <pjt@google.com>, Andrew Hunter <ahh@google.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Dave Watson <davejwatson@fb.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Will Deacon <will.deacon@arm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andi Kleen <andi@firstfloor.org>, Chris Lameter <cl@linux.com>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Russell King <linux@arm.linux.org.uk>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Michael Kerrisk <mtk.manpages@gmail>
Subject: Re: [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call
Date: Fri, 13 Oct 2017 09:35:32 +0000	[thread overview]
Message-ID: <DM5PR15MB1690DA99E4AA74FBE54CF7F9CF480@DM5PR15MB1690.namprd15.prod.outlook.com> (raw)
In-Reply-To: <CA+55aFy54Sk63VkS3HFnd4MPiSbgV6==NaYwr+D9sjG6KZGmHQ@mail.gmail.com>

Hey,

I'm really excited to hear that you're open to this patch set and totally understand the desire for some more numbers. I have a few thoughts and questions -- hopefully ones that could help better understand where you'd like to see more data (potentially from myself and other Facebook folks)

> A "increment percpu value" simply isn't relevant.

While I understand it seems trivial, my experience has been that this type of operation can actually be important in many server workloads. In applications with 1000s of threads, keeping a counter like this can pose a challenge. One can use a per-thread variable, but the size overhead here can be very large (8 bytes per counter per thread adds up very quickly). And atomic instructions can cause contention quickly. Server programs tend to have many statistical counters, being able to implement them efficiently without size bloat is a real world win.

This type of per-cpu counter can also very quickly be used to implement other abstractions in common use -- eg an asymmetric reader-writer lock or a reference counted object that is changed infrequently. While thread local storage can also be used in these cases this can be a substantial size overhead and can also require cooperation between the application and the library to manage thread lifecycle.

At least from what I've seen of our usage of these types of abstractions within Facebook, if rseq met these use cases and did absolutely nothing else it would still be a feature that our applications would benefit from. Hopefully we can find evidence that it can do even more than this, but I think that this "trivial" use case is actually addressing a real world problem.

> When I asked last time, people pointed me to potential uses, including
> malloc libraries that could get per-thread performance with just
> per-cpu (not per thread) data structure overhead. I see that you once
> more point to the slides from 2013 that again talks about it.
>
> But people didn't post code, people didn't post numbers, and people
> didn't point to actual real uses, just "this could happen".

At Facebook we did some work to experiment with rseq and jemalloc Qi and David (cc'd) may be able to provide more context on the current state.

> Because without real-world uses, it's not obvious that there won't be
> somebody who goes "oh, this isn't quite enough for us, the semantics
> are subtly incompatible with our real-world use case".

Is your concern mainly this question (is this patchset a good way to bring per-cpu algorithms to userspace)? I'm hoping that given the variety of ways that per-cpu data structures are used in the kernel the concerns around this patch set are mainly around what approach we should take rather than if per-cpu algorithms are a good idea at all. If this is your main concern perhaps our focus should be around demonstrating that a number of useful per-cpu algorithms can be implemented using restartable sequences.

Ultimately I'm worried there's a chicken and egg problem here. It's hard to get people to commit to investing in rseq without a clear path to the patchset seeing the light of day. It's also easy to understand why you'd be reluctant to merge such a substantial and unfamiliar API without extensive data. If we're still not able to get compelling data, I'm wondering if there are other approaches that could get us unstuck, eg

(1) Could we merge enough of this patchset (eg things like the faster getcpu() operation, which seems like a good improvement over the current approach). If we make the remaining patches small enough it may be easier for sophisticated users to apply the remaining patches, maintain them, and provide real-world operational experience with this abstraction.

(2) Could we implement restartable sequences in the kernel but only allow the vdso to make use of them? We could have the vdso export a number of high-level operations (like the ones suggested in Paul Turner's original presentation -- per-cpu spin lock, per-cpu atomic increment/decrement, per-cpu list push/pop). This would allow us to get real-world data about how these primitives are used without committing to a complex ABI -- only committing to support the specific operations. If the whole idea flops we could eventually create a slow/naive implementation of the vdso functions and kill restartable sequences entirely.

-b

  reply	other threads:[~2017-10-13  9:36 UTC|newest]

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-12 23:03 [RFC PATCH v9 for 4.15 00/14] Restartable sequences and CPU op vector system calls Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH v9 for 4.15 01/14] Restartable sequences system call Mathieu Desnoyers
2017-10-13  0:36   ` Linus Torvalds
2017-10-13  0:36     ` Linus Torvalds
2017-10-13  9:35     ` Ben Maurer [this message]
2017-10-13  9:35       ` Ben Maurer
2017-10-13 18:30       ` Linus Torvalds
2017-10-13 18:30         ` Linus Torvalds
2017-10-13 20:54         ` Paul E. McKenney
2017-10-13 20:54           ` Paul E. McKenney
2017-10-13 21:05           ` Linus Torvalds
2017-10-13 21:05             ` Linus Torvalds
2017-10-13 21:21             ` Paul E. McKenney
2017-10-13 21:21               ` Paul E. McKenney
2017-10-13 21:36             ` Mathieu Desnoyers
2017-10-13 21:36               ` Mathieu Desnoyers
2017-10-16 16:04               ` Carlos O'Donell
2017-10-16 16:04                 ` Carlos O'Donell
2017-10-16 16:46                 ` Andi Kleen
2017-10-16 16:46                   ` Andi Kleen
2017-10-16 22:17                   ` Mathieu Desnoyers
2017-10-16 22:17                     ` Mathieu Desnoyers
2017-10-17 16:19                     ` Ben Maurer
2017-10-17 16:19                       ` Ben Maurer
2017-10-17 16:33                       ` Mathieu Desnoyers
2017-10-17 16:33                         ` Mathieu Desnoyers
2017-10-17 16:41                         ` Ben Maurer
2017-10-17 16:41                           ` Ben Maurer
2017-10-17 17:48                           ` Mathieu Desnoyers
2017-10-17 17:48                             ` Mathieu Desnoyers
2017-10-18  6:22                       ` Greg KH
2017-10-18  6:22                         ` Greg KH
2017-10-18 16:28                         ` Mathieu Desnoyers
2017-10-18 16:28                           ` Mathieu Desnoyers
2017-10-14  3:01         ` Andi Kleen
2017-10-14  3:01           ` Andi Kleen
2017-10-14  4:05           ` Linus Torvalds
2017-10-14  4:05             ` Linus Torvalds
2017-10-14 11:37             ` Mathieu Desnoyers
2017-10-14 11:37               ` Mathieu Desnoyers
2017-10-13 12:50   ` Florian Weimer
2017-10-13 13:40     ` Mathieu Desnoyers
2017-10-13 13:40       ` Mathieu Desnoyers
2017-10-13 13:56       ` Florian Weimer
2017-10-13 13:56         ` Florian Weimer
2017-10-13 14:27         ` Mathieu Desnoyers
2017-10-13 14:27           ` Mathieu Desnoyers
2017-10-13 17:24           ` Andy Lutomirski
2017-10-13 17:24             ` Andy Lutomirski
2017-10-13 17:53             ` Florian Weimer
2017-10-13 17:53               ` Florian Weimer
2017-10-13 18:17               ` Andy Lutomirski
2017-10-13 18:17                 ` Andy Lutomirski
2017-10-14 11:53                 ` Mathieu Desnoyers
2017-10-14 11:53                   ` Mathieu Desnoyers
2017-10-18 16:41   ` Ben Maurer
2017-10-18 18:11     ` Mathieu Desnoyers
2017-10-18 18:11       ` Mathieu Desnoyers
2017-10-19 11:35       ` Mathieu Desnoyers
2017-10-19 11:35         ` Mathieu Desnoyers
2017-10-19 17:01         ` Florian Weimer
2017-10-19 17:01           ` Florian Weimer
2017-10-23 17:30       ` Ben Maurer
2017-10-23 17:30         ` Ben Maurer
2017-10-23 20:44         ` Mathieu Desnoyers
2017-10-23 20:44           ` Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 02/14] tracing: instrument restartable sequences Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 03/14] Restartable sequences: ARM 32 architecture support Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 04/14] Restartable sequences: wire up ARM 32 system call Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 05/14] Restartable sequences: x86 32/64 architecture support Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 06/14] Restartable sequences: wire up x86 32/64 system call Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 07/14] Restartable sequences: powerpc architecture support Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 08/14] Restartable sequences: Wire up powerpc system call Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 09/14] Provide cpu_opv " Mathieu Desnoyers
2017-10-13 13:57   ` Alan Cox
2017-10-13 13:57     ` Alan Cox
2017-10-13 14:50     ` Mathieu Desnoyers
2017-10-13 14:50       ` Mathieu Desnoyers
2017-10-14 14:22       ` Mathieu Desnoyers
2017-10-14 14:22         ` Mathieu Desnoyers
2017-10-13 17:20   ` Andy Lutomirski
2017-10-13 17:20     ` Andy Lutomirski
2017-10-14  2:50   ` Andi Kleen
2017-10-14  2:50     ` Andi Kleen
2017-10-14 13:35     ` Mathieu Desnoyers
2017-10-14 13:35       ` Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 10/14] cpu_opv: Wire up x86 32/64 " Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 11/14] cpu_opv: Wire up powerpc " Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 12/14] cpu_opv: Wire up ARM32 " Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 13/14] cpu_opv: Implement selftests Mathieu Desnoyers
2017-10-12 23:03 ` [RFC PATCH for 4.15 14/14] Restartable sequences: Provide self-tests Mathieu Desnoyers
2017-10-16  2:51   ` Michael Ellerman
2017-10-16  2:51     ` Michael Ellerman
2017-10-16 14:23     ` Mathieu Desnoyers
2017-10-16 14:23       ` Mathieu Desnoyers
2017-10-17 10:38       ` Michael Ellerman
2017-10-17 10:38         ` Michael Ellerman
2017-10-17 13:50         ` Mathieu Desnoyers
2017-10-17 13:50           ` Mathieu Desnoyers
2017-10-16 18:50     ` Mathieu Desnoyers
2017-10-16 18:50       ` Mathieu Desnoyers
2017-10-17 10:36       ` Michael Ellerman
2017-10-17 10:36         ` Michael Ellerman
2017-10-17 13:50         ` Mathieu Desnoyers
2017-10-17 13:50           ` Mathieu Desnoyers
2017-10-18  5:45           ` Michael Ellerman
2017-10-18  5:45             ` Michael Ellerman
2017-10-16  3:00   ` Michael Ellerman
2017-10-16  3:00     ` Michael Ellerman
2017-10-16  3:48     ` Boqun Feng
2017-10-16  3:48       ` Boqun Feng
2017-10-16 11:48       ` Michael Ellerman
2017-10-16 11:48         ` Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM5PR15MB1690DA99E4AA74FBE54CF7F9CF480@DM5PR15MB1690.namprd15.prod.outlook.com \
    --to=bmaurer@fb.com \
    --cc=ahh@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=boqun.feng@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@linux.com \
    --cc=davejwatson@fb.com \
    --cc=davidgoldblatt@fb.com \
    --cc=hpa@zytor.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@arm.linux.org.uk \
    --cc=luto@amacapital.net \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=qiwang@fb.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.