Re: rseq event_counter field

* Re: rseq event_counter field
       [not found]   ` <606828370.51261.1509183329197.JavaMail.zimbra@efficios.com>
@ 2017-10-28 20:20     ` Andy Lutomirski
  2017-10-29  1:07       ` Mathieu Desnoyers
  2017-10-30 18:08       ` Kyle Huey
  0 siblings, 2 replies; 4+ messages in thread
From: Andy Lutomirski @ 2017-10-28 20:20 UTC (permalink / raw)
  To: Mathieu Desnoyers, Kyle Huey, Ben Maurer
  Cc: Andy Lutomirski, Boqun Feng, Will Deacon, Peter Zijlstra,
	Paul Turner, linux-kernel, Linus Torvalds

Answering both emails here.

Also, welcome Kyle.  Kyle, how badly does rseq's proposed
event_counter break rr?  For that matter, how badly does rseq without
an event_counter break rr?

(Linus, if you care, I'm proposing that rseq is probably fine for
4.15, but that it should be merged without the event_counter.  If
there is a bona fide use for event_counter along with a benchmark
showing that it makes a meaningful difference, then event_counter can
be easily added later.)

On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@fb.com> wrote:
> My understanding is that one of the motivations for the event counter was to
> make it possible to express things like "increment a per-cpu value" in C.
> Without this abstraction it wasn't possible to ensure that the strict
> requirements of the rseq ABI were met (contiguous restartable block, single
> instruction for the final write). If you write your entire restartable
> sequence in assembly there's no need for the event counter. Comparing all
> input values as an alternative way of having non-assembly sequences seems
> like it might have a risk of ABA style race conditions.
>

My opinion after looking at a few examples is that event_counter
sounds useful but that it's really more of a crutch that enables
mediocre userspace code.  With event_counter, you can do roughly this:

struct rseq_state rseq = begin();

int cpu = rseq_getcpu();
Calculate some stuff.  Be very very careful to to chase pointers,
because it's easy to do in a very very fragile way as Mathieu's
example below does.

commit(&rseq, some pointer you found, some value to write, some
double-checks to check);

With a real database, this style of programming works well.  You
probably have a progress guarantee, and the database gives you the
semantics you want.  With rseq, you don't get any of this.  If the C
code is too slow or does something (an accidental interaction with
userfaultfd, perhaps?), then you lose forward progress.  You also can
be a bit slow in the C code, in which case you'll get more conflicts
detected than actually exist.

Without event_counter, you can use a different abstraction.  In your
percpu data, you have something like:

struct rseq_lock lock;
struct freelist_entry *head;

Then you can do:

int cpu = rseq_getcpu();
struct my_data *data = ... cpu ...;
struct rseq_state rseq = rseq_acquire(data->lock);

Calculate some stuff, being careful not to touch anything that isn't
protected by the rseq_lock you're using.

commit(&rseq, cpu, some pointer you found, some value to write, some
double-checks to check);

This variant is indeed a tiny bit slower *if you measure time by
number if instructions* -- there are some very clever optimizations
that save some instructions in the event_counter variant.  But, in the
cases where you actually have C code in here, I suspect that the
difference won't matter in real code.  And the counter-less variant
can have arbitrarily slow C code or even syscalls without being
spuriously aborted until the very short asm commit sequence starts.

BUT, I think that the event_counter-free version is likely to be
faster in real code if the library is done well.  The version with
event_counter has to load the event_counter from the rseq cacheline
right at the beginning, and that value is used, so unless the CPU is
really quite aggressive avoid speculating through a branch that hasn't
gotten its dependencies into cache yet, you're likely to stall a bit
if the per-thread rseq cacheline isn't in L1.

The variant without event_counter can avoid this problem as long as
the architecture gives a cheap way to find the CPU number.  On x86,
this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
feature that isn't terribly hard to add and would IMO be quite nifty.)
 If this is done, then you still have to do:

__this_thread_rseq->rseq_cs = &the_rseq_cs_here;

but that's a *store* to the rseq cacheline.  It'll go in the store
buffer and, in the fast case, nothing ever depends on it.  Hence it's
unlikely to stall.  Of course, there's still a load from the
rseq_lock, but that shares a cacheline with the data that it protects,
so it's basically free.

IOW, I think that adding event_counter encourages user code that has a
weaker progress guarantee, is slower in a very highly optimized
implementation, and allows users to think less about they're doing to
their own detriment, for the gain of saving a couple of instructions.
I could be wrong, but I think it would be nice to see evidence that
I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
suspect that neat tools like rr will have a much easier time dealing
with rseq if event_counter isn't involved.

>
> RDPID is interesting, but it seems like given the current ABI (which
> requires setting the current restartable range) you'd still need some sort
> of TLS value. Have any benchmarks of RDPID perf vs TLS?

That's the wrong comparison.  For a good userspace implementation
(which is very awkward without libc's help), storing to the rseq
pointer is:

leaq the_rseq_cs(%rip), %rax
movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)

loading the cpu number is:

movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax

The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
fixed) instead.  This avoids reading from the rseq cacheline.

On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@kernel.org wrote:
>
>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>
> Let me step back a bit, and try to understand the fundamental
> motivation for not letting user-space detect that it has been
> preempted.
>
> My understanding (please let me know where I am mistaken) is that
> the motivation for not letting user-space know if preempted is to
> ensure malware cannot cloak (hide) themselves by changing their
> behavior when they detect that they are heavily preempted. This
> is effectively a way to detect things like single-stepping, strace,
> ltrace, and so on.

That's about a third of my objection.  The next third is that it seems
messy to be to intentionally expose what should be an invisible
detail, and the third is that I still haven't seen convincing evidence
that anything genuinely needs an event counter.  See below and my
other email.

>
> One can also pretty much trivially use rdtsc (x86), tb register
> (powerpc) or ARM ccnt register (if user access is enabled) to
> detect whether it runs under observation.
> (this would be your timing-and-guessing approach)
>

Mostly irrelevant, but rdtsc is trappable.

> I hope that very *few* users will completely open-code their asm
> sequences. For instance, if you do the whole list pop in asm, you
> need to:
>
> 1) store current rseq_cs pointer
> 2) load CPU number from struct rseq,
> 3) pointer-chase the right per-cpu data that you want to act on,
> 4) load head
> 5) compare head against NULL
> 6) conditional branch
> 6) load head->next
> 10) store head->next to head (commit)
>
> step 3) is the interesting bit here:
> It heavily depends on the application per-cpu data layout. It can be
> a large array of per-cpu integers, or an array of per-cpu structures
> at a specific offset, or an array of per-cpu struct + offset + pointer
> to some other structure (dynamically allocated), and so on. I do not
> want to impose any restriction on the data structure layout of
> user-space applications. This means the asm-specialized-stuff will
> have to be specialized for each per-cpu data structure layout, which
> is a pain.

To my point: I don't think this is a particularly good example.  I
think your examples are buggy (see below), and I would implement this
quite differently if I were trying to use C (or even asm, perhaps),
like this:

1) load CPU number.
2) pointer-chase the right per-cpu data, in C if desired.
3) store rseq_cs (which pretty much has to be in asm using the current ABI)
4) load head, compare against null, load head->next
5) commit

I don't see how the event counter is at all useful here.

FWIW, I can easily imagine alternative ABIs in which abort_ip is
treated as a *call* instead of a jump to enable much better C support.

> Having the event_counter in place allows users to code the
> pointer-chasing in pure C, which we can expect will help adoption
> and ensure people use rseq through a utility library header file
> rather than open-coding everything (with the associated testing
> problems).

True, but my variant works just as well, might be faster, and doesn't
need event_counter.

>
> If we look at the list pop alternative using C for pointer-chasing,
> with event_counter:
>
> <C>
> 1) 64-bit load of both event_counter and cpu_id
> 2) Use cpu_id to pointer-chase the per-cpu data
> 3) Load head
> 4) compare head with NULL
> 5) conditional branch
> 6) Load head->next

^^^ Potential segfault here.  head might have been freed and unmapped.

> <ASM>
> 9) store current rseq_cs
> 10) Load event counter
> 11) compare current event_counter with event_counter loaded in 1)
> 12) conditional branch
> 13) store new value to @loc
>
> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
> stall due to pointer-chasing.
>
> And here is the resulting list pop where the algorithm is in C,
> without sequence counter. Let's make it a simple case
> where the application is not changing the layout of the data
> structures being pointer-chased concurrently from a book-keeping
> algorithm (because that would add extra compare in the asm).
>
> <C>
> 1) 32-bit load of cpu_id
> 2) Use cpu_id to pointer-chase the per-cpu data
> 3) Load head
> 4) compare head with NULL
> 5) conditional branch
> 6) Load head->next

^^^ ditto: potential segfault here

> <ASM>
> 7) store current rseq_cs
> 8) 32-bit load of cpu_id
> 9) compare cpu_id with prior cpu_id read in 1)
> 10) conditional branch
> 11) Load head
> 12) compare head against head loaded in 3)
> 13) conditional branch
> 14) Load head->next
> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
> 16) conditional branch
> 17) store head->next as new head
>
>
> Would it be required to change the kernel-user ABI when switching to
> RDPID ? Can this co-exist with TLS users ?

RDPID gives you the CPU number without reading memory, and that's
pretty much all it does.  The format in which it returns it sucks,
unfortunately, but I'm going to try to fix that by the time rseq is
merged.

--Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread