All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: rseq event_counter field
       [not found]   ` <606828370.51261.1509183329197.JavaMail.zimbra@efficios.com>
@ 2017-10-28 20:20     ` Andy Lutomirski
  2017-10-29  1:07       ` Mathieu Desnoyers
  2017-10-30 18:08       ` Kyle Huey
  0 siblings, 2 replies; 4+ messages in thread
From: Andy Lutomirski @ 2017-10-28 20:20 UTC (permalink / raw)
  To: Mathieu Desnoyers, Kyle Huey, Ben Maurer
  Cc: Andy Lutomirski, Boqun Feng, Will Deacon, Peter Zijlstra,
	Paul Turner, linux-kernel, Linus Torvalds

Answering both emails here.

Also, welcome Kyle.  Kyle, how badly does rseq's proposed
event_counter break rr?  For that matter, how badly does rseq without
an event_counter break rr?

(Linus, if you care, I'm proposing that rseq is probably fine for
4.15, but that it should be merged without the event_counter.  If
there is a bona fide use for event_counter along with a benchmark
showing that it makes a meaningful difference, then event_counter can
be easily added later.)

On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@fb.com> wrote:
> My understanding is that one of the motivations for the event counter was to
> make it possible to express things like "increment a per-cpu value" in C.
> Without this abstraction it wasn't possible to ensure that the strict
> requirements of the rseq ABI were met (contiguous restartable block, single
> instruction for the final write). If you write your entire restartable
> sequence in assembly there's no need for the event counter. Comparing all
> input values as an alternative way of having non-assembly sequences seems
> like it might have a risk of ABA style race conditions.
>

My opinion after looking at a few examples is that event_counter
sounds useful but that it's really more of a crutch that enables
mediocre userspace code.  With event_counter, you can do roughly this:

struct rseq_state rseq = begin();

int cpu = rseq_getcpu();
Calculate some stuff.  Be very very careful to to chase pointers,
because it's easy to do in a very very fragile way as Mathieu's
example below does.

commit(&rseq, some pointer you found, some value to write, some
double-checks to check);

With a real database, this style of programming works well.  You
probably have a progress guarantee, and the database gives you the
semantics you want.  With rseq, you don't get any of this.  If the C
code is too slow or does something (an accidental interaction with
userfaultfd, perhaps?), then you lose forward progress.  You also can
be a bit slow in the C code, in which case you'll get more conflicts
detected than actually exist.

Without event_counter, you can use a different abstraction.  In your
percpu data, you have something like:

struct rseq_lock lock;
struct freelist_entry *head;

Then you can do:

int cpu = rseq_getcpu();
struct my_data *data = ... cpu ...;
struct rseq_state rseq = rseq_acquire(data->lock);

Calculate some stuff, being careful not to touch anything that isn't
protected by the rseq_lock you're using.

commit(&rseq, cpu, some pointer you found, some value to write, some
double-checks to check);

This variant is indeed a tiny bit slower *if you measure time by
number if instructions* -- there are some very clever optimizations
that save some instructions in the event_counter variant.  But, in the
cases where you actually have C code in here, I suspect that the
difference won't matter in real code.  And the counter-less variant
can have arbitrarily slow C code or even syscalls without being
spuriously aborted until the very short asm commit sequence starts.

BUT, I think that the event_counter-free version is likely to be
faster in real code if the library is done well.  The version with
event_counter has to load the event_counter from the rseq cacheline
right at the beginning, and that value is used, so unless the CPU is
really quite aggressive avoid speculating through a branch that hasn't
gotten its dependencies into cache yet, you're likely to stall a bit
if the per-thread rseq cacheline isn't in L1.

The variant without event_counter can avoid this problem as long as
the architecture gives a cheap way to find the CPU number.  On x86,
this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
feature that isn't terribly hard to add and would IMO be quite nifty.)
 If this is done, then you still have to do:

__this_thread_rseq->rseq_cs = &the_rseq_cs_here;

but that's a *store* to the rseq cacheline.  It'll go in the store
buffer and, in the fast case, nothing ever depends on it.  Hence it's
unlikely to stall.  Of course, there's still a load from the
rseq_lock, but that shares a cacheline with the data that it protects,
so it's basically free.

IOW, I think that adding event_counter encourages user code that has a
weaker progress guarantee, is slower in a very highly optimized
implementation, and allows users to think less about they're doing to
their own detriment, for the gain of saving a couple of instructions.
I could be wrong, but I think it would be nice to see evidence that
I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
suspect that neat tools like rr will have a much easier time dealing
with rseq if event_counter isn't involved.

>
> RDPID is interesting, but it seems like given the current ABI (which
> requires setting the current restartable range) you'd still need some sort
> of TLS value. Have any benchmarks of RDPID perf vs TLS?

That's the wrong comparison.  For a good userspace implementation
(which is very awkward without libc's help), storing to the rseq
pointer is:

leaq the_rseq_cs(%rip), %rax
movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)

loading the cpu number is:

movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax

The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
fixed) instead.  This avoids reading from the rseq cacheline.


On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@kernel.org wrote:
>
>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>
> Let me step back a bit, and try to understand the fundamental
> motivation for not letting user-space detect that it has been
> preempted.
>
> My understanding (please let me know where I am mistaken) is that
> the motivation for not letting user-space know if preempted is to
> ensure malware cannot cloak (hide) themselves by changing their
> behavior when they detect that they are heavily preempted. This
> is effectively a way to detect things like single-stepping, strace,
> ltrace, and so on.

That's about a third of my objection.  The next third is that it seems
messy to be to intentionally expose what should be an invisible
detail, and the third is that I still haven't seen convincing evidence
that anything genuinely needs an event counter.  See below and my
other email.

>
> One can also pretty much trivially use rdtsc (x86), tb register
> (powerpc) or ARM ccnt register (if user access is enabled) to
> detect whether it runs under observation.
> (this would be your timing-and-guessing approach)
>

Mostly irrelevant, but rdtsc is trappable.

> I hope that very *few* users will completely open-code their asm
> sequences. For instance, if you do the whole list pop in asm, you
> need to:
>
> 1) store current rseq_cs pointer
> 2) load CPU number from struct rseq,
> 3) pointer-chase the right per-cpu data that you want to act on,
> 4) load head
> 5) compare head against NULL
> 6) conditional branch
> 6) load head->next
> 10) store head->next to head (commit)
>
> step 3) is the interesting bit here:
> It heavily depends on the application per-cpu data layout. It can be
> a large array of per-cpu integers, or an array of per-cpu structures
> at a specific offset, or an array of per-cpu struct + offset + pointer
> to some other structure (dynamically allocated), and so on. I do not
> want to impose any restriction on the data structure layout of
> user-space applications. This means the asm-specialized-stuff will
> have to be specialized for each per-cpu data structure layout, which
> is a pain.

To my point: I don't think this is a particularly good example.  I
think your examples are buggy (see below), and I would implement this
quite differently if I were trying to use C (or even asm, perhaps),
like this:

1) load CPU number.
2) pointer-chase the right per-cpu data, in C if desired.
3) store rseq_cs (which pretty much has to be in asm using the current ABI)
4) load head, compare against null, load head->next
5) commit

I don't see how the event counter is at all useful here.

FWIW, I can easily imagine alternative ABIs in which abort_ip is
treated as a *call* instead of a jump to enable much better C support.

> Having the event_counter in place allows users to code the
> pointer-chasing in pure C, which we can expect will help adoption
> and ensure people use rseq through a utility library header file
> rather than open-coding everything (with the associated testing
> problems).

True, but my variant works just as well, might be faster, and doesn't
need event_counter.

>
> If we look at the list pop alternative using C for pointer-chasing,
> with event_counter:
>
> <C>
> 1) 64-bit load of both event_counter and cpu_id
> 2) Use cpu_id to pointer-chase the per-cpu data
> 3) Load head
> 4) compare head with NULL
> 5) conditional branch
> 6) Load head->next

^^^ Potential segfault here.  head might have been freed and unmapped.

> <ASM>
> 9) store current rseq_cs
> 10) Load event counter
> 11) compare current event_counter with event_counter loaded in 1)
> 12) conditional branch
> 13) store new value to @loc
>
> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
> stall due to pointer-chasing.
>
> And here is the resulting list pop where the algorithm is in C,
> without sequence counter. Let's make it a simple case
> where the application is not changing the layout of the data
> structures being pointer-chased concurrently from a book-keeping
> algorithm (because that would add extra compare in the asm).
>
> <C>
> 1) 32-bit load of cpu_id
> 2) Use cpu_id to pointer-chase the per-cpu data
> 3) Load head
> 4) compare head with NULL
> 5) conditional branch
> 6) Load head->next

^^^ ditto: potential segfault here

> <ASM>
> 7) store current rseq_cs
> 8) 32-bit load of cpu_id
> 9) compare cpu_id with prior cpu_id read in 1)
> 10) conditional branch
> 11) Load head
> 12) compare head against head loaded in 3)
> 13) conditional branch
> 14) Load head->next
> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
> 16) conditional branch
> 17) store head->next as new head
>
>
> Would it be required to change the kernel-user ABI when switching to
> RDPID ? Can this co-exist with TLS users ?

RDPID gives you the CPU number without reading memory, and that's
pretty much all it does.  The format in which it returns it sucks,
unfortunately, but I'm going to try to fix that by the time rseq is
merged.

--Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rseq event_counter field
  2017-10-28 20:20     ` rseq event_counter field Andy Lutomirski
@ 2017-10-29  1:07       ` Mathieu Desnoyers
  2017-10-30 18:08       ` Kyle Huey
  1 sibling, 0 replies; 4+ messages in thread
From: Mathieu Desnoyers @ 2017-10-29  1:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kyle Huey, Ben Maurer, Boqun Feng, Will Deacon, Peter Zijlstra,
	Paul Turner, linux-kernel, Linus Torvalds

----- On Oct 28, 2017, at 10:20 PM, Andy Lutomirski luto@kernel.org wrote:

> Answering both emails here.
> 
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
> 
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
> 
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@fb.com> wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
> 
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
> 
> struct rseq_state rseq = begin();
> 
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
> 
> commit(&rseq, some pointer you found, some value to write, some
> double-checks to check);
> 
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
> 
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
> 
> struct rseq_lock lock;
> struct freelist_entry *head;
> 
> Then you can do:
> 
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
> 
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
> 
> commit(&rseq, cpu, some pointer you found, some value to write, some
> double-checks to check);
> 
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
> 
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
> 
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
> If this is done, then you still have to do:
> 
> __this_thread_rseq->rseq_cs = &the_rseq_cs_here;
> 
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
> 
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
> 
>>
>> RDPID is interesting, but it seems like given the current ABI (which
>> requires setting the current restartable range) you'd still need some sort
>> of TLS value. Have any benchmarks of RDPID perf vs TLS?
> 
> That's the wrong comparison.  For a good userspace implementation
> (which is very awkward without libc's help), storing to the rseq
> pointer is:
> 
> leaq the_rseq_cs(%rip), %rax
> movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)
> 
> loading the cpu number is:
> 
> movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax
> 
> The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
> fixed) instead.  This avoids reading from the rseq cacheline.
> 
> 
> On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@kernel.org wrote:
>>
>>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Let me step back a bit, and try to understand the fundamental
>> motivation for not letting user-space detect that it has been
>> preempted.
>>
>> My understanding (please let me know where I am mistaken) is that
>> the motivation for not letting user-space know if preempted is to
>> ensure malware cannot cloak (hide) themselves by changing their
>> behavior when they detect that they are heavily preempted. This
>> is effectively a way to detect things like single-stepping, strace,
>> ltrace, and so on.
> 
> That's about a third of my objection.  The next third is that it seems
> messy to be to intentionally expose what should be an invisible
> detail, and the third is that I still haven't seen convincing evidence
> that anything genuinely needs an event counter.  See below and my
> other email.
> 
>>
>> One can also pretty much trivially use rdtsc (x86), tb register
>> (powerpc) or ARM ccnt register (if user access is enabled) to
>> detect whether it runs under observation.
>> (this would be your timing-and-guessing approach)
>>
> 
> Mostly irrelevant, but rdtsc is trappable.
> 
>> I hope that very *few* users will completely open-code their asm
>> sequences. For instance, if you do the whole list pop in asm, you
>> need to:
>>
>> 1) store current rseq_cs pointer
>> 2) load CPU number from struct rseq,
>> 3) pointer-chase the right per-cpu data that you want to act on,
>> 4) load head
>> 5) compare head against NULL
>> 6) conditional branch
>> 6) load head->next
>> 10) store head->next to head (commit)
>>
>> step 3) is the interesting bit here:
>> It heavily depends on the application per-cpu data layout. It can be
>> a large array of per-cpu integers, or an array of per-cpu structures
>> at a specific offset, or an array of per-cpu struct + offset + pointer
>> to some other structure (dynamically allocated), and so on. I do not
>> want to impose any restriction on the data structure layout of
>> user-space applications. This means the asm-specialized-stuff will
>> have to be specialized for each per-cpu data structure layout, which
>> is a pain.
> 
> To my point: I don't think this is a particularly good example.  I
> think your examples are buggy (see below),

Good catch! I think indeed that the list pop I currently have is only meant
to be used on type-safe memory (or objects otherwise guaranteed to exist by
some other mean like RCU). This can be indeed very much unexpected and tricky
for users to have this kind of constraint. So I start to better understand
where you are getting to...


> and I would implement this
> quite differently if I were trying to use C (or even asm, perhaps),
> like this:
> 
> 1) load CPU number.
> 2) pointer-chase the right per-cpu data, in C if desired.
> 3) store rseq_cs (which pretty much has to be in asm using the current ABI)
> 4) load head, compare against null, load head->next
> 5) commit
> 
> I don't see how the event counter is at all useful here.

OK, so there is a "use-after-free" on non type-safe memory for list pop
in both my rseq fast-path and cpu_opv slow path currently. We'll indeed
need to have the load head, compare != NULL, load head->next all in asm
(and in the cpu_opv preempt-off section) to prevent use-after-free in the
general case.

This makes a very strong case for removing the event_counter, since its
use does not prevent against use-after-free, and would therefore be very
much error-prone.

> 
> FWIW, I can easily imagine alternative ABIs in which abort_ip is
> treated as a *call* instead of a jump to enable much better C support.
> 
>> Having the event_counter in place allows users to code the
>> pointer-chasing in pure C, which we can expect will help adoption
>> and ensure people use rseq through a utility library header file
>> rather than open-coding everything (with the associated testing
>> problems).
> 
> True, but my variant works just as well, might be faster, and doesn't
> need event_counter.
> 
>>
>> If we look at the list pop alternative using C for pointer-chasing,
>> with event_counter:
>>
>> <C>
>> 1) 64-bit load of both event_counter and cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
> 
> ^^^ Potential segfault here.  head might have been freed and unmapped.
> 
>> <ASM>
>> 9) store current rseq_cs
>> 10) Load event counter
>> 11) compare current event_counter with event_counter loaded in 1)
>> 12) conditional branch
>> 13) store new value to @loc
>>
>> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
>> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
>> stall due to pointer-chasing.
>>
>> And here is the resulting list pop where the algorithm is in C,
>> without sequence counter. Let's make it a simple case
>> where the application is not changing the layout of the data
>> structures being pointer-chased concurrently from a book-keeping
>> algorithm (because that would add extra compare in the asm).
>>
>> <C>
>> 1) 32-bit load of cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
> 
> ^^^ ditto: potential segfault here
> 

Yep, good catch! Thanks for showing me how you envision the
userspace C vs ASM split, I'll try to come up with improved
user-space fast/slow paths based on your suggestions, and
therefore remove the event counter.

>> <ASM>
>> 7) store current rseq_cs
>> 8) 32-bit load of cpu_id
>> 9) compare cpu_id with prior cpu_id read in 1)
>> 10) conditional branch
>> 11) Load head
>> 12) compare head against head loaded in 3)
>> 13) conditional branch
>> 14) Load head->next
>> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
>> 16) conditional branch
>> 17) store head->next as new head
>>
>>
>> Would it be required to change the kernel-user ABI when switching to
>> RDPID ? Can this co-exist with TLS users ?
> 
> RDPID gives you the CPU number without reading memory, and that's
> pretty much all it does.  The format in which it returns it sucks,
> unfortunately, but I'm going to try to fix that by the time rseq is
> merged.

OK, so it is entirely independent from rseq and can be simply used as
an alternative way of reading the cpu_id.

Thanks for the thorough explanation, I think I know what to aim for now.

Mathieu

> 
> --Andy

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rseq event_counter field
  2017-10-28 20:20     ` rseq event_counter field Andy Lutomirski
  2017-10-29  1:07       ` Mathieu Desnoyers
@ 2017-10-30 18:08       ` Kyle Huey
  2017-10-30 19:20         ` Mathieu Desnoyers
  1 sibling, 1 reply; 4+ messages in thread
From: Kyle Huey @ 2017-10-30 18:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mathieu Desnoyers, Ben Maurer, Boqun Feng, Will Deacon,
	Peter Zijlstra, Paul Turner, linux-kernel, Linus Torvalds,
	Robert O'Callahan

On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski <luto@kernel.org> wrote:
> Answering both emails here.
>
> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
> event_counter break rr?  For that matter, how badly does rseq without
> an event_counter break rr?
>
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter.  If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
>
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@fb.com> wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
>
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code.  With event_counter, you can do roughly this:
>
> struct rseq_state rseq = begin();
>
> int cpu = rseq_getcpu();
> Calculate some stuff.  Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
>
> commit(&rseq, some pointer you found, some value to write, some
> double-checks to check);
>
> With a real database, this style of programming works well.  You
> probably have a progress guarantee, and the database gives you the
> semantics you want.  With rseq, you don't get any of this.  If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress.  You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
>
> Without event_counter, you can use a different abstraction.  In your
> percpu data, you have something like:
>
> struct rseq_lock lock;
> struct freelist_entry *head;
>
> Then you can do:
>
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
>
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
>
> commit(&rseq, cpu, some pointer you found, some value to write, some
> double-checks to check);
>
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant.  But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code.  And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
>
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well.  The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
>
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number.  On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
>  If this is done, then you still have to do:
>
> __this_thread_rseq->rseq_cs = &the_rseq_cs_here;
>
> but that's a *store* to the rseq cacheline.  It'll go in the store
> buffer and, in the fast case, nothing ever depends on it.  Hence it's
> unlikely to stall.  Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
>
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
>
>>
>> RDPID is interesting, but it seems like given the current ABI (which
>> requires setting the current restartable range) you'd still need some sort
>> of TLS value. Have any benchmarks of RDPID perf vs TLS?
>
> That's the wrong comparison.  For a good userspace implementation
> (which is very awkward without libc's help), storing to the rseq
> pointer is:
>
> leaq the_rseq_cs(%rip), %rax
> movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)
>
> loading the cpu number is:
>
> movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax
>
> The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
> fixed) instead.  This avoids reading from the rseq cacheline.
>
>
> On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@kernel.org wrote:
>>
>>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>
>> Let me step back a bit, and try to understand the fundamental
>> motivation for not letting user-space detect that it has been
>> preempted.
>>
>> My understanding (please let me know where I am mistaken) is that
>> the motivation for not letting user-space know if preempted is to
>> ensure malware cannot cloak (hide) themselves by changing their
>> behavior when they detect that they are heavily preempted. This
>> is effectively a way to detect things like single-stepping, strace,
>> ltrace, and so on.
>
> That's about a third of my objection.  The next third is that it seems
> messy to be to intentionally expose what should be an invisible
> detail, and the third is that I still haven't seen convincing evidence
> that anything genuinely needs an event counter.  See below and my
> other email.
>
>>
>> One can also pretty much trivially use rdtsc (x86), tb register
>> (powerpc) or ARM ccnt register (if user access is enabled) to
>> detect whether it runs under observation.
>> (this would be your timing-and-guessing approach)
>>
>
> Mostly irrelevant, but rdtsc is trappable.
>
>> I hope that very *few* users will completely open-code their asm
>> sequences. For instance, if you do the whole list pop in asm, you
>> need to:
>>
>> 1) store current rseq_cs pointer
>> 2) load CPU number from struct rseq,
>> 3) pointer-chase the right per-cpu data that you want to act on,
>> 4) load head
>> 5) compare head against NULL
>> 6) conditional branch
>> 6) load head->next
>> 10) store head->next to head (commit)
>>
>> step 3) is the interesting bit here:
>> It heavily depends on the application per-cpu data layout. It can be
>> a large array of per-cpu integers, or an array of per-cpu structures
>> at a specific offset, or an array of per-cpu struct + offset + pointer
>> to some other structure (dynamically allocated), and so on. I do not
>> want to impose any restriction on the data structure layout of
>> user-space applications. This means the asm-specialized-stuff will
>> have to be specialized for each per-cpu data structure layout, which
>> is a pain.
>
> To my point: I don't think this is a particularly good example.  I
> think your examples are buggy (see below), and I would implement this
> quite differently if I were trying to use C (or even asm, perhaps),
> like this:
>
> 1) load CPU number.
> 2) pointer-chase the right per-cpu data, in C if desired.
> 3) store rseq_cs (which pretty much has to be in asm using the current ABI)
> 4) load head, compare against null, load head->next
> 5) commit
>
> I don't see how the event counter is at all useful here.
>
> FWIW, I can easily imagine alternative ABIs in which abort_ip is
> treated as a *call* instead of a jump to enable much better C support.
>
>> Having the event_counter in place allows users to code the
>> pointer-chasing in pure C, which we can expect will help adoption
>> and ensure people use rseq through a utility library header file
>> rather than open-coding everything (with the associated testing
>> problems).
>
> True, but my variant works just as well, might be faster, and doesn't
> need event_counter.
>
>>
>> If we look at the list pop alternative using C for pointer-chasing,
>> with event_counter:
>>
>> <C>
>> 1) 64-bit load of both event_counter and cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
>
> ^^^ Potential segfault here.  head might have been freed and unmapped.
>
>> <ASM>
>> 9) store current rseq_cs
>> 10) Load event counter
>> 11) compare current event_counter with event_counter loaded in 1)
>> 12) conditional branch
>> 13) store new value to @loc
>>
>> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
>> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
>> stall due to pointer-chasing.
>>
>> And here is the resulting list pop where the algorithm is in C,
>> without sequence counter. Let's make it a simple case
>> where the application is not changing the layout of the data
>> structures being pointer-chased concurrently from a book-keeping
>> algorithm (because that would add extra compare in the asm).
>>
>> <C>
>> 1) 32-bit load of cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
>
> ^^^ ditto: potential segfault here
>
>> <ASM>
>> 7) store current rseq_cs
>> 8) 32-bit load of cpu_id
>> 9) compare cpu_id with prior cpu_id read in 1)
>> 10) conditional branch
>> 11) Load head
>> 12) compare head against head loaded in 3)
>> 13) conditional branch
>> 14) Load head->next
>> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
>> 16) conditional branch
>> 17) store head->next as new head
>>
>>
>> Would it be required to change the kernel-user ABI when switching to
>> RDPID ? Can this co-exist with TLS users ?
>
> RDPID gives you the CPU number without reading memory, and that's
> pretty much all it does.  The format in which it returns it sucks,
> unfortunately, but I'm going to try to fix that by the time rseq is
> merged.
>
> --Andy

(+roc)

Thanks for roping me in.  I'm not sure I fully understand the proposal
(are there docs somewhere outside of this email thread?) but I don't
think this will be a huge problem for rr because:

1) Since rseq is initiated through a syscall we can always make that
fail and deal with not recording programs that use it.  As long as it
doesn't become widely used in glibc or something that's probably
bearable.
2) We ought to be able to emulate it if we have to, since everything
is mediated by the kernel  rr might need to be able to see when a
thread migrates from one CPU to another (via a perf event or
something, idk if this functionality already exists) and when a rseq
restart happens (via ptrace or perf events or something).

RDPID sounds annoying.  My copy of the SDM doesn't seem to indicate
whether it obeys the RDTSC(P) faulting mechanism.  Does anyone know
what microarchitecture introduced that instruction?  We could always
mask off the CPUID bit indicating it's supported.

A similar concept on ARM (the load-lock-store-condition pattern of the
LDREX/STREX instructions) doomed our effort to port rr to ARM a few
years back.  The key problems there for us were:

1. Hardware interrupts in the critical section trigger a failed store
because the kernel will grab the per-CPU exclusive lock.
2. There's no way to observe when a store fails so as to induce the
same failures during replay, and
3. That pattern used all over the place for every atomic operation on
ARM so blacklisting or emulating it wasn't feasible.

If everything is moderated by the kernel and the kernel exposes enough
information through ptrace/perf events/whatever I think we'll be ok

Thanks,

- Kyle

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rseq event_counter field
  2017-10-30 18:08       ` Kyle Huey
@ 2017-10-30 19:20         ` Mathieu Desnoyers
  0 siblings, 0 replies; 4+ messages in thread
From: Mathieu Desnoyers @ 2017-10-30 19:20 UTC (permalink / raw)
  To: Kyle Huey
  Cc: Andy Lutomirski, Ben Maurer, Boqun Feng, Will Deacon,
	Peter Zijlstra, Paul Turner, linux-kernel, Linus Torvalds,
	Robert O'Callahan

----- On Oct 30, 2017, at 2:08 PM, Kyle Huey me@kylehuey.com wrote:

> On Sat, Oct 28, 2017 at 1:20 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> Answering both emails here.
>>
>> Also, welcome Kyle.  Kyle, how badly does rseq's proposed
>> event_counter break rr?  For that matter, how badly does rseq without
>> an event_counter break rr?
>>
>> (Linus, if you care, I'm proposing that rseq is probably fine for
>> 4.15, but that it should be merged without the event_counter.  If
>> there is a bona fide use for event_counter along with a benchmark
>> showing that it makes a meaningful difference, then event_counter can
>> be easily added later.)
>>
>> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@fb.com> wrote:
>>> My understanding is that one of the motivations for the event counter was to
>>> make it possible to express things like "increment a per-cpu value" in C.
>>> Without this abstraction it wasn't possible to ensure that the strict
>>> requirements of the rseq ABI were met (contiguous restartable block, single
>>> instruction for the final write). If you write your entire restartable
>>> sequence in assembly there's no need for the event counter. Comparing all
>>> input values as an alternative way of having non-assembly sequences seems
>>> like it might have a risk of ABA style race conditions.
>>>
>>
>> My opinion after looking at a few examples is that event_counter
>> sounds useful but that it's really more of a crutch that enables
>> mediocre userspace code.  With event_counter, you can do roughly this:
>>
>> struct rseq_state rseq = begin();
>>
>> int cpu = rseq_getcpu();
>> Calculate some stuff.  Be very very careful to to chase pointers,
>> because it's easy to do in a very very fragile way as Mathieu's
>> example below does.
>>
>> commit(&rseq, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> With a real database, this style of programming works well.  You
>> probably have a progress guarantee, and the database gives you the
>> semantics you want.  With rseq, you don't get any of this.  If the C
>> code is too slow or does something (an accidental interaction with
>> userfaultfd, perhaps?), then you lose forward progress.  You also can
>> be a bit slow in the C code, in which case you'll get more conflicts
>> detected than actually exist.
>>
>> Without event_counter, you can use a different abstraction.  In your
>> percpu data, you have something like:
>>
>> struct rseq_lock lock;
>> struct freelist_entry *head;
>>
>> Then you can do:
>>
>> int cpu = rseq_getcpu();
>> struct my_data *data = ... cpu ...;
>> struct rseq_state rseq = rseq_acquire(data->lock);
>>
>> Calculate some stuff, being careful not to touch anything that isn't
>> protected by the rseq_lock you're using.
>>
>> commit(&rseq, cpu, some pointer you found, some value to write, some
>> double-checks to check);
>>
>> This variant is indeed a tiny bit slower *if you measure time by
>> number if instructions* -- there are some very clever optimizations
>> that save some instructions in the event_counter variant.  But, in the
>> cases where you actually have C code in here, I suspect that the
>> difference won't matter in real code.  And the counter-less variant
>> can have arbitrarily slow C code or even syscalls without being
>> spuriously aborted until the very short asm commit sequence starts.
>>
>> BUT, I think that the event_counter-free version is likely to be
>> faster in real code if the library is done well.  The version with
>> event_counter has to load the event_counter from the rseq cacheline
>> right at the beginning, and that value is used, so unless the CPU is
>> really quite aggressive avoid speculating through a branch that hasn't
>> gotten its dependencies into cache yet, you're likely to stall a bit
>> if the per-thread rseq cacheline isn't in L1.
>>
>> The variant without event_counter can avoid this problem as long as
>> the architecture gives a cheap way to find the CPU number.  On x86,
>> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
>> any CPU as long as the kernel supports per-cpu FSBASE.  (That's a
>> feature that isn't terribly hard to add and would IMO be quite nifty.)
>>  If this is done, then you still have to do:
>>
>> __this_thread_rseq->rseq_cs = &the_rseq_cs_here;
>>
>> but that's a *store* to the rseq cacheline.  It'll go in the store
>> buffer and, in the fast case, nothing ever depends on it.  Hence it's
>> unlikely to stall.  Of course, there's still a load from the
>> rseq_lock, but that shares a cacheline with the data that it protects,
>> so it's basically free.
>>
>> IOW, I think that adding event_counter encourages user code that has a
>> weaker progress guarantee, is slower in a very highly optimized
>> implementation, and allows users to think less about they're doing to
>> their own detriment, for the gain of saving a couple of instructions.
>> I could be wrong, but I think it would be nice to see evidence that
>> I'm wrong before event_counter becomes enshrined in the ABI.  Also, I
>> suspect that neat tools like rr will have a much easier time dealing
>> with rseq if event_counter isn't involved.
>>
>>>
>>> RDPID is interesting, but it seems like given the current ABI (which
>>> requires setting the current restartable range) you'd still need some sort
>>> of TLS value. Have any benchmarks of RDPID perf vs TLS?
>>
>> That's the wrong comparison.  For a good userspace implementation
>> (which is very awkward without libc's help), storing to the rseq
>> pointer is:
>>
>> leaq the_rseq_cs(%rip), %rax
>> movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)
>>
>> loading the cpu number is:
>>
>> movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax
>>
>> The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
>> fixed) instead.  This avoids reading from the rseq cacheline.
>>
>>
>> On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@kernel.org wrote:
>>>
>>>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> Let me step back a bit, and try to understand the fundamental
>>> motivation for not letting user-space detect that it has been
>>> preempted.
>>>
>>> My understanding (please let me know where I am mistaken) is that
>>> the motivation for not letting user-space know if preempted is to
>>> ensure malware cannot cloak (hide) themselves by changing their
>>> behavior when they detect that they are heavily preempted. This
>>> is effectively a way to detect things like single-stepping, strace,
>>> ltrace, and so on.
>>
>> That's about a third of my objection.  The next third is that it seems
>> messy to be to intentionally expose what should be an invisible
>> detail, and the third is that I still haven't seen convincing evidence
>> that anything genuinely needs an event counter.  See below and my
>> other email.
>>
>>>
>>> One can also pretty much trivially use rdtsc (x86), tb register
>>> (powerpc) or ARM ccnt register (if user access is enabled) to
>>> detect whether it runs under observation.
>>> (this would be your timing-and-guessing approach)
>>>
>>
>> Mostly irrelevant, but rdtsc is trappable.
>>
>>> I hope that very *few* users will completely open-code their asm
>>> sequences. For instance, if you do the whole list pop in asm, you
>>> need to:
>>>
>>> 1) store current rseq_cs pointer
>>> 2) load CPU number from struct rseq,
>>> 3) pointer-chase the right per-cpu data that you want to act on,
>>> 4) load head
>>> 5) compare head against NULL
>>> 6) conditional branch
>>> 6) load head->next
>>> 10) store head->next to head (commit)
>>>
>>> step 3) is the interesting bit here:
>>> It heavily depends on the application per-cpu data layout. It can be
>>> a large array of per-cpu integers, or an array of per-cpu structures
>>> at a specific offset, or an array of per-cpu struct + offset + pointer
>>> to some other structure (dynamically allocated), and so on. I do not
>>> want to impose any restriction on the data structure layout of
>>> user-space applications. This means the asm-specialized-stuff will
>>> have to be specialized for each per-cpu data structure layout, which
>>> is a pain.
>>
>> To my point: I don't think this is a particularly good example.  I
>> think your examples are buggy (see below), and I would implement this
>> quite differently if I were trying to use C (or even asm, perhaps),
>> like this:
>>
>> 1) load CPU number.
>> 2) pointer-chase the right per-cpu data, in C if desired.
>> 3) store rseq_cs (which pretty much has to be in asm using the current ABI)
>> 4) load head, compare against null, load head->next
>> 5) commit
>>
>> I don't see how the event counter is at all useful here.
>>
>> FWIW, I can easily imagine alternative ABIs in which abort_ip is
>> treated as a *call* instead of a jump to enable much better C support.
>>
>>> Having the event_counter in place allows users to code the
>>> pointer-chasing in pure C, which we can expect will help adoption
>>> and ensure people use rseq through a utility library header file
>>> rather than open-coding everything (with the associated testing
>>> problems).
>>
>> True, but my variant works just as well, might be faster, and doesn't
>> need event_counter.
>>
>>>
>>> If we look at the list pop alternative using C for pointer-chasing,
>>> with event_counter:
>>>
>>> <C>
>>> 1) 64-bit load of both event_counter and cpu_id
>>> 2) Use cpu_id to pointer-chase the per-cpu data
>>> 3) Load head
>>> 4) compare head with NULL
>>> 5) conditional branch
>>> 6) Load head->next
>>
>> ^^^ Potential segfault here.  head might have been freed and unmapped.
>>
>>> <ASM>
>>> 9) store current rseq_cs
>>> 10) Load event counter
>>> 11) compare current event_counter with event_counter loaded in 1)
>>> 12) conditional branch
>>> 13) store new value to @loc
>>>
>>> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
>>> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
>>> stall due to pointer-chasing.
>>>
>>> And here is the resulting list pop where the algorithm is in C,
>>> without sequence counter. Let's make it a simple case
>>> where the application is not changing the layout of the data
>>> structures being pointer-chased concurrently from a book-keeping
>>> algorithm (because that would add extra compare in the asm).
>>>
>>> <C>
>>> 1) 32-bit load of cpu_id
>>> 2) Use cpu_id to pointer-chase the per-cpu data
>>> 3) Load head
>>> 4) compare head with NULL
>>> 5) conditional branch
>>> 6) Load head->next
>>
>> ^^^ ditto: potential segfault here
>>
>>> <ASM>
>>> 7) store current rseq_cs
>>> 8) 32-bit load of cpu_id
>>> 9) compare cpu_id with prior cpu_id read in 1)
>>> 10) conditional branch
>>> 11) Load head
>>> 12) compare head against head loaded in 3)
>>> 13) conditional branch
>>> 14) Load head->next
>>> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
>>> 16) conditional branch
>>> 17) store head->next as new head
>>>
>>>
>>> Would it be required to change the kernel-user ABI when switching to
>>> RDPID ? Can this co-exist with TLS users ?
>>
>> RDPID gives you the CPU number without reading memory, and that's
>> pretty much all it does.  The format in which it returns it sucks,
>> unfortunately, but I'm going to try to fix that by the time rseq is
>> merged.
>>
>> --Andy
> 
> (+roc)
> 
> Thanks for roping me in.  I'm not sure I fully understand the proposal
> (are there docs somewhere outside of this email thread?)

The tree with my current implementation is at
https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-rseq.git/

The commit messages and implementation contain comments describing the
architecture of the rseq proposal.

However, the last bits about event_counter removal were discussed between
Andy and me just last week, so it's a bit soon to have this documented. ;)

> but I don't
> think this will be a huge problem for rr because:
> 
> 1) Since rseq is initiated through a syscall we can always make that
> fail and deal with not recording programs that use it.  As long as it
> doesn't become widely used in glibc or something that's probably
> bearable.

One goal is to make glibc use it for its memory allocator. However, I
would expect glibc to fall-back to another allocator if it detects that
rseq is not available, so making the system call return -ENOSYS should
trigger the fallback behavior.

> 2) We ought to be able to emulate it if we have to, since everything
> is mediated by the kernel  rr might need to be able to see when a
> thread migrates from one CPU to another (via a perf event or
> something, idk if this functionality already exists) and when a rseq
> restart happens (via ptrace or perf events or something).

rseq users are exporting a __rseq_table section entry from each rseq critical
section, which contain the start_ip, post_commit_offset fields, and the
abort_ip. This should allow a debugger/single-stepping emulator to
understand that it should expect the kernel to "jump" to the abort_ip,
and thus set breakpoints to the abort_ip to plan for these "unexpected"
branches. It might become useful for architectures that perform single-stepping
by placing explicit breakpoints on the next instruction (and on jump
targets).

Another feature provided by rseq is the in-TLS "cpu_id" field: the kernel
stores the current cpu_id value into that TLS field whenever it returns
to user-space. If you care about trapping RDPID or sched_getcpu() vDSO,
you will probably care about that field too.

> 
> RDPID sounds annoying.  My copy of the SDM doesn't seem to indicate
> whether it obeys the RDTSC(P) faulting mechanism.  Does anyone know
> what microarchitecture introduced that instruction?  We could always
> mask off the CPUID bit indicating it's supported.
> 
> A similar concept on ARM (the load-lock-store-condition pattern of the
> LDREX/STREX instructions) doomed our effort to port rr to ARM a few
> years back.  The key problems there for us were:
> 
> 1. Hardware interrupts in the critical section trigger a failed store
> because the kernel will grab the per-CPU exclusive lock.
> 2. There's no way to observe when a store fails so as to induce the
> same failures during replay, and
> 3. That pattern used all over the place for every atomic operation on
> ARM so blacklisting or emulating it wasn't feasible.

I'm with you there. I've looked at how gdb deals with ll/sc stuff, and
god it's ugly. It really expects specific patterns, and anything out of
the ordinary would fool it. That's why I decided to expose the
__rseq_table section so a debugger would not have to "guess" rseq asm
sequence patterns.

> 
> If everything is moderated by the kernel and the kernel exposes enough
> information through ptrace/perf events/whatever I think we'll be ok

The simple solution would be to -ENOSYS sys_rseq(). A more elaborate
solution would be to add breakpoints at each rseq_table abort_ip target,
so you can catch whenever the kernel aborts a rseq critical section.
You would probably have to tweak the rseq TLS cpu_id value presented to
user-space if you care too.

I'm currently working on removing the "event_counter" from rseq.

Let me know if you need clarifications or more explanation on specific
points.

Thanks,

Mathieu

> 
> Thanks,
> 
> - Kyle

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-10-30 19:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1810582179.50144.1509036869544.JavaMail.zimbra@efficios.com>
     [not found] ` <CALCETrWgjmN0s-WcdD4v+TY1=0gB9eAKN71H=TT9aemZTfwQYw@mail.gmail.com>
     [not found]   ` <606828370.51261.1509183329197.JavaMail.zimbra@efficios.com>
2017-10-28 20:20     ` rseq event_counter field Andy Lutomirski
2017-10-29  1:07       ` Mathieu Desnoyers
2017-10-30 18:08       ` Kyle Huey
2017-10-30 19:20         ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.