On Wed, Oct 05, 2016 at 08:39:31AM -0700, Dave Hansen wrote:
> On 10/04/2016 05:41 AM, Jann Horn wrote:
> > $ time ./atomic_user_test 2 1 1000000000 # multi-threaded, no protection
> > real	0m9.550s
> > user	0m18.988s
> > sys	0m0.000s
> > $ time ./atomic_user_test 2 2 1000000000 # multi-threaded, racy protection
> > real	0m9.249s
> > user	0m18.430s
> > sys	0m0.004s
> > $ time ./atomic_user_test 2 3 1000000000 # multi-threaded, cmpxchg protection
> > real	1m47.331s
> > user	3m34.390s
> > sys	0m0.024s
> 
> Yikes, that does get a ton worse.

Yeah, but as Kees said, that's an absolute worst case, and while there
might be some performance impact with some very syscall-heavy real-world
usage, I think it's very unlikely to be that bad in practice.

It probably doesn't matter much here, but out of curiosity: Do you know
what makes this so slow? I'm not familiar with details of how processors
work  - and you're at Intel, so maybe you know more about this or can ask
someone who knows? The causes I can imagine are:

1. Pipeline flushes because of branch prediction failures caused by
   more-or-less random cmpxchg retries? Pipeline flushes are pretty
   expensive, right?
2. Repeated back-and-forth bouncing of the cacheline because an increment
   via cmpxchg needs at least two accesses instead of one, and the
   cacheline could be "stolen" by the other thread between the READ_ONCE
   and the cmpxchg.
3. Simply the cost of retrying if the value has changed in the meantime.
4. Maybe if two CPUs try increments at the same time, with exactly the
   same timing, they get stuck in a tiny livelock where every cmpxchg
   fails because the value was just updated by the other core? And then
   something slightly disturbs the timing (interrupt / clock speed
   change / ...), allowing one task to win the race?

> But, I guess it's good to know we
> have a few choices between performant and absolutely "correct".

Hrm. My opinion is that the racy protection is unlikely to help much with
panic_on_oops=0. So IMO, on x86, it's more like a choice between:

 - performant, but pretty useless
 - performant, but technically unreliable and with panic on overflow
 - doing it properly, with a performance hit

> Do you have any explanation for "racy protection" going faster than no
> protection?

My guess is "I'm not measuring well enough and random stuff is going on". 
Re-running these on the same box, I get the following numbers:

$ time ./atomic_user_test 2 1 1000000000 # multi-threaded, no protection
real	0m9.549s
user	0m19.023s
sys	0m0.000s
$ time ./atomic_user_test 2 2 1000000000 # multi-threaded, racy protection
real	0m9.586s
user	0m19.154s
sys	0m0.001s

(This might be because I'm using the ondemand governor, because my CPU has
the 4GHz boost thing, because stuff in the background is randomly
interfering... no idea.)