On Wed, Oct 05, 2016 at 08:39:31AM -0700, Dave Hansen wrote: > On 10/04/2016 05:41 AM, Jann Horn wrote: > > $ time ./atomic_user_test 2 1 1000000000 # multi-threaded, no protection > > real 0m9.550s > > user 0m18.988s > > sys 0m0.000s > > $ time ./atomic_user_test 2 2 1000000000 # multi-threaded, racy protection > > real 0m9.249s > > user 0m18.430s > > sys 0m0.004s > > $ time ./atomic_user_test 2 3 1000000000 # multi-threaded, cmpxchg protection > > real 1m47.331s > > user 3m34.390s > > sys 0m0.024s > > Yikes, that does get a ton worse. Yeah, but as Kees said, that's an absolute worst case, and while there might be some performance impact with some very syscall-heavy real-world usage, I think it's very unlikely to be that bad in practice. It probably doesn't matter much here, but out of curiosity: Do you know what makes this so slow? I'm not familiar with details of how processors work - and you're at Intel, so maybe you know more about this or can ask someone who knows? The causes I can imagine are: 1. Pipeline flushes because of branch prediction failures caused by more-or-less random cmpxchg retries? Pipeline flushes are pretty expensive, right? 2. Repeated back-and-forth bouncing of the cacheline because an increment via cmpxchg needs at least two accesses instead of one, and the cacheline could be "stolen" by the other thread between the READ_ONCE and the cmpxchg. 3. Simply the cost of retrying if the value has changed in the meantime. 4. Maybe if two CPUs try increments at the same time, with exactly the same timing, they get stuck in a tiny livelock where every cmpxchg fails because the value was just updated by the other core? And then something slightly disturbs the timing (interrupt / clock speed change / ...), allowing one task to win the race? > But, I guess it's good to know we > have a few choices between performant and absolutely "correct". Hrm. My opinion is that the racy protection is unlikely to help much with panic_on_oops=0. So IMO, on x86, it's more like a choice between: - performant, but pretty useless - performant, but technically unreliable and with panic on overflow - doing it properly, with a performance hit > Do you have any explanation for "racy protection" going faster than no > protection? My guess is "I'm not measuring well enough and random stuff is going on". Re-running these on the same box, I get the following numbers: $ time ./atomic_user_test 2 1 1000000000 # multi-threaded, no protection real 0m9.549s user 0m19.023s sys 0m0.000s $ time ./atomic_user_test 2 2 1000000000 # multi-threaded, racy protection real 0m9.586s user 0m19.154s sys 0m0.001s (This might be because I'm using the ondemand governor, because my CPU has the 4GHz boost thing, because stuff in the background is randomly interfering... no idea.)