Re: Making races happen more often

From: "Paul E. McKenney" <paulmck@kernel.org>
To: Willy Tarreau <w@1wt.eu>
Cc: rcu@vger.kernel.org
Subject: Re: Making races happen more often
Date: Mon, 27 Sep 2021 21:01:47 -0700	[thread overview]
Message-ID: <20210928040147.GA880162@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <20210927064915.GA20117@1wt.eu>

On Mon, Sep 27, 2021 at 08:49:15AM +0200, Willy Tarreau wrote:
> Hi Paul,
> 
> On Sun, Sep 26, 2021 at 08:25:17PM -0700, Paul E. McKenney wrote:
> > > I wrote a program to measure inter-CPU atomic latencies and success/failure
> > > rates to perform a CAS depending on each thread combination. Even on a
> > > Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive
> > > failures. When trying to modify my code to try enforce some fairness, I
> > > managed to go down to less than 4 failures. I was happy until I discovered
> > > that there was a bug in my program making it not do anything except check
> > > the shared variable that I'm using to enforce the fairness. So it seems
> > > that interleaving multiple indenpendent accesses might sometimes help
> > > provide some fairness in all of this, which is what I'm going to work on
> > > today.
> > 
> > The textbook approach is to partition the algorithm, for example, with
> > a combining tree or similar.
> 
> That's what I'm working on for the long term. For the short term I'm
> trying to measure the average time spent trying and failing, and setting
> a threshold above which other threads do not start to work and let others
> finish first. This gives very good results, with peak CAS failures going
> down from ~300 to ~16 in my tests, and average time being even lower than
> before. However it's a bit addictive, and playing a bit too much with
> tuning tends to favor some archs at the expense of others. My experiments
> are here for the curious, though probably not very interesting at the moment:
> 
>    https://github.com/wtarreau/atomic-tests  

I took only a quick glance, but the results below do look interesting.

> > Otherwise, you are quite right, there are
> > all sorts of hardware-specific things that can happen at high levels of
> > memory contention.
> 
> There's not just memory contention, I found some severe contention inside
> L3 caches themselves. For example, AMD Ryzen CPUs have their caches arranged
> as "core complexes" of 4 cores connected to a same block of L3 cache, itself
> connected to the other "core complexes". Look at the measured inter-CPU
> latency (in nanoseconds) for a compare-and-swap between each of them:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -   8  24  24  25  24  26  26  69  68  68  69  69  69  69  69
>      1   8   -  24  24  24  24  26  26  69  69  69  69  69  69  69  69
>      2  24  24   -   8  23  23  24  24  69  69  68  68  69  69  68  69
>      3  24  24   8   -  23  23  24  24  69  68  68  69  69  69  68  71
>      4  25  25  23  23   -   8  25  25  69  69  69  69  70  70  69  70
>      5  25  24  23  23   8   -  25  25  69  69  70  69  70  69  70  69
>      6  26  26  24  24  25  25   -   8  69  69  69  69  69  69  69  69
>      7  26  26  24  24  25  25   8   -  68  69  69  70  70  70  70  69
>      8  69  69  68  68  69  69  69  69   -   8  24  24  25  25  26  26
>      9  69  69  68  68  69  69  69  69   8   -  24  24  25  25  26  26
>     10  69  69  69  69  69  70  69  68  24  24   -   8  23  23  24  24
>     11  69  69  69  67  68  69  68  68  24  24   8   -  23  23  24  24
>     12  69  69  69  68  70  70  69  69  25  25  23  23   -   8  25  25
>     13  69  69  69  68  70  70  69  69  25  24  23  23   8   -  25  25
>     14  69  69  68  69  70  69  69  69  26  26  24  24  25  25   -   8
>     15  69  69  68  68  70  70  69  69  26  26  24  24  25  25   8   -

Nice results!

I have been following the hardware-architecture approach, so that
the "memory" in "memory contention" includes the caches.

> And if you perform a load before the CAS (like for example when computing
> a max or updating a pointer), it gets better for same-core-complex and
> much worse for others:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
>      1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
>      2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
>      3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
>      4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
>      5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
>      6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
>      7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
>      8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
>      9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
>     10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
>     11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
>     12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
>     13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
>     14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
>     15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -
> 
> See ? There's basically a factor of 6 in latency depending on the cores
> that are supposed to be connected to the same L3. And looking closer you
> even observe that within a same core-complex, things are not uniform at
> all either:
>   - core 1 always experiences fastest communication with other cores
>   - core 2 always experiences slowest communication with other cores
>   - discrepancies between pairs vay by up to 50%

So what rcutorture would want to do here is to assign the first four-CPU
guest OS to CPUs 0, 1, 8, and 9, right?  It would actually do that if
the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
on another.

> However when it comes to success/failures of a CAS, it's the opposite:
>   - after a load, core 2 almost always succeeds a CAS (~50 tries)
>   - after a load, core 1 almost always fails a CAS (~500 tries)
>   - after a load, cores 0 and 3 are average (~100 tries)
> 
> This indicates to me that there's probably some arbitration cost to the
> cores that access the cache, and that cache lines assigned to one core
> or the other tend to take more time for some cores than others to be
> reclaimed, and that cores experiencing the highest latencies are also
> the ones keeping the lines longer and observing the highest success
> rate on CAS.

Cute!

> In your case it could mean that certain races will never happen on certain
> cores but could on others ones, or at least above certain frequencies,
> because if the cache line sticks long enough in an L2 to perform a series
> of operations undisturbed, whatever you'll do on other cores might have no
> effect.

In the cases exposed by my differential-latency testing, the issue was
pretty clearly the added delay.  This is because the issue involved
normal memory references for the RCU Tasks Trace bugs, read-side access
to seqlock-protected memory for the 2,199.0-second RCU CPU stall warning
(normal memory accesses plus memory barriers), and a lockdep false
positive in the Tiny SRCU case (though in lockdep's defense, this was
at best an accident waiting to happen).

But I completely agree that cmpxchg operations on highly contended
memory locations can expose all sorts of other issues.

> I suspect that working at much lower frequencies could at some point allow
> to release a cache line between two operations. I don't seem to be observing
> this on the AMD for now though, as the differences increase when lowering
> the frequency from 4.0 to 2.2G for a single CAS:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
>      1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
>      2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
>      3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
>      4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
>      5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
>      6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
>      7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
>      8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
>      9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
>     10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
>     11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
>     12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
>     13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
>     14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
>     15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -
> 
> Where a same core-complex would see variations from 22 to 26ns now we're
> seeing 40 to 62!

Huh.  Are the fast CPUs seeing higher latencies due to delays from the
slower CPUs?  Or is the clock-frequency difference causing increased
delays in and of itself?  (Crossing clock-frequency domains can be much
harder than it might seem to us software folks.)

So adjacent pairs of CPUs are electrically close as well?  I am looking
at the 18s and 19s along the diagonal above.  There appears to be
additional structure above the CPU-pair level within an 8-CPU group.
A linear arrangement of four pairs of CPUs?

> > Other than that, there are a lot of heuristics involving per-node counters
> > so that if a given node has had more than N consecutive shots and some
> > other node wants in, it holds off until the other node gets in. Complex
> > and tricky, so partitioning is better where it can be done.
> 
> I tried a bit with this but it didn't yield that interesting results
> for now, I think the reason is that I'm focusing on max latency and that
> with this, the max latency experienced by any CPU becomes the number of
> CPUs times the max granted to a single CPU. Playing with roughly averaged
> wait times instead tends to favor threads waiting too long first. The only
> thing is that I try to avoid updating this when the wait time is low
> enough, in order to avoid writes to a shared area. One of the side effects
> of writing to such a shared areas is that some operations may be delayed
> due to write ordering. I.e. writing that you're waiting for the access,
> then performing a CAS will first require that the line containing that
> "waiting" flag be flushed before trying the CAS, which may increase the
> risk that another CPU in the mean time updates it and causes the CAS to
> fail. And if the "waiting" flag is shared with the area being touched by
> the CAS, then touching it to say "I'm here" is enough to slow down the
> other ones. Thus I prefer to watch a shared entry that is not supposed to
> be updated often, except when CPUs are fed up with waiting. Under normal
> operation it should perform better.

All points in favor of your long-term plan of reducing contention of
the variable being cmpxchg()ed.

> Another point I'm quite interested in is trying to group operations from
> close neighbors. In the example above on the Ryzen CPU, I would love to
> see CPUs from the same core complex try their luck together. One could
> think about using a CPU number to distribute accesses so that they're
> grouped, but it's less convenient from userland (or even from a VM). One
> element however knows better than the software, it's the cache controller.
> That's why I'm trying to keep *some* contention, so that once a core gets
> its access to the cache, the other ones on the same side of the L3 have a
> better chance to access it immediately after. And I think that's how I've
> reached an average of 45ns per operation (stats included) in my last test
> across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
> is quite unstable for now.

And very very dependent on the exact hardware.  I have seen hardware
where prefetching the line passed to cmpxchg doubled the overhead,
and some where it made no measurable difference.

> > > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > > frequency differences into testing would be quite welcome.
> > > 
> > > I think that playing with scaling_max_freq and randomly setting it
> > > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > > However it will likely not work inside KVM, but for bare-metal tests
> > > it should work where cpufreq is supported.
> > 
> > OK, that explains at least some of my difficulties.  I almost always
> > run under qemu/KVM.
> 
> But you could disturb the real machine with changing its frequency.

Fair point, and that should fit into the memory-latency disturbing
that I am currently doing.

> I've been wondering if it's possible to change the tick rate inside KVM,
> or even program a dummy device to deliver very fast IRQs but I haven't
> found such a thing for now. I've been thinking about NMIs as well. The
> idea would be that there could be a higher chance of interrupting
> processing at any random place using fast interrupts (>100k/s) than by
> using context switching caused by scheduler slots being depleted. Maybe
> one only needs to hack an existing device emulator to flood an IRQ :-/

Or have a bug in the timekeeping code!  ;-)

							Thanx, Paul