Re: Making races happen more often

From: "Paul E. McKenney" <paulmck@kernel.org>
To: Willy Tarreau <w@1wt.eu>
Cc: rcu@vger.kernel.org
Subject: Re: Making races happen more often
Date: Thu, 30 Sep 2021 11:46:23 -0700	[thread overview]
Message-ID: <20210930184623.GI880162@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <20210929195908.GA9405@1wt.eu>

On Wed, Sep 29, 2021 at 09:59:08PM +0200, Willy Tarreau wrote:
> On Mon, Sep 27, 2021 at 09:01:47PM -0700, Paul E. McKenney wrote:
> > > And if you perform a load before the CAS (like for example when computing
> > > a max or updating a pointer), it gets better for same-core-complex and
> > > much worse for others:
> > > 
> > >          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> > >      0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
> > >      1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
> > >      2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
> > >      3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
> > >      4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
> > >      5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
> > >      6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
> > >      7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
> > >      8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
> > >      9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
> > >     10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
> > >     11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
> > >     12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
> > >     13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
> > >     14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
> > >     15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -
> > > 
> > > See ? There's basically a factor of 6 in latency depending on the cores
> > > that are supposed to be connected to the same L3. And looking closer you
> > > even observe that within a same core-complex, things are not uniform at
> > > all either:
> > >   - core 1 always experiences fastest communication with other cores
> > >   - core 2 always experiences slowest communication with other cores
> > >   - discrepancies between pairs vay by up to 50%
> > 
> > So what rcutorture would want to do here is to assign the first four-CPU
> > guest OS to CPUs 0, 1, 8, and 9, right?
> 
> Interesting idea, I didn't think about that. I would then do even better
> if you use 4 CPUs. Here, as can be seen in the low latency numbers between
> adjacent CPUs of the same pair, they're both threads of the same core. So
> 0,1 are core 0 of die 0, and 8,9 are core 0 of die 1. Thus I would assign
> two threads from the same core on an average core (either core 0 or 3 of
> one die), and one thread of a fast core and one thread of a slow core of
> the other die. That would give for example CPUs 0,1,10,12. This is how I
> can imagine the highest unfairness between threads for your use case. A
> variant could be to use CPU 2 instead of 10 (the low latency core).

Do CPUs 1-7 and 8-15 show up in different nodes in sysfs?  If so, this
case is already covered within rcutorture.  ;-)

> > It would actually do that if
> > the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
> > on another.
> 
> Yep. Also, please note that my tests on a Xeon W2145 (8 cores 16 threads
> as well, just organized differently) didn't show any single difference
> between the cache access patterns of any core. Thus at best you could
> hope for sharing two threads of a core and two other cores, but that's
> all. I haven't tested on an EPYC yet but I'd be interested in seeing if
> recent AMDs exhibit similar imbalance between their cores as their
> ancestors.

I would expect quite a bit of variety.

> > > In your case it could mean that certain races will never happen on certain
> > > cores but could on others ones, or at least above certain frequencies,
> > > because if the cache line sticks long enough in an L2 to perform a series
> > > of operations undisturbed, whatever you'll do on other cores might have no
> > > effect.
> > 
> > In the cases exposed by my differential-latency testing, the issue was
> > pretty clearly the added delay.  This is because the issue involved
> > normal memory references for the RCU Tasks Trace bugs, read-side access
> > to seqlock-protected memory for the 2,199.0-second RCU CPU stall warning
> > (normal memory accesses plus memory barriers), and a lockdep false
> > positive in the Tiny SRCU case (though in lockdep's defense, this was
> > at best an accident waiting to happen).
> 
> These really don't look like easy patterns to reproduce!

I could probably spend full time tweaking things and forcing different
classes of bugs to happen.  :-/

> > But I completely agree that cmpxchg operations on highly contended
> > memory locations can expose all sorts of other issues.
> 
> Yep, they're "interesting" because the operation in itself is fundamentally
> flawed and is used as the most inefficient building block for plenty of
> other ones that emphasize bad patterns, starting with the opposite of
> cmpxchg which would only replace a value if it's greater or lower for
> example, or only replace it if not NULL. Due to the way it works,
> slowdowns spiral down into a complete stall very quickly, and exponential
> backoff that one might want to rely on usually further degrades the
> situation since it requires to re-read before retrying.

Indeed!  That line of reasoning led to the Itanium ld.bias instruction
being my fault.  ;-)

And that is exactly why good concurrent algorithms maintain low
contention.  And I agree that this is more of an issue with cmpxchg than
(say) xchg, xadd, and so on.  Assuming competent hardware, anyway.  :-/

> > > I suspect that working at much lower frequencies could at some point allow
> > > to release a cache line between two operations. I don't seem to be observing
> > > this on the AMD for now though, as the differences increase when lowering
> > > the frequency from 4.0 to 2.2G for a single CAS:
> > > 
> > >          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> > >      0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
> > >      1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
> > >      2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
> > >      3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
> > >      4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
> > >      5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
> > >      6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
> > >      7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
> > >      8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
> > >      9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
> > >     10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
> > >     11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
> > >     12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
> > >     13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
> > >     14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
> > >     15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -
> > > 
> > > Where a same core-complex would see variations from 22 to 26ns now we're
> > > seeing 40 to 62!
> > 
> > Huh.  Are the fast CPUs seeing higher latencies due to delays from the
> > slower CPUs?  Or is the clock-frequency difference causing increased
> > delays in and of itself?  (Crossing clock-frequency domains can be much
> > harder than it might seem to us software folks.)
> 
> Note, they're all running at the same frequency here (2.2 GHzin this
> test). I do think that all of this can be caused by non-integral
> frequency ratios between the cores and the caches that would miss the
> end of a cycle and have to wait for yet another one. But I could be
> completely wrong. It could also be that the caches hold a line at least
> X of their cycles in order to optimize for such test-and-set operations,
> and that below a certain core frequency the cache doesn't hold the line
> long enough for the operation to complete, causing repeated failures
> due to contention with other cores. Note, for the measures above, I'm
> running only with two threads for each measure, and each thread waits
> for the other one's value in the shared counter to replace it (i.e. one
> replaces odd values with even ones and the other one replaces even
> values with odd ones).

OK, good to know!

> > So adjacent pairs of CPUs are electrically close as well?  I am looking
> > at the 18s and 19s along the diagonal above.
> 
> These ones are sibling threads of the same core. I didn't notice that
> their time didn't change with the frequency, it's still 18-19 ns for
> both 2.2 and 4.0 GHz! So this makes me think that the L1 is not shared
> at all and exchanges between them pass via the L2 which runs at a
> constant frequency and not at the core frequency.
> 
> > There appears to be
> > additional structure above the CPU-pair level within an 8-CPU group.
> > A linear arrangement of four pairs of CPUs?
> 
> Yes that's exactly what I meant with the core complexes. My understanding
> is that they're connected like this:
> 
>       CPU    CPU    CPU    CPU        CPU    CPU    CPU    CPU
>       0,1    2,3    4,5    6,7        8,9   10,11  12,13  14,15
>     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
>     |core| |core| |core| |core|     |core| |core| |core| |core|
>     |  0 | |  1 | |  2 | |  3 |     |  4 | |  5 | |  6 | |  7 |
>     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
>        |      |      |      |          |      |      |      |
>      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
>      |L2|   |L2|   |L2|   |L2|       |L2|   |L2|   |L2|   |L2|
>      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
>        |      |      |      |          |      |      |      |
>     +--------------------------+   +---------------------------+
>   #=|          L3 1/2          |===|           L3 2/2          |=#
>   # +--------------------------+   +---------------------------+ #
>   #                                                              #
>   #==============================================================#

Aha!  You cannot completely hide your hardware structure because your
latencies will tell on you!  ;-)

> That's not cool for their users (I imagine those running VMs with
> automatic CPU allocations...) but when trying to emphasize horrible
> patterns like you and me do, they're particularly interesting :-)

;-) ;-) ;-)

> My suspicion is that the L3 blocks have in fact 6-7 total ports with
> some rotary arbitration and that the cores are connected there in a
> specific order, and that the ring connection to the other half just
> uses extra ports that are located at certain places that favor more
> or less certain cores. Maybe that it's even 7 ports (for the
> diagonal in 16+ core CPUs) and that the 7th port is left unused
> here, which could also explain the asymmetry between all threads.
> For example it could look like this for each block with all 7 ports
> being round-robinned:
> 
>                 to L3 cores 4-7
>                 east port
>                 //
>             +------+
>             |      |
>    core 0 ==|  L3  |== core 1
>             |block |
>             |cores |
>    core 2 ==| 0-3  |== core 3
>             |      |
>             +------+
>              //  \\
>          unused   to L3 cores 4-7
>                   west port
> 
> It's then easy to imagine an extra latency for cores 1 and 2 there,
> with 2 being more impacted by two slots before it.

Indeed, beyond a certain point, it is necessary to talk to someone
who actually knows the hardware structure.

> > All points in favor of your long-term plan of reducing contention of
> > the variable being cmpxchg()ed.
> 
> Oh yes definitely! I know it's not the subject here but I'm currently
> modifying the architecture to support thread groups to partition the
> contention. It's way more complex than I imagined, and as usual the
> devil is in the details.

Understood!  And that is all too often the case.

> > > Another point I'm quite interested in is trying to group operations from
> > > close neighbors. In the example above on the Ryzen CPU, I would love to
> > > see CPUs from the same core complex try their luck together. One could
> > > think about using a CPU number to distribute accesses so that they're
> > > grouped, but it's less convenient from userland (or even from a VM). One
> > > element however knows better than the software, it's the cache controller.
> > > That's why I'm trying to keep *some* contention, so that once a core gets
> > > its access to the cache, the other ones on the same side of the L3 have a
> > > better chance to access it immediately after. And I think that's how I've
> > > reached an average of 45ns per operation (stats included) in my last test
> > > across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
> > > is quite unstable for now.
> > 
> > And very very dependent on the exact hardware.
> 
> Yep! For the record this one is an AMD Ryzen 2700X.
> 
> > I have seen hardware where prefetching the line passed to cmpxchg
> > doubled the overhead, and some where it made no measurable difference.
> 
> I noticed exactly the same. Prefetching on the Xeon W2145 costs twice
> as much :-(

Ouch!

> Another thing I discovered while preparing a machine for a test was that
> there was an option in the BIOS proposing to automatically prefetch the
> next cache line when reading one. I wasn't aware that some CPUs had this
> option but this scared me, thinking that I'm trying to align some critical
> structures on 64 bytes to avoid false sharing and that some machines
> might be capable of still suffering from sharing between adjacent lines!

I have no idea whether to suggest turning off that option on the one
hand or aligning to 128 bytes on the other...

> > > > > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > > > > frequency differences into testing would be quite welcome.
> > > > > 
> > > > > I think that playing with scaling_max_freq and randomly setting it
> > > > > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > > > > However it will likely not work inside KVM, but for bare-metal tests
> > > > > it should work where cpufreq is supported.
> > > > 
> > > > OK, that explains at least some of my difficulties.  I almost always
> > > > run under qemu/KVM.
> > > 
> > > But you could disturb the real machine with changing its frequency.
> > 
> > Fair point, and that should fit into the memory-latency disturbing
> > that I am currently doing.
> 
> Yes very likely. I'm wondering if you couldn't just try to prefetch
> random lines from a very large memory location in the disturbing
> threads. You might manage to waste a lot of memory bandwidth and to
> cause high latencies. Maybe using AVX and friends to perform large
> batches of writes could help further increase latency by filling
> write buffers.

If I knew what data was involved in the bug, then yes, the occasional
atomic add of zero to that data might be quite useful.

This sort of thing is what formal verification techniques are good at.
Pity about their inability to handle anything but tiny snippets of code...

> > > I've been wondering if it's possible to change the tick rate inside KVM,
> > > or even program a dummy device to deliver very fast IRQs but I haven't
> > > found such a thing for now. I've been thinking about NMIs as well. The
> > > idea would be that there could be a higher chance of interrupting
> > > processing at any random place using fast interrupts (>100k/s) than by
> > > using context switching caused by scheduler slots being depleted. Maybe
> > > one only needs to hack an existing device emulator to flood an IRQ :-/
> > 
> > Or have a bug in the timekeeping code!  ;-)
> 
> Not sure how to achieve that, but maybe :-)

Given that the 2,199.0-second stall came as quite a surprise to the
timekeeping folks, you are not the only one unsure as to how to achieve
that.  In their defense, it took me quite some time to accept what was
happening.

							Thanx, Paul