All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Willy Tarreau <w@1wt.eu>
Cc: rcu@vger.kernel.org
Subject: Re: Making races happen more often
Date: Mon, 27 Sep 2021 21:01:47 -0700	[thread overview]
Message-ID: <20210928040147.GA880162@paulmck-ThinkPad-P17-Gen-1> (raw)
In-Reply-To: <20210927064915.GA20117@1wt.eu>

On Mon, Sep 27, 2021 at 08:49:15AM +0200, Willy Tarreau wrote:
> Hi Paul,
> 
> On Sun, Sep 26, 2021 at 08:25:17PM -0700, Paul E. McKenney wrote:
> > > I wrote a program to measure inter-CPU atomic latencies and success/failure
> > > rates to perform a CAS depending on each thread combination. Even on a
> > > Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive
> > > failures. When trying to modify my code to try enforce some fairness, I
> > > managed to go down to less than 4 failures. I was happy until I discovered
> > > that there was a bug in my program making it not do anything except check
> > > the shared variable that I'm using to enforce the fairness. So it seems
> > > that interleaving multiple indenpendent accesses might sometimes help
> > > provide some fairness in all of this, which is what I'm going to work on
> > > today.
> > 
> > The textbook approach is to partition the algorithm, for example, with
> > a combining tree or similar.
> 
> That's what I'm working on for the long term. For the short term I'm
> trying to measure the average time spent trying and failing, and setting
> a threshold above which other threads do not start to work and let others
> finish first. This gives very good results, with peak CAS failures going
> down from ~300 to ~16 in my tests, and average time being even lower than
> before. However it's a bit addictive, and playing a bit too much with
> tuning tends to favor some archs at the expense of others. My experiments
> are here for the curious, though probably not very interesting at the moment:
> 
>    https://github.com/wtarreau/atomic-tests  

I took only a quick glance, but the results below do look interesting.

> > Otherwise, you are quite right, there are
> > all sorts of hardware-specific things that can happen at high levels of
> > memory contention.
> 
> There's not just memory contention, I found some severe contention inside
> L3 caches themselves. For example, AMD Ryzen CPUs have their caches arranged
> as "core complexes" of 4 cores connected to a same block of L3 cache, itself
> connected to the other "core complexes". Look at the measured inter-CPU
> latency (in nanoseconds) for a compare-and-swap between each of them:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -   8  24  24  25  24  26  26  69  68  68  69  69  69  69  69
>      1   8   -  24  24  24  24  26  26  69  69  69  69  69  69  69  69
>      2  24  24   -   8  23  23  24  24  69  69  68  68  69  69  68  69
>      3  24  24   8   -  23  23  24  24  69  68  68  69  69  69  68  71
>      4  25  25  23  23   -   8  25  25  69  69  69  69  70  70  69  70
>      5  25  24  23  23   8   -  25  25  69  69  70  69  70  69  70  69
>      6  26  26  24  24  25  25   -   8  69  69  69  69  69  69  69  69
>      7  26  26  24  24  25  25   8   -  68  69  69  70  70  70  70  69
>      8  69  69  68  68  69  69  69  69   -   8  24  24  25  25  26  26
>      9  69  69  68  68  69  69  69  69   8   -  24  24  25  25  26  26
>     10  69  69  69  69  69  70  69  68  24  24   -   8  23  23  24  24
>     11  69  69  69  67  68  69  68  68  24  24   8   -  23  23  24  24
>     12  69  69  69  68  70  70  69  69  25  25  23  23   -   8  25  25
>     13  69  69  69  68  70  70  69  69  25  24  23  23   8   -  25  25
>     14  69  69  68  69  70  69  69  69  26  26  24  24  25  25   -   8
>     15  69  69  68  68  70  70  69  69  26  26  24  24  25  25   8   -

Nice results!

I have been following the hardware-architecture approach, so that
the "memory" in "memory contention" includes the caches.

> And if you perform a load before the CAS (like for example when computing
> a max or updating a pointer), it gets better for same-core-complex and
> much worse for others:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
>      1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
>      2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
>      3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
>      4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
>      5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
>      6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
>      7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
>      8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
>      9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
>     10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
>     11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
>     12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
>     13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
>     14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
>     15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -
> 
> See ? There's basically a factor of 6 in latency depending on the cores
> that are supposed to be connected to the same L3. And looking closer you
> even observe that within a same core-complex, things are not uniform at
> all either:
>   - core 1 always experiences fastest communication with other cores
>   - core 2 always experiences slowest communication with other cores
>   - discrepancies between pairs vay by up to 50%

So what rcutorture would want to do here is to assign the first four-CPU
guest OS to CPUs 0, 1, 8, and 9, right?  It would actually do that if
the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
on another.

> However when it comes to success/failures of a CAS, it's the opposite:
>   - after a load, core 2 almost always succeeds a CAS (~50 tries)
>   - after a load, core 1 almost always fails a CAS (~500 tries)
>   - after a load, cores 0 and 3 are average (~100 tries)
> 
> This indicates to me that there's probably some arbitration cost to the
> cores that access the cache, and that cache lines assigned to one core
> or the other tend to take more time for some cores than others to be
> reclaimed, and that cores experiencing the highest latencies are also
> the ones keeping the lines longer and observing the highest success
> rate on CAS.

Cute!

> In your case it could mean that certain races will never happen on certain
> cores but could on others ones, or at least above certain frequencies,
> because if the cache line sticks long enough in an L2 to perform a series
> of operations undisturbed, whatever you'll do on other cores might have no
> effect.

In the cases exposed by my differential-latency testing, the issue was
pretty clearly the added delay.  This is because the issue involved
normal memory references for the RCU Tasks Trace bugs, read-side access
to seqlock-protected memory for the 2,199.0-second RCU CPU stall warning
(normal memory accesses plus memory barriers), and a lockdep false
positive in the Tiny SRCU case (though in lockdep's defense, this was
at best an accident waiting to happen).

But I completely agree that cmpxchg operations on highly contended
memory locations can expose all sorts of other issues.

> I suspect that working at much lower frequencies could at some point allow
> to release a cache line between two operations. I don't seem to be observing
> this on the AMD for now though, as the differences increase when lowering
> the frequency from 4.0 to 2.2G for a single CAS:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
>      1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
>      2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
>      3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
>      4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
>      5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
>      6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
>      7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
>      8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
>      9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
>     10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
>     11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
>     12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
>     13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
>     14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
>     15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -
> 
> Where a same core-complex would see variations from 22 to 26ns now we're
> seeing 40 to 62!

Huh.  Are the fast CPUs seeing higher latencies due to delays from the
slower CPUs?  Or is the clock-frequency difference causing increased
delays in and of itself?  (Crossing clock-frequency domains can be much
harder than it might seem to us software folks.)

So adjacent pairs of CPUs are electrically close as well?  I am looking
at the 18s and 19s along the diagonal above.  There appears to be
additional structure above the CPU-pair level within an 8-CPU group.
A linear arrangement of four pairs of CPUs?

> > Other than that, there are a lot of heuristics involving per-node counters
> > so that if a given node has had more than N consecutive shots and some
> > other node wants in, it holds off until the other node gets in. Complex
> > and tricky, so partitioning is better where it can be done.
> 
> I tried a bit with this but it didn't yield that interesting results
> for now, I think the reason is that I'm focusing on max latency and that
> with this, the max latency experienced by any CPU becomes the number of
> CPUs times the max granted to a single CPU. Playing with roughly averaged
> wait times instead tends to favor threads waiting too long first. The only
> thing is that I try to avoid updating this when the wait time is low
> enough, in order to avoid writes to a shared area. One of the side effects
> of writing to such a shared areas is that some operations may be delayed
> due to write ordering. I.e. writing that you're waiting for the access,
> then performing a CAS will first require that the line containing that
> "waiting" flag be flushed before trying the CAS, which may increase the
> risk that another CPU in the mean time updates it and causes the CAS to
> fail. And if the "waiting" flag is shared with the area being touched by
> the CAS, then touching it to say "I'm here" is enough to slow down the
> other ones. Thus I prefer to watch a shared entry that is not supposed to
> be updated often, except when CPUs are fed up with waiting. Under normal
> operation it should perform better.

All points in favor of your long-term plan of reducing contention of
the variable being cmpxchg()ed.

> Another point I'm quite interested in is trying to group operations from
> close neighbors. In the example above on the Ryzen CPU, I would love to
> see CPUs from the same core complex try their luck together. One could
> think about using a CPU number to distribute accesses so that they're
> grouped, but it's less convenient from userland (or even from a VM). One
> element however knows better than the software, it's the cache controller.
> That's why I'm trying to keep *some* contention, so that once a core gets
> its access to the cache, the other ones on the same side of the L3 have a
> better chance to access it immediately after. And I think that's how I've
> reached an average of 45ns per operation (stats included) in my last test
> across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
> is quite unstable for now.

And very very dependent on the exact hardware.  I have seen hardware
where prefetching the line passed to cmpxchg doubled the overhead,
and some where it made no measurable difference.

> > > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > > frequency differences into testing would be quite welcome.
> > > 
> > > I think that playing with scaling_max_freq and randomly setting it
> > > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > > However it will likely not work inside KVM, but for bare-metal tests
> > > it should work where cpufreq is supported.
> > 
> > OK, that explains at least some of my difficulties.  I almost always
> > run under qemu/KVM.
> 
> But you could disturb the real machine with changing its frequency.

Fair point, and that should fit into the memory-latency disturbing
that I am currently doing.

> I've been wondering if it's possible to change the tick rate inside KVM,
> or even program a dummy device to deliver very fast IRQs but I haven't
> found such a thing for now. I've been thinking about NMIs as well. The
> idea would be that there could be a higher chance of interrupting
> processing at any random place using fast interrupts (>100k/s) than by
> using context switching caused by scheduler slots being depleted. Maybe
> one only needs to hack an existing device emulator to flood an IRQ :-/

Or have a bug in the timekeeping code!  ;-)

							Thanx, Paul

  reply	other threads:[~2021-09-28  4:01 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-26  3:51 Making races happen more often Paul E. McKenney
2021-09-26  4:41 ` Willy Tarreau
2021-09-27  3:25   ` Paul E. McKenney
2021-09-27  6:49     ` Willy Tarreau
2021-09-28  4:01       ` Paul E. McKenney [this message]
2021-09-29 19:59         ` Willy Tarreau
2021-09-30 18:46           ` Paul E. McKenney
2021-09-30 21:12             ` Willy Tarreau
2021-10-07  0:44               ` Paul E. McKenney
2021-10-07 17:53                 ` Willy Tarreau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210928040147.GA880162@paulmck-ThinkPad-P17-Gen-1 \
    --to=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.