All of lore.kernel.org
 help / color / mirror / Atom feed
* Making races happen more often
@ 2021-09-26  3:51 Paul E. McKenney
  2021-09-26  4:41 ` Willy Tarreau
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2021-09-26  3:51 UTC (permalink / raw)
  To: w; +Cc: rcu

Hello, Willy!

Continuing from linkedin:

> Maybe this doesn't work as well as expected because of the common L3 cache
> that runs at a single frequency and that imposes discrete timings. Also,
> I noticed that on modern CPUs, cache lines tend to "stick" at least a few
> cycles once they're in a cache, which helps the corresponding CPU chain
> a few atomic ops undisturbed. For example on a 8-core Ryzen I'm seeing a
> minimum of 8ns between two threads of the same core (L1 probably split in
> two halves), 25ns between two L2 and 60ns between the two halves (CCX)
> of the L3. This certainly makes it much harder to trigger concurrency
> issues. Well let's continue by e-mail, it's a real pain to type in this
> awful interface.

Indeed, I get best (worst?) results from memory latency on multi-socket
systems.  And these results were not subtle:

	https://paulmck.livejournal.com/62071.html

All that aside, any advice on portably and usefully getting 2-3x clock
frequency differences into testing would be quite welcome.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-26  3:51 Making races happen more often Paul E. McKenney
@ 2021-09-26  4:41 ` Willy Tarreau
  2021-09-27  3:25   ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Willy Tarreau @ 2021-09-26  4:41 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

Hi Paul,

On Sat, Sep 25, 2021 at 08:51:03PM -0700, Paul E. McKenney wrote:
> Hello, Willy!
> 
> Continuing from linkedin:
> 
> > Maybe this doesn't work as well as expected because of the common L3 cache
> > that runs at a single frequency and that imposes discrete timings. Also,
> > I noticed that on modern CPUs, cache lines tend to "stick" at least a few
> > cycles once they're in a cache, which helps the corresponding CPU chain
> > a few atomic ops undisturbed. For example on a 8-core Ryzen I'm seeing a
> > minimum of 8ns between two threads of the same core (L1 probably split in
> > two halves), 25ns between two L2 and 60ns between the two halves (CCX)
> > of the L3. This certainly makes it much harder to trigger concurrency
> > issues. Well let's continue by e-mail, it's a real pain to type in this
> > awful interface.
> 
> Indeed, I get best (worst?) results from memory latency on multi-socket
> systems.  And these results were not subtle:
> 
> 	https://paulmck.livejournal.com/62071.html

Interesting. I've been working for a few months on trying to efficiently
break compare-and-swap loops that tend to favor local nodes and to stall
other ones. A bit of background first: in haproxy we have a 2-second
watchdog that kills the process if a thread is stuck that long (that's
an eternity on an event-driven program). We've recently got reports of
the watchdog triggering on EPYC systems with blocks of 16 consecutive
threads not being able to make any progress. It turns out that such 16
consecutive threads seem to always be in the same core-complex, i.e.
same L3, so it seems that sometimes it's hard to reclaim a line that's
stressed by other nodes.

I wrote a program to measure inter-CPU atomic latencies and success/failure
rates to perform a CAS depending on each thread combination. Even on a
Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive
failures. When trying to modify my code to try enforce some fairness, I
managed to go down to less than 4 failures. I was happy until I discovered
that there was a bug in my program making it not do anything except check
the shared variable that I'm using to enforce the fairness. So it seems
that interleaving multiple indenpendent accesses might sometimes help
provide some fairness in all of this, which is what I'm going to work on
today.

> All that aside, any advice on portably and usefully getting 2-3x clock
> frequency differences into testing would be quite welcome.

I think that playing with scaling_max_freq and randomly setting it
to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
However it will likely not work inside KVM, but for bare-metal tests
it should work where cpufreq is supported.

Something like the untested below could be a good start, and maybe it
could even be periodically changed while the test is running:

   for p in /sys/devices/system/cpu/cpufreq/policy*; do
      min=$(cat $p/cpuinfo_min_freq)
      max=$(cat $p/cpuinfo_max_freq)
      if ((RANDOM & 1)); then
        echo $max > $p/scaling_max_freq
      else
        echo $min > $p/scaling_max_freq
      fi
   done

Then you can undo it like this:

   for p in /sys/devices/system/cpu/cpufreq/policy*; do
      cat $p/cpuinfo_max_freq > $p/scaling_max_freq
   done

On my i7-8650U laptop, the min/max freqs are 400/4200 MHz. On the previous
one (i5-3320M) it's 1200/3300. On a Ryzen 2700X it's 4000/2200. I'm seeing
800/5000 on a core i9-9900K. So I guess that it could be a good start. If
you need the ratios to be kept tighter, we could probably improve the
script above to try intermediate values.

Willy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-26  4:41 ` Willy Tarreau
@ 2021-09-27  3:25   ` Paul E. McKenney
  2021-09-27  6:49     ` Willy Tarreau
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2021-09-27  3:25 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: rcu

On Sun, Sep 26, 2021 at 06:41:27AM +0200, Willy Tarreau wrote:
> Hi Paul,
> 
> On Sat, Sep 25, 2021 at 08:51:03PM -0700, Paul E. McKenney wrote:
> > Hello, Willy!
> > 
> > Continuing from linkedin:
> > 
> > > Maybe this doesn't work as well as expected because of the common L3 cache
> > > that runs at a single frequency and that imposes discrete timings. Also,
> > > I noticed that on modern CPUs, cache lines tend to "stick" at least a few
> > > cycles once they're in a cache, which helps the corresponding CPU chain
> > > a few atomic ops undisturbed. For example on a 8-core Ryzen I'm seeing a
> > > minimum of 8ns between two threads of the same core (L1 probably split in
> > > two halves), 25ns between two L2 and 60ns between the two halves (CCX)
> > > of the L3. This certainly makes it much harder to trigger concurrency
> > > issues. Well let's continue by e-mail, it's a real pain to type in this
> > > awful interface.
> > 
> > Indeed, I get best (worst?) results from memory latency on multi-socket
> > systems.  And these results were not subtle:
> > 
> > 	https://paulmck.livejournal.com/62071.html
> 
> Interesting. I've been working for a few months on trying to efficiently
> break compare-and-swap loops that tend to favor local nodes and to stall
> other ones. A bit of background first: in haproxy we have a 2-second
> watchdog that kills the process if a thread is stuck that long (that's
> an eternity on an event-driven program). We've recently got reports of
> the watchdog triggering on EPYC systems with blocks of 16 consecutive
> threads not being able to make any progress. It turns out that such 16
> consecutive threads seem to always be in the same core-complex, i.e.
> same L3, so it seems that sometimes it's hard to reclaim a line that's
> stressed by other nodes.
> 
> I wrote a program to measure inter-CPU atomic latencies and success/failure
> rates to perform a CAS depending on each thread combination. Even on a
> Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive
> failures. When trying to modify my code to try enforce some fairness, I
> managed to go down to less than 4 failures. I was happy until I discovered
> that there was a bug in my program making it not do anything except check
> the shared variable that I'm using to enforce the fairness. So it seems
> that interleaving multiple indenpendent accesses might sometimes help
> provide some fairness in all of this, which is what I'm going to work on
> today.

The textbook approach is to partition the algorithm, for example, with
a combining tree or similar.  Otherwise, you are quite right, there are
all sorts of hardware-specific things that can happen at high levels of
memory contention.

Other than that, there are a lot of heuristics involving per-node counters
so that if a given node has had more than N consecutive shots and some
other node wants in, it holds off until the other node gets in.  Complex
and tricky, so partitioning is better where it can be done.

> > All that aside, any advice on portably and usefully getting 2-3x clock
> > frequency differences into testing would be quite welcome.
> 
> I think that playing with scaling_max_freq and randomly setting it
> to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> However it will likely not work inside KVM, but for bare-metal tests
> it should work where cpufreq is supported.

OK, that explains at least some of my difficulties.  I almost always
run under qemu/KVM.

> Something like the untested below could be a good start, and maybe it
> could even be periodically changed while the test is running:
> 
>    for p in /sys/devices/system/cpu/cpufreq/policy*; do
>       min=$(cat $p/cpuinfo_min_freq)
>       max=$(cat $p/cpuinfo_max_freq)
>       if ((RANDOM & 1)); then
>         echo $max > $p/scaling_max_freq
>       else
>         echo $min > $p/scaling_max_freq
>       fi
>    done
> 
> Then you can undo it like this:
> 
>    for p in /sys/devices/system/cpu/cpufreq/policy*; do
>       cat $p/cpuinfo_max_freq > $p/scaling_max_freq
>    done
> 
> On my i7-8650U laptop, the min/max freqs are 400/4200 MHz. On the previous
> one (i5-3320M) it's 1200/3300. On a Ryzen 2700X it's 4000/2200. I'm seeing
> 800/5000 on a core i9-9900K. So I guess that it could be a good start. If
> you need the ratios to be kept tighter, we could probably improve the
> script above to try intermediate values.

But yes, on bare-metal setups, those ratios should be able to do some
useful damage.  ;-)

Thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-27  3:25   ` Paul E. McKenney
@ 2021-09-27  6:49     ` Willy Tarreau
  2021-09-28  4:01       ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Willy Tarreau @ 2021-09-27  6:49 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

Hi Paul,

On Sun, Sep 26, 2021 at 08:25:17PM -0700, Paul E. McKenney wrote:
> > I wrote a program to measure inter-CPU atomic latencies and success/failure
> > rates to perform a CAS depending on each thread combination. Even on a
> > Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive
> > failures. When trying to modify my code to try enforce some fairness, I
> > managed to go down to less than 4 failures. I was happy until I discovered
> > that there was a bug in my program making it not do anything except check
> > the shared variable that I'm using to enforce the fairness. So it seems
> > that interleaving multiple indenpendent accesses might sometimes help
> > provide some fairness in all of this, which is what I'm going to work on
> > today.
> 
> The textbook approach is to partition the algorithm, for example, with
> a combining tree or similar.

That's what I'm working on for the long term. For the short term I'm
trying to measure the average time spent trying and failing, and setting
a threshold above which other threads do not start to work and let others
finish first. This gives very good results, with peak CAS failures going
down from ~300 to ~16 in my tests, and average time being even lower than
before. However it's a bit addictive, and playing a bit too much with
tuning tends to favor some archs at the expense of others. My experiments
are here for the curious, though probably not very interesting at the moment:

   https://github.com/wtarreau/atomic-tests  
    
> Otherwise, you are quite right, there are
> all sorts of hardware-specific things that can happen at high levels of
> memory contention.

There's not just memory contention, I found some severe contention inside
L3 caches themselves. For example, AMD Ryzen CPUs have their caches arranged
as "core complexes" of 4 cores connected to a same block of L3 cache, itself
connected to the other "core complexes". Look at the measured inter-CPU
latency (in nanoseconds) for a compare-and-swap between each of them:

         0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
     0   -   8  24  24  25  24  26  26  69  68  68  69  69  69  69  69
     1   8   -  24  24  24  24  26  26  69  69  69  69  69  69  69  69
     2  24  24   -   8  23  23  24  24  69  69  68  68  69  69  68  69
     3  24  24   8   -  23  23  24  24  69  68  68  69  69  69  68  71
     4  25  25  23  23   -   8  25  25  69  69  69  69  70  70  69  70
     5  25  24  23  23   8   -  25  25  69  69  70  69  70  69  70  69
     6  26  26  24  24  25  25   -   8  69  69  69  69  69  69  69  69
     7  26  26  24  24  25  25   8   -  68  69  69  70  70  70  70  69
     8  69  69  68  68  69  69  69  69   -   8  24  24  25  25  26  26
     9  69  69  68  68  69  69  69  69   8   -  24  24  25  25  26  26
    10  69  69  69  69  69  70  69  68  24  24   -   8  23  23  24  24
    11  69  69  69  67  68  69  68  68  24  24   8   -  23  23  24  24
    12  69  69  69  68  70  70  69  69  25  25  23  23   -   8  25  25
    13  69  69  69  68  70  70  69  69  25  24  23  23   8   -  25  25
    14  69  69  68  69  70  69  69  69  26  26  24  24  25  25   -   8
    15  69  69  68  68  70  70  69  69  26  26  24  24  25  25   8   -

And if you perform a load before the CAS (like for example when computing
a max or updating a pointer), it gets better for same-core-complex and
much worse for others:

         0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
     0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
     1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
     2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
     3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
     4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
     5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
     6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
     7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
     8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
     9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
    10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
    11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
    12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
    13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
    14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
    15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -

See ? There's basically a factor of 6 in latency depending on the cores
that are supposed to be connected to the same L3. And looking closer you
even observe that within a same core-complex, things are not uniform at
all either:
  - core 1 always experiences fastest communication with other cores
  - core 2 always experiences slowest communication with other cores
  - discrepancies between pairs vay by up to 50%

However when it comes to success/failures of a CAS, it's the opposite:
  - after a load, core 2 almost always succeeds a CAS (~50 tries)
  - after a load, core 1 almost always fails a CAS (~500 tries)
  - after a load, cores 0 and 3 are average (~100 tries)

This indicates to me that there's probably some arbitration cost to the
cores that access the cache, and that cache lines assigned to one core
or the other tend to take more time for some cores than others to be
reclaimed, and that cores experiencing the highest latencies are also
the ones keeping the lines longer and observing the highest success
rate on CAS.

In your case it could mean that certain races will never happen on certain
cores but could on others ones, or at least above certain frequencies,
because if the cache line sticks long enough in an L2 to perform a series
of operations undisturbed, whatever you'll do on other cores might have no
effect.

I suspect that working at much lower frequencies could at some point allow
to release a cache line between two operations. I don't seem to be observing
this on the AMD for now though, as the differences increase when lowering
the frequency from 4.0 to 2.2G for a single CAS:

         0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
     0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
     1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
     2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
     3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
     4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
     5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
     6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
     7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
     8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
     9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
    10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
    11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
    12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
    13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
    14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
    15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -

Where a same core-complex would see variations from 22 to 26ns now we're
seeing 40 to 62!

> Other than that, there are a lot of heuristics involving per-node counters
> so that if a given node has had more than N consecutive shots and some
> other node wants in, it holds off until the other node gets in. Complex
> and tricky, so partitioning is better where it can be done.

I tried a bit with this but it didn't yield that interesting results
for now, I think the reason is that I'm focusing on max latency and that
with this, the max latency experienced by any CPU becomes the number of
CPUs times the max granted to a single CPU. Playing with roughly averaged
wait times instead tends to favor threads waiting too long first. The only
thing is that I try to avoid updating this when the wait time is low
enough, in order to avoid writes to a shared area. One of the side effects
of writing to such a shared areas is that some operations may be delayed
due to write ordering. I.e. writing that you're waiting for the access,
then performing a CAS will first require that the line containing that
"waiting" flag be flushed before trying the CAS, which may increase the
risk that another CPU in the mean time updates it and causes the CAS to
fail. And if the "waiting" flag is shared with the area being touched by
the CAS, then touching it to say "I'm here" is enough to slow down the
other ones. Thus I prefer to watch a shared entry that is not supposed to
be updated often, except when CPUs are fed up with waiting. Under normal
operation it should perform better.

Another point I'm quite interested in is trying to group operations from
close neighbors. In the example above on the Ryzen CPU, I would love to
see CPUs from the same core complex try their luck together. One could
think about using a CPU number to distribute accesses so that they're
grouped, but it's less convenient from userland (or even from a VM). One
element however knows better than the software, it's the cache controller.
That's why I'm trying to keep *some* contention, so that once a core gets
its access to the cache, the other ones on the same side of the L3 have a
better chance to access it immediately after. And I think that's how I've
reached an average of 45ns per operation (stats included) in my last test
across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
is quite unstable for now.

> > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > frequency differences into testing would be quite welcome.
> > 
> > I think that playing with scaling_max_freq and randomly setting it
> > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > However it will likely not work inside KVM, but for bare-metal tests
> > it should work where cpufreq is supported.
> 
> OK, that explains at least some of my difficulties.  I almost always
> run under qemu/KVM.

But you could disturb the real machine with changing its frequency.

I've been wondering if it's possible to change the tick rate inside KVM,
or even program a dummy device to deliver very fast IRQs but I haven't
found such a thing for now. I've been thinking about NMIs as well. The
idea would be that there could be a higher chance of interrupting
processing at any random place using fast interrupts (>100k/s) than by
using context switching caused by scheduler slots being depleted. Maybe
one only needs to hack an existing device emulator to flood an IRQ :-/

Cheers,
Willy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-27  6:49     ` Willy Tarreau
@ 2021-09-28  4:01       ` Paul E. McKenney
  2021-09-29 19:59         ` Willy Tarreau
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2021-09-28  4:01 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: rcu

On Mon, Sep 27, 2021 at 08:49:15AM +0200, Willy Tarreau wrote:
> Hi Paul,
> 
> On Sun, Sep 26, 2021 at 08:25:17PM -0700, Paul E. McKenney wrote:
> > > I wrote a program to measure inter-CPU atomic latencies and success/failure
> > > rates to perform a CAS depending on each thread combination. Even on a
> > > Ryzen 2700X (8 cores in 2*4 blocks) I've measured up to ~4k consecutive
> > > failures. When trying to modify my code to try enforce some fairness, I
> > > managed to go down to less than 4 failures. I was happy until I discovered
> > > that there was a bug in my program making it not do anything except check
> > > the shared variable that I'm using to enforce the fairness. So it seems
> > > that interleaving multiple indenpendent accesses might sometimes help
> > > provide some fairness in all of this, which is what I'm going to work on
> > > today.
> > 
> > The textbook approach is to partition the algorithm, for example, with
> > a combining tree or similar.
> 
> That's what I'm working on for the long term. For the short term I'm
> trying to measure the average time spent trying and failing, and setting
> a threshold above which other threads do not start to work and let others
> finish first. This gives very good results, with peak CAS failures going
> down from ~300 to ~16 in my tests, and average time being even lower than
> before. However it's a bit addictive, and playing a bit too much with
> tuning tends to favor some archs at the expense of others. My experiments
> are here for the curious, though probably not very interesting at the moment:
> 
>    https://github.com/wtarreau/atomic-tests  

I took only a quick glance, but the results below do look interesting.

> > Otherwise, you are quite right, there are
> > all sorts of hardware-specific things that can happen at high levels of
> > memory contention.
> 
> There's not just memory contention, I found some severe contention inside
> L3 caches themselves. For example, AMD Ryzen CPUs have their caches arranged
> as "core complexes" of 4 cores connected to a same block of L3 cache, itself
> connected to the other "core complexes". Look at the measured inter-CPU
> latency (in nanoseconds) for a compare-and-swap between each of them:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -   8  24  24  25  24  26  26  69  68  68  69  69  69  69  69
>      1   8   -  24  24  24  24  26  26  69  69  69  69  69  69  69  69
>      2  24  24   -   8  23  23  24  24  69  69  68  68  69  69  68  69
>      3  24  24   8   -  23  23  24  24  69  68  68  69  69  69  68  71
>      4  25  25  23  23   -   8  25  25  69  69  69  69  70  70  69  70
>      5  25  24  23  23   8   -  25  25  69  69  70  69  70  69  70  69
>      6  26  26  24  24  25  25   -   8  69  69  69  69  69  69  69  69
>      7  26  26  24  24  25  25   8   -  68  69  69  70  70  70  70  69
>      8  69  69  68  68  69  69  69  69   -   8  24  24  25  25  26  26
>      9  69  69  68  68  69  69  69  69   8   -  24  24  25  25  26  26
>     10  69  69  69  69  69  70  69  68  24  24   -   8  23  23  24  24
>     11  69  69  69  67  68  69  68  68  24  24   8   -  23  23  24  24
>     12  69  69  69  68  70  70  69  69  25  25  23  23   -   8  25  25
>     13  69  69  69  68  70  70  69  69  25  24  23  23   8   -  25  25
>     14  69  69  68  69  70  69  69  69  26  26  24  24  25  25   -   8
>     15  69  69  68  68  70  70  69  69  26  26  24  24  25  25   8   -

Nice results!

I have been following the hardware-architecture approach, so that
the "memory" in "memory contention" includes the caches.

> And if you perform a load before the CAS (like for example when computing
> a max or updating a pointer), it gets better for same-core-complex and
> much worse for others:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
>      1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
>      2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
>      3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
>      4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
>      5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
>      6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
>      7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
>      8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
>      9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
>     10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
>     11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
>     12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
>     13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
>     14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
>     15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -
> 
> See ? There's basically a factor of 6 in latency depending on the cores
> that are supposed to be connected to the same L3. And looking closer you
> even observe that within a same core-complex, things are not uniform at
> all either:
>   - core 1 always experiences fastest communication with other cores
>   - core 2 always experiences slowest communication with other cores
>   - discrepancies between pairs vay by up to 50%

So what rcutorture would want to do here is to assign the first four-CPU
guest OS to CPUs 0, 1, 8, and 9, right?  It would actually do that if
the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
on another.

> However when it comes to success/failures of a CAS, it's the opposite:
>   - after a load, core 2 almost always succeeds a CAS (~50 tries)
>   - after a load, core 1 almost always fails a CAS (~500 tries)
>   - after a load, cores 0 and 3 are average (~100 tries)
> 
> This indicates to me that there's probably some arbitration cost to the
> cores that access the cache, and that cache lines assigned to one core
> or the other tend to take more time for some cores than others to be
> reclaimed, and that cores experiencing the highest latencies are also
> the ones keeping the lines longer and observing the highest success
> rate on CAS.

Cute!

> In your case it could mean that certain races will never happen on certain
> cores but could on others ones, or at least above certain frequencies,
> because if the cache line sticks long enough in an L2 to perform a series
> of operations undisturbed, whatever you'll do on other cores might have no
> effect.

In the cases exposed by my differential-latency testing, the issue was
pretty clearly the added delay.  This is because the issue involved
normal memory references for the RCU Tasks Trace bugs, read-side access
to seqlock-protected memory for the 2,199.0-second RCU CPU stall warning
(normal memory accesses plus memory barriers), and a lockdep false
positive in the Tiny SRCU case (though in lockdep's defense, this was
at best an accident waiting to happen).

But I completely agree that cmpxchg operations on highly contended
memory locations can expose all sorts of other issues.

> I suspect that working at much lower frequencies could at some point allow
> to release a cache line between two operations. I don't seem to be observing
> this on the AMD for now though, as the differences increase when lowering
> the frequency from 4.0 to 2.2G for a single CAS:
> 
>          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
>      0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
>      1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
>      2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
>      3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
>      4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
>      5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
>      6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
>      7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
>      8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
>      9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
>     10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
>     11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
>     12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
>     13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
>     14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
>     15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -
> 
> Where a same core-complex would see variations from 22 to 26ns now we're
> seeing 40 to 62!

Huh.  Are the fast CPUs seeing higher latencies due to delays from the
slower CPUs?  Or is the clock-frequency difference causing increased
delays in and of itself?  (Crossing clock-frequency domains can be much
harder than it might seem to us software folks.)

So adjacent pairs of CPUs are electrically close as well?  I am looking
at the 18s and 19s along the diagonal above.  There appears to be
additional structure above the CPU-pair level within an 8-CPU group.
A linear arrangement of four pairs of CPUs?

> > Other than that, there are a lot of heuristics involving per-node counters
> > so that if a given node has had more than N consecutive shots and some
> > other node wants in, it holds off until the other node gets in. Complex
> > and tricky, so partitioning is better where it can be done.
> 
> I tried a bit with this but it didn't yield that interesting results
> for now, I think the reason is that I'm focusing on max latency and that
> with this, the max latency experienced by any CPU becomes the number of
> CPUs times the max granted to a single CPU. Playing with roughly averaged
> wait times instead tends to favor threads waiting too long first. The only
> thing is that I try to avoid updating this when the wait time is low
> enough, in order to avoid writes to a shared area. One of the side effects
> of writing to such a shared areas is that some operations may be delayed
> due to write ordering. I.e. writing that you're waiting for the access,
> then performing a CAS will first require that the line containing that
> "waiting" flag be flushed before trying the CAS, which may increase the
> risk that another CPU in the mean time updates it and causes the CAS to
> fail. And if the "waiting" flag is shared with the area being touched by
> the CAS, then touching it to say "I'm here" is enough to slow down the
> other ones. Thus I prefer to watch a shared entry that is not supposed to
> be updated often, except when CPUs are fed up with waiting. Under normal
> operation it should perform better.

All points in favor of your long-term plan of reducing contention of
the variable being cmpxchg()ed.

> Another point I'm quite interested in is trying to group operations from
> close neighbors. In the example above on the Ryzen CPU, I would love to
> see CPUs from the same core complex try their luck together. One could
> think about using a CPU number to distribute accesses so that they're
> grouped, but it's less convenient from userland (or even from a VM). One
> element however knows better than the software, it's the cache controller.
> That's why I'm trying to keep *some* contention, so that once a core gets
> its access to the cache, the other ones on the same side of the L3 have a
> better chance to access it immediately after. And I think that's how I've
> reached an average of 45ns per operation (stats included) in my last test
> across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
> is quite unstable for now.

And very very dependent on the exact hardware.  I have seen hardware
where prefetching the line passed to cmpxchg doubled the overhead,
and some where it made no measurable difference.

> > > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > > frequency differences into testing would be quite welcome.
> > > 
> > > I think that playing with scaling_max_freq and randomly setting it
> > > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > > However it will likely not work inside KVM, but for bare-metal tests
> > > it should work where cpufreq is supported.
> > 
> > OK, that explains at least some of my difficulties.  I almost always
> > run under qemu/KVM.
> 
> But you could disturb the real machine with changing its frequency.

Fair point, and that should fit into the memory-latency disturbing
that I am currently doing.

> I've been wondering if it's possible to change the tick rate inside KVM,
> or even program a dummy device to deliver very fast IRQs but I haven't
> found such a thing for now. I've been thinking about NMIs as well. The
> idea would be that there could be a higher chance of interrupting
> processing at any random place using fast interrupts (>100k/s) than by
> using context switching caused by scheduler slots being depleted. Maybe
> one only needs to hack an existing device emulator to flood an IRQ :-/

Or have a bug in the timekeeping code!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-28  4:01       ` Paul E. McKenney
@ 2021-09-29 19:59         ` Willy Tarreau
  2021-09-30 18:46           ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Willy Tarreau @ 2021-09-29 19:59 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Mon, Sep 27, 2021 at 09:01:47PM -0700, Paul E. McKenney wrote:
> > And if you perform a load before the CAS (like for example when computing
> > a max or updating a pointer), it gets better for same-core-complex and
> > much worse for others:
> > 
> >          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> >      0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
> >      1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
> >      2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
> >      3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
> >      4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
> >      5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
> >      6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
> >      7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
> >      8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
> >      9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
> >     10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
> >     11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
> >     12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
> >     13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
> >     14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
> >     15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -
> > 
> > See ? There's basically a factor of 6 in latency depending on the cores
> > that are supposed to be connected to the same L3. And looking closer you
> > even observe that within a same core-complex, things are not uniform at
> > all either:
> >   - core 1 always experiences fastest communication with other cores
> >   - core 2 always experiences slowest communication with other cores
> >   - discrepancies between pairs vay by up to 50%
> 
> So what rcutorture would want to do here is to assign the first four-CPU
> guest OS to CPUs 0, 1, 8, and 9, right?

Interesting idea, I didn't think about that. I would then do even better
if you use 4 CPUs. Here, as can be seen in the low latency numbers between
adjacent CPUs of the same pair, they're both threads of the same core. So
0,1 are core 0 of die 0, and 8,9 are core 0 of die 1. Thus I would assign
two threads from the same core on an average core (either core 0 or 3 of
one die), and one thread of a fast core and one thread of a slow core of
the other die. That would give for example CPUs 0,1,10,12. This is how I
can imagine the highest unfairness between threads for your use case. A
variant could be to use CPU 2 instead of 10 (the low latency core).

> It would actually do that if
> the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
> on another.

Yep. Also, please note that my tests on a Xeon W2145 (8 cores 16 threads
as well, just organized differently) didn't show any single difference
between the cache access patterns of any core. Thus at best you could
hope for sharing two threads of a core and two other cores, but that's
all. I haven't tested on an EPYC yet but I'd be interested in seeing if
recent AMDs exhibit similar imbalance between their cores as their
ancestors.

> > In your case it could mean that certain races will never happen on certain
> > cores but could on others ones, or at least above certain frequencies,
> > because if the cache line sticks long enough in an L2 to perform a series
> > of operations undisturbed, whatever you'll do on other cores might have no
> > effect.
> 
> In the cases exposed by my differential-latency testing, the issue was
> pretty clearly the added delay.  This is because the issue involved
> normal memory references for the RCU Tasks Trace bugs, read-side access
> to seqlock-protected memory for the 2,199.0-second RCU CPU stall warning
> (normal memory accesses plus memory barriers), and a lockdep false
> positive in the Tiny SRCU case (though in lockdep's defense, this was
> at best an accident waiting to happen).

These really don't look like easy patterns to reproduce!

> But I completely agree that cmpxchg operations on highly contended
> memory locations can expose all sorts of other issues.

Yep, they're "interesting" because the operation in itself is fundamentally
flawed and is used as the most inefficient building block for plenty of
other ones that emphasize bad patterns, starting with the opposite of
cmpxchg which would only replace a value if it's greater or lower for
example, or only replace it if not NULL. Due to the way it works,
slowdowns spiral down into a complete stall very quickly, and exponential
backoff that one might want to rely on usually further degrades the
situation since it requires to re-read before retrying.

> > I suspect that working at much lower frequencies could at some point allow
> > to release a cache line between two operations. I don't seem to be observing
> > this on the AMD for now though, as the differences increase when lowering
> > the frequency from 4.0 to 2.2G for a single CAS:
> > 
> >          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> >      0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
> >      1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
> >      2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
> >      3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
> >      4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
> >      5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
> >      6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
> >      7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
> >      8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
> >      9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
> >     10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
> >     11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
> >     12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
> >     13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
> >     14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
> >     15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -
> > 
> > Where a same core-complex would see variations from 22 to 26ns now we're
> > seeing 40 to 62!
> 
> Huh.  Are the fast CPUs seeing higher latencies due to delays from the
> slower CPUs?  Or is the clock-frequency difference causing increased
> delays in and of itself?  (Crossing clock-frequency domains can be much
> harder than it might seem to us software folks.)

Note, they're all running at the same frequency here (2.2 GHzin this
test). I do think that all of this can be caused by non-integral
frequency ratios between the cores and the caches that would miss the
end of a cycle and have to wait for yet another one. But I could be
completely wrong. It could also be that the caches hold a line at least
X of their cycles in order to optimize for such test-and-set operations,
and that below a certain core frequency the cache doesn't hold the line
long enough for the operation to complete, causing repeated failures
due to contention with other cores. Note, for the measures above, I'm
running only with two threads for each measure, and each thread waits
for the other one's value in the shared counter to replace it (i.e. one
replaces odd values with even ones and the other one replaces even
values with odd ones).

> So adjacent pairs of CPUs are electrically close as well?  I am looking
> at the 18s and 19s along the diagonal above.

These ones are sibling threads of the same core. I didn't notice that
their time didn't change with the frequency, it's still 18-19 ns for
both 2.2 and 4.0 GHz! So this makes me think that the L1 is not shared
at all and exchanges between them pass via the L2 which runs at a
constant frequency and not at the core frequency.

> There appears to be
> additional structure above the CPU-pair level within an 8-CPU group.
> A linear arrangement of four pairs of CPUs?

Yes that's exactly what I meant with the core complexes. My understanding
is that they're connected like this:

      CPU    CPU    CPU    CPU        CPU    CPU    CPU    CPU
      0,1    2,3    4,5    6,7        8,9   10,11  12,13  14,15
    +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
    |core| |core| |core| |core|     |core| |core| |core| |core|
    |  0 | |  1 | |  2 | |  3 |     |  4 | |  5 | |  6 | |  7 |
    +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
       |      |      |      |          |      |      |      |
     +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
     |L2|   |L2|   |L2|   |L2|       |L2|   |L2|   |L2|   |L2|
     +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
       |      |      |      |          |      |      |      |
    +--------------------------+   +---------------------------+
  #=|          L3 1/2          |===|           L3 2/2          |=#
  # +--------------------------+   +---------------------------+ #
  #                                                              #
  #==============================================================#

That's not cool for their users (I imagine those running VMs with
automatic CPU allocations...) but when trying to emphasize horrible
patterns like you and me do, they're particularly interesting :-)

My suspicion is that the L3 blocks have in fact 6-7 total ports with
some rotary arbitration and that the cores are connected there in a
specific order, and that the ring connection to the other half just
uses extra ports that are located at certain places that favor more
or less certain cores. Maybe that it's even 7 ports (for the
diagonal in 16+ core CPUs) and that the 7th port is left unused
here, which could also explain the asymmetry between all threads.
For example it could look like this for each block with all 7 ports
being round-robinned:

                to L3 cores 4-7
                east port
                //
            +------+
            |      |
   core 0 ==|  L3  |== core 1
            |block |
            |cores |
   core 2 ==| 0-3  |== core 3
            |      |
            +------+
             //  \\
         unused   to L3 cores 4-7
                  west port

It's then easy to imagine an extra latency for cores 1 and 2 there,
with 2 being more impacted by two slots before it.

> All points in favor of your long-term plan of reducing contention of
> the variable being cmpxchg()ed.

Oh yes definitely! I know it's not the subject here but I'm currently
modifying the architecture to support thread groups to partition the
contention. It's way more complex than I imagined, and as usual the
devil is in the details.

> > Another point I'm quite interested in is trying to group operations from
> > close neighbors. In the example above on the Ryzen CPU, I would love to
> > see CPUs from the same core complex try their luck together. One could
> > think about using a CPU number to distribute accesses so that they're
> > grouped, but it's less convenient from userland (or even from a VM). One
> > element however knows better than the software, it's the cache controller.
> > That's why I'm trying to keep *some* contention, so that once a core gets
> > its access to the cache, the other ones on the same side of the L3 have a
> > better chance to access it immediately after. And I think that's how I've
> > reached an average of 45ns per operation (stats included) in my last test
> > across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
> > is quite unstable for now.
> 
> And very very dependent on the exact hardware.

Yep! For the record this one is an AMD Ryzen 2700X.

> I have seen hardware where prefetching the line passed to cmpxchg
> doubled the overhead, and some where it made no measurable difference.

I noticed exactly the same. Prefetching on the Xeon W2145 costs twice
as much :-(

Another thing I discovered while preparing a machine for a test was that
there was an option in the BIOS proposing to automatically prefetch the
next cache line when reading one. I wasn't aware that some CPUs had this
option but this scared me, thinking that I'm trying to align some critical
structures on 64 bytes to avoid false sharing and that some machines
might be capable of still suffering from sharing between adjacent lines!

> > > > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > > > frequency differences into testing would be quite welcome.
> > > > 
> > > > I think that playing with scaling_max_freq and randomly setting it
> > > > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > > > However it will likely not work inside KVM, but for bare-metal tests
> > > > it should work where cpufreq is supported.
> > > 
> > > OK, that explains at least some of my difficulties.  I almost always
> > > run under qemu/KVM.
> > 
> > But you could disturb the real machine with changing its frequency.
> 
> Fair point, and that should fit into the memory-latency disturbing
> that I am currently doing.

Yes very likely. I'm wondering if you couldn't just try to prefetch
random lines from a very large memory location in the disturbing
threads. You might manage to waste a lot of memory bandwidth and to
cause high latencies. Maybe using AVX and friends to perform large
batches of writes could help further increase latency by filling
write buffers.

> > I've been wondering if it's possible to change the tick rate inside KVM,
> > or even program a dummy device to deliver very fast IRQs but I haven't
> > found such a thing for now. I've been thinking about NMIs as well. The
> > idea would be that there could be a higher chance of interrupting
> > processing at any random place using fast interrupts (>100k/s) than by
> > using context switching caused by scheduler slots being depleted. Maybe
> > one only needs to hack an existing device emulator to flood an IRQ :-/
> 
> Or have a bug in the timekeeping code!  ;-)

Not sure how to achieve that, but maybe :-)

Willy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-29 19:59         ` Willy Tarreau
@ 2021-09-30 18:46           ` Paul E. McKenney
  2021-09-30 21:12             ` Willy Tarreau
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2021-09-30 18:46 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: rcu

On Wed, Sep 29, 2021 at 09:59:08PM +0200, Willy Tarreau wrote:
> On Mon, Sep 27, 2021 at 09:01:47PM -0700, Paul E. McKenney wrote:
> > > And if you perform a load before the CAS (like for example when computing
> > > a max or updating a pointer), it gets better for same-core-complex and
> > > much worse for others:
> > > 
> > >          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> > >      0   -  18  21  21  23  24  22  22 124 124 123 122 125 125 124 124
> > >      1  18   -  22  22  24  24  22  22 124 124 122 122 124 124 124 123
> > >      2  21  22   -  18  23  23  21  56 122 122 122 122 125 125 122 124
> > >      3  21  21  18   -  23  23  21  20 123 123 122 122 124 124 122 122
> > >      4  24  23  24  23   -  18  24  23 124 124 123 124 126 126 124 124
> > >      5  24  24  23  23  18   -  23  23 124 124 124 123 126 126 125 126
> > >      6  22  22  21  20  23  23   -  18 124 124 124 123 125 124 124 124
> > >      7  22  22  20  21  23  24  18   - 125 124 122 122 126 124 125 125
> > >      8 123 123 122 122 124 124 122 122   -  18  20  20  24  23  21  21
> > >      9 124 124 123 123 125 125 124 124  18   -  20  21  24  24  58  21
> > >     10 123 123 122 122 124 122 122 122  21  21   -  18  22  23  20  20
> > >     11 123 122 122 122 123 124 124 123  20  20  18   -  23  23  23  20
> > >     12 125 124 124 123 126 126 125 124  23  23  23  23   -  18  24  23
> > >     13 126 125 124 123 126 126 124 124  23  23  24  22  18   -  23  23
> > >     14 124 124 123 122 124 124 124 124  20  20  20  21  23  25   -  18
> > >     15 124 124 123 123 125 124 124 124  22  21  20  56  24  24  18   -
> > > 
> > > See ? There's basically a factor of 6 in latency depending on the cores
> > > that are supposed to be connected to the same L3. And looking closer you
> > > even observe that within a same core-complex, things are not uniform at
> > > all either:
> > >   - core 1 always experiences fastest communication with other cores
> > >   - core 2 always experiences slowest communication with other cores
> > >   - discrepancies between pairs vay by up to 50%
> > 
> > So what rcutorture would want to do here is to assign the first four-CPU
> > guest OS to CPUs 0, 1, 8, and 9, right?
> 
> Interesting idea, I didn't think about that. I would then do even better
> if you use 4 CPUs. Here, as can be seen in the low latency numbers between
> adjacent CPUs of the same pair, they're both threads of the same core. So
> 0,1 are core 0 of die 0, and 8,9 are core 0 of die 1. Thus I would assign
> two threads from the same core on an average core (either core 0 or 3 of
> one die), and one thread of a fast core and one thread of a slow core of
> the other die. That would give for example CPUs 0,1,10,12. This is how I
> can imagine the highest unfairness between threads for your use case. A
> variant could be to use CPU 2 instead of 10 (the low latency core).

Do CPUs 1-7 and 8-15 show up in different nodes in sysfs?  If so, this
case is already covered within rcutorture.  ;-)

> > It would actually do that if
> > the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
> > on another.
> 
> Yep. Also, please note that my tests on a Xeon W2145 (8 cores 16 threads
> as well, just organized differently) didn't show any single difference
> between the cache access patterns of any core. Thus at best you could
> hope for sharing two threads of a core and two other cores, but that's
> all. I haven't tested on an EPYC yet but I'd be interested in seeing if
> recent AMDs exhibit similar imbalance between their cores as their
> ancestors.

I would expect quite a bit of variety.

> > > In your case it could mean that certain races will never happen on certain
> > > cores but could on others ones, or at least above certain frequencies,
> > > because if the cache line sticks long enough in an L2 to perform a series
> > > of operations undisturbed, whatever you'll do on other cores might have no
> > > effect.
> > 
> > In the cases exposed by my differential-latency testing, the issue was
> > pretty clearly the added delay.  This is because the issue involved
> > normal memory references for the RCU Tasks Trace bugs, read-side access
> > to seqlock-protected memory for the 2,199.0-second RCU CPU stall warning
> > (normal memory accesses plus memory barriers), and a lockdep false
> > positive in the Tiny SRCU case (though in lockdep's defense, this was
> > at best an accident waiting to happen).
> 
> These really don't look like easy patterns to reproduce!

I could probably spend full time tweaking things and forcing different
classes of bugs to happen.  :-/

> > But I completely agree that cmpxchg operations on highly contended
> > memory locations can expose all sorts of other issues.
> 
> Yep, they're "interesting" because the operation in itself is fundamentally
> flawed and is used as the most inefficient building block for plenty of
> other ones that emphasize bad patterns, starting with the opposite of
> cmpxchg which would only replace a value if it's greater or lower for
> example, or only replace it if not NULL. Due to the way it works,
> slowdowns spiral down into a complete stall very quickly, and exponential
> backoff that one might want to rely on usually further degrades the
> situation since it requires to re-read before retrying.

Indeed!  That line of reasoning led to the Itanium ld.bias instruction
being my fault.  ;-)

And that is exactly why good concurrent algorithms maintain low
contention.  And I agree that this is more of an issue with cmpxchg than
(say) xchg, xadd, and so on.  Assuming competent hardware, anyway.  :-/

> > > I suspect that working at much lower frequencies could at some point allow
> > > to release a cache line between two operations. I don't seem to be observing
> > > this on the AMD for now though, as the differences increase when lowering
> > > the frequency from 4.0 to 2.2G for a single CAS:
> > > 
> > >          0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> > >      0   -  19  55  55  48  48  62  62 103 103 105 105 103 103 103 103
> > >      1  19   -  55  55  48  48  62  62 103 103 105 105 103 103 103 103
> > >      2  55  55   -  19  40  40  55  55 104 104 106 106 107 107 104 103
> > >      3  55  55  19   -  40  40  55  55 104 104 105 106 107 107 104 104
> > >      4  48  48  40  40   -  19  47  47 109 109 107 106 105 105 112 111
> > >      5  48  48  40  40  19   -  47  47 109 109 107 106 105 105 111 111
> > >      6  62  62  55  55  47  47   -  19 103 103 105 105 103 103 103 103
> > >      7  62  62  55  55  47  47  19   - 103 103 105 105 103 103 103 103
> > >      8 103 103 104 104 110 109 103 103   -  19  55  55  48  48  62  62
> > >      9 103 103 104 104 110 110 103 103  19   -  55  55  48  48  62  62
> > >     10 105 105 105 106 106 106 105 105  55  55   -  19  40  40  55  55
> > >     11 105 105 105 106 107 107 105 105  55  55  19   -  40  40  55  55
> > >     12 103 103 108 107 105 105 103 103  48  48  40  40   -  19  47  47
> > >     13 103 103 107 107 105 105 103 103  48  48  40  40  19   -  47  47
> > >     14 103 103 105 104 110 111 103 103  62  62  55  55  47  47   -  19
> > >     15 103 103 104 104 111 111 103 103  62  62  55  55  47  47  19   -
> > > 
> > > Where a same core-complex would see variations from 22 to 26ns now we're
> > > seeing 40 to 62!
> > 
> > Huh.  Are the fast CPUs seeing higher latencies due to delays from the
> > slower CPUs?  Or is the clock-frequency difference causing increased
> > delays in and of itself?  (Crossing clock-frequency domains can be much
> > harder than it might seem to us software folks.)
> 
> Note, they're all running at the same frequency here (2.2 GHzin this
> test). I do think that all of this can be caused by non-integral
> frequency ratios between the cores and the caches that would miss the
> end of a cycle and have to wait for yet another one. But I could be
> completely wrong. It could also be that the caches hold a line at least
> X of their cycles in order to optimize for such test-and-set operations,
> and that below a certain core frequency the cache doesn't hold the line
> long enough for the operation to complete, causing repeated failures
> due to contention with other cores. Note, for the measures above, I'm
> running only with two threads for each measure, and each thread waits
> for the other one's value in the shared counter to replace it (i.e. one
> replaces odd values with even ones and the other one replaces even
> values with odd ones).

OK, good to know!

> > So adjacent pairs of CPUs are electrically close as well?  I am looking
> > at the 18s and 19s along the diagonal above.
> 
> These ones are sibling threads of the same core. I didn't notice that
> their time didn't change with the frequency, it's still 18-19 ns for
> both 2.2 and 4.0 GHz! So this makes me think that the L1 is not shared
> at all and exchanges between them pass via the L2 which runs at a
> constant frequency and not at the core frequency.
> 
> > There appears to be
> > additional structure above the CPU-pair level within an 8-CPU group.
> > A linear arrangement of four pairs of CPUs?
> 
> Yes that's exactly what I meant with the core complexes. My understanding
> is that they're connected like this:
> 
>       CPU    CPU    CPU    CPU        CPU    CPU    CPU    CPU
>       0,1    2,3    4,5    6,7        8,9   10,11  12,13  14,15
>     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
>     |core| |core| |core| |core|     |core| |core| |core| |core|
>     |  0 | |  1 | |  2 | |  3 |     |  4 | |  5 | |  6 | |  7 |
>     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
>        |      |      |      |          |      |      |      |
>      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
>      |L2|   |L2|   |L2|   |L2|       |L2|   |L2|   |L2|   |L2|
>      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
>        |      |      |      |          |      |      |      |
>     +--------------------------+   +---------------------------+
>   #=|          L3 1/2          |===|           L3 2/2          |=#
>   # +--------------------------+   +---------------------------+ #
>   #                                                              #
>   #==============================================================#

Aha!  You cannot completely hide your hardware structure because your
latencies will tell on you!  ;-)

> That's not cool for their users (I imagine those running VMs with
> automatic CPU allocations...) but when trying to emphasize horrible
> patterns like you and me do, they're particularly interesting :-)

;-) ;-) ;-)

> My suspicion is that the L3 blocks have in fact 6-7 total ports with
> some rotary arbitration and that the cores are connected there in a
> specific order, and that the ring connection to the other half just
> uses extra ports that are located at certain places that favor more
> or less certain cores. Maybe that it's even 7 ports (for the
> diagonal in 16+ core CPUs) and that the 7th port is left unused
> here, which could also explain the asymmetry between all threads.
> For example it could look like this for each block with all 7 ports
> being round-robinned:
> 
>                 to L3 cores 4-7
>                 east port
>                 //
>             +------+
>             |      |
>    core 0 ==|  L3  |== core 1
>             |block |
>             |cores |
>    core 2 ==| 0-3  |== core 3
>             |      |
>             +------+
>              //  \\
>          unused   to L3 cores 4-7
>                   west port
> 
> It's then easy to imagine an extra latency for cores 1 and 2 there,
> with 2 being more impacted by two slots before it.

Indeed, beyond a certain point, it is necessary to talk to someone
who actually knows the hardware structure.

> > All points in favor of your long-term plan of reducing contention of
> > the variable being cmpxchg()ed.
> 
> Oh yes definitely! I know it's not the subject here but I'm currently
> modifying the architecture to support thread groups to partition the
> contention. It's way more complex than I imagined, and as usual the
> devil is in the details.

Understood!  And that is all too often the case.

> > > Another point I'm quite interested in is trying to group operations from
> > > close neighbors. In the example above on the Ryzen CPU, I would love to
> > > see CPUs from the same core complex try their luck together. One could
> > > think about using a CPU number to distribute accesses so that they're
> > > grouped, but it's less convenient from userland (or even from a VM). One
> > > element however knows better than the software, it's the cache controller.
> > > That's why I'm trying to keep *some* contention, so that once a core gets
> > > its access to the cache, the other ones on the same side of the L3 have a
> > > better chance to access it immediately after. And I think that's how I've
> > > reached an average of 45ns per operation (stats included) in my last test
> > > across all CPUs despite inter-CPU times ranging from 18 to 125ns. But that
> > > is quite unstable for now.
> > 
> > And very very dependent on the exact hardware.
> 
> Yep! For the record this one is an AMD Ryzen 2700X.
> 
> > I have seen hardware where prefetching the line passed to cmpxchg
> > doubled the overhead, and some where it made no measurable difference.
> 
> I noticed exactly the same. Prefetching on the Xeon W2145 costs twice
> as much :-(

Ouch!

> Another thing I discovered while preparing a machine for a test was that
> there was an option in the BIOS proposing to automatically prefetch the
> next cache line when reading one. I wasn't aware that some CPUs had this
> option but this scared me, thinking that I'm trying to align some critical
> structures on 64 bytes to avoid false sharing and that some machines
> might be capable of still suffering from sharing between adjacent lines!

I have no idea whether to suggest turning off that option on the one
hand or aligning to 128 bytes on the other...

> > > > > > All that aside, any advice on portably and usefully getting 2-3x clock
> > > > > > frequency differences into testing would be quite welcome.
> > > > > 
> > > > > I think that playing with scaling_max_freq and randomly setting it
> > > > > to either cpuinfo_min_freq or cpuinfo_max_freq could be a good start.
> > > > > However it will likely not work inside KVM, but for bare-metal tests
> > > > > it should work where cpufreq is supported.
> > > > 
> > > > OK, that explains at least some of my difficulties.  I almost always
> > > > run under qemu/KVM.
> > > 
> > > But you could disturb the real machine with changing its frequency.
> > 
> > Fair point, and that should fit into the memory-latency disturbing
> > that I am currently doing.
> 
> Yes very likely. I'm wondering if you couldn't just try to prefetch
> random lines from a very large memory location in the disturbing
> threads. You might manage to waste a lot of memory bandwidth and to
> cause high latencies. Maybe using AVX and friends to perform large
> batches of writes could help further increase latency by filling
> write buffers.

If I knew what data was involved in the bug, then yes, the occasional
atomic add of zero to that data might be quite useful.

This sort of thing is what formal verification techniques are good at.
Pity about their inability to handle anything but tiny snippets of code...

> > > I've been wondering if it's possible to change the tick rate inside KVM,
> > > or even program a dummy device to deliver very fast IRQs but I haven't
> > > found such a thing for now. I've been thinking about NMIs as well. The
> > > idea would be that there could be a higher chance of interrupting
> > > processing at any random place using fast interrupts (>100k/s) than by
> > > using context switching caused by scheduler slots being depleted. Maybe
> > > one only needs to hack an existing device emulator to flood an IRQ :-/
> > 
> > Or have a bug in the timekeeping code!  ;-)
> 
> Not sure how to achieve that, but maybe :-)

Given that the 2,199.0-second stall came as quite a surprise to the
timekeeping folks, you are not the only one unsure as to how to achieve
that.  In their defense, it took me quite some time to accept what was
happening.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-30 18:46           ` Paul E. McKenney
@ 2021-09-30 21:12             ` Willy Tarreau
  2021-10-07  0:44               ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Willy Tarreau @ 2021-09-30 21:12 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Thu, Sep 30, 2021 at 11:46:23AM -0700, Paul E. McKenney wrote:
> > > So what rcutorture would want to do here is to assign the first four-CPU
> > > guest OS to CPUs 0, 1, 8, and 9, right?
> > 
> > Interesting idea, I didn't think about that. I would then do even better
> > if you use 4 CPUs. Here, as can be seen in the low latency numbers between
> > adjacent CPUs of the same pair, they're both threads of the same core. So
> > 0,1 are core 0 of die 0, and 8,9 are core 0 of die 1. Thus I would assign
> > two threads from the same core on an average core (either core 0 or 3 of
> > one die), and one thread of a fast core and one thread of a slow core of
> > the other die. That would give for example CPUs 0,1,10,12. This is how I
> > can imagine the highest unfairness between threads for your use case. A
> > variant could be to use CPU 2 instead of 10 (the low latency core).
> 
> Do CPUs 1-7 and 8-15 show up in different nodes in sysfs?  If so, this
> case is already covered within rcutorture.  ;-)

No, they don't:

# lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
  0    0      0    0 0:0:0:0          yes 4000.0000 2200.0000
  1    0      0    0 0:0:0:0          yes 4000.0000 2200.0000
  2    0      0    1 1:1:1:0          yes 4000.0000 2200.0000
  3    0      0    1 1:1:1:0          yes 4000.0000 2200.0000
  4    0      0    2 2:2:2:0          yes 4000.0000 2200.0000
  5    0      0    2 2:2:2:0          yes 4000.0000 2200.0000
  6    0      0    3 3:3:3:0          yes 4000.0000 2200.0000
  7    0      0    3 3:3:3:0          yes 4000.0000 2200.0000
  8    0      0    4 4:4:4:1          yes 4000.0000 2200.0000
  9    0      0    4 4:4:4:1          yes 4000.0000 2200.0000
 10    0      0    5 5:5:5:1          yes 4000.0000 2200.0000
 11    0      0    5 5:5:5:1          yes 4000.0000 2200.0000
 12    0      0    6 6:6:6:1          yes 4000.0000 2200.0000
 13    0      0    6 6:6:6:1          yes 4000.0000 2200.0000
 14    0      0    7 7:7:7:1          yes 4000.0000 2200.0000
 15    0      0    7 7:7:7:1          yes 4000.0000 2200.0000

The only difference you can see is the L3 appearing as different. Maybe
you could modify rcutorture to consider node+l3 as a distinct node ?

> > > It would actually do that if
> > > the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
> > > on another.
> > 
> > Yep. Also, please note that my tests on a Xeon W2145 (8 cores 16 threads
> > as well, just organized differently) didn't show any single difference
> > between the cache access patterns of any core. Thus at best you could
> > hope for sharing two threads of a core and two other cores, but that's
> > all. I haven't tested on an EPYC yet but I'd be interested in seeing if
> > recent AMDs exhibit similar imbalance between their cores as their
> > ancestors.
> 
> I would expect quite a bit of variety.

We should get one in a few days-weeks so I expect to discover more stuff
and hopefully more bugs in my code.

> > > But I completely agree that cmpxchg operations on highly contended
> > > memory locations can expose all sorts of other issues.
> > 
> > Yep, they're "interesting" because the operation in itself is fundamentally
> > flawed and is used as the most inefficient building block for plenty of
> > other ones that emphasize bad patterns, starting with the opposite of
> > cmpxchg which would only replace a value if it's greater or lower for
> > example, or only replace it if not NULL. Due to the way it works,
> > slowdowns spiral down into a complete stall very quickly, and exponential
> > backoff that one might want to rely on usually further degrades the
> > situation since it requires to re-read before retrying.
> 
> Indeed!  That line of reasoning led to the Itanium ld.bias instruction
> being my fault.  ;-)

I didn't know this feature. I suspect you could achieve the same using
cmpxchg(ptr,x,x), which will perform a write only if the area contains
magic value "x" to replace it with itself (thus put an unlikely one),
while other values will only perform the bus first cycle and will return
the value in x. On x86 (at least) it's documented that the load phase of
cmpxchg uses a bus write cycle, and arguably it's understandable that
nobody wants to start a cmpxchg without having exclusive ownership of
the cache line. So that should achieve your ld.bias on modern CPUs in
case you miss it, with a single bus cycle vs two for an atomic add :-)

Do you think there could be some value in using this after a backoff
to reread an old value before retrying a CAS ? I'm just thinking that
if a cache controller holds the line a bit longer when it has exclusive
access there might be a chance of greater success, but I never tested
because I never thought about this. Now I feel like I'll have something
to play with this week-end.

> And that is exactly why good concurrent algorithms maintain low
> contention.  And I agree that this is more of an issue with cmpxchg than
> (say) xchg, xadd, and so on.  Assuming competent hardware, anyway.  :-/

That's exactly why I designed my upgradable locks around xadd a few years
ago! A single operation changes an unknown state to a known set of possible
states and returns the previous one. You can even rely on the wrapping
overflow to drop certain bits if needed.

> > My understanding is that they're connected like this:
> > 
> >       CPU    CPU    CPU    CPU        CPU    CPU    CPU    CPU
> >       0,1    2,3    4,5    6,7        8,9   10,11  12,13  14,15
> >     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
> >     |core| |core| |core| |core|     |core| |core| |core| |core|
> >     |  0 | |  1 | |  2 | |  3 |     |  4 | |  5 | |  6 | |  7 |
> >     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
> >        |      |      |      |          |      |      |      |
> >      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
> >      |L2|   |L2|   |L2|   |L2|       |L2|   |L2|   |L2|   |L2|
> >      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
> >        |      |      |      |          |      |      |      |
> >     +--------------------------+   +---------------------------+
> >   #=|          L3 1/2          |===|           L3 2/2          |=#
> >   # +--------------------------+   +---------------------------+ #
> >   #                                                              #
> >   #==============================================================#
> 
> Aha!  You cannot completely hide your hardware structure because your
> latencies will tell on you!  ;-)

:-)  That reminds me that when I was a kid I used to distinguish
serial port controller brands by monitoring their status bits during
a soft reset, and I noticed they would not all come up in the exact
same sequence depending on the manufacturer so I could proudly tell
my friends without opening their PC "that's intel or NS or Winbond"
(they wouldn't open to verify anyway). Fun times.

> > It's then easy to imagine an extra latency for cores 1 and 2 there,
> > with 2 being more impacted by two slots before it.
> 
> Indeed, beyond a certain point, it is necessary to talk to someone
> who actually knows the hardware structure.

Sure that can help but by the time you get your info, your CPU is
obsolete and you need to focus on another one. What matters to me
are two goals:
  - identify such awful patterns to make sure my code's performance
    will not degrade too much when running on these, or at least
    architect some options to work around them (e.g. thread groups)

  - easily detect where they exist so that I can use those machines
    to try to reproduce issues instead of using too slick machines
    that will never trigger issues (very similar to what you're
    trying to do in fact)

> > Another thing I discovered while preparing a machine for a test was that
> > there was an option in the BIOS proposing to automatically prefetch the
> > next cache line when reading one. I wasn't aware that some CPUs had this
> > option but this scared me, thinking that I'm trying to align some critical
> > structures on 64 bytes to avoid false sharing and that some machines
> > might be capable of still suffering from sharing between adjacent lines!
> 
> I have no idea whether to suggest turning off that option on the one
> hand or aligning to 128 bytes on the other...

That was my first idea when I read that option. Then I reread and noticed
this word "NEXT cache line", which is different from "prefetch cache line
PAIRS" or "adjacent cachelines" etc. I suspect it's used to optimise
memcpy() and it doesn't care about parity, so even if you align on 128,
if you hit the second cache line first you're sure to provoke the load
of the one that follows :-/  I have not tried to play with that yet and
the machines are currently in use so I won't know if these have any
impact for a while.

> > I'm wondering if you couldn't just try to prefetch
> > random lines from a very large memory location in the disturbing
> > threads. You might manage to waste a lot of memory bandwidth and to
> > cause high latencies. Maybe using AVX and friends to perform large
> > batches of writes could help further increase latency by filling
> > write buffers.
> 
> If I knew what data was involved in the bug, then yes, the occasional
> atomic add of zero to that data might be quite useful.

I meant doing it at random places in large batches to flush massive
amounts of cache lines and force irregular latencies that might disturb
communications between some cores.

> > > > I've been wondering if it's possible to change the tick rate inside KVM,
> > > > or even program a dummy device to deliver very fast IRQs but I haven't
> > > > found such a thing for now. I've been thinking about NMIs as well. The
> > > > idea would be that there could be a higher chance of interrupting
> > > > processing at any random place using fast interrupts (>100k/s) than by
> > > > using context switching caused by scheduler slots being depleted. Maybe
> > > > one only needs to hack an existing device emulator to flood an IRQ :-/
> > > 
> > > Or have a bug in the timekeeping code!  ;-)
> > 
> > Not sure how to achieve that, but maybe :-)
> 
> Given that the 2,199.0-second stall came as quite a surprise to the
> timekeeping folks, you are not the only one unsure as to how to achieve
> that.  In their defense, it took me quite some time to accept what was
> happening.

I guess you already noticed, but just in case you didn't I dare to
mention it anyway, that 2199.0 seconds is almost exactly 2^31 * 1024 ns
(or 2^41 ns). That sounds incredibly suspicious for a discrete value
when it can remind an integer overflow somewhere. Just saying, as I
don't know the code involved :-)

Willy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-09-30 21:12             ` Willy Tarreau
@ 2021-10-07  0:44               ` Paul E. McKenney
  2021-10-07 17:53                 ` Willy Tarreau
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2021-10-07  0:44 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: rcu

On Thu, Sep 30, 2021 at 11:12:18PM +0200, Willy Tarreau wrote:

Apologies for the delay!  I was involved in a bit of a time sink:
https://paulmck.livejournal.com/62436.html

> On Thu, Sep 30, 2021 at 11:46:23AM -0700, Paul E. McKenney wrote:
> > > > So what rcutorture would want to do here is to assign the first four-CPU
> > > > guest OS to CPUs 0, 1, 8, and 9, right?
> > > 
> > > Interesting idea, I didn't think about that. I would then do even better
> > > if you use 4 CPUs. Here, as can be seen in the low latency numbers between
> > > adjacent CPUs of the same pair, they're both threads of the same core. So
> > > 0,1 are core 0 of die 0, and 8,9 are core 0 of die 1. Thus I would assign
> > > two threads from the same core on an average core (either core 0 or 3 of
> > > one die), and one thread of a fast core and one thread of a slow core of
> > > the other die. That would give for example CPUs 0,1,10,12. This is how I
> > > can imagine the highest unfairness between threads for your use case. A
> > > variant could be to use CPU 2 instead of 10 (the low latency core).
> > 
> > Do CPUs 1-7 and 8-15 show up in different nodes in sysfs?  If so, this
> > case is already covered within rcutorture.  ;-)
> 
> No, they don't:
> 
> # lscpu -e
> CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
>   0    0      0    0 0:0:0:0          yes 4000.0000 2200.0000
>   1    0      0    0 0:0:0:0          yes 4000.0000 2200.0000
>   2    0      0    1 1:1:1:0          yes 4000.0000 2200.0000
>   3    0      0    1 1:1:1:0          yes 4000.0000 2200.0000
>   4    0      0    2 2:2:2:0          yes 4000.0000 2200.0000
>   5    0      0    2 2:2:2:0          yes 4000.0000 2200.0000
>   6    0      0    3 3:3:3:0          yes 4000.0000 2200.0000
>   7    0      0    3 3:3:3:0          yes 4000.0000 2200.0000
>   8    0      0    4 4:4:4:1          yes 4000.0000 2200.0000
>   9    0      0    4 4:4:4:1          yes 4000.0000 2200.0000
>  10    0      0    5 5:5:5:1          yes 4000.0000 2200.0000
>  11    0      0    5 5:5:5:1          yes 4000.0000 2200.0000
>  12    0      0    6 6:6:6:1          yes 4000.0000 2200.0000
>  13    0      0    6 6:6:6:1          yes 4000.0000 2200.0000
>  14    0      0    7 7:7:7:1          yes 4000.0000 2200.0000
>  15    0      0    7 7:7:7:1          yes 4000.0000 2200.0000
> 
> The only difference you can see is the L3 appearing as different. Maybe
> you could modify rcutorture to consider node+l3 as a distinct node ?

You know, it might be easier to parse "lscpu -e" output than doing
it the way I currently do.  It should not be all that hard to take
the first of NODE, SOCKET, and L3 that is non-constant.

> > > > It would actually do that if
> > > > the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15
> > > > on another.
> > > 
> > > Yep. Also, please note that my tests on a Xeon W2145 (8 cores 16 threads
> > > as well, just organized differently) didn't show any single difference
> > > between the cache access patterns of any core. Thus at best you could
> > > hope for sharing two threads of a core and two other cores, but that's
> > > all. I haven't tested on an EPYC yet but I'd be interested in seeing if
> > > recent AMDs exhibit similar imbalance between their cores as their
> > > ancestors.
> > 
> > I would expect quite a bit of variety.
> 
> We should get one in a few days-weeks so I expect to discover more stuff
> and hopefully more bugs in my code.

;-) ;-) ;-)

> > > > But I completely agree that cmpxchg operations on highly contended
> > > > memory locations can expose all sorts of other issues.
> > > 
> > > Yep, they're "interesting" because the operation in itself is fundamentally
> > > flawed and is used as the most inefficient building block for plenty of
> > > other ones that emphasize bad patterns, starting with the opposite of
> > > cmpxchg which would only replace a value if it's greater or lower for
> > > example, or only replace it if not NULL. Due to the way it works,
> > > slowdowns spiral down into a complete stall very quickly, and exponential
> > > backoff that one might want to rely on usually further degrades the
> > > situation since it requires to re-read before retrying.
> > 
> > Indeed!  That line of reasoning led to the Itanium ld.bias instruction
> > being my fault.  ;-)
> 
> I didn't know this feature. I suspect you could achieve the same using
> cmpxchg(ptr,x,x), which will perform a write only if the area contains
> magic value "x" to replace it with itself (thus put an unlikely one),
> while other values will only perform the bus first cycle and will return
> the value in x. On x86 (at least) it's documented that the load phase of
> cmpxchg uses a bus write cycle, and arguably it's understandable that
> nobody wants to start a cmpxchg without having exclusive ownership of
> the cache line. So that should achieve your ld.bias on modern CPUs in
> case you miss it, with a single bus cycle vs two for an atomic add :-)

Me, I prefer to work hard to reduce memory contention.  ;-)

> Do you think there could be some value in using this after a backoff
> to reread an old value before retrying a CAS ? I'm just thinking that
> if a cache controller holds the line a bit longer when it has exclusive
> access there might be a chance of greater success, but I never tested
> because I never thought about this. Now I feel like I'll have something
> to play with this week-end.

There are many heuristics, and their effectiveness depends on the
hardware, workload, and who knows what all else.

> > And that is exactly why good concurrent algorithms maintain low
> > contention.  And I agree that this is more of an issue with cmpxchg than
> > (say) xchg, xadd, and so on.  Assuming competent hardware, anyway.  :-/
> 
> That's exactly why I designed my upgradable locks around xadd a few years
> ago! A single operation changes an unknown state to a known set of possible
> states and returns the previous one. You can even rely on the wrapping
> overflow to drop certain bits if needed.

Nice!

> > > My understanding is that they're connected like this:
> > > 
> > >       CPU    CPU    CPU    CPU        CPU    CPU    CPU    CPU
> > >       0,1    2,3    4,5    6,7        8,9   10,11  12,13  14,15
> > >     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
> > >     |core| |core| |core| |core|     |core| |core| |core| |core|
> > >     |  0 | |  1 | |  2 | |  3 |     |  4 | |  5 | |  6 | |  7 |
> > >     +----+ +----+ +----+ +----+     +----+ +----+ +----+ +----+
> > >        |      |      |      |          |      |      |      |
> > >      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
> > >      |L2|   |L2|   |L2|   |L2|       |L2|   |L2|   |L2|   |L2|
> > >      +--+   +--+   +--+   +--+       +--+   +--+   +--+   +--+
> > >        |      |      |      |          |      |      |      |
> > >     +--------------------------+   +---------------------------+
> > >   #=|          L3 1/2          |===|           L3 2/2          |=#
> > >   # +--------------------------+   +---------------------------+ #
> > >   #                                                              #
> > >   #==============================================================#
> > 
> > Aha!  You cannot completely hide your hardware structure because your
> > latencies will tell on you!  ;-)
> 
> :-)  That reminds me that when I was a kid I used to distinguish
> serial port controller brands by monitoring their status bits during
> a soft reset, and I noticed they would not all come up in the exact
> same sequence depending on the manufacturer so I could proudly tell
> my friends without opening their PC "that's intel or NS or Winbond"
> (they wouldn't open to verify anyway). Fun times.

;-) ;-) ;-)

> > > It's then easy to imagine an extra latency for cores 1 and 2 there,
> > > with 2 being more impacted by two slots before it.
> > 
> > Indeed, beyond a certain point, it is necessary to talk to someone
> > who actually knows the hardware structure.
> 
> Sure that can help but by the time you get your info, your CPU is
> obsolete and you need to focus on another one. What matters to me
> are two goals:
>   - identify such awful patterns to make sure my code's performance
>     will not degrade too much when running on these, or at least
>     architect some options to work around them (e.g. thread groups)
> 
>   - easily detect where they exist so that I can use those machines
>     to try to reproduce issues instead of using too slick machines
>     that will never trigger issues (very similar to what you're
>     trying to do in fact)

Fair point on actually getting in touch with a hardware expert.  :-/

And good goals as well.

> > > Another thing I discovered while preparing a machine for a test was that
> > > there was an option in the BIOS proposing to automatically prefetch the
> > > next cache line when reading one. I wasn't aware that some CPUs had this
> > > option but this scared me, thinking that I'm trying to align some critical
> > > structures on 64 bytes to avoid false sharing and that some machines
> > > might be capable of still suffering from sharing between adjacent lines!
> > 
> > I have no idea whether to suggest turning off that option on the one
> > hand or aligning to 128 bytes on the other...
> 
> That was my first idea when I read that option. Then I reread and noticed
> this word "NEXT cache line", which is different from "prefetch cache line
> PAIRS" or "adjacent cachelines" etc. I suspect it's used to optimise
> memcpy() and it doesn't care about parity, so even if you align on 128,
> if you hit the second cache line first you're sure to provoke the load
> of the one that follows :-/  I have not tried to play with that yet and
> the machines are currently in use so I won't know if these have any
> impact for a while.

Ouch!

> > > I'm wondering if you couldn't just try to prefetch
> > > random lines from a very large memory location in the disturbing
> > > threads. You might manage to waste a lot of memory bandwidth and to
> > > cause high latencies. Maybe using AVX and friends to perform large
> > > batches of writes could help further increase latency by filling
> > > write buffers.
> > 
> > If I knew what data was involved in the bug, then yes, the occasional
> > atomic add of zero to that data might be quite useful.
> 
> I meant doing it at random places in large batches to flush massive
> amounts of cache lines and force irregular latencies that might disturb
> communications between some cores.

That is worth some thought!

> > > > > I've been wondering if it's possible to change the tick rate inside KVM,
> > > > > or even program a dummy device to deliver very fast IRQs but I haven't
> > > > > found such a thing for now. I've been thinking about NMIs as well. The
> > > > > idea would be that there could be a higher chance of interrupting
> > > > > processing at any random place using fast interrupts (>100k/s) than by
> > > > > using context switching caused by scheduler slots being depleted. Maybe
> > > > > one only needs to hack an existing device emulator to flood an IRQ :-/
> > > > 
> > > > Or have a bug in the timekeeping code!  ;-)
> > > 
> > > Not sure how to achieve that, but maybe :-)
> > 
> > Given that the 2,199.0-second stall came as quite a surprise to the
> > timekeeping folks, you are not the only one unsure as to how to achieve
> > that.  In their defense, it took me quite some time to accept what was
> > happening.
> 
> I guess you already noticed, but just in case you didn't I dare to
> mention it anyway, that 2199.0 seconds is almost exactly 2^31 * 1024 ns
> (or 2^41 ns). That sounds incredibly suspicious for a discrete value
> when it can remind an integer overflow somewhere. Just saying, as I
> don't know the code involved :-)

Actually, I hadn't thought that through.  I knew that it had to be an
overflow somewhere in timekeeping, and that once I pinned down the
location of the failure, that I would be talking to someone in that
area.  ;-)

But the stall times were all a bit over 2,199.02 seconds, so I do
believe that you are quite correct!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making races happen more often
  2021-10-07  0:44               ` Paul E. McKenney
@ 2021-10-07 17:53                 ` Willy Tarreau
  0 siblings, 0 replies; 10+ messages in thread
From: Willy Tarreau @ 2021-10-07 17:53 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

Hi Paul!

On Wed, Oct 06, 2021 at 05:44:37PM -0700, Paul E. McKenney wrote:
> On Thu, Sep 30, 2021 at 11:12:18PM +0200, Willy Tarreau wrote:
> 
> Apologies for the delay!

No problem, that lets me think and analyse between exchanges :-)

>  I was involved in a bit of a time sink:
> https://paulmck.livejournal.com/62436.html

Very interesting and detailed. I'm far from having read it all yet
but that will certainly be useful whatever the outcome of rust in
kernel is.

> > > Do CPUs 1-7 and 8-15 show up in different nodes in sysfs?  If so, this
> > > case is already covered within rcutorture.  ;-)
> > 
> > No, they don't:
> > 
> > # lscpu -e
> > CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
> >   0    0      0    0 0:0:0:0          yes 4000.0000 2200.0000
> >   1    0      0    0 0:0:0:0          yes 4000.0000 2200.0000
> >   2    0      0    1 1:1:1:0          yes 4000.0000 2200.0000
> >   3    0      0    1 1:1:1:0          yes 4000.0000 2200.0000
> >   4    0      0    2 2:2:2:0          yes 4000.0000 2200.0000
> >   5    0      0    2 2:2:2:0          yes 4000.0000 2200.0000
> >   6    0      0    3 3:3:3:0          yes 4000.0000 2200.0000
> >   7    0      0    3 3:3:3:0          yes 4000.0000 2200.0000
> >   8    0      0    4 4:4:4:1          yes 4000.0000 2200.0000
> >   9    0      0    4 4:4:4:1          yes 4000.0000 2200.0000
> >  10    0      0    5 5:5:5:1          yes 4000.0000 2200.0000
> >  11    0      0    5 5:5:5:1          yes 4000.0000 2200.0000
> >  12    0      0    6 6:6:6:1          yes 4000.0000 2200.0000
> >  13    0      0    6 6:6:6:1          yes 4000.0000 2200.0000
> >  14    0      0    7 7:7:7:1          yes 4000.0000 2200.0000
> >  15    0      0    7 7:7:7:1          yes 4000.0000 2200.0000
> > 
> > The only difference you can see is the L3 appearing as different. Maybe
> > you could modify rcutorture to consider node+l3 as a distinct node ?
> 
> You know, it might be easier to parse "lscpu -e" output than doing
> it the way I currently do.  It should not be all that hard to take
> the first of NODE, SOCKET, and L3 that is non-constant.

I'm not that sure, because the output format will depend on the
machine it's run on and the entries found in /sys. For example
on an ARMv8 (S922X, 4*A73+2*A53) :

  $ lscpu -e
  CPU SOCKET CORE ONLINE    MAXMHZ   MINMHZ
    0      0    0    yes 2016.0000 500.0000
    1      0    1    yes 2016.0000 500.0000
    2      1    2    yes 2400.0000 500.0000
    3      1    3    yes 2400.0000 500.0000
    4      1    4    yes 2400.0000 500.0000
    5      1    5    yes 2400.0000 500.0000

  $ grep '' /sys/devices/system/cpu/cpu0/topology/*
  /sys/devices/system/cpu/cpu0/topology/core_id:0
  /sys/devices/system/cpu/cpu0/topology/core_siblings:03
  /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-1
  /sys/devices/system/cpu/cpu0/topology/physical_package_id:0
  /sys/devices/system/cpu/cpu0/topology/thread_siblings:01
  /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0

Here there's no cpu*/cache subdir.


On another one (S905D3 = 4xA55):

  $ lscpu -e
  CPU SOCKET CORE L1d:L1i:L2 ONLINE    MAXMHZ   MINMHZ
    0      0    0 0:0:0         yes 2208.0000 100.0000
    1      0    1 1:1:0         yes 2208.0000 100.0000
    2      0    2 2:2:0         yes 2208.0000 100.0000
    3      0    3 3:3:0         yes 2208.0000 100.0000

  $ grep '' /sys/devices/system/cpu/cpu0/topology/*
  /sys/devices/system/cpu/cpu0/topology/core_cpus:1
  /sys/devices/system/cpu/cpu0/topology/core_cpus_list:0
  /sys/devices/system/cpu/cpu0/topology/core_id:0
  /sys/devices/system/cpu/cpu0/topology/core_siblings:f
  /sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-3
  /sys/devices/system/cpu/cpu0/topology/die_cpus:1
  /sys/devices/system/cpu/cpu0/topology/die_cpus_list:0
  /sys/devices/system/cpu/cpu0/topology/die_id:-1
  /sys/devices/system/cpu/cpu0/topology/package_cpus:f
  /sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-3
  /sys/devices/system/cpu/cpu0/topology/physical_package_id:0
  /sys/devices/system/cpu/cpu0/topology/thread_siblings:1
  /sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0

  $ grep '' /sys/devices/system/cpu/cpu0/cache/index*/*
  /sys/devices/system/cpu/cpu0/cache/index0/level:1
  /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
  /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map:1
  /sys/devices/system/cpu/cpu0/cache/index0/type:Data
  /sys/devices/system/cpu/cpu0/cache/index1/level:1
  /sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
  /sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map:1
  /sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
  /sys/devices/system/cpu/cpu0/cache/index2/level:2
  /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-3
  /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:f
  /sys/devices/system/cpu/cpu0/cache/index2/type:Unified

Thus I think it can remain easier to directly deal with /sys,
especially if you've already done most of it.

> > > Indeed!  That line of reasoning led to the Itanium ld.bias instruction
> > > being my fault.  ;-)
> > 
> > I didn't know this feature. I suspect you could achieve the same using
> > cmpxchg(ptr,x,x), which will perform a write only if the area contains
> > magic value "x" to replace it with itself (thus put an unlikely one),
> > while other values will only perform the bus first cycle and will return
> > the value in x. On x86 (at least) it's documented that the load phase of
> > cmpxchg uses a bus write cycle, and arguably it's understandable that
> > nobody wants to start a cmpxchg without having exclusive ownership of
> > the cache line. So that should achieve your ld.bias on modern CPUs in
> > case you miss it, with a single bus cycle vs two for an atomic add :-)
> 
> Me, I prefer to work hard to reduce memory contention.  ;-)

For sure but I mean that if you *need* that, it's possible. By the
way I tested and I can confirm that cmpxchg is twice as fast as xadd
to perform a load from an exclusive line. It tends to significantly
reduce the amount of failed subsequent cmpxchg as I expected (I don't
have the numbers in mind anymore, something like 1/5 or so on avg and
peaks) but is quite slower as well on some machines (+30% on Ryzen vs
+50% for xadd, +10% on i7). It can provide a benefit but the failure
rate remains way higher than zero so it's not an awesome solution
either, I'll continue to work with my failure counters.

> > > > Another thing I discovered while preparing a machine for a test was that
> > > > there was an option in the BIOS proposing to automatically prefetch the
> > > > next cache line when reading one. I wasn't aware that some CPUs had this
> > > > option but this scared me, thinking that I'm trying to align some critical
> > > > structures on 64 bytes to avoid false sharing and that some machines
> > > > might be capable of still suffering from sharing between adjacent lines!
> > > 
> > > I have no idea whether to suggest turning off that option on the one
> > > hand or aligning to 128 bytes on the other...
> > 
> > That was my first idea when I read that option. Then I reread and noticed
> > this word "NEXT cache line", which is different from "prefetch cache line
> > PAIRS" or "adjacent cachelines" etc. I suspect it's used to optimise
> > memcpy() and it doesn't care about parity, so even if you align on 128,
> > if you hit the second cache line first you're sure to provoke the load
> > of the one that follows :-/  I have not tried to play with that yet and
> > the machines are currently in use so I won't know if these have any
> > impact for a while.
> 
> Ouch!

Yeah, that why I'm keeping that one on my mid-term todo list.

> > > > I'm wondering if you couldn't just try to prefetch
> > > > random lines from a very large memory location in the disturbing
> > > > threads. You might manage to waste a lot of memory bandwidth and to
> > > > cause high latencies. Maybe using AVX and friends to perform large
> > > > batches of writes could help further increase latency by filling
> > > > write buffers.
> > > 
> > > If I knew what data was involved in the bug, then yes, the occasional
> > > atomic add of zero to that data might be quite useful.
> > 
> > I meant doing it at random places in large batches to flush massive
> > amounts of cache lines and force irregular latencies that might disturb
> > communications between some cores.
> 
> That is worth some thought!

On that subject, I've been wondering about something that bothers me.

Please just keep in mind that I'm really incompetent on the various
memory models and quickly get lost in the terminology, but for having
designed write buffers for multi-core cpus a long time ago and having
seen diagrams of various models that look like what I know, I think I
have the background to understand the model at the hardware level.

My understanding of the x86-TSO model is that it guarantes that another
CPU will see the writes in the same order they were performed (or more
precisely will not see them reversed). But what if we modify a shared
variable (lock, counter, etc) inside a function call like this:

   int take_ref(int *ptr)
   {
	return xadd(ptr, 1);
   }

   int foo()
   {
        ...
        ref = take_ref(&counter);
   }

The calling function will push the return pointer onto the local stack,
thus perform a memory write to the L1 cache, but that's a memory write
nonetheless. Then it will atomically increment the shared counter, then
return.

If another processor competes with this one on the counter, requiring
an access to its cache line should cause the first processor's cache to
first flush the cache line that contains the part of the stack where the
return address was written, no ? If so that would mean that performing
certain operations on shared data inside non-inlined functions might
occasionally bring a big overhead consisting in flushing a totally
unrelated cache line at the same time. The same could be said for local
variables that the compiler spills into the stack when registers are
lacking by the way. Or are there any special attributes on the SS
segment to indicate that writes to the stack are totally relaxed ?

I'm a bit confused on this subject because I'm torn between "it's not
possible that it's that ugly, I must be missing something obvious" and
"except via a dedicated segment register how could that be implemented
given that there's no such a thing as a relaxed write on x86?". Any
hint or pointer to my mistake would be much appreciated.

> > I guess you already noticed, but just in case you didn't I dare to
> > mention it anyway, that 2199.0 seconds is almost exactly 2^31 * 1024 ns
> > (or 2^41 ns). That sounds incredibly suspicious for a discrete value
> > when it can remind an integer overflow somewhere. Just saying, as I
> > don't know the code involved :-)
> 
> Actually, I hadn't thought that through.  I knew that it had to be an
> overflow somewhere in timekeeping, and that once I pinned down the
> location of the failure, that I would be talking to someone in that
> area.  ;-)
> 
> But the stall times were all a bit over 2,199.02 seconds, so I do
> believe that you are quite correct!

Were the stalls really observed or could it be a measurement error,
that wrongly reports a stall based on the invalid calculation ? We're
talking about orders of magnitude where everything seems possible :-/

Willy

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-10-07 18:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-26  3:51 Making races happen more often Paul E. McKenney
2021-09-26  4:41 ` Willy Tarreau
2021-09-27  3:25   ` Paul E. McKenney
2021-09-27  6:49     ` Willy Tarreau
2021-09-28  4:01       ` Paul E. McKenney
2021-09-29 19:59         ` Willy Tarreau
2021-09-30 18:46           ` Paul E. McKenney
2021-09-30 21:12             ` Willy Tarreau
2021-10-07  0:44               ` Paul E. McKenney
2021-10-07 17:53                 ` Willy Tarreau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.