From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A904C433FE for ; Thu, 7 Oct 2021 00:44:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 73827610A0 for ; Thu, 7 Oct 2021 00:44:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240106AbhJGAqa (ORCPT ); Wed, 6 Oct 2021 20:46:30 -0400 Received: from mail.kernel.org ([198.145.29.99]:33620 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240096AbhJGAqa (ORCPT ); Wed, 6 Oct 2021 20:46:30 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 6FD2961077; Thu, 7 Oct 2021 00:44:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1633567477; bh=5BhUCCCrwZDTwxT2zPQtONntuvLrvwCV6ua2HHrPTLY=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=C5NeTKKCqPo7zidCqNAHPQW9WXSvy2dxD8SJN5yDyelOSxMgw+fb64DDfHLdO2FNp 3sjdj+6St6voAUIS7Jm3Ofm/wx2zqH3cqMBZrr9jZCQvvihmf5xV8ymoUNEJsiUhS5 KpZ3ei3MC8+ttzabdJt2aNWL4658UsZJQ0zRggZZxKA4h4eV3j+OzC2V+Y2aLiAI2g cMHfeBsXpXC6jatYWIUoPxf1cS14HAHfVyNCGw87yMTPQLq5+/tCUJ8XDmla5wPlZJ QP+4p0nH5/qR/tXXRmkNwvSi8RogWk+MZzj9vcfd26X27mkgr0AEqstiiwxWxc+tFc UUEfqeVf4B9jA== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 3E2285C0892; Wed, 6 Oct 2021 17:44:37 -0700 (PDT) Date: Wed, 6 Oct 2021 17:44:37 -0700 From: "Paul E. McKenney" To: Willy Tarreau Cc: rcu@vger.kernel.org Subject: Re: Making races happen more often Message-ID: <20211007004437.GG880162@paulmck-ThinkPad-P17-Gen-1> Reply-To: paulmck@kernel.org References: <20210926035103.GA1953486@paulmck-ThinkPad-P17-Gen-1> <20210926044127.GA11326@1wt.eu> <20210927032517.GR880162@paulmck-ThinkPad-P17-Gen-1> <20210927064915.GA20117@1wt.eu> <20210928040147.GA880162@paulmck-ThinkPad-P17-Gen-1> <20210929195908.GA9405@1wt.eu> <20210930184623.GI880162@paulmck-ThinkPad-P17-Gen-1> <20210930211218.GA18669@1wt.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210930211218.GA18669@1wt.eu> Precedence: bulk List-ID: X-Mailing-List: rcu@vger.kernel.org On Thu, Sep 30, 2021 at 11:12:18PM +0200, Willy Tarreau wrote: Apologies for the delay! I was involved in a bit of a time sink: https://paulmck.livejournal.com/62436.html > On Thu, Sep 30, 2021 at 11:46:23AM -0700, Paul E. McKenney wrote: > > > > So what rcutorture would want to do here is to assign the first four-CPU > > > > guest OS to CPUs 0, 1, 8, and 9, right? > > > > > > Interesting idea, I didn't think about that. I would then do even better > > > if you use 4 CPUs. Here, as can be seen in the low latency numbers between > > > adjacent CPUs of the same pair, they're both threads of the same core. So > > > 0,1 are core 0 of die 0, and 8,9 are core 0 of die 1. Thus I would assign > > > two threads from the same core on an average core (either core 0 or 3 of > > > one die), and one thread of a fast core and one thread of a slow core of > > > the other die. That would give for example CPUs 0,1,10,12. This is how I > > > can imagine the highest unfairness between threads for your use case. A > > > variant could be to use CPU 2 instead of 10 (the low latency core). > > > > Do CPUs 1-7 and 8-15 show up in different nodes in sysfs? If so, this > > case is already covered within rcutorture. ;-) > > No, they don't: > > # lscpu -e > CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ > 0 0 0 0 0:0:0:0 yes 4000.0000 2200.0000 > 1 0 0 0 0:0:0:0 yes 4000.0000 2200.0000 > 2 0 0 1 1:1:1:0 yes 4000.0000 2200.0000 > 3 0 0 1 1:1:1:0 yes 4000.0000 2200.0000 > 4 0 0 2 2:2:2:0 yes 4000.0000 2200.0000 > 5 0 0 2 2:2:2:0 yes 4000.0000 2200.0000 > 6 0 0 3 3:3:3:0 yes 4000.0000 2200.0000 > 7 0 0 3 3:3:3:0 yes 4000.0000 2200.0000 > 8 0 0 4 4:4:4:1 yes 4000.0000 2200.0000 > 9 0 0 4 4:4:4:1 yes 4000.0000 2200.0000 > 10 0 0 5 5:5:5:1 yes 4000.0000 2200.0000 > 11 0 0 5 5:5:5:1 yes 4000.0000 2200.0000 > 12 0 0 6 6:6:6:1 yes 4000.0000 2200.0000 > 13 0 0 6 6:6:6:1 yes 4000.0000 2200.0000 > 14 0 0 7 7:7:7:1 yes 4000.0000 2200.0000 > 15 0 0 7 7:7:7:1 yes 4000.0000 2200.0000 > > The only difference you can see is the L3 appearing as different. Maybe > you could modify rcutorture to consider node+l3 as a distinct node ? You know, it might be easier to parse "lscpu -e" output than doing it the way I currently do. It should not be all that hard to take the first of NODE, SOCKET, and L3 that is non-constant. > > > > It would actually do that if > > > > the /sys files indicate that CPUs 0-7 are on one node and CPUs 8-15 > > > > on another. > > > > > > Yep. Also, please note that my tests on a Xeon W2145 (8 cores 16 threads > > > as well, just organized differently) didn't show any single difference > > > between the cache access patterns of any core. Thus at best you could > > > hope for sharing two threads of a core and two other cores, but that's > > > all. I haven't tested on an EPYC yet but I'd be interested in seeing if > > > recent AMDs exhibit similar imbalance between their cores as their > > > ancestors. > > > > I would expect quite a bit of variety. > > We should get one in a few days-weeks so I expect to discover more stuff > and hopefully more bugs in my code. ;-) ;-) ;-) > > > > But I completely agree that cmpxchg operations on highly contended > > > > memory locations can expose all sorts of other issues. > > > > > > Yep, they're "interesting" because the operation in itself is fundamentally > > > flawed and is used as the most inefficient building block for plenty of > > > other ones that emphasize bad patterns, starting with the opposite of > > > cmpxchg which would only replace a value if it's greater or lower for > > > example, or only replace it if not NULL. Due to the way it works, > > > slowdowns spiral down into a complete stall very quickly, and exponential > > > backoff that one might want to rely on usually further degrades the > > > situation since it requires to re-read before retrying. > > > > Indeed! That line of reasoning led to the Itanium ld.bias instruction > > being my fault. ;-) > > I didn't know this feature. I suspect you could achieve the same using > cmpxchg(ptr,x,x), which will perform a write only if the area contains > magic value "x" to replace it with itself (thus put an unlikely one), > while other values will only perform the bus first cycle and will return > the value in x. On x86 (at least) it's documented that the load phase of > cmpxchg uses a bus write cycle, and arguably it's understandable that > nobody wants to start a cmpxchg without having exclusive ownership of > the cache line. So that should achieve your ld.bias on modern CPUs in > case you miss it, with a single bus cycle vs two for an atomic add :-) Me, I prefer to work hard to reduce memory contention. ;-) > Do you think there could be some value in using this after a backoff > to reread an old value before retrying a CAS ? I'm just thinking that > if a cache controller holds the line a bit longer when it has exclusive > access there might be a chance of greater success, but I never tested > because I never thought about this. Now I feel like I'll have something > to play with this week-end. There are many heuristics, and their effectiveness depends on the hardware, workload, and who knows what all else. > > And that is exactly why good concurrent algorithms maintain low > > contention. And I agree that this is more of an issue with cmpxchg than > > (say) xchg, xadd, and so on. Assuming competent hardware, anyway. :-/ > > That's exactly why I designed my upgradable locks around xadd a few years > ago! A single operation changes an unknown state to a known set of possible > states and returns the previous one. You can even rely on the wrapping > overflow to drop certain bits if needed. Nice! > > > My understanding is that they're connected like this: > > > > > > CPU CPU CPU CPU CPU CPU CPU CPU > > > 0,1 2,3 4,5 6,7 8,9 10,11 12,13 14,15 > > > +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ > > > |core| |core| |core| |core| |core| |core| |core| |core| > > > | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | > > > +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ > > > | | | | | | | | > > > +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ > > > |L2| |L2| |L2| |L2| |L2| |L2| |L2| |L2| > > > +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ > > > | | | | | | | | > > > +--------------------------+ +---------------------------+ > > > #=| L3 1/2 |===| L3 2/2 |=# > > > # +--------------------------+ +---------------------------+ # > > > # # > > > #==============================================================# > > > > Aha! You cannot completely hide your hardware structure because your > > latencies will tell on you! ;-) > > :-) That reminds me that when I was a kid I used to distinguish > serial port controller brands by monitoring their status bits during > a soft reset, and I noticed they would not all come up in the exact > same sequence depending on the manufacturer so I could proudly tell > my friends without opening their PC "that's intel or NS or Winbond" > (they wouldn't open to verify anyway). Fun times. ;-) ;-) ;-) > > > It's then easy to imagine an extra latency for cores 1 and 2 there, > > > with 2 being more impacted by two slots before it. > > > > Indeed, beyond a certain point, it is necessary to talk to someone > > who actually knows the hardware structure. > > Sure that can help but by the time you get your info, your CPU is > obsolete and you need to focus on another one. What matters to me > are two goals: > - identify such awful patterns to make sure my code's performance > will not degrade too much when running on these, or at least > architect some options to work around them (e.g. thread groups) > > - easily detect where they exist so that I can use those machines > to try to reproduce issues instead of using too slick machines > that will never trigger issues (very similar to what you're > trying to do in fact) Fair point on actually getting in touch with a hardware expert. :-/ And good goals as well. > > > Another thing I discovered while preparing a machine for a test was that > > > there was an option in the BIOS proposing to automatically prefetch the > > > next cache line when reading one. I wasn't aware that some CPUs had this > > > option but this scared me, thinking that I'm trying to align some critical > > > structures on 64 bytes to avoid false sharing and that some machines > > > might be capable of still suffering from sharing between adjacent lines! > > > > I have no idea whether to suggest turning off that option on the one > > hand or aligning to 128 bytes on the other... > > That was my first idea when I read that option. Then I reread and noticed > this word "NEXT cache line", which is different from "prefetch cache line > PAIRS" or "adjacent cachelines" etc. I suspect it's used to optimise > memcpy() and it doesn't care about parity, so even if you align on 128, > if you hit the second cache line first you're sure to provoke the load > of the one that follows :-/ I have not tried to play with that yet and > the machines are currently in use so I won't know if these have any > impact for a while. Ouch! > > > I'm wondering if you couldn't just try to prefetch > > > random lines from a very large memory location in the disturbing > > > threads. You might manage to waste a lot of memory bandwidth and to > > > cause high latencies. Maybe using AVX and friends to perform large > > > batches of writes could help further increase latency by filling > > > write buffers. > > > > If I knew what data was involved in the bug, then yes, the occasional > > atomic add of zero to that data might be quite useful. > > I meant doing it at random places in large batches to flush massive > amounts of cache lines and force irregular latencies that might disturb > communications between some cores. That is worth some thought! > > > > > I've been wondering if it's possible to change the tick rate inside KVM, > > > > > or even program a dummy device to deliver very fast IRQs but I haven't > > > > > found such a thing for now. I've been thinking about NMIs as well. The > > > > > idea would be that there could be a higher chance of interrupting > > > > > processing at any random place using fast interrupts (>100k/s) than by > > > > > using context switching caused by scheduler slots being depleted. Maybe > > > > > one only needs to hack an existing device emulator to flood an IRQ :-/ > > > > > > > > Or have a bug in the timekeeping code! ;-) > > > > > > Not sure how to achieve that, but maybe :-) > > > > Given that the 2,199.0-second stall came as quite a surprise to the > > timekeeping folks, you are not the only one unsure as to how to achieve > > that. In their defense, it took me quite some time to accept what was > > happening. > > I guess you already noticed, but just in case you didn't I dare to > mention it anyway, that 2199.0 seconds is almost exactly 2^31 * 1024 ns > (or 2^41 ns). That sounds incredibly suspicious for a discrete value > when it can remind an integer overflow somewhere. Just saying, as I > don't know the code involved :-) Actually, I hadn't thought that through. I knew that it had to be an overflow somewhere in timekeeping, and that once I pinned down the location of the failure, that I would be talking to someone in that area. ;-) But the stall times were all a bit over 2,199.02 seconds, so I do believe that you are quite correct! Thanx, Paul