On Thu, 21 Feb 2019 09:49:13 PST (-0800), paul@pwsan.com wrote: > > On Thu, 21 Feb 2019, Björn Töpel wrote: > >> On Thu, 21 Feb 2019 at 17:28, Paul Walmsley wrote: >> > >> > On Wed, 20 Feb 2019, Björn Töpel wrote: >> > >> > > Christopher mentioned "per-cpu atomicity" in his Plumbers RISC-V HPC >> > > talk [1]. Typically, this would be for the per-cpu implementation in >> > > Linux, so I had a look at the current implementation. >> > > >> > > I noticed that RISC-V currently uses the generic one. Has there been >> > > any work on moving to a RISC-V specific impl, based on the >> > > A-extensions, similar to what aarch64 does? Would that make sense? >> > >> > Yes doing a RISC-V implementation with the AMO instructions would make >> > sense. Care to take that on? >> > >> >> Yeah, I'll take a stab at it! I guess AMO is in general favored over >> the LL/SC path, correct? > > Yes. I think our current plan is to only support cores with the > A-extension in upstream Linux. Between LR/SC and atomics, the atomics > would be preferred. As Christoph wrote, using an atomic add avoids the > fail/retry risk inherent in an LR/SC sequence, and is easier to optimize > in microarchitecture. We already mandate the A extension in Linux, and there's already a handful of places where the kernel won't build without A. We used to have support for SMP=n systems that didn't have the A extension, but as part of the glibc upstream port submission we removed that support because it wasn't done the right way. We've always sort of left the door open for this, but nobody has expressed interest in doing the legwork yet. That said, it's all a moot point here: our LR/SC instructions are part of the A extension, so if you have those you have AMOs. The general policy here is that if there is an AMO that efficiently implements what you're looking for then it should be preferred to an LR/SC sequence. This specific case is actually one of the places where AMOs have a big benefit as opposed to LR/SC sequences, with the AMO implementation looking roughly like li t0, 1 la t1, counter amoadd.w zero, t0, 0(t1) The amoadd will increment that counter without writing the old value back, and since there the AQ and RL bits aren't set it enforces the minimum possible memory ordering constraints (ie, it's still correct for subsequent loads/stores but that's about it). Like Paul said, this is way easier for a microarchitecture to deal with than the equivalent LR/SC sequence, which would look something like: la t0, counter 1: lr.w t1, 0(t0) addi t1, t1, 1 sc.w t1, t1, 0(t0) bnez t1, 1b Specifically, that requires materializing the old value of the counter in t1 at some point during execution, even if it's later overwritten. Converting this sequence to a fire-and-forget cache operation seems like it'd be way harder to do. Splitting out the counters to cache lines should be standard stuff, I don't think there's anything RISC-V specific there. The ISA doesn't mandate a cache line size, but IIRC we have some macro floating around that matches the current implementations which is good enough for now. Thanks for looking in to this!