On Thu, 21 Feb 2019 09:49:13 PST (-0800), paul@pwsan.com wrote:
>
> On Thu, 21 Feb 2019, Björn Töpel wrote:
>
>> On Thu, 21 Feb 2019 at 17:28, Paul Walmsley <paul@pwsan.com> wrote:
>> >
>> > On Wed, 20 Feb 2019, Björn Töpel wrote:
>> >
>> > > Christopher mentioned "per-cpu atomicity" in his Plumbers RISC-V HPC
>> > > talk [1]. Typically, this would be for the per-cpu implementation in
>> > > Linux, so I had a look at the current implementation.
>> > >
>> > > I noticed that RISC-V currently uses the generic one. Has there been
>> > > any work on moving to a RISC-V specific impl, based on the
>> > > A-extensions, similar to what aarch64 does? Would that make sense?
>> >
>> > Yes doing a RISC-V implementation with the AMO instructions would make
>> > sense.  Care to take that on?
>> >
>>
>> Yeah, I'll take a stab at it! I guess AMO is in general favored over
>> the LL/SC path, correct?
>
> Yes.  I think our current plan is to only support cores with the
> A-extension in upstream Linux.  Between LR/SC and atomics, the atomics
> would be preferred.  As Christoph wrote, using an atomic add avoids the
> fail/retry risk inherent in an LR/SC sequence, and is easier to optimize
> in microarchitecture.

We already mandate the A extension in Linux, and there's already a handful of 
places where the kernel won't build without A.  We used to have support for 
SMP=n systems that didn't have the A extension, but as part of the glibc 
upstream port submission we removed that support because it wasn't done the 
right way.  We've always sort of left the door open for this, but nobody has 
expressed interest in doing the legwork yet.

That said, it's all a moot point here: our LR/SC instructions are part of the A 
extension, so if you have those you have AMOs.  The general policy here is that 
if there is an AMO that efficiently implements what you're looking for then it 
should be preferred to an LR/SC sequence.  This specific case is actually one 
of the places where AMOs have a big benefit as opposed to LR/SC sequences, with 
the AMO implementation looking roughly like

    li t0, 1
    la t1, counter
    amoadd.w zero, t0, 0(t1)

The amoadd will increment that counter without writing the old value back, and 
since there the AQ and RL bits aren't set it enforces the minimum possible 
memory ordering constraints (ie, it's still correct for subsequent loads/stores 
but that's about it).  Like Paul said, this is way easier for a 
microarchitecture to deal with than the equivalent LR/SC sequence, which would 
look something like:

     la t0, counter
     1: lr.w t1, 0(t0)
     addi t1, t1, 1
     sc.w t1, t1, 0(t0)
     bnez t1, 1b

Specifically, that requires materializing the old value of the counter in t1 at 
some point during execution, even if it's later overwritten.  Converting this 
sequence to a fire-and-forget cache operation seems like it'd be way harder to 
do.

Splitting out the counters to cache lines should be standard stuff, I don't 
think there's anything RISC-V specific there.  The ISA doesn't mandate a cache 
line size, but IIRC we have some macro floating around that matches the current 
implementations which is good enough for now.

Thanks for looking in to this!