* per-cpu thoughts @ 2019-02-20 19:57 Björn Töpel 2019-02-21 15:57 ` Christopher Lameter 2019-02-21 16:28 ` Paul Walmsley 0 siblings, 2 replies; 41+ messages in thread From: Björn Töpel @ 2019-02-20 19:57 UTC (permalink / raw) To: linux-riscv; +Cc: Christoph Lameter Hi, Christopher mentioned "per-cpu atomicity" in his Plumbers RISC-V HPC talk [1]. Typically, this would be for the per-cpu implementation in Linux, so I had a look at the current implementation. I noticed that RISC-V currently uses the generic one. Has there been any work on moving to a RISC-V specific impl, based on the A-extensions, similar to what aarch64 does? Would that make sense? Finally; Has there been any more discussions on Christopher's input? Thanks, Björn [1] https://linuxplumbersconf.org/event/2/contributions/199/attachments/117/149/2018_Plumbers_HPC_in_Risc.pdf _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-20 19:57 per-cpu thoughts Björn Töpel @ 2019-02-21 15:57 ` Christopher Lameter 2019-02-21 16:28 ` Paul Walmsley 1 sibling, 0 replies; 41+ messages in thread From: Christopher Lameter @ 2019-02-21 15:57 UTC (permalink / raw) To: Björn Töpel; +Cc: linux-riscv [-- Attachment #1: Type: text/plain, Size: 959 bytes --] On Wed, 20 Feb 2019, Björn Töpel wrote: > I noticed that RISC-V currently uses the generic one. Has there been > any work on moving to a RISC-V specific impl, based on the > A-extensions, similar to what aarch64 does? Would that make sense? arm64 does not really handle that effectively. Intel has a single instruction to increment a counter without requiring full consistency guarantees like full blown atomics. The single instruction is important to guarantee that the processing cannot be interrupted and thus there is no need to disable preemption or interrupts. What you want is some special instruction that can be executed in a cycle or so and dedicate a register to the per cpu base address (comparable to the segment register on intel) so that you can do inc [per cpu register + offset] in one cycle on an arbitrary hardware thread. That way you can keep statistics effectvely and without contention. The vmstat subsystem is based on that. [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-20 19:57 per-cpu thoughts Björn Töpel 2019-02-21 15:57 ` Christopher Lameter @ 2019-02-21 16:28 ` Paul Walmsley 2019-02-21 17:24 ` Björn Töpel 1 sibling, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-02-21 16:28 UTC (permalink / raw) To: Björn Töpel; +Cc: linux-riscv, Christoph Lameter [-- Attachment #1: Type: text/plain, Size: 556 bytes --] On Wed, 20 Feb 2019, Björn Töpel wrote: > Christopher mentioned "per-cpu atomicity" in his Plumbers RISC-V HPC > talk [1]. Typically, this would be for the per-cpu implementation in > Linux, so I had a look at the current implementation. > > I noticed that RISC-V currently uses the generic one. Has there been > any work on moving to a RISC-V specific impl, based on the > A-extensions, similar to what aarch64 does? Would that make sense? Yes doing a RISC-V implementation with the AMO instructions would make sense. Care to take that on? - Paul [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-21 16:28 ` Paul Walmsley @ 2019-02-21 17:24 ` Björn Töpel 2019-02-21 17:49 ` Paul Walmsley 0 siblings, 1 reply; 41+ messages in thread From: Björn Töpel @ 2019-02-21 17:24 UTC (permalink / raw) To: Paul Walmsley; +Cc: linux-riscv, Christoph Lameter On Thu, 21 Feb 2019 at 17:28, Paul Walmsley <paul@pwsan.com> wrote: > > On Wed, 20 Feb 2019, Björn Töpel wrote: > > > Christopher mentioned "per-cpu atomicity" in his Plumbers RISC-V HPC > > talk [1]. Typically, this would be for the per-cpu implementation in > > Linux, so I had a look at the current implementation. > > > > I noticed that RISC-V currently uses the generic one. Has there been > > any work on moving to a RISC-V specific impl, based on the > > A-extensions, similar to what aarch64 does? Would that make sense? > > Yes doing a RISC-V implementation with the AMO instructions would make > sense. Care to take that on? > Yeah, I'll take a stab at it! I guess AMO is in general favored over the LL/SC path, correct? Björn > > - Paul _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-21 17:24 ` Björn Töpel @ 2019-02-21 17:49 ` Paul Walmsley 2019-02-21 19:40 ` Palmer Dabbelt 0 siblings, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-02-21 17:49 UTC (permalink / raw) To: Björn Töpel; +Cc: linux-riscv, Christoph Lameter [-- Attachment #1: Type: text/plain, Size: 1126 bytes --] On Thu, 21 Feb 2019, Björn Töpel wrote: > On Thu, 21 Feb 2019 at 17:28, Paul Walmsley <paul@pwsan.com> wrote: > > > > On Wed, 20 Feb 2019, Björn Töpel wrote: > > > > > Christopher mentioned "per-cpu atomicity" in his Plumbers RISC-V HPC > > > talk [1]. Typically, this would be for the per-cpu implementation in > > > Linux, so I had a look at the current implementation. > > > > > > I noticed that RISC-V currently uses the generic one. Has there been > > > any work on moving to a RISC-V specific impl, based on the > > > A-extensions, similar to what aarch64 does? Would that make sense? > > > > Yes doing a RISC-V implementation with the AMO instructions would make > > sense. Care to take that on? > > > > Yeah, I'll take a stab at it! I guess AMO is in general favored over > the LL/SC path, correct? Yes. I think our current plan is to only support cores with the A-extension in upstream Linux. Between LR/SC and atomics, the atomics would be preferred. As Christoph wrote, using an atomic add avoids the fail/retry risk inherent in an LR/SC sequence, and is easier to optimize in microarchitecture. - Paul [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-21 17:49 ` Paul Walmsley @ 2019-02-21 19:40 ` Palmer Dabbelt 2019-02-22 15:04 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Palmer Dabbelt @ 2019-02-21 19:40 UTC (permalink / raw) To: paul; +Cc: bjorn.topel, linux-riscv, cl [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=utf-8; format=flowed, Size: 3223 bytes --] On Thu, 21 Feb 2019 09:49:13 PST (-0800), paul@pwsan.com wrote: > > On Thu, 21 Feb 2019, Björn Töpel wrote: > >> On Thu, 21 Feb 2019 at 17:28, Paul Walmsley <paul@pwsan.com> wrote: >> > >> > On Wed, 20 Feb 2019, Björn Töpel wrote: >> > >> > > Christopher mentioned "per-cpu atomicity" in his Plumbers RISC-V HPC >> > > talk [1]. Typically, this would be for the per-cpu implementation in >> > > Linux, so I had a look at the current implementation. >> > > >> > > I noticed that RISC-V currently uses the generic one. Has there been >> > > any work on moving to a RISC-V specific impl, based on the >> > > A-extensions, similar to what aarch64 does? Would that make sense? >> > >> > Yes doing a RISC-V implementation with the AMO instructions would make >> > sense. Care to take that on? >> > >> >> Yeah, I'll take a stab at it! I guess AMO is in general favored over >> the LL/SC path, correct? > > Yes. I think our current plan is to only support cores with the > A-extension in upstream Linux. Between LR/SC and atomics, the atomics > would be preferred. As Christoph wrote, using an atomic add avoids the > fail/retry risk inherent in an LR/SC sequence, and is easier to optimize > in microarchitecture. We already mandate the A extension in Linux, and there's already a handful of places where the kernel won't build without A. We used to have support for SMP=n systems that didn't have the A extension, but as part of the glibc upstream port submission we removed that support because it wasn't done the right way. We've always sort of left the door open for this, but nobody has expressed interest in doing the legwork yet. That said, it's all a moot point here: our LR/SC instructions are part of the A extension, so if you have those you have AMOs. The general policy here is that if there is an AMO that efficiently implements what you're looking for then it should be preferred to an LR/SC sequence. This specific case is actually one of the places where AMOs have a big benefit as opposed to LR/SC sequences, with the AMO implementation looking roughly like li t0, 1 la t1, counter amoadd.w zero, t0, 0(t1) The amoadd will increment that counter without writing the old value back, and since there the AQ and RL bits aren't set it enforces the minimum possible memory ordering constraints (ie, it's still correct for subsequent loads/stores but that's about it). Like Paul said, this is way easier for a microarchitecture to deal with than the equivalent LR/SC sequence, which would look something like: la t0, counter 1: lr.w t1, 0(t0) addi t1, t1, 1 sc.w t1, t1, 0(t0) bnez t1, 1b Specifically, that requires materializing the old value of the counter in t1 at some point during execution, even if it's later overwritten. Converting this sequence to a fire-and-forget cache operation seems like it'd be way harder to do. Splitting out the counters to cache lines should be standard stuff, I don't think there's anything RISC-V specific there. The ISA doesn't mandate a cache line size, but IIRC we have some macro floating around that matches the current implementations which is good enough for now. Thanks for looking in to this! [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-21 19:40 ` Palmer Dabbelt @ 2019-02-22 15:04 ` Christopher Lameter 2019-02-22 15:36 ` Nick Kossifidis 0 siblings, 1 reply; 41+ messages in thread From: Christopher Lameter @ 2019-02-22 15:04 UTC (permalink / raw) To: Palmer Dabbelt; +Cc: bjorn.topel, paul, linux-riscv On Thu, 21 Feb 2019, Palmer Dabbelt wrote: > That said, it's all a moot point here: our LR/SC instructions are part of the > A extension, so if you have those you have AMOs. The general policy here is > that if there is an AMO that efficiently implements what you're looking for > then it should be preferred to an LR/SC sequence. This specific case is > actually one of the places where AMOs have a big benefit as opposed to LR/SC > sequences, with the AMO implementation looking roughly like > > li t0, 1 > la t1, counter > amoadd.w zero, t0, 0(t1) Can you add a register that contains the per cpu area address for the current hardware thread in amoadd? If we do not disable interrupts and preemption then the instruction could be executed on any processor since the OS may preempt and reschedule the kernel thread arbitrarily. We need a global register that contains the per cpu area address (which is different for every hardware thread). If the instruction is rescheduled on a different thread then the other offset needs to be used so the instruction need to specify that the addres is relative to that global register. PowerPC uses such a global register. On Intel the GS segment register is abused for that purpose. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 15:04 ` Christopher Lameter @ 2019-02-22 15:36 ` Nick Kossifidis 2019-02-22 15:56 ` Christopher Lameter 2019-02-22 19:48 ` Björn Töpel 0 siblings, 2 replies; 41+ messages in thread From: Nick Kossifidis @ 2019-02-22 15:36 UTC (permalink / raw) To: Christopher Lameter; +Cc: bjorn.topel, paul, Palmer Dabbelt, linux-riscv Στις 2019-02-22 17:04, Christopher Lameter έγραψε: > On Thu, 21 Feb 2019, Palmer Dabbelt wrote: > >> That said, it's all a moot point here: our LR/SC instructions are part >> of the >> A extension, so if you have those you have AMOs. The general policy >> here is >> that if there is an AMO that efficiently implements what you're >> looking for >> then it should be preferred to an LR/SC sequence. This specific case >> is >> actually one of the places where AMOs have a big benefit as opposed to >> LR/SC >> sequences, with the AMO implementation looking roughly like >> >> li t0, 1 >> la t1, counter >> amoadd.w zero, t0, 0(t1) > > Can you add a register that contains the per cpu area address for the > current hardware thread in amoadd? > > If we do not disable interrupts and preemption then the instruction > could > be executed on any processor since the OS may preempt and reschedule > the > kernel thread arbitrarily. We need a global register that contains the > per > cpu area address (which is different for every hardware thread). If the > instruction is rescheduled on a different thread then the other offset > needs to be used so the instruction need to specify that the addres is > relative to that global register. > > PowerPC uses such a global register. On Intel the GS segment register > is > abused for that purpose. > > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv How about the scratch registers defined on priv spec ? We already use sscratch on Linux for keeping per-hart kernel data structures (check out entry.S) and mscratch on OpenSBI for keeping per-hart state etc. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 15:36 ` Nick Kossifidis @ 2019-02-22 15:56 ` Christopher Lameter 2019-02-22 19:47 ` Björn Töpel 2019-02-22 19:48 ` Björn Töpel 1 sibling, 1 reply; 41+ messages in thread From: Christopher Lameter @ 2019-02-22 15:56 UTC (permalink / raw) To: Nick Kossifidis; +Cc: bjorn.topel, paul, Palmer Dabbelt, linux-riscv On Fri, 22 Feb 2019, Nick Kossifidis wrote: > How about the scratch registers defined on priv spec ? We already use sscratch > on Linux for keeping per-hart kernel data structures (check out entry.S) and > mscratch on OpenSBI for keeping per-hart state etc. The problem is that the register needs to be referenced by the amo instruction. Can amoadd increment an address relative to a scratch register? _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 15:56 ` Christopher Lameter @ 2019-02-22 19:47 ` Björn Töpel 2019-02-22 19:56 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Björn Töpel @ 2019-02-22 19:47 UTC (permalink / raw) To: Christopher Lameter Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv On Fri, 22 Feb 2019 at 16:56, Christopher Lameter <cl@linux.com> wrote: > > On Fri, 22 Feb 2019, Nick Kossifidis wrote: > > > How about the scratch registers defined on priv spec ? We already use sscratch > > on Linux for keeping per-hart kernel data structures (check out entry.S) and > > mscratch on OpenSBI for keeping per-hart state etc. > > The problem is that the register needs to be referenced by the amo > instruction. Can amoadd increment an address relative to a scratch > register? > It cannot. So, we'll end up with (at least) a preempt disabled region... _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 19:47 ` Björn Töpel @ 2019-02-22 19:56 ` Christopher Lameter 2019-02-28 12:20 ` Paul Walmsley 2019-03-08 7:17 ` Björn Töpel 0 siblings, 2 replies; 41+ messages in thread From: Christopher Lameter @ 2019-02-22 19:56 UTC (permalink / raw) To: Björn Töpel Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv [-- Attachment #1: Type: text/plain, Size: 471 bytes --] On Fri, 22 Feb 2019, Björn Töpel wrote: > > The problem is that the register needs to be referenced by the amo > > instruction. Can amoadd increment an address relative to a scratch > > register? > > > > It cannot. So, we'll end up with (at least) a preempt disabled region... If we need to do that then we already have so much overhead due to that processing that it may not be worthwhile. lc/sc overhead is minor compared to switching preempt on and off I think. [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 19:56 ` Christopher Lameter @ 2019-02-28 12:20 ` Paul Walmsley 2019-02-28 17:58 ` Christopher Lameter 2019-03-08 7:17 ` Björn Töpel 1 sibling, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-02-28 12:20 UTC (permalink / raw) To: Christopher Lameter Cc: Nick Kossifidis, Björn Töpel, Paul Walmsley, Palmer Dabbelt, linux-riscv [-- Attachment #1: Type: text/plain, Size: 793 bytes --] On Fri, 22 Feb 2019, Christopher Lameter wrote: > On Fri, 22 Feb 2019, Björn Töpel wrote: > > > > The problem is that the register needs to be referenced by the amo > > > instruction. Can amoadd increment an address relative to a scratch > > > register? > > > > > > > It cannot. So, we'll end up with (at least) a preempt disabled region... > > If we need to do that then we already have so much overhead due to that > processing that it may not be worthwhile. lc/sc overhead is minor compared > to switching preempt on and off I think. Is it strictly necessary to disable and re-enable preemption? On a UMA system, it might be preferable to risk the cost of the cache line bounce than to flip preemption off and on - assuming there's no other impact. - Paul [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-28 12:20 ` Paul Walmsley @ 2019-02-28 17:58 ` Christopher Lameter 2019-02-28 18:42 ` Paul Walmsley 0 siblings, 1 reply; 41+ messages in thread From: Christopher Lameter @ 2019-02-28 17:58 UTC (permalink / raw) To: Paul Walmsley Cc: Nick Kossifidis, Björn Töpel, Paul Walmsley, Palmer Dabbelt, linux-riscv [-- Attachment #1: Type: text/plain, Size: 1447 bytes --] On Thu, 28 Feb 2019, Paul Walmsley wrote: > On Fri, 22 Feb 2019, Christopher Lameter wrote: > > > On Fri, 22 Feb 2019, Björn Töpel wrote: > > > > > > The problem is that the register needs to be referenced by the amo > > > > instruction. Can amoadd increment an address relative to a scratch > > > > register? > > > > > > > > > > It cannot. So, we'll end up with (at least) a preempt disabled region... > > > > If we need to do that then we already have so much overhead due to that > > processing that it may not be worthwhile. lc/sc overhead is minor compared > > to switching preempt on and off I think. > > Is it strictly necessary to disable and re-enable preemption? It is necessary if CONFIG_PREEMPT is set because then the execution can be interrupted at any time and rescheduled on another hardware thread. > On a UMA system, it might be preferable to risk the cost of the cache line > bounce than to flip preemption off and on - assuming there's no other > impact. Its not the cost of a cache line bounce. The counter data will be corrupted since the RMW operations is not properly serialized. The counter data will be read from one processor, then the execution context may change and the writeback may occur either to the new execution context per cpu area (where we overwrite the old value) or to the old exeuction context per cpu are (where the counter may already have been incremented by other code that ran in that context) [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-28 17:58 ` Christopher Lameter @ 2019-02-28 18:42 ` Paul Walmsley 2019-02-28 19:09 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-02-28 18:42 UTC (permalink / raw) To: Christopher Lameter Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, Paul Walmsley, Nick Kossifidis, linux-riscv [-- Attachment #1: Type: text/plain, Size: 1902 bytes --] On Thu, 28 Feb 2019, Christopher Lameter wrote: > On Thu, 28 Feb 2019, Paul Walmsley wrote: > > > On Fri, 22 Feb 2019, Christopher Lameter wrote: > > > > > On Fri, 22 Feb 2019, Björn Töpel wrote: > > > > > > > > The problem is that the register needs to be referenced by the amo > > > > > instruction. Can amoadd increment an address relative to a scratch > > > > > register? > > > > > > > > > > > > > It cannot. So, we'll end up with (at least) a preempt disabled region... > > > > > > If we need to do that then we already have so much overhead due to that > > > processing that it may not be worthwhile. lc/sc overhead is minor compared > > > to switching preempt on and off I think. > > > > Is it strictly necessary to disable and re-enable preemption? > > It is necessary if CONFIG_PREEMPT is set because then the execution can be > interrupted at any time and rescheduled on another hardware thread. > > > On a UMA system, it might be preferable to risk the cost of the cache line > > bounce than to flip preemption off and on - assuming there's no other > > impact. > > Its not the cost of a cache line bounce. The counter data will be > corrupted since the RMW operations is not properly serialized. The counter > data will be read from one processor, then the execution context may > change and the writeback may occur either to the new execution context > per cpu area (where we overwrite the old value) or to the old exeuction context > per cpu are (where the counter may already have been incremented by other > code that ran in that context) This is the AMO-based sequence that Palmer wrote in his E-mail: li t0, 1 la t1, counter amoadd.w zero, t0, 0(t1) The RMW takes place in the amoadd, and executes atomically from the point of view of all cores in the cache coherency domain. Or am I misunderstanding you? - Paul [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-28 18:42 ` Paul Walmsley @ 2019-02-28 19:09 ` Christopher Lameter 2019-02-28 20:21 ` Paul Walmsley 0 siblings, 1 reply; 41+ messages in thread From: Christopher Lameter @ 2019-02-28 19:09 UTC (permalink / raw) To: Paul Walmsley Cc: Nick Kossifidis, Björn Töpel, Paul Walmsley, Palmer Dabbelt, linux-riscv On Thu, 28 Feb 2019, Paul Walmsley wrote: > This is the AMO-based sequence that Palmer wrote in his E-mail: > > li t0, 1 > la t1, counter > amoadd.w zero, t0, 0(t1) > > The RMW takes place in the amoadd, and executes atomically from the point > of view of all cores in the cache coherency domain. > > Or am I misunderstanding you? Yes Palmers example missed having the address of the per cpu space somewhere in a register (lets call it "tPerCpu"). Then you can do li t0, 1 la t1, offset of counter in percpu area amoadd.w zero,t0, 0(t1 + tPerCpu) But as far as we know we do not have an amoadd that can do this. If you add the percpu space address earlier: li t0, 1 la t1, offset of counter in percpu area add t1, t1, tPerCPu amoadd.w zero, t0, 0(t1) then you have the dnager of the amoadd being done on the wrong per cpu address because the scheduler may move the execution thread to a different processor betweeen the "add" and the amoadd. If you do it without the amoadd with just lc/sc then there more potential issues exist. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-28 19:09 ` Christopher Lameter @ 2019-02-28 20:21 ` Paul Walmsley 2019-03-01 1:13 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-02-28 20:21 UTC (permalink / raw) To: Christopher Lameter Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, Paul Walmsley, Nick Kossifidis, linux-riscv On Thu, 28 Feb 2019, Christopher Lameter wrote: > On Thu, 28 Feb 2019, Paul Walmsley wrote: > > > This is the AMO-based sequence that Palmer wrote in his E-mail: > > > > li t0, 1 > > la t1, counter > > amoadd.w zero, t0, 0(t1) > > > > The RMW takes place in the amoadd, and executes atomically from the point > > of view of all cores in the cache coherency domain. > > > > Or am I misunderstanding you? > > Yes Palmers example missed having the address of the per cpu space > somewhere in a register (lets call it "tPerCpu"). Then you can do > > li t0, 1 > la t1, offset of counter in percpu area > amoadd.w zero,t0, 0(t1 + tPerCpu) > > But as far as we know we do not have an amoadd that can do this. If you > add the percpu space address earlier: > > li t0, 1 > la t1, offset of counter in percpu area > add t1, t1, tPerCPu > amoadd.w zero, t0, 0(t1) > > then you have the dnager of the amoadd being done on the wrong per cpu > address because the scheduler may move the execution thread to a different > processor betweeen the "add" and the amoadd. > > If you do it without the amoadd with just lc/sc then there more potential > issues exist. Thanks for explaining that. Is the offset of the counter from the start of my per-cpu area guaranteed to be the same across all CPUs? If so, wouldn't there simply be a potential performance risk if preemption was left enabled, rather than a correctness risk? - Paul _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-28 20:21 ` Paul Walmsley @ 2019-03-01 1:13 ` Christopher Lameter 0 siblings, 0 replies; 41+ messages in thread From: Christopher Lameter @ 2019-03-01 1:13 UTC (permalink / raw) To: Paul Walmsley Cc: Nick Kossifidis, Björn Töpel, Paul Walmsley, Palmer Dabbelt, linux-riscv On Thu, 28 Feb 2019, Paul Walmsley wrote: > Is the offset of the counter from the start of my per-cpu area guaranteed > to be the same across all CPUs? If so, wouldn't there simply be a Yes of course otherwise it would work. The per cpu area is replicated across all processors on bootup. > potential performance risk if preemption was left enabled, rather than a > correctness risk? An unlocked RMW write operation can corrupt data because it may race with another unlocked RMW write operation on another processor. Writesto the per cpu data must always occur from the processor that owns the area. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 19:56 ` Christopher Lameter 2019-02-28 12:20 ` Paul Walmsley @ 2019-03-08 7:17 ` Björn Töpel 2019-03-11 13:22 ` Palmer Dabbelt 1 sibling, 1 reply; 41+ messages in thread From: Björn Töpel @ 2019-03-08 7:17 UTC (permalink / raw) To: Christopher Lameter Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv On Fri, 22 Feb 2019 at 20:56, Christopher Lameter <cl@linux.com> wrote: > > On Fri, 22 Feb 2019, Björn Töpel wrote: > > > > The problem is that the register needs to be referenced by the amo > > > instruction. Can amoadd increment an address relative to a scratch > > > register? > > > > > > > It cannot. So, we'll end up with (at least) a preempt disabled region... > > If we need to do that then we already have so much overhead due to that > processing that it may not be worthwhile. lc/sc overhead is minor compared > to switching preempt on and off I think. > I'll do some measurements. The generic one disables interrupts, so a preempt_{dis,en}able and AMO dance might (well, we'll see :-)) be better still. I definitely see your points from the presentation now. Similar to ARM, RISC-V cannot use preempt_enable_no_resched :-(, so we'll have to see where we end up performance-wise... Thanks for the input! Björn > _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-08 7:17 ` Björn Töpel @ 2019-03-11 13:22 ` Palmer Dabbelt 2019-03-11 14:48 ` Björn Töpel 0 siblings, 1 reply; 41+ messages in thread From: Palmer Dabbelt @ 2019-03-11 13:22 UTC (permalink / raw) To: bjorn.topel; +Cc: mick, cl, paul, linux-riscv On Thu, 07 Mar 2019 23:17:52 PST (-0800), bjorn.topel@gmail.com wrote: > On Fri, 22 Feb 2019 at 20:56, Christopher Lameter <cl@linux.com> wrote: >> >> On Fri, 22 Feb 2019, Björn Töpel wrote: >> >> > > The problem is that the register needs to be referenced by the amo >> > > instruction. Can amoadd increment an address relative to a scratch >> > > register? >> > > >> > >> > It cannot. So, we'll end up with (at least) a preempt disabled region... >> >> If we need to do that then we already have so much overhead due to that >> processing that it may not be worthwhile. lc/sc overhead is minor compared >> to switching preempt on and off I think. >> > > I'll do some measurements. The generic one disables interrupts, so a > preempt_{dis,en}able and AMO dance might (well, we'll see :-)) be > better still. I definitely see your points from the presentation now. Thanks a bunch! I feel like the best option here is to just use the AMOs without disabling preemption and ensuring that all other accesses are atomic (via AMOs or LR/SC). The only reason I can see that wouldn't be the way to go would be if it requires non-arch modifications, as if we go down that path we'll be able to handle the performance edge cases in the implementation. > Similar to ARM, RISC-V cannot use preempt_enable_no_resched :-(, so > we'll have to see where we end up performance-wise... I'm having no luck figuring out why we can't use that (or even where we'd use it). I've lost much of the context of the thread, so sorry if this is meant to be obvious, but do you mind explaining? If there's a trick then I'd at least like to know about it, even if we can't take advantage of it :) _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 13:22 ` Palmer Dabbelt @ 2019-03-11 14:48 ` Björn Töpel 2019-03-11 14:56 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Björn Töpel @ 2019-03-11 14:48 UTC (permalink / raw) To: Palmer Dabbelt Cc: Nick Kossifidis, Christoph Lameter, Paul Walmsley, linux-riscv On Mon, 11 Mar 2019 at 14:22, Palmer Dabbelt <palmer@sifive.com> wrote: > > On Thu, 07 Mar 2019 23:17:52 PST (-0800), bjorn.topel@gmail.com wrote: > > On Fri, 22 Feb 2019 at 20:56, Christopher Lameter <cl@linux.com> wrote: > >> > >> On Fri, 22 Feb 2019, Björn Töpel wrote: > >> > >> > > The problem is that the register needs to be referenced by the amo > >> > > instruction. Can amoadd increment an address relative to a scratch > >> > > register? > >> > > > >> > > >> > It cannot. So, we'll end up with (at least) a preempt disabled region... > >> > >> If we need to do that then we already have so much overhead due to that > >> processing that it may not be worthwhile. lc/sc overhead is minor compared > >> to switching preempt on and off I think. > >> > > > > I'll do some measurements. The generic one disables interrupts, so a > > preempt_{dis,en}able and AMO dance might (well, we'll see :-)) be > > better still. I definitely see your points from the presentation now. > > Thanks a bunch! I feel like the best option here is to just use the AMOs > without disabling preemption and ensuring that all other accesses are atomic > (via AMOs or LR/SC). The only reason I can see that wouldn't be the way to go > would be if it requires non-arch modifications, as if we go down that path > we'll be able to handle the performance edge cases in the implementation. > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt disable, only preemption.) > > Similar to ARM, RISC-V cannot use preempt_enable_no_resched :-(, so > > we'll have to see where we end up performance-wise... > > I'm having no luck figuring out why we can't use that (or even where we'd use > it). I've lost much of the context of the thread, so sorry if this is meant to > be obvious, but do you mind explaining? If there's a trick then I'd at least > like to know about it, even if we can't take advantage of it :) It's not allowed for module code to call preempt_enable_no_resched, and this_cpu_* can be from a module context. That means that we need (similar to ARM) to use the vanilla preempt_enable, which might trigger a reschedule and is more expensive. So, I'll take a stab at a percpu impl that uses AMO and preempt_disable/enable regions, and compares that to the generic one. My gut feeling says that it *should* perform better. ...and now I have new shiny HW to try it on as well. :-) Björn _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 14:48 ` Björn Töpel @ 2019-03-11 14:56 ` Christopher Lameter 2019-03-11 15:05 ` Björn Töpel 0 siblings, 1 reply; 41+ messages in thread From: Christopher Lameter @ 2019-03-11 14:56 UTC (permalink / raw) To: Björn Töpel Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv [-- Attachment #1: Type: text/plain, Size: 725 bytes --] On Mon, 11 Mar 2019, Björn Töpel wrote: > > Thanks a bunch! I feel like the best option here is to just use the AMOs > > without disabling preemption and ensuring that all other accesses are atomic > > (via AMOs or LR/SC). The only reason I can see that wouldn't be the way to go > > would be if it requires non-arch modifications, as if we go down that path > > we'll be able to handle the performance edge cases in the implementation. > > > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt > disable, only preemption.) If you disable preemption then you dont need AMO anymore. In fact that is the default fallback generic implementation. Just dont do anything and you already have that solution. [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 14:56 ` Christopher Lameter @ 2019-03-11 15:05 ` Björn Töpel 2019-03-11 15:26 ` Paul Walmsley 2019-03-11 15:51 ` Christopher Lameter 0 siblings, 2 replies; 41+ messages in thread From: Björn Töpel @ 2019-03-11 15:05 UTC (permalink / raw) To: Christopher Lameter Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv On Mon, 11 Mar 2019 at 15:56, Christopher Lameter <cl@linux.com> wrote: > > On Mon, 11 Mar 2019, Björn Töpel wrote: > > > > Thanks a bunch! I feel like the best option here is to just use the AMOs > > > without disabling preemption and ensuring that all other accesses are atomic > > > (via AMOs or LR/SC). The only reason I can see that wouldn't be the way to go > > > would be if it requires non-arch modifications, as if we go down that path > > > we'll be able to handle the performance edge cases in the implementation. > > > > > > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt > > disable, only preemption.) > > If you disable preemption then you dont need AMO anymore. In fact that is > the default fallback generic implementation. Just dont do anything and you > already have that solution. But the generic one disables interrupts, right? I believe the rational for RV is similar to ARM's; AMO+preemption disable regions is *slightly* better than the generic, but not as good as the IA one. Or am I missing something? _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 15:05 ` Björn Töpel @ 2019-03-11 15:26 ` Paul Walmsley 2019-03-11 16:48 ` Mark Rutland ` (2 more replies) 2019-03-11 15:51 ` Christopher Lameter 1 sibling, 3 replies; 41+ messages in thread From: Paul Walmsley @ 2019-03-11 15:26 UTC (permalink / raw) To: Björn Töpel, Christopher Lameter Cc: Paul Walmsley, catalin.marinas, Palmer Dabbelt, will.deacon, Nick Kossifidis, linux-riscv, linux-arm-kernel [-- Attachment #1: Type: text/plain, Size: 2498 bytes --] + the ARM64 guys and lakml On Mon, 11 Mar 2019, Björn Töpel wrote: > On Mon, 11 Mar 2019 at 15:56, Christopher Lameter <cl@linux.com> wrote: > > > > On Mon, 11 Mar 2019, Björn Töpel wrote: > > > > > > Thanks a bunch! I feel like the best option here is to just use the AMOs > > > > without disabling preemption and ensuring that all other accesses are atomic > > > > (via AMOs or LR/SC). The only reason I can see that wouldn't be the way to go > > > > would be if it requires non-arch modifications, as if we go down that path > > > > we'll be able to handle the performance edge cases in the implementation. > > > > > > > > > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt > > > disable, only preemption.) > > > > If you disable preemption then you dont need AMO anymore. In fact that is > > the default fallback generic implementation. Just dont do anything and you > > already have that solution. > > But the generic one disables interrupts, right? > > I believe the rational for RV is similar to ARM's; AMO+preemption > disable regions is *slightly* better than the generic, but not as good > as the IA one. Or am I missing something? There's been a discussion going on in a private thread about this that I unfortunately didn't add you to. The discussion is still ongoing, but I think Christoph and myself and a few other folks have agreed that the preempt_disable/enable is not needed for the amoadd approach. This is since the apparent intention of the preemption disable/enable is to ensure the correctness of the counter increment; however there is no risk of incorrectness in an amoadd sequence since the atomic add is locked across all of the cache coherency domain. Christoph, would you disagree with that characterization? There are a few outstanding points that we're trying to talk through, but it should be fine for an initial implementation to start with the amoadd-based approach. As far as the ARM LSE atomic implementation goes, I'm not an expert on those instructions. If those instructions are locked across all of the caches for the cores in the Linux system also, then they probably don't need the preempt_disable/enable either - assuming our collective understanding of the purpose of the preempt_disable/enable is correct. All this is, of course, assuming there is no secondary purpose to the preempt_disable/enable that we haven't managed to elicit yet. - Paul [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 15:26 ` Paul Walmsley @ 2019-03-11 16:48 ` Mark Rutland 2019-03-11 18:39 ` Paul Walmsley 2019-03-12 4:26 ` Christopher Lameter 2019-03-22 14:51 ` Nick Kossifidis 2 siblings, 1 reply; 41+ messages in thread From: Mark Rutland @ 2019-03-11 16:48 UTC (permalink / raw) To: Paul Walmsley Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis, Christopher Lameter, linux-riscv, linux-arm-kernel Hi Paul, On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote: > + the ARM64 guys and lakml > > On Mon, 11 Mar 2019, Björn Töpel wrote: > > > On Mon, 11 Mar 2019 at 15:56, Christopher Lameter <cl@linux.com> wrote: > > > > > > On Mon, 11 Mar 2019, Björn Töpel wrote: > > > > > > > > Thanks a bunch! I feel like the best option here is to just use the AMOs > > > > > without disabling preemption and ensuring that all other accesses are atomic > > > > > (via AMOs or LR/SC). The only reason I can see that wouldn't be the way to go > > > > > would be if it requires non-arch modifications, as if we go down that path > > > > > we'll be able to handle the performance edge cases in the implementation. > > > > > > > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt > > > > disable, only preemption.) > > > > > > If you disable preemption then you dont need AMO anymore. In fact that is > > > the default fallback generic implementation. Just dont do anything and you > > > already have that solution. > > > > But the generic one disables interrupts, right? > > > > I believe the rational for RV is similar to ARM's; AMO+preemption > > disable regions is *slightly* better than the generic, but not as good > > as the IA one. Or am I missing something? > > There's been a discussion going on in a private thread about this that I > unfortunately didn't add you to. The discussion is still ongoing, but I > think Christoph and myself and a few other folks have agreed that the > preempt_disable/enable is not needed for the amoadd approach. This is > since the apparent intention of the preemption disable/enable is to ensure > the correctness of the counter increment; however there is no risk of > incorrectness in an amoadd sequence since the atomic add is locked across > all of the cache coherency domain. We also thought that initially, but there's a sbutle race that can occur, and so we added code to disable preemption in commit: f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe") The problem on arm64 is that our atomics take a single base register, and we have to generate the percpu address with separate instructions from the atomic itself. That means we can get preempted between address generation and the atomic, which is problematic for sequences like: // Thread-A // Thread-B this_cpu_add(var) local_irq_disable(flags) ... v = __this_cpu_read(var); v = some_function(v); __this_cpu_write(var, v); ... local_irq_restore(flags) ... which can unexpectedly race as: // Thread-A // Thread-B < generate CPU X addr > < preempted > < scheduled on CPU X > local_irq_disable(flags); v = __this_cpu_read(var); < scheduled on CPU Y > < add to CPU X's var > v = some_function(v); __this_cpu_write(var, v); local_irq_restore(flags); ... and hence we lose an update to a percpu variable. I suspect RISC-V would have the same problem, unless its AMOs can generate the percpu address and perform the update in a single instruction. > There are a few outstanding points that we're trying to talk through, but > it should be fine for an initial implementation to start with the > amoadd-based approach. > > As far as the ARM LSE atomic implementation goes, I'm not an expert on > those instructions. If those instructions are locked across all of the > caches for the cores in the Linux system also, then they probably don't > need the preempt_disable/enable either - assuming our collective > understanding of the purpose of the preempt_disable/enable is correct. > > All this is, of course, assuming there is no secondary purpose to the > preempt_disable/enable that we haven't managed to elicit yet. Sorry to be the bearer of bad news! :) FWIW, I had a go at building percpu ops that didn't need to disable preemption, but that required LL/SC atomics, reserving a GPR for the percpu offset, and didn't result in a measurable difference in practice. The patches are at: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5 Thanks, Mark. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 16:48 ` Mark Rutland @ 2019-03-11 18:39 ` Paul Walmsley 2019-03-12 11:23 ` Mark Rutland 0 siblings, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-03-11 18:39 UTC (permalink / raw) To: Mark Rutland, Christopher Lameter Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, Paul Walmsley, catalin.marinas, Nick Kossifidis, linux-riscv, linux-arm-kernel [-- Attachment #1: Type: text/plain, Size: 3776 bytes --] Hi Mark, On Mon, 11 Mar 2019, Mark Rutland wrote: > On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote: > > On Mon, 11 Mar 2019, Björn Töpel wrote: > > > > > > But the generic one disables interrupts, right? > > > > > > I believe the rational for RV is similar to ARM's; AMO+preemption > > > disable regions is *slightly* better than the generic, but not as good > > > as the IA one. Or am I missing something? > > > > There's been a discussion going on in a private thread about this that I > > unfortunately didn't add you to. The discussion is still ongoing, but I > > think Christoph and myself and a few other folks have agreed that the > > preempt_disable/enable is not needed for the amoadd approach. This is > > since the apparent intention of the preemption disable/enable is to ensure > > the correctness of the counter increment; however there is no risk of > > incorrectness in an amoadd sequence since the atomic add is locked across > > all of the cache coherency domain. > > We also thought that initially, but there's a sbutle race that can > occur, and so we added code to disable preemption in commit: > > f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe") > > The problem on arm64 is that our atomics take a single base register, > and we have to generate the percpu address with separate instructions > from the atomic itself. That means we can get preempted between address > generation and the atomic, which is problematic for sequences like: > > // Thread-A // Thread-B > > this_cpu_add(var) > local_irq_disable(flags) > ... > v = __this_cpu_read(var); > v = some_function(v); > __this_cpu_write(var, v); > ... > local_irq_restore(flags) > > ... which can unexpectedly race as: > > > // Thread-A // Thread-B > > < generate CPU X addr > > < preempted > > > < scheduled on CPU X > > local_irq_disable(flags); > v = __this_cpu_read(var); > > < scheduled on CPU Y > > < add to CPU X's var > > v = some_function(v); > __this_cpu_write(var, v); > local_irq_restore(flags); > > ... and hence we lose an update to a percpu variable. Makes sense, and thank you very much for the detailed sequences. Open-coded per-cpu code sequences would also cause RISC-V to be exposed to the same issue. Christoph mentioned this, but at the time my attention was focused on per-cpu counters, and not the broader range of per-cpu code sequences. > I suspect RISC-V would have the same problem, unless its AMOs can > generate the percpu address and perform the update in a single > instruction. My understanding is that many of Christoph's per-cpu performance concerns revolve around counters in the VM code, such as: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355 and probably elsewhere by now. It may be worth creating a distinct API for those counters. If only increment, decrement, and read operations are needed, there shouldn't be a need to disable or re-enable preemption in those code paths - assuming that one is either able to tolerate the occasional cache line bounce or retries in a long LL/SC sequence. Any opinions on that? > FWIW, I had a go at building percpu ops that didn't need to disable > preemption, but that required LL/SC atomics, reserving a GPR for the > percpu offset, and didn't result in a measurable difference in practice. > The patches are at: > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5 Very interesting indeed. Thank you for sharing that, - Paul [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 18:39 ` Paul Walmsley @ 2019-03-12 11:23 ` Mark Rutland 2019-03-12 16:01 ` Paul Walmsley 0 siblings, 1 reply; 41+ messages in thread From: Mark Rutland @ 2019-03-12 11:23 UTC (permalink / raw) To: Paul Walmsley Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis, Christopher Lameter, linux-riscv, linux-arm-kernel On Mon, Mar 11, 2019 at 11:39:56AM -0700, Paul Walmsley wrote: > On Mon, 11 Mar 2019, Mark Rutland wrote: > > On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote: > > > On Mon, 11 Mar 2019, Björn Töpel wrote: > > > > > > > > But the generic one disables interrupts, right? > > > > > > > > I believe the rational for RV is similar to ARM's; AMO+preemption > > > > disable regions is *slightly* better than the generic, but not as good > > > > as the IA one. Or am I missing something? > > > > > > There's been a discussion going on in a private thread about this that I > > > unfortunately didn't add you to. The discussion is still ongoing, but I > > > think Christoph and myself and a few other folks have agreed that the > > > preempt_disable/enable is not needed for the amoadd approach. This is > > > since the apparent intention of the preemption disable/enable is to ensure > > > the correctness of the counter increment; however there is no risk of > > > incorrectness in an amoadd sequence since the atomic add is locked across > > > all of the cache coherency domain. > > > > We also thought that initially, but there's a sbutle race that can > > occur, and so we added code to disable preemption in commit: > > > > f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe") > > > > The problem on arm64 is that our atomics take a single base register, > > and we have to generate the percpu address with separate instructions > > from the atomic itself. That means we can get preempted between address > > generation and the atomic, which is problematic for sequences like: > > > > // Thread-A // Thread-B > > > > this_cpu_add(var) > > local_irq_disable(flags) > > ... > > v = __this_cpu_read(var); > > v = some_function(v); > > __this_cpu_write(var, v); > > ... > > local_irq_restore(flags) > > > > ... which can unexpectedly race as: > > > > > > // Thread-A // Thread-B > > > > < generate CPU X addr > > > < preempted > > > > > < scheduled on CPU X > > > local_irq_disable(flags); > > v = __this_cpu_read(var); > > > > < scheduled on CPU Y > > > < add to CPU X's var > > > v = some_function(v); > > __this_cpu_write(var, v); > > local_irq_restore(flags); > > > > ... and hence we lose an update to a percpu variable. > > Makes sense, and thank you very much for the detailed sequences. > Open-coded per-cpu code sequences would also cause RISC-V to be exposed to > the same issue. Christoph mentioned this, but at the time my attention > was focused on per-cpu counters, and not the broader range of per-cpu code > sequences. > > > I suspect RISC-V would have the same problem, unless its AMOs can > > generate the percpu address and perform the update in a single > > instruction. > > My understanding is that many of Christoph's per-cpu performance concerns > revolve around counters in the VM code, such as: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355 The mod_*_state() functions are the only ones which mess with preemption, and that should only mandate a few locally-visible modifications of preempt_count. Similar cases apply within SLUB, and I'd hoped to improve that with my this-cpu-reg branch, but I didn't see a measureable improvement on workloads I tried. Have you seen a measureable performance problem here? > and probably elsewhere by now. It may be worth creating a distinct API > for those counters. If only increment, decrement, and read operations are > needed, there shouldn't be a need to disable or re-enable > preemption in those code paths - assuming that one is either able to > tolerate the occasional cache line bounce or retries in a long LL/SC > sequence. Any opinions on that? I'm afraid I don't understand this code well enough to say whether that would be safe. It's not clear to me whether there would be a measureable performance difference, as I'd expect fiddling with preempt_count to be relatively cheap. The AMOs themselves don't need to enforce ordering here, and only a few compiler barriers are necessary. Thanks, Mark. > > FWIW, I had a go at building percpu ops that didn't need to disable > > preemption, but that required LL/SC atomics, reserving a GPR for the > > percpu offset, and didn't result in a measurable difference in practice. > > The patches are at: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5 > > Very interesting indeed. Thank you for sharing that, > > - Paul _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-12 11:23 ` Mark Rutland @ 2019-03-12 16:01 ` Paul Walmsley 2019-03-12 17:34 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-03-12 16:01 UTC (permalink / raw) To: Mark Rutland, Christopher Lameter Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, Paul Walmsley, catalin.marinas, Nick Kossifidis, linux-riscv, linux-arm-kernel Hi Mark, On Tue, 12 Mar 2019, Mark Rutland wrote: > On Mon, Mar 11, 2019 at 11:39:56AM -0700, Paul Walmsley wrote: > > > My understanding is that many of Christoph's per-cpu performance concerns > > revolve around counters in the VM code, such as: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355 > > The mod_*_state() functions are the only ones which mess with > preemption, and that should only mandate a few locally-visible > modifications of preempt_count. Also __{inc,dec}_*_state() calls __this_cpu_{inc,dec}_return() which tweaks the preemption count. > Similar cases apply within SLUB, and I'd hoped to improve that with my > this-cpu-reg branch, but I didn't see a measureable improvement on > workloads I tried. That certainly suggests that all of this could be much to-do about nothing, or at least very little. One observation is that some of the performance concerns that Christoph is expressing here may be about ensuring predictable and minimal latency bounds, rather than raw throughput. > Have you seen a measureable performance problem here? Not yet. The two motivations at the moment are: 1. to determine how our initial per-arch implementation for percpu.h should look, and 2. to get a high-level view of whether unlocked base + offset increment instructions are worthwhile, from people who know more than I do about them. So far the counters look like a distinct use-case - one that might have relaxed requirements wrt preemption changes. > > and probably elsewhere by now. It may be worth creating a distinct API > > for those counters. If only increment, decrement, and read operations are > > needed, there shouldn't be a need to disable or re-enable > > preemption in those code paths - assuming that one is either able to > > tolerate the occasional cache line bounce or retries in a long LL/SC > > sequence. Any opinions on that? > > I'm afraid I don't understand this code well enough to say whether that > would be safe. That makes two of us. Have followed up with Christoph in a separate thread with lakml cc'ed. > It's not clear to me whether there would be a measureable performance > difference, as I'd expect fiddling with preempt_count to be relatively > cheap. The AMOs themselves don't need to enforce ordering here, and only > a few compiler barriers are necessary. OK. I have been assuming that the risk of a scheduler call in preempt_enable() is what Christoph is concerned about here: https://lore.kernel.org/linux-riscv/b0653f7a6f1bc0c9329d37de690d3bed@mailhost.ics.forth.gr/T/#m6e609e26a9e5405c4a7e2dbd5ca8c969cada5c36 If is possible to eliminate the latency risk from a 'simple' counter increment/decrement by creating a restricted API, that may be worthwhile. Christoph has also been concerned that the AMO operations will carry an unacceptable performance overhead. But the RISC-V AMO operations can be written such that they don't have the ordering restrictions that the Intel LOCK-prefixed operations do, and thus those concerns may not apply -- at least not to the same extent. Perhaps this is also true for the ARM LSE atomics. - Paul _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-12 16:01 ` Paul Walmsley @ 2019-03-12 17:34 ` Christopher Lameter 0 siblings, 0 replies; 41+ messages in thread From: Christopher Lameter @ 2019-03-12 17:34 UTC (permalink / raw) To: Paul Walmsley Cc: Mark Rutland, Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis, linux-riscv, linux-arm-kernel On Tue, 12 Mar 2019, Paul Walmsley wrote: > > Similar cases apply within SLUB, and I'd hoped to improve that with my > > this-cpu-reg branch, but I didn't see a measureable improvement on > > workloads I tried. > > That certainly suggests that all of this could be much to-do about > nothing, or at least very little. One observation is that some of the > performance concerns that Christoph is expressing here may be about > ensuring predictable and minimal latency bounds, rather than raw > throughput. The performance concerns are mainly when scaling RISC V to many cores which will create contention for counter handling. The scalable counter system (ZVCs) was developed to address these issues and later the this cpu operations where optimizing that performance. At this point on RISC V with just a couple of cores you may not see too much of an effect. In fact a UP system would be running faster if it does not use these schemes since there is no contention. Scalability of counter operations becomes a challenge as core counts increase. > OK. I have been assuming that the risk of a scheduler call in > preempt_enable() is what Christoph is concerned about here: > > https://lore.kernel.org/linux-riscv/b0653f7a6f1bc0c9329d37de690d3bed@mailhost.ics.forth.gr/T/#m6e609e26a9e5405c4a7e2dbd5ca8c969cada5c36 > > If is possible to eliminate the latency risk from a 'simple' counter > increment/decrement by creating a restricted API, that may be worthwhile. > > Christoph has also been concerned that the AMO operations will carry an > unacceptable performance overhead. But the RISC-V AMO operations can be > written such that they don't have the ordering restrictions that the Intel > LOCK-prefixed operations do, and thus those concerns may not apply -- at > least not to the same extent. Perhaps this is also true for the ARM LSE > atomics. My main concern at this point is to ensure that RISC V has the proper setup for the future and that decision are made that scaling up of RISC V to hundreds of cores (if not more) does not become a bottleneck. One of the use cases here for us is likely to have extreme parallel operations for HPC style compute. The main issue for the core VM is to limit the overhead of the statics and counter operations. Introducing atomic operations in key fault paths has caused performance regressions in the past. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 15:26 ` Paul Walmsley 2019-03-11 16:48 ` Mark Rutland @ 2019-03-12 4:26 ` Christopher Lameter 2019-03-12 14:21 ` Paul Walmsley 2019-03-22 14:51 ` Nick Kossifidis 2 siblings, 1 reply; 41+ messages in thread From: Christopher Lameter @ 2019-03-12 4:26 UTC (permalink / raw) To: Paul Walmsley Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis, linux-riscv, linux-arm-kernel On Mon, 11 Mar 2019, Paul Walmsley wrote: > There's been a discussion going on in a private thread about this that I > unfortunately didn't add you to. The discussion is still ongoing, but I > think Christoph and myself and a few other folks have agreed that the > preempt_disable/enable is not needed for the amoadd approach. This is > since the apparent intention of the preemption disable/enable is to ensure > the correctness of the counter increment; however there is no risk of > incorrectness in an amoadd sequence since the atomic add is locked across > all of the cache coherency domain. Christoph, would you disagree with > that characterization? No. As I said before all write operations have to happen from the local cpu to a per cpu segment. > There are a few outstanding points that we're trying to talk through, but > it should be fine for an initial implementation to start with the > amoadd-based approach. > > As far as the ARM LSE atomic implementation goes, I'm not an expert on > those instructions. If those instructions are locked across all of the > caches for the cores in the Linux system also, then they probably don't > need the preempt_disable/enable either - assuming our collective > understanding of the purpose of the preempt_disable/enable is correct. > > All this is, of course, assuming there is no secondary purpose to the > preempt_disable/enable that we haven't managed to elicit yet. The intention is that the write occurs to the local per cpu segment and that any per cpu RMW activities (like this_cpu_inc()) appear to be "atomic" in regards to other processes running on the same cpu. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-12 4:26 ` Christopher Lameter @ 2019-03-12 14:21 ` Paul Walmsley 2019-03-12 17:42 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Paul Walmsley @ 2019-03-12 14:21 UTC (permalink / raw) To: Christopher Lameter Cc: Paul Walmsley, catalin.marinas, Björn Töpel, Palmer Dabbelt, will.deacon, Paul Walmsley, Nick Kossifidis, linux-riscv, linux-arm-kernel On Tue, 12 Mar 2019, Christopher Lameter wrote: > On Mon, 11 Mar 2019, Paul Walmsley wrote: > > > There's been a discussion going on in a private thread about this that I > > unfortunately didn't add you to. The discussion is still ongoing, but I > > think Christoph and myself and a few other folks have agreed that the > > preempt_disable/enable is not needed for the amoadd approach. This is > > since the apparent intention of the preemption disable/enable is to ensure > > the correctness of the counter increment; however there is no risk of > > incorrectness in an amoadd sequence since the atomic add is locked across > > all of the cache coherency domain. Christoph, would you disagree with > > that characterization? > > No. As I said before all write operations have to happen from the local > cpu to a per cpu segment. In light of the thread that's going on with Mark, given the open-coded __this_cpu_* operations, it's clear we don't have a choice - we'd need to disable preemption for the basic operations. The counters, though, may not need the preemption disable/reenable. Christoph, you expressed earlier that you think the overhead of the preempt_disable/enable is quite high. Do you think it's worth creating a separate, restricted implementation for per-cpu counters? - Paul _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-12 14:21 ` Paul Walmsley @ 2019-03-12 17:42 ` Christopher Lameter 2019-03-12 17:59 ` Gary Guo 2019-03-13 20:15 ` Paul Walmsley 0 siblings, 2 replies; 41+ messages in thread From: Christopher Lameter @ 2019-03-12 17:42 UTC (permalink / raw) To: Paul Walmsley Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis, linux-riscv, linux-arm-kernel On Tue, 12 Mar 2019, Paul Walmsley wrote: > The counters, though, may not need the preemption disable/reenable. > Christoph, you expressed earlier that you think the overhead of the > preempt_disable/enable is quite high. Do you think it's worth creating a > separate, restricted implementation for per-cpu counters? As I have always said: I would like to have per cpu atomic instructions added on RISCV V that works like those on Intel. Single instruction and relative to a per cpu based addressable counter operations please. I think the attempt to reengineer the core counter mechanisms on Linux is not that realistic and would require you to first know the cross platform issues that have driven the development of these things in the first place. Sorry that I have been just superficially involved in these discussions but I have a hard time seeing this going anywere. There are already established core implementations and various arches take on this issue and those have been around for longer than a decade. It will be hard to come up with something better. Can we focus on the RISC V instruction support? I also do not think that this is a currently pressing issue but it will be when you scale up RISC V to many cores (especiall hundreds or thousands of concurrent hardware threads like what our interest is likely going to be in coming years and likely also when RISC V is going to be used for enterprise / cloud data services). _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-12 17:42 ` Christopher Lameter @ 2019-03-12 17:59 ` Gary Guo 2019-03-13 18:58 ` Christopher Lameter 2019-03-13 20:15 ` Paul Walmsley 1 sibling, 1 reply; 41+ messages in thread From: Gary Guo @ 2019-03-12 17:59 UTC (permalink / raw) To: Christopher Lameter, Paul Walmsley Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis, linux-riscv, linux-arm-kernel The added instruction would require three operands (base, offset, value), which is clearly an issue on how to encode it. Also, as RISC-V's base ISA is already frozen, even if we have this instruction, it will be an extension, and is not going to be implemented by all hardware. Linux will still need to support those hardware without this instruction. On 12/03/2019 17:42, Christopher Lameter wrote: > On Tue, 12 Mar 2019, Paul Walmsley wrote: > >> The counters, though, may not need the preemption disable/reenable. >> Christoph, you expressed earlier that you think the overhead of the >> preempt_disable/enable is quite high. Do you think it's worth creating a >> separate, restricted implementation for per-cpu counters? > > As I have always said: I would like to have per cpu atomic instructions > added on RISCV V that works like those on Intel. Single instruction and > relative to a per cpu based addressable counter operations please. > > I think the attempt to reengineer the core counter mechanisms on Linux is > not that realistic and would require you to first know the cross platform > issues that have driven the development of these things in the first > place. Sorry that I have been just superficially involved in these > discussions but I have a hard time seeing this going anywere. There are > already established core implementations and various arches take on this > issue and those have been around for longer than a decade. It will be hard > to come up with something better. > > Can we focus on the RISC V instruction support? I also do not think that > this is a currently pressing issue but it will be when you scale up RISC V > to many cores (especiall hundreds or thousands of concurrent hardware > threads like what our interest is likely going to be in coming years and > likely also when RISC V is going to be used for enterprise / cloud data > services). > _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-12 17:59 ` Gary Guo @ 2019-03-13 18:58 ` Christopher Lameter 0 siblings, 0 replies; 41+ messages in thread From: Christopher Lameter @ 2019-03-13 18:58 UTC (permalink / raw) To: Gary Guo Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, Paul Walmsley, catalin.marinas, Nick Kossifidis, linux-riscv, linux-arm-kernel On Tue, 12 Mar 2019, Gary Guo wrote: > The added instruction would require three operands (base, offset, > value), which is clearly an issue on how to encode it. Also, as RISC-V's > base ISA is already frozen, even if we have this instruction, it will be > an extension, and is not going to be implemented by all hardware. Linux > will still need to support those hardware without this instruction. We have the ability to dynamically patch the kernel on bootup as the features of the architecture are detected. That wont be an issue and we do this on x86 for numerous variances in the instruction set. The per cpu atomic instructions requires a base register that contains the pointer to the base of the per cpu area of the currently executing processor and then an offset that may be immediate or also in a register. And yes then the value of the increment for the most general case of counter handling. We also need RMW operations like cmpxchg. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-12 17:42 ` Christopher Lameter 2019-03-12 17:59 ` Gary Guo @ 2019-03-13 20:15 ` Paul Walmsley 1 sibling, 0 replies; 41+ messages in thread From: Paul Walmsley @ 2019-03-13 20:15 UTC (permalink / raw) To: Christopher Lameter Cc: Paul Walmsley, catalin.marinas, Björn Töpel, Palmer Dabbelt, will.deacon, Paul Walmsley, Nick Kossifidis, linux-riscv, linux-arm-kernel Hi Christoph On Tue, 12 Mar 2019, Christopher Lameter wrote: > On Tue, 12 Mar 2019, Paul Walmsley wrote: > > > The counters, though, may not need the preemption disable/reenable. > > Christoph, you expressed earlier that you think the overhead of the > > preempt_disable/enable is quite high. Do you think it's worth creating a > > separate, restricted implementation for per-cpu counters? > > As I have always said: I would like to have per cpu atomic instructions > added on RISCV V that works like those on Intel. Single instruction and > relative to a per cpu based addressable counter operations please. Speaking from a RISC-V point of view, you could propose those kinds of instructions as an RISC-V extension. One good way to start might be to start a discussion in the isa-dev Google Group first. Others might speak up who agree with you and may be willing to help with the technical details: https://groups.google.com/a/groups.riscv.org/forum/#!forum/isa-dev Organizations that are members of the RISC-V Foundation can propose a new RISC-V Foundation working group to produce new instructions. There are a few of these extension task groups in progress now. > I think the attempt to reengineer the core counter mechanisms on Linux is > not that realistic and would require you to first know the cross platform > issues that have driven the development of these things in the first > place. Sorry that I have been just superficially involved in these > discussions but I have a hard time seeing this going anywere. There are > already established core implementations and various arches take on this > issue and those have been around for longer than a decade. It will be hard > to come up with something better. You might be right and you are certainly more of an expert in the per-cpu code than most of us. But if the existing implementation was designed around x86 instruction set constraints which might not apply to RISC-V (or ARM), there might be room for improvement. > Can we focus on the RISC V instruction support? I also do not think that > this is a currently pressing issue but it will be when you scale up RISC V > to many cores (especiall hundreds or thousands of concurrent hardware > threads like what our interest is likely going to be in coming years and > likely also when RISC V is going to be used for enterprise / cloud data > services). If the software discussion isn't productive, you could always take your concern directly to the architecture and microarchitecture mailing lists involved with the RISC-V Foundation. That way there's less need to talk it through with those of us who are primarily software focused. I would expect those engineers will probably go through a similar process: 1. they would want to become familiar with the minimal code sequences 2. if they think that an alternate code sequence would perform better, or just as well, they would probably expect someone to try it first 3. if they think that a given microarchitectural optimization would resolve the issue, without needing to add new instructions, they would probably be less interested in adding them 4. they might want to see traces that illustrate that there is an underlying problem This is more or less the approach that we're trying to take here, although with more of a software focus. We know you as a serious contributor and maintainer from the Linux community, which the hardware engineers might not, so it's easier to justify spending more time initially trying to understand the issue. Also, if it turns out that software optimizations are possible, it may help other architectures like ARM. - Paul _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 15:26 ` Paul Walmsley 2019-03-11 16:48 ` Mark Rutland 2019-03-12 4:26 ` Christopher Lameter @ 2019-03-22 14:51 ` Nick Kossifidis 2019-03-22 17:57 ` Christopher Lameter 2 siblings, 1 reply; 41+ messages in thread From: Nick Kossifidis @ 2019-03-22 14:51 UTC (permalink / raw) To: Paul Walmsley Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis, Christopher Lameter, linux-riscv, linux-arm-kernel Στις 2019-03-11 17:26, Paul Walmsley έγραψε: > + the ARM64 guys and lakml > > On Mon, 11 Mar 2019, Björn Töpel wrote: > >> On Mon, 11 Mar 2019 at 15:56, Christopher Lameter <cl@linux.com> >> wrote: >> > >> > On Mon, 11 Mar 2019, Björn Töpel wrote: >> > >> > > > Thanks a bunch! I feel like the best option here is to just use the AMOs >> > > > without disabling preemption and ensuring that all other accesses are atomic >> > > > (via AMOs or LR/SC). The only reason I can see that wouldn't be the way to go >> > > > would be if it requires non-arch modifications, as if we go down that path >> > > > we'll be able to handle the performance edge cases in the implementation. >> > > > >> > > >> > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt >> > > disable, only preemption.) >> > >> > If you disable preemption then you dont need AMO anymore. In fact that is >> > the default fallback generic implementation. Just dont do anything and you >> > already have that solution. >> >> But the generic one disables interrupts, right? >> >> I believe the rational for RV is similar to ARM's; AMO+preemption >> disable regions is *slightly* better than the generic, but not as good >> as the IA one. Or am I missing something? > > There's been a discussion going on in a private thread about this that > I > unfortunately didn't add you to. The discussion is still ongoing, but > I > think Christoph and myself and a few other folks have agreed that the > preempt_disable/enable is not needed for the amoadd approach. This is > since the apparent intention of the preemption disable/enable is to > ensure > the correctness of the counter increment; however there is no risk of > incorrectness in an amoadd sequence since the atomic add is locked > across > all of the cache coherency domain. Christoph, would you disagree with > that characterization? > Some further input on this... https://www.kernel.org/doc/Documentation/preempt-locking.txt "Note that you do not need to explicitly prevent preemption if you are holding any locks or interrupts are disabled, since preemption is implicitly disabled in those cases. But keep in mind that 'irqs disabled' is a fundamentally unsafe way of disabling preemption - any cond_resched() or cond_resched_lock() might trigger a reschedule if the preempt count is 0. A simple printk() might trigger a reschedule. So use this implicit preemption-disabling property only if you know that the affected codepath does not do any of this. Best policy is to use this only for small, atomic code that you wrote and which calls no complex functions." _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-22 14:51 ` Nick Kossifidis @ 2019-03-22 17:57 ` Christopher Lameter 0 siblings, 0 replies; 41+ messages in thread From: Christopher Lameter @ 2019-03-22 17:57 UTC (permalink / raw) To: Nick Kossifidis Cc: Paul Walmsley, catalin.marinas, Björn Töpel, Palmer Dabbelt, will.deacon, Paul Walmsley, linux-riscv, linux-arm-kernel On Fri, 22 Mar 2019, Nick Kossifidis wrote: > But keep in mind that 'irqs disabled' is a fundamentally unsafe way of > disabling preemption - any cond_resched() or cond_resched_lock() might trigger > a reschedule if the preempt count is 0. A simple printk() might trigger a > reschedule. So use this implicit preemption-disabling property only if you > know that the affected codepath does not do any of this. Best policy is to use > this only for small, atomic code that you wrote and which calls no complex > functions." Really? Preemption tracks hardirq status if you use the kernel provided way to disabling and enabling interrupts. See linux/include/preempt.h: /* * We put the hardirq and softirq counter into the preemption * counter. The bitmask has the following meaning: * * - bits 0-7 are the preemption count (max preemption depth: 256) * - bits 8-15 are the softirq count (max # of softirqs: 256) * * The hardirq count could in theory be the same as the number of * interrupts in the system, but we run all interrupt handlers with * interrupts disabled, so we cannot have nesting interrupts. Though * there are a few palaeontologic drivers which reenable interrupts in * the handler, so we need more than one bit here. * * PREEMPT_MASK: 0x000000ff * SOFTIRQ_MASK: 0x0000ff00 * HARDIRQ_MASK: 0x000f0000 * NMI_MASK: 0x00100000 * PREEMPT_NEED_RESCHED: 0x80000000 */ #define PREEMPT_BITS 8 #define SOFTIRQ_BITS 8 #define HARDIRQ_BITS 4 #define NMI_BITS 1 _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 15:05 ` Björn Töpel 2019-03-11 15:26 ` Paul Walmsley @ 2019-03-11 15:51 ` Christopher Lameter 2019-03-11 16:35 ` Björn Töpel 1 sibling, 1 reply; 41+ messages in thread From: Christopher Lameter @ 2019-03-11 15:51 UTC (permalink / raw) To: Björn Töpel Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv [-- Attachment #1: Type: text/plain, Size: 1145 bytes --] On Mon, 11 Mar 2019, Bjï¿œrn Tï¿œpel wrote: > > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt > > > disable, only preemption.) > > > > If you disable preemption then you dont need AMO anymore. In fact that is > > the default fallback generic implementation. Just dont do anything and you > > already have that solution. > > But the generic one disables interrupts, right? No it doesnt. > I believe the rational for RV is similar to ARM's; AMO+preemption > disable regions is *slightly* better than the generic, but not as good > as the IA one. Or am I missing something?> Looks to me that the better solution for ARM may be to drop its implementation and revert to the generic one. The overhead of AMOs is likely too high to justify using those. And the LC/SC overhead is also generating useless overhead. If you benchmark those then please be aware that the kernel uses __this_xxx ops in places where the preempt disable is not needed. The savings come in the form of this_xxx ops where we avoid the overhead of enabling and disabling preemption (which can be significant especialy given the scheduler actions). [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 15:51 ` Christopher Lameter @ 2019-03-11 16:35 ` Björn Töpel 2019-03-12 4:22 ` Christopher Lameter 0 siblings, 1 reply; 41+ messages in thread From: Björn Töpel @ 2019-03-11 16:35 UTC (permalink / raw) To: Christopher Lameter Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv On Mon, 11 Mar 2019 at 16:51, Christopher Lameter <cl@linux.com> wrote: > > On Mon, 11 Mar 2019, Bjï¿œrn Tï¿œpel wrote: > > > > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt > > > > disable, only preemption.) > > > > > > If you disable preemption then you dont need AMO anymore. In fact that is > > > the default fallback generic implementation. Just dont do anything and you > > > already have that solution. > > > > But the generic one disables interrupts, right? > > No it doesnt. > Ok, please educate me what I'm missing here. From asm-generic/percpu.h: #define this_cpu_generic_to_op(pcp, val, op) \ do { \ unsigned long __flags; \ raw_local_irq_save(__flags); \ raw_cpu_generic_to_op(pcp, val, op); \ raw_local_irq_restore(__flags); \ } while (0) And the disassembly from my build does indeed have interrupt disabled regions. Sorry for being a bit slow here. :-) Björn > > I believe the rational for RV is similar to ARM's; AMO+preemption > > disable regions is *slightly* better than the generic, but not as good > > as the IA one. Or am I missing something?> > > Looks to me that the better solution for ARM may be to drop its > implementation and revert to the generic one. > > The overhead of AMOs is likely too high to justify using those. And the > LC/SC overhead is also generating useless overhead. > > If you benchmark those then please be aware that the kernel uses > __this_xxx ops in places where the preempt disable is not needed. The > savings come in the form of this_xxx ops where we avoid the overhead of > enabling and disabling preemption (which can be significant especialy > given the scheduler actions). _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-03-11 16:35 ` Björn Töpel @ 2019-03-12 4:22 ` Christopher Lameter 0 siblings, 0 replies; 41+ messages in thread From: Christopher Lameter @ 2019-03-12 4:22 UTC (permalink / raw) To: Björn Töpel Cc: Nick Kossifidis, Paul Walmsley, Palmer Dabbelt, linux-riscv [-- Attachment #1: Type: text/plain, Size: 940 bytes --] On Mon, 11 Mar 2019, Björn Töpel wrote: > >From asm-generic/percpu.h: > > #define this_cpu_generic_to_op(pcp, val, op) \ > do { \ > unsigned long __flags; \ > raw_local_irq_save(__flags); \ > raw_cpu_generic_to_op(pcp, val, op); \ > raw_local_irq_restore(__flags); \ > } while (0) > > And the disassembly from my build does indeed have interrupt disabled regions. Well there is widespread agreement here that preemption off *should be* enough and that no counters *should be* incremented during an irqoff session. Most per cpu counters obey that restriction and the major VM counters are incremented in preempt off sections without disabling interrupts. Sorry I thought we already had that finished. Guess I jumped the gun on that one. Ok so we have a lot more overhead to work with currently. [-- Attachment #2: Type: text/plain, Size: 161 bytes --] _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 15:36 ` Nick Kossifidis 2019-02-22 15:56 ` Christopher Lameter @ 2019-02-22 19:48 ` Björn Töpel 2019-02-22 20:53 ` Nick Kossifidis 1 sibling, 1 reply; 41+ messages in thread From: Björn Töpel @ 2019-02-22 19:48 UTC (permalink / raw) To: Nick Kossifidis Cc: Christopher Lameter, Palmer Dabbelt, Paul Walmsley, linux-riscv On Fri, 22 Feb 2019 at 16:36, Nick Kossifidis <mick@ics.forth.gr> wrote: > [...] > > How about the scratch registers defined on priv spec ? We already use > sscratch on Linux for keeping per-hart kernel data structures (check out > entry.S) and mscratch on OpenSBI for keeping per-hart state etc. Hmm, what is the "tp" register used for in kernel space today? _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: per-cpu thoughts 2019-02-22 19:48 ` Björn Töpel @ 2019-02-22 20:53 ` Nick Kossifidis 0 siblings, 0 replies; 41+ messages in thread From: Nick Kossifidis @ 2019-02-22 20:53 UTC (permalink / raw) To: Björn Töpel Cc: Nick Kossifidis, Christopher Lameter, Palmer Dabbelt, Paul Walmsley, linux-riscv Στις 2019-02-22 21:48, Björn Töpel έγραψε: > On Fri, 22 Feb 2019 at 16:36, Nick Kossifidis <mick@ics.forth.gr> > wrote: >> > [...] >> >> How about the scratch registers defined on priv spec ? We already use >> sscratch on Linux for keeping per-hart kernel data structures (check >> out >> entry.S) and mscratch on OpenSBI for keeping per-hart state etc. > > Hmm, what is the "tp" register used for in kernel space today? In this case it's used for task_struct. https://elixir.bootlin.com/linux/v5.0-rc7/source/include/linux/sched.h#L592 Note that its first element is thread_info, that's why you see the TI_* macros being used for getting offsets on tp. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2019-03-22 17:57 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-20 19:57 per-cpu thoughts Björn Töpel 2019-02-21 15:57 ` Christopher Lameter 2019-02-21 16:28 ` Paul Walmsley 2019-02-21 17:24 ` Björn Töpel 2019-02-21 17:49 ` Paul Walmsley 2019-02-21 19:40 ` Palmer Dabbelt 2019-02-22 15:04 ` Christopher Lameter 2019-02-22 15:36 ` Nick Kossifidis 2019-02-22 15:56 ` Christopher Lameter 2019-02-22 19:47 ` Björn Töpel 2019-02-22 19:56 ` Christopher Lameter 2019-02-28 12:20 ` Paul Walmsley 2019-02-28 17:58 ` Christopher Lameter 2019-02-28 18:42 ` Paul Walmsley 2019-02-28 19:09 ` Christopher Lameter 2019-02-28 20:21 ` Paul Walmsley 2019-03-01 1:13 ` Christopher Lameter 2019-03-08 7:17 ` Björn Töpel 2019-03-11 13:22 ` Palmer Dabbelt 2019-03-11 14:48 ` Björn Töpel 2019-03-11 14:56 ` Christopher Lameter 2019-03-11 15:05 ` Björn Töpel 2019-03-11 15:26 ` Paul Walmsley 2019-03-11 16:48 ` Mark Rutland 2019-03-11 18:39 ` Paul Walmsley 2019-03-12 11:23 ` Mark Rutland 2019-03-12 16:01 ` Paul Walmsley 2019-03-12 17:34 ` Christopher Lameter 2019-03-12 4:26 ` Christopher Lameter 2019-03-12 14:21 ` Paul Walmsley 2019-03-12 17:42 ` Christopher Lameter 2019-03-12 17:59 ` Gary Guo 2019-03-13 18:58 ` Christopher Lameter 2019-03-13 20:15 ` Paul Walmsley 2019-03-22 14:51 ` Nick Kossifidis 2019-03-22 17:57 ` Christopher Lameter 2019-03-11 15:51 ` Christopher Lameter 2019-03-11 16:35 ` Björn Töpel 2019-03-12 4:22 ` Christopher Lameter 2019-02-22 19:48 ` Björn Töpel 2019-02-22 20:53 ` Nick Kossifidis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).