On 21.03.23 16:35, Jan Beulich wrote: > On 21.03.2023 15:57, Roger Pau Monné wrote: >> On Mon, Mar 20, 2023 at 10:48:45AM +0100, Jan Beulich wrote: >>> On 17.03.2023 13:26, Andrew Cooper wrote: >>>> On 17/03/2023 11:22 am, Roger Pau Monné wrote: >>>>> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote: >>>>>> This is faster than using the software implementation, and the insn is >>>>>> available on all half-way recent hardware. Therefore convert >>>>>> generic_hweight() to out-of-line functions (without affecting Arm) >>>>>> and use alternatives patching to replace the function calls. >>>>>> >>>>>> Note that the approach doesn#t work for clang, due to it not recognizing >>>>>> -ffixed-*. >>>>> I've been giving this a look, and I wonder if it would be fine to >>>>> simply push and pop the scratch registers in the 'call' path of the >>>>> alternative, as that won't require any specific compiler option. >>> >>> Hmm, ... >>> >>>> It's been a long while, and in that time I've learnt a lot more about >>>> performance, but my root objection to the approach taken here still >>>> stands - it is penalising the common case to optimise some pointless >>>> corner cases. >>>> >>>> Yes - on the call path, an extra push/pop pair (or few) to get temp >>>> registers is basically free. >>> >>> ... what is "a few"? We'd need to push/pop all call-clobbered registers >>> except %rax, i.e. a total of eight. I consider this too much. Unless, >>> as you suggest further down, we wrote the fallback in assembly. Which I >>> have to admit I'm surprised you propose when we strive to reduce the >>> amount of assembly we have to maintain. >> >> AMD added popcnt in 2007 and Intel in 2008. While we shouldn't >> mandate popcnt support, I think we also shouldn't overly worry about >> the non-popcnt path. > > We agree here. > >> Also, how can you assert that the code generated without the scratch >> registers being usable won't be worse than the penalty of pushing and >> popping such registers on the stack and letting the routines use all >> registers freely? > > Irrespective of the -ffixed-* the compiler can make itself sufficiently > many registers available to use, by preserving just the ones it actually > uses. So that's pretty certainly not more PUSH/POP than when we would > blindly preserve all which may need preserving (no matter whether > they're actually used). > > Yet then both this question and ... > >> I very much prefer to have a non-optimal non-popcnt path, but have >> popcnt support for both gcc and clang, and without requiring any >> compiler options. > > ... this makes me wonder whether we're thinking of fundamentally > different generated code that we would end up with. Since the > (apparently agreed upon) goal is to optimize for the "popcnt > available" case, mainline code should be not significantly longer that > what's needed for the single instruction itself, or alternatively a > CALL insn. Adding pushes/pops of all call clobbered registers around > the CALL would grow mainline code too much (for my taste), i.e. > irrespective of these PUSH/POP all getting NOP-ed out by alternatives > patching. (As an aside I consider fiddling with the stack pointer in > inline asm() risky, and hence I would prefer to avoid such whenever > possible. Yes, it can be written so it's independent of what the > compiler thinks the stack pointer points at, but such constructs are > fragile as soon as one wants to change them a little later on.) > > Are you perhaps thinking of indeed having the PUSH/POP in mainline > code? Or some entirely different model? > > What I could see us doing to accommodate Clang is to have wrappers > around the actual functions which do as you suggest and preserve all > relevant registers, no matter whether they're used. You could just copy the PV_CALLEE_SAVE_REGS_THUNK() macro from the Linux kernel, which is doing exactly that. Juergen