On Mon, 16 Oct 2023 at 12:29, Sean Christopherson wrote: > > > Are we certain that ucode on modern x86 CPUs check CR4 for every affected > > instruction? > > Not certain at all. I agree the CR4.FSGSBASE thing could be a complete non-issue > and was just me speculating. Note that my timings on two fairly different arches do put the cost of 'rdgsbase' at 2 cycles, so it's not microcoded in the sense of jumping off to some microcode sequence that has a noticeable overhead. So it's almost certainly what Intel calls a "complex decoder" case that generates up to 4 uops inline and only decodes in the first decode slot. One of the uops could easily be a cr4 check, that's not an uncommon thing for those kinds of instructions. If somebody wants to try my truly atrocious test program on other machines, go right ahead. It's attached. I'm not proud of it. It's a hack. Do something like this: $ gcc -O2 t.c $ ./a.out "nop"=0l: 0.380925 "nop"=0l: 0.380640 "nop"=0l: 0.380373 "mov %1,%0":"=r"(base):"m"(zero)=0l: 0.787984 "rdgsbase %0":"=r"(base)=0l: 2.626625 and you'll see that a no-op takes about a third of a cycle on my Zen 2 core (according to this truly stupid benchmark). With some small overhead. And a "mov memory to register" shows up as ~3/4 cycle, but it's really probably that the core can do two of them per cycle, and then the chain of adds (see how that benchmark makes sure the result is "used") adds some more overhead etc. And the 'rdgsbase' is about two cycles, and presumably is fully serialized, so all the loop overhead and adding results then shows up as that extra .6 of a cycle on average. But doing cycle estimations on OoO machines is "guess rough patterns", so take all the above with a big pinch of salt. And feel free to test it on other cores than the ones I did (Intel Skylake and and AMD Zen 2). You migth want to put your machine into "performance" mode or other things to actually make it run at the highest frequency to get more repeatable numbers. The Skylake core does better on the nops (I think Intel gets rid of them earlier in the decode stages and they basically disappear in the uop cache), and can do three loads per cycle. So rdgsbase looks relatively slower on my Skylake at about 3 cycles per op, but when you look at an individual instruction, that's a fairly artificial thing. You don't run these things in the uop cache in reality. Linus