Hello, Intel analysed my CoffeeLake performance numbers, and put the discrepancy down to being an early alpha build. Since my first email, microcode for Skylake-SP has become available, and is in its production form. Like before, these are raw TSC cycles collected with RDTSCP, and are not comparable with the raw Coffeelake numbers, because I haven't scaled them by the TSC frequency. Curiously, with Skylake-SP I see a difference in instruction latency depending on whether it is operating in root mode or non-root mode. I didn't observe this difference with CoffeeLake. (For reasons of ease of my test environment, MSR_FLUSH_CMD is only measured in Non-Root mode.) Pre microcode: * VERW of NUL => Root: 70-74, Non-Root: 82-86 cycles * VERW of %ds => Root: 36-40, Non-Root: 44-48 cycles * MSR_FLUSH_CMD => Non-Root: 1070-1078 cycles Post microcode: * VERW of NUL => Root: 394-406, Non-Root: 384-390 cycles * VERW of %ds => Root: 362-370, Non-Root: 352-360 cycles * MSR_FLUSH_CMD => Non-Root: 1280-1288 cycles So, in comparison to the Coffeelake very early alpha ucode, the numbers now favour VERW of %ds in all cases, and the absolute hit of the extra flushing has reduced (by far more than the delta between raw values). Both of these are consistent with this being better optimised microcode, and mean that the aforementioned guidance is accurate. Thanks, ~Andrew