Hello,

Intel analysed my CoffeeLake performance numbers, and put the
discrepancy down to being an early alpha build.  Since my first email,
microcode for Skylake-SP has become available, and is in its production
form.

Like before, these are raw TSC cycles collected with RDTSCP, and are not
comparable with the raw Coffeelake numbers, because I haven't scaled
them by the TSC frequency.

Curiously, with Skylake-SP I see a difference in instruction latency
depending on whether it is operating in root mode or non-root mode.  I
didn't observe this difference with CoffeeLake.  (For reasons of ease of
my test environment, MSR_FLUSH_CMD is only measured in Non-Root mode.)

Pre microcode:
* VERW of NUL   => Root: 70-74, Non-Root: 82-86 cycles
* VERW of %ds   => Root: 36-40, Non-Root: 44-48 cycles
* MSR_FLUSH_CMD => Non-Root: 1070-1078 cycles

Post microcode:
* VERW of NUL   => Root: 394-406, Non-Root: 384-390 cycles
* VERW of %ds   => Root: 362-370, Non-Root: 352-360 cycles
* MSR_FLUSH_CMD => Non-Root: 1280-1288 cycles


So, in comparison to the Coffeelake very early alpha ucode, the numbers
now favour VERW of %ds in all cases, and the absolute hit of the extra
flushing has reduced (by far more than the delta between raw values).

Both of these are consistent with this being better optimised microcode,
and mean that the aforementioned guidance is accurate.

Thanks,

~Andrew