All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] Some microperf tests
@ 2019-02-23 18:26 Andrew Cooper
  2019-02-23 19:30 ` [MODERATED] " Linus Torvalds
  2019-03-07 14:26 ` [MODERATED] Updated " Andrew Cooper
  0 siblings, 2 replies; 5+ messages in thread
From: Andrew Cooper @ 2019-02-23 18:26 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

Hello,

So I've finally got my Coffee Lake system and alpha microcode working.

All numbers are the deltas between two RDTSCP instructions, with the
single instruction under test and just enough compiler-inserted mov's to
preserve the output of the first RDTSCP for later calculations.

(Insert some disclaimer about these not being statistically rigorous,
but they do at least give a rough ballpark.)

Pre microcode:
* VERW of NUL   => 65-69 cycles
* VERW of %ds   => 33-37 cycles
* MSR_FLUSH_CMD => 925-980 cycles

Post microcode:
* VERW of NUL   => 512-520 cycles
* VERW of %ds   => 520-540 cycles
* MSR_FLUSH_CMD => 1300-1500 cycles


So, MSR_FLUSH_CMD has got longer, but not by as much as VERW got longer
by.  Pre microcode, the "use %ds" advice is clearly a win, but post
microcode, it appears to be fractionally worse.

I've raise the selector question with Intel - its possible it is a side
effect of this piece of alpha ucode being an early prototype, or that
this particular system is different to most older parts.

Thanks,

~Andrew


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [MODERATED] Re: Some microperf tests
  2019-02-23 18:26 [MODERATED] Some microperf tests Andrew Cooper
@ 2019-02-23 19:30 ` Linus Torvalds
  2019-02-23 20:42   ` Andrew Cooper
  2019-02-24 14:23   ` Andi Kleen
  2019-03-07 14:26 ` [MODERATED] Updated " Andrew Cooper
  1 sibling, 2 replies; 5+ messages in thread
From: Linus Torvalds @ 2019-02-23 19:30 UTC (permalink / raw)
  To: speck

On Sat, Feb 23, 2019 at 10:27 AM speck for Andrew Cooper
<speck@linutronix.de> wrote:
>
>
> Pre microcode:
> * VERW of NUL   => 65-69 cycles
> * VERW of %ds   => 33-37 cycles
>
> Post microcode:
> * VERW of NUL   => 512-520 cycles
> * VERW of %ds   => 520-540 cycles

Ok, those numbers actually make sense to me.

Before the whole "let's use verw for state flushing" issue, the
*normal* use of verw would have been for an actual used segment, and
making the microcode optimize the branches for that would case have
made sense.

Admittedly nobody really uses verw for that reason any more, but from
a legacy standpoint it would seem to be sensible. Maybe old Windows
models really did use verw regularly on real loads.

After the microcode changes, that's no longer true, and using verw on
a real descriptor is pointless, because the only real expected use of
that instruction is flushing, and avoiding the load of the actual
segment value from the LDT/GDT should be the fast case.

So those numbers are actually sensible.

Of course, the fact that verw on a NUL descriptor is so much slower in
the old case is very inconvenient for the "we should do verw even if
the CPU says it doesn't have the microcode update, for vmware rasons",
aka vmverw.

So it would be good to verify that

 (a) yes, this is the intended performance profile from intel

 (b) we probably should give a NUL descriptor for the workaround

 (c) it hurts the vmverw case, but maybe we can only do vmverw when we
notice we are actually running under vmware.

Is there any way to do that vmware detection?

                 Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [MODERATED] Re: Some microperf tests
  2019-02-23 19:30 ` [MODERATED] " Linus Torvalds
@ 2019-02-23 20:42   ` Andrew Cooper
  2019-02-24 14:23   ` Andi Kleen
  1 sibling, 0 replies; 5+ messages in thread
From: Andrew Cooper @ 2019-02-23 20:42 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1283 bytes --]

On 23/02/2019 19:30, speck for Linus Torvalds wrote:
> So those numbers are actually sensible.

The old behaviour has always made sense in terms of how legacy software
was expected to use it.

However, given the recommendation, I wasn't certain that the new
behaviour is deliberate.  My first thought was whether it was an
unintended consequence of the NUL selector not resulting in a memory
access, while the pipeline buffers are being frobbed.

> Of course, the fact that verw on a NUL descriptor is so much slower in
> the old case is very inconvenient for the "we should do verw even if
> the CPU says it doesn't have the microcode update, for vmware rasons",
> aka vmverw.
>
> So it would be good to verify that
>
>  (a) yes, this is the intended performance profile from intel
>
>  (b) we probably should give a NUL descriptor for the workaround

As I said - I've currently got these questions open with the relevant
people.  I'll feed back the reply.

>  (c) it hurts the vmverw case, but maybe we can only do vmverw when we
> notice we are actually running under vmware.
>
> Is there any way to do that vmware detection?

CPUID leaves at 0x40000000?  Not necessarily perfect, but will cover the
overwhelming majority of cases.

~Andrew


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [MODERATED] Re: Some microperf tests
  2019-02-23 19:30 ` [MODERATED] " Linus Torvalds
  2019-02-23 20:42   ` Andrew Cooper
@ 2019-02-24 14:23   ` Andi Kleen
  1 sibling, 0 replies; 5+ messages in thread
From: Andi Kleen @ 2019-02-24 14:23 UTC (permalink / raw)
  To: speck

> So it would be good to verify that
> 
>  (a) yes, this is the intended performance profile from intel
> 
>  (b) we probably should give a NUL descriptor for the workaround

AFAIK %ds was expected to be faster than 0. I can ask about why
this benchmark disagrees. 

The Intel recommendation was to use (%rsp).

There are also other CPUs with slightly different microcode
than what was tested here.

> 
>  (c) it hurts the vmverw case, but maybe we can only do vmverw when we
> notice we are actually running under vmware.
> 
> Is there any way to do that vmware detection?

We already have VMWare detection I thought

arch/x86/kernel/cpu/vmware.c:
static uint32_t __init vmware_platform(void)
{
        if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
                unsigned int eax;
                unsigned int hyper_vendor_id[3];

                cpuid(CPUID_VMWARE_INFO_LEAF, &eax, &hyper_vendor_id[0],
                      &hyper_vendor_id[1], &hyper_vendor_id[2]);
                if (!memcmp(hyper_vendor_id, "VMwareVMware", 12))
                        return CPUID_VMWARE_INFO_LEAF;
        } else if (dmi_available && dmi_name_in_serial("VMware") &&
                   __vmware_platform())
                return 1;

        return 0;
}

BTW it could be also some other hypervisor which doesn't export
the CPUID. Even in KVM/qemu it will need a configuration change
and update.

I suppose it could be just made conditionaly on X86_FEATURE_HYPERVISOR

-Andi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [MODERATED] Updated microperf tests
  2019-02-23 18:26 [MODERATED] Some microperf tests Andrew Cooper
  2019-02-23 19:30 ` [MODERATED] " Linus Torvalds
@ 2019-03-07 14:26 ` Andrew Cooper
  1 sibling, 0 replies; 5+ messages in thread
From: Andrew Cooper @ 2019-03-07 14:26 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1428 bytes --]

Hello,

Intel analysed my CoffeeLake performance numbers, and put the
discrepancy down to being an early alpha build.  Since my first email,
microcode for Skylake-SP has become available, and is in its production
form.

Like before, these are raw TSC cycles collected with RDTSCP, and are not
comparable with the raw Coffeelake numbers, because I haven't scaled
them by the TSC frequency.

Curiously, with Skylake-SP I see a difference in instruction latency
depending on whether it is operating in root mode or non-root mode.  I
didn't observe this difference with CoffeeLake.  (For reasons of ease of
my test environment, MSR_FLUSH_CMD is only measured in Non-Root mode.)

Pre microcode:
* VERW of NUL   => Root: 70-74, Non-Root: 82-86 cycles
* VERW of %ds   => Root: 36-40, Non-Root: 44-48 cycles
* MSR_FLUSH_CMD => Non-Root: 1070-1078 cycles

Post microcode:
* VERW of NUL   => Root: 394-406, Non-Root: 384-390 cycles
* VERW of %ds   => Root: 362-370, Non-Root: 352-360 cycles
* MSR_FLUSH_CMD => Non-Root: 1280-1288 cycles


So, in comparison to the Coffeelake very early alpha ucode, the numbers
now favour VERW of %ds in all cases, and the absolute hit of the extra
flushing has reduced (by far more than the delta between raw values).

Both of these are consistent with this being better optimised microcode,
and mean that the aforementioned guidance is accurate.

Thanks,

~Andrew


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-03-07 15:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-23 18:26 [MODERATED] Some microperf tests Andrew Cooper
2019-02-23 19:30 ` [MODERATED] " Linus Torvalds
2019-02-23 20:42   ` Andrew Cooper
2019-02-24 14:23   ` Andi Kleen
2019-03-07 14:26 ` [MODERATED] Updated " Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.