linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
@ 2018-11-18 14:03 Samuel Neves
  2018-11-20 13:20 ` Li, Aubrey
  2018-11-26  3:36 ` Li, Aubrey
  0 siblings, 2 replies; 10+ messages in thread
From: Samuel Neves @ 2018-11-18 14:03 UTC (permalink / raw)
  To: aubrey.li, dave.hansen, aubrey.li, Thomas Gleixner, Ingo Molnar,
	peterz, H. Peter Anvin
  Cc: ak, tim.c.chen, arjan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3978 bytes --]

On 11/17/18 12:36 AM, Li, Aubrey wrote:
> On 2018/11/17 7:10, Dave Hansen wrote:
>> Just to be clear: there are 3 AVX-512 XSAVE states:
>>
>>          XFEATURE_OPMASK,
>>          XFEATURE_ZMM_Hi256,
>>          XFEATURE_Hi16_ZMM,
>>
>> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
>> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
>> looking too great).

XFEATURE_OPMASK refers to the additional 8 mask registers used in
AVX512. These are more similar to general purpose registers than
vector registers, and should not be too relevant here.

>>
>> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
>> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
>> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
>>
>> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
>> with AVX512 instructions.
>>
>> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
>> XFEATURE_Hi16_ZMM.  That's without question.
>
> No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
> interested to us.
>

I think Dave is right, and it's easy enough to check this. See the
attached program. For the "high current" instruction vpmuludq
operating on zmm0--zmm3 registers, we have (on a Skylake-SP Xeon Gold
5120)

           175,097      core_power.lvl0_turbo_license:u
                     ( +-  2.18% )
            41,185      core_power.lvl1_turbo_license:u
                     ( +-  1.55% )
        83,928,648      core_power.lvl2_turbo_license:u
                     ( +-  0.00% )

while for the same code operating on zmm28--zmm31 registers, we have

           163,507      core_power.lvl0_turbo_license:u
                     ( +-  6.85% )
            47,390      core_power.lvl1_turbo_license:u
                     ( +- 12.25% )
        83,927,735      core_power.lvl2_turbo_license:u
                     ( +-  0.00% )

In other words, the register index does not seem to matter at all for
turbo license purposes (this makes sense, considering these chips have
168 vector registers internally; zmm15--zmm31 are simply newly exposed
architectural registers).

We can also see that XFEATURE_Hi16_ZMM does not imply license 1 or 2;
we may be using xmm15--xmm31 purely for the convenient extra register
space. For example, cases 4 and 5 of the sample program:

        84,064,239      core_power.lvl0_turbo_license:u
                     ( +-  0.00% )
                 0      core_power.lvl1_turbo_license:u
                 0      core_power.lvl2_turbo_license:u

        84,060,625      core_power.lvl0_turbo_license:u
                     ( +-  0.00% )
                 0      core_power.lvl1_turbo_license:u
                 0      core_power.lvl2_turbo_license:u

So what's most important is the width of the vectors being used, not
the instruction set or the register index. Second to that is the
instruction type, namely whether those are "heavy" instructions.
Neither of these things can be accurately captured by the XSAVE state.

>>
>> It's probably *possible* to run AVX512 instructions by loading state
>> into the YMM register and then executing AVX512 instructions that only
>> write to memory and never to register state.  That *might* allow
>> XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
>> for the frequency to be affected since AVX512 instructions _are_
>> executing.  But, there's no way to detect this situation from XSAVE
>> states themselves.
>>
>
> Andi should have more details on this. FWICT, not all AVX512 instructions
> has high current, those only touching memory do not cause notable frequency
> drop.

According to section 15.26 of the Intel optimization reference manual,
"heavy" instructions consist of floating-point and integer
multiplication. Moves, adds, logical operations, etc, will request at
most turbo license 1 when operating on zmm registers.

>
> Thanks,
> -Aubrey
>

[-- Attachment #2: turbo.c --]
[-- Type: text/x-csrc, Size: 2716 bytes --]

#include <stdlib.h>

#define INSN_LOOP_LO(insn, reg) do {                                   \
  asm volatile(                                                        \
    "mov $1<<24,%%rcx;"                                                \
    ".align 32;"                                                       \
    "1:"                                                               \
    #insn " " "%%" #reg "0" "," "%%" #reg "0" "," "%%" #reg "0" ";"    \
    #insn " " "%%" #reg "1" "," "%%" #reg "1" "," "%%" #reg "1" ";"    \
    #insn " " "%%" #reg "2" "," "%%" #reg "2" "," "%%" #reg "2" ";"    \
    #insn " " "%%" #reg "3" "," "%%" #reg "3" "," "%%" #reg "3" ";"    \
    "dec %%rcx;"                                                       \
    "jnz 1b;"                                                          \
    ::: "rcx"                                                          \
  );                                                                   \
} while(0);

#define INSN_LOOP_HI(insn, reg) do {                                   \
  asm volatile(                                                        \
    "mov $1<<24,%%rcx;"                                                \
    ".align 32;"                                                       \
    "1:"                                                               \
    #insn " " "%%" #reg "31" "," "%%" #reg "31" "," "%%" #reg "31" ";" \
    #insn " " "%%" #reg "30" "," "%%" #reg "30" "," "%%" #reg "30" ";" \
    #insn " " "%%" #reg "29" "," "%%" #reg "29" "," "%%" #reg "29" ";" \
    #insn " " "%%" #reg "28" "," "%%" #reg "28" "," "%%" #reg "28" ";" \
    "dec %%rcx;"                                                       \
    "jnz 1b;"                                                          \
    ::: "rcx"                                                          \
  );                                                                   \
} while(0);



int main(int argc, char ** argv) {
  int x = strtoul(argv[1], 0, 10);
  asm volatile("vzeroall");
  switch(x) {
  case 0:
    INSN_LOOP_LO(vpmuludq, zmm);
    break;
  case 1:
    INSN_LOOP_HI(vpmuludq, zmm);
    break;
  case 2:
    INSN_LOOP_LO(vpmuludq, ymm);
    break;
  case 3:
    INSN_LOOP_HI(vpmuludq, ymm);
    break;
  case 4:
    INSN_LOOP_LO(vpmuludq, xmm);
    break;
  case 5:
    INSN_LOOP_HI(vpmuludq, xmm);
    break;
  case 6:
    INSN_LOOP_LO(vpaddq, zmm);
    break;
  case 7:
    INSN_LOOP_HI(vpaddq, zmm);
    break;
  case 8:
    INSN_LOOP_LO(vpaddq, ymm);
    break;
  case 9:
    INSN_LOOP_HI(vpaddq, ymm);
    break;
  case 10:
    INSN_LOOP_LO(vpaddq, xmm);
    break;
  case 11:
    INSN_LOOP_HI(vpaddq, xmm);
    break;
  }
  return 0;
}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-18 14:03 [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Samuel Neves
@ 2018-11-20 13:20 ` Li, Aubrey
  2018-11-20 14:47   ` Samuel Neves
  2018-11-26  3:36 ` Li, Aubrey
  1 sibling, 1 reply; 10+ messages in thread
From: Li, Aubrey @ 2018-11-20 13:20 UTC (permalink / raw)
  To: Samuel Neves, dave.hansen, aubrey.li, Thomas Gleixner,
	Ingo Molnar, peterz, H. Peter Anvin
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/11/18 22:03, Samuel Neves wrote:
> On 11/17/18 12:36 AM, Li, Aubrey wrote:
>> On 2018/11/17 7:10, Dave Hansen wrote:
>>> Just to be clear: there are 3 AVX-512 XSAVE states:
>>>
>>>          XFEATURE_OPMASK,
>>>          XFEATURE_ZMM_Hi256,
>>>          XFEATURE_Hi16_ZMM,
>>>
>>> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
>>> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
>>> looking too great).
> 
> XFEATURE_OPMASK refers to the additional 8 mask registers used in
> AVX512. These are more similar to general purpose registers than
> vector registers, and should not be too relevant here.
> 
>>>
>>> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
>>> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
>>> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
>>>
>>> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
>>> with AVX512 instructions.
>>>
>>> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
>>> XFEATURE_Hi16_ZMM.  That's without question.
>>
>> No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
>> interested to us.
>>
> 
> I think Dave is right, and it's easy enough to check this. See the
> attached program. For the "high current" instruction vpmuludq
> operating on zmm0--zmm3 registers, we have (on a Skylake-SP Xeon Gold
> 5120)
> 
>            175,097      core_power.lvl0_turbo_license:u
>                      ( +-  2.18% )
>             41,185      core_power.lvl1_turbo_license:u
>                      ( +-  1.55% )
>         83,928,648      core_power.lvl2_turbo_license:u
>                      ( +-  0.00% )
> 
> while for the same code operating on zmm28--zmm31 registers, we have
> 
>            163,507      core_power.lvl0_turbo_license:u
>                      ( +-  6.85% )
>             47,390      core_power.lvl1_turbo_license:u
>                      ( +- 12.25% )
>         83,927,735      core_power.lvl2_turbo_license:u
>                      ( +-  0.00% )
> 
> In other words, the register index does not seem to matter at all for
> turbo license purposes (this makes sense, considering these chips have
> 168 vector registers internally; zmm15--zmm31 are simply newly exposed
> architectural registers).
> 
> We can also see that XFEATURE_Hi16_ZMM does not imply license 1 or 2;
> we may be using xmm15--xmm31 purely for the convenient extra register
> space. For example, cases 4 and 5 of the sample program:
> 
>         84,064,239      core_power.lvl0_turbo_license:u
>                      ( +-  0.00% )
>                  0      core_power.lvl1_turbo_license:u
>                  0      core_power.lvl2_turbo_license:u
> 
>         84,060,625      core_power.lvl0_turbo_license:u
>                      ( +-  0.00% )
>                  0      core_power.lvl1_turbo_license:u
>                  0      core_power.lvl2_turbo_license:u
> 

Thanks for your program, Samuel, it's very helpful. But I saw a different
output on my side, May I have your glibc version?

Thanks,
-Aubrey

> So what's most important is the width of the vectors being used, not
> the instruction set or the register index. Second to that is the
> instruction type, namely whether those are "heavy" instructions.
> Neither of these things can be accurately captured by the XSAVE state.
> 
>>>
>>> It's probably *possible* to run AVX512 instructions by loading state
>>> into the YMM register and then executing AVX512 instructions that only
>>> write to memory and never to register state.  That *might* allow
>>> XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
>>> for the frequency to be affected since AVX512 instructions _are_
>>> executing.  But, there's no way to detect this situation from XSAVE
>>> states themselves.
>>>
>>
>> Andi should have more details on this. FWICT, not all AVX512 instructions
>> has high current, those only touching memory do not cause notable frequency
>> drop.
> 
> According to section 15.26 of the Intel optimization reference manual,
> "heavy" instructions consist of floating-point and integer
> multiplication. Moves, adds, logical operations, etc, will request at
> most turbo license 1 when operating on zmm registers.
> 
>>
>> Thanks,
>> -Aubrey
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-20 13:20 ` Li, Aubrey
@ 2018-11-20 14:47   ` Samuel Neves
  0 siblings, 0 replies; 10+ messages in thread
From: Samuel Neves @ 2018-11-20 14:47 UTC (permalink / raw)
  To: aubrey.li
  Cc: dave.hansen, aubrey.li, Thomas Gleixner, Ingo Molnar, peterz,
	H. Peter Anvin, ak, tim.c.chen, Arjan van de Ven, linux-kernel

On Tue, Nov 20, 2018 at 1:32 PM Li, Aubrey <aubrey.li@linux.intel.com> wrote:
> Thanks for your program, Samuel, it's very helpful. But I saw a different
> output on my side, May I have your glibc version?
>

This system is running glibc 2.27, and kernel 4.18.7. The Xeon Gold
5120 also happens to be one of the Skylake-SP models with a single
512-bit FMA unit, instead of 2.

Samuel.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-18 14:03 [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Samuel Neves
  2018-11-20 13:20 ` Li, Aubrey
@ 2018-11-26  3:36 ` Li, Aubrey
  1 sibling, 0 replies; 10+ messages in thread
From: Li, Aubrey @ 2018-11-26  3:36 UTC (permalink / raw)
  To: Samuel Neves, dave.hansen, aubrey.li, Thomas Gleixner,
	Ingo Molnar, peterz, H. Peter Anvin
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/11/18 22:03, Samuel Neves wrote:
> On 11/17/18 12:36 AM, Li, Aubrey wrote:
>> On 2018/11/17 7:10, Dave Hansen wrote:
>>> Just to be clear: there are 3 AVX-512 XSAVE states:
>>>
>>>          XFEATURE_OPMASK,
>>>          XFEATURE_ZMM_Hi256,
>>>          XFEATURE_Hi16_ZMM,
>>>
>>> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
>>> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
>>> looking too great).
> 
> XFEATURE_OPMASK refers to the additional 8 mask registers used in
> AVX512. These are more similar to general purpose registers than
> vector registers, and should not be too relevant here.
> 
>>>
>>> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
>>> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
>>> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
>>>
>>> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
>>> with AVX512 instructions.
>>>
>>> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
>>> XFEATURE_Hi16_ZMM.  That's without question.
>>
>> No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
>> interested to us.
>>
> 
> I think Dave is right, and it's easy enough to check this. See the
> attached program. For the "high current" instruction vpmuludq
> operating on zmm0--zmm3 registers, we have (on a Skylake-SP Xeon Gold
> 5120)
> 
>            175,097      core_power.lvl0_turbo_license:u
>                      ( +-  2.18% )
>             41,185      core_power.lvl1_turbo_license:u
>                      ( +-  1.55% )
>         83,928,648      core_power.lvl2_turbo_license:u
>                      ( +-  0.00% )
> 
> while for the same code operating on zmm28--zmm31 registers, we have
> 
>            163,507      core_power.lvl0_turbo_license:u
>                      ( +-  6.85% )
>             47,390      core_power.lvl1_turbo_license:u
>                      ( +- 12.25% )
>         83,927,735      core_power.lvl2_turbo_license:u
>                      ( +-  0.00% )
> 
> In other words, the register index does not seem to matter at all for
> turbo license purposes (this makes sense, considering these chips have
> 168 vector registers internally; zmm15--zmm31 are simply newly exposed
> architectural registers).
> 
> We can also see that XFEATURE_Hi16_ZMM does not imply license 1 or 2;
> we may be using xmm15--xmm31 purely for the convenient extra register
> space. For example, cases 4 and 5 of the sample program:
> 
>         84,064,239      core_power.lvl0_turbo_license:u
>                      ( +-  0.00% )
>                  0      core_power.lvl1_turbo_license:u
>                  0      core_power.lvl2_turbo_license:u
> 
>         84,060,625      core_power.lvl0_turbo_license:u
>                      ( +-  0.00% )
>                  0      core_power.lvl1_turbo_license:u
>                  0      core_power.lvl2_turbo_license:u
> 
> So what's most important is the width of the vectors being used, not
> the instruction set or the register index. Second to that is the
> instruction type, namely whether those are "heavy" instructions.
> Neither of these things can be accurately captured by the XSAVE state.
> 

okay, in terms of license 2 we only care about, AVX512 is a requirement.
Do we have any exception of non-AVX512 producing license 2? If no, I'm
gonna use 

#define XFEATURE_MASK_AVX512            (XFEATURE_MASK_OPMASK \
                                         | XFEATURE_MASK_ZMM_Hi256 \
                                         | XFEATURE_MASK_Hi16_ZMM)

to expose AVX512 component usage. Although AVX512 is not a sufficient
condition of license 2, but the usage could be an useful hint to the user
tool to further check PMU counter. (That is, if the hint is zero, no need
further check).

What do you think?

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-16 23:10     ` Dave Hansen
@ 2018-11-17  0:36       ` Li, Aubrey
  0 siblings, 0 replies; 10+ messages in thread
From: Li, Aubrey @ 2018-11-17  0:36 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/11/17 7:10, Dave Hansen wrote:
> On 11/15/18 4:21 PM, Li, Aubrey wrote:
>> "Core cycles where the core was running with power delivery for license
>> level 2 (introduced in Skylake Server microarchitecture). This includes
>> high current AVX 512-bit instructions."
>>
>> I translated license level 2 to frequency drop.
> 
> BTW, the "high" in that text: "high-current AVX 512-bit instructions" is
> talking about high-current, not "high ... instructions" or high-numbered
> registers.  I think that might be the source of some of the confusion
> about which XSAVE state needs to be examined.
> 
> Just to be clear: there are 3 AVX-512 XSAVE states:
> 
>         XFEATURE_OPMASK,
>         XFEATURE_ZMM_Hi256,
>         XFEATURE_Hi16_ZMM,
> 
> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
> looking too great).
> 
> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
> 
> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
> with AVX512 instructions.
> 
> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
> XFEATURE_Hi16_ZMM.  That's without question.

No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
interested to us.

> 
> It's probably *possible* to run AVX512 instructions by loading state
> into the YMM register and then executing AVX512 instructions that only
> write to memory and never to register state.  That *might* allow
> XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
> for the frequency to be affected since AVX512 instructions _are_
> executing.  But, there's no way to detect this situation from XSAVE
> states themselves.
> 

Andi should have more details on this. FWICT, not all AVX512 instructions
has high current, those only touching memory do not cause notable frequency
drop.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-16  0:21   ` Li, Aubrey
  2018-11-16  1:04     ` Dave Hansen
@ 2018-11-16 23:10     ` Dave Hansen
  2018-11-17  0:36       ` Li, Aubrey
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2018-11-16 23:10 UTC (permalink / raw)
  To: Li, Aubrey, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 11/15/18 4:21 PM, Li, Aubrey wrote:
> "Core cycles where the core was running with power delivery for license
> level 2 (introduced in Skylake Server microarchitecture). This includes
> high current AVX 512-bit instructions."
> 
> I translated license level 2 to frequency drop.

BTW, the "high" in that text: "high-current AVX 512-bit instructions" is
talking about high-current, not "high ... instructions" or high-numbered
registers.  I think that might be the source of some of the confusion
about which XSAVE state needs to be examined.

Just to be clear: there are 3 AVX-512 XSAVE states:

        XFEATURE_OPMASK,
        XFEATURE_ZMM_Hi256,
        XFEATURE_Hi16_ZMM,

I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
looking too great).

But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.

XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
with AVX512 instructions.

So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
XFEATURE_Hi16_ZMM.  That's without question.

It's probably *possible* to run AVX512 instructions by loading state
into the YMM register and then executing AVX512 instructions that only
write to memory and never to register state.  That *might* allow
XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
for the frequency to be affected since AVX512 instructions _are_
executing.  But, there's no way to detect this situation from XSAVE
states themselves.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-16  0:21   ` Li, Aubrey
@ 2018-11-16  1:04     ` Dave Hansen
  2018-11-16 23:10     ` Dave Hansen
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Hansen @ 2018-11-16  1:04 UTC (permalink / raw)
  To: Li, Aubrey, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 11/15/18 4:21 PM, Li, Aubrey wrote:
> On 2018/11/15 23:40, Dave Hansen wrote:
>> On 11/14/18 3:00 PM, Aubrey Li wrote:
>>> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
>>> frequency drop. Add per task Hi16_ZMM state tracking to context switch.
>>
>> Just curious, but is there any public documentation of this?  It seems
>> really odd to me that something using the same AVX-512 instructions on
>> some low-numbered registers would behave differently than the same
>> instructions on some high-numbered registers.  I'm not saying this is
>> wrong, but it's certainly counter-intuitive and I think that begs for
>> some more explanation.
> 
> Yes, Intel 64 and IA-32 Architectures software developer's Manual mentioned
> this in performance event CORE_POWER.LVL2_TURBO_LICENSE.
> 
> "Core cycles where the core was running with power delivery for license
> level 2 (introduced in Skylake Server microarchitecture). This includes
> high current AVX 512-bit instructions."
> 
> I translated license level 2 to frequency drop.

OK, but that talks about AVX 512 and not specifically about Hi16_ZMM's
impact which is what this patch measures.  Are the Hi16_ZMM intricacies
documented anywhere?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-15 15:40 ` Dave Hansen
@ 2018-11-16  0:21   ` Li, Aubrey
  2018-11-16  1:04     ` Dave Hansen
  2018-11-16 23:10     ` Dave Hansen
  0 siblings, 2 replies; 10+ messages in thread
From: Li, Aubrey @ 2018-11-16  0:21 UTC (permalink / raw)
  To: Dave Hansen, Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel

On 2018/11/15 23:40, Dave Hansen wrote:
> On 11/14/18 3:00 PM, Aubrey Li wrote:
>> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
>> frequency drop. Add per task Hi16_ZMM state tracking to context switch.
> 
> Just curious, but is there any public documentation of this?  It seems
> really odd to me that something using the same AVX-512 instructions on
> some low-numbered registers would behave differently than the same
> instructions on some high-numbered registers.  I'm not saying this is
> wrong, but it's certainly counter-intuitive and I think that begs for
> some more explanation.

Yes, Intel 64 and IA-32 Architectures software developer's Manual mentioned
this in performance event CORE_POWER.LVL2_TURBO_LICENSE.

"Core cycles where the core was running with power delivery for license
level 2 (introduced in Skylake Server microarchitecture). This includes
high current AVX 512-bit instructions."

I translated license level 2 to frequency drop.

Thanks,
-Aubrey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
  2018-11-14 23:00 Aubrey Li
@ 2018-11-15 15:40 ` Dave Hansen
  2018-11-16  0:21   ` Li, Aubrey
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2018-11-15 15:40 UTC (permalink / raw)
  To: Aubrey Li, tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, arjan, linux-kernel, Aubrey Li

On 11/14/18 3:00 PM, Aubrey Li wrote:
> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
> frequency drop. Add per task Hi16_ZMM state tracking to context switch.

Just curious, but is there any public documentation of this?  It seems
really odd to me that something using the same AVX-512 instructions on
some low-numbered registers would behave differently than the same
instructions on some high-numbered registers.  I'm not saying this is
wrong, but it's certainly counter-intuitive and I think that begs for
some more explanation.

> The tracking turns on the usage flag immediately, but requires 3
> consecutive context switches with no usage to clear it. This decay is
> required because of AVX-512 using tasks could set Hi16_ZMM state back
> to the init state themselves.

It would be nice to not assume that folks reading this changelog know
what XSAVE 'init states' are.  In fact, that comment you have in the
function below would be great here, but probably shouldn't be in the
comment.

I would say it even more strongly than that:  Part of the best practices
for using AVX-512 is to use the VZEROUPPER instruction to zero out some
state when the AVX-512 operation is finished.  Unlike all the other FPU
and AVX state before it,  this means that the Hi16_ZMM is expected to
frequently transition back to the "init state" in normal use.  This
might cause this detection mechanism to frequently miss tasks that
actually use AVX-512.  To fix that, add a decay.

> Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  arch/x86/include/asm/fpu/internal.h | 26 ++++++++++++++++++++++++++
>  arch/x86/include/asm/fpu/types.h    |  9 +++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index a38bf5a..f382449 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -275,6 +275,31 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
>  		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
>  		     : "memory")
>  
> +#define	HI16ZMM_STATE_DECAY_COUNT	3
> +/*
> + * This function is called during context switch to update Hi16_ZMM state
> + */
> +static inline void update_hi16zmm_state(struct fpu *fpu)
> +{
> +	/*
> +	 * XSAVE header contains a state-component bitmap(xfeatures),
> +	 * which allows software to discover the state of the init
> +	 * optimization used by XSAVEOPT and XSAVES.

I don't think we need the XSAVE background here.  Can you put this in
the changelog?

> +	 * Hi16_ZMM state(one state of AVX-512 component) is tracked here
> +	 * because its usage could cause notable core turbo frequency drop.

I'd leave just this part of the comment.

> +	 * AVX512-using tasks could set Hi16_ZMM state back to the init
> +	 * state themselves. Thus, this tracking mechanism can miss.

Can you make this a stronger statement, just like the changelog?

> +	 * The decay usage ensures that false-negatives do not immediately
> +	 * make a task be considered as not using Hi16_ZMM registers.
> +	 */

To ensure that false-negatives do not immediately show up, decay the
usage count over time.


> +	 *
> +	 * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
> +	 * A value of non-zero is used to indicate whether there is valid
> +	 * state in these AVX512 registers.
> +	 */

> 
>  	/*
> +	 * @hi16zmm_usage:
> +	 *
> +	 * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
> +	 * A value of non-zero is used to indicate whether there is valid
> +	 * state in these AVX512 registers.
> +	 */
> +	unsigned char			hi16zmm_usage;
> +

Nit: With the decay, this does not indicate register state.  It
indicates whether the registers recently had state.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks
@ 2018-11-14 23:00 Aubrey Li
  2018-11-15 15:40 ` Dave Hansen
  0 siblings, 1 reply; 10+ messages in thread
From: Aubrey Li @ 2018-11-14 23:00 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, aubrey.li, linux-kernel, Aubrey Li

User space tools which do automated task placement need information
about AVX-512 usage of tasks, because AVX-512 usage could cause core
turbo frequency drop and impact the running task on the sibling CPU.

XSAVE header contains a state-component bitmap, which allows software
to discover the state of the init optimization used by XSAVEOPT and
XSAVES. Set bits in the bitmap denotes the usage of the components.

AVX-512 component has 3 states, only Hi16_ZMM state causes notable
frequency drop. Add per task Hi16_ZMM state tracking to context switch.

The tracking turns on the usage flag immediately, but requires 3
consecutive context switches with no usage to clear it. This decay is
required because of AVX-512 using tasks could set Hi16_ZMM state back
to the init state themselves.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/include/asm/fpu/internal.h | 26 ++++++++++++++++++++++++++
 arch/x86/include/asm/fpu/types.h    |  9 +++++++++
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a38bf5a..f382449 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -275,6 +275,31 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
 		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
 		     : "memory")
 
+#define	HI16ZMM_STATE_DECAY_COUNT	3
+/*
+ * This function is called during context switch to update Hi16_ZMM state
+ */
+static inline void update_hi16zmm_state(struct fpu *fpu)
+{
+	/*
+	 * XSAVE header contains a state-component bitmap(xfeatures),
+	 * which allows software to discover the state of the init
+	 * optimization used by XSAVEOPT and XSAVES.
+	 *
+	 * Hi16_ZMM state(one state of AVX-512 component) is tracked here
+	 * because its usage could cause notable core turbo frequency drop.
+	 *
+	 * AVX512-using tasks could set Hi16_ZMM state back to the init
+	 * state themselves. Thus, this tracking mechanism can miss.
+	 * The decay usage ensures that false-negatives do not immediately
+	 * make a task be considered as not using Hi16_ZMM registers.
+	 */
+	if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_Hi16_ZMM)
+		fpu->hi16zmm_usage = HI16ZMM_STATE_DECAY_COUNT;
+	else if (fpu->hi16zmm_usage)
+		fpu->hi16zmm_usage--;
+}
+
 /*
  * This function is called only during boot time when x86 caps are not set
  * up and alternative can not be used yet.
@@ -411,6 +436,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
 {
 	if (likely(use_xsave())) {
 		copy_xregs_to_kernel(&fpu->state.xsave);
+		update_hi16zmm_state(fpu);
 		return 1;
 	}
 
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c539..c0c7577 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -303,6 +303,15 @@ struct fpu {
 	unsigned char			initialized;
 
 	/*
+	 * @hi16zmm_usage:
+	 *
+	 * Records the usage of the upper 16 AVX512 registers: ZMM16-ZMM31.
+	 * A value of non-zero is used to indicate whether there is valid
+	 * state in these AVX512 registers.
+	 */
+	unsigned char			hi16zmm_usage;
+
+	/*
 	 * @state:
 	 *
 	 * In-memory copy of all FPU registers that we save/restore
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-11-26  3:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-18 14:03 [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks Samuel Neves
2018-11-20 13:20 ` Li, Aubrey
2018-11-20 14:47   ` Samuel Neves
2018-11-26  3:36 ` Li, Aubrey
  -- strict thread matches above, loose matches on Subject: below --
2018-11-14 23:00 Aubrey Li
2018-11-15 15:40 ` Dave Hansen
2018-11-16  0:21   ` Li, Aubrey
2018-11-16  1:04     ` Dave Hansen
2018-11-16 23:10     ` Dave Hansen
2018-11-17  0:36       ` Li, Aubrey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).