Message ID  20191113124654.181224ggherdovich@suse.cz 

State  New, archived 
Headers  show 
Series 

Related  show 
On Wed, Nov 13, 2019 at 01:46:51PM +0100, Giovanni Gherdovich wrote: > The scheduler needs the ratio freq_curr/freq_max for frequencyinvariant > accounting. On Xeon Phi CPUs set freq_max to the secondhighest frequency > reported by the CPU. > > Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either > one or two turbo frequencies; in the former case that's 100 MHz above the base > frequency, in the latter case the two levels are 100 MHz and 200 MHz above > base frequency. > > We set freq_max to the secondhighest frequency reported by the CPU. This > could be the base frequency (if only one turbo level is available) or the first > turbo level (if two levels are available). The rationale is to compromise > between power efficiency or performance  going straight to max turbo would > favor efficiency and blindly using base freq would favor performance. > > For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi > to get the available frequencies (taken from a comment in turbostat's sources): > > [0]  Reserved > [7:1]  Base value of number of active cores of bucket 1. > [15:8]  Base value of freq ratio of bucket 1. > [20:16]  +ve delta of number of active cores of bucket 2. > i.e. active cores of bucket 2 = > active cores of bucket 1 + delta > [23:21]  Negative delta of freq ratio of bucket 2. > i.e. freq ratio of bucket 2 = > freq ratio of bucket 1  delta > [28:24] +ve delta of number of active cores of bucket 3. > [31:29] ve delta of freq ratio of bucket 3. > [36:32] +ve delta of number of active cores of bucket 4. > [39:37] ve delta of freq ratio of bucket 4. > [44:40] +ve delta of number of active cores of bucket 5. > [47:45] ve delta of freq ratio of bucket 5. > [52:48] +ve delta of number of active cores of bucket 6. > [55:53] ve delta of freq ratio of bucket 6. > [60:56] +ve delta of number of active cores of bucket 7. > [63:61] ve delta of freq ratio of bucket 7. Does it make sense to write a complete decoder and pass a @size parameter just like the skx/glm case? (I've no idea on the 4 I passed in, probably wants to be something else)  Index: linux2.6/arch/x86/kernel/smpboot.c ===================================================================  linux2.6.orig/arch/x86/kernel/smpboot.c +++ linux2.6/arch/x86/kernel/smpboot.c @@ 1863,36 +1863,12 @@ static const struct x86_cpu_id has_glm_t {} }; static int get_knl_turbo_ratio(u64 *turbo_ratio) +static bool knl_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio, int size) { + int delta_cores, delta_fratio; + int cores, fratio; + int err, i; u64 msr;  u32 ratio, delta_ratio;  int err, i, found = 0;   err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr);  if (err)  return err;   ratio = (msr >> 8) & 0xFF;   for (i = 16; i < 64; i += 8) {  delta_ratio = (msr >> (i + 5)) & 0x7;  if (delta_ratio) {  *turbo_ratio = ratio  delta_ratio;  found = 1;  break;  }  }   if (!found)  return 1;   return 0; }  static bool knl_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) {  int err; if (!x86_match_cpu(has_knl_turbo_ratio_limits)) return false; @@ 1901,15 +1877,32 @@ static bool knl_set_cpu_max_freq(u64 *ra if (err) return false;  /* second highest turbo ratio */  err = get_knl_turbo_ratio(turbo_ratio); + *ratio = (*ratio >> 8) & 0xFF; /* max P state ratio */ + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr); if (err) return false;  /* max P state ratio */  *ratio = (*ratio >> 8) & 0xFF; + cores = (msr >> 1) & 0x7F; + fratio = (msr >> 8) & 0xFF;  return true; + i = 16; + do { + if (cores >= size) { + *turbo_ratio = fratio; + return true; + } + + delta_cores = (msr >> i) & 0x1F; + delta_fratio = (msr >> (i + 5)) & 0x07; + + cores += delta_cores; + fratio = delta_fratio; + + i += 8; + } while (i < 64); + + return false; } static bool skx_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio, int size) @@ 1975,7 +1968,7 @@ static void intel_set_cpu_max_freq(void) skx_set_cpu_max_freq(&ratio, &turbo_ratio, 1)) goto set_value;  if (knl_set_cpu_max_freq(&ratio, &turbo_ratio)) + if (knl_set_cpu_max_freq(&ratio, &turbo_ratio, 4)) goto set_value; if (x86_match_cpu(has_skx_turbo_ratio_limits) &&
On Wed, 20191218 at 21:22 +0100, Peter Zijlstra wrote: > On Wed, Nov 13, 2019 at 01:46:51PM +0100, Giovanni Gherdovich wrote: > > The scheduler needs the ratio freq_curr/freq_max for frequencyinvariant > > accounting. On Xeon Phi CPUs set freq_max to the secondhighest frequency > > reported by the CPU. > > > > Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either > > one or two turbo frequencies; in the former case that's 100 MHz above the base > > frequency, in the latter case the two levels are 100 MHz and 200 MHz above > > base frequency. > > > > We set freq_max to the secondhighest frequency reported by the CPU. This > > could be the base frequency (if only one turbo level is available) or the first > > turbo level (if two levels are available). The rationale is to compromise > > between power efficiency or performance  going straight to max turbo would > > favor efficiency and blindly using base freq would favor performance. > > > > For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi > > to get the available frequencies (taken from a comment in turbostat's sources): > > > > [0]  Reserved > > [7:1]  Base value of number of active cores of bucket 1. > > [15:8]  Base value of freq ratio of bucket 1. > > [20:16]  +ve delta of number of active cores of bucket 2. > > i.e. active cores of bucket 2 = > > active cores of bucket 1 + delta > > [23:21]  Negative delta of freq ratio of bucket 2. > > i.e. freq ratio of bucket 2 = > > freq ratio of bucket 1  delta > > [28:24] +ve delta of number of active cores of bucket 3. > > [31:29] ve delta of freq ratio of bucket 3. > > [36:32] +ve delta of number of active cores of bucket 4. > > [39:37] ve delta of freq ratio of bucket 4. > > [44:40] +ve delta of number of active cores of bucket 5. > > [47:45] ve delta of freq ratio of bucket 5. > > [52:48] +ve delta of number of active cores of bucket 6. > > [55:53] ve delta of freq ratio of bucket 6. > > [60:56] +ve delta of number of active cores of bucket 7. > > [63:61] ve delta of freq ratio of bucket 7. > > Does it make sense to write a complete decoder and pass a @size > parameter just like the skx/glm case? > > (I've no idea on the 4 I passed in, probably wants to be something else) I see your point: it's better to have a @size parameter so that if there is a better value, we can easily change the number in the future. It also uniforms to how the others are handled. But from the little I've learned on Xeon Phi, the best parameter to characterize the choice is not the @size of the buckets, but the number of nonzero @delta's of freq ratio that you encounter while parsing the MSR. The number of those nonzero freq ratio @delta's corresponds to how many freq ratios you can have, and the documentation says this can be either 1 or 2. My "documentation" is actually the 4pages leaflet at [1], but I breafly asked Len Brown and he said it made sense. So that's how I would parametrize the code. If I use your function, in order to extract the max_freq value I want from the Xeon Phi machine I have, @size should be a number between 31 and 68 cores (see example below). Not that there is anything wrong with 31 <= size <= 68, it's just that I can make an assumption on how many freq ratios there are and I'd like to do it. My Xeon Phi test machine: Xeon Phi CPU 7255 (Knights Mill) Max Efficiency: 1000 MHz Base Frequency: 1100 MHz 68 cores turbo: 1100 MHz 30 cores turbo: 1200 MHz You can see the above has only 1 nonzero delta freq ratio; the base freq is 1100 MHz and max turbo is 1100+100 MHz. If I could at least check a Knights Landing (which I don't have) and confirm that 31 <= size gives me the secondhighest freq ratio, I'd use your function, but at the moment using @delta as a parameter seems safer. [1] https://www.intel.com/content/dam/www/public/us/en/documents/productbriefs/xeonphiprocessorproductbrief.pdf > footnote #3, last page: "Frequency listed is nominal (nonAVX) TDP frequency. For alltile turbo frequency, add 100 MHz. For singletile turbo frequency, add 200 MHz. For highAVX instruction frequency, subtract 200 MHz."
diff git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 11d57d741584..0e79dcc03ae4 100644  a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ 1841,6 +1841,55 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { {} }; +static int get_knl_turbo_ratio(u64 *turbo_ratio) +{ + u64 msr; + u32 ratio, delta_ratio; + int err, i, found = 0; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr); + if (err) + return err; + + ratio = (msr >> 8) & 0xFF; + + for (i = 16; i < 64; i += 8) { + delta_ratio = (msr >> (i + 5)) & 0x7; + if (delta_ratio) { + *turbo_ratio = ratio  delta_ratio; + found = 1; + break; + } + } + + if (!found) + return 1; + + return 0; +} + +static bool knl_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) +{ + int err; + + if (!x86_match_cpu(has_knl_turbo_ratio_limits)) + return false; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio); + if (err) + return false; + + /* second highest turbo ratio */ + err = get_knl_turbo_ratio(turbo_ratio); + if (err) + return false; + + /* max P state ratio */ + *ratio = (*ratio >> 8) & 0xFF; + + return true; +} + static int get_turbo_ratio_group(u64 *turbo_ratio) { u64 ratio, core_counts; @@ 1913,7 +1962,6 @@ static void intel_set_cpu_max_freq(void) /* * TODO: add support for: *  *  Xeon Phi (KNM, KNL) *  Atom Goldmont *  Atom Silvermont * @@ 1923,10 +1971,12 @@ static void intel_set_cpu_max_freq(void) u64 ratio = 1, turbo_ratio = 1; if (turbo_disabled()   x86_match_cpu(has_knl_turbo_ratio_limits)  x86_match_cpu(has_glm_turbo_ratio_limits)) return; + if (knl_set_cpu_max_freq(&ratio, &turbo_ratio)) + goto set_value; + if (skx_set_cpu_max_freq(&ratio, &turbo_ratio)) goto set_value;
The scheduler needs the ratio freq_curr/freq_max for frequencyinvariant accounting. On Xeon Phi CPUs set freq_max to the secondhighest frequency reported by the CPU. Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either one or two turbo frequencies; in the former case that's 100 MHz above the base frequency, in the latter case the two levels are 100 MHz and 200 MHz above base frequency. We set freq_max to the secondhighest frequency reported by the CPU. This could be the base frequency (if only one turbo level is available) or the first turbo level (if two levels are available). The rationale is to compromise between power efficiency or performance  going straight to max turbo would favor efficiency and blindly using base freq would favor performance. For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi to get the available frequencies (taken from a comment in turbostat's sources): [0]  Reserved [7:1]  Base value of number of active cores of bucket 1. [15:8]  Base value of freq ratio of bucket 1. [20:16]  +ve delta of number of active cores of bucket 2. i.e. active cores of bucket 2 = active cores of bucket 1 + delta [23:21]  Negative delta of freq ratio of bucket 2. i.e. freq ratio of bucket 2 = freq ratio of bucket 1  delta [28:24] +ve delta of number of active cores of bucket 3. [31:29] ve delta of freq ratio of bucket 3. [36:32] +ve delta of number of active cores of bucket 4. [39:37] ve delta of freq ratio of bucket 4. [44:40] +ve delta of number of active cores of bucket 5. [47:45] ve delta of freq ratio of bucket 5. [52:48] +ve delta of number of active cores of bucket 6. [55:53] ve delta of freq ratio of bucket 6. [60:56] +ve delta of number of active cores of bucket 7. [63:61] ve delta of freq ratio of bucket 7. 1. PERFORMANCE EVALUATION: TBENCH +5% 2. NEUTRAL BENCHMARKS (ALL OTHERS) 3. TEST SETUP 1. PERFORMANCE EVALUATION: TBENCH +5%  A performance evaluation was conducted on a Knights Mill machine (see "Test Setup" below), were the frequencyinvariance patch (on schedutil) is compared to both noninvariant schedutil and active intel_pstate with powersave: all three tested kernels behave the same performancewise and with regard to power consumption (performance per watt). The only notable difference is tbench: comparison ratio of performance with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQINV  tbench 1.04 1.05 performanceperwatt ratios with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQINV  tbench 1.03 1.04 which essentially means that frequencyinvariant schedutil is 5% better than baseline, the same as intel_pstate+powersave. As the results above are averaged over the varying parameter, here the detailed table. Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 freqinv                                                          Hmean 1 49.06 + 2.12% ( ) 51.66 + 1.52% ( 5.30%) 52.87 + 0.88% ( 7.76%) Hmean 2 93.82 + 0.45% ( ) 103.24 + 0.70% ( 10.05%) 105.90 + 0.70% ( 12.88%) Hmean 4 192.46 + 1.15% ( ) 215.95 + 0.60% ( 12.21%) 215.78 + 1.43% ( 12.12%) Hmean 8 406.74 + 2.58% ( ) 438.58 + 0.36% ( 7.83%) 437.61 + 0.97% ( 7.59%) Hmean 16 857.70 + 1.22% ( ) 890.26 + 0.72% ( 3.80%) 889.11 + 0.73% ( 3.66%) Hmean 32 1760.10 + 0.92% ( ) 1791.70 + 0.44% ( 1.79%) 1787.95 + 0.44% ( 1.58%) Hmean 64 3183.50 + 0.34% ( ) 3183.19 + 0.36% ( 0.01%) 3187.53 + 0.36% ( 0.13%) Hmean 128 4830.96 + 0.31% ( ) 4846.53 + 0.30% ( 0.32%) 4855.86 + 0.30% ( 0.52%) Hmean 256 5467.98 + 0.38% ( ) 5793.80 + 0.28% ( 5.96%) 5821.94 + 0.17% ( 6.47%) Hmean 512 5398.10 + 0.06% ( ) 5745.56 + 0.08% ( 6.44%) 5503.68 + 0.07% ( 1.96%) Hmean 1024 5290.43 + 0.63% ( ) 5221.07 + 0.47% ( 1.31%) 5277.22 + 0.80% ( 0.25%) Hmean 1088 5139.71 + 0.57% ( ) 5236.02 + 0.71% ( 1.87%) 5190.57 + 0.41% ( 0.99%) 2. NEUTRAL BENCHMARKS (ALL OTHERS)  * pgbench (both read/write and readonly) * NASA Parallel Benchmarks (NPB), MPI or OpenMP for messagepassing * hackbench * netperf * dbench * kernbench * gitsource (git unit test suite) 3. TEST SETUP  Test machine: CPU Model : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill) Fam/Mod/Ste : 6:133:0 Topology : 1 socket, 68 cores / 272 threads Memory : 96G Storage : rotary, XFS filesystem Max EFFICiency, BASE frequency and available turbo levels (MHz): EFFIC 1000 ********** BASE 1100 *********** 68C 1100 *********** 30C 1200 ************ Tested kernels: Baseline : v5.2, intel_pstate passive, schedutil Comparison #1 : v5.2, intel_pstate active , powersave Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil Signedoffby: Giovanni Gherdovich <ggherdovich@suse.cz>  arch/x86/kernel/smpboot.c  54 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+), 2 deletions()