On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The PELT metric used by the schedutil governor underestimates the > CPU utilization in some cases. The reason for that may be time spent > in interrupt handlers and similar which is not accounted for by PELT. > > That can be easily demonstrated by running kernel compilation on > a Sandy Bridge Intel processor, running turbostat in parallel with > it and looking at the values written to the MSR_IA32_PERF_CTL > register. Namely, the expected result would be that when all CPUs > were 100% busy, all of them would be requested to run in the maximum > P-state, but observation shows that this clearly isn't the case. > The CPUs run in the maximum P-state for a while and then are > requested to run slower and go back to the maximum P-state after > a while again. That causes the actual frequency of the processor to > visibly oscillate below the sustainable maximum in a jittery fashion > which clearly is not desirable. In case you are wondering about the actual numbers, attached are two turbostat log files from two runs of the same workload, without (before.txt) and with (after.txt) the patch applied. The workload is essentially "make -j 5" in the kernel source tree and the machine has an SSD storage and a quad-core Intel Sandy Bridge processor. The P-states available for each core are between 8 and 31 (0x1f) corresponding to 800 MHz and 3.1 GHz, respectively. All cores can run sustainably at 2.9 GHz at the same time, although that is not a guaranteed sustainable frequency (it may be dropped occasionally for thermal reasons, for example). The interesting columns are Bzy_MHz (and specifically the rows with "-" under CPU that correspond to the entire processor), which is the avreage frequency between iterations based on the numbers read from feedback registers, and the rightmost one, which is the values written to the P-state request register (the 3rd and 4th hex digits from the right represent the requested P-state). The turbostat data collection ran every 2 seconds and I looked at the last 30 iterations in each case corresponding to about 1 minute of the workload run during which all of the cores were around 100% busy. Now, if you look at after.txt (the run with the patch applied), you'll notice that during those last 30 iterations P-state 31 (0x1f) had been requested on all cores pretty much 100% of the time (meaning: as expected in that case) and the average processor frequency (computed by taking the average from all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to drop it from 2.9 GHz occasionally). In the before.txt case (without the patch) the average frequency over the last 30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch applied (on the average). That already is quite a measurable difference, but it would have been much worse if the processor had not coordinated P-states in hardware (such that if any core requested 31, the processor would pick that one or close to it for the entire package regardless of the requests from the other cores). Namely, if you look at the P-states requested for different cores (during the last 30 iterations of the before.txt run), which essentially is what should be used according to the governor, the average of them is 27.25 (almost 4 bins lower than the maximum) and the standard deviation is 6, so it is not like they are a little off occasionally. At least some of them are way off most of the time. Honestly, if the processor had been capable of doing per-core P-states, that would have been a disaster and there are customers who wouldn't look at schedutil again after being confronted with these numbers. So this is rather serious. BTW, both intel_pstate in the active mode and ondemand request 0x1f on all cores for that workload, like in the after.txt case. Thanks, Rafael