Francisco Jerez <currojerez@riseup.net> writes:

> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
>[...]
>> Some time ago we entertained the idea of GPU "load average", where that 
>> was defined as a count of runnable requests (so batch buffers). How 
>> that, more generic metric, would behave here if used as an input signal 
>> really intrigues me. Sadly I don't have a patch ready to give to you and 
>> ask to please test it.
>>
>> Or maybe the key is count of runnable contexts as opposed to requests, 
>> which would more match the ELSP[1] idea.
>>
>[..]
> This patch takes the rather conservative approach of limiting the
> application of the response frequency PM QoS request to the more
> restrictive set of cases where we are most certain that CPU latency
> shouldn't be an issue, in order to avoid regressions.  But it might be
> that you find the additional energy efficiency benefit from the more
> aggressive approach to be worth the cost to a few execlists submission
> latency-sensitive applications.  I'm trying to get some numbers
> comparing the two approaches now, will post them here once I have
> results so we can make a more informed trade-off.
>

I got some results from the promised comparison between the dual-ELSP
utilization approach used in this series and the more obvious
alternative of keeping track of the time that any request (or context)
is in flight.  As expected there are quite a few performance
improvements (numbers relative to this approach), however most of them
are either synthetic benchmarks or off-screen variants of benchmarks
(the corresponding on-screen variant of each benchmark below doesn't
show a significant improvement):

 synmark/OglCSDof:                                                                      XXX ±0.15% x18 ->   XXX ±0.22% x12          d=1.15% ±0.18%       p=0.00%
 synmark/OglDeferred:                                                                   XXX ±0.31% x18 ->   XXX ±0.15% x12          d=1.16% ±0.26%       p=0.00%
 synmark/OglTexFilterAniso:                                                             XXX ±0.18% x18 ->   XXX ±0.21% x12          d=1.25% ±0.19%       p=0.00%
 synmark/OglPSPhong:                                                                    XXX ±0.43% x18 ->   XXX ±0.29% x12          d=1.28% ±0.38%       p=0.00%
 synmark/OglBatch0:                                                                     XXX ±0.40% x18 ->   XXX ±0.53% x12          d=1.29% ±0.46%       p=0.00%
 synmark/OglVSDiffuse8:                                                                 XXX ±0.49% x17 ->   XXX ±0.25% x12          d=1.30% ±0.41%       p=0.00%
 synmark/OglVSTangent:                                                                  XXX ±0.53% x18 ->   XXX ±0.31% x12          d=1.31% ±0.46%       p=0.00%
 synmark/OglGeomPoint:                                                                  XXX ±0.56% x18 ->   XXX ±0.15% x12          d=1.48% ±0.44%       p=0.00%
 gputest/plot3d:                                                                        XXX ±0.16% x18 ->   XXX ±0.11% x12          d=1.50% ±0.14%       p=0.00%
 gputest/tess_x32:                                                                      XXX ±0.15% x18 ->   XXX ±0.06% x12          d=1.59% ±0.13%       p=0.00%
 synmark/OglTexFilterTri:                                                               XXX ±0.15% x18 ->   XXX ±0.19% x12          d=1.62% ±0.17%       p=0.00%
 synmark/OglBatch3:                                                                     XXX ±0.57% x18 ->   XXX ±0.33% x12          d=1.70% ±0.49%       p=0.00%
 synmark/OglBatch1:                                                                     XXX ±0.41% x18 ->   XXX ±0.34% x12          d=1.81% ±0.38%       p=0.00%
 synmark/OglShMapVsm:                                                                   XXX ±0.53% x18 ->   XXX ±0.38% x12          d=1.81% ±0.48%       p=0.00%
 synmark/OglTexMem128:                                                                  XXX ±0.62% x18 ->   XXX ±0.29% x12          d=1.87% ±0.52%       p=0.00%
 phoronix/x11perf/test=Scrolling 500 x 500 px:                                           XXX ±0.35% x6 ->   XXX ±0.56% x12          d=2.23% ±0.52%       p=0.00%
 phoronix/x11perf/test=500px Copy From Window To Window:                                 XXX ±0.00% x3 ->   XXX ±0.74% x12          d=2.41% ±0.70%       p=0.01%
 gfxbench/gl_trex_off:                                                                   XXX ±0.04% x3 ->   XXX ±0.34% x12          d=2.59% ±0.32%       p=0.00%
 synmark/OglBatch2:                                                                     XXX ±0.85% x18 ->   XXX ±0.21% x12          d=2.87% ±0.67%       p=0.00%
 glbenchmark/GLB27_EgyptHD_inherited_C24Z16_FixedTime_Offscreen:                         XXX ±0.35% x3 ->   XXX ±0.84% x12          d=3.03% ±0.81%       p=0.01%
 glbenchmark/GLB27_TRex_C24Z16_Offscreen:                                                XXX ±0.23% x3 ->   XXX ±0.32% x12          d=3.09% ±0.32%       p=0.00%
 synmark/OglCSCloth:                                                                    XXX ±0.60% x18 ->   XXX ±0.29% x12          d=3.76% ±0.50%       p=0.00%
 phoronix/x11perf/test=Copy 500x500 From Pixmap To Pixmap:                               XXX ±0.44% x3 ->   XXX ±0.70% x12          d=4.31% ±0.69%       p=0.00%

There aren't as many regressions (numbers relative to upstream
linux-next kernel), they're mostly 2D test-cases, however they are
substantially worse in absolute value:

 phoronix/jxrendermark/rendering-test=12pt Text LCD/rendering-size=128x128:              XXX ±0.30% x26 ->  XXX ±5.71% x26        d=-23.15% ±3.11%       p=0.00%
 phoronix/jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128:      XXX ±0.30% x26 ->  XXX ±4.32% x26        d=-21.34% ±2.41%       p=0.00%
 phoronix/x11perf/test=500px Compositing From Pixmap To Window:                         XXX ±15.46% x26 -> XXX ±12.76% x26       d=-19.05% ±13.15%       p=0.00%
 phoronix/jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128:  XXX ±0.20% x26 ->  XXX ±3.82% x27         d=-5.07% ±2.57%       p=0.00%
 phoronix/gtkperf/gtk-test=GtkDrawingArea - Pixbufs:                                     XXX ±2.81% x26 ->  XXX ±2.10% x26         d=-3.59% ±2.45%       p=0.00%
 warsow/benchsow:                                                                        XXX ±0.61% x26 ->  XXX ±1.41% x27         d=-2.45% ±1.07%       p=0.00%
 synmark/OglTerrainFlyInst:                                                              XXX ±0.44% x25 ->  XXX ±0.74% x25         d=-1.24% ±0.60%       p=0.00%

There are some things we might be able to do to get some of the
additional improvement we can see above without hurting
latency-sensitive workloads, but it's going to take more effort, the
present approach of using the dual-ELSP utilization seems like a good
compromise to me for starters.

>[...]