Re: intel graphics performance thought

* Re: intel graphics performance thought
       [not found] <87r5fe6os9.fsf@pollan.anholt.net>
@ 2010-10-26  3:35 ` Peter Clifton
  2010-10-26 16:45   ` Eric Anholt
  0 siblings, 1 reply; 2+ messages in thread
From: Peter Clifton @ 2010-10-26  3:35 UTC (permalink / raw)
  To: Eric Anholt, intel-gfx

On Mon, 2010-10-25 at 12:44 -0700, Eric Anholt wrote:

> So, what if the problem is that our URB allocations aren't big enough?
> I would expect that to look kind of like what I'm seeing.  One
> experiment would be to go double the preferred size of each stage in
> brw_urb.c one by one -- is one stage's URB allocation a limit?  Or, am I
> on the right track at all (go reduce all the preferred sizes to 1/2 and
> see if that hurts)?

I've tinkered with the URB allocations a little, and couldn't notice any
discernible performance impact. I think I'll have to re-try with the CPU
forced into C0 and at full operating frequency in order to really tax
things and make sure I can be confident in comparing numbers.

I did notice that the mesa code appears to enforce a minimum of 4x URB
entries for a GS thread, where the PRM suggests you could potentially
get stalls due to the CLIP operation unless you have 5x URB threads.
(Vol2 GM45 page 56). Obviously we're not seeing GPU hangs all the time,
otherwise people would have complained!, but it might be something worth
adjusting if I'm correct in my assessment.

> If this is the problem, I'd think URB allocation should actually look
> something like dividing up the whole space according to some weighting,
> with minimums per unit to prevent deadlock.  Right now, we're just using
> a fixed division for preferred if it fits, and a nasty minimum set for
> the fallback case.

It would seem that the different operations need to take about the same
amount of time, otherwise they will stall anyway. There is no point
processing vertices faster than the WM can absorb them, right? I guess
this should reflect in the thread scheduling in the GPU EUs though.

I would half expect various FF units to spend time "idle" / waiting for
data to move.. the real curiosity is if the threads being dispatched
can't keep all the EUs at 100%. Something has to be the bottle-neck in a
pipeline, I'm hoping it will eventually be the GPU EUs.

I've also tinkered with max thread numbers for the GS and CLIP units
(which looked to be running at 1 or 2 threads only). That didn't appear
to have any (positive) impact either. Browsing the code, it would appear
that up to 50x WM and 32x VS threads are dispatched. I've not checked
the docs for a maximum number yet, but I'm assuming the numbers in MESA
may reflect a hardware limit there.

I think what we really need for better understanding is a per-frame
profile of when different execution units are busy. It sounds like you
have something like this in development for Ironlake (unfortunately I'm
only on GM45 here).

Thanks for your help in debugging this. I understand performance will
necessarily come as a secondary consideration to chip-set support and
stability, so I appreciate you taking the time to think about this a
bit.

One thing which might be interesting (albeit hard for me to do), is
compare some benchmarks against the Win32 drivers for the chip to sanity
check whether the Linux drivers + mesa are in the same ball-park or not.
If they are, I guess there won't be a silver bullet fix anywhere.

Best regards,

-- 
Peter Clifton

Electrical Engineering Division,
Engineering Department,
University of Cambridge,
9, JJ Thomson Avenue,
Cambridge
CB3 0FA

Tel: +44 (0)7729 980173 - (No signal in the lab!)
Tel: +44 (0)1223 748328 - (Shared lab phone, ask for me)

^ permalink raw reply	[flat|nested] 2+ messages in thread