All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] drm/i915: read-read semaphore optimization
@ 2011-12-13  3:52 Ben Widawsky
  2011-12-13  9:49 ` Chris Wilson
  2011-12-13 17:22 ` Eric Anholt
  0 siblings, 2 replies; 16+ messages in thread
From: Ben Widawsky @ 2011-12-13  3:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Ben Widawsky

Since we don't differentiate on the different GPU read domains, it
should be safe to allow back to back reads to occur without issuing a
wait (or flush in the non-semaphore case).

This has the unfortunate side effect that we need to keep track of all
the outstanding buffer reads so that we can synchronize on a write, to
another ring (since we don't know which read finishes first). In other
words, the code is quite simple for two rings, but gets more tricky for
> 2 rings.

Here is a picture of the solution to the above problem

Ring 0            Ring 1             Ring 2
batch 0           batch 1            batch 2
  read buffer A     read buffer A      wait batch 0
                                       wait batch 1
                                       write buffer A

This code is really untested. I'm hoping for some feedback if this is
worth cleaning up, and testing more thoroughly.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h            |    1 +
 drivers/gpu/drm/i915/i915_gem.c            |    1 +
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   32 ++++++++++++++++++++++-----
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 4a9c1b9..d59bf20 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -852,6 +852,7 @@ struct drm_i915_gem_object {
 
 	/** Breadcrumb of last rendering to the buffer. */
 	uint32_t last_rendering_seqno;
+	uint32_t lrs_ro[I915_NUM_RINGS];
 	struct intel_ring_buffer *ring;
 
 	/** Breadcrumb of last fenced GPU access to the buffer. */
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ed0b68f..64cc804 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1561,6 +1561,7 @@ i915_gem_object_move_off_active(struct drm_i915_gem_object *obj)
 {
 	list_del_init(&obj->ring_list);
 	obj->last_rendering_seqno = 0;
+	memset(obj->lrs_ro, 0, sizeof(obj->lrs_ro));
 }
 
 static void
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 42a6529..78090a3 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -752,21 +752,20 @@ i915_gem_execbuffer_sync_rings(struct drm_i915_gem_object *obj,
 {
 	struct intel_ring_buffer *from = obj->ring;
 	u32 seqno;
-	int ret, idx;
+	int ret, idx, i;
 
 	if (from == NULL || to == from)
 		return 0;
 
-	/* If the object isn't being written to, and the object isn't moving
-	 * into a GPU write domain, it doesn't need to sync.
-	 */
 	if (obj->pending_gpu_write == 0 &&
-	    obj->base.pending_write_domain == 0)
+	    obj->base.pending_write_domain == 0) {
+		obj->lrs_ro[ffs(to->id) - 1] = obj->last_rendering_seqno;
 		return 0;
+	}
 
 	/* XXX gpu semaphores are implicated in various hard hangs on SNB */
 	if (INTEL_INFO(obj->base.dev)->gen < 6 || !i915_semaphores)
-		return i915_gem_object_wait_rendering(obj);
+		ret = i915_gem_object_wait_rendering(obj);
 
 	idx = intel_ring_sync_index(from, to);
 
@@ -792,6 +791,27 @@ i915_gem_execbuffer_sync_rings(struct drm_i915_gem_object *obj,
 
 	from->sync_seqno[idx] = seqno;
 
+	/* We must be doing a write to an object which may be in queue to be
+	 * read by some other command. If we skipped synchronizing because of a read
+	 * only request, we must check and add a sync point now. The reason we
+	 * must sync on the write is because we have more than 2 rings. If two
+	 * rings have outstanding reads (without sync), we can't let the write
+	 * occur until both other have completed.
+	 */
+	for (i = 0; i < I915_NUM_RINGS; i++) {
+		uint32_t ro_seqno = obj->lrs_ro[i];
+		if (ro_seqno != 0) {
+			obj->lrs_ro[i] = 0;
+
+			if (i915_seqno_passed(from->get_seqno(from), ro_seqno))
+				continue;
+
+			ret = to->sync_to(to, from, ro_seqno - 1);
+			if (ret)
+				return ret;
+		}
+	}
+
 	return to->sync_to(to, from, seqno - 1);
 }
 
-- 
1.7.8

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13  3:52 [PATCH] [RFC] drm/i915: read-read semaphore optimization Ben Widawsky
@ 2011-12-13  9:49 ` Chris Wilson
  2011-12-13 16:01   ` Daniel Vetter
  2011-12-13 17:22 ` Eric Anholt
  1 sibling, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2011-12-13  9:49 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Ben Widawsky

On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> Since we don't differentiate on the different GPU read domains, it
> should be safe to allow back to back reads to occur without issuing a
> wait (or flush in the non-semaphore case).
> 
> This has the unfortunate side effect that we need to keep track of all
> the outstanding buffer reads so that we can synchronize on a write, to
> another ring (since we don't know which read finishes first). In other
> words, the code is quite simple for two rings, but gets more tricky for
> > 2 rings.
> 
> Here is a picture of the solution to the above problem
> 
> Ring 0            Ring 1             Ring 2
> batch 0           batch 1            batch 2
>   read buffer A     read buffer A      wait batch 0
>                                        wait batch 1
>                                        write buffer A
> 
> This code is really untested. I'm hoping for some feedback if this is
> worth cleaning up, and testing more thoroughly.

Yes, that race is quite valid and the reason why I thought I hadn't made
that optimisation. Darn. :(

To go a step further, we can split the obj->ring_list into
(obj->ring_read_list[NUM_RINGS], obj->num_readers, obj->last_read_seqno) and
(obj->ring_write_list, obj->last_write_seqno). At which point Daniel
complains about bloating every i915_gem_object, and we probably should
kmem_cache_alloc a i915_gem_object_seqno on demand. This allows us to track
objects in multiple rings and implement read-write locking, albeit at
significantly more complexity in managing the active lists.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13  9:49 ` Chris Wilson
@ 2011-12-13 16:01   ` Daniel Vetter
  2011-12-13 16:36     ` Keith Packard
  2011-12-13 16:59     ` Chris Wilson
  0 siblings, 2 replies; 16+ messages in thread
From: Daniel Vetter @ 2011-12-13 16:01 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, Ben Widawsky, intel-gfx

On Tue, Dec 13, 2011 at 09:49:34AM +0000, Chris Wilson wrote:
> On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> > Since we don't differentiate on the different GPU read domains, it
> > should be safe to allow back to back reads to occur without issuing a
> > wait (or flush in the non-semaphore case).
> > 
> > This has the unfortunate side effect that we need to keep track of all
> > the outstanding buffer reads so that we can synchronize on a write, to
> > another ring (since we don't know which read finishes first). In other
> > words, the code is quite simple for two rings, but gets more tricky for
> > > 2 rings.
> > 
> > Here is a picture of the solution to the above problem
> > 
> > Ring 0            Ring 1             Ring 2
> > batch 0           batch 1            batch 2
> >   read buffer A     read buffer A      wait batch 0
> >                                        wait batch 1
> >                                        write buffer A
> > 
> > This code is really untested. I'm hoping for some feedback if this is
> > worth cleaning up, and testing more thoroughly.
> 
> Yes, that race is quite valid and the reason why I thought I hadn't made
> that optimisation. Darn. :(
> 
> To go a step further, we can split the obj->ring_list into
> (obj->ring_read_list[NUM_RINGS], obj->num_readers, obj->last_read_seqno) and
> (obj->ring_write_list, obj->last_write_seqno). At which point Daniel
> complains about bloating every i915_gem_object, and we probably should
> kmem_cache_alloc a i915_gem_object_seqno on demand. This allows us to track
> objects in multiple rings and implement read-write locking, albeit at
> significantly more complexity in managing the active lists.

I think the i915_gem_object bloat can be fought by stealing a few bits
from the various seqnos and storing the ring id in there. The thing that
makes me more uneasy is that I don't trust our gpu domain tracking
(especially since it's not per-ring). So either
- extend it to be per-ring
- or remove it all and invalidate/flush unconditionally.
In the light of all the complexity and the fact that due to our various
w/as I prefer the latter.

Afaik the only use-case for parallel reads is video decode with
post-processing on the render ring. The decode ring needs read-only access
to reference frames to decode the next frame and the render ring read-only
access to past frames for post-processing (e.g. deinterlacing). But given
the general state of perf optimizations in libva I think we have lower
hanging fruit to chase if we actually miss a performance target for this
use-case.

Cheers, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13 16:01   ` Daniel Vetter
@ 2011-12-13 16:36     ` Keith Packard
  2011-12-13 22:49       ` Ben Widawsky
  2011-12-13 16:59     ` Chris Wilson
  1 sibling, 1 reply; 16+ messages in thread
From: Keith Packard @ 2011-12-13 16:36 UTC (permalink / raw)
  To: Daniel Vetter, Chris Wilson; +Cc: Daniel Vetter, Ben Widawsky, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 495 bytes --]

On Tue, 13 Dec 2011 17:01:33 +0100, Daniel Vetter <daniel@ffwll.ch> wrote:

> - or remove it all and invalidate/flush unconditionally.

Eric and I were chatting yesterday about trying this -- it seems like
we'd be able to dramatically simplify the kernel module by doing this,
and given how much flushing already occurs, I doubt we'd see any
significant performance difference, and we'd save a pile of CPU time,
which might actually improve performance.

-- 
keith.packard@intel.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 827 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13 16:01   ` Daniel Vetter
  2011-12-13 16:36     ` Keith Packard
@ 2011-12-13 16:59     ` Chris Wilson
  1 sibling, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2011-12-13 16:59 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, Ben Widawsky, intel-gfx

On Tue, 13 Dec 2011 17:01:33 +0100, Daniel Vetter <daniel@ffwll.ch> wrote:
> Afaik the only use-case for parallel reads is video decode with
> post-processing on the render ring. The decode ring needs read-only access
> to reference frames to decode the next frame and the render ring read-only
> access to past frames for post-processing (e.g. deinterlacing). But given
> the general state of perf optimizations in libva I think we have lower
> hanging fruit to chase if we actually miss a performance target for this
> use-case.

One in the near future will be: render to backbuffer (RCS),
pageflip to scanout (BCS), read from front (RCS).

And in its current form UXA will do the back-to-front blit on the BCS.
But that is async and so not a large race window, whereas the pageflip
may takes ~16ms to process. I don't think it is entirely unfeasible that
we see some form of this whilst running compositors or games. Or at
least would if we enabled semaphores for pageflips. Except in the
pageflip scenario we know we are protected by the fb ref, so consider
the hypothetical scenario where we have a working vsync'ed blit...

The real question is in any event do we have enough instrumentation to
diagnose GPU stalls upon buffer migration? Then we can replace the read
optimisation with a tracepoint and wait for a test case.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13  3:52 [PATCH] [RFC] drm/i915: read-read semaphore optimization Ben Widawsky
  2011-12-13  9:49 ` Chris Wilson
@ 2011-12-13 17:22 ` Eric Anholt
  2011-12-13 18:36   ` Ben Widawsky
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Anholt @ 2011-12-13 17:22 UTC (permalink / raw)
  To: intel-gfx; +Cc: Daniel Vetter, Ben Widawsky


[-- Attachment #1.1: Type: text/plain, Size: 1104 bytes --]

On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> Since we don't differentiate on the different GPU read domains, it
> should be safe to allow back to back reads to occur without issuing a
> wait (or flush in the non-semaphore case).
> 
> This has the unfortunate side effect that we need to keep track of all
> the outstanding buffer reads so that we can synchronize on a write, to
> another ring (since we don't know which read finishes first). In other
> words, the code is quite simple for two rings, but gets more tricky for
> > 2 rings.
> 
> Here is a picture of the solution to the above problem
> 
> Ring 0            Ring 1             Ring 2
> batch 0           batch 1            batch 2
>   read buffer A     read buffer A      wait batch 0
>                                        wait batch 1
>                                        write buffer A
> 
> This code is really untested. I'm hoping for some feedback if this is
> worth cleaning up, and testing more thoroughly.

You say it's an optimization -- do you have performance numbers?

[-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13 17:22 ` Eric Anholt
@ 2011-12-13 18:36   ` Ben Widawsky
  2012-01-16 21:50     ` Daniel Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Widawsky @ 2011-12-13 18:36 UTC (permalink / raw)
  To: Eric Anholt; +Cc: Daniel Vetter, intel-gfx

On 12/13/2011 09:22 AM, Eric Anholt wrote:
> On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky<ben@bwidawsk.net>  wrote:
>> Since we don't differentiate on the different GPU read domains, it
>> should be safe to allow back to back reads to occur without issuing a
>> wait (or flush in the non-semaphore case).
>>
>> This has the unfortunate side effect that we need to keep track of all
>> the outstanding buffer reads so that we can synchronize on a write, to
>> another ring (since we don't know which read finishes first). In other
>> words, the code is quite simple for two rings, but gets more tricky for
>>> 2 rings.
>>
>> Here is a picture of the solution to the above problem
>>
>> Ring 0            Ring 1             Ring 2
>> batch 0           batch 1            batch 2
>>    read buffer A     read buffer A      wait batch 0
>>                                         wait batch 1
>>                                         write buffer A
>>
>> This code is really untested. I'm hoping for some feedback if this is
>> worth cleaning up, and testing more thoroughly.
>
> You say it's an optimization -- do you have performance numbers?

33% improvement on a hacked version of gem_ring_sync_loop with.

It's not really a valid test as it's not coherent, but this is 
approximately the best case improvement.

Oddly semaphores doesn't make much difference in this test, which was 
surprising.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13 16:36     ` Keith Packard
@ 2011-12-13 22:49       ` Ben Widawsky
  2011-12-13 23:22         ` Keith Packard
  2011-12-14  1:09         ` Eric Anholt
  0 siblings, 2 replies; 16+ messages in thread
From: Ben Widawsky @ 2011-12-13 22:49 UTC (permalink / raw)
  To: Keith Packard; +Cc: Daniel Vetter, intel-gfx

On 12/13/2011 08:36 AM, Keith Packard wrote:
> On Tue, 13 Dec 2011 17:01:33 +0100, Daniel Vetter <daniel@ffwll.ch> wrote:
> 
>> - or remove it all and invalidate/flush unconditionally.
> 
> Eric and I were chatting yesterday about trying this -- it seems like
> we'd be able to dramatically simplify the kernel module by doing this,
> and given how much flushing already occurs, I doubt we'd see any
> significant performance difference, and we'd save a pile of CPU time,
> which might actually improve performance.


Would we want to keep domain tracking if the HW worked correctly and we
didn't have to always flush. It seems like a shame to just gut the code
if it actually could offer a benefit on future generations.

I know Daniel has the same idea about gutting it...

Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13 22:49       ` Ben Widawsky
@ 2011-12-13 23:22         ` Keith Packard
  2011-12-14  1:09         ` Eric Anholt
  1 sibling, 0 replies; 16+ messages in thread
From: Keith Packard @ 2011-12-13 23:22 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Daniel Vetter, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 876 bytes --]

On Tue, 13 Dec 2011 14:49:49 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:

> Would we want to keep domain tracking if the HW worked correctly and we
> didn't have to always flush. It seems like a shame to just gut the code
> if it actually could offer a benefit on future generations.

That sounds like premature optimization to me. If we want something
similar on future hardware, we can resurrect the old code and see what
pieces are useful. For now, we're fighting correctness and stability
issues, and given the limited (zero? negative?) performance benefits, we
just need to get to code which works reliably and provides good
performance.

The current code has gotten to the 'piles of kludges on kludges' stage,
which makes it very fragile -- see the regression caused by changing
flushing orders in the VT-d work-around.

-- 
keith.packard@intel.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 827 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13 22:49       ` Ben Widawsky
  2011-12-13 23:22         ` Keith Packard
@ 2011-12-14  1:09         ` Eric Anholt
  2011-12-14  3:25           ` Keith Packard
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Anholt @ 2011-12-14  1:09 UTC (permalink / raw)
  To: Ben Widawsky, Keith Packard; +Cc: Daniel Vetter, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 1329 bytes --]

On Tue, 13 Dec 2011 14:49:49 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> On 12/13/2011 08:36 AM, Keith Packard wrote:
> > On Tue, 13 Dec 2011 17:01:33 +0100, Daniel Vetter <daniel@ffwll.ch> wrote:
> > 
> >> - or remove it all and invalidate/flush unconditionally.
> > 
> > Eric and I were chatting yesterday about trying this -- it seems like
> > we'd be able to dramatically simplify the kernel module by doing this,
> > and given how much flushing already occurs, I doubt we'd see any
> > significant performance difference, and we'd save a pile of CPU time,
> > which might actually improve performance.
> 
> 
> Would we want to keep domain tracking if the HW worked correctly and we
> didn't have to always flush. It seems like a shame to just gut the code
> if it actually could offer a benefit on future generations.

It's not a matter of hardware behavior.  It's a matter of always needing
to flush the caches anyway because you're emitting new commands at the
GPU and you're wanting results completed on the screen at the end.  So
we go to all this fragile, expensive CPU work to get no benefit.

We introduced this complexity with no evidence that it would help, just
because we thought, like you, that "avoiding cache flushes should be
good, right?".  Experiments so far say we were wrong.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-14  1:09         ` Eric Anholt
@ 2011-12-14  3:25           ` Keith Packard
  0 siblings, 0 replies; 16+ messages in thread
From: Keith Packard @ 2011-12-14  3:25 UTC (permalink / raw)
  To: Eric Anholt, Ben Widawsky; +Cc: Daniel Vetter, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 541 bytes --]

On Tue, 13 Dec 2011 17:09:45 -0800, Eric Anholt <eric@anholt.net> wrote:

> We introduced this complexity with no evidence that it would help, just
> because we thought, like you, that "avoiding cache flushes should be
> good, right?".  Experiments so far say we were wrong.

Right, you'd think we'd have learned to not optimize in advance of
data. Someday maybe we'll know better...

And, of course, future hardware may require different code for optimal
performance. Who would have guessed that?

-- 
keith.packard@intel.com

[-- Attachment #1.2: Type: application/pgp-signature, Size: 827 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2011-12-13 18:36   ` Ben Widawsky
@ 2012-01-16 21:50     ` Daniel Vetter
  2012-01-16 22:20       ` Ben Widawsky
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2012-01-16 21:50 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: intel-gfx, Daniel Vetter

On Tue, Dec 13, 2011 at 10:36:15AM -0800, Ben Widawsky wrote:
> On 12/13/2011 09:22 AM, Eric Anholt wrote:
> >On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky<ben@bwidawsk.net>  wrote:
> >>Since we don't differentiate on the different GPU read domains, it
> >>should be safe to allow back to back reads to occur without issuing a
> >>wait (or flush in the non-semaphore case).
> >>
> >>This has the unfortunate side effect that we need to keep track of all
> >>the outstanding buffer reads so that we can synchronize on a write, to
> >>another ring (since we don't know which read finishes first). In other
> >>words, the code is quite simple for two rings, but gets more tricky for
> >>>2 rings.
> >>
> >>Here is a picture of the solution to the above problem
> >>
> >>Ring 0            Ring 1             Ring 2
> >>batch 0           batch 1            batch 2
> >>   read buffer A     read buffer A      wait batch 0
> >>                                        wait batch 1
> >>                                        write buffer A
> >>
> >>This code is really untested. I'm hoping for some feedback if this is
> >>worth cleaning up, and testing more thoroughly.
> >
> >You say it's an optimization -- do you have performance numbers?
> 
> 33% improvement on a hacked version of gem_ring_sync_loop with.
> 
> It's not really a valid test as it's not coherent, but this is
> approximately the best case improvement.
> 
> Oddly semaphores doesn't make much difference in this test, which
> was surprising.

Our domain tracking is already complicated in unfunny ways. And (at least
without a use-case showing gains with hard numbers in either perf or power
usage) I think this patch is the kind of "this looks cool" stuff that
added a lot to the current problem.

So before adding more complexity on top I'd like to remove some of the
superflous stuff we already have. I.e. all the flushing_list stuff and
maybe other things ...

Cheers, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2012-01-16 21:50     ` Daniel Vetter
@ 2012-01-16 22:20       ` Ben Widawsky
  2012-01-17  3:41         ` Eric Anholt
  2012-01-17 10:15         ` Daniel Vetter
  0 siblings, 2 replies; 16+ messages in thread
From: Ben Widawsky @ 2012-01-16 22:20 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, intel-gfx

On 01/16/2012 01:50 PM, Daniel Vetter wrote:
> On Tue, Dec 13, 2011 at 10:36:15AM -0800, Ben Widawsky wrote:
>> On 12/13/2011 09:22 AM, Eric Anholt wrote:
>>> On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky<ben@bwidawsk.net>  wrote:
>>>> Since we don't differentiate on the different GPU read domains, it
>>>> should be safe to allow back to back reads to occur without issuing a
>>>> wait (or flush in the non-semaphore case).
>>>>
>>>> This has the unfortunate side effect that we need to keep track of all
>>>> the outstanding buffer reads so that we can synchronize on a write, to
>>>> another ring (since we don't know which read finishes first). In other
>>>> words, the code is quite simple for two rings, but gets more tricky for
>>>>> 2 rings.
>>>>
>>>> Here is a picture of the solution to the above problem
>>>>
>>>> Ring 0            Ring 1             Ring 2
>>>> batch 0           batch 1            batch 2
>>>>   read buffer A     read buffer A      wait batch 0
>>>>                                        wait batch 1
>>>>                                        write buffer A
>>>>
>>>> This code is really untested. I'm hoping for some feedback if this is
>>>> worth cleaning up, and testing more thoroughly.
>>>
>>> You say it's an optimization -- do you have performance numbers?
>>
>> 33% improvement on a hacked version of gem_ring_sync_loop with.
>>
>> It's not really a valid test as it's not coherent, but this is
>> approximately the best case improvement.
>>
>> Oddly semaphores doesn't make much difference in this test, which
>> was surprising.
> 
> Our domain tracking is already complicated in unfunny ways. And (at least
> without a use-case showing gains with hard numbers in either perf or power
> usage) I think this patch is the kind of "this looks cool" stuff that
> added a lot to the current problem.
> 
> So before adding more complexity on top I'd like to remove some of the
> superflous stuff we already have. I.e. all the flushing_list stuff and
> maybe other things ...

Can you be more clear on what exactly you want done before taking a
patch like this? Maybe I can work on it during some down time.

> 
> Cheers, Daniel

~Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2012-01-16 22:20       ` Ben Widawsky
@ 2012-01-17  3:41         ` Eric Anholt
  2012-01-17 10:15         ` Daniel Vetter
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Anholt @ 2012-01-17  3:41 UTC (permalink / raw)
  To: Ben Widawsky, Daniel Vetter; +Cc: Daniel Vetter, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 2461 bytes --]

On Mon, 16 Jan 2012 14:20:55 -0800, Ben Widawsky <ben@bwidawsk.net> wrote:
> On 01/16/2012 01:50 PM, Daniel Vetter wrote:
> > On Tue, Dec 13, 2011 at 10:36:15AM -0800, Ben Widawsky wrote:
> >> On 12/13/2011 09:22 AM, Eric Anholt wrote:
> >>> On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky<ben@bwidawsk.net>  wrote:
> >>>> Since we don't differentiate on the different GPU read domains, it
> >>>> should be safe to allow back to back reads to occur without issuing a
> >>>> wait (or flush in the non-semaphore case).
> >>>>
> >>>> This has the unfortunate side effect that we need to keep track of all
> >>>> the outstanding buffer reads so that we can synchronize on a write, to
> >>>> another ring (since we don't know which read finishes first). In other
> >>>> words, the code is quite simple for two rings, but gets more tricky for
> >>>>> 2 rings.
> >>>>
> >>>> Here is a picture of the solution to the above problem
> >>>>
> >>>> Ring 0            Ring 1             Ring 2
> >>>> batch 0           batch 1            batch 2
> >>>>   read buffer A     read buffer A      wait batch 0
> >>>>                                        wait batch 1
> >>>>                                        write buffer A
> >>>>
> >>>> This code is really untested. I'm hoping for some feedback if this is
> >>>> worth cleaning up, and testing more thoroughly.
> >>>
> >>> You say it's an optimization -- do you have performance numbers?
> >>
> >> 33% improvement on a hacked version of gem_ring_sync_loop with.
> >>
> >> It's not really a valid test as it's not coherent, but this is
> >> approximately the best case improvement.
> >>
> >> Oddly semaphores doesn't make much difference in this test, which
> >> was surprising.
> > 
> > Our domain tracking is already complicated in unfunny ways. And (at least
> > without a use-case showing gains with hard numbers in either perf or power
> > usage) I think this patch is the kind of "this looks cool" stuff that
> > added a lot to the current problem.
> > 
> > So before adding more complexity on top I'd like to remove some of the
> > superflous stuff we already have. I.e. all the flushing_list stuff and
> > maybe other things ...
> 
> Can you be more clear on what exactly you want done before taking a
> patch like this? Maybe I can work on it during some down time.

If it claims to be an optimization, at a minimum the patch should
include performance numbers.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2012-01-16 22:20       ` Ben Widawsky
  2012-01-17  3:41         ` Eric Anholt
@ 2012-01-17 10:15         ` Daniel Vetter
  2012-01-17 17:55           ` Eric Anholt
  1 sibling, 1 reply; 16+ messages in thread
From: Daniel Vetter @ 2012-01-17 10:15 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: Daniel Vetter, intel-gfx

On Mon, Jan 16, 2012 at 02:20:55PM -0800, Ben Widawsky wrote:
> On 01/16/2012 01:50 PM, Daniel Vetter wrote:
> > On Tue, Dec 13, 2011 at 10:36:15AM -0800, Ben Widawsky wrote:
> >> On 12/13/2011 09:22 AM, Eric Anholt wrote:
> >>> On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky<ben@bwidawsk.net>  wrote:
> >>>> Since we don't differentiate on the different GPU read domains, it
> >>>> should be safe to allow back to back reads to occur without issuing a
> >>>> wait (or flush in the non-semaphore case).
> >>>>
> >>>> This has the unfortunate side effect that we need to keep track of all
> >>>> the outstanding buffer reads so that we can synchronize on a write, to
> >>>> another ring (since we don't know which read finishes first). In other
> >>>> words, the code is quite simple for two rings, but gets more tricky for
> >>>>> 2 rings.
> >>>>
> >>>> Here is a picture of the solution to the above problem
> >>>>
> >>>> Ring 0            Ring 1             Ring 2
> >>>> batch 0           batch 1            batch 2
> >>>>   read buffer A     read buffer A      wait batch 0
> >>>>                                        wait batch 1
> >>>>                                        write buffer A
> >>>>
> >>>> This code is really untested. I'm hoping for some feedback if this is
> >>>> worth cleaning up, and testing more thoroughly.
> >>>
> >>> You say it's an optimization -- do you have performance numbers?
> >>
> >> 33% improvement on a hacked version of gem_ring_sync_loop with.
> >>
> >> It's not really a valid test as it's not coherent, but this is
> >> approximately the best case improvement.
> >>
> >> Oddly semaphores doesn't make much difference in this test, which
> >> was surprising.
> > 
> > Our domain tracking is already complicated in unfunny ways. And (at least
> > without a use-case showing gains with hard numbers in either perf or power
> > usage) I think this patch is the kind of "this looks cool" stuff that
> > added a lot to the current problem.
> > 
> > So before adding more complexity on top I'd like to remove some of the
> > superflous stuff we already have. I.e. all the flushing_list stuff and
> > maybe other things ...
> 
> Can you be more clear on what exactly you want done before taking a
> patch like this? Maybe I can work on it during some down time.

I was thinking about Eric's no-more-domains stuff specifically, which has
tons of natural split-points - and we want to exploit these for merging.
Imo step 1 would be to just rework the batch dispatch in
intel_ringbuffer.c so that we unconditionally invalidate before and flush
afterwards. The ring->flush would become a no-op. No changes to the core
flushing_list tracking for step 1.

If we can get this in for 3.4 we could (in the -next merge cycle) walk all
the callchains from their ends and rip out everything which is a no-op,
starting from ring->flush. I think that's the safest way to attack this.
Eric?

Cheers, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] [RFC] drm/i915: read-read semaphore optimization
  2012-01-17 10:15         ` Daniel Vetter
@ 2012-01-17 17:55           ` Eric Anholt
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Anholt @ 2012-01-17 17:55 UTC (permalink / raw)
  To: Daniel Vetter, Ben Widawsky; +Cc: Daniel Vetter, intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 3415 bytes --]

On Tue, 17 Jan 2012 11:15:31 +0100, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Mon, Jan 16, 2012 at 02:20:55PM -0800, Ben Widawsky wrote:
> > On 01/16/2012 01:50 PM, Daniel Vetter wrote:
> > > On Tue, Dec 13, 2011 at 10:36:15AM -0800, Ben Widawsky wrote:
> > >> On 12/13/2011 09:22 AM, Eric Anholt wrote:
> > >>> On Mon, 12 Dec 2011 19:52:08 -0800, Ben Widawsky<ben@bwidawsk.net>  wrote:
> > >>>> Since we don't differentiate on the different GPU read domains, it
> > >>>> should be safe to allow back to back reads to occur without issuing a
> > >>>> wait (or flush in the non-semaphore case).
> > >>>>
> > >>>> This has the unfortunate side effect that we need to keep track of all
> > >>>> the outstanding buffer reads so that we can synchronize on a write, to
> > >>>> another ring (since we don't know which read finishes first). In other
> > >>>> words, the code is quite simple for two rings, but gets more tricky for
> > >>>>> 2 rings.
> > >>>>
> > >>>> Here is a picture of the solution to the above problem
> > >>>>
> > >>>> Ring 0            Ring 1             Ring 2
> > >>>> batch 0           batch 1            batch 2
> > >>>>   read buffer A     read buffer A      wait batch 0
> > >>>>                                        wait batch 1
> > >>>>                                        write buffer A
> > >>>>
> > >>>> This code is really untested. I'm hoping for some feedback if this is
> > >>>> worth cleaning up, and testing more thoroughly.
> > >>>
> > >>> You say it's an optimization -- do you have performance numbers?
> > >>
> > >> 33% improvement on a hacked version of gem_ring_sync_loop with.
> > >>
> > >> It's not really a valid test as it's not coherent, but this is
> > >> approximately the best case improvement.
> > >>
> > >> Oddly semaphores doesn't make much difference in this test, which
> > >> was surprising.
> > > 
> > > Our domain tracking is already complicated in unfunny ways. And (at least
> > > without a use-case showing gains with hard numbers in either perf or power
> > > usage) I think this patch is the kind of "this looks cool" stuff that
> > > added a lot to the current problem.
> > > 
> > > So before adding more complexity on top I'd like to remove some of the
> > > superflous stuff we already have. I.e. all the flushing_list stuff and
> > > maybe other things ...
> > 
> > Can you be more clear on what exactly you want done before taking a
> > patch like this? Maybe I can work on it during some down time.
> 
> I was thinking about Eric's no-more-domains stuff specifically, which has
> tons of natural split-points - and we want to exploit these for merging.
> Imo step 1 would be to just rework the batch dispatch in
> intel_ringbuffer.c so that we unconditionally invalidate before and flush
> afterwards. The ring->flush would become a no-op. No changes to the core
> flushing_list tracking for step 1.
> 
> If we can get this in for 3.4 we could (in the -next merge cycle) walk all
> the callchains from their ends and rip out everything which is a no-op,
> starting from ring->flush. I think that's the safest way to attack this.
> Eric?

I was stuck with a mysterious regression in my previous attempt at the
patch series.  I've recently realized that I can do the major code
deletion with fewer + lines in the diff, which I hope will avoid that.
I'm working it that again now.

[-- Attachment #1.2: Type: application/pgp-signature, Size: 197 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-01-17 17:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-13  3:52 [PATCH] [RFC] drm/i915: read-read semaphore optimization Ben Widawsky
2011-12-13  9:49 ` Chris Wilson
2011-12-13 16:01   ` Daniel Vetter
2011-12-13 16:36     ` Keith Packard
2011-12-13 22:49       ` Ben Widawsky
2011-12-13 23:22         ` Keith Packard
2011-12-14  1:09         ` Eric Anholt
2011-12-14  3:25           ` Keith Packard
2011-12-13 16:59     ` Chris Wilson
2011-12-13 17:22 ` Eric Anholt
2011-12-13 18:36   ` Ben Widawsky
2012-01-16 21:50     ` Daniel Vetter
2012-01-16 22:20       ` Ben Widawsky
2012-01-17  3:41         ` Eric Anholt
2012-01-17 10:15         ` Daniel Vetter
2012-01-17 17:55           ` Eric Anholt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.