Re: [Intel-gfx] [PATCH] drm/i915: Unconditionally flush writes before execbuffer

From: Daniel Vetter <daniel@ffwll.ch>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	Daniel Vetter <daniel@ffwll.ch>,
	intel-gfx@lists.freedesktop.org,
	Akash Goel <akash.goel@intel.com>,
	stable@vger.kernel.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Unconditionally flush writes before execbuffer
Date: Thu, 21 May 2015 16:21:46 +0200	[thread overview]
Message-ID: <20150521142146.GF15256@phenom.ffwll.local> (raw)
In-Reply-To: <20150521131301.GQ17761@nuc-i3427.alporthouse.com>

On Thu, May 21, 2015 at 02:13:01PM +0100, Chris Wilson wrote:
> On Thu, May 21, 2015 at 03:07:54PM +0200, Daniel Vetter wrote:
> > On Thu, May 21, 2015 at 02:00:34PM +0100, Chris Wilson wrote:
> > > On Tue, May 19, 2015 at 03:41:48PM +0100, Chris Wilson wrote:
> > > > On Mon, May 11, 2015 at 04:25:52PM +0100, Chris Wilson wrote:
> > > > > On Mon, May 11, 2015 at 12:34:37PM +0200, Daniel Vetter wrote:
> > > > > > On Mon, May 11, 2015 at 08:51:36AM +0100, Chris Wilson wrote:
> > > > > > > With the advent of mmap(wc), we have a path to write directly into
> > > > > > > active GPU buffers. When combined with async updates (i.e. avoiding the
> > > > > > > explicit domain management along with the memory barriers and GPU
> > > > > > > stalls) we start to see the GPU read the wrong values from memory - i.e.
> > > > > > > we have insufficient memory barriers along the execbuffer path. Writes
> > > > > > > through the GTT should have been naturally serialised with execution
> > > > > > > through the GTT as well and so the impact only seems to be from the WC
> > > > > > > paths.
> > > > > > > 
> > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > Cc: Akash Goel <akash.goel@intel.com>
> > > > > > > Cc: stable@vger.kernel.org
> > > > > > 
> > > > > > Do we have a nasty igt for this? Bugzilla?
> > > > > 
> > > > > I've added igt/gem_streaming_writes.
> > > > > 
> > > > > That wmb() is not enough for !llc. Since the wmb() made piglit happy it
> > > > > is quite possible I haven't hit the same path exactly, but it's going to
> > > > > take some investigation to see if igt/gem_streaming_writes can possibly
> > > > > work on !llc.
> > > > 
> > > > Humbug.
> > > > 
> > > > Found the bug in gem_streaming_writes, even though I still think the
> > > > wmb() is strictly required, it runs fine without (presumably I haven't
> > > > managed to avoid all barriers in the execbuffer path yet). However, I
> > > > think can improve the stress by inserting extra gpu load -- that should
> > > > help make the CPU writes / GPU reads of the buffer concurrent?
> > > 
> > > Just a small update. I haven't found a way to reproduce this in igt yet,
> > > but I can still observe the effect using vbo-map-unsync and the fix
> > > there is the above patch to make the wmb() unconditional.
> > > 
> > > We need to put this into stable@ reasonably quickly (I suspect some of
> > > the 4.0 mmap(wc) regressions are due to this as well).
> > 
> > What about
> > 
> > 	if (flush_domains & (GTT | CPU))
> > 		wmb();
> > 
> > instead? That would imo explain things a lot better, since cpu wc is
> > treated as if in the CPU domains. Hm, looking at the igt that's not quite
> > the case, we still put it into the GTT domain for wc mmaps afaict.
> 
> No. flush_domains is 0. We are talking about async writes which means
> that userspace is not telling the kernel about susbsequent writes into
> the inactive portions of the bo, and trusting that the buffer is
> coherent and the writes are flushed. Putting the wmb() in the kernel is
> not the only solution, but the most convenient (and allows us to just
> emit one wmb() - but given the large number of other potential barriers
> in this path, I am surprised that is required. Empirical evidence to the
> contrary!)

Hm right. What about emphasising this a bit more in the comment:

	/*
	 * Empirical evidence indicates that we need a write barrier to
	 * make sure write-combined writes (both to the gtt, but also to
	 * the cpu mmaps). But userspace also uses wc mmaps as
	 * unsynchronized upload paths where it inform the kernel about
	 * domain changes (to avoid the stalls). Hence we must do this
	 * barrier unconditinally.
	 */

Mostly just rewording, unsing unsynchronized as used by gl/libdrm and
clarification why we need to have the barrier unconditionally. With that

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

And I guess also

Cc: stable@vger.kernel.org
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch