Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno

From: Imre Deak <imre.deak@intel.com>
To: Dave Gordon <david.s.gordon@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/i915/bxt: work around HW coherency issue when accessing GPU seqno
Date: Wed, 10 Jun 2015 13:59:33 +0300	[thread overview]
Message-ID: <1433933973.25216.48.camel@intel.com> (raw)
In-Reply-To: <5575EDF9.9070303@intel.com>

On ma, 2015-06-08 at 20:33 +0100, Dave Gordon wrote:
> On 08/06/15 19:40, Ville Syrjälä wrote:
> > On Mon, Jun 08, 2015 at 07:00:49PM +0100, Chris Wilson wrote:
> >> On Mon, Jun 08, 2015 at 08:34:51PM +0300, Ville Syrjälä wrote:
> >>> On Mon, Jun 08, 2015 at 06:12:47PM +0100, Chris Wilson wrote:
> >>>> On Mon, Jun 08, 2015 at 06:08:00PM +0100, Dave Gordon wrote:
> >>>>> On 08/06/15 17:28, Imre Deak wrote:
> >>>>>> By running igt/store_dword_loop_render on BXT we can hit a coherency
> >>>>>> problem where the seqno written at GPU command completion time is not
> >>>>>> seen by the CPU. This results in __i915_wait_request seeing the stale
> >>>>>> seqno and not completing the request (not considering the lost
> >>>>>> interrupt/GPU reset mechanism). I also verified that this isn't a case
> >>>>>> of a lost interrupt, or that the command didn't complete somehow: when
> >>>>>> the coherency issue occured I read the seqno via an uncached GTT mapping
> >>>>>> too. While the cached version of the seqno still showed the stale value
> >>>>>> the one read via the uncached mapping was the correct one.
> >>>>>>
> >>>>>> Work around this issue by clflushing the corresponding CPU cacheline
> >>>>>> following any store of the seqno and preceding any reading of it. When
> >>>>>> reading it do this only when the caller expects a coherent view.
> >>>>>>
> >>>>>> Testcase: igt/store_dword_loop_render
> >>>>>> Signed-off-by: Imre Deak <imre.deak@intel.com>
> >>>>>
> >>>>> Not necessarily a cure for this, but BSpec says of MI_STORE_DATA_IMM
> >>>>> (and MI_STORE_DATA_INDEX):
> >>>>>
> >>>>> 	This command simply initiates the write operation with
> >>>>> 	command execution proceeding normally. Although the write
> >>>>> 	operation is guaranteed to complete "eventually", there is
> >>>>> 	no mechanism to synchronize command execution with the
> >>>>> 	completion (or even initiation) of these operations.
> >>>>>
> >>>>> So shouldn't we use MI_FLUSH_DW or PIPE_CONTROL to update the seqno in
> >>>>> the HWSP instead?
> >>>>
> >>>> iirc there is also no guarrantee for when the post-sync write op is
> >>>> completed for a FLUSH_DW/PIPE_CONTROL either. I'd be happy to be
> >>>> corrected!
> >>>
> >>> My reading of the spec suggests that something like this could work:
> >>> PIPE_CONTROL w/ PIPE_CONTROL_QW_WRITE
> >>> PIPE_CONTROL w/ PIPE_CONTROL_NOTIFY | PIPE_CONTROL_FLUSH_ENABLE
> >>
> >> Absolutely sure? The issue is not the completion of the PIPE_CONTROL,
> >> but of ensure that the value has been written to memory and the CPU
> >> cache snooped. I don't remember there being anything as clear as the
> >> gen2-5 statement that all writes are coherent before the interrupt is
> >> raised.
> >>
> >> We can hit the issue that the seqno writes aren't coherent before the
> >> interrupt with our current method - I have seen it with hsw, so this is
> >> definitely something worth improving.
> > 
> > What I get from the spec is:
> > - The post-sync operation is started after previous and current flushes
> >   have completed
> > - The flush enable bit causes the CS to wait until all previous
> >   post-sync operations have completed, which hopefully means the
> >   store is visible to everyone
> > - The notify interrupt is signalled after the current sync operation
> >   has completed, which I hope means the flush enable stall has also
> >   finished (and if not a three PIPE_CONTROL seqence could be used
> >   instead)
> > 
> > So no, I'm not absolutely sure by any means.
> 
> Here's the commentary for MI_FLUSH_DW:
> 
> 	The MI_FLUSH_DW command is used to perform an internal "flush"
> 	operation. The parser pauses on an internal flush until all
> 	drawing engines have completed any pending operations. In
> 	addition, this command can also be used to: Flush any dirty
> 	data to memory. Invalidate the TLB cache inside the hardware
> 
> 	Usage note: After this command is completed with a Store DWord
> 	enabled, CPU access to graphics memory will be coherent
> 	(assuming the Render Cache flush is not inhibited).
> 
> 	...
> 
> 	NOTIFY: If ENABLED, a Sync Completion Interrupt will be
> 	generated (if enabled by the MI Interrupt Control registers)
> 	once the sync operation is complete. See Interrupt Control
> 	Registers in Memory Interface Registers for details.
> 
> So that's likewise unclear about whether the interrupt comes after just
> the "sync" or after the "post sync operation" as well.
> 
> Then there's the DC_FLUSH bit in the PIPE_CONTROL instruction:
> 
> 	DC Flush (L3 Flush) by default doesn’t result in
> 	flushing/invalidating the IA Coherent lines from L3$, however
> 	this can be achieved by setting control bit “Pipe line flush
> 	Coherent lines” in “L3SQCREG4” register.

I think the discussion here is about two separate things:
1. Possible ordering issue between the seqno store and the completion
interrupt
2. Coherency issue that leaves the CPU with a stale view of the seqno
indefinitely, which this patch works around

I'm confident that in my case the problem is not due to ordering. If it
was "only" ordering then the value would show up eventually. This is not
the case though, __wait_for_request will see the stale value
indefinitely even though it gets woken up periodically afterwards by the
lost IRQ logic (with hangcheck disabled).

I have tried to replace the MI_STORE_MEMORY_IMM+MI_USER_INTERRUPT on the
render ring with a single PIPE_CONTROL to store the seqno and send a
notify interrupt, setting the CS_STALL, PIPE_CONTROL_FLUSH_ENABLE,
DC_FLUSH_ENABLE bits. This didn't get rid of the problem, I had the same
behavior as before. I also tried setting L3SQCREG4 bit 21 as you
suggested, but it didn't have an effect.

Similarly I tried now to replace the MI_STORE_MEMORY_IMM
+MI_USER_INTERRUPT on the BSD/BLT/VEBOX rings with a MI_FLUSH_DW storing
the seqno and sending a sync interrupt. As above, this didn't solve the
problem.

I wonder if stores by MI_STORE_MEMORY_IMM even go through the L3$.
According to Ville they are not and so I'm not sure how the above flush
operations would solve this problem.

> And is the GPU's MI_CLFLUSH instruction of any use?

According to the spec this command is RCS specific, so it wouldn't help
in case of the other rings (where I also see this problem without this
patch).

--Imre

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx