From mboxrd@z Thu Jan 1 00:00:00 1970 From: Francisco Jerez Subject: Re: [PATCH v2] drm/i915: Restore inhibiting the load of the default context Date: Thu, 10 Dec 2015 15:57:30 +0200 Message-ID: <87r3iuqud1.fsf@riseup.net> References: <87zixz4r0k.fsf@gaia.fi.intel.com> <1448630935-27377-1-git-send-email-chris@chris-wilson.co.uk> <87y4d2y5am.fsf@gaia.fi.intel.com> <87wpsmqvvf.fsf@riseup.net> <20151210133756.GN29974@nuc-i3427.alporthouse.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0130812288==" Return-path: Received: from mx1.riseup.net (mx1.riseup.net [198.252.153.129]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6B1DB6E9FA for ; Thu, 10 Dec 2015 05:57:52 -0800 (PST) In-Reply-To: <20151210133756.GN29974@nuc-i3427.alporthouse.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" To: Chris Wilson Cc: intel-gfx@lists.freedesktop.org, Ben Widawsky , Daniel Vetter List-Id: intel-gfx@lists.freedesktop.org --===============0130812288== Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Chris Wilson writes: > On Thu, Dec 10, 2015 at 03:24:52PM +0200, Francisco Jerez wrote: >> Mika Kuoppala writes: >>=20 >> > Chris Wilson writes: >> > >> >> Following a GPU reset, we may leave the context in a poorly defined >> >> state, and reloading from that context will leave the GPU flummoxed. = For >> >> secondary contexts, this will lead to that context being banned - but >> >> currently it is also causing the default context to become banned, >> >> leading to turmoil in the shared state. >> >> >> >> This is a regression from >> >> >> >> commit 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b [v4.1] >> >> Author: Ben Widawsky >> >> Date: Mon Mar 16 16:00:58 2015 +0000 >> >> >> >> drm/i915: Initialize all contexts >> >> >> >> which quietly introduced the removal of the MI_RESTORE_INHIBIT on the >> >> default context. >> >> >>=20 >> AFAICT the removal of MI_RESTORE_INHIBIT in that commit seemed >> justified. Ben explained that it was needed to fix a pagefault in the >> default context under certain conditions. I don't know the details of >> the original change (Ben CC'ed), but it seems like this would be trading >> one bug for another? > > A bug in a feature that never worked and isn't enabled? >=20=20 >> Other than that this opens the door again to subtle state leaks between >> processes, as I learned recently while implementing L3 partitioning in >> Mesa. Mesa now changes the L3 configuration in ways that will break >> assumptions from processes that use the default context (the DDX). The >> DDX assumes, for instance, that the URB size is set according to the >> hardware defaults, but it doesn't actually program the URB size itself, >> so in a way the DDX relies on MI_RESTORE_INHIBIT *not* to be set for the >> default context for correct operation. This commit breaks that >> assumption and the kernel ABI. > > Wrong the ABI is the other way around and has been for 10 years. Note > the kernel isn't also ensuring that the default context has the default > hardware values either. The "golden renderstate" also doesn't set all > defaults and is also a recent introduction. > > It changes the ABI back to what it was.... Sounds like a change fully unrelated to fixing the bug on GPU reset. The ABI change done in 6702cf16e0ba8b0129f5aa1b6609d4e9c70bc13b seemed legitimate because it made the context state on switch more well-defined than it used to be, IOW it was a backwards-compatible ABI change because it couldn't possibly break userspace assumptions. This change OTOH breaks an assumption about the kernel ABI done by the DDX now. >=20=20 >> Mesa has a workaround in place to reduce the likelihood of such leaks, >> but the solution is far from ideal because it relies on userspace >> cooperation and had a measurable impact in performance (because it >> requires userspace to assume the worst-case scenario that the following >> batch is going to be from a different context with MI_RESTORE_INHIBIT >> set, so we have to restore the hardware default L3 configuration at the >> end of every batch even if that's actually not the case), for that >> reason we would like to drop the userspace workaround in the future at >> least on v4.1 kernels and newer. > > Mesa has workarounds for leaks between hardware contexts, i.e. state > that is not stored in the context itself. Mesa also has to assume a > clean slate when setting up the context for the first time, and doesn't > use the default context itself. >=20=20 No, the workaround involves restoring state which *is* part of the hardware context, it just won't get saved and restored with this commit because of the MI_RESTORE_INHIBIT flag when switching to the DDX' default context. >> One more question below. >>=20 >> >> v2: Mark the global default context as uninitialized on GPU reset so >> >> that the context-local workarounds are reloaded upon re-enabling. >> >> >> >> Signed-off-by: Chris Wilson >> > >> > Reviewed-by: Mika Kuoppala >> > >> >> Cc: Michel Thierry >> >> Cc: Mika Kuoppala >> >> Cc: Daniel Vetter >> >> --- >> >> drivers/gpu/drm/i915/i915_gem_context.c | 6 +++++- >> >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/dr= m/i915/i915_gem_context.c >> >> index 43761c5bcaca..f024d5d2c746 100644 >> >> --- a/drivers/gpu/drm/i915/i915_gem_context.c >> >> +++ b/drivers/gpu/drm/i915/i915_gem_context.c >> >> @@ -340,6 +340,10 @@ void i915_gem_context_reset(struct drm_device *d= ev) >> >> i915_gem_context_unreference(lctx); >> >> ring->last_context =3D NULL; >> >> } >> >> + >> >> + /* Force the GPU state to be reinitialised on enabling */ >> >> + if (ring->default_context) >> >> + ring->default_context->legacy_hw_ctx.initialized =3D false; >> >> } >> >> } >> >>=20=20 >> >> @@ -708,7 +712,7 @@ static int do_switch(struct drm_i915_gem_request = *req) >> >> if (ret) >> >> goto unpin_out; >> >>=20=20 >> >> - if (!to->legacy_hw_ctx.initialized) { >> >> + if (!to->legacy_hw_ctx.initialized || i915_gem_context_is_default(t= o)) { >>=20 >> This hunk causes MI_RESTORE_INHIBIT to be set again for the default >> context regardless of whether a reset happened or not, so it seems >> unrelated to the rest of your change. Maybe I'm understanding this >> wrong but doesn't the !initialized check together with the hunk above >> already guarantee that MI_RESTORE_INHIBIT will be set after GPU reset, >> which is what you wanted to achieve? > > This is the change to *restore* the kernel ABI. The flag in reset is to > force the WA LRI to be re-emitted (not for gen6 ofc) as they will not be > preserved. Again you don't justify why changing the kernel ABI is required to fix a GPU reset bug. > -Chris > > --=20 > Chris Wilson, Intel Open Source Technology Centre --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iF4EAREIAAYFAlZphMoACgkQg5k4nX1Sv1teqQEAj8hPWwLeRU/LZNeO9OPw0jLm nDkU6YyRJbbr1cjLgUEA+QECtxz8UiLIER4CSXNQNAxgQyzf7Iedte069UAXCDVW =yFWd -----END PGP SIGNATURE----- --==-=-=-- --===============0130812288== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KSW50ZWwtZ2Z4 IG1haWxpbmcgbGlzdApJbnRlbC1nZnhAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHA6Ly9saXN0 cy5mcmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9pbnRlbC1nZngK --===============0130812288==--