All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Ekstrand <jason@jlekstrand.net>
To: Francisco Jerez <currojerez@riseup.net>
Cc: "Intel GFX" <intel-gfx@lists.freedesktop.org>,
	"Marcin Ślusarz" <marcin.slusarz@gmail.com>,
	"Chris Wilson" <chris@chris-wilson.co.uk>
Subject: Re: [Intel-gfx] [PATCH] drm/i915: Disable atomics in L3 for gen9
Date: Mon, 9 Nov 2020 13:52:26 -0600	[thread overview]
Message-ID: <CAOFGe96Hs+7GXT=vNmB14oXRnNjLna7D67Pm1G9jrkc3ws3=+w@mail.gmail.com> (raw)
In-Reply-To: <87a7d3qjzx.fsf@riseup.net>

We need to land this patch.  The number of bugs we have piling up in
Mesa gitlab related to this is getting a lot larger than I'd like.
I've gone back and forth with various HW and SW people internally for
countless e-mail threads and there is no other good workaround.  Yes,
the perf hit to atomics sucks but, fortunately, most games don't use
them heavily enough for it to make a significant impact.  We should
just eat the perf hit and fix the hangs.

Reviewed-by: Jason Ekstrand <jason@jlesktrand.net>

--Jason

On Wed, Jul 24, 2019 at 3:02 PM Francisco Jerez <currojerez@riseup.net> wrote:
>
> Chris Wilson <chris@chris-wilson.co.uk> writes:
>
> > Quoting Francisco Jerez (2019-07-23 23:19:13)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >>
> >> > Quoting Tvrtko Ursulin (2019-07-22 12:41:36)
> >> >>
> >> >> On 20/07/2019 15:31, Chris Wilson wrote:
> >> >> > Enabling atomic operations in L3 leads to unrecoverable GPU hangs, as
> >> >> > the machine stops responding milliseconds after receipt of the reset
> >> >> > request [GDRT]. By disabling the cached atomics, the hang do not occur
> >> >> > and we presume the GPU would reset normally for similar hangs.
> >> >> >
> >> >> > Reported-by: Jason Ekstrand <jason@jlekstrand.net>
> >> >> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110998
> >> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> >> > Cc: Jason Ekstrand <jason@jlekstrand.net>
> >> >> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >> >> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >> >> > ---
> >> >> > Jason reports that Windows is not clearing L3SQCREG4:22 and does not
> >> >> > suffer the same GPU hang so it is likely some other w/a that interacts
> >> >> > badly. Fwiw, these 3 are the only registers I could find that mention
> >> >> > atomic ops (and appear to be part of the same chain for memory access).
> >> >>
> >> >> Bit-toggling itself looks fine to me and matches what I could find in
> >> >> the docs. (All three bits across three registers should be equal.)
> >> >>
> >> >> What I am curious about is what are the other consequences of disabling
> >> >> L3 atomics? Performance drop somewhere?
> >> >
> >> > The test I have where it goes from dead to passing, that's a considerable
> >> > performance improvement ;)
> >> >
> >> > I imagine not being able to use L3 for atomics is pretty dire, whether that
> >> > has any impact, I have no clue.
> >> >
> >> > It is still very likely that we see this because we are doing something
> >> > wrong elsewhere.
> >>
> >> This reminds me of f3fc4884ebe6ae649d3723be14b219230d3b7fd2 followed by
> >> d351f6d94893f3ba98b1b20c5ef44c35fc1da124 due to the massive impact (of
> >> the order of 20x IIRC) using the L3 turned out to have on the
> >> performance of HDC atomics, on at least that platform.  It seems
> >> unfortunate that we're going to lose L3 atomics on Gen9 now, even though
> >> it's only buffer atomics which are broken IIUC, and even though the
> >> Windows driver is somehow getting away without disabling them.  Some of
> >> our setup must be wrong either in the kernel or in userspace...  Are
> >> these registers at least whitelisted so userspace can re-enable L3
> >> atomics once the problem is addressed?  Wouldn't it be a more specific
> >> workaround for userspace to simply use a non-L3-cacheable MOCS for
> >> (rarely used) buffer surfaces, so it could benefit from L3 atomics
> >> elsewhere?
> >
> > If it was the case that disabling L3 atomics was the only way to prevent
> > the machine lockup under this scenario, then I think it is
> > unquestionably the right thing to do, and we could not leave it to
> > userspace to dtrt. We should never add non-context saved unsafe
> > registers to the whitelist (if setting a register may cause data
> > corruption or worse in another context/process, that is bad) despite our
> > repeated transgressions. However, there's no evidence to say that it does
> > prevent the machine lockup as it prevents the GPU hang that lead to the
> > lockup on reset.
> >
> > Other than GPGPU requiring a flush around every sneeze, I did not see
> > anything in the gen9 w/a list that seemed like a match. Nevertheless, I
> > expect there is a more precise w/a than a blanket disable.
> > -Chris
>
> Supposedly there is a more precise one (setting the surface state MOCS
> to UC for buffer images), but it relies on userspace doing the right
> thing for the machine not to lock up.  There is a good chance that the
> reason why L3 atomics hang on such buffers is ultimately under userspace
> control, in which case we'll eventually have to undo the programming
> done in this patch in order to re-enable L3 atomics once the problem is
> addressed.  That means that userspace will have the freedom to hang the
> machine hard once again, which sounds really bad, but it's no real news
> for us (*cough* HSW *cough*), and it might be the only way to match the
> performance of the Windows driver.
>
> What can we do here?  Add an i915 option to enable performance features
> that can lead to the system hanging hard under malicious (or
> incompetent) userspace programming?  Probably only the user can tell
> whether the trade-off between performance and security of the system is
> acceptable...
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2020-11-09 19:52 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-20 14:31 [PATCH] drm/i915: Disable atomics in L3 for gen9 Chris Wilson
2019-07-20 15:10 ` ✓ Fi.CI.BAT: success for " Patchwork
2019-07-20 16:23 ` ✓ Fi.CI.IGT: " Patchwork
2019-07-22 11:41 ` [PATCH] " Tvrtko Ursulin
2019-07-23 11:55   ` Chris Wilson
2019-07-23 22:19     ` Francisco Jerez
2019-07-24 14:34       ` Chris Wilson
2019-07-24 20:02         ` Francisco Jerez
2020-11-09 19:52           ` Jason Ekstrand [this message]
2020-11-09 20:15             ` [Intel-gfx] " Chris Wilson
2020-11-09 20:48               ` Ville Syrjälä
2021-01-25 21:52 [Intel-gfx] [CI] " Chris Wilson
2021-01-25 22:01 ` [Intel-gfx] [PATCH] " Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOFGe96Hs+7GXT=vNmB14oXRnNjLna7D67Pm1G9jrkc3ws3=+w@mail.gmail.com' \
    --to=jason@jlekstrand.net \
    --cc=chris@chris-wilson.co.uk \
    --cc=currojerez@riseup.net \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=marcin.slusarz@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.