From mboxrd@z Thu Jan  1 00:00:00 1970
From: Daniel Vetter <daniel@ffwll.ch>
Subject: Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO
 from intel_ring_begin
Date: Mon, 2 Jul 2012 18:47:48 +0200
Message-ID: <20120702164748.GC5064@phenom.ffwll.local>
References: <1340744933-11835-1-git-send-email-daniel.vetter@ffwll.ch>
	<20120630200936.5a802fe1@bwidawsk.net>
	<CAKMK7uFiFL7CQ5Cg8y0qSkBOZH+CrwcWLLowC=zOnnK-EAG6xg@mail.gmail.com>
	<20120702090448.119b02cc@bwidawsk.net>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org>
Received: from mail-ee0-f49.google.com (mail-ee0-f49.google.com [74.125.83.49])
	by gabe.freedesktop.org (Postfix) with ESMTP id 5A9569E7D9
	for <intel-gfx@lists.freedesktop.org>;
	Mon,  2 Jul 2012 09:47:51 -0700 (PDT)
Received: by eekd17 with SMTP id d17so2212257eek.36
	for <intel-gfx@lists.freedesktop.org>;
	Mon, 02 Jul 2012 09:47:50 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <20120702090448.119b02cc@bwidawsk.net>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
To: Ben Widawsky <ben@bwidawsk.net>
Cc: intel-gfx@lists.freedesktop.org
List-Id: intel-gfx@lists.freedesktop.org

On Mon, Jul 02, 2012 at 09:04:48AM -0700, Ben Widawsky wrote:
> On Sun, 1 Jul 2012 12:41:19 +0200
> Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben@bwidawsk.net> wrote:
> > > On Tue, 26 Jun 2012 23:08:50 +0200
> > > Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > >> Having this early check in intel_ring_begin doesn't buy us anything,
> > >> since we'll be calling into wait_request in the usual case already
> > >> anyway. In the corner case of not waiting for free space using the
> > >> last_retired_head we simply need to do the same check, too.
> > >>
> > >> With these changes we'll only ever get an -EIO from intel_ring_begin
> > >> if the gpu has truely been declared dead.
> > >>
> > >> v2: Don't conflate the change to prevent intel_ring_begin from returning
> > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
> > >> just prep work to avoid returning -EAGAIN in callsites that can't handle
> > >> syscall restarting.
> > >
> > > I'm really scared by this change. It's diffuclt to review because there
> > > are SO many callers of intel_ring_begin, and figuring out if they all
> > > work in the wedged case is simply too difficult. I've yet to review the
> > > rest of the series, but it may make more sense to put this change last
> > > perhaps?
> > 
> > Well, this patch doesn't really affect much what error codes the
> > callers get - we'll still throw both -EGAIN and -EIO at them (later on
> > patches will fix this).
> > 
> > What this patch does though is prevent us from returning too many
> > -EIO. Imagine the gpu died and we've noticed already (hence
> > dev_priv->mm.wedged is set), but some process is stuck churning
> > through the execbuf ioctl, holding dev->struct_mutex. While processing
> > the execbuf we call intel_ring_begin to emit a few commands. Now
> > usually, even when the gpu is dead, there is enough space in the ring
> > to do so, which allows us to complete the execbuf ioctl and then later
> > on we can block properly when trying to grab the mutex in the next
> > ioctl for the gpu reset work handler to complete.
> 
> That in itself is a pretty big change, don't you think? It seems rather
> strange and dangerous to modify HW (which we will if we allow execbuf to
> continue when we write the tail pointer). I think we want some way to
> not write the tail ptr in such a case. Now you might respond, well, who
> cares about writes? But this imposes a pretty large restriction on any
> code that can't work well after the GPU is hung.
> 
> I see the irony. I'm complaining that you can make GPU hangs unstable,
> and the patch series is fixing GPU reset. Call it paranoia, it still
> seems unsafe to me, and makes us have to think a bit more whenever
> adding any code. Am I over-thinking this?

I think so. It takes us a few seconds before we notice that the gpu died -
only when the hangcheck timer expired at least twice we'll declare the hw
wedged. Imo a few seconds is plenty of time to sneak in a few ringbuffer
emissions (and hence tail pointer writes) ;-)

So I think trying to avoid touching the gpu for another few usecs _is_
overthinking the problem, because fundamentally we can't avoid touching a
dead gpu - we will only ever notice its demise after the fact.

Yours, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48