From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Vetter Subject: Re: [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin Date: Mon, 2 Jul 2012 18:47:48 +0200 Message-ID: <20120702164748.GC5064@phenom.ffwll.local> References: <1340744933-11835-1-git-send-email-daniel.vetter@ffwll.ch> <20120630200936.5a802fe1@bwidawsk.net> <20120702090448.119b02cc@bwidawsk.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ee0-f49.google.com (mail-ee0-f49.google.com [74.125.83.49]) by gabe.freedesktop.org (Postfix) with ESMTP id 5A9569E7D9 for ; Mon, 2 Jul 2012 09:47:51 -0700 (PDT) Received: by eekd17 with SMTP id d17so2212257eek.36 for ; Mon, 02 Jul 2012 09:47:50 -0700 (PDT) Content-Disposition: inline In-Reply-To: <20120702090448.119b02cc@bwidawsk.net> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org To: Ben Widawsky Cc: intel-gfx@lists.freedesktop.org List-Id: intel-gfx@lists.freedesktop.org On Mon, Jul 02, 2012 at 09:04:48AM -0700, Ben Widawsky wrote: > On Sun, 1 Jul 2012 12:41:19 +0200 > Daniel Vetter wrote: > > > On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky wrote: > > > On Tue, 26 Jun 2012 23:08:50 +0200 > > > Daniel Vetter wrote: > > > > > >> Having this early check in intel_ring_begin doesn't buy us anything, > > >> since we'll be calling into wait_request in the usual case already > > >> anyway. In the corner case of not waiting for free space using the > > >> last_retired_head we simply need to do the same check, too. > > >> > > >> With these changes we'll only ever get an -EIO from intel_ring_begin > > >> if the gpu has truely been declared dead. > > >> > > >> v2: Don't conflate the change to prevent intel_ring_begin from returning > > >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is > > >> just prep work to avoid returning -EAGAIN in callsites that can't handle > > >> syscall restarting. > > > > > > I'm really scared by this change. It's diffuclt to review because there > > > are SO many callers of intel_ring_begin, and figuring out if they all > > > work in the wedged case is simply too difficult. I've yet to review the > > > rest of the series, but it may make more sense to put this change last > > > perhaps? > > > > Well, this patch doesn't really affect much what error codes the > > callers get - we'll still throw both -EGAIN and -EIO at them (later on > > patches will fix this). > > > > What this patch does though is prevent us from returning too many > > -EIO. Imagine the gpu died and we've noticed already (hence > > dev_priv->mm.wedged is set), but some process is stuck churning > > through the execbuf ioctl, holding dev->struct_mutex. While processing > > the execbuf we call intel_ring_begin to emit a few commands. Now > > usually, even when the gpu is dead, there is enough space in the ring > > to do so, which allows us to complete the execbuf ioctl and then later > > on we can block properly when trying to grab the mutex in the next > > ioctl for the gpu reset work handler to complete. > > That in itself is a pretty big change, don't you think? It seems rather > strange and dangerous to modify HW (which we will if we allow execbuf to > continue when we write the tail pointer). I think we want some way to > not write the tail ptr in such a case. Now you might respond, well, who > cares about writes? But this imposes a pretty large restriction on any > code that can't work well after the GPU is hung. > > I see the irony. I'm complaining that you can make GPU hangs unstable, > and the patch series is fixing GPU reset. Call it paranoia, it still > seems unsafe to me, and makes us have to think a bit more whenever > adding any code. Am I over-thinking this? I think so. It takes us a few seconds before we notice that the gpu died - only when the hangcheck timer expired at least twice we'll declare the hw wedged. Imo a few seconds is plenty of time to sneak in a few ringbuffer emissions (and hence tail pointer writes) ;-) So I think trying to avoid touching the gpu for another few usecs _is_ overthinking the problem, because fundamentally we can't avoid touching a dead gpu - we will only ever notice its demise after the fact. Yours, Daniel -- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48