From mboxrd@z Thu Jan  1 00:00:00 1970
From: Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [PATCH 5/7] drm/i915: queue hangcheck on reset
Date: Tue, 16 Jul 2013 10:16:32 +0100
Message-ID: <20130716091632.GL2823@cantiga.alporthouse.com>
References: <1372861332-6308-1-git-send-email-mika.kuoppala@intel.com>
	<1372861332-6308-6-git-send-email-mika.kuoppala@intel.com>
	<20130716084929.GP5784@phenom.ffwll.local>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org>
Received: from fireflyinternet.com (s16502780.onlinehome-server.info
	[87.106.93.118])
	by gabe.freedesktop.org (Postfix) with ESMTP id 2C9DAE64CC
	for <intel-gfx@lists.freedesktop.org>;
	Tue, 16 Jul 2013 02:16:37 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <20130716084929.GP5784@phenom.ffwll.local>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
To: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
List-Id: intel-gfx@lists.freedesktop.org

On Tue, Jul 16, 2013 at 10:49:29AM +0200, Daniel Vetter wrote:
> On Wed, Jul 03, 2013 at 05:22:10PM +0300, Mika Kuoppala wrote:
> > From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > 
> > Upon resetting the GPU, we begin processing batches once more, so
> > reset the hangcheck timer.
> > 
> > v2: kicking inside reset instead of hangcheck_elapsed and
> >     sane commit message by Chris Wilson
> > 
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_irq.c |    2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > index b0fec7f..1b0e903 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1452,6 +1452,8 @@ static void i915_error_work_func(struct work_struct *work)
> >  
> >  			kobject_uevent_env(&dev->primary->kdev.kobj,
> >  					   KOBJ_CHANGE, reset_done_event);
> > +
> > +			i915_queue_hangcheck(dev);
> 
> Hm, what exactly is this for? After reset we don't have any batches
> running right now (since we reset all batches), so I don't understand why
> we need this. And the commit message also doesn't give a reason.

Because our code is a little snafu after reset. We do have batches still
queued, but the rings are incorrectly reset. Instead they should just be
promoted past the failed batch and processing restarted. If you get
extremely fancy, we can no-op out any requests from the hung !default
context to prevent incorrect state leakage. This also likely explains
why we end up with active bo stuck after becoming wedged.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre