From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-owner@vger.kernel.org>
Received: from mail.fireflyinternet.com ([87.106.93.118]:61854 "EHLO
	fireflyinternet.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org
	with ESMTP id S1752213AbcAMUGc (ORCPT
	<rfc822;stable@vger.kernel.org>); Wed, 13 Jan 2016 15:06:32 -0500
Date: Wed, 13 Jan 2016 20:06:17 +0000
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Dave Gordon <david.s.gordon@intel.com>
Cc: intel-gfx@lists.freedesktop.org,
	Daniel Vetter <daniel.vetter@ffwll.ch>, stable@vger.kernel.org
Subject: Re: [Intel-gfx] [PATCH 004/190] drm/i915: Fix some invalid requests
 cancellations
Message-ID: <20160113200617.GE29867@nuc-i3427.alporthouse.com>
References: <1452503961-14837-1-git-send-email-chris@chris-wilson.co.uk>
 <1452503961-14837-4-git-send-email-chris@chris-wilson.co.uk>
 <569542F5.7040704@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <569542F5.7040704@intel.com>
Sender: stable-owner@vger.kernel.org
List-ID: <stable.vger.kernel.org>

On Tue, Jan 12, 2016 at 06:16:21PM +0000, Dave Gordon wrote:
> On 11/01/16 09:16, Chris Wilson wrote:
> >As we add the VMA to the request early, it may be cancelled during
> >execbuf reservation. This will leave the context object pointing to a
> >dangling request; i915_wait_request() simply skips the wait and so we
> >may unbind the object whilst it is still active.
> 
> I don't understand this; context objects don't point to requests,
> it's vice versa. The request has a pointer to the engine, and to the
> context, and adds +1 to the refcount on the latter.

The context object has an active reference to prevent it being freed
whilst still in use. That is combined with the pin on last_context until
after the switch is complete to prevent the hardware writing to stale
pages. The golden render state also uses the active reference to prevent
being reaped whilst it is still in the ringbuffer, again we can return
those pages back to the system and then execute them on the GPU.

> >However, if at any point we make a change to the hardware (and equally
> >importantly our bookkeeping in the driver), we cannot cancel the request
> >as what has already been written must be submitted. Submitting a partial
> >request is far easier than trying to unwind the incomplete change.
> 
> What change could be made to the hardware? Engine reset, perhaps,
> but even that doesn't necessarily invalidate a request in
> preparation for sending to the hardware.

Sloppy terminology, but we don't unwind the request when we cancel it,
so the command we have written into the ringbuffer persist and will be
executed by the next request. We would also need to unwind the changes
we made to our state tracker.
 
> Submitting a partial change seems likely to leave the h/w in an
> undefined state. Cancelling a request is (or ought to be) trivial;
> just reset the ringbuffer's tail pointer to where it was when the
> request was allocated. The engine doesn't read anything past the
> tail of the previous request until the new request is submitted.

The opposite, submitting the changes that we have made so far ensures
that the hardware state matches our bookkeeping.
 
> >Unfortunately this patch undoes the excess breadcrumb usage that olr
> >prevented, e.g. if we interrupt batchbuffer submission then we submit
> >the requests along with the memory writes and interrupt (even though we
> >do no real work). Disassociating requests from breadcrumbs (and
> >semaphores) is a topic for a past/future series, but now much more
> >important.
> 
> Another incomprehensible comment? OLR went away a long time ago now.

OLR was introduced to avoid exactly this scenario - where we would emit
commands into the ring and then have to cancel the operation. Rather
than emit the breadcrumb and complete the seqno/request, we kept the
request open and reused it for the newcomer.

> And AFAIK batchbuffer submission cannot be interrupted by anything
> (or more accurately, anything that interrupts submission won't be
> able to write to the ringbuffer or submit new work).

You mean after the request is constructed. Whilst the request is being
constructed it is very easily to interrupt, and apparently fragile.

> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >Cc: stable@vger.kernel.org

Do I need to mention that igt finds these bugs?

> >---
> >  drivers/gpu/drm/i915/i915_drv.h            |  1 -
> >  drivers/gpu/drm/i915/i915_gem.c            |  7 ++-----
> >  drivers/gpu/drm/i915/i915_gem_context.c    | 21 +++++++++------------
> >  drivers/gpu/drm/i915/i915_gem_execbuffer.c | 16 +++++-----------
> >  drivers/gpu/drm/i915/intel_display.c       |  2 +-
> >  drivers/gpu/drm/i915/intel_lrc.c           |  1 -
> >  6 files changed, 17 insertions(+), 31 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >index 747d2d84a18c..ec20814adb0c 100644
> >--- a/drivers/gpu/drm/i915/i915_drv.h
> >+++ b/drivers/gpu/drm/i915/i915_drv.h
> >@@ -2813,7 +2813,6 @@ int i915_gem_sw_finish_ioctl(struct drm_device *dev, void *data,
> >  			     struct drm_file *file_priv);
> >  void i915_gem_execbuffer_move_to_active(struct list_head *vmas,
> >  					struct drm_i915_gem_request *req);
> >-void i915_gem_execbuffer_retire_commands(struct i915_execbuffer_params *params);
> >  int i915_gem_ringbuffer_submission(struct i915_execbuffer_params *params,
> >  				   struct drm_i915_gem_execbuffer2 *args,
> >  				   struct list_head *vmas);
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index 3ab529669448..fd24877eb0a0 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -3384,12 +3384,9 @@ int i915_gpu_idle(struct drm_device *dev)
> >  				return ret;
> >
> >  			ret = i915_switch_context(req);
> >-			if (ret) {
> >-				i915_gem_request_cancel(req);
> >-				return ret;
> >-			}
> >-
> >  			i915_add_request_no_flush(req);
> >+			if (ret)
> >+				return ret;
> 
> This seems like a bad idea. Looking at how we could get here (i.e.
> how could switch_context() return an error), we see things such as
> "failed to pin" or i915_gem_object_get_pages() failed.

The point is that we can hit the error path after having done part of
the request setup and have objects pointing to the request. It cannot be
cancelled at this point.
 
> With no real idea of what the GPU and/or CPU address spaces contain
> at this point, it seems unwise to charge ahead regardless.

? We know the state. We know that objects are using this request to
track their changes. Throwing those away means that we throw away
storage being used by the HW.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre