From mboxrd@z Thu Jan  1 00:00:00 1970
From: Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [RFC] GPU reset notification interface
Date: Wed, 18 Jul 2012 08:06:10 +0100
Message-ID: <1342595191_10959@CP5-2952>
References: <5005E433.40200@freedesktop.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org>
Received: from fireflyinternet.com (smtp.fireflyinternet.com [109.228.6.236])
	by gabe.freedesktop.org (Postfix) with ESMTP id 252FDA0DE6
	for <intel-gfx@lists.freedesktop.org>;
	Wed, 18 Jul 2012 00:06:37 -0700 (PDT)
In-Reply-To: <5005E433.40200@freedesktop.org>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org
To: Ian Romanick <idr@freedesktop.org>, "intel-gfx@lists.freedesktop.org" <intel-gfx@lists.freedesktop.org>
List-Id: intel-gfx@lists.freedesktop.org

On Tue, 17 Jul 2012 15:16:19 -0700, Ian Romanick <idr@freedesktop.org> wrote:
> I'm getting ready to implement the reset notification part of 
> GL_ARB_robustness in the i965 driver.  There are a bunch of quirky bits 
> of the extension that are causing some grief in designing the kernel / 
> user interface.  I think I've settled on an interface that should meet 
> all of the requirements, but I want to bounce it off people before I 
> start writing code.

What's the rationale for reset-in-progress? The actual delay between a
hang being noticed and the reset worker kicked off is going to be
very, very small -- in theory there is no window of opportunity for
another client to sneek and perform more work.

If we replace 'index' with breadcrumb, I'm much happier. Then you
have

struct drm_i915_reset_counts {
	u32 num_resets;
	u32 last_reset; /* using the seqno breadcrumb */
	u32 flags; /* 0x1 if in progress */

	u32 num_contexts;
	struct drm_i915_context_reset_counts contexts[0];
};

struct drm_i915_context_reset_counts {
	u32 ctx_id; /* per-fd namespace */

	u32 num_resets;
	u32 last_reset; // seqno

	u32 num_guilty;
	u32 last_guilty; // seqno
/*
	u32 num_innocent;
	u32 last_innocent;

	u32 num_unproven;
	u32 last_unproven;
*/
};;

So what do the terms mean as far as we are aware?

guilty - this context owned the batch we determine the GPU is executing

innocent - this context owns buffer that were on the active list at the
           time of execution and presumed trashed with the reset, i..e
           the context itself was on the active list.

unproven - a hang occurred in the driver without any indication of a
           guilty party (e.g. a semaphore hang due to bad olr), and
           this context was on the active list.

If we only increment the reset counter for a context if it is active at
the time of reset, then
  num_resets = num_guiltly + num_innocent + num_unproven;
and we can drop num_unproven. Since there is then no difference in
effect between unproven/innocence, we need only report the last time it
was found guilty and the total number of reset events involving this
context.

Similarly, the per-fd global counter is only incremented if one of its
contexts was active at the time of an event. The presumption being that
userspace does the broadcast across shared buffers/contexts to other
processes and updates the shared buffers as necessary.

We only keep 'monotonically' increasing state, so userspace is
responsible for computing deltas to catch events.

For the breadcrumb I'd just to use the seqno. It matters not to
userspace that the event breadcrumb jumps by more than one between
events, as it only sees a compressed view anyway (i.e. there may be
external resets that it knows nothing about).

The largest uncertainity for me is the guilty/innocence/unknown
counters and the value of providing both innocent and unproven. Then
there is the question whether we track/report resets that are nothing to
do with the fd. And whether contexts or buffers is more meaningful. From
a tracking perspective, contexts wins both for the kernel, and
presumably for userspace. And given a queriable last_write_seqno for
userspace, it can easily convert a context reset seqno into a trash
list.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre