From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Wilson Subject: Re: [RFC] GPU reset notification interface Date: Wed, 18 Jul 2012 08:06:10 +0100 Message-ID: <1342595191_10959@CP5-2952> References: <5005E433.40200@freedesktop.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from fireflyinternet.com (smtp.fireflyinternet.com [109.228.6.236]) by gabe.freedesktop.org (Postfix) with ESMTP id 252FDA0DE6 for ; Wed, 18 Jul 2012 00:06:37 -0700 (PDT) In-Reply-To: <5005E433.40200@freedesktop.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+gcfxdi-intel-gfx=m.gmane.org@lists.freedesktop.org To: Ian Romanick , "intel-gfx@lists.freedesktop.org" List-Id: intel-gfx@lists.freedesktop.org On Tue, 17 Jul 2012 15:16:19 -0700, Ian Romanick wrote: > I'm getting ready to implement the reset notification part of > GL_ARB_robustness in the i965 driver. There are a bunch of quirky bits > of the extension that are causing some grief in designing the kernel / > user interface. I think I've settled on an interface that should meet > all of the requirements, but I want to bounce it off people before I > start writing code. What's the rationale for reset-in-progress? The actual delay between a hang being noticed and the reset worker kicked off is going to be very, very small -- in theory there is no window of opportunity for another client to sneek and perform more work. If we replace 'index' with breadcrumb, I'm much happier. Then you have struct drm_i915_reset_counts { u32 num_resets; u32 last_reset; /* using the seqno breadcrumb */ u32 flags; /* 0x1 if in progress */ u32 num_contexts; struct drm_i915_context_reset_counts contexts[0]; }; struct drm_i915_context_reset_counts { u32 ctx_id; /* per-fd namespace */ u32 num_resets; u32 last_reset; // seqno u32 num_guilty; u32 last_guilty; // seqno /* u32 num_innocent; u32 last_innocent; u32 num_unproven; u32 last_unproven; */ };; So what do the terms mean as far as we are aware? guilty - this context owned the batch we determine the GPU is executing innocent - this context owns buffer that were on the active list at the time of execution and presumed trashed with the reset, i..e the context itself was on the active list. unproven - a hang occurred in the driver without any indication of a guilty party (e.g. a semaphore hang due to bad olr), and this context was on the active list. If we only increment the reset counter for a context if it is active at the time of reset, then num_resets = num_guiltly + num_innocent + num_unproven; and we can drop num_unproven. Since there is then no difference in effect between unproven/innocence, we need only report the last time it was found guilty and the total number of reset events involving this context. Similarly, the per-fd global counter is only incremented if one of its contexts was active at the time of an event. The presumption being that userspace does the broadcast across shared buffers/contexts to other processes and updates the shared buffers as necessary. We only keep 'monotonically' increasing state, so userspace is responsible for computing deltas to catch events. For the breadcrumb I'd just to use the seqno. It matters not to userspace that the event breadcrumb jumps by more than one between events, as it only sees a compressed view anyway (i.e. there may be external resets that it knows nothing about). The largest uncertainity for me is the guilty/innocence/unknown counters and the value of providing both innocent and unproven. Then there is the question whether we track/report resets that are nothing to do with the fd. And whether contexts or buffers is more meaningful. From a tracking perspective, contexts wins both for the kernel, and presumably for userspace. And given a queriable last_write_seqno for userspace, it can easily convert a context reset seqno into a trash list. -Chris -- Chris Wilson, Intel Open Source Technology Centre