From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=mLoZ=OT=lists.freedesktop.org=dri-devel-bounces@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 69DAEC433F5
	for <dri-devel@archiver.kernel.org>; Wed, 29 Sep 2021 00:27:51 +0000 (UTC)
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 261366134F
	for <dri-devel@archiver.kernel.org>; Wed, 29 Sep 2021 00:27:51 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 261366134F
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 6AC7D6E140;
	Wed, 29 Sep 2021 00:27:50 +0000 (UTC)
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
 by gabe.freedesktop.org (Postfix) with ESMTPS id D80D66E140;
 Wed, 29 Sep 2021 00:27:49 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10121"; a="285846324"
X-IronPort-AV: E=Sophos;i="5.85,330,1624345200"; d="scan'208";a="285846324"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
 by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 28 Sep 2021 17:27:49 -0700
X-IronPort-AV: E=Sophos;i="5.85,330,1624345200"; d="scan'208";a="519424044"
Received: from jons-linux-dev-box.fm.intel.com (HELO jons-linux-dev-box)
 ([10.1.27.20])
 by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 28 Sep 2021 17:27:48 -0700
Date: Tue, 28 Sep 2021 17:22:57 -0700
From: Matthew Brost <matthew.brost@intel.com>
To: John Harrison <john.c.harrison@intel.com>
Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
 daniel.vetter@ffwll.ch, tony.ye@intel.com, zhengguo.xu@intel.com
Subject: Re: [Intel-gfx] [PATCH 23/27] drm/i915/guc: Implement no mid batch
 preemption for multi-lrc
Message-ID: <20210929002257.GA10045@jons-linux-dev-box>
References: <20210820224446.30620-1-matthew.brost@intel.com>
 <20210820224446.30620-24-matthew.brost@intel.com>
 <d5b15551-3849-1547-29c7-be593005a7e1@intel.com>
 <20210928223353.GA32806@jons-linux-dev-box>
 <2281cc6d-2e6e-f78c-a823-32f0a3d455ba@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <2281cc6d-2e6e-f78c-a823-32f0a3d455ba@intel.com>
User-Agent: Mutt/1.9.4 (2018-02-28)
X-BeenThere: dri-devel@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Direct Rendering Infrastructure - Development
 <dri-devel.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

On Tue, Sep 28, 2021 at 04:33:24PM -0700, John Harrison wrote:
> On 9/28/2021 15:33, Matthew Brost wrote:
> > On Tue, Sep 28, 2021 at 03:20:42PM -0700, John Harrison wrote:
> > > On 8/20/2021 15:44, Matthew Brost wrote:
> > > > For some users of multi-lrc, e.g. split frame, it isn't safe to preempt
> > > > mid BB. To safely enable preemption at the BB boundary, a handshake
> > > > between to parent and child is needed. This is implemented via custom
> > > > emit_bb_start & emit_fini_breadcrumb functions and enabled via by
> > > via by -> by
> > > 
> > > > default if a context is configured by set parallel extension.
> > > I tend to agree with Tvrtko that this should probably be an opt in change.
> > > Is there a flags word passed in when creating the context?
> > > 
> > I don't disagree but the uAPI in this series is where we landed. It has
> > been acked all by the relevant parties in the RFC, ported to our
> > internal tree, and the media UMD has been updated / posted. Concerns
> > with the uAPI should've been raised in the RFC phase, not now. I really
> > don't feel like changing this uAPI another time.
> The counter argument is that once a UAPI has been merged, it cannot be
> changed. Ever. So it is worth taking the trouble to get it right first time.
> 
> The proposal isn't a major re-write of the interface. It is simply a request
> to set an extra flag when creating the context.
> 

We are basically just talking about the polarity of a flag at this
point. Either by default you can't be preempted mid batch (current GPU /
UMD requirement) or by default you can be preempted mid-batch (no
current GPU / UMD can do this yet but add flags that everyone opts
into). I think Daniel's opinion was just default to what the current GPU
/ UMD wants and if future requirements arise we add flags to the
interface. I understand both points of view for flag / not flag but in
the end it doesn't really matter. Either way the interface works now and
will in the future too.

> 
> > 
> > > Also, it's not just a change in pre-emption behaviour but a change in
> > > synchronisation too, right? Previously, if you had a whole bunch of back to
> > > back submissions then one child could run ahead of another and/or the
> > > parent. After this change, there is a forced regroup at the end of each
> > > batch. So while one could end sooner/later than the others, they can't ever
> > > get an entire batch (or more) ahead or behind. Or was that synchronisation
> > > already in there through other means anyway?
> > > 
> > Yes, each parent / child sync at the of each batch - this is the only
> > way safely insert preemption points. Without this the GuC could attempt
> > a preemption and hang the batches.
> To be clear, I'm not saying that this is wrong. I'm just saying that this
> appears to be new behaviour with this patch but it is not explicitly called
> out in the description of the patch.
> 

Will add some comments explaining this behavior (unless I already have
them).

> 
> > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/i915/gt/intel_context.c       |   2 +-
> > > >    drivers/gpu/drm/i915/gt/intel_context_types.h |   3 +
> > > >    drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   2 +-
> > > >    .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 283 +++++++++++++++++-
> > > >    4 files changed, 287 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > index 5615be32879c..2de62649e275 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context.c
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> > > > @@ -561,7 +561,7 @@ void intel_context_bind_parent_child(struct intel_context *parent,
> > > >    	GEM_BUG_ON(intel_context_is_child(child));
> > > >    	GEM_BUG_ON(intel_context_is_parent(child));
> > > > -	parent->guc_number_children++;
> > > > +	child->guc_child_index = parent->guc_number_children++;
> > > >    	list_add_tail(&child->guc_child_link,
> > > >    		      &parent->guc_child_list);
> > > >    	child->parent = parent;
> > > > diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > index 713d85b0b364..727f91e7f7c2 100644
> > > > --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
> > > > @@ -246,6 +246,9 @@ struct intel_context {
> > > >    		/** @guc_number_children: number of children if parent */
> > > >    		u8 guc_number_children;
> > > > +		/** @guc_child_index: index into guc_child_list if child */
> > > > +		u8 guc_child_index;
> > > > +
> > > >    		/**
> > > >    		 * @parent_page: page in context used by parent for work queue,
> > > >    		 * work queue descriptor
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > index 6cd26dc060d1..9f61cfa5566a 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
> > > > @@ -188,7 +188,7 @@ struct guc_process_desc {
> > > >    	u32 wq_status;
> > > >    	u32 engine_presence;
> > > >    	u32 priority;
> > > > -	u32 reserved[30];
> > > > +	u32 reserved[36];
> > > What is this extra space for? All the extra storage is grabbed from after
> > > the end of this structure, isn't it?
> > > 
> > This is the size of process descriptor in the GuC spec. Even though this
> > is unused space we really don't want the child go / join memory using
> > anything within the process descriptor.
> Okay. So it's more that the code was previously broken and we just hadn't
> hit a problem because of it? Again, worth adding a comment in the
> description to call it out as a bug fix.
>

Sure.
 
> > 
> > > >    } __packed;
> > > >    #define CONTEXT_REGISTRATION_FLAG_KMD	BIT(0)
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > index 91330525330d..1a18f99bf12a 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > > @@ -11,6 +11,7 @@
> > > >    #include "gt/intel_context.h"
> > > >    #include "gt/intel_engine_pm.h"
> > > >    #include "gt/intel_engine_heartbeat.h"
> > > > +#include "gt/intel_gpu_commands.h"
> > > >    #include "gt/intel_gt.h"
> > > >    #include "gt/intel_gt_irq.h"
> > > >    #include "gt/intel_gt_pm.h"
> > > > @@ -366,10 +367,14 @@ static struct i915_priolist *to_priolist(struct rb_node *rb)
> > > >    /*
> > > >     * When using multi-lrc submission an extra page in the context state is
> > > > - * reserved for the process descriptor and work queue.
> > > > + * reserved for the process descriptor, work queue, and preempt BB boundary
> > > > + * handshake between the parent + childlren contexts.
> > > >     *
> > > >     * The layout of this page is below:
> > > >     * 0						guc_process_desc
> > > > + * + sizeof(struct guc_process_desc)		child go
> > > > + * + CACHELINE_BYTES				child join ...
> > > > + * + CACHELINE_BYTES ...
> > > Would be better written as '[num_children]' instead of '...' to make it
> > > clear it is a per child array.
> > > 
> > I think this description is pretty clear.
> Evidently not because it confused me for a moment.
> 

Ok, let me see if I can make this a bit more clear.

> > 
> > > Also, maybe create a struct for this to get rid of the magic '+1's and
> > > 'BYTES / sizeof' constructs in the functions below.
> > > 
> > Let me see if I can create a struct that describes the layout.
> That would definitely make the code a lot clearer.
> 
> > 
> > > >     * ...						unused
> > > >     * PAGE_SIZE / 2				work queue start
> > > >     * ...						work queue
> > > > @@ -1785,6 +1790,30 @@ static int deregister_context(struct intel_context *ce, u32 guc_id, bool loop)
> > > >    	return __guc_action_deregister_context(guc, guc_id, loop);
> > > >    }
> > > > +static inline void clear_children_join_go_memory(struct intel_context *ce)
> > > > +{
> > > > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > > > +	u8 i;
> > > > +
> > > > +	for (i = 0; i < ce->guc_number_children + 1; ++i)
> > > > +		mem[i * (CACHELINE_BYTES / sizeof(u32))] = 0;
> > > > +}
> > > > +
> > > > +static inline u32 get_children_go_value(struct intel_context *ce)
> > > > +{
> > > > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > > > +
> > > > +	return mem[0];
> > > > +}
> > > > +
> > > > +static inline u32 get_children_join_value(struct intel_context *ce,
> > > > +					  u8 child_index)
> > > > +{
> > > > +	u32 *mem = (u32 *)(__get_process_desc(ce) + 1);
> > > > +
> > > > +	return mem[(child_index + 1) * (CACHELINE_BYTES / sizeof(u32))];
> > > > +}
> > > > +
> > > >    static void guc_context_policy_init(struct intel_engine_cs *engine,
> > > >    				    struct guc_lrc_desc *desc)
> > > >    {
> > > > @@ -1867,6 +1896,8 @@ static int guc_lrc_desc_pin(struct intel_context *ce, bool loop)
> > > >    			desc->context_flags = CONTEXT_REGISTRATION_FLAG_KMD;
> > > >    			guc_context_policy_init(engine, desc);
> > > >    		}
> > > > +
> > > > +		clear_children_join_go_memory(ce);
> > > >    	}
> > > >    	/*
> > > > @@ -2943,6 +2974,31 @@ static const struct intel_context_ops virtual_child_context_ops = {
> > > >    	.get_sibling = guc_virtual_get_sibling,
> > > >    };
> > > > +/*
> > > > + * The below override of the breadcrumbs is enabled when the user configures a
> > > > + * context for parallel submission (multi-lrc, parent-child).
> > > > + *
> > > > + * The overridden breadcrumbs implements an algorithm which allows the GuC to
> > > > + * safely preempt all the hw contexts configured for parallel submission
> > > > + * between each BB. The contract between the i915 and GuC is if the parent
> > > > + * context can be preempted, all the children can be preempted, and the GuC will
> > > > + * always try to preempt the parent before the children. A handshake between the
> > > > + * parent / children breadcrumbs ensures the i915 holds up its end of the deal
> > > > + * creating a window to preempt between each set of BBs.
> > > > + */
> > > > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						     u64 offset, u32 len,
> > > > +						     const unsigned int flags);
> > > > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						    u64 offset, u32 len,
> > > > +						    const unsigned int flags);
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						 u32 *cs);
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						u32 *cs);
> > > > +
> > > >    static struct intel_context *
> > > >    guc_create_parallel(struct intel_engine_cs **engines,
> > > >    		    unsigned int num_siblings,
> > > > @@ -2978,6 +3034,20 @@ guc_create_parallel(struct intel_engine_cs **engines,
> > > >    		}
> > > >    	}
> > > > +	parent->engine->emit_bb_start =
> > > > +		emit_bb_start_parent_no_preempt_mid_batch;
> > > > +	parent->engine->emit_fini_breadcrumb =
> > > > +		emit_fini_breadcrumb_parent_no_preempt_mid_batch;
> > > > +	parent->engine->emit_fini_breadcrumb_dw =
> > > > +		12 + 4 * parent->guc_number_children;
> > > > +	for_each_child(parent, ce) {
> > > > +		ce->engine->emit_bb_start =
> > > > +			emit_bb_start_child_no_preempt_mid_batch;
> > > > +		ce->engine->emit_fini_breadcrumb =
> > > > +			emit_fini_breadcrumb_child_no_preempt_mid_batch;
> > > > +		ce->engine->emit_fini_breadcrumb_dw = 16;
> > > > +	}
> > > > +
> > > >    	kfree(siblings);
> > > >    	return parent;
> > > > @@ -3362,6 +3432,204 @@ void intel_guc_submission_init_early(struct intel_guc *guc)
> > > >    	guc->submission_selected = __guc_submission_selected(guc);
> > > >    }
> > > > +static inline u32 get_children_go_addr(struct intel_context *ce)
> > > > +{
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	return i915_ggtt_offset(ce->state) +
> > > > +		__get_process_desc_offset(ce) +
> > > > +		sizeof(struct guc_process_desc);
> > > > +}
> > > > +
> > > > +static inline u32 get_children_join_addr(struct intel_context *ce,
> > > > +					 u8 child_index)
> > > > +{
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	return get_children_go_addr(ce) + (child_index + 1) * CACHELINE_BYTES;
> > > > +}
> > > > +
> > > > +#define PARENT_GO_BB			1
> > > > +#define PARENT_GO_FINI_BREADCRUMB	0
> > > > +#define CHILD_GO_BB			1
> > > > +#define CHILD_GO_FINI_BREADCRUMB	0
> > > > +static int emit_bb_start_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						     u64 offset, u32 len,
> > > > +						     const unsigned int flags)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +	u32 *cs;
> > > > +	u8 i;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	cs = intel_ring_begin(rq, 10 + 4 * ce->guc_number_children);
> > > > +	if (IS_ERR(cs))
> > > > +		return PTR_ERR(cs);
> > > > +
> > > > +	/* Wait on chidlren */
> > > chidlren -> children
> > > 
> > Yep.
> > > > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > > > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +			 MI_SEMAPHORE_POLL |
> > > > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +		*cs++ = PARENT_GO_BB;
> > > > +		*cs++ = get_children_join_addr(ce, i);
> > > > +		*cs++ = 0;
> > > > +	}
> > > > +
> > > > +	/* Turn off preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	/* Tell children go */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  CHILD_GO_BB,
> > > > +				  get_children_go_addr(ce),
> > > > +				  0);
> > > > +
> > > > +	/* Jump to batch */
> > > > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > > > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > > > +	*cs++ = lower_32_bits(offset);
> > > > +	*cs++ = upper_32_bits(offset);
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	intel_ring_advance(rq, cs);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static int emit_bb_start_child_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						    u64 offset, u32 len,
> > > > +						    const unsigned int flags)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +	u32 *cs;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > > > +
> > > > +	cs = intel_ring_begin(rq, 12);
> > > > +	if (IS_ERR(cs))
> > > > +		return PTR_ERR(cs);
> > > > +
> > > > +	/* Signal parent */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  PARENT_GO_BB,
> > > > +				  get_children_join_addr(ce->parent,
> > > > +							 ce->guc_child_index),
> > > > +				  0);
> > > > +
> > > > +	/* Wait parent on for go */
> > > parent on -> on parent
> > > 
> > Yep.
> > > > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +		 MI_SEMAPHORE_POLL |
> > > > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +	*cs++ = CHILD_GO_BB;
> > > > +	*cs++ = get_children_go_addr(ce->parent);
> > > > +	*cs++ = 0;
> > > > +
> > > > +	/* Turn off preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > > > +
> > > > +	/* Jump to batch */
> > > > +	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
> > > > +		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8));
> > > > +	*cs++ = lower_32_bits(offset);
> > > > +	*cs++ = upper_32_bits(offset);
> > > > +
> > > > +	intel_ring_advance(rq, cs);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_parent_no_preempt_mid_batch(struct i915_request *rq,
> > > > +						 u32 *cs)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +	u8 i;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_parent(ce));
> > > > +
> > > > +	/* Wait on children */
> > > > +	for (i = 0; i < ce->guc_number_children; ++i) {
> > > > +		*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +			 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +			 MI_SEMAPHORE_POLL |
> > > > +			 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +		*cs++ = PARENT_GO_FINI_BREADCRUMB;
> > > > +		*cs++ = get_children_join_addr(ce, i);
> > > > +		*cs++ = 0;
> > > > +	}
> > > > +
> > > > +	/* Turn on preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	/* Tell children go */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  CHILD_GO_FINI_BREADCRUMB,
> > > > +				  get_children_go_addr(ce),
> > > > +				  0);
> > > > +
> > > > +	/* Emit fini breadcrumb */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  rq->fence.seqno,
> > > > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > > > +				  0);
> > > > +
> > > > +	/* User interrupt */
> > > > +	*cs++ = MI_USER_INTERRUPT;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	rq->tail = intel_ring_offset(rq, cs);
> > > > +
> > > > +	return cs;
> > > > +}
> > > > +
> > > > +static u32 *
> > > > +emit_fini_breadcrumb_child_no_preempt_mid_batch(struct i915_request *rq, u32 *cs)
> > > > +{
> > > > +	struct intel_context *ce = rq->context;
> > > > +
> > > > +	GEM_BUG_ON(!intel_context_is_child(ce));
> > > > +
> > > > +	/* Turn on preemption */
> > > > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	/* Signal parent */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  PARENT_GO_FINI_BREADCRUMB,
> > > > +				  get_children_join_addr(ce->parent,
> > > > +							 ce->guc_child_index),
> > > > +				  0);
> > > > +
> > > This is backwards compared to the parent?
> > > 
> > > Parent: wait on children then enable pre-emption
> > > Child: enable pre-emption then signal parent
> > > 
> > > Makes for a window where the parent is waiting in atomic context for a
> > > signal from a child that has been pre-empted and might not get to run for
> > > some time?
> > > 
> > No, this is correct. The rule is if the parent can be preempted all the
> > children can be preempted, thus we can't enable preemption on the parent
> > until all the children have preemption enabled, thus the parent waits
> > for all the children to join before enabling its preemption.
> > 
> > Matt
> But,
> 
> The write to PARENT_GO_FINI can't fail or stall, right? So if it happens
> before the ARB_ON then the child is guaranteed to execute the ARB_ON once it
> has signalled the parent. Indeed, by the time the parent context gets to see
> the update memory value, the child is practically certain to have passed the
> ARB_ON. So, by the time the parent becomes pre-emptible, the children will
> all be pre-emptible. Even if the parent is superfast, the children are
> guaranteed to become pre-emptible immediately - certainly before any
> fail-to-preempt timeout could occur.
>
> Whereas, with the current ordering, it is possible for the child to be
> preempted before it has issued the signal to the parent. So now you have a
> non-preemptible parent hogging the hardware, waiting for a signal that isn't

To be clear the parent is always preempted first by the GuC. The parent
can't be running if the child preempt is attempted.

> going to come for an entire execution quantum. Indeed, it is actually quite
> likely the child would be preempted before it can signal the parent because
> any pre-emption request that was issued at any time during the child's
> execution will take effect immediately on the ARB_ON instruction.
> 

Looking at the code, I do think I have a bug though.

I think I'm missing a MI_ARB_CHECK in the parent after turning on
preemption before releasing the children, right?

This covers the case where the GuC issues a preemption to the parent
while it is waiting on the children, all the children join, the parent
turns on preemption and is preempted with the added MI_ARB_CHECK
instruction, and the children all can be preempted waiting on the parent
go semaphore. Does that sound correct?

Matt

> John.
> 
> 
> > > John.
> > > 
> > > 
> > > > +	/* Wait parent on for go */
> > > > +	*cs++ = (MI_SEMAPHORE_WAIT |
> > > > +		 MI_SEMAPHORE_GLOBAL_GTT |
> > > > +		 MI_SEMAPHORE_POLL |
> > > > +		 MI_SEMAPHORE_SAD_EQ_SDD);
> > > > +	*cs++ = CHILD_GO_FINI_BREADCRUMB;
> > > > +	*cs++ = get_children_go_addr(ce->parent);
> > > > +	*cs++ = 0;
> > > > +
> > > > +	/* Emit fini breadcrumb */
> > > > +	cs = gen8_emit_ggtt_write(cs,
> > > > +				  rq->fence.seqno,
> > > > +				  i915_request_active_timeline(rq)->hwsp_offset,
> > > > +				  0);
> > > > +
> > > > +	/* User interrupt */
> > > > +	*cs++ = MI_USER_INTERRUPT;
> > > > +	*cs++ = MI_NOOP;
> > > > +
> > > > +	rq->tail = intel_ring_offset(rq, cs);
> > > > +
> > > > +	return cs;
> > > > +}
> > > > +
> > > >    static struct intel_context *
> > > >    g2h_context_lookup(struct intel_guc *guc, u32 desc_idx)
> > > >    {
> > > > @@ -3807,6 +4075,19 @@ void intel_guc_submission_print_context_info(struct intel_guc *guc,
> > > >    			drm_printf(p, "\t\tWQI Status: %u\n\n",
> > > >    				   READ_ONCE(desc->wq_status));
> > > > +			drm_printf(p, "\t\tNumber Children: %u\n\n",
> > > > +				   ce->guc_number_children);
> > > > +			if (ce->engine->emit_bb_start ==
> > > > +			    emit_bb_start_parent_no_preempt_mid_batch) {
> > > > +				u8 i;
> > > > +
> > > > +				drm_printf(p, "\t\tChildren Go: %u\n\n",
> > > > +					   get_children_go_value(ce));
> > > > +				for (i = 0; i < ce->guc_number_children; ++i)
> > > > +					drm_printf(p, "\t\tChildren Join: %u\n",
> > > > +						   get_children_join_value(ce, i));
> > > > +			}
> > > > +
> > > >    			for_each_child(ce, child)
> > > >    				guc_log_context(p, child);
> > > >    		}
>