All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Matthew Brost <matthew.brost@intel.com>,
	intel-gfx@lists.freedesktop.org,
	 dri-devel@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts
Date: Fri, 17 Dec 2021 11:14:07 +0000	[thread overview]
Message-ID: <35bc4a2a-9a50-9651-5c17-65f788817f64@linux.intel.com> (raw)
In-Reply-To: <7cc85926-75e8-0368-1684-62ae5f341807@linux.intel.com>


On 17/12/2021 11:06, Tvrtko Ursulin wrote:
> On 14/12/2021 17:04, Matthew Brost wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> While attempting to debug a CT deadlock issue in various CI failures
>> (most easily reproduced with gem_ctx_create/basic-files), I was seeing
>> CPU deadlock errors being reported. This were because the context
>> destroy loop was blocking waiting on H2G space from inside an IRQ
>> spinlock. There no was deadlock as such, it's just that the H2G queue
>> was full of context destroy commands and GuC was taking a long time to
>> process them. However, the kernel was seeing the large amount of time
>> spent inside the IRQ lock as a dead CPU. Various Bad Things(tm) would
>> then happen (heartbeat failures, CT deadlock errors, outstanding H2G
>> WARNs, etc.).
>>
>> Re-working the loop to only acquire the spinlock around the list
>> management (which is all it is meant to protect) rather than the
>> entire destroy operation seems to fix all the above issues.
>>
>> v2:
>>   (John Harrison)
>>    - Fix typo in comment message
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Reviewed-by: Matthew Brost <matthew.brost@intel.com>
>> ---
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 ++++++++++++-------
>>   1 file changed, 28 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 36c2965db49b..96fcf869e3ff 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -2644,7 +2644,6 @@ static inline void guc_lrc_desc_unpin(struct 
>> intel_context *ce)
>>       unsigned long flags;
>>       bool disabled;
>> -    lockdep_assert_held(&guc->submission_state.lock);
>>       GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>>       GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
>>       GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));
>> @@ -2660,7 +2659,7 @@ static inline void guc_lrc_desc_unpin(struct 
>> intel_context *ce)
>>       }
>>       spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>>       if (unlikely(disabled)) {
>> -        __release_guc_id(guc, ce);
>> +        release_guc_id(guc, ce);
>>           __guc_context_destroy(ce);
>>           return;
>>       }
>> @@ -2694,36 +2693,48 @@ static void __guc_context_destroy(struct 
>> intel_context *ce)
>>   static void guc_flush_destroyed_contexts(struct intel_guc *guc)
>>   {
>> -    struct intel_context *ce, *cn;
>> +    struct intel_context *ce;
>>       unsigned long flags;
>>       GEM_BUG_ON(!submission_disabled(guc) &&
>>              guc_submission_initialized(guc));
>> -    spin_lock_irqsave(&guc->submission_state.lock, flags);
>> -    list_for_each_entry_safe(ce, cn,
>> -                 &guc->submission_state.destroyed_contexts,
>> -                 destroyed_link) {
>> -        list_del_init(&ce->destroyed_link);
>> -        __release_guc_id(guc, ce);
>> +    while (!list_empty(&guc->submission_state.destroyed_contexts)) {
> 
> Are lockless false negatives a concern here - I mean this thread not 
> seeing something just got added to the list?
> 
>> +        spin_lock_irqsave(&guc->submission_state.lock, flags);
>> +        ce = 
>> list_first_entry_or_null(&guc->submission_state.destroyed_contexts,
>> +                          struct intel_context,
>> +                          destroyed_link);
>> +        if (ce)
>> +            list_del_init(&ce->destroyed_link);
>> +        spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>> +
>> +        if (!ce)
>> +            break;
>> +
>> +        release_guc_id(guc, ce);
> 
> This looks suboptimal and in conflict with this part of the commit message:
> 
> """
>   Re-working the loop to only acquire the spinlock around the list
>   management (which is all it is meant to protect) rather than the
>   entire destroy operation seems to fix all the above issues.
> """
> 
> Because you end up doing:
> 
> ... loop ...
>    spin_lock_irqsave(&guc->submission_state.lock, flags);
>    list_del_init(&ce->destroyed_link);
>    spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> 
>    release_guc_id, which calls:
>      spin_lock_irqsave(&guc->submission_state.lock, flags);
>      __release_guc_id(guc, ce);
>      spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> 
> So a) the lock seems to be protecting more than just list management, or 
> release_guc_if is wrong, and b) the loop ends up with highly 
> questionable hammering on the lock.
> 
> Is there any point to this part of the patch? Or the only business end 
> of the patch is below:
> 
>>           __guc_context_destroy(ce);
>>       }
>> -    spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>   }
>>   static void deregister_destroyed_contexts(struct intel_guc *guc)
>>   {
>> -    struct intel_context *ce, *cn;
>> +    struct intel_context *ce;
>>       unsigned long flags;
>> -    spin_lock_irqsave(&guc->submission_state.lock, flags);
>> -    list_for_each_entry_safe(ce, cn,
>> -                 &guc->submission_state.destroyed_contexts,
>> -                 destroyed_link) {
>> -        list_del_init(&ce->destroyed_link);
>> +    while (!list_empty(&guc->submission_state.destroyed_contexts)) {
>> +        spin_lock_irqsave(&guc->submission_state.lock, flags);
>> +        ce = 
>> list_first_entry_or_null(&guc->submission_state.destroyed_contexts,
>> +                          struct intel_context,
>> +                          destroyed_link);
>> +        if (ce)
>> +            list_del_init(&ce->destroyed_link);
>> +        spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>> +
>> +        if (!ce)
>> +            break;
>> +
>>           guc_lrc_desc_unpin(ce);
> 
> Here?
> 
> Not wanting/needing to nest ce->guc_state.lock under 
> guc->submission_state.lock, and call the CPU cycle expensive 
> deregister_context?
> 
> 1)
> Could you unlink en masse, under the assumption destroyed contexts are 
> not reachable from anywhere else at this point, so under a single lock 
> hold?
> 
> 2)
> But then you also end up with guc_lrc_desc_unpin calling 
> __release_guc_id, which when called by release_guc_id does take 
> guc->submission_state.lock and here it does not. Is it then clear which 
> operations inside __release_guc_id need the lock? Bitmap or IDA?

Ah no, with 2nd point I missed you changed guc_lrc_desc_unpin to call 
release_guc_id.

Question on the merit of change in guc_flush_destroyed_contexts remains, 
and also whether at both places you could do group unlink (one lock 
hold), put on a private list, and then unpin/deregister.

Regards,

Tvrtko

  reply	other threads:[~2021-12-17 11:14 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-14 17:04 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
2021-12-14 17:04 ` [Intel-gfx] " Matthew Brost
2021-12-14 17:04 ` [PATCH 1/7] drm/i915/guc: Use correct context lock when callig clr_context_registered Matthew Brost
2021-12-14 17:04   ` [Intel-gfx] " Matthew Brost
2021-12-14 17:04 ` [PATCH 2/7] drm/i915/guc: Only assign guc_id.id when stealing guc_id Matthew Brost
2021-12-14 17:04   ` [Intel-gfx] " Matthew Brost
2021-12-14 17:04 ` [PATCH 3/7] drm/i915/guc: Remove racey GEM_BUG_ON Matthew Brost
2021-12-14 17:04   ` [Intel-gfx] " Matthew Brost
2021-12-14 17:04 ` [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts Matthew Brost
2021-12-14 17:04   ` [Intel-gfx] " Matthew Brost
2021-12-17 11:06   ` Tvrtko Ursulin
2021-12-17 11:14     ` Tvrtko Ursulin [this message]
2021-12-22 16:25       ` Tvrtko Ursulin
2021-12-22 16:25         ` Tvrtko Ursulin
2021-12-22 20:38         ` Matthew Brost
2021-12-22 20:38           ` Matthew Brost
2021-12-14 17:04 ` [PATCH 5/7] drm/i915/guc: Add extra debug on CT deadlock Matthew Brost
2021-12-14 17:04   ` [Intel-gfx] " Matthew Brost
2021-12-14 17:04 ` [PATCH 6/7] drm/i915/guc: Kick G2H tasklet if no credits Matthew Brost
2021-12-14 17:04   ` [Intel-gfx] " Matthew Brost
2021-12-14 17:05 ` [PATCH 7/7] drm/i915/guc: Selftest for stealing of guc ids Matthew Brost
2021-12-14 17:05   ` [Intel-gfx] " Matthew Brost
2021-12-14 19:48   ` John Harrison
2021-12-14 19:48     ` [Intel-gfx] " John Harrison
2021-12-14 18:12 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Fix stealing guc_ids + test (rev3) Patchwork
2021-12-14 18:13 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-12-14 18:42 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-12-15  3:28 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2021-12-11 17:35 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
2021-12-11 17:35 ` [Intel-gfx] [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts Matthew Brost
2021-12-11  0:56 [PATCH 0/7] Fix stealing guc_ids + test Matthew Brost
2021-12-11  0:56 ` [Intel-gfx] [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts Matthew Brost
2021-12-11  1:07   ` John Harrison
2021-12-11  1:10     ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=35bc4a2a-9a50-9651-5c17-65f788817f64@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.