All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop wait loop time
@ 2023-01-12  0:36 Paul E. McKenney
  2023-01-12 20:51 ` Jonas Oberhauser
  0 siblings, 1 reply; 3+ messages in thread
From: Paul E. McKenney @ 2023-01-12  0:36 UTC (permalink / raw)
  To: riel, davej; +Cc: linux-kernel, kernel-team

We see systems stuck in the queued_spin_lock_slowpath() loop that waits
for the lock to become unlocked in the case where the current CPU has
set pending state.  Therefore, this not-for-mainline commit gives a warning
that includes the lock word state if the loop has been spinning for more
than 10 seconds.  It also adds a WARN_ON_ONCE() that complains if the
lock is not in pending state.

If this is to be placed in production, some reporting mechanism not
involving spinlocks is likely needed, for example, BPF, trace events,
or some combination thereof.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ac5a3e6d3b564..be1440782c4b3 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -379,8 +379,22 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * clear_pending_set_locked() implementations imply full
 	 * barriers.
 	 */
-	if (val & _Q_LOCKED_MASK)
-		atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
+	if (val & _Q_LOCKED_MASK) {
+		int cnt = _Q_PENDING_LOOPS;
+		unsigned long j = jiffies + 10 * HZ;
+		struct qspinlock qval;
+		int val;
+
+		for (;;) {
+			val = atomic_read_acquire(&lock->val);
+			atomic_set(&qval.val, val);
+			WARN_ON_ONCE(!(val & _Q_PENDING_VAL));
+			if (!(val & _Q_LOCKED_MASK))
+				break;
+			if (!--cnt && !WARN(time_after(jiffies, j), "%s: Still pending and locked: %#x (%c%c%#x)\n", __func__, val, ".L"[!!qval.locked], ".P"[!!qval.pending], qval.tail))
+				cnt = _Q_PENDING_LOOPS;
+		}
+	}
 
 	/*
 	 * take ownership and clear the pending bit.

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* RE: [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop wait loop time
  2023-01-12  0:36 [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop wait loop time Paul E. McKenney
@ 2023-01-12 20:51 ` Jonas Oberhauser
  2023-01-12 23:49   ` Paul E. McKenney
  0 siblings, 1 reply; 3+ messages in thread
From: Jonas Oberhauser @ 2023-01-12 20:51 UTC (permalink / raw)
  To: paulmck, riel, davej; +Cc: linux-kernel, kernel-team

Hi Paul,

-----Original Message-----
From: Paul E. McKenney [mailto:paulmck@kernel.org] 
> We see systems stuck in the queued_spin_lock_slowpath() loop that waits for the lock to become unlocked in the case where the current CPU has set pending state.

Interesting!
Do you know if the hangs started with a recent patch? What codepaths are active (virtualization/arch/...)? Does it happen extremely rarely? Do you have any additional information?

I saw a similar situation a few years ago in a proprietary kernel, but it only happened once ever and I gave up on looking for the reason after a few days (including some time combing through the compiler generated assembler).

Have fun,
jonas

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop wait loop time
  2023-01-12 20:51 ` Jonas Oberhauser
@ 2023-01-12 23:49   ` Paul E. McKenney
  0 siblings, 0 replies; 3+ messages in thread
From: Paul E. McKenney @ 2023-01-12 23:49 UTC (permalink / raw)
  To: Jonas Oberhauser; +Cc: riel, davej, linux-kernel, kernel-team

On Thu, Jan 12, 2023 at 08:51:04PM +0000, Jonas Oberhauser wrote:
> Hi Paul,
> 
> -----Original Message-----
> From: Paul E. McKenney [mailto:paulmck@kernel.org] 
> > We see systems stuck in the queued_spin_lock_slowpath() loop that waits for the lock to become unlocked in the case where the current CPU has set pending state.
> 
> Interesting!
> Do you know if the hangs started with a recent patch? What codepaths are active (virtualization/arch/...)? Does it happen extremely rarely? Do you have any additional information?

As best we can tell right now, see it about three times per day per
million systems on x86 systems running v5.12 plus backports.  It is
entirely possible that it is a hardware/firmware problem, but normally
that would cause the failure to cluster on a specific piece of hardware
or specific type of hardware, and we are not seeing that.

But we are in very early days investigating this.  In particular,
everything in the previous paragraph is subject to change.  For example,
we have not yet eliminated the possibility that the lockword is being
corrupted by unrelated kernel software, which is part of the motivation
for the patch in my earlier email.

> I saw a similar situation a few years ago in a proprietary kernel, but it only happened once ever and I gave up on looking for the reason after a few days (including some time combing through the compiler generated assembler).

If it makes you feel better, yesterday I was sure that I had found the
bug by inspection.  But no, just confusion on my part!  ;-)

But thank you very much for the possible corroborating information.
You never know!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-01-12 23:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12  0:36 [PATCH diagnostic qspinlock] Diagnostics for excessive lock-drop wait loop time Paul E. McKenney
2023-01-12 20:51 ` Jonas Oberhauser
2023-01-12 23:49   ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.