linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Lock ordering documentation and annotation for lockdep
@ 2020-03-13 17:46 Sebastian Andrzej Siewior
  2020-03-13 17:46 ` [PATCH 1/9] Documentation: Add lock ordering and nesting documentation Sebastian Andrzej Siewior
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner

The first patch introduces the long asked locking documentation
regarding spinlock_t vs raw_spinlock_t (and other locking primitives).
Followed by updates to obey the rules including a change to completions
to use struct swait_queue_head instead of wait_queue_head_t.

It ends with lockdep updates to recognize the "bad" lock nesting. This
new lockdep feature is activated by CONFIG_PROVE_RAW_LOCK_NESTING. Once
enabled it will report for instance printk, the memory allocator or the
workqueue implementation. This is known and it is worked on.

Sebastian



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/9] Documentation: Add lock ordering and nesting documentation
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
@ 2020-03-13 17:46 ` Sebastian Andrzej Siewior
  2020-03-14 22:57   ` Randy Dunlap
  2020-03-13 17:46 ` [PATCH 2/9] timekeeping: Split jiffies seqlock Sebastian Andrzej Siewior
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

From: Thomas Gleixner <tglx@linutronix.de>

The kernel provides a variety of locking primitives. The nesting of these
lock types and the implications of them on RT enabled kernels is nowhere
documented.

Add initial documentation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 Documentation/locking/index.rst     |   1 +
 Documentation/locking/locktypes.rst | 298 ++++++++++++++++++++++++++++
 2 files changed, 299 insertions(+)
 create mode 100644 Documentation/locking/locktypes.rst

diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst
index 626a463f7e42e..5d6800a723dc6 100644
--- a/Documentation/locking/index.rst
+++ b/Documentation/locking/index.rst
@@ -7,6 +7,7 @@ locking
 .. toctree::
     :maxdepth: 1
 
+    locktypes
     lockdep-design
     lockstat
     locktorture
diff --git a/Documentation/locking/locktypes.rst b/Documentation/locking/locktypes.rst
new file mode 100644
index 0000000000000..d4c3f2094ad20
--- /dev/null
+++ b/Documentation/locking/locktypes.rst
@@ -0,0 +1,298 @@
+.. _kernel_hacking_locktypes:
+
+==========================
+Lock types and their rules
+==========================
+
+Introduction
+============
+
+The kernel provides a variety of locking primitives which can be divided
+into two categories:
+
+ - Sleeping locks
+ - Spinning locks
+
+This document describes the lock types at least at the conceptual level and
+provides rules for nesting of lock types also under the aspect of PREEMPT_RT.
+
+Lock categories
+===============
+
+Sleeping locks
+--------------
+
+Sleeping locks can only be acquired in preemptible task context.
+
+Some of the implementations allow try_lock() attempts from other contexts,
+but that has to be really evaluated carefully including the question
+whether the unlock can be done from that context safely as well.
+
+Note, that some lock types change their implementation details when
+debugging is enabled, so this should be really only considered if there is
+no other option.
+
+Sleeping lock types:
+
+ - mutex
+ - rt_mutex
+ - semaphore
+ - rw_semaphore
+ - ww_mutex
+ - percpu_rw_semaphore
+
+On a PREEMPT_RT enabled kernel the following lock types are converted to
+sleeping locks:
+
+ - spinlock_t
+ - rwlock_t
+
+Spinning locks
+--------------
+
+ - raw_spinlock_t
+ - bit spinlocks
+
+On a non PREEMPT_RT enabled kernel the following lock types are spinning
+locks as well:
+
+ - spinlock_t
+ - rwlock_t
+
+Spinning locks implicitly disable preemption and the lock / unlock functions
+can have suffixes which apply further protections:
+
+ ===================  ====================================================
+ _bh()                Disable / enable bottom halfs (soft interrupts)
+ _irq()               Disable / enable interrupts
+ _irqsave/restore()   Save and disable / restore interrupt disabled state
+ ===================  ====================================================
+
+
+rtmutex
+=======
+
+RT-mutexes are mutexes with support for priority inheritance (PI).
+
+PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
+interrupt disabled sections.
+
+On a PREEMPT_RT enabled kernel most of these sections are fully
+preemptible. This is possible because PREEMPT_RT forces most executions
+into task context, especially interrupt handlers and soft interrupts, which
+allows to substitute spinlock_t and rwlock_t with RT-mutex based
+implementations.
+
+
+raw_spinlock_t and spinlock_t
+=============================
+
+raw_spinlock_t
+--------------
+
+raw_spinlock_t is a strict spinning lock implementation regardless of the
+kernel configuration including PREEMPT_RT enabled kernels.
+
+raw_spinlock_t is to be used only in real critical core code, low level
+interrupt handling and places where protecting (hardware) state is required
+to be safe against preemption and eventually interrupts.
+
+Another reason to use raw_spinlock_t is when the critical section is tiny
+to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
+contended case.
+
+spinlock_t
+----------
+
+The semantics of spinlock_t change with the state of CONFIG_PREEMPT_RT.
+
+On a non PREEMPT_RT enabled kernel spinlock_t is mapped to raw_spinlock_t
+and has exactly the same semantics.
+
+spinlock_t and PREEMPT_RT
+-------------------------
+
+On a PREEMPT_RT enabled kernel spinlock_t is mapped to a separate
+implementation based on rt_mutex which changes the semantics:
+
+ - Preemption is not disabled
+
+ - The hard interrupt related suffixes for spin_lock / spin_unlock
+   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
+   interrupt disabled state
+
+ - The soft interrupt related suffix (_bh()) is still disabling the
+   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
+   kernel, which utilizes the preemption count, this is achieved by a per
+   CPU bottom half locking mechanism.
+
+All other semantics of spinlock_t are preserved:
+
+ - Migration of tasks which hold a spinlock_t is prevented. On a non
+   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
+   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
+   pointers to per CPU variables stay valid even if the task is preempted.
+
+ - Task state preservation. The task state is not affected when a lock is
+   contended and the task has to schedule out and wait for the lock to
+   become available. The lock wake up restores the task state unless there
+   was a regular (not lock related) wake up on the task. This ensures that
+   the task state rules are always correct independent of the kernel
+   configuration.
+
+rwlock_t
+========
+
+rwlock_t is a multiple readers and single writers lock mechanism.
+
+On a non PREEMPT_RT enabled kernel rwlock_t is implemented as a spinning
+lock and the suffix rules of spinlock_t apply accordingly. The
+implementation is fair and prevents writer starvation.
+
+rwlock_t and PREEMPT_RT
+-----------------------
+
+On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
+implementation based on rt_mutex which changes the semantics:
+
+ - Same changes as for spinlock_t
+
+ - The implementation is not fair and can cause writer starvation under
+   certain circumstances. The reason for this is that a writer cannot
+   inherit its priority to multiple readers. Readers which are blocked
+   on a writer fully support the priority inheritance protocol.
+
+
+PREEMPT_RT caveats
+==================
+
+spinlock_t and rwlock_t
+-----------------------
+
+The substitution of spinlock_t and rwlock_t on PREEMPT_RT enabled kernels
+with RT-mutex based implementations has a few implications.
+
+On a non PREEMPT_RT enabled kernel the following code construct is
+perfectly fine::
+
+   local_irq_disable();
+   spin_lock(&lock);
+
+and fully equivalent to::
+
+   spin_lock_irq(&lock);
+
+Same applies to rwlock_t and the _irqsave() suffix variant.
+
+On a PREEMPT_RT enabled kernel this breaks because the RT-mutex
+substitution expects a fully preemptible context.
+
+The preferred solution is to use :c:func:`spin_lock_irq()` or
+:c:func:`spin_lock_irqsave()` and their unlock counterparts.
+
+PREEMPT_RT also offers a local_lock mechanism to substitute the
+local_irq_disable/save() constructs in cases where a separation of the
+interrupt disabling and the locking is really unavoidable. This should be
+restricted to very rare cases.
+
+
+raw_spinlock_t
+--------------
+
+As raw_spinlock_t locking disables preemption and eventually interrupts the
+code inside the critical region has to be careful to avoid calls into code
+which takes regular spinlock_t or rwlock_t. A prime example is memory
+allocation.
+
+On a non PREEMPT_RT enabled kernel the following code construct is
+perfectly fine code::
+
+  raw_spin_lock(&lock);
+  p = kmalloc(sizeof(*p), GFP_ATOMIC);
+
+On a PREEMPT_RT enabled kernel this breaks because the memory allocator is
+fully preemptible and therefore does not support allocations from truly
+atomic contexts.
+
+Contrary to that the following code construct is perfectly fine on
+PREEMPT_RT as spin_lock() does not disable preemption::
+
+  spin_lock(&lock);
+  p = kmalloc(sizeof(*p), GFP_ATOMIC);
+
+Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
+execution is forced into thread context and the lock substitution is
+ensuring preemptability.
+
+
+bit spinlocks
+-------------
+
+Bit spinlocks are problematic for PREEMPT_RT as they cannot be easily
+substituted by a RT-mutex based implementation for obvious reasons.
+
+The semantics of bit spinlocks are preserved on a PREEMPT_RT enabled kernel
+and the caveats vs. raw_spinlock_t apply.
+
+Some bit spinlocks are substituted by regular spinlock_t for PREEMPT_RT but
+this requires conditional (#ifdef'ed) code changes at the usage side while
+the spinlock_t substitution is simply done by the compiler and the
+conditionals are restricted to header files and core implementation of the
+locking primitives and the usage sites do not require any changes.
+
+
+Lock type nesting rules
+=======================
+
+The most basic rules are:
+
+  - Lock types of the same lock category (sleeping, spinning) can nest
+    arbitrarily as long as they respect the general lock ordering rules to
+    prevent deadlocks.
+
+  - Sleeping lock types cannot nest inside spinning lock types.
+
+  - Spinning lock types can nest inside sleeping lock types.
+
+These rules apply in general independent of CONFIG_PREEMPT_RT.
+
+As PREEMPT_RT changes the lock category of spinlock_t and rwlock_t from
+spinning to sleeping this has obviously restrictions how they can nest with
+raw_spinlock_t.
+
+This results in the following nest ordering:
+
+  1) Sleeping locks
+  2) spinlock_t and rwlock_t
+  3) raw_spinlock_t and bit spinlocks
+
+Lockdep is aware of these constraints to ensure that they are respected.
+
+
+Owner semantics
+===============
+
+Most lock types in the Linux kernel have strict owner semantics, i.e. the
+context (task) which acquires a lock has to release it.
+
+There are two exceptions:
+
+  - semaphores
+  - rwsem
+
+semaphores have no strict owner semantics for historical reasons. They are
+often used for both serialization and waiting purposes. That's generally
+discouraged and should be replaced by separate serialization and wait
+mechanisms.
+
+rwsem have grown interfaces which allow non owner release for special
+purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
+substitutes all locking primitives except semaphores with RT-mutex based
+implementation to provide priority inheritance for all lock types except
+the truly spinning ones. Priority inheritance on ownerless locks is
+obviously impossible.
+
+For now the rwsem non-owner release excludes code which utilizes it from
+being used on PREEMPT_RT enabled kernels. In same cases this can be
+mitigated by disabling portions of the code, in other cases the complete
+functionality has to be disabled until a workable solution has been found.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/9] timekeeping: Split jiffies seqlock
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
  2020-03-13 17:46 ` [PATCH 1/9] Documentation: Add lock ordering and nesting documentation Sebastian Andrzej Siewior
@ 2020-03-13 17:46 ` Sebastian Andrzej Siewior
  2020-03-13 17:46 ` [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll Sebastian Andrzej Siewior
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

From: Thomas Gleixner <tglx@linutronix.de>

seqlock consists of a sequence counter and a spinlock_t which is used to
serialize the writers. spinlock_t is substituted by a "sleeping" spinlock
on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping
code as the writers are executed in hard interrupt and therefore
non-preemptible context even on PREEMPT_RT.

The spinlock in seqlock cannot be unconditionally replaced by a
raw_spinlock_t as many seqlock users have nesting spinlock sections or
other code which is not suitable to run in truly atomic context on RT.

Instead of providing a raw_seqlock API for a single use case, open code the
seqlock for the jiffies use case and implement it with a raw_spinlock_t and
a sequence counter.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/time/jiffies.c     |  7 ++++---
 kernel/time/tick-common.c | 10 ++++++----
 kernel/time/tick-sched.c  | 19 ++++++++++++-------
 kernel/time/timekeeping.c |  6 ++++--
 kernel/time/timekeeping.h |  3 ++-
 5 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c
index d23b434c2ca7b..eddcf49704445 100644
--- a/kernel/time/jiffies.c
+++ b/kernel/time/jiffies.c
@@ -58,7 +58,8 @@ static struct clocksource clocksource_jiffies = {
 	.max_cycles	= 10,
 };
 
-__cacheline_aligned_in_smp DEFINE_SEQLOCK(jiffies_lock);
+__cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(jiffies_lock);
+__cacheline_aligned_in_smp seqcount_t jiffies_seq;
 
 #if (BITS_PER_LONG < 64)
 u64 get_jiffies_64(void)
@@ -67,9 +68,9 @@ u64 get_jiffies_64(void)
 	u64 ret;
 
 	do {
-		seq = read_seqbegin(&jiffies_lock);
+		seq = read_seqcount_begin(&jiffies_seq);
 		ret = jiffies_64;
-	} while (read_seqretry(&jiffies_lock, seq));
+	} while (read_seqcount_retry(&jiffies_seq, seq));
 	return ret;
 }
 EXPORT_SYMBOL(get_jiffies_64);
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 7e5d3524e924d..6c9c342dd0e53 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -84,13 +84,15 @@ int tick_is_oneshot_available(void)
 static void tick_periodic(int cpu)
 {
 	if (tick_do_timer_cpu == cpu) {
-		write_seqlock(&jiffies_lock);
+		raw_spin_lock(&jiffies_lock);
+		write_seqcount_begin(&jiffies_seq);
 
 		/* Keep track of the next tick event */
 		tick_next_period = ktime_add(tick_next_period, tick_period);
 
 		do_timer(1);
-		write_sequnlock(&jiffies_lock);
+		write_seqcount_end(&jiffies_seq);
+		raw_spin_unlock(&jiffies_lock);
 		update_wall_time();
 	}
 
@@ -162,9 +164,9 @@ void tick_setup_periodic(struct clock_event_device *dev, int broadcast)
 		ktime_t next;
 
 		do {
-			seq = read_seqbegin(&jiffies_lock);
+			seq = read_seqcount_begin(&jiffies_seq);
 			next = tick_next_period;
-		} while (read_seqretry(&jiffies_lock, seq));
+		} while (read_seqcount_retry(&jiffies_seq, seq));
 
 		clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
 
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a792d21cac645..4be756b88a48e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -65,7 +65,8 @@ static void tick_do_update_jiffies64(ktime_t now)
 		return;
 
 	/* Reevaluate with jiffies_lock held */
-	write_seqlock(&jiffies_lock);
+	raw_spin_lock(&jiffies_lock);
+	write_seqcount_begin(&jiffies_seq);
 
 	delta = ktime_sub(now, last_jiffies_update);
 	if (delta >= tick_period) {
@@ -91,10 +92,12 @@ static void tick_do_update_jiffies64(ktime_t now)
 		/* Keep the tick_next_period variable up to date */
 		tick_next_period = ktime_add(last_jiffies_update, tick_period);
 	} else {
-		write_sequnlock(&jiffies_lock);
+		write_seqcount_end(&jiffies_seq);
+		raw_spin_unlock(&jiffies_lock);
 		return;
 	}
-	write_sequnlock(&jiffies_lock);
+	write_seqcount_end(&jiffies_seq);
+	raw_spin_unlock(&jiffies_lock);
 	update_wall_time();
 }
 
@@ -105,12 +108,14 @@ static ktime_t tick_init_jiffy_update(void)
 {
 	ktime_t period;
 
-	write_seqlock(&jiffies_lock);
+	raw_spin_lock(&jiffies_lock);
+	write_seqcount_begin(&jiffies_seq);
 	/* Did we start the jiffies update yet ? */
 	if (last_jiffies_update == 0)
 		last_jiffies_update = tick_next_period;
 	period = last_jiffies_update;
-	write_sequnlock(&jiffies_lock);
+	write_seqcount_end(&jiffies_seq);
+	raw_spin_unlock(&jiffies_lock);
 	return period;
 }
 
@@ -676,10 +681,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 
 	/* Read jiffies and the time when jiffies were updated last */
 	do {
-		seq = read_seqbegin(&jiffies_lock);
+		seq = read_seqcount_begin(&jiffies_seq);
 		basemono = last_jiffies_update;
 		basejiff = jiffies;
-	} while (read_seqretry(&jiffies_lock, seq));
+	} while (read_seqcount_retry(&jiffies_seq, seq));
 	ts->last_jiffies = basejiff;
 	ts->timer_expires_base = basemono;
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index ca69290bee2a3..856280d2cbd4c 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2397,8 +2397,10 @@ EXPORT_SYMBOL(hardpps);
  */
 void xtime_update(unsigned long ticks)
 {
-	write_seqlock(&jiffies_lock);
+	raw_spin_lock(&jiffies_lock);
+	write_seqcount_begin(&jiffies_seq);
 	do_timer(ticks);
-	write_sequnlock(&jiffies_lock);
+	write_seqcount_end(&jiffies_seq);
+	raw_spin_unlock(&jiffies_lock);
 	update_wall_time();
 }
diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h
index 141ab3ab0354f..099737f6f10c7 100644
--- a/kernel/time/timekeeping.h
+++ b/kernel/time/timekeeping.h
@@ -25,7 +25,8 @@ static inline void sched_clock_resume(void) { }
 extern void do_timer(unsigned long ticks);
 extern void update_wall_time(void);
 
-extern seqlock_t jiffies_lock;
+extern raw_spinlock_t jiffies_lock;
+extern seqcount_t jiffies_seq;
 
 #define CS_NAME_LEN	32
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
  2020-03-13 17:46 ` [PATCH 1/9] Documentation: Add lock ordering and nesting documentation Sebastian Andrzej Siewior
  2020-03-13 17:46 ` [PATCH 2/9] timekeeping: Split jiffies seqlock Sebastian Andrzej Siewior
@ 2020-03-13 17:46 ` Sebastian Andrzej Siewior
  2020-03-13 18:11   ` Logan Gunthorpe
  2020-03-13 19:31   ` Peter Zijlstra
  2020-03-13 17:46 ` [PATCH 4/9] sched/swait: Prepare usage in completions Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior, Kurt Schwemmer, Logan Gunthorpe,
	Bjorn Helgaas, linux-pci

The poll callback is abusing the completion wait queue and sticks it into
poll_wait() to wake up pollers after a command has completed.

First of all it's a layering violation as it imposes restrictions on the
inner workings of completions. Just because C allows to do so does not
justify that in any way. The proper way to do such things is to post
patches which extend the core infrastructure and not by silently abusing
it.

Aside of that the implementation is seriously broken:

 1) It cannot work with EPOLLEXCLUSIVE

 2) It's racy:

  poll()	      	  	 write()
   switchtec_dev_poll()		   switchtec_dev_write()
    poll_wait(&s->comp.wait);        mrpc_queue_cmd()
    				       init_completion(&s->comp)
					 init_waitqueue_head(&s->comp.wait)

Replace it with a regular wait queue which removes the completion abuse and
cures #1 and #2 above.

Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/pci/switch/switchtec.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/switch/switchtec.c b/drivers/pci/switch/switchtec.c
index a823b4b8ef8a9..e69cac84b605f 100644
--- a/drivers/pci/switch/switchtec.c
+++ b/drivers/pci/switch/switchtec.c
@@ -52,10 +52,11 @@ struct switchtec_user {
 
 	enum mrpc_state state;
 
-	struct completion comp;
+	wait_queue_head_t cmd_comp;
 	struct kref kref;
 	struct list_head list;
 
+	bool cmd_done;
 	u32 cmd;
 	u32 status;
 	u32 return_code;
@@ -77,7 +78,7 @@ static struct switchtec_user *stuser_create(struct switchtec_dev *stdev)
 	stuser->stdev = stdev;
 	kref_init(&stuser->kref);
 	INIT_LIST_HEAD(&stuser->list);
-	init_completion(&stuser->comp);
+	init_waitqueue_head(&stuser->cmd_comp);
 	stuser->event_cnt = atomic_read(&stdev->event_cnt);
 
 	dev_dbg(&stdev->dev, "%s: %p\n", __func__, stuser);
@@ -175,7 +176,7 @@ static int mrpc_queue_cmd(struct switchtec_user *stuser)
 	kref_get(&stuser->kref);
 	stuser->read_len = sizeof(stuser->data);
 	stuser_set_state(stuser, MRPC_QUEUED);
-	init_completion(&stuser->comp);
+	stuser->cmd_done = false;
 	list_add_tail(&stuser->list, &stdev->mrpc_queue);
 
 	mrpc_cmd_submit(stdev);
@@ -222,7 +223,8 @@ static void mrpc_complete_cmd(struct switchtec_dev *stdev)
 		memcpy_fromio(stuser->data, &stdev->mmio_mrpc->output_data,
 			      stuser->read_len);
 out:
-	complete_all(&stuser->comp);
+	stuser->cmd_done = true;
+	wake_up_interruptible(&stuser->cmd_comp);
 	list_del_init(&stuser->list);
 	stuser_put(stuser);
 	stdev->mrpc_busy = 0;
@@ -529,10 +531,11 @@ static ssize_t switchtec_dev_read(struct file *filp, char __user *data,
 	mutex_unlock(&stdev->mrpc_mutex);
 
 	if (filp->f_flags & O_NONBLOCK) {
-		if (!try_wait_for_completion(&stuser->comp))
+		if (!stuser->cmd_done)
 			return -EAGAIN;
 	} else {
-		rc = wait_for_completion_interruptible(&stuser->comp);
+		rc = wait_event_interruptible(stuser->cmd_comp,
+					      stuser->cmd_done);
 		if (rc < 0)
 			return rc;
 	}
@@ -580,7 +583,7 @@ static __poll_t switchtec_dev_poll(struct file *filp, poll_table *wait)
 	struct switchtec_dev *stdev = stuser->stdev;
 	__poll_t ret = 0;
 
-	poll_wait(filp, &stuser->comp.wait, wait);
+	poll_wait(filp, &stuser->cmd_comp, wait);
 	poll_wait(filp, &stdev->event_wq, wait);
 
 	if (lock_mutex_and_test_alive(stdev))
@@ -588,7 +591,7 @@ static __poll_t switchtec_dev_poll(struct file *filp, poll_table *wait)
 
 	mutex_unlock(&stdev->mrpc_mutex);
 
-	if (try_wait_for_completion(&stuser->comp))
+	if (stuser->cmd_done)
 		ret |= EPOLLIN | EPOLLRDNORM;
 
 	if (stuser->event_cnt != atomic_read(&stdev->event_cnt))
@@ -1272,7 +1275,8 @@ static void stdev_kill(struct switchtec_dev *stdev)
 
 	/* Wake up and kill any users waiting on an MRPC request */
 	list_for_each_entry_safe(stuser, tmpuser, &stdev->mrpc_queue, list) {
-		complete_all(&stuser->comp);
+		stuser->cmd_done = true;
+		wake_up_interruptible(&stuser->cmd_comp);
 		list_del_init(&stuser->list);
 		stuser_put(stuser);
 	}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/9] sched/swait: Prepare usage in completions
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2020-03-13 17:46 ` [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll Sebastian Andrzej Siewior
@ 2020-03-13 17:46 ` Sebastian Andrzej Siewior
  2020-03-14  0:26   ` Thomas Gleixner
  2020-03-13 17:46 ` [PATCH 5/9] completion: Use simple wait queues Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

From: Thomas Gleixner <tglx@linutronix.de>

As a preparation to use simple wait queues for completions:

  - Provide swake_up_all_locked() to support complete_all()
  - Make __prepare_to_swait() public available

This is done to enable the usage of complete() within truly atomic contexts
on a PREEMPT_RT enabled kernel.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/sched/sched.h | 3 +++
 kernel/sched/swait.c | 8 +++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9ea647835fd6f..fdc77e7963242 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2492,3 +2492,6 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 	return true;
 }
 #endif
+
+void swake_up_all_locked(struct swait_queue_head *q);
+void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index e83a3f8449f65..c528f238c68d6 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -32,6 +32,12 @@ void swake_up_locked(struct swait_queue_head *q)
 }
 EXPORT_SYMBOL(swake_up_locked);
 
+void swake_up_all_locked(struct swait_queue_head *q)
+{
+	while (!list_empty(&q->task_list))
+		swake_up_locked(q);
+}
+
 void swake_up_one(struct swait_queue_head *q)
 {
 	unsigned long flags;
@@ -69,7 +75,7 @@ void swake_up_all(struct swait_queue_head *q)
 }
 EXPORT_SYMBOL(swake_up_all);
 
-static void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
+void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait)
 {
 	wait->task = current;
 	if (list_empty(&wait->task_list))
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 5/9] completion: Use simple wait queues
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2020-03-13 17:46 ` [PATCH 4/9] sched/swait: Prepare usage in completions Sebastian Andrzej Siewior
@ 2020-03-13 17:46 ` Sebastian Andrzej Siewior
  2020-03-14 15:40   ` Linus Torvalds
  2020-03-13 17:46 ` [PATCH 6/9] lockdep: Introduce wait-type checks Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

From: Thomas Gleixner <tglx@linutronix.de>

completion uses a wait_queue_head_t to enqueue waiters.

wait_queue_head_t contains a spinlock_t to protect the list of waiters
which excludes it from being used in truly atomic context on a PREEMPT_RT
enabled kernel.

The spinlock in the wait queue head cannot be replaced by a raw_spinlock
because:

  - wait queues can have custom wakeup callbacks, which acquire other
    spinlock_t locks and have potentially long execution times

  - wake_up() walks an unbounded number of list entries during the wake up
    and may wake an unbounded number of waiters.

For simplicity and performance reasons complete() should be usable on
PREEMPT_RT enabled kernels.

completions do not use custom wakeup callbacks and are usually single
waiter, except for a few corner cases.

Replace the wait queue in the completion with a simple wait queue (swait),
which uses a raw_spinlock_t for protecting the waiter list and therefore is
safe to use inside truly atomic regions on PREEMPT_RT.

complete_all() might cause unbound latencies with a large number of waiters
being woken at once, but most complete_all() usage sites are either in
testing or initialization code or have only a really small number of
concurrent waiters which for now does not cause a latency problem. Keep it
simple for now.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 arch/powerpc/platforms/ps3/device-init.c      |  4 +--
 .../wireless/intersil/orinoco/orinoco_usb.c   |  4 +--
 drivers/usb/gadget/function/f_fs.c            |  2 +-
 drivers/usb/gadget/legacy/inode.c             |  4 +--
 include/linux/completion.h                    |  8 ++---
 kernel/sched/completion.c                     | 36 ++++++++++---------
 6 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/ps3/device-init.c b/arch/powerpc/platforms/ps3/device-init.c
index 2735ec90414dc..359231cea8cd7 100644
--- a/arch/powerpc/platforms/ps3/device-init.c
+++ b/arch/powerpc/platforms/ps3/device-init.c
@@ -738,8 +738,8 @@ static int ps3_notification_read_write(struct ps3_notification_device *dev,
 	}
 	pr_debug("%s:%u: notification %s issued\n", __func__, __LINE__, op);
 
-	res = wait_event_interruptible(dev->done.wait,
-				       dev->done.done || kthread_should_stop());
+	res = swait_event_interruptible_exclusive(dev->done.wait,
+						  dev->done.done || kthread_should_stop());
 	if (kthread_should_stop())
 		res = -EINTR;
 	if (res) {
diff --git a/drivers/net/wireless/intersil/orinoco/orinoco_usb.c b/drivers/net/wireless/intersil/orinoco/orinoco_usb.c
index e753f43e0162f..7e53a0ba57762 100644
--- a/drivers/net/wireless/intersil/orinoco/orinoco_usb.c
+++ b/drivers/net/wireless/intersil/orinoco/orinoco_usb.c
@@ -693,8 +693,8 @@ static void ezusb_req_ctx_wait(struct ezusb_priv *upriv,
 			while (!ctx->done.done && msecs--)
 				udelay(1000);
 		} else {
-			wait_event_interruptible(ctx->done.wait,
-						 ctx->done.done);
+			swait_event_interruptible_exclusive(ctx->done.wait,
+							    ctx->done.done);
 		}
 		break;
 	default:
diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index 571917677d358..234177dd1ed57 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -1703,7 +1703,7 @@ static void ffs_data_put(struct ffs_data *ffs)
 		pr_info("%s(): freeing\n", __func__);
 		ffs_data_clear(ffs);
 		BUG_ON(waitqueue_active(&ffs->ev.waitq) ||
-		       waitqueue_active(&ffs->ep0req_completion.wait) ||
+		       swait_active(&ffs->ep0req_completion.wait) ||
 		       waitqueue_active(&ffs->wait));
 		destroy_workqueue(ffs->io_completion_wq);
 		kfree(ffs->dev_name);
diff --git a/drivers/usb/gadget/legacy/inode.c b/drivers/usb/gadget/legacy/inode.c
index b47938dff1a22..e5141c1136e47 100644
--- a/drivers/usb/gadget/legacy/inode.c
+++ b/drivers/usb/gadget/legacy/inode.c
@@ -344,7 +344,7 @@ ep_io (struct ep_data *epdata, void *buf, unsigned len)
 	spin_unlock_irq (&epdata->dev->lock);
 
 	if (likely (value == 0)) {
-		value = wait_event_interruptible (done.wait, done.done);
+		value = swait_event_interruptible_exclusive(done.wait, done.done);
 		if (value != 0) {
 			spin_lock_irq (&epdata->dev->lock);
 			if (likely (epdata->ep != NULL)) {
@@ -353,7 +353,7 @@ ep_io (struct ep_data *epdata, void *buf, unsigned len)
 				usb_ep_dequeue (epdata->ep, epdata->req);
 				spin_unlock_irq (&epdata->dev->lock);
 
-				wait_event (done.wait, done.done);
+				swait_event_exclusive(done.wait, done.done);
 				if (epdata->status == -ECONNRESET)
 					epdata->status = -EINTR;
 			} else {
diff --git a/include/linux/completion.h b/include/linux/completion.h
index 519e94915d185..bf8e77001f18f 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -9,7 +9,7 @@
  * See kernel/sched/completion.c for details.
  */
 
-#include <linux/wait.h>
+#include <linux/swait.h>
 
 /*
  * struct completion - structure used to maintain state for a "completion"
@@ -25,7 +25,7 @@
  */
 struct completion {
 	unsigned int done;
-	wait_queue_head_t wait;
+	struct swait_queue_head wait;
 };
 
 #define init_completion_map(x, m) __init_completion(x)
@@ -34,7 +34,7 @@ static inline void complete_acquire(struct completion *x) {}
 static inline void complete_release(struct completion *x) {}
 
 #define COMPLETION_INITIALIZER(work) \
-	{ 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
+	{ 0, __SWAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
 
 #define COMPLETION_INITIALIZER_ONSTACK_MAP(work, map) \
 	(*({ init_completion_map(&(work), &(map)); &(work); }))
@@ -85,7 +85,7 @@ static inline void complete_release(struct completion *x) {}
 static inline void __init_completion(struct completion *x)
 {
 	x->done = 0;
-	init_waitqueue_head(&x->wait);
+	init_swait_queue_head(&x->wait);
 }
 
 /**
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index a1ad5b7d5521b..f15e96164ff1e 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -29,12 +29,12 @@ void complete(struct completion *x)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&x->wait.lock, flags);
+	raw_spin_lock_irqsave(&x->wait.lock, flags);
 
 	if (x->done != UINT_MAX)
 		x->done++;
-	__wake_up_locked(&x->wait, TASK_NORMAL, 1);
-	spin_unlock_irqrestore(&x->wait.lock, flags);
+	swake_up_locked(&x->wait);
+	raw_spin_unlock_irqrestore(&x->wait.lock, flags);
 }
 EXPORT_SYMBOL(complete);
 
@@ -58,10 +58,12 @@ void complete_all(struct completion *x)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&x->wait.lock, flags);
+	WARN_ON(irqs_disabled());
+
+	raw_spin_lock_irqsave(&x->wait.lock, flags);
 	x->done = UINT_MAX;
-	__wake_up_locked(&x->wait, TASK_NORMAL, 0);
-	spin_unlock_irqrestore(&x->wait.lock, flags);
+	swake_up_all_locked(&x->wait);
+	raw_spin_unlock_irqrestore(&x->wait.lock, flags);
 }
 EXPORT_SYMBOL(complete_all);
 
@@ -70,20 +72,20 @@ do_wait_for_common(struct completion *x,
 		   long (*action)(long), long timeout, int state)
 {
 	if (!x->done) {
-		DECLARE_WAITQUEUE(wait, current);
+		DECLARE_SWAITQUEUE(wait);
 
-		__add_wait_queue_entry_tail_exclusive(&x->wait, &wait);
 		do {
 			if (signal_pending_state(state, current)) {
 				timeout = -ERESTARTSYS;
 				break;
 			}
+			__prepare_to_swait(&x->wait, &wait);
 			__set_current_state(state);
-			spin_unlock_irq(&x->wait.lock);
+			raw_spin_unlock_irq(&x->wait.lock);
 			timeout = action(timeout);
-			spin_lock_irq(&x->wait.lock);
+			raw_spin_lock_irq(&x->wait.lock);
 		} while (!x->done && timeout);
-		__remove_wait_queue(&x->wait, &wait);
+		__finish_swait(&x->wait, &wait);
 		if (!x->done)
 			return timeout;
 	}
@@ -100,9 +102,9 @@ __wait_for_common(struct completion *x,
 
 	complete_acquire(x);
 
-	spin_lock_irq(&x->wait.lock);
+	raw_spin_lock_irq(&x->wait.lock);
 	timeout = do_wait_for_common(x, action, timeout, state);
-	spin_unlock_irq(&x->wait.lock);
+	raw_spin_unlock_irq(&x->wait.lock);
 
 	complete_release(x);
 
@@ -291,12 +293,12 @@ bool try_wait_for_completion(struct completion *x)
 	if (!READ_ONCE(x->done))
 		return false;
 
-	spin_lock_irqsave(&x->wait.lock, flags);
+	raw_spin_lock_irqsave(&x->wait.lock, flags);
 	if (!x->done)
 		ret = false;
 	else if (x->done != UINT_MAX)
 		x->done--;
-	spin_unlock_irqrestore(&x->wait.lock, flags);
+	raw_spin_unlock_irqrestore(&x->wait.lock, flags);
 	return ret;
 }
 EXPORT_SYMBOL(try_wait_for_completion);
@@ -322,8 +324,8 @@ bool completion_done(struct completion *x)
 	 * otherwise we can end up freeing the completion before complete()
 	 * is done referencing it.
 	 */
-	spin_lock_irqsave(&x->wait.lock, flags);
-	spin_unlock_irqrestore(&x->wait.lock, flags);
+	raw_spin_lock_irqsave(&x->wait.lock, flags);
+	raw_spin_unlock_irqrestore(&x->wait.lock, flags);
 	return true;
 }
 EXPORT_SYMBOL(completion_done);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 6/9] lockdep: Introduce wait-type checks
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2020-03-13 17:46 ` [PATCH 5/9] completion: Use simple wait queues Sebastian Andrzej Siewior
@ 2020-03-13 17:46 ` Sebastian Andrzej Siewior
  2020-03-31 13:25   ` Geert Uytterhoeven
  2020-03-13 17:46 ` [PATCH 7/9] lockdep: Add hrtimer context tracing bits Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

From: Peter Zijlstra <peterz@infradead.org>

Extend lockdep to validate lock wait-type context.

The current wait-types are:

	LD_WAIT_FREE,		/* wait free, rcu etc.. */
	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
	LD_WAIT_SLEEP,		/* sleeping locks, mutex_t etc.. */

Where lockdep validates that the current lock (the one being acquired)
fits in the current wait-context (as generated by the held stack).

This ensures that there is no attempt to acquire mutexes while holding
spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
other words, its a more fancy might_sleep().

Obviously RCU made the entire ordeal more complex than a simple single
value test because RCU can be acquired in (pretty much) any context and
while it presents a context to nested locks it is not the same as it
got acquired in.

Therefore its necessary to split the wait_type into two values, one
representing the acquire (outer) and one representing the nested context
(inner). For most 'normal' locks these two are the same.

[ To make static initialization easier we have the rule that:
  .outer == INV means .outer == .inner; because INV == 0. ]

It further means that its required to find the minimal .inner of the held
stack to compare against the outer of the new lock; because while 'normal'
RCU presents a CONFIG type to nested locks, if it is taken while already
holding a SPIN type it obviously doesn't relax the rules.

Below is an example output generated by the trivial test code:

  raw_spin_lock(&foo);
  spin_lock(&bar);
  spin_unlock(&bar);
  raw_spin_unlock(&foo);

 [ BUG: Invalid wait context ]
 -----------------------------
 swapper/0/1 is trying to lock:
 ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
 other info that might help us debug this:
 1 lock held by swapper/0/1:
  #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187

The way to read it is to look at the new -{n,m} part in the lock
description; -{3:3} for the attempted lock, and try and match that up to
the held locks, which in this case is the one: -{2,2}.

This tells that the acquiring lock requires a more relaxed environment than
presented by the lock stack.

Currently only the normal locks and RCU are converted, the rest of the
lockdep users defaults to .inner = INV which is ignored. More conversions
can be done when desired.

The check for spinlock_t nesting is not enabled by default. It's a separate
config option for now as there are known problems which are currently
addressed. The config option allows to identify these problems and to
verify that the solutions found are indeed solving them.

The config switch will be removed and the checks will permanently enabled
once the vast majority of issues has been addressed.

[ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
	   failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
[ tglx: Add the config option ]

Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irqflags.h        |   8 +-
 include/linux/lockdep.h         |  71 ++++++++++++++---
 include/linux/mutex.h           |   7 +-
 include/linux/rwlock_types.h    |   6 +-
 include/linux/rwsem.h           |   6 +-
 include/linux/sched.h           |   1 +
 include/linux/spinlock.h        |  35 ++++++---
 include/linux/spinlock_types.h  |  24 +++++-
 kernel/irq/handle.c             |   7 ++
 kernel/locking/lockdep.c        | 132 ++++++++++++++++++++++++++++++--
 kernel/locking/mutex-debug.c    |   2 +-
 kernel/locking/rwsem.c          |   2 +-
 kernel/locking/spinlock_debug.c |   6 +-
 kernel/rcu/update.c             |  24 ++++--
 lib/Kconfig.debug               |  17 ++++
 15 files changed, 301 insertions(+), 47 deletions(-)

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 21619c92c3770..fdaf28601cbe1 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -37,7 +37,12 @@
 # define trace_softirqs_enabled(p)	((p)->softirqs_enabled)
 # define trace_hardirq_enter()			\
 do {						\
-	current->hardirq_context++;		\
+	if (!current->hardirq_context++)	\
+		current->hardirq_threaded = 0;	\
+} while (0)
+# define trace_hardirq_threaded()		\
+do {						\
+	current->hardirq_threaded = 1;		\
 } while (0)
 # define trace_hardirq_exit()			\
 do {						\
@@ -59,6 +64,7 @@ do {						\
 # define trace_hardirqs_enabled(p)	0
 # define trace_softirqs_enabled(p)	0
 # define trace_hardirq_enter()		do { } while (0)
+# define trace_hardirq_threaded()	do { } while (0)
 # define trace_hardirq_exit()		do { } while (0)
 # define lockdep_softirq_enter()	do { } while (0)
 # define lockdep_softirq_exit()		do { } while (0)
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 664f52c6dd4c1..425b4ceb7cd07 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -21,6 +21,22 @@ extern int lock_stat;
 
 #include <linux/types.h>
 
+enum lockdep_wait_type {
+	LD_WAIT_INV = 0,	/* not checked, catch all */
+
+	LD_WAIT_FREE,		/* wait free, rcu etc.. */
+	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
+
+#ifdef CONFIG_PROVE_RAW_LOCK_NESTING
+	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
+#else
+	LD_WAIT_CONFIG = LD_WAIT_SPIN,
+#endif
+	LD_WAIT_SLEEP,		/* sleeping locks, mutex_t etc.. */
+
+	LD_WAIT_MAX,		/* must be last */
+};
+
 #ifdef CONFIG_LOCKDEP
 
 #include <linux/linkage.h>
@@ -111,6 +127,9 @@ struct lock_class {
 	int				name_version;
 	const char			*name;
 
+	short				wait_type_inner;
+	short				wait_type_outer;
+
 #ifdef CONFIG_LOCK_STAT
 	unsigned long			contention_point[LOCKSTAT_POINTS];
 	unsigned long			contending_point[LOCKSTAT_POINTS];
@@ -158,6 +177,8 @@ struct lockdep_map {
 	struct lock_class_key		*key;
 	struct lock_class		*class_cache[NR_LOCKDEP_CACHING_CLASSES];
 	const char			*name;
+	short				wait_type_outer; /* can be taken in this context */
+	short				wait_type_inner; /* presents this context */
 #ifdef CONFIG_LOCK_STAT
 	int				cpu;
 	unsigned long			ip;
@@ -299,8 +320,21 @@ extern void lockdep_unregister_key(struct lock_class_key *key);
  * to lockdep:
  */
 
-extern void lockdep_init_map(struct lockdep_map *lock, const char *name,
-			     struct lock_class_key *key, int subclass);
+extern void lockdep_init_map_waits(struct lockdep_map *lock, const char *name,
+	struct lock_class_key *key, int subclass, short inner, short outer);
+
+static inline void
+lockdep_init_map_wait(struct lockdep_map *lock, const char *name,
+		      struct lock_class_key *key, int subclass, short inner)
+{
+	lockdep_init_map_waits(lock, name, key, subclass, inner, LD_WAIT_INV);
+}
+
+static inline void lockdep_init_map(struct lockdep_map *lock, const char *name,
+			     struct lock_class_key *key, int subclass)
+{
+	lockdep_init_map_wait(lock, name, key, subclass, LD_WAIT_INV);
+}
 
 /*
  * Reinitialize a lock key - for cases where there is special locking or
@@ -308,18 +342,29 @@ extern void lockdep_init_map(struct lockdep_map *lock, const char *name,
  * of dependencies wrong: they are either too broad (they need a class-split)
  * or they are too narrow (they suffer from a false class-split):
  */
-#define lockdep_set_class(lock, key) \
-		lockdep_init_map(&(lock)->dep_map, #key, key, 0)
-#define lockdep_set_class_and_name(lock, key, name) \
-		lockdep_init_map(&(lock)->dep_map, name, key, 0)
-#define lockdep_set_class_and_subclass(lock, key, sub) \
-		lockdep_init_map(&(lock)->dep_map, #key, key, sub)
-#define lockdep_set_subclass(lock, sub)	\
-		lockdep_init_map(&(lock)->dep_map, #lock, \
-				 (lock)->dep_map.key, sub)
+#define lockdep_set_class(lock, key)				\
+	lockdep_init_map_waits(&(lock)->dep_map, #key, key, 0,	\
+			       (lock)->dep_map.wait_type_inner,	\
+			       (lock)->dep_map.wait_type_outer)
+
+#define lockdep_set_class_and_name(lock, key, name)		\
+	lockdep_init_map_waits(&(lock)->dep_map, name, key, 0,	\
+			       (lock)->dep_map.wait_type_inner,	\
+			       (lock)->dep_map.wait_type_outer)
+
+#define lockdep_set_class_and_subclass(lock, key, sub)		\
+	lockdep_init_map_waits(&(lock)->dep_map, #key, key, sub,\
+			       (lock)->dep_map.wait_type_inner,	\
+			       (lock)->dep_map.wait_type_outer)
+
+#define lockdep_set_subclass(lock, sub)					\
+	lockdep_init_map_waits(&(lock)->dep_map, #lock, (lock)->dep_map.key, sub,\
+			       (lock)->dep_map.wait_type_inner,		\
+			       (lock)->dep_map.wait_type_outer)
 
 #define lockdep_set_novalidate_class(lock) \
 	lockdep_set_class_and_name(lock, &__lockdep_no_validate__, #lock)
+
 /*
  * Compare locking classes
  */
@@ -432,6 +477,10 @@ static inline void lockdep_set_selftest_task(struct task_struct *task)
 # define lock_set_class(l, n, k, s, i)		do { } while (0)
 # define lock_set_subclass(l, s, i)		do { } while (0)
 # define lockdep_init()				do { } while (0)
+# define lockdep_init_map_waits(lock, name, key, sub, inner, outer) \
+		do { (void)(name); (void)(key); } while (0)
+# define lockdep_init_map_wait(lock, name, key, sub, inner) \
+		do { (void)(name); (void)(key); } while (0)
 # define lockdep_init_map(lock, name, key, sub) \
 		do { (void)(name); (void)(key); } while (0)
 # define lockdep_set_class(lock, key)		do { (void)(key); } while (0)
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index aca8f36dfac9a..ae197cc00cc87 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -109,8 +109,11 @@ do {									\
 } while (0)
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \
-		, .dep_map = { .name = #lockname }
+# define __DEP_MAP_MUTEX_INITIALIZER(lockname)			\
+		, .dep_map = {					\
+			.name = #lockname,			\
+			.wait_type_inner = LD_WAIT_SLEEP,	\
+		}
 #else
 # define __DEP_MAP_MUTEX_INITIALIZER(lockname)
 #endif
diff --git a/include/linux/rwlock_types.h b/include/linux/rwlock_types.h
index 857a72ceb7942..3bd03e18061c1 100644
--- a/include/linux/rwlock_types.h
+++ b/include/linux/rwlock_types.h
@@ -22,7 +22,11 @@ typedef struct {
 #define RWLOCK_MAGIC		0xdeaf1eed
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define RW_DEP_MAP_INIT(lockname)	.dep_map = { .name = #lockname }
+# define RW_DEP_MAP_INIT(lockname)					\
+	.dep_map = {							\
+		.name = #lockname,					\
+		.wait_type_inner = LD_WAIT_CONFIG,			\
+	}
 #else
 # define RW_DEP_MAP_INIT(lockname)
 #endif
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 00d6054687dd2..18b010846bef0 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -71,7 +71,11 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 /* Common initializer macros and functions */
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
+# define __RWSEM_DEP_MAP_INIT(lockname)			\
+	, .dep_map = {					\
+		.name = #lockname,			\
+		.wait_type_inner = LD_WAIT_SLEEP,	\
+	}
 #else
 # define __RWSEM_DEP_MAP_INIT(lockname)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15b..4d3b9ecce0742 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -970,6 +970,7 @@ struct task_struct {
 
 #ifdef CONFIG_TRACE_IRQFLAGS
 	unsigned int			irq_events;
+	unsigned int			hardirq_threaded;
 	unsigned long			hardirq_enable_ip;
 	unsigned long			hardirq_disable_ip;
 	unsigned int			hardirq_enable_event;
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 031ce8617df8f..d3770b3f9d9a9 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -93,12 +93,13 @@
 
 #ifdef CONFIG_DEBUG_SPINLOCK
   extern void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name,
-				   struct lock_class_key *key);
-# define raw_spin_lock_init(lock)				\
-do {								\
-	static struct lock_class_key __key;			\
-								\
-	__raw_spin_lock_init((lock), #lock, &__key);		\
+				   struct lock_class_key *key, short inner);
+
+# define raw_spin_lock_init(lock)					\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__raw_spin_lock_init((lock), #lock, &__key, LD_WAIT_SPIN);	\
 } while (0)
 
 #else
@@ -327,12 +328,26 @@ static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
 	return &lock->rlock;
 }
 
-#define spin_lock_init(_lock)				\
-do {							\
-	spinlock_check(_lock);				\
-	raw_spin_lock_init(&(_lock)->rlock);		\
+#ifdef CONFIG_DEBUG_SPINLOCK
+
+# define spin_lock_init(lock)					\
+do {								\
+	static struct lock_class_key __key;			\
+								\
+	__raw_spin_lock_init(spinlock_check(lock),		\
+			     #lock, &__key, LD_WAIT_CONFIG);	\
 } while (0)
 
+#else
+
+# define spin_lock_init(_lock)			\
+do {						\
+	spinlock_check(_lock);			\
+	*(_lock) = __SPIN_LOCK_UNLOCKED(_lock);	\
+} while (0)
+
+#endif
+
 static __always_inline void spin_lock(spinlock_t *lock)
 {
 	raw_spin_lock(&lock->rlock);
diff --git a/include/linux/spinlock_types.h b/include/linux/spinlock_types.h
index 24b4e6f2c1a22..6102e6bff3aeb 100644
--- a/include/linux/spinlock_types.h
+++ b/include/linux/spinlock_types.h
@@ -33,8 +33,18 @@ typedef struct raw_spinlock {
 #define SPINLOCK_OWNER_INIT	((void *)-1L)
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define SPIN_DEP_MAP_INIT(lockname)	.dep_map = { .name = #lockname }
+# define RAW_SPIN_DEP_MAP_INIT(lockname)		\
+	.dep_map = {					\
+		.name = #lockname,			\
+		.wait_type_inner = LD_WAIT_SPIN,	\
+	}
+# define SPIN_DEP_MAP_INIT(lockname)			\
+	.dep_map = {					\
+		.name = #lockname,			\
+		.wait_type_inner = LD_WAIT_CONFIG,	\
+	}
 #else
+# define RAW_SPIN_DEP_MAP_INIT(lockname)
 # define SPIN_DEP_MAP_INIT(lockname)
 #endif
 
@@ -51,7 +61,7 @@ typedef struct raw_spinlock {
 	{					\
 	.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
 	SPIN_DEBUG_INIT(lockname)		\
-	SPIN_DEP_MAP_INIT(lockname) }
+	RAW_SPIN_DEP_MAP_INIT(lockname) }
 
 #define __RAW_SPIN_LOCK_UNLOCKED(lockname)	\
 	(raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname)
@@ -72,11 +82,17 @@ typedef struct spinlock {
 	};
 } spinlock_t;
 
+#define ___SPIN_LOCK_INITIALIZER(lockname)	\
+	{					\
+	.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED,	\
+	SPIN_DEBUG_INIT(lockname)		\
+	SPIN_DEP_MAP_INIT(lockname) }
+
 #define __SPIN_LOCK_INITIALIZER(lockname) \
-	{ { .rlock = __RAW_SPIN_LOCK_INITIALIZER(lockname) } }
+	{ { .rlock = ___SPIN_LOCK_INITIALIZER(lockname) } }
 
 #define __SPIN_LOCK_UNLOCKED(lockname) \
-	(spinlock_t ) __SPIN_LOCK_INITIALIZER(lockname)
+	(spinlock_t) __SPIN_LOCK_INITIALIZER(lockname)
 
 #define DEFINE_SPINLOCK(x)	spinlock_t x = __SPIN_LOCK_UNLOCKED(x)
 
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index a4ace611f47fe..16ee716e82917 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -145,6 +145,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
 	for_each_action_of_desc(desc, action) {
 		irqreturn_t res;
 
+		/*
+		 * If this IRQ would be threaded under force_irqthreads, mark it so.
+		 */
+		if (irq_settings_can_thread(desc) &&
+		    !(action->flags & (IRQF_NO_THREAD | IRQF_PERCPU | IRQF_ONESHOT)))
+			trace_hardirq_threaded();
+
 		trace_irq_handler_entry(irq, action);
 		res = action->handler(irq, action->dev_id);
 		trace_irq_handler_exit(irq, action, res);
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 32406ef0d6a2d..4a9abf8bd9d15 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -653,7 +653,9 @@ static void print_lock_name(struct lock_class *class)
 
 	printk(KERN_CONT " (");
 	__print_lock_name(class);
-	printk(KERN_CONT "){%s}", usage);
+	printk(KERN_CONT "){%s}-{%hd:%hd}", usage,
+			class->wait_type_outer ?: class->wait_type_inner,
+			class->wait_type_inner);
 }
 
 static void print_lockdep_cache(struct lockdep_map *lock)
@@ -1230,6 +1232,8 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 	WARN_ON_ONCE(!list_empty(&class->locks_before));
 	WARN_ON_ONCE(!list_empty(&class->locks_after));
 	class->name_version = count_matching_names(class);
+	class->wait_type_inner = lock->wait_type_inner;
+	class->wait_type_outer = lock->wait_type_outer;
 	/*
 	 * We use RCU's safe list-add method to make
 	 * parallel walking of the hash-list safe:
@@ -3706,8 +3710,9 @@ static inline int separate_irq_context(struct task_struct *curr,
 /*
  * Initialize a lock instance's lock-class mapping info:
  */
-void lockdep_init_map(struct lockdep_map *lock, const char *name,
-		      struct lock_class_key *key, int subclass)
+void lockdep_init_map_waits(struct lockdep_map *lock, const char *name,
+			    struct lock_class_key *key, int subclass,
+			    short inner, short outer)
 {
 	int i;
 
@@ -3728,6 +3733,9 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
 
 	lock->name = name;
 
+	lock->wait_type_outer = outer;
+	lock->wait_type_inner = inner;
+
 	/*
 	 * No key, no joy, we need to hash something.
 	 */
@@ -3761,7 +3769,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
 		raw_local_irq_restore(flags);
 	}
 }
-EXPORT_SYMBOL_GPL(lockdep_init_map);
+EXPORT_SYMBOL_GPL(lockdep_init_map_waits);
 
 struct lock_class_key __lockdep_no_validate__;
 EXPORT_SYMBOL_GPL(__lockdep_no_validate__);
@@ -3798,6 +3806,113 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
 	dump_stack();
 }
 
+static int
+print_lock_invalid_wait_context(struct task_struct *curr,
+				struct held_lock *hlock)
+{
+	if (!debug_locks_off())
+		return 0;
+	if (debug_locks_silent)
+		return 0;
+
+	pr_warn("\n");
+	pr_warn("=============================\n");
+	pr_warn("[ BUG: Invalid wait context ]\n");
+	print_kernel_ident();
+	pr_warn("-----------------------------\n");
+
+	pr_warn("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
+	print_lock(hlock);
+
+	pr_warn("other info that might help us debug this:\n");
+	lockdep_print_held_locks(curr);
+
+	pr_warn("stack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+
+/*
+ * Verify the wait_type context.
+ *
+ * This check validates we takes locks in the right wait-type order; that is it
+ * ensures that we do not take mutexes inside spinlocks and do not attempt to
+ * acquire spinlocks inside raw_spinlocks and the sort.
+ *
+ * The entire thing is slightly more complex because of RCU, RCU is a lock that
+ * can be taken from (pretty much) any context but also has constraints.
+ * However when taken in a stricter environment the RCU lock does not loosen
+ * the constraints.
+ *
+ * Therefore we must look for the strictest environment in the lock stack and
+ * compare that to the lock we're trying to acquire.
+ */
+static int check_wait_context(struct task_struct *curr, struct held_lock *next)
+{
+	short next_inner = hlock_class(next)->wait_type_inner;
+	short next_outer = hlock_class(next)->wait_type_outer;
+	short curr_inner;
+	int depth;
+
+	if (!curr->lockdep_depth || !next_inner || next->trylock)
+		return 0;
+
+	if (!next_outer)
+		next_outer = next_inner;
+
+	/*
+	 * Find start of current irq_context..
+	 */
+	for (depth = curr->lockdep_depth - 1; depth >= 0; depth--) {
+		struct held_lock *prev = curr->held_locks + depth;
+		if (prev->irq_context != next->irq_context)
+			break;
+	}
+	depth++;
+
+	/*
+	 * Set appropriate wait type for the context; for IRQs we have to take
+	 * into account force_irqthread as that is implied by PREEMPT_RT.
+	 */
+	if (curr->hardirq_context) {
+		/*
+		 * Check if force_irqthreads will run us threaded.
+		 */
+		if (curr->hardirq_threaded)
+			curr_inner = LD_WAIT_CONFIG;
+		else
+			curr_inner = LD_WAIT_SPIN;
+	} else if (curr->softirq_context) {
+		/*
+		 * Softirqs are always threaded.
+		 */
+		curr_inner = LD_WAIT_CONFIG;
+	} else {
+		curr_inner = LD_WAIT_MAX;
+	}
+
+	for (; depth < curr->lockdep_depth; depth++) {
+		struct held_lock *prev = curr->held_locks + depth;
+		short prev_inner = hlock_class(prev)->wait_type_inner;
+
+		if (prev_inner) {
+			/*
+			 * We can have a bigger inner than a previous one
+			 * when outer is smaller than inner, as with RCU.
+			 *
+			 * Also due to trylocks.
+			 */
+			curr_inner = min(curr_inner, prev_inner);
+		}
+	}
+
+	if (next_outer > curr_inner)
+		return print_lock_invalid_wait_context(curr, next);
+
+	return 0;
+}
+
 static int __lock_is_held(const struct lockdep_map *lock, int read);
 
 /*
@@ -3862,7 +3977,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 
 	class_idx = class - lock_classes;
 
-	if (depth) {
+	if (depth) { /* we're holding locks */
 		hlock = curr->held_locks + depth - 1;
 		if (hlock->class_idx == class_idx && nest_lock) {
 			if (!references)
@@ -3904,6 +4019,9 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 #endif
 	hlock->pin_count = pin_count;
 
+	if (check_wait_context(curr, hlock))
+		return 0;
+
 	/* Initialize the lock usage bit */
 	if (!mark_usage(curr, hlock, check))
 		return 0;
@@ -4139,7 +4257,9 @@ __lock_set_class(struct lockdep_map *lock, const char *name,
 		return 0;
 	}
 
-	lockdep_init_map(lock, name, key, 0);
+	lockdep_init_map_waits(lock, name, key, 0,
+			       lock->wait_type_inner,
+			       lock->wait_type_outer);
 	class = register_lock_class(lock, subclass, 0);
 	hlock->class_idx = class - lock_classes;
 
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 771d4ca96dda7..a7276aaf2abc0 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -85,7 +85,7 @@ void debug_mutex_init(struct mutex *lock, const char *name,
 	 * Make sure we are not reinitializing a held lock:
 	 */
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
-	lockdep_init_map(&lock->dep_map, name, key, 0);
+	lockdep_init_map_wait(&lock->dep_map, name, key, 0, LD_WAIT_SLEEP);
 #endif
 	lock->magic = lock;
 }
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 0d9b6be9ecc88..cf1425452eed2 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -329,7 +329,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
 	 * Make sure we are not reinitializing a held semaphore:
 	 */
 	debug_check_no_locks_freed((void *)sem, sizeof(*sem));
-	lockdep_init_map(&sem->dep_map, name, key, 0);
+	lockdep_init_map_wait(&sem->dep_map, name, key, 0, LD_WAIT_SLEEP);
 #endif
 #ifdef CONFIG_DEBUG_RWSEMS
 	sem->magic = sem;
diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c
index 472dd462a40ca..b9d93087ee669 100644
--- a/kernel/locking/spinlock_debug.c
+++ b/kernel/locking/spinlock_debug.c
@@ -14,14 +14,14 @@
 #include <linux/export.h>
 
 void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name,
-			  struct lock_class_key *key)
+			  struct lock_class_key *key, short inner)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/*
 	 * Make sure we are not reinitializing a held lock:
 	 */
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
-	lockdep_init_map(&lock->dep_map, name, key, 0);
+	lockdep_init_map_wait(&lock->dep_map, name, key, 0, inner);
 #endif
 	lock->raw_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 	lock->magic = SPINLOCK_MAGIC;
@@ -39,7 +39,7 @@ void __rwlock_init(rwlock_t *lock, const char *name,
 	 * Make sure we are not reinitializing a held lock:
 	 */
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
-	lockdep_init_map(&lock->dep_map, name, key, 0);
+	lockdep_init_map_wait(&lock->dep_map, name, key, 0, LD_WAIT_CONFIG);
 #endif
 	lock->raw_lock = (arch_rwlock_t) __ARCH_RW_LOCK_UNLOCKED;
 	lock->magic = RWLOCK_MAGIC;
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 6c4b862f57d6f..8d3eb2fe20ae3 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -227,18 +227,30 @@ core_initcall(rcu_set_runtime_mode);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 static struct lock_class_key rcu_lock_key;
-struct lockdep_map rcu_lock_map =
-	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock", &rcu_lock_key);
+struct lockdep_map rcu_lock_map = {
+	.name = "rcu_read_lock",
+	.key = &rcu_lock_key,
+	.wait_type_outer = LD_WAIT_FREE,
+	.wait_type_inner = LD_WAIT_CONFIG, /* XXX PREEMPT_RCU ? */
+};
 EXPORT_SYMBOL_GPL(rcu_lock_map);
 
 static struct lock_class_key rcu_bh_lock_key;
-struct lockdep_map rcu_bh_lock_map =
-	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_bh", &rcu_bh_lock_key);
+struct lockdep_map rcu_bh_lock_map = {
+	.name = "rcu_read_lock_bh",
+	.key = &rcu_bh_lock_key,
+	.wait_type_outer = LD_WAIT_FREE,
+	.wait_type_inner = LD_WAIT_CONFIG, /* PREEMPT_LOCK also makes BH preemptible */
+};
 EXPORT_SYMBOL_GPL(rcu_bh_lock_map);
 
 static struct lock_class_key rcu_sched_lock_key;
-struct lockdep_map rcu_sched_lock_map =
-	STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_sched", &rcu_sched_lock_key);
+struct lockdep_map rcu_sched_lock_map = {
+	.name = "rcu_read_lock_sched",
+	.key = &rcu_sched_lock_key,
+	.wait_type_outer = LD_WAIT_FREE,
+	.wait_type_inner = LD_WAIT_SPIN,
+};
 EXPORT_SYMBOL_GPL(rcu_sched_lock_map);
 
 static struct lock_class_key rcu_callback_key;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 69def4a9df009..70813e39f911c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1086,6 +1086,23 @@ config PROVE_LOCKING
 
 	 For more details, see Documentation/locking/lockdep-design.rst.
 
+config PROVE_RAW_LOCK_NESTING
+	bool "Enable raw_spinlock - spinlock nesting checks"
+	depends on PROVE_LOCKING
+	default n
+	help
+	 Enable the raw_spinlock vs. spinlock nesting checks which ensure
+	 that the lock nesting rules for PREEMPT_RT enabled kernels are
+	 not violated.
+
+	 NOTE: There are known nesting problems. So if you enable this
+	 option expect lockdep splats until these problems have been fully
+	 addressed which is work in progress. This config switch allows to
+	 identify and analyze these problems. It will be removed and the
+	 check permanentely enabled once the main issues have been fixed.
+
+	 If unsure, select N.
+
 config LOCK_STAT
 	bool "Lock usage statistics"
 	depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 7/9] lockdep: Add hrtimer context tracing bits
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
                   ` (5 preceding siblings ...)
  2020-03-13 17:46 ` [PATCH 6/9] lockdep: Introduce wait-type checks Sebastian Andrzej Siewior
@ 2020-03-13 17:46 ` Sebastian Andrzej Siewior
  2020-03-13 17:47 ` [PATCH 8/9] lockdep: Annotate irq_work Sebastian Andrzej Siewior
  2020-03-13 17:47 ` [PATCH 9/9] lockdep: Add posixtimer context tracing bits Sebastian Andrzej Siewior
  8 siblings, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

Set current->irq_config = 1 for hrtimers which are not marked to expire in
hard interrupt context during hrtimer_init(). These timers will expire in
softirq context on PREEMPT_RT.

Setting this allows lockdep to differentiate these timers. If a timer is
marked to expire in hard interrupt context then the timer callback is not
supposed to acquire a regular spinlock instead of a raw_spinlock in the
expiry callback.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irqflags.h | 15 +++++++++++++++
 include/linux/sched.h    |  1 +
 kernel/locking/lockdep.c |  2 +-
 kernel/time/hrtimer.c    |  6 +++++-
 4 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index fdaf28601cbe1..9c17f9c827aac 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -56,6 +56,19 @@ do {						\
 do {						\
 	current->softirq_context--;		\
 } while (0)
+
+# define lockdep_hrtimer_enter(__hrtimer)		\
+	  do {						\
+		  if (!__hrtimer->is_hard)		\
+			current->irq_config = 1;	\
+	  } while (0)
+
+# define lockdep_hrtimer_exit(__hrtimer)		\
+	  do {						\
+		  if (!__hrtimer->is_hard)		\
+			current->irq_config = 0;	\
+	  } while (0)
+
 #else
 # define trace_hardirqs_on()		do { } while (0)
 # define trace_hardirqs_off()		do { } while (0)
@@ -68,6 +81,8 @@ do {						\
 # define trace_hardirq_exit()		do { } while (0)
 # define lockdep_softirq_enter()	do { } while (0)
 # define lockdep_softirq_exit()		do { } while (0)
+# define lockdep_hrtimer_enter(__hrtimer)		do { } while (0)
+# define lockdep_hrtimer_exit(__hrtimer)		do { } while (0)
 #endif
 
 #if defined(CONFIG_IRQSOFF_TRACER) || \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d3b9ecce0742..933914cdf2197 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -983,6 +983,7 @@ struct task_struct {
 	unsigned int			softirq_enable_event;
 	int				softirqs_enabled;
 	int				softirq_context;
+	int				irq_config;
 #endif
 
 #ifdef CONFIG_LOCKDEP
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 4a9abf8bd9d15..df74531fd9f85 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3879,7 +3879,7 @@ static int check_wait_context(struct task_struct *curr, struct held_lock *next)
 		/*
 		 * Check if force_irqthreads will run us threaded.
 		 */
-		if (curr->hardirq_threaded)
+		if (curr->hardirq_threaded || curr->irq_config)
 			curr_inner = LD_WAIT_CONFIG;
 		else
 			curr_inner = LD_WAIT_SPIN;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3a609e7344f3d..8cce72501aea5 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1404,7 +1404,7 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
 	base = softtimer ? HRTIMER_MAX_CLOCK_BASES / 2 : 0;
 	base += hrtimer_clockid_to_base(clock_id);
 	timer->is_soft = softtimer;
-	timer->is_hard = !softtimer;
+	timer->is_hard = !!(mode & HRTIMER_MODE_HARD);
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
 }
@@ -1514,7 +1514,11 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	 */
 	raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
 	trace_hrtimer_expire_entry(timer, now);
+	lockdep_hrtimer_enter(timer);
+
 	restart = fn(timer);
+
+	lockdep_hrtimer_exit(timer);
 	trace_hrtimer_expire_exit(timer);
 	raw_spin_lock_irq(&cpu_base->lock);
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 8/9] lockdep: Annotate irq_work
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
                   ` (6 preceding siblings ...)
  2020-03-13 17:46 ` [PATCH 7/9] lockdep: Add hrtimer context tracing bits Sebastian Andrzej Siewior
@ 2020-03-13 17:47 ` Sebastian Andrzej Siewior
  2020-03-13 17:47 ` [PATCH 9/9] lockdep: Add posixtimer context tracing bits Sebastian Andrzej Siewior
  8 siblings, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

Mark irq_work items with IRQ_WORK_HARD_IRQ which should be invoked in
hardirq context even on PREEMPT_RT. IRQ_WORK without this flag will be
invoked in softirq context on PREEMPT_RT.

Set ->irq_config to 1 for the IRQ_WORK items which are invoked in softirq
context so lockdep knows that these can safely acquire a spinlock_t.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irq_work.h |  2 ++
 include/linux/irqflags.h | 13 +++++++++++++
 kernel/irq_work.c        |  2 ++
 kernel/rcu/tree.c        |  1 +
 kernel/time/tick-sched.c |  1 +
 5 files changed, 19 insertions(+)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 02da997ad12ce..3b752e80c017d 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -18,6 +18,8 @@
 
 /* Doesn't want IPI, wait for tick: */
 #define IRQ_WORK_LAZY		BIT(2)
+/* Run hard IRQ context, even on RT */
+#define IRQ_WORK_HARD_IRQ	BIT(3)
 
 #define IRQ_WORK_CLAIMED	(IRQ_WORK_PENDING | IRQ_WORK_BUSY)
 
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 9c17f9c827aac..f23f540e0ebba 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -69,6 +69,17 @@ do {						\
 			current->irq_config = 0;	\
 	  } while (0)
 
+# define lockdep_irq_work_enter(__work)					\
+	  do {								\
+		  if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+			current->irq_config = 1;			\
+	  } while (0)
+# define lockdep_irq_work_exit(__work)					\
+	  do {								\
+		  if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
+			current->irq_config = 0;			\
+	  } while (0)
+
 #else
 # define trace_hardirqs_on()		do { } while (0)
 # define trace_hardirqs_off()		do { } while (0)
@@ -83,6 +94,8 @@ do {						\
 # define lockdep_softirq_exit()		do { } while (0)
 # define lockdep_hrtimer_enter(__hrtimer)		do { } while (0)
 # define lockdep_hrtimer_exit(__hrtimer)		do { } while (0)
+# define lockdep_irq_work_enter(__work)		do { } while (0)
+# define lockdep_irq_work_exit(__work)		do { } while (0)
 #endif
 
 #if defined(CONFIG_IRQSOFF_TRACER) || \
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 828cc30774bc4..48b5d1b6af4d3 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -153,7 +153,9 @@ static void irq_work_run_list(struct llist_head *list)
 		 */
 		flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags);
 
+		lockdep_irq_work_enter(work);
 		work->func(work);
+		lockdep_irq_work_exit(work);
 		/*
 		 * Clear the BUSY bit and return to the free state if
 		 * no-one else claimed it meanwhile.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d91c9156fab2e..5066d1dd30777 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1113,6 +1113,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 		    !rdp->rcu_iw_pending && rdp->rcu_iw_gp_seq != rnp->gp_seq &&
 		    (rnp->ffmask & rdp->grpmask)) {
 			init_irq_work(&rdp->rcu_iw, rcu_iw_handler);
+			atomic_set(&rdp->rcu_iw.flags, IRQ_WORK_HARD_IRQ);
 			rdp->rcu_iw_pending = true;
 			rdp->rcu_iw_gp_seq = rnp->gp_seq;
 			irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 4be756b88a48e..3e2dc9b8858c7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -245,6 +245,7 @@ static void nohz_full_kick_func(struct irq_work *work)
 
 static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = {
 	.func = nohz_full_kick_func,
+	.flags = ATOMIC_INIT(IRQ_WORK_HARD_IRQ),
 };
 
 /*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 9/9] lockdep: Add posixtimer context tracing bits
  2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
                   ` (7 preceding siblings ...)
  2020-03-13 17:47 ` [PATCH 8/9] lockdep: Annotate irq_work Sebastian Andrzej Siewior
@ 2020-03-13 17:47 ` Sebastian Andrzej Siewior
  8 siblings, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-13 17:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Sebastian Andrzej Siewior

Splitting run_posix_cpu_timers() into two parts is work in progress which
is stuck on other entry code related problems. The heavy lifting which
involves locking of sighand lock will be moved into task context so the
necessary execution time is burdened on the task and not on interrupt
context.

Until this work completes lockdep with the spinlock nesting rules enabled
would emit warnings for this known context.

Prevent it by setting "->irq_config = 1" for the invocation of
run_posix_cpu_timers() so lockdep does not complain when sighand lock is
acquried. This will be removed once the split is completed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/irqflags.h       | 12 ++++++++++++
 kernel/time/posix-cpu-timers.c |  6 +++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index f23f540e0ebba..a16adbb58f66a 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -69,6 +69,16 @@ do {						\
 			current->irq_config = 0;	\
 	  } while (0)
 
+# define lockdep_posixtimer_enter()				\
+	  do {							\
+		  current->irq_config = 1;			\
+	  } while (0)
+
+# define lockdep_posixtimer_exit()				\
+	  do {							\
+		  current->irq_config = 0;			\
+	  } while (0)
+
 # define lockdep_irq_work_enter(__work)					\
 	  do {								\
 		  if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\
@@ -94,6 +104,8 @@ do {						\
 # define lockdep_softirq_exit()		do { } while (0)
 # define lockdep_hrtimer_enter(__hrtimer)		do { } while (0)
 # define lockdep_hrtimer_exit(__hrtimer)		do { } while (0)
+# define lockdep_posixtimer_enter()		do { } while (0)
+# define lockdep_posixtimer_exit()		do { } while (0)
 # define lockdep_irq_work_enter(__work)		do { } while (0)
 # define lockdep_irq_work_exit(__work)		do { } while (0)
 #endif
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 8ff6da77a01fd..2c48a7233b196 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -1126,8 +1126,11 @@ void run_posix_cpu_timers(void)
 	if (!fastpath_timer_check(tsk))
 		return;
 
-	if (!lock_task_sighand(tsk, &flags))
+	lockdep_posixtimer_enter();
+	if (!lock_task_sighand(tsk, &flags)) {
+		lockdep_posixtimer_exit();
 		return;
+	}
 	/*
 	 * Here we take off tsk->signal->cpu_timers[N] and
 	 * tsk->cpu_timers[N] all the timers that are firing, and
@@ -1169,6 +1172,7 @@ void run_posix_cpu_timers(void)
 			cpu_timer_fire(timer);
 		spin_unlock(&timer->it_lock);
 	}
+	lockdep_posixtimer_exit();
 }
 
 /*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-13 17:46 ` [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll Sebastian Andrzej Siewior
@ 2020-03-13 18:11   ` Logan Gunthorpe
  2020-03-14  0:23     ` Thomas Gleixner
  2020-03-16 19:34     ` Thomas Gleixner
  2020-03-13 19:31   ` Peter Zijlstra
  1 sibling, 2 replies; 31+ messages in thread
From: Logan Gunthorpe @ 2020-03-13 18:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Kurt Schwemmer, Bjorn Helgaas, linux-pci



On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote:
> The poll callback is abusing the completion wait queue and sticks it into
> poll_wait() to wake up pollers after a command has completed.
> 
> First of all it's a layering violation as it imposes restrictions on the
> inner workings of completions. Just because C allows to do so does not
> justify that in any way. The proper way to do such things is to post
> patches which extend the core infrastructure and not by silently abusing
> it.

As I've said previously, I disagree with this approach. Open coding
standard primitives sweeps issues under the rug and is a step backwards
for code quality. Calling it a layering violation is just one opinion
and if it is, the better solution would be to create an interface you
find appropriate so that it isn't one.

> Aside of that the implementation is seriously broken:
> 
>  1) It cannot work with EPOLLEXCLUSIVE

Why? You don't explain this. And I don't see how this patch would change
anything to do with the call to poll_wait(). All you've done is
open-code the completion.

Not that it matters that much, having multiple waiters poll on this
interface can pretty much never happen. It only makes sense for the
process who submitted the write to poll on the interface.

>  2) It's racy:
> 
>   poll()	      	  	 write()
>    switchtec_dev_poll()		   switchtec_dev_write()
>     poll_wait(&s->comp.wait);        mrpc_queue_cmd()
>     				       init_completion(&s->comp)
> 					 init_waitqueue_head(&s->comp.wait)

That's a nice catch! But wouldn't an easier solution be to change the
code to use reinit_completion() instead of using the bug to justify a
different change?

Thanks,

Logan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-13 17:46 ` [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll Sebastian Andrzej Siewior
  2020-03-13 18:11   ` Logan Gunthorpe
@ 2020-03-13 19:31   ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2020-03-13 19:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner,
	Kurt Schwemmer, Logan Gunthorpe, Bjorn Helgaas, linux-pci

On Fri, Mar 13, 2020 at 06:46:55PM +0100, Sebastian Andrzej Siewior wrote:
> The poll callback is abusing the completion wait queue and sticks it into
> poll_wait() to wake up pollers after a command has completed.
> 
> First of all it's a layering violation as it imposes restrictions on the
> inner workings of completions. Just because C allows to do so does not
> justify that in any way. The proper way to do such things is to post
> patches which extend the core infrastructure and not by silently abusing
> it.
> 
> Aside of that the implementation is seriously broken:
> 
>  1) It cannot work with EPOLLEXCLUSIVE
> 
>  2) It's racy:
> 
>   poll()	      	  	 write()
>    switchtec_dev_poll()		   switchtec_dev_write()
>     poll_wait(&s->comp.wait);        mrpc_queue_cmd()
>     				       init_completion(&s->comp)
> 					 init_waitqueue_head(&s->comp.wait)
> 
> Replace it with a regular wait queue which removes the completion abuse and
> cures #1 and #2 above.
> 
> Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Relying on implementation details of locking primitives like that is
yuck.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-13 18:11   ` Logan Gunthorpe
@ 2020-03-14  0:23     ` Thomas Gleixner
  2020-03-14  6:01       ` Logan Gunthorpe
  2020-03-16 19:34     ` Thomas Gleixner
  1 sibling, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2020-03-14  0:23 UTC (permalink / raw)
  To: Logan Gunthorpe, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci

Logan Gunthorpe <logang@deltatee.com> writes:
> On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote:
>> The poll callback is abusing the completion wait queue and sticks it into
>> poll_wait() to wake up pollers after a command has completed.
>> 
>> First of all it's a layering violation as it imposes restrictions on the
>> inner workings of completions. Just because C allows to do so does not
>> justify that in any way. The proper way to do such things is to post
>> patches which extend the core infrastructure and not by silently abusing
>> it.
>
> As I've said previously, I disagree with this approach.

Feel free to do s.

> Open coding standard primitives sweeps issues under the rug and is a
> step backwards for code quality. Calling it a layering violation is
> just one opinion and if it is, the better solution would be to create
> an interface you find appropriate so that it isn't one.

There is no standard primitive which allows to poll on a completion.

You decided that this is smart to do and just because C does not
allow to hide implementation details this is not a justification for
this at all.

Due to the limitations of C, the kernel has to rely on the assumption
that developers know and respect the difference between API and
implementation.

Relying on implementation details of an interface and then arguing that
this is a standard primitive for the chosen purpose is just backwards.

What's even more hillarious is that you now request that we give you a
replacement interface for something which was not an interface to use in
the first place.

>>  1) It cannot work with EPOLLEXCLUSIVE
>
> Why? You don't explain this. And I don't see how this patch would change
> anything to do with the call to poll_wait(). All you've done is
> open-code the completion.
>
> Not that it matters that much, having multiple waiters poll on this
> interface can pretty much never happen. It only makes sense for the
> process who submitted the write to poll on the interface.

It does not matter whether your envisioned usage implies that it cannot
happen. Fact is that there is no restriction. That means using this with
the well documented semantics of poll(2) will result in failure. 

>>  2) It's racy:
>> 
>>   poll()	      	  	 write()
>>    switchtec_dev_poll()		   switchtec_dev_write()
>>     poll_wait(&s->comp.wait);        mrpc_queue_cmd()
>>     				       init_completion(&s->comp)
>> 					 init_waitqueue_head(&s->comp.wait)
>
> That's a nice catch! But wouldn't an easier solution be to change the
> code to use reinit_completion() instead of using the bug to justify a
> different change?

Sure taht can be cured by changing it to reinit, but that does not cure
the abuse of implementation details. As Peter, who maintains the
completion code says:

  Relying on implementation details of locking primitives like that is
  yuck.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/9] sched/swait: Prepare usage in completions
  2020-03-13 17:46 ` [PATCH 4/9] sched/swait: Prepare usage in completions Sebastian Andrzej Siewior
@ 2020-03-14  0:26   ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2020-03-14  0:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds,
	Sebastian Andrzej Siewior

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
>  
> +void swake_up_all_locked(struct swait_queue_head *q)
> +{
> +	while (!list_empty(&q->task_list))
> +		swake_up_locked(q);
> +}

This one wants a big fat comment along with the usage site in the
completions.

Yes, it's my fault that I forgot to double check this. Will send a delta
patch when brain is more awake.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-14  0:23     ` Thomas Gleixner
@ 2020-03-14  6:01       ` Logan Gunthorpe
  2020-03-16 18:52         ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Logan Gunthorpe @ 2020-03-14  6:01 UTC (permalink / raw)
  To: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci



On 2020-03-13 6:23 p.m., Thomas Gleixner wrote:
> Logan Gunthorpe <logang@deltatee.com> writes:
>> On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote:
>>> The poll callback is abusing the completion wait queue and sticks it into
>>> poll_wait() to wake up pollers after a command has completed.
>>>
>>> First of all it's a layering violation as it imposes restrictions on the
>>> inner workings of completions. Just because C allows to do so does not
>>> justify that in any way. The proper way to do such things is to post
>>> patches which extend the core infrastructure and not by silently abusing
>>> it.
>>
>> As I've said previously, I disagree with this approach.
> 
> Feel free to do s.
> 
>> Open coding standard primitives sweeps issues under the rug and is a
>> step backwards for code quality. Calling it a layering violation is
>> just one opinion and if it is, the better solution would be to create
>> an interface you find appropriate so that it isn't one.
> 
> There is no standard primitive which allows to poll on a completion.
> 
> You decided that this is smart to do and just because C does not
> allow to hide implementation details this is not a justification for
> this at all.
> 
> Due to the limitations of C, the kernel has to rely on the assumption
> that developers know and respect the difference between API and
> implementation.
> 
> Relying on implementation details of an interface and then arguing that
> this is a standard primitive for the chosen purpose is just backwards.
> 
> What's even more hillarious is that you now request that we give you a
> replacement interface for something which was not an interface to use in
> the first place.

I'm in awe at the lack of professionalism in your emails. If you
bothered to edit out the ad hominems, you might have noticed that nobody
has yet described how the poll interface fails here (with
EPOLLEXCLUSIVE) or how replacing one wait queue for another fixes the
purported problem.

We clearly disagree on what's considered appropriate usage of the
completion helper (calling it a "locking primitive" is a bit
disingenuous) and it doesn't sound like that's going to change. I hold
no power here, but you aren't going to bully me into giving this patch
an Ack or into silencing my opinion on the matter.

I'd prefer it if you submit a patch that's honest about what it's trying
to accomplish and why (ie. it doesn't masquerade as being necessary to
fix a bug). I also ask that you accept that I'm within my right to voice
my dissent. If, after that, Bjorn chooses to take your patch, then I
will respect his decision. I trust that he's able to read behind the
personal attacks and look only at the technical issues.

Logan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/9] completion: Use simple wait queues
  2020-03-13 17:46 ` [PATCH 5/9] completion: Use simple wait queues Sebastian Andrzej Siewior
@ 2020-03-14 15:40   ` Linus Torvalds
  2020-03-16 18:55     ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2020-03-14 15:40 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Paul E . McKenney, Joel Fernandes, Steven Rostedt,
	Thomas Gleixner

On Fri, Mar 13, 2020 at 10:47 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> Replace the wait queue in the completion with a simple wait queue (swait),

This is almost certainly completely and utterly wrong.

Every time somebody uses those horrible swait queues, the end result is buggy.

Don't do it.

And most definitely, don't do it like this, which seems to be entirely
mindlessly just randomly changing things by some brute force in
multiple places.

The swait semantics are completely different from the normal
wait-queue semantics, and generally not in good ways.

There's a *REASON* why the comment at the top of <linux/swait.h> starts with

 * BROKEN wait-queues.
 *
 * These "simple" wait-queues are broken garbage, and should never be
 * used. The comments below claim that they are "similar" to regular
 * wait-queues, but the semantics are actually completely different, and
 * every single user we have ever had has been buggy (or pointless).

and before you do a conversion, you need to spend a _lot_ of time
thinking about why that is the case.

And _after_ you do the conversion, you damn well need to explain why
it's safe. Not just state that it's a good idea.

For example, this patch just randomly changes wait events to the swait
event _exclusive_ waits. With not a single explanation of why that
would be ok.

I want an explanation for EVERY SINGLE CASE. Because people have done
this kind of conversion before, and it's been buggy garbage before. I
want to see that people actually thought about what the semantic
differences were, and _documented_ that thinking process.

            Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/9] Documentation: Add lock ordering and nesting documentation
  2020-03-13 17:46 ` [PATCH 1/9] Documentation: Add lock ordering and nesting documentation Sebastian Andrzej Siewior
@ 2020-03-14 22:57   ` Randy Dunlap
  2020-03-16 10:34     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 31+ messages in thread
From: Randy Dunlap @ 2020-03-14 22:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Thomas Gleixner

Hi,

A few comments for your consideration:

On 3/13/20 10:46 AM, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The kernel provides a variety of locking primitives. The nesting of these
> lock types and the implications of them on RT enabled kernels is nowhere
> documented.
> 
> Add initial documentation.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  Documentation/locking/index.rst     |   1 +
>  Documentation/locking/locktypes.rst | 298 ++++++++++++++++++++++++++++
>  2 files changed, 299 insertions(+)
>  create mode 100644 Documentation/locking/locktypes.rst

> diff --git a/Documentation/locking/locktypes.rst b/Documentation/locking/locktypes.rst
> new file mode 100644
> index 0000000000000..d4c3f2094ad20
> --- /dev/null
> +++ b/Documentation/locking/locktypes.rst
> @@ -0,0 +1,298 @@
> +.. _kernel_hacking_locktypes:
> +
> +==========================
> +Lock types and their rules
> +==========================
> +
> +Introduction
> +============
> +
> +The kernel provides a variety of locking primitives which can be divided
> +into two categories:
> +
> + - Sleeping locks
> + - Spinning locks
> +
> +This document describes the lock types at least at the conceptual level and
> +provides rules for nesting of lock types also under the aspect of PREEMPT_RT.
> +
> +Lock categories
> +===============
> +
> +Sleeping locks
> +--------------
> +
> +Sleeping locks can only be acquired in preemptible task context.
> +
> +Some of the implementations allow try_lock() attempts from other contexts,
> +but that has to be really evaluated carefully including the question
> +whether the unlock can be done from that context safely as well.
> +
> +Note, that some lock types change their implementation details when

Drop comma.

> +debugging is enabled, so this should be really only considered if there is
> +no other option.
> +
> +Sleeping lock types:
> +
> + - mutex
> + - rt_mutex
> + - semaphore
> + - rw_semaphore
> + - ww_mutex
> + - percpu_rw_semaphore
> +
> +On a PREEMPT_RT enabled kernel the following lock types are converted to
> +sleeping locks:
> +
> + - spinlock_t
> + - rwlock_t
> +
> +Spinning locks
> +--------------
> +
> + - raw_spinlock_t
> + - bit spinlocks
> +
> +On a non PREEMPT_RT enabled kernel the following lock types are spinning
> +locks as well:
> +
> + - spinlock_t
> + - rwlock_t
> +
> +Spinning locks implicitly disable preemption and the lock / unlock functions
> +can have suffixes which apply further protections:
> +
> + ===================  ====================================================
> + _bh()                Disable / enable bottom halfs (soft interrupts)

                                                 halves

> + _irq()               Disable / enable interrupts
> + _irqsave/restore()   Save and disable / restore interrupt disabled state
> + ===================  ====================================================
> +
> +
> +rtmutex
> +=======
> +
> +RT-mutexes are mutexes with support for priority inheritance (PI).
> +
> +PI has limitations on non PREEMPT_RT enabled kernels due to preemption and
> +interrupt disabled sections.
> +
> +On a PREEMPT_RT enabled kernel most of these sections are fully
> +preemptible. This is possible because PREEMPT_RT forces most executions
> +into task context, especially interrupt handlers and soft interrupts, which
> +allows to substitute spinlock_t and rwlock_t with RT-mutex based
> +implementations.
> +
> +
> +raw_spinlock_t and spinlock_t
> +=============================
> +
> +raw_spinlock_t
> +--------------
> +
> +raw_spinlock_t is a strict spinning lock implementation regardless of the
> +kernel configuration including PREEMPT_RT enabled kernels.
> +
> +raw_spinlock_t is to be used only in real critical core code, low level
> +interrupt handling and places where protecting (hardware) state is required
> +to be safe against preemption and eventually interrupts.
> +
> +Another reason to use raw_spinlock_t is when the critical section is tiny
> +to avoid the overhead of spinlock_t on a PREEMPT_RT enabled kernel in the
> +contended case.
> +
> +spinlock_t
> +----------
> +
> +The semantics of spinlock_t change with the state of CONFIG_PREEMPT_RT.
> +
> +On a non PREEMPT_RT enabled kernel spinlock_t is mapped to raw_spinlock_t
> +and has exactly the same semantics.
> +
> +spinlock_t and PREEMPT_RT
> +-------------------------
> +
> +On a PREEMPT_RT enabled kernel spinlock_t is mapped to a separate
> +implementation based on rt_mutex which changes the semantics:
> +
> + - Preemption is not disabled
> +
> + - The hard interrupt related suffixes for spin_lock / spin_unlock
> +   operations (_irq, _irqsave / _irqrestore) do not affect the CPUs
> +   interrupt disabled state
> +
> + - The soft interrupt related suffix (_bh()) is still disabling the
> +   execution of soft interrupts, but contrary to a non PREEMPT_RT enabled
> +   kernel, which utilizes the preemption count, this is achieved by a per
> +   CPU bottom half locking mechanism.
> +
> +All other semantics of spinlock_t are preserved:
> +
> + - Migration of tasks which hold a spinlock_t is prevented. On a non
> +   PREEMPT_RT enabled kernel this is implicit due to preemption disable.
> +   PREEMPT_RT has a separate mechanism to achieve this. This ensures that
> +   pointers to per CPU variables stay valid even if the task is preempted.
> +
> + - Task state preservation. The task state is not affected when a lock is
> +   contended and the task has to schedule out and wait for the lock to
> +   become available. The lock wake up restores the task state unless there
> +   was a regular (not lock related) wake up on the task. This ensures that
> +   the task state rules are always correct independent of the kernel
> +   configuration.
> +
> +rwlock_t
> +========
> +
> +rwlock_t is a multiple readers and single writers lock mechanism.

                                             writer

> +
> +On a non PREEMPT_RT enabled kernel rwlock_t is implemented as a spinning
> +lock and the suffix rules of spinlock_t apply accordingly. The
> +implementation is fair and prevents writer starvation.
> +
> +rwlock_t and PREEMPT_RT
> +-----------------------
> +
> +On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
> +implementation based on rt_mutex which changes the semantics:
> +
> + - Same changes as for spinlock_t
> +
> + - The implementation is not fair and can cause writer starvation under
> +   certain circumstances. The reason for this is that a writer cannot
> +   inherit its priority to multiple readers. Readers which are blocked

      ^^^^^^^ I think this is backwards. Maybe more like so:
                                                         a writer cannot
      bequeath or grant or bestow or pass down    ...    its priority to

> +   on a writer fully support the priority inheritance protocol.
> +
> +
> +PREEMPT_RT caveats
> +==================
> +
> +spinlock_t and rwlock_t
> +-----------------------
> +
> +The substitution of spinlock_t and rwlock_t on PREEMPT_RT enabled kernels
> +with RT-mutex based implementations has a few implications.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine::
> +
> +   local_irq_disable();
> +   spin_lock(&lock);
> +
> +and fully equivalent to::
> +
> +   spin_lock_irq(&lock);
> +
> +Same applies to rwlock_t and the _irqsave() suffix variant.
> +
> +On a PREEMPT_RT enabled kernel this breaks because the RT-mutex
> +substitution expects a fully preemptible context.
> +
> +The preferred solution is to use :c:func:`spin_lock_irq()` or
> +:c:func:`spin_lock_irqsave()` and their unlock counterparts.
> +
> +PREEMPT_RT also offers a local_lock mechanism to substitute the
> +local_irq_disable/save() constructs in cases where a separation of the
> +interrupt disabling and the locking is really unavoidable. This should be
> +restricted to very rare cases.
> +
> +
> +raw_spinlock_t
> +--------------
> +
> +As raw_spinlock_t locking disables preemption and eventually interrupts the
> +code inside the critical region has to be careful to avoid calls into code

Can I buy a comma in there somewhere, please?
I don't get it as is.

> +which takes regular spinlock_t or rwlock_t. A prime example is memory
> +allocation.
> +
> +On a non PREEMPT_RT enabled kernel the following code construct is
> +perfectly fine code::
> +
> +  raw_spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +On a PREEMPT_RT enabled kernel this breaks because the memory allocator is
> +fully preemptible and therefore does not support allocations from truly
> +atomic contexts.
> +
> +Contrary to that the following code construct is perfectly fine on
> +PREEMPT_RT as spin_lock() does not disable preemption::
> +
> +  spin_lock(&lock);
> +  p = kmalloc(sizeof(*p), GFP_ATOMIC);
> +
> +Most places which use GFP_ATOMIC allocations are safe on PREEMPT_RT as the
> +execution is forced into thread context and the lock substitution is
> +ensuring preemptability.

            preemptibility           would go along with preemptible

> +
> +
> +bit spinlocks
> +-------------
> +
> +Bit spinlocks are problematic for PREEMPT_RT as they cannot be easily
> +substituted by a RT-mutex based implementation for obvious reasons.

               by an

(IMO; depends on how you pronounce it)

> +
> +The semantics of bit spinlocks are preserved on a PREEMPT_RT enabled kernel
> +and the caveats vs. raw_spinlock_t apply.
> +
> +Some bit spinlocks are substituted by regular spinlock_t for PREEMPT_RT but
> +this requires conditional (#ifdef'ed) code changes at the usage side while
> +the spinlock_t substitution is simply done by the compiler and the
> +conditionals are restricted to header files and core implementation of the
> +locking primitives and the usage sites do not require any changes.
> +
> +
> +Lock type nesting rules
> +=======================
> +
> +The most basic rules are:
> +
> +  - Lock types of the same lock category (sleeping, spinning) can nest
> +    arbitrarily as long as they respect the general lock ordering rules to
> +    prevent deadlocks.
> +
> +  - Sleeping lock types cannot nest inside spinning lock types.
> +
> +  - Spinning lock types can nest inside sleeping lock types.
> +
> +These rules apply in general independent of CONFIG_PREEMPT_RT.

                                independently

> +
> +As PREEMPT_RT changes the lock category of spinlock_t and rwlock_t from
> +spinning to sleeping this has obviously restrictions how they can nest with
> +raw_spinlock_t.
> +
> +This results in the following nest ordering:
> +
> +  1) Sleeping locks
> +  2) spinlock_t and rwlock_t
> +  3) raw_spinlock_t and bit spinlocks
> +
> +Lockdep is aware of these constraints to ensure that they are respected.
> +
> +
> +Owner semantics
> +===============
> +
> +Most lock types in the Linux kernel have strict owner semantics, i.e. the
> +context (task) which acquires a lock has to release it.
> +
> +There are two exceptions:
> +
> +  - semaphores
> +  - rwsem

       rwsems
(plural, like semaphores)

> +
> +semaphores have no strict owner semantics for historical reasons. They are
> +often used for both serialization and waiting purposes. That's generally
> +discouraged and should be replaced by separate serialization and wait
> +mechanisms.
> +
> +rwsem have grown interfaces which allow non owner release for special

   rwsems                                  non-owner

> +purposes. This usage is problematic on PREEMPT_RT because PREEMPT_RT
> +substitutes all locking primitives except semaphores with RT-mutex based
> +implementation to provide priority inheritance for all lock types except
> +the truly spinning ones. Priority inheritance on ownerless locks is
> +obviously impossible.
> +
> +For now the rwsem non-owner release excludes code which utilizes it from
> +being used on PREEMPT_RT enabled kernels. In same cases this can be
> +mitigated by disabling portions of the code, in other cases the complete
> +functionality has to be disabled until a workable solution has been found.
> 

cheers.
-- 
~Randy


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/9] Documentation: Add lock ordering and nesting documentation
  2020-03-14 22:57   ` Randy Dunlap
@ 2020-03-16 10:34     ` Sebastian Andrzej Siewior
  2020-03-16 15:13       ` Randy Dunlap
  0 siblings, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-16 10:34 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Paul E . McKenney, Joel Fernandes, Steven Rostedt,
	Linus Torvalds, Thomas Gleixner

On 2020-03-14 15:57:24 [-0700], Randy Dunlap wrote:
> Hi,
Hi Randy,

> A few comments for your consideration:

I merged all of you comments but two:

> On 3/13/20 10:46 AM, Sebastian Andrzej Siewior wrote:
> > +rwlock_t and PREEMPT_RT
> > +-----------------------
> > +
> > +On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
> > +implementation based on rt_mutex which changes the semantics:
> > +
> > + - Same changes as for spinlock_t
> > +
> > + - The implementation is not fair and can cause writer starvation under
> > +   certain circumstances. The reason for this is that a writer cannot
> > +   inherit its priority to multiple readers. Readers which are blocked
> 
>       ^^^^^^^ I think this is backwards. Maybe more like so:
>                                                          a writer cannot
>       bequeath or grant or bestow or pass down    ...    its priority to

So the term "inherit" is the problem. The protocol is officially called
PI which is short for Priority Inheritance. Other documentation,
RT-mutex for instance, is also using this term when it is referring to
altering the priority of a task. For that reason I prefer to keep using
this term.

> > +   on a writer fully support the priority inheritance protocol.
> > +raw_spinlock_t
> > +--------------
> > +
> > +As raw_spinlock_t locking disables preemption and eventually interrupts the
> > +code inside the critical region has to be careful to avoid calls into code
> 
> Can I buy a comma in there somewhere, please?
> I don't get it as is.

What about

| As raw_spinlock_t locking disables preemption, and eventually interrupts, the
| code inside the critical region has to be careful to avoid calls into code

any better?

…

Sebastian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/9] Documentation: Add lock ordering and nesting documentation
  2020-03-16 10:34     ` Sebastian Andrzej Siewior
@ 2020-03-16 15:13       ` Randy Dunlap
  0 siblings, 0 replies; 31+ messages in thread
From: Randy Dunlap @ 2020-03-16 15:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Paul E . McKenney, Joel Fernandes, Steven Rostedt,
	Linus Torvalds, Thomas Gleixner

On 3/16/20 3:34 AM, Sebastian Andrzej Siewior wrote:
> On 2020-03-14 15:57:24 [-0700], Randy Dunlap wrote:
>> Hi,
> Hi Randy,
> 
>> A few comments for your consideration:
> 
> I merged all of you comments but two:
> 
>> On 3/13/20 10:46 AM, Sebastian Andrzej Siewior wrote:
> …
>>> +rwlock_t and PREEMPT_RT
>>> +-----------------------
>>> +
>>> +On a PREEMPT_RT enabled kernel rwlock_t is mapped to a separate
>>> +implementation based on rt_mutex which changes the semantics:
>>> +
>>> + - Same changes as for spinlock_t
>>> +
>>> + - The implementation is not fair and can cause writer starvation under
>>> +   certain circumstances. The reason for this is that a writer cannot
>>> +   inherit its priority to multiple readers. Readers which are blocked
>>
>>       ^^^^^^^ I think this is backwards. Maybe more like so:
>>                                                          a writer cannot
>>       bequeath or grant or bestow or pass down    ...    its priority to
> 
> So the term "inherit" is the problem. The protocol is officially called
> PI which is short for Priority Inheritance. Other documentation,
> RT-mutex for instance, is also using this term when it is referring to
> altering the priority of a task. For that reason I prefer to keep using
> this term.

OK, I get it.

>>> +   on a writer fully support the priority inheritance protocol.
> …
>>> +raw_spinlock_t
>>> +--------------
>>> +
>>> +As raw_spinlock_t locking disables preemption and eventually interrupts the
>>> +code inside the critical region has to be careful to avoid calls into code
>>
>> Can I buy a comma in there somewhere, please?
>> I don't get it as is.
> 
> What about
> 
> | As raw_spinlock_t locking disables preemption, and eventually interrupts, the
> | code inside the critical region has to be careful to avoid calls into code
> 
> any better?

Yes.

thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-14  6:01       ` Logan Gunthorpe
@ 2020-03-16 18:52         ` Thomas Gleixner
  2020-03-16 19:24           ` Logan Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2020-03-16 18:52 UTC (permalink / raw)
  To: Logan Gunthorpe, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci

Logan Gunthorpe <logang@deltatee.com> writes:
> On 2020-03-13 6:23 p.m., Thomas Gleixner wrote:
> I'm in awe at the lack of professionalism in your emails. If you
> bothered to edit out the ad hominems, you might have noticed that nobody
> has yet described how the poll interface fails here (with
> EPOLLEXCLUSIVE) or how replacing one wait queue for another fixes the
> purported problem.

I merily stated an opinion, but if you consider this an ad hominem
attack, then let me ensure you this wasn't my intention and accept my
apology.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/9] completion: Use simple wait queues
  2020-03-14 15:40   ` Linus Torvalds
@ 2020-03-16 18:55     ` Thomas Gleixner
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Gleixner @ 2020-03-16 18:55 UTC (permalink / raw)
  To: Linus Torvalds, Sebastian Andrzej Siewior
  Cc: Linux Kernel Mailing List, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Paul E . McKenney, Joel Fernandes, Steven Rostedt

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Fri, Mar 13, 2020 at 10:47 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> and before you do a conversion, you need to spend a _lot_ of time
> thinking about why that is the case.
>
> And _after_ you do the conversion, you damn well need to explain why
> it's safe. Not just state that it's a good idea.
>
> For example, this patch just randomly changes wait events to the swait
> event _exclusive_ waits. With not a single explanation of why that
> would be ok.
>
> I want an explanation for EVERY SINGLE CASE. Because people have done
> this kind of conversion before, and it's been buggy garbage before. I
> want to see that people actually thought about what the semantic
> differences were, and _documented_ that thinking process.

My bad. I'll rework the changelog so it contains the proof that the
result is semantical and functional equivalent.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-16 18:52         ` Thomas Gleixner
@ 2020-03-16 19:24           ` Logan Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Logan Gunthorpe @ 2020-03-16 19:24 UTC (permalink / raw)
  To: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci



On 2020-03-16 12:52 p.m., Thomas Gleixner wrote:
> Logan Gunthorpe <logang@deltatee.com> writes:
>> On 2020-03-13 6:23 p.m., Thomas Gleixner wrote:
>> I'm in awe at the lack of professionalism in your emails. If you
>> bothered to edit out the ad hominems, you might have noticed that nobody
>> has yet described how the poll interface fails here (with
>> EPOLLEXCLUSIVE) or how replacing one wait queue for another fixes the
>> purported problem.
> 
> I merily stated an opinion, but if you consider this an ad hominem
> attack, then let me ensure you this wasn't my intention and accept my
> apology.

A technical opinion, and a valid argument, does *not* involve telling me
what my decision process was ("You decided this was smart to do"), or
mocking it as "hillarious" (sic).

Your actual opinion was drowned out by these attacks. And, while valid,
your opinion is very much subjective and I, personally, disagree with it.

I accept your apology and hope this doesn't happen again.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-13 18:11   ` Logan Gunthorpe
  2020-03-14  0:23     ` Thomas Gleixner
@ 2020-03-16 19:34     ` Thomas Gleixner
  2020-03-16 21:53       ` Logan Gunthorpe
  1 sibling, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2020-03-16 19:34 UTC (permalink / raw)
  To: Logan Gunthorpe, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci

Logan Gunthorpe <logang@deltatee.com> writes:
> On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote:
>>  1) It cannot work with EPOLLEXCLUSIVE
>
> Why? You don't explain this.

man epoll_ctt(2)

EPOLLEXCLUSIVE (since Linux 4.5)

  Sets an exclusive wakeup mode for the epoll file descriptor that is
  being attached to the target file descriptor, fd.  When a wakeup event
  occurs and multiple epoll file descriptors are attached to the same
  target file using EPOLLEXCLUSIVE, one or more of the epoll file
  descriptors will receive an event with epoll_wait(2).

As this uses complete_all() there is no distinction possible, because
complete_all() wakes up everything.

> And I don't see how this patch would change anything to do with the
> call to poll_wait(). All you've done is open-code the completion.

wake_up_interruptible(x) resolves to: 

     __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)

which wakes exactly 1 exclusive waiter.

Also the other way round is just working because the waker side uses
complete_all(). Why? Because completion internally defaults to exclusive
mode and complete() wakes exactly one exlusive waiter.

There is a conceptual difference and while it works for that particular
purpose to some extent it's not suitable as a general wait notification
construct.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-16 19:34     ` Thomas Gleixner
@ 2020-03-16 21:53       ` Logan Gunthorpe
  2020-03-17  0:17         ` Thomas Gleixner
  0 siblings, 1 reply; 31+ messages in thread
From: Logan Gunthorpe @ 2020-03-16 21:53 UTC (permalink / raw)
  To: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci



On 2020-03-16 1:34 p.m., Thomas Gleixner wrote:
> Logan Gunthorpe <logang@deltatee.com> writes:
>> On 2020-03-13 11:46 a.m., Sebastian Andrzej Siewior wrote:
>>>  1) It cannot work with EPOLLEXCLUSIVE
>>
>> Why? You don't explain this.
> 
> man epoll_ctt(2)
> 
> EPOLLEXCLUSIVE (since Linux 4.5)
> 
>   Sets an exclusive wakeup mode for the epoll file descriptor that is
>   being attached to the target file descriptor, fd.  When a wakeup event
>   occurs and multiple epoll file descriptors are attached to the same
>   target file using EPOLLEXCLUSIVE, one or more of the epoll file
>   descriptors will receive an event with epoll_wait(2).
> 
> As this uses complete_all() there is no distinction possible, because
> complete_all() wakes up everything.
> 
>> And I don't see how this patch would change anything to do with the
>> call to poll_wait(). All you've done is open-code the completion.
> 
> wake_up_interruptible(x) resolves to: 
> 
>      __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)
> 
> which wakes exactly 1 exclusive waiter.
> 
> Also the other way round is just working because the waker side uses
> complete_all(). Why? Because completion internally defaults to exclusive
> mode and complete() wakes exactly one exlusive waiter.
> 
> There is a conceptual difference and while it works for that particular
> purpose to some extent it's not suitable as a general wait notification
> construct.

Ok, I now understand this point. That's exceedingly subtle.

I certainly would not agree that this qualifies as "seriously broken",
and I'm not even sure I'd agree that this actually violates the
semantics of poll() seeing the man page clearly states that with
EPOLLEXCLUSIVE set, "one or more" pollers will be woken up. So waking up
all of them is still allowed. Ensuring fewer pollers wake up is just an
optimization to avoid the thundering herd problem which users of this
interface are very unlikely to ever have (I can confidently tell you
that none have this problem now).

If we do want to say that all poll_wait() users *must* respect
EPOLLEXCLUSIVE, we should at least have some documentation saying that
combining poll_wait() with wake_up_all() (or similar) is not allowed. A
*very* quick check finds there's at least a few drivers doing this:

  drivers/char/ipmi/ipmb_dev_int.c
  drivers/dma-buf/sync_file.c
  drivers/gpu/vga/vgaarb.c

(That's just looking at the drivers tree, up to "G".)

Finally, since we seem to back to more reasonable discussion, I will
make this point: it's fairly common for wait queue users to directly use
the spinlock from within wait_queue_head_t without an interface (even
completion.c does it). How are developers supposed to know when an
interface is required and when it's not? Sometimes using
"implementation" details interface-free is standard practice, but other
times it's "yuck" and will illicit ire from other developers? Is it
valid to use completion.wait.lock? Where's the line?

Logan



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-16 21:53       ` Logan Gunthorpe
@ 2020-03-17  0:17         ` Thomas Gleixner
  2020-03-17  1:15           ` Logan Gunthorpe
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Gleixner @ 2020-03-17  0:17 UTC (permalink / raw)
  To: Logan Gunthorpe, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci

Logan,

Logan Gunthorpe <logang@deltatee.com> writes:
> On 2020-03-16 1:34 p.m., Thomas Gleixner wrote:
>> Logan Gunthorpe <logang@deltatee.com> writes:
> I certainly would not agree that this qualifies as "seriously broken",

I concede that this is the wrong wording, but chasing things like this
and constantly mopping up code is not necessarily keeping my mood up all
the time.

> and I'm not even sure I'd agree that this actually violates the
> semantics of poll() seeing the man page clearly states that with
> EPOLLEXCLUSIVE set, "one or more" pollers will be woken up. So waking up
> all of them is still allowed. Ensuring fewer pollers wake up is just an
> optimization to avoid the thundering herd problem which users of this
> interface are very unlikely to ever have (I can confidently tell you
> that none have this problem now).

I can see the point, but in general we should just strive for keeping
the interfaces consistent and not subject to interpretation by
individual developers. Maybe in your case the probability that someone
wants to use it in that way is close to zero, but we've been surprised
often enough by the creativity of user space developers who came up with
use cases nobody ever expected.

> If we do want to say that all poll_wait() users *must* respect
> EPOLLEXCLUSIVE, we should at least have some documentation saying that
> combining poll_wait() with wake_up_all() (or similar) is not allowed. A
> *very* quick check finds there's at least a few drivers doing this:
>
>   drivers/char/ipmi/ipmb_dev_int.c
>   drivers/dma-buf/sync_file.c
>   drivers/gpu/vga/vgaarb.c

Right. There is surely quite some stuff in drivers which either predates
these things or slipped through review.

> Finally, since we seem to back to more reasonable discussion, I will
> make this point: it's fairly common for wait queue users to directly use
> the spinlock from within wait_queue_head_t without an interface (even
> completion.c does it).

completions and waitqueues are both core code. Core primitives build
obviously on other primitives so dependencies on the inner workings are
expected to some degree, especially for real and valuable optimization
reasons. Solving these dependencies when one primitive changes has
limited scope.

> How are developers supposed to know when an interface is required and
> when it's not? Sometimes using "implementation" details interface-free
> is standard practice, but other times it's "yuck" and will illicit ire
> from other developers? Is it valid to use completion.wait.lock?
> Where's the line?

That's a good question, but that question simply arises due to the fact
that C does not provide proper privatizing or if you want to work around
that it forces you to come up with really horrible constructs.

The waitqueue lock usage has a very long history. It goes back to 2002
when the semaphore implementation was optimized. That exposed some of
the waitqueue internals which got consequently used elsewhere as
well. But looking at the actual usage sites, that's a pretty limited
amount. Most of them are core infrastrucure. Of course there are drivers
as well which use it for questionable reasons.

In general I think that silently using implementation details just to
scratch an itch or "optimizing" code is pretty much bad practice.

Especially as this has a tendency to spread out in creative ways. And
this happens simply because developers often copy the next thing which
looks suitable and then expand on it. As this is not necessarily caught
in reviews, this can spread to the point where the core infrastructure
cannot be changed anymore.

That's not a made up nightmare scenario. This happened in reality and
caused me to mop up 50+ interrupt chip implementations in order to be
able to make an urgently needed 10 line change to the core interrupt
infrastructure. Guess what, the vast majority of instances fiddling with
the core internals were either voodoo programming or plain bugs. There
were a few legitimate issues, but they had been better solved in the
core code upfront.  Even after that cleanup a driver got merged which
had #include "../../../../kernel/irg/internal.h" inside just because the
code which was developed out of tree before this change had be made to
"work".

That's just not a workable solution with the ever expanding code base of
the kernel.

I really prefer when people come along and show the problem they want to
solve - I'm completely fine with the POC hack which uses internals for
that purpose - so that this can be discussed and eventually integrated
into core infrastructure in one way or the other or better suitable
solutions can be found.

There are other aspects to this:

 - Using existing interfaces allows a reviewer to do the quick check on
   init, run time usage and tear down instead of wrapping his head
   around the special case

 - Verification tooling of all sorts just works

 - Improvements to the core implementation including new features are
   just coming for free.

I hope that clarifies where I'm coming from.

This has nothing to do with you personally, you just triggered a few
sensible fuses while understandably defending your admittedly smart
solution.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll
  2020-03-17  0:17         ` Thomas Gleixner
@ 2020-03-17  1:15           ` Logan Gunthorpe
  0 siblings, 0 replies; 31+ messages in thread
From: Logan Gunthorpe @ 2020-03-17  1:15 UTC (permalink / raw)
  To: Thomas Gleixner, Sebastian Andrzej Siewior, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Paul E . McKenney,
	Joel Fernandes, Steven Rostedt, Linus Torvalds, Kurt Schwemmer,
	Bjorn Helgaas, linux-pci



On 2020-03-16 6:17 p.m., Thomas Gleixner wrote:
> That's a good question, but that question simply arises due to the fact
> that C does not provide proper privatizing or if you want to work around
> that it forces you to come up with really horrible constructs.

Well, we do have the underscore convention. Though, I concede this code
could potentially predate that. Had there been a preceding underscore, I
definitely would have thought twice before touching it.

> That's not a made up nightmare scenario. This happened in reality and
> caused me to mop up 50+ interrupt chip implementations in order to be
> able to make an urgently needed 10 line change to the core interrupt
> infrastructure. Guess what, the vast majority of instances fiddling with
> the core internals were either voodoo programming or plain bugs. There
> were a few legitimate issues, but they had been better solved in the
> core code upfront.  Even after that cleanup a driver got merged which
> had #include "../../../../kernel/irg/internal.h" inside just because the
> code which was developed out of tree before this change had be made to
> "work".

I get where your coming from, and it sucks having to clean up so many
instances in an urgent situation. But I see this kind of cleanup work as
routine, a necessary thing that happens all the time. I've done it
myself a couple times before in the kernel. The hard trick is to get
developers to do more of it, before it becomes a problem like the one
you faced.

In my experience, what makes clean up work even harder is where
developers see an interface, notice it's not a perfect fit and so open
code the whole thing themselves. Then you have random buggy primitives
open coded all over the place that are impossible to find. And the
primitives themselves never improve or grow new interfaces because
nobody knows there's a bunch of instances that require the improvement.
That's a much bigger mess.

> I really prefer when people come along and show the problem they want to
> solve - I'm completely fine with the POC hack which uses internals for
> that purpose - so that this can be discussed and eventually integrated
> into core infrastructure in one way or the other or better suitable
> solutions can be found.

Yes, and this code was a prototype at one point and went through review
from a number of people in the community, and nobody complained about
this. I've also been in the situation where I submitted a POC and
somebody pointed out a better way (though with a few swears thrown in
for good measure); but in that case, I was actually changing a primitive
so it got their attention more easily.

It's impossible for the maintainer of a primitive to review all the use
cases of every primitive when new code gets merged. But at least if new
code uses/abuses the primitive they will eventually come to light and
can be cleaned up as appropriate with a holistic view.

> I hope that clarifies where I'm coming from.

Yes, I understood your point. And I concede that a completion is a
pretty trivial primitive to open code and the change is not really worth
any further fight. If the patch gets merged (preferably with a reworked
commit message), I will not complain.

> This has nothing to do with you personally, you just triggered a few
> sensible fuses while understandably defending your admittedly smart
> solution.

Thank you.

Logan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] lockdep: Introduce wait-type checks
  2020-03-13 17:46 ` [PATCH 6/9] lockdep: Introduce wait-type checks Sebastian Andrzej Siewior
@ 2020-03-31 13:25   ` Geert Uytterhoeven
  2020-03-31 13:42     ` Sebastian Andrzej Siewior
  2020-03-31 14:55     ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Geert Uytterhoeven @ 2020-03-31 13:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Peter Zijlstra
  Cc: Linux Kernel Mailing List, Ingo Molnar, Will Deacon,
	Paul E . McKenney, Joel Fernandes, Steven Rostedt,
	Linus Torvalds, Thomas Gleixner, Linux ARM, Russell King,
	Catalin Marinas

Hi all,

On Fri, Mar 13, 2020 at 6:48 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Extend lockdep to validate lock wait-type context.
>
> The current wait-types are:
>
>         LD_WAIT_FREE,           /* wait free, rcu etc.. */
>         LD_WAIT_SPIN,           /* spin loops, raw_spinlock_t etc.. */
>         LD_WAIT_CONFIG,         /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
>         LD_WAIT_SLEEP,          /* sleeping locks, mutex_t etc.. */
>
> Where lockdep validates that the current lock (the one being acquired)
> fits in the current wait-context (as generated by the held stack).
>
> This ensures that there is no attempt to acquire mutexes while holding
> spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
> other words, its a more fancy might_sleep().
>
> Obviously RCU made the entire ordeal more complex than a simple single
> value test because RCU can be acquired in (pretty much) any context and
> while it presents a context to nested locks it is not the same as it
> got acquired in.
>
> Therefore its necessary to split the wait_type into two values, one
> representing the acquire (outer) and one representing the nested context
> (inner). For most 'normal' locks these two are the same.
>
> [ To make static initialization easier we have the rule that:
>   .outer == INV means .outer == .inner; because INV == 0. ]
>
> It further means that its required to find the minimal .inner of the held
> stack to compare against the outer of the new lock; because while 'normal'
> RCU presents a CONFIG type to nested locks, if it is taken while already
> holding a SPIN type it obviously doesn't relax the rules.
>
> Below is an example output generated by the trivial test code:
>
>   raw_spin_lock(&foo);
>   spin_lock(&bar);
>   spin_unlock(&bar);
>   raw_spin_unlock(&foo);
>
>  [ BUG: Invalid wait context ]
>  -----------------------------
>  swapper/0/1 is trying to lock:
>  ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
>  other info that might help us debug this:
>  1 lock held by swapper/0/1:
>   #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187
>
> The way to read it is to look at the new -{n,m} part in the lock
> description; -{3:3} for the attempted lock, and try and match that up to
> the held locks, which in this case is the one: -{2,2}.
>
> This tells that the acquiring lock requires a more relaxed environment than
> presented by the lock stack.
>
> Currently only the normal locks and RCU are converted, the rest of the
> lockdep users defaults to .inner = INV which is ignored. More conversions
> can be done when desired.
>
> The check for spinlock_t nesting is not enabled by default. It's a separate
> config option for now as there are known problems which are currently
> addressed. The config option allows to identify these problems and to
> verify that the solutions found are indeed solving them.
>
> The config switch will be removed and the checks will permanently enabled
> once the vast majority of issues has been addressed.
>
> [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
>            failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
> [ tglx: Add the config option ]
>
> Requested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

On arm64 (e.g. R-Car H3 ES2.0):

+=============================
+[ BUG: Invalid wait context ]
+5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679 Not tainted
+-----------------------------
+swapper/5/0 is trying to lock:
+ffffff86ff76f398 (&pool->lock){..-.}-{3:3}, at: __queue_work+0x134/0x430
+other info that might help us debug this:
+1 lock held by swapper/5/0:
+ #0: ffffffc01103a4a0 (rcu_read_lock){....}-{1:3}, at:
rcu_lock_acquire.constprop.59+0x0/0x38
+stack backtrace:
+CPU: 5 PID: 0 Comm: swapper/5 Not tainted
5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679
+Hardware name: Renesas Salvator-X 2nd version board based on r8a77951 (DT)
+Call trace:
+ dump_backtrace+0x0/0x180
+ show_stack+0x14/0x1c
+ dump_stack+0xdc/0x12c
+ __lock_acquire+0x37c/0xf9c
+ lock_acquire+0x258/0x288
+ _raw_spin_lock+0x34/0x48
+ __queue_work+0x134/0x430
+ queue_work_on+0x48/0x8c
+ timers_update_nohz+0x24/0x2c
+ tick_nohz_activate.isra.15.part.16+0x5c/0x80
+ tick_setup_sched_timer+0xe0/0xf0
+ hrtimer_run_queues+0x88/0xf8
+ run_local_timers+0x20/0x58
+ update_process_times+0x24/0x50
+ tick_periodic+0xd8/0xe8
+ tick_handle_periodic+0x30/0x98
+ arch_timer_handler_phys+0x28/0x3c
+ handle_percpu_devid_irq+0x64/0x110
+ generic_handle_irq+0x20/0x34
+ __handle_domain_irq+0x94/0x98
+ gic_handle_irq+0x78/0xbc
+ el1_irq+0xf4/0x1c0
+ arch_cpu_idle+0x38/0x54
+ default_idle_call+0x2c/0x30
+ do_idle+0x13c/0x244
+ cpu_startup_entry+0x20/0x24
+ secondary_start_kernel+0x1c4/0x1d8

On arm32 (e.g SH-Mobile AG5, using Cortex-A9 TWD):

+=============================
+[ BUG: Invalid wait context ]
+5.6.0-kzm9g-09424-g2698719b4c4f35db-dirty #253 Not tainted
+-----------------------------
+swapper/0/0 is trying to lock:
+dfbc5250 (&pool->lock){..-.}-{3:3}, at: __queue_work+0x14c/0x4d0
+other info that might help us debug this:
+1 lock held by swapper/0/0:
+ #0: c0a17074 (rcu_read_lock){....}-{1:3}, at:
rcu_lock_acquire.constprop.33+0x0/0x38
+stack backtrace:
+CPU: 0 PID: 0 Comm: swapper/0 Not tainted
5.6.0-kzm9g-09424-g2698719b4c4f35db-dirty #253
+Hardware name: Generic SH73A0 (Flattened Device Tree)
+[<c010f948>] (unwind_backtrace) from [<c010bbf0>] (show_stack+0x10/0x14)
+[<c010bbf0>] (show_stack) from [<c05c3e84>] (dump_stack+0xa8/0xe0)
+[<c05c3e84>] (dump_stack) from [<c017036c>] (__lock_acquire+0x398/0x1568)
+[<c017036c>] (__lock_acquire) from [<c0171e90>] (lock_acquire+0x274/0x2ac)
+[<c0171e90>] (lock_acquire) from [<c05dfb90>] (_raw_spin_lock+0x40/0x50)
+[<c05dfb90>] (_raw_spin_lock) from [<c013adc8>] (__queue_work+0x14c/0x4d0)
+[<c013adc8>] (__queue_work) from [<c013b194>] (queue_work_on+0x48/0x6c)
+[<c013b194>] (queue_work_on) from [<c01a89e8>]
(tick_setup_sched_timer+0x148/0x188)
+[<c01a89e8>] (tick_setup_sched_timer) from [<c0197ad8>]
(hrtimer_run_queues+0x74/0x114)
+[<c0197ad8>] (hrtimer_run_queues) from [<c0195e38>]
(run_local_timers+0x14/0x54)
+[<c0195e38>] (run_local_timers) from [<c0195e9c>]
(update_process_times+0x24/0x54)
+[<c0195e9c>] (update_process_times) from [<c01a5520>]
(tick_handle_periodic+0x28/0xa0)
+[<c01a5520>] (tick_handle_periodic) from [<c010eaf8>] (twd_handler+0x2c/0x38)
+[<c010eaf8>] (twd_handler) from [<c01829c0>]
(handle_percpu_devid_irq+0x58/0xfc)
+[<c01829c0>] (handle_percpu_devid_irq) from [<c017ccc4>]
(generic_handle_irq+0x28/0x38)
+[<c017ccc4>] (generic_handle_irq) from [<c017d2fc>]
(__handle_domain_irq+0x90/0xa0)
+[<c017d2fc>] (__handle_domain_irq) from [<c0392d70>] (gic_handle_irq+0x58/0x90)
+[<c0392d70>] (gic_handle_irq) from [<c0101ab0>] (__irq_svc+0x70/0x98)
+Exception stack(0xc0a01f40 to 0xc0a01f88)
+1f40: 00000001 00000006 c0a0bac0 00000000 00000001 ffffe000 c0a08ea8 c0a94532
+1f60: c0a08eec 00000001 c09278e0 00000000 ffffe000 c0a01f90 c0108448 c010844c
+1f80: 60000013 ffffffff
+[<c0101ab0>] (__irq_svc) from [<c010844c>] (arch_cpu_idle+0x20/0x3c)
+[<c010844c>] (arch_cpu_idle) from [<c0152350>] (do_idle+0xe8/0x13c)
+[<c0152350>] (do_idle) from [<c0152718>] (cpu_startup_entry+0x18/0x1c)
+[<c0152718>] (cpu_startup_entry) from [<c0900d54>] (start_kernel+0x3f0/0x49c)
+[<c0900d54>] (start_kernel) from [<00000000>] (0x0)

I also see it on other arm32 platforms using Renesas-specific timers,
but I'll ignore those until the issues with "standard" ARM timers have
been resolved ;-)

Thanks!

Gr{oetje,eeting}s,

                        Geert


--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] lockdep: Introduce wait-type checks
  2020-03-31 13:25   ` Geert Uytterhoeven
@ 2020-03-31 13:42     ` Sebastian Andrzej Siewior
  2020-03-31 14:55     ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-03-31 13:42 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Peter Zijlstra, Linux Kernel Mailing List, Ingo Molnar,
	Will Deacon, Paul E . McKenney, Joel Fernandes, Steven Rostedt,
	Linus Torvalds, Thomas Gleixner, Linux ARM, Russell King,
	Catalin Marinas

On 2020-03-31 15:25:21 [+0200], Geert Uytterhoeven wrote:
> Hi all,
Hi Geert,

> On arm64 (e.g. R-Car H3 ES2.0):
> 
> +=============================
> +[ BUG: Invalid wait context ]
> +5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679 Not tainted
> +-----------------------------
> +swapper/5/0 is trying to lock:
> +ffffff86ff76f398 (&pool->lock){..-.}-{3:3}, at: __queue_work+0x134/0x430

The pool->lock is not yet converted.
…
> On arm32 (e.g SH-Mobile AG5, using Cortex-A9 TWD):
> 
> +=============================
> +[ BUG: Invalid wait context ]
> +5.6.0-kzm9g-09424-g2698719b4c4f35db-dirty #253 Not tainted
> +-----------------------------
> +swapper/0/0 is trying to lock:
> +dfbc5250 (&pool->lock){..-.}-{3:3}, at: __queue_work+0x14c/0x4d0

same issue.

> I also see it on other arm32 platforms using Renesas-specific timers,
> but I'll ignore those until the issues with "standard" ARM timers have
> been resolved ;-)

There are some known culprits, like the work infrastructure. The printk
is another one. From Kconfig:

|   NOTE: There are known nesting problems. So if you enable this
|   option expect lockdep splats until these problems have been fully
|   addressed which is work in progress. This config switch allows to
|   identify and analyze these problems. It will be removed and the
|   check permanentely enabled once the main issues have been fixed.


> Thanks!
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> 

Sebastian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] lockdep: Introduce wait-type checks
  2020-03-31 13:25   ` Geert Uytterhoeven
  2020-03-31 13:42     ` Sebastian Andrzej Siewior
@ 2020-03-31 14:55     ` Peter Zijlstra
  2020-03-31 15:28       ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2020-03-31 14:55 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Sebastian Andrzej Siewior, Linux Kernel Mailing List,
	Ingo Molnar, Will Deacon, Paul E . McKenney, Joel Fernandes,
	Steven Rostedt, Linus Torvalds, Thomas Gleixner, Linux ARM,
	Russell King, Catalin Marinas, Frederic Weisbecker

On Tue, Mar 31, 2020 at 03:25:21PM +0200, Geert Uytterhoeven wrote:
> On arm64 (e.g. R-Car H3 ES2.0):
> 
> +=============================
> +[ BUG: Invalid wait context ]
> +5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679 Not tainted
> +-----------------------------
> +swapper/5/0 is trying to lock:
> +ffffff86ff76f398 (&pool->lock){..-.}-{3:3}, at: __queue_work+0x134/0x430
> +other info that might help us debug this:
> +1 lock held by swapper/5/0:
> + #0: ffffffc01103a4a0 (rcu_read_lock){....}-{1:3}, at:
> rcu_lock_acquire.constprop.59+0x0/0x38
> +stack backtrace:
> +CPU: 5 PID: 0 Comm: swapper/5 Not tainted
> 5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679
> +Hardware name: Renesas Salvator-X 2nd version board based on r8a77951 (DT)
> +Call trace:
> + dump_backtrace+0x0/0x180
> + show_stack+0x14/0x1c
> + dump_stack+0xdc/0x12c
> + __lock_acquire+0x37c/0xf9c
> + lock_acquire+0x258/0x288
> + _raw_spin_lock+0x34/0x48
> + __queue_work+0x134/0x430
> + queue_work_on+0x48/0x8c
> + timers_update_nohz+0x24/0x2c
> + tick_nohz_activate.isra.15.part.16+0x5c/0x80
> + tick_setup_sched_timer+0xe0/0xf0
> + hrtimer_run_queues+0x88/0xf8

So this is complaining that it cannot take pool->lock, which is
WAIT_CONFIG while holding RCU, which presents a WAIT_CONFIG context.

This seems to implicate something is amiss, because that should be
allowed. The thing it doesn't print is the context, which in the above
case is a (hrtimer) interrupt.

I suspect this really is a hardirq context and the next patch won't cure
things. It looks nohz (full?) related.

Frederic, can you untangle this?

---

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 1511690e4de7..ac10db66cc63 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3952,10 +3952,36 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this,
 	return ret;
 }
 
+static inline short task_wait_context(struct task_struct *curr)
+{
+	/*
+	 * Set appropriate wait type for the context; for IRQs we have to take
+	 * into account force_irqthread as that is implied by PREEMPT_RT.
+	 */
+	if (curr->hardirq_context) {
+		/*
+		 * Check if force_irqthreads will run us threaded.
+		 */
+		if (curr->hardirq_threaded || curr->irq_config)
+			return LD_WAIT_CONFIG;
+
+		return LD_WAIT_SPIN;
+	} else if (curr->softirq_context) {
+		/*
+		 * Softirqs are always threaded.
+		 */
+		return LD_WAIT_CONFIG;
+	}
+
+	return LD_WAIT_MAX;
+}
+
 static int
 print_lock_invalid_wait_context(struct task_struct *curr,
 				struct held_lock *hlock)
 {
+	short curr_inner;
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
@@ -3971,6 +3997,10 @@ print_lock_invalid_wait_context(struct task_struct *curr,
 	print_lock(hlock);
 
 	pr_warn("other info that might help us debug this:\n");
+
+	curr_inner = task_wait_context(curr);
+	pr_warn("context-{%d:%d}\n", curr_inner, curr_inner);
+
 	lockdep_print_held_locks(curr);
 
 	pr_warn("stack backtrace:\n");
@@ -4017,26 +4047,7 @@ static int check_wait_context(struct task_struct *curr, struct held_lock *next)
 	}
 	depth++;
 
-	/*
-	 * Set appropriate wait type for the context; for IRQs we have to take
-	 * into account force_irqthread as that is implied by PREEMPT_RT.
-	 */
-	if (curr->hardirq_context) {
-		/*
-		 * Check if force_irqthreads will run us threaded.
-		 */
-		if (curr->hardirq_threaded || curr->irq_config)
-			curr_inner = LD_WAIT_CONFIG;
-		else
-			curr_inner = LD_WAIT_SPIN;
-	} else if (curr->softirq_context) {
-		/*
-		 * Softirqs are always threaded.
-		 */
-		curr_inner = LD_WAIT_CONFIG;
-	} else {
-		curr_inner = LD_WAIT_MAX;
-	}
+	curr_inner = task_wait_context(curr);
 
 	for (; depth < curr->lockdep_depth; depth++) {
 		struct held_lock *prev = curr->held_locks + depth;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] lockdep: Introduce wait-type checks
  2020-03-31 14:55     ` Peter Zijlstra
@ 2020-03-31 15:28       ` Peter Zijlstra
  2020-03-31 15:37         ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2020-03-31 15:28 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Sebastian Andrzej Siewior, Linux Kernel Mailing List,
	Ingo Molnar, Will Deacon, Paul E . McKenney, Joel Fernandes,
	Steven Rostedt, Linus Torvalds, Thomas Gleixner, Linux ARM,
	Russell King, Catalin Marinas, Frederic Weisbecker

On Tue, Mar 31, 2020 at 04:55:15PM +0200, Peter Zijlstra wrote:
> On Tue, Mar 31, 2020 at 03:25:21PM +0200, Geert Uytterhoeven wrote:
> > On arm64 (e.g. R-Car H3 ES2.0):
> > 
> > +=============================
> > +[ BUG: Invalid wait context ]
> > +5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679 Not tainted
> > +-----------------------------
> > +swapper/5/0 is trying to lock:
> > +ffffff86ff76f398 (&pool->lock){..-.}-{3:3}, at: __queue_work+0x134/0x430
> > +other info that might help us debug this:
> > +1 lock held by swapper/5/0:
> > + #0: ffffffc01103a4a0 (rcu_read_lock){....}-{1:3}, at:
> > rcu_lock_acquire.constprop.59+0x0/0x38
> > +stack backtrace:
> > +CPU: 5 PID: 0 Comm: swapper/5 Not tainted
> > 5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679
> > +Hardware name: Renesas Salvator-X 2nd version board based on r8a77951 (DT)
> > +Call trace:
> > + dump_backtrace+0x0/0x180
> > + show_stack+0x14/0x1c
> > + dump_stack+0xdc/0x12c
> > + __lock_acquire+0x37c/0xf9c
> > + lock_acquire+0x258/0x288
> > + _raw_spin_lock+0x34/0x48
> > + __queue_work+0x134/0x430
> > + queue_work_on+0x48/0x8c
> > + timers_update_nohz+0x24/0x2c
> > + tick_nohz_activate.isra.15.part.16+0x5c/0x80
> > + tick_setup_sched_timer+0xe0/0xf0
> > + hrtimer_run_queues+0x88/0xf8
> 
> So this is complaining that it cannot take pool->lock, which is
> WAIT_CONFIG while holding RCU, which presents a WAIT_CONFIG context.
> 
> This seems to implicate something is amiss, because that should be
> allowed. The thing it doesn't print is the context, which in the above
> case is a (hrtimer) interrupt.
> 
> I suspect this really is a hardirq context and the next patch won't cure
> things. It looks nohz (full?) related.
> 
> Frederic, can you untangle this?

Sebastian is right; I completely forgot the workqueue thing was still
pending.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/9] lockdep: Introduce wait-type checks
  2020-03-31 15:28       ` Peter Zijlstra
@ 2020-03-31 15:37         ` Frederic Weisbecker
  0 siblings, 0 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2020-03-31 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Geert Uytterhoeven, Sebastian Andrzej Siewior,
	Linux Kernel Mailing List, Ingo Molnar, Will Deacon,
	Paul E . McKenney, Joel Fernandes, Steven Rostedt,
	Linus Torvalds, Thomas Gleixner, Linux ARM, Russell King,
	Catalin Marinas, Frederic Weisbecker

On Tue, Mar 31, 2020 at 05:28:50PM +0200, Peter Zijlstra wrote:
> On Tue, Mar 31, 2020 at 04:55:15PM +0200, Peter Zijlstra wrote:
> > On Tue, Mar 31, 2020 at 03:25:21PM +0200, Geert Uytterhoeven wrote:
> > > On arm64 (e.g. R-Car H3 ES2.0):
> > > 
> > > +=============================
> > > +[ BUG: Invalid wait context ]
> > > +5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679 Not tainted
> > > +-----------------------------
> > > +swapper/5/0 is trying to lock:
> > > +ffffff86ff76f398 (&pool->lock){..-.}-{3:3}, at: __queue_work+0x134/0x430
> > > +other info that might help us debug this:
> > > +1 lock held by swapper/5/0:
> > > + #0: ffffffc01103a4a0 (rcu_read_lock){....}-{1:3}, at:
> > > rcu_lock_acquire.constprop.59+0x0/0x38
> > > +stack backtrace:
> > > +CPU: 5 PID: 0 Comm: swapper/5 Not tainted
> > > 5.6.0-salvator-x-09423-gb29514ba13a9c459-dirty #679
> > > +Hardware name: Renesas Salvator-X 2nd version board based on r8a77951 (DT)
> > > +Call trace:
> > > + dump_backtrace+0x0/0x180
> > > + show_stack+0x14/0x1c
> > > + dump_stack+0xdc/0x12c
> > > + __lock_acquire+0x37c/0xf9c
> > > + lock_acquire+0x258/0x288
> > > + _raw_spin_lock+0x34/0x48
> > > + __queue_work+0x134/0x430
> > > + queue_work_on+0x48/0x8c
> > > + timers_update_nohz+0x24/0x2c
> > > + tick_nohz_activate.isra.15.part.16+0x5c/0x80
> > > + tick_setup_sched_timer+0xe0/0xf0
> > > + hrtimer_run_queues+0x88/0xf8
> > 
> > So this is complaining that it cannot take pool->lock, which is
> > WAIT_CONFIG while holding RCU, which presents a WAIT_CONFIG context.
> > 
> > This seems to implicate something is amiss, because that should be
> > allowed. The thing it doesn't print is the context, which in the above
> > case is a (hrtimer) interrupt.
> > 
> > I suspect this really is a hardirq context and the next patch won't cure
> > things. It looks nohz (full?) related.
> > 
> > Frederic, can you untangle this?
> 
> Sebastian is right; I completely forgot the workqueue thing was still
> pending.
> 

Ok good, because I had no better answer :)

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-03-31 15:37 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-13 17:46 [PATCH 0/9] Lock ordering documentation and annotation for lockdep Sebastian Andrzej Siewior
2020-03-13 17:46 ` [PATCH 1/9] Documentation: Add lock ordering and nesting documentation Sebastian Andrzej Siewior
2020-03-14 22:57   ` Randy Dunlap
2020-03-16 10:34     ` Sebastian Andrzej Siewior
2020-03-16 15:13       ` Randy Dunlap
2020-03-13 17:46 ` [PATCH 2/9] timekeeping: Split jiffies seqlock Sebastian Andrzej Siewior
2020-03-13 17:46 ` [PATCH 3/9] pci/switchtec: Don't abuse completion wait queue for poll Sebastian Andrzej Siewior
2020-03-13 18:11   ` Logan Gunthorpe
2020-03-14  0:23     ` Thomas Gleixner
2020-03-14  6:01       ` Logan Gunthorpe
2020-03-16 18:52         ` Thomas Gleixner
2020-03-16 19:24           ` Logan Gunthorpe
2020-03-16 19:34     ` Thomas Gleixner
2020-03-16 21:53       ` Logan Gunthorpe
2020-03-17  0:17         ` Thomas Gleixner
2020-03-17  1:15           ` Logan Gunthorpe
2020-03-13 19:31   ` Peter Zijlstra
2020-03-13 17:46 ` [PATCH 4/9] sched/swait: Prepare usage in completions Sebastian Andrzej Siewior
2020-03-14  0:26   ` Thomas Gleixner
2020-03-13 17:46 ` [PATCH 5/9] completion: Use simple wait queues Sebastian Andrzej Siewior
2020-03-14 15:40   ` Linus Torvalds
2020-03-16 18:55     ` Thomas Gleixner
2020-03-13 17:46 ` [PATCH 6/9] lockdep: Introduce wait-type checks Sebastian Andrzej Siewior
2020-03-31 13:25   ` Geert Uytterhoeven
2020-03-31 13:42     ` Sebastian Andrzej Siewior
2020-03-31 14:55     ` Peter Zijlstra
2020-03-31 15:28       ` Peter Zijlstra
2020-03-31 15:37         ` Frederic Weisbecker
2020-03-13 17:46 ` [PATCH 7/9] lockdep: Add hrtimer context tracing bits Sebastian Andrzej Siewior
2020-03-13 17:47 ` [PATCH 8/9] lockdep: Annotate irq_work Sebastian Andrzej Siewior
2020-03-13 17:47 ` [PATCH 9/9] lockdep: Add posixtimer context tracing bits Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).