linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/7] Fix the scheduler-IPI mess.
@ 2020-05-26 16:10 Peter Zijlstra
  2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:10 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

Hi!

So commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()") is pretty buggered, and
I'm afraid this is the 'prettiest' pile of puke I could come up with.

It boots and builds a kernel, but it already did that for me.

I do think there's definite room for further integration of smp_function_call
and irq_work, but figured this was quite enough for now.

Please look hard, there be dragons here, I'm sure.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB
  2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
@ 2020-05-26 16:10 ` Peter Zijlstra
  2020-05-26 23:56   ` Frederic Weisbecker
                     ` (3 more replies)
  2020-05-26 16:10 ` [RFC][PATCH 2/7] smp: Optimize flush_smp_call_function_queue() Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 4 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:10 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in kick_ilb() got this wrong.

While on first reading kick_ilb() has an atomic test-and-set that
appears to serialize the use, the matching 'release' is not in the
right place to actually guarantee this serialization.

Rework the nohz_idle_balance() trigger so that the release is in the
IPI callback and thus guarantees the required serialization for the
CSD.

Fixes: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h      |    4 +
 include/linux/sched/idle.h |   55 ++++++++++++++++++
 kernel/sched/core.c        |  132 +++++++++------------------------------------
 kernel/sched/fair.c        |   18 ++----
 kernel/sched/idle.c        |    3 -
 kernel/sched/sched.h       |    2 
 kernel/smp.c               |    7 ++
 7 files changed, 102 insertions(+), 119 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -637,41 +568,25 @@ void wake_up_nohz_cpu(int cpu)
 		wake_up_idle_cpu(cpu);
 }
 
-static inline bool got_nohz_idle_kick(void)
+static void nohz_csd_func(void *info)
 {
-	int cpu = smp_processor_id();
-
-	if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
-		return false;
-
-	if (idle_cpu(cpu) && !need_resched())
-		return true;
+	struct rq *rq = info;
+	int cpu = cpu_of(rq);
+	unsigned int flags;
 
 	/*
-	 * We can't run Idle Load Balance on this CPU for this time so we
-	 * cancel it and clear NOHZ_BALANCE_KICK
+	 * Release the rq::nohz_csd.
 	 */
-	atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
-	return false;
-}
-
-static void nohz_csd_func(void *info)
-{
-	struct rq *rq = info;
+	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
+	WARN_ON(!(flags & NOHZ_KICK_MASK));
 
-	if (got_nohz_idle_kick()) {
-		rq->idle_balance = 1;
+	rq->idle_balance = idle_cpu(cpu);
+	if (rq->idle_balance && !need_resched()) {
+		rq->nohz_idle_balance = flags;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
 }
 
-#else /* CONFIG_NO_HZ_COMMON */
-
-static inline bool got_nohz_idle_kick(void)
-{
-	return false;
-}
-
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10024,6 +10024,10 @@ static void kick_ilb(unsigned int flags)
 	if (ilb_cpu >= nr_cpu_ids)
 		return;
 
+	/*
+	 * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
+	 * the first flag owns it; cleared by nohz_csd_func().
+	 */
 	flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
 	if (flags & NOHZ_KICK_MASK)
 		return;
@@ -10371,20 +10375,14 @@ static bool _nohz_idle_balance(struct rq
  */
 static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
-	int this_cpu = this_rq->cpu;
-	unsigned int flags;
+	unsigned int flags = this_rq->nohz_idle_balance;
 
-	if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
+	if (!flags)
 		return false;
 
-	if (idle != CPU_IDLE) {
-		atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
-		return false;
-	}
+	this_rq->nohz_idle_balance = 0;
 
-	/* could be _relaxed() */
-	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
-	if (!(flags & NOHZ_KICK_MASK))
+	if (idle != CPU_IDLE)
 		return false;
 
 	_nohz_idle_balance(this_rq, flags, idle);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -951,6 +951,7 @@ struct rq {
 
 	struct callback_head	*balance_callback;
 
+	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
 	unsigned long		misfit_task_load;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH 2/7] smp: Optimize flush_smp_call_function_queue()
  2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
  2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
@ 2020-05-26 16:10 ` Peter Zijlstra
  2020-05-28 12:28   ` Frederic Weisbecker
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2020-05-26 16:11 ` [RFC][PATCH 3/7] smp: Move irq_work_run() out of flush_smp_call_function_queue() Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:10 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

The call_single_queue can contain (two) different callbacks,
synchronous and asynchronous. The current interrupt handler runs them
in-order, which means that remote CPUs that are waiting for their
synchronous call can be delayed by running asynchronous callbacks.

Rework the interrupt handler to first run the synchonous callbacks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/smp.c |   27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -209,9 +209,9 @@ void generic_smp_call_function_single_in
  */
 static void flush_smp_call_function_queue(bool warn_cpu_offline)
 {
-	struct llist_head *head;
-	struct llist_node *entry;
 	call_single_data_t *csd, *csd_next;
+	struct llist_node *entry, *prev;
+	struct llist_head *head;
 	static bool warned;
 
 	lockdep_assert_irqs_disabled();
@@ -235,21 +235,40 @@ static void flush_smp_call_function_queu
 				csd->func);
 	}
 
+	/*
+	 * First; run all SYNC callbacks, people are waiting for us.
+	 */
+	prev = NULL;
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
 		smp_call_func_t func = csd->func;
 		void *info = csd->info;
 
 		/* Do we wait until *after* callback? */
 		if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
+			if (prev) {
+				prev->next = &csd_next->llist;
+			} else {
+				entry = &csd_next->llist;
+			}
 			func(info);
 			csd_unlock(csd);
 		} else {
-			csd_unlock(csd);
-			func(info);
+			prev = &csd->llist;
 		}
 	}
 
 	/*
+	 * Second; run all !SYNC callbacks.
+	 */
+	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
+		smp_call_func_t func = csd->func;
+		void *info = csd->info;
+
+		csd_unlock(csd);
+		func(info);
+	}
+
+	/*
 	 * Handle irq works queued remotely by irq_work_queue_on().
 	 * Smp functions above are typically synchronous so they
 	 * better run first since some other CPUs may be busy waiting



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH 3/7] smp: Move irq_work_run() out of flush_smp_call_function_queue()
  2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
  2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
  2020-05-26 16:10 ` [RFC][PATCH 2/7] smp: Optimize flush_smp_call_function_queue() Peter Zijlstra
@ 2020-05-26 16:11 ` Peter Zijlstra
  2020-05-29 13:04   ` Frederic Weisbecker
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2020-05-26 16:11 ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:11 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

This ensures flush_smp_call_function_queue() is strictly about
call_single_queue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/smp.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -84,6 +84,7 @@ int smpcfd_dying_cpu(unsigned int cpu)
 	 * still pending.
 	 */
 	flush_smp_call_function_queue(false);
+	irq_work_run();
 	return 0;
 }
 
@@ -191,6 +192,14 @@ static int generic_exec_single(int cpu,
 void generic_smp_call_function_single_interrupt(void)
 {
 	flush_smp_call_function_queue(true);
+
+	/*
+	 * Handle irq works queued remotely by irq_work_queue_on().
+	 * Smp functions above are typically synchronous so they
+	 * better run first since some other CPUs may be busy waiting
+	 * for them.
+	 */
+	irq_work_run();
 }
 
 /**
@@ -267,14 +276,6 @@ static void flush_smp_call_function_queu
 		csd_unlock(csd);
 		func(info);
 	}
-
-	/*
-	 * Handle irq works queued remotely by irq_work_queue_on().
-	 * Smp functions above are typically synchronous so they
-	 * better run first since some other CPUs may be busy waiting
-	 * for them.
-	 */
-	irq_work_run();
 }
 
 /*



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
                   ` (2 preceding siblings ...)
  2020-05-26 16:11 ` [RFC][PATCH 3/7] smp: Move irq_work_run() out of flush_smp_call_function_queue() Peter Zijlstra
@ 2020-05-26 16:11 ` Peter Zijlstra
  2020-05-27  9:56   ` Peter Zijlstra
                     ` (2 more replies)
  2020-05-26 16:11 ` [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:11 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

Just like the ttwu_queue_remote() IPI, make use of _TIF_POLLING_NRFLAG
to avoid sending IPIs to idle CPUs.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   10 ++++++++++
 kernel/sched/idle.c  |    1 +
 kernel/sched/sched.h |    2 ++
 kernel/smp.c         |   16 +++++++++++++++-
 4 files changed, 28 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2296,6 +2296,16 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
+void send_call_function_single_ipi(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	if (!set_nr_if_polling(rq->idle))
+		arch_send_call_function_single_ipi(cpu);
+	else
+		trace_sched_wake_idle_without_ipi(cpu);
+}
+
 /*
  * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
  * necessary. The wakee CPU on receipt of the IPI will queue the task
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -289,6 +289,7 @@ static void do_idle(void)
 	 */
 	smp_mb__after_atomic();
 
+	flush_smp_call_function_from_idle();
 	sched_ttwu_pending();
 	schedule_idle();
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1506,6 +1506,8 @@ static inline void unregister_sched_doma
 }
 #endif
 
+extern void flush_smp_call_function_from_idle(void);
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -135,6 +135,8 @@ static __always_inline void csd_unlock(c
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
+extern void send_call_function_single_ipi(int cpu);
+
 /*
  * Insert a previously allocated call_single_data_t element
  * for execution on the given CPU. data must already have
@@ -178,7 +180,7 @@ static int generic_exec_single(int cpu,
 	 * equipped to do the right thing...
 	 */
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-		arch_send_call_function_single_ipi(cpu);
+		send_call_function_single_ipi(cpu);
 
 	return 0;
 }
@@ -278,6 +280,18 @@ static void flush_smp_call_function_queu
 	}
 }
 
+void flush_smp_call_function_from_idle(void)
+{
+	unsigned long flags;
+
+	if (llist_empty(this_cpu_ptr(&call_single_queue)))
+		return;
+
+	local_irq_save(flags);
+	flush_smp_call_function_queue(true);
+	local_irq_restore(flags);
+}
+
 /*
  * smp_call_function_single - Run a function on a specific CPU
  * @func: The function to run. This must be fast and non-blocking.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
                   ` (3 preceding siblings ...)
  2020-05-26 16:11 ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
@ 2020-05-26 16:11 ` Peter Zijlstra
  2020-05-28 23:40   ` Frederic Weisbecker
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2020-05-26 16:11 ` [RFC][PATCH 6/7] sched: Add rq::ttwu_pending Peter Zijlstra
  2020-05-26 16:11 ` [RFC][PATCH 7/7] sched: Replace rq::wake_list Peter Zijlstra
  6 siblings, 2 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:11 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

Currently irq_work_queue_on() will issue an unconditional
arch_send_call_function_single_ipi() and has the handler do
irq_work_run().

This is unfortunate in that it makes the IPI handler look at a second
cacheline and it misses the opportunity to avoid the IPI. Instead note
that struct irq_work and struct __call_single_data are very similar in
layout, so use a few bits in the flags word to encode a type and stick
the irq_work on the call_single_queue list.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/irq_work.h |    7 ++
 include/linux/smp.h      |   23 ++++++++-
 kernel/irq_work.c        |   53 +++++++++++---------
 kernel/smp.c             |  119 ++++++++++++++++++++++++++++-------------------
 4 files changed, 130 insertions(+), 72 deletions(-)

--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -13,6 +13,8 @@
  * busy      NULL, 2 -> {free, claimed} : callback in progress, can be claimed
  */
 
+/* flags share CSD_FLAG_ space */
+
 #define IRQ_WORK_PENDING	BIT(0)
 #define IRQ_WORK_BUSY		BIT(1)
 
@@ -23,9 +25,12 @@
 
 #define IRQ_WORK_CLAIMED	(IRQ_WORK_PENDING | IRQ_WORK_BUSY)
 
+/*
+ * structure shares layout with single_call_data_t.
+ */
 struct irq_work {
-	atomic_t flags;
 	struct llist_node llnode;
+	atomic_t flags;
 	void (*func)(struct irq_work *);
 };
 
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -16,17 +16,38 @@
 
 typedef void (*smp_call_func_t)(void *info);
 typedef bool (*smp_cond_func_t)(int cpu, void *info);
+
+enum {
+	CSD_FLAG_LOCK		= 0x01,
+
+	/* IRQ_WORK_flags */
+
+	CSD_TYPE_ASYNC		= 0x00,
+	CSD_TYPE_SYNC		= 0x10,
+	CSD_TYPE_IRQ_WORK	= 0x20,
+	CSD_FLAG_TYPE_MASK	= 0xF0,
+};
+
+/*
+ * structure shares (partial) layout with struct irq_work
+ */
 struct __call_single_data {
 	struct llist_node llist;
+	unsigned int flags;
 	smp_call_func_t func;
 	void *info;
-	unsigned int flags;
 };
 
 /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
 typedef struct __call_single_data call_single_data_t
 	__aligned(sizeof(struct __call_single_data));
 
+/*
+ * Enqueue a llist_node on the call_single_queue; be very careful, read
+ * flush_smp_call_function_queue() in detail.
+ */
+extern void __smp_call_single_queue(int cpu, struct llist_node *node);
+
 /* total number of cpus in this system (may exceed NR_CPUS) */
 extern unsigned int total_cpus;
 
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -31,7 +31,7 @@ static bool irq_work_claim(struct irq_wo
 {
 	int oflags;
 
-	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED, &work->flags);
+	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED | CSD_TYPE_IRQ_WORK, &work->flags);
 	/*
 	 * If the work is already pending, no need to raise the IPI.
 	 * The pairing atomic_fetch_andnot() in irq_work_run() makes sure
@@ -102,8 +102,7 @@ bool irq_work_queue_on(struct irq_work *
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
-			arch_send_call_function_single_ipi(cpu);
+		__smp_call_single_queue(cpu, &work->llnode);
 	} else {
 		__irq_work_queue_local(work);
 	}
@@ -131,6 +130,31 @@ bool irq_work_needs_cpu(void)
 	return true;
 }
 
+void irq_work_single(void *arg)
+{
+	struct irq_work *work = arg;
+	int flags;
+
+	/*
+	 * Clear the PENDING bit, after this point the @work
+	 * can be re-used.
+	 * Make it immediately visible so that other CPUs trying
+	 * to claim that work don't rely on us to handle their data
+	 * while we are in the middle of the func.
+	 */
+	flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags);
+
+	lockdep_irq_work_enter(work);
+	work->func(work);
+	lockdep_irq_work_exit(work);
+	/*
+	 * Clear the BUSY bit and return to the free state if
+	 * no-one else claimed it meanwhile.
+	 */
+	flags &= ~IRQ_WORK_PENDING;
+	(void)atomic_cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY);
+}
+
 static void irq_work_run_list(struct llist_head *list)
 {
 	struct irq_work *work, *tmp;
@@ -142,27 +166,8 @@ static void irq_work_run_list(struct lli
 		return;
 
 	llnode = llist_del_all(list);
-	llist_for_each_entry_safe(work, tmp, llnode, llnode) {
-		int flags;
-		/*
-		 * Clear the PENDING bit, after this point the @work
-		 * can be re-used.
-		 * Make it immediately visible so that other CPUs trying
-		 * to claim that work don't rely on us to handle their data
-		 * while we are in the middle of the func.
-		 */
-		flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags);
-
-		lockdep_irq_work_enter(work);
-		work->func(work);
-		lockdep_irq_work_exit(work);
-		/*
-		 * Clear the BUSY bit and return to the free state if
-		 * no-one else claimed it meanwhile.
-		 */
-		flags &= ~IRQ_WORK_PENDING;
-		(void)atomic_cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY);
-	}
+	llist_for_each_entry_safe(work, tmp, llnode, llnode)
+		irq_work_single(work);
 }
 
 /*
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -23,10 +23,8 @@
 
 #include "smpboot.h"
 
-enum {
-	CSD_FLAG_LOCK		= 0x01,
-	CSD_FLAG_SYNCHRONOUS	= 0x02,
-};
+
+#define CSD_TYPE(_csd)	((_csd)->flags & CSD_FLAG_TYPE_MASK)
 
 struct call_function_data {
 	call_single_data_t	__percpu *csd;
@@ -137,15 +135,33 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(cal
 
 extern void send_call_function_single_ipi(int cpu);
 
+void __smp_call_single_queue(int cpu, struct llist_node *node)
+{
+	/*
+	 * The list addition should be visible before sending the IPI
+	 * handler locks the list to pull the entry off it because of
+	 * normal cache coherency rules implied by spinlocks.
+	 *
+	 * If IPIs can go out of order to the cache coherency protocol
+	 * in an architecture, sufficient synchronisation should be added
+	 * to arch code to make it appear to obey cache coherency WRT
+	 * locking and barrier primitives. Generic code isn't really
+	 * equipped to do the right thing...
+	 */
+	if (llist_add(node, &per_cpu(call_single_queue, cpu)))
+		send_call_function_single_ipi(cpu);
+}
+
 /*
  * Insert a previously allocated call_single_data_t element
  * for execution on the given CPU. data must already have
  * ->func, ->info, and ->flags set.
  */
-static int generic_exec_single(int cpu, call_single_data_t *csd,
-			       smp_call_func_t func, void *info)
+static int generic_exec_single(int cpu, call_single_data_t *csd)
 {
 	if (cpu == smp_processor_id()) {
+		smp_call_func_t func = csd->func;
+		void *info = csd->info;
 		unsigned long flags;
 
 		/*
@@ -159,28 +175,12 @@ static int generic_exec_single(int cpu,
 		return 0;
 	}
 
-
 	if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) {
 		csd_unlock(csd);
 		return -ENXIO;
 	}
 
-	csd->func = func;
-	csd->info = info;
-
-	/*
-	 * The list addition should be visible before sending the IPI
-	 * handler locks the list to pull the entry off it because of
-	 * normal cache coherency rules implied by spinlocks.
-	 *
-	 * If IPIs can go out of order to the cache coherency protocol
-	 * in an architecture, sufficient synchronisation should be added
-	 * to arch code to make it appear to obey cache coherency WRT
-	 * locking and barrier primitives. Generic code isn't really
-	 * equipped to do the right thing...
-	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-		send_call_function_single_ipi(cpu);
+	__smp_call_single_queue(cpu, &csd->llist);
 
 	return 0;
 }
@@ -194,16 +194,10 @@ static int generic_exec_single(int cpu,
 void generic_smp_call_function_single_interrupt(void)
 {
 	flush_smp_call_function_queue(true);
-
-	/*
-	 * Handle irq works queued remotely by irq_work_queue_on().
-	 * Smp functions above are typically synchronous so they
-	 * better run first since some other CPUs may be busy waiting
-	 * for them.
-	 */
-	irq_work_run();
 }
 
+extern void irq_work_single(void *);
+
 /**
  * flush_smp_call_function_queue - Flush pending smp-call-function callbacks
  *
@@ -241,9 +235,21 @@ static void flush_smp_call_function_queu
 		 * We don't have to use the _safe() variant here
 		 * because we are not invoking the IPI handlers yet.
 		 */
-		llist_for_each_entry(csd, entry, llist)
-			pr_warn("IPI callback %pS sent to offline CPU\n",
-				csd->func);
+		llist_for_each_entry(csd, entry, llist) {
+			switch (CSD_TYPE(csd)) {
+			case CSD_TYPE_ASYNC:
+			case CSD_TYPE_SYNC:
+			case CSD_TYPE_IRQ_WORK:
+				pr_warn("IPI callback %pS sent to offline CPU\n",
+					csd->func);
+				break;
+
+			default:
+				pr_warn("IPI callback, unknown type %d, sent to offline CPU\n",
+					CSD_TYPE(csd));
+				break;
+			}
+		}
 	}
 
 	/*
@@ -251,16 +257,17 @@ static void flush_smp_call_function_queu
 	 */
 	prev = NULL;
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
-		smp_call_func_t func = csd->func;
-		void *info = csd->info;
-
 		/* Do we wait until *after* callback? */
-		if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
+		if (CSD_TYPE(csd) == CSD_TYPE_SYNC) {
+			smp_call_func_t func = csd->func;
+			void *info = csd->info;
+
 			if (prev) {
 				prev->next = &csd_next->llist;
 			} else {
 				entry = &csd_next->llist;
 			}
+
 			func(info);
 			csd_unlock(csd);
 		} else {
@@ -272,11 +279,17 @@ static void flush_smp_call_function_queu
 	 * Second; run all !SYNC callbacks.
 	 */
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
-		smp_call_func_t func = csd->func;
-		void *info = csd->info;
+		int type = CSD_TYPE(csd);
 
-		csd_unlock(csd);
-		func(info);
+		if (type == CSD_TYPE_ASYNC) {
+			smp_call_func_t func = csd->func;
+			void *info = csd->info;
+
+			csd_unlock(csd);
+			func(info);
+		} else if (type == CSD_TYPE_IRQ_WORK) {
+			irq_work_single(csd);
+		}
 	}
 }
 
@@ -305,7 +318,7 @@ int smp_call_function_single(int cpu, sm
 {
 	call_single_data_t *csd;
 	call_single_data_t csd_stack = {
-		.flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS,
+		.flags = CSD_FLAG_LOCK | CSD_TYPE_SYNC,
 	};
 	int this_cpu;
 	int err;
@@ -339,7 +352,10 @@ int smp_call_function_single(int cpu, sm
 		csd_lock(csd);
 	}
 
-	err = generic_exec_single(cpu, csd, func, info);
+	csd->func = func;
+	csd->info = info;
+
+	err = generic_exec_single(cpu, csd);
 
 	if (wait)
 		csd_lock_wait(csd);
@@ -385,7 +401,7 @@ int smp_call_function_single_async(int c
 	csd->flags = CSD_FLAG_LOCK;
 	smp_wmb();
 
-	err = generic_exec_single(cpu, csd, csd->func, csd->info);
+	err = generic_exec_single(cpu, csd);
 
 out:
 	preempt_enable();
@@ -500,7 +516,7 @@ static void smp_call_function_many_cond(
 
 		csd_lock(csd);
 		if (wait)
-			csd->flags |= CSD_FLAG_SYNCHRONOUS;
+			csd->flags |= CSD_TYPE_SYNC;
 		csd->func = func;
 		csd->info = info;
 		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
@@ -632,6 +648,17 @@ void __init smp_init(void)
 {
 	int num_nodes, num_cpus;
 
+	/*
+	 * Ensure struct irq_work layout matches so that
+	 * flush_smp_call_function_queue() can do horrible things.
+	 */
+	BUILD_BUG_ON(offsetof(struct irq_work, llnode) !=
+		     offsetof(struct __call_single_data, llist));
+	BUILD_BUG_ON(offsetof(struct irq_work, func) !=
+		     offsetof(struct __call_single_data, func));
+	BUILD_BUG_ON(offsetof(struct irq_work, flags) !=
+		     offsetof(struct __call_single_data, flags));
+
 	idle_threads_init();
 	cpuhp_threads_init();
 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH 6/7] sched: Add rq::ttwu_pending
  2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
                   ` (4 preceding siblings ...)
  2020-05-26 16:11 ` [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue Peter Zijlstra
@ 2020-05-26 16:11 ` Peter Zijlstra
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2020-05-26 16:11 ` [RFC][PATCH 7/7] sched: Replace rq::wake_list Peter Zijlstra
  6 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:11 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

In preparation of removing rq->wake_list, replace the
!list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully
equivalent as this new variable is racy.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   13 +++++++++++--
 kernel/sched/debug.c |    1 -
 kernel/sched/fair.c  |    2 +-
 kernel/sched/sched.h |    4 +++-
 4 files changed, 15 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2275,13 +2275,21 @@ static int ttwu_remote(struct task_struc
 void sched_ttwu_pending(void)
 {
 	struct rq *rq = this_rq();
-	struct llist_node *llist = llist_del_all(&rq->wake_list);
+	struct llist_node *llist;
 	struct task_struct *p, *t;
 	struct rq_flags rf;
 
+	llist = llist_del_all(&rq->wake_list);
 	if (!llist)
 		return;
 
+	/*
+	 * rq::ttwu_pending racy indication of out-standing wakeups.
+	 * Races such that false-negatives are possible, since they
+	 * are shorter lived that false-positives would be.
+	 */
+	WRITE_ONCE(rq->ttwu_pending, 0);
+
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
@@ -2312,6 +2320,7 @@ static void ttwu_queue_remote(struct tas
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
+	WRITE_ONCE(rq->ttwu_pending, 1);
 	if (llist_add(&p->wake_entry, &rq->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
 			smp_call_function_single_async(cpu, &rq->wake_csd);
@@ -4668,7 +4677,7 @@ int idle_cpu(int cpu)
 		return 0;
 
 #ifdef CONFIG_SMP
-	if (!llist_empty(&rq->wake_list))
+	if (rq->ttwu_pending)
 		return 0;
 #endif
 
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -638,7 +638,6 @@ do {									\
 
 	P(nr_running);
 	P(nr_switches);
-	P(nr_load_updates);
 	P(nr_uninterruptible);
 	PN(next_balance);
 	SEQ_printf(m, "  .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr)));
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8590,7 +8590,7 @@ static int idle_cpu_without(int cpu, str
 	 */
 
 #ifdef CONFIG_SMP
-	if (!llist_empty(&rq->wake_list))
+	if (rq->ttwu_pending)
 		return 0;
 #endif
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -895,7 +895,9 @@ struct rq {
 	atomic_t		nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
 
-	unsigned long		nr_load_updates;
+#ifdef CONFIG_SMP
+	unsigned int		ttwu_pending;
+#endif
 	u64			nr_switches;
 
 #ifdef CONFIG_UCLAMP_TASK



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
                   ` (5 preceding siblings ...)
  2020-05-26 16:11 ` [RFC][PATCH 6/7] sched: Add rq::ttwu_pending Peter Zijlstra
@ 2020-05-26 16:11 ` Peter Zijlstra
  2020-05-29 15:10   ` Valdis Klētnieks
                     ` (2 more replies)
  6 siblings, 3 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-26 16:11 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, peterz

The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in ttwu_queue_remote() got this wrong.

While on first reading ttwu_queue_remote() has an atomic test-and-set
that appears to serialize the use, the matching 'release' is not in
the right place to actually guarantee this serialization.

The actual race is vs the sched_ttwu_pending() call in the idle loop;
that can run the wakeup-list without consuming the CSD.

Instead of trying to chain the lists, merge them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    1 +
 include/linux/smp.h   |    1 +
 kernel/sched/core.c   |   25 +++++++------------------
 kernel/sched/idle.c   |    1 -
 kernel/sched/sched.h  |   11 -----------
 kernel/smp.c          |   47 ++++++++++++++++++++++++++++++++++++++++-------
 6 files changed, 49 insertions(+), 37 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -655,6 +655,7 @@ struct task_struct {
 
 #ifdef CONFIG_SMP
 	struct llist_node		wake_entry;
+	unsigned int			wake_entry_type;
 	int				on_cpu;
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/* Current CPU: */
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -25,6 +25,7 @@ enum {
 	CSD_TYPE_ASYNC		= 0x00,
 	CSD_TYPE_SYNC		= 0x10,
 	CSD_TYPE_IRQ_WORK	= 0x20,
+	CSD_TYPE_TTWU		= 0x30,
 	CSD_FLAG_TYPE_MASK	= 0xF0,
 };
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1538,7 +1538,7 @@ static int migration_cpu_stop(void *data
 	 * __migrate_task() such that we will not miss enforcing cpus_ptr
 	 * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
 	 */
-	sched_ttwu_pending();
+	flush_smp_call_function_from_idle();
 
 	raw_spin_lock(&p->pi_lock);
 	rq_lock(rq, &rf);
@@ -2272,14 +2272,13 @@ static int ttwu_remote(struct task_struc
 }
 
 #ifdef CONFIG_SMP
-void sched_ttwu_pending(void)
+void sched_ttwu_pending(void *arg)
 {
+	struct llist_node *llist = arg;
 	struct rq *rq = this_rq();
-	struct llist_node *llist;
 	struct task_struct *p, *t;
 	struct rq_flags rf;
 
-	llist = llist_del_all(&rq->wake_list);
 	if (!llist)
 		return;
 
@@ -2299,11 +2298,6 @@ void sched_ttwu_pending(void)
 	rq_unlock_irqrestore(rq, &rf);
 }
 
-static void wake_csd_func(void *info)
-{
-	sched_ttwu_pending();
-}
-
 void send_call_function_single_ipi(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -2321,12 +2315,7 @@ static void ttwu_queue_remote(struct tas
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
 	WRITE_ONCE(rq->ttwu_pending, 1);
-	if (llist_add(&p->wake_entry, &rq->wake_list)) {
-		if (!set_nr_if_polling(rq->idle))
-			smp_call_function_single_async(cpu, &rq->wake_csd);
-		else
-			trace_sched_wake_idle_without_ipi(cpu);
-	}
+	__smp_call_single_queue(cpu, &p->wake_entry);
 }
 
 void wake_up_if_idle(int cpu)
@@ -2735,6 +2724,9 @@ static void __sched_fork(unsigned long c
 	p->capture_control = NULL;
 #endif
 	init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SMP
+	p->wake_entry_type = CSD_TYPE_TTWU;
+#endif
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6527,7 +6519,6 @@ int sched_cpu_dying(unsigned int cpu)
 	struct rq_flags rf;
 
 	/* Handle pending wakeups and then migrate everything off */
-	sched_ttwu_pending();
 	sched_tick_stop(cpu);
 
 	rq_lock_irqsave(rq, &rf);
@@ -6726,8 +6717,6 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
-		rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
-
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
 		rq_attach_root(rq, &def_root_domain);
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -290,7 +290,6 @@ static void do_idle(void)
 	smp_mb__after_atomic();
 
 	flush_smp_call_function_from_idle();
-	sched_ttwu_pending();
 	schedule_idle();
 
 	if (unlikely(klp_patch_pending(current)))
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1023,11 +1023,6 @@ struct rq {
 	unsigned int		ttwu_local;
 #endif
 
-#ifdef CONFIG_SMP
-	call_single_data_t	wake_csd;
-	struct llist_head	wake_list;
-#endif
-
 #ifdef CONFIG_CPU_IDLE
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
@@ -1371,8 +1366,6 @@ queue_balance_callback(struct rq *rq,
 	rq->balance_callback = head;
 }
 
-extern void sched_ttwu_pending(void);
-
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -1510,10 +1503,6 @@ static inline void unregister_sched_doma
 
 extern void flush_smp_call_function_from_idle(void);
 
-#else
-
-static inline void sched_ttwu_pending(void) { }
-
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -196,6 +196,7 @@ void generic_smp_call_function_single_in
 	flush_smp_call_function_queue(true);
 }
 
+extern void sched_ttwu_pending(void *);
 extern void irq_work_single(void *);
 
 /**
@@ -244,6 +245,10 @@ static void flush_smp_call_function_queu
 					csd->func);
 				break;
 
+			case CSD_TYPE_TTWU:
+				pr_warn("IPI task-wakeup sent to offline CPU\n");
+				break;
+
 			default:
 				pr_warn("IPI callback, unknown type %d, sent to offline CPU\n",
 					CSD_TYPE(csd));
@@ -275,22 +280,43 @@ static void flush_smp_call_function_queu
 		}
 	}
 
+	if (!entry)
+		return;
+
 	/*
 	 * Second; run all !SYNC callbacks.
 	 */
+	prev = NULL;
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
 		int type = CSD_TYPE(csd);
 
-		if (type == CSD_TYPE_ASYNC) {
-			smp_call_func_t func = csd->func;
-			void *info = csd->info;
+		if (type != CSD_TYPE_TTWU) {
+			if (prev) {
+				prev->next = &csd_next->llist;
+			} else {
+				entry = &csd_next->llist;
+			}
 
-			csd_unlock(csd);
-			func(info);
-		} else if (type == CSD_TYPE_IRQ_WORK) {
-			irq_work_single(csd);
+			if (type == CSD_TYPE_ASYNC) {
+				smp_call_func_t func = csd->func;
+				void *info = csd->info;
+
+				csd_unlock(csd);
+				func(info);
+			} else if (type == CSD_TYPE_IRQ_WORK) {
+				irq_work_single(csd);
+			}
+
+		} else {
+			prev = &csd->llist;
 		}
 	}
+
+	/*
+	 * Third; only CSD_TYPE_TTWU is left, issue those.
+	 */
+	if (entry)
+		sched_ttwu_pending(entry);
 }
 
 void flush_smp_call_function_from_idle(void)
@@ -659,6 +685,13 @@ void __init smp_init(void)
 	BUILD_BUG_ON(offsetof(struct irq_work, flags) !=
 		     offsetof(struct __call_single_data, flags));
 
+	/*
+	 * Assert the CSD_TYPE_TTWU layout is similar enough
+	 * for task_struct to be on the @call_single_queue.
+	 */
+	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
+		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
+
 	idle_threads_init();
 	cpuhp_threads_init();
 



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB
  2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
@ 2020-05-26 23:56   ` Frederic Weisbecker
  2020-05-27 10:23   ` Vincent Guittot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 62+ messages in thread
From: Frederic Weisbecker @ 2020-05-26 23:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, x86, cai, mgorman

On Tue, May 26, 2020 at 06:10:58PM +0200, Peter Zijlstra wrote:
> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> got smp_call_function_single_async() subtly wrong. Even though it will
> return -EBUSY when trying to re-use a csd, that condition is not
> atomic and still requires external serialization.
> 
> The change in kick_ilb() got this wrong.
> 
> While on first reading kick_ilb() has an atomic test-and-set that
> appears to serialize the use, the matching 'release' is not in the
> right place to actually guarantee this serialization.
> 
> Rework the nohz_idle_balance() trigger so that the release is in the
> IPI callback and thus guarantees the required serialization for the
> CSD.
> 
> Fixes: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Looks very good! Thanks!

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-26 16:11 ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
@ 2020-05-27  9:56   ` Peter Zijlstra
  2020-05-27 10:15     ` Peter Zijlstra
  2020-05-29 13:01   ` Frederic Weisbecker
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-27  9:56 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, paulmck

On Tue, May 26, 2020 at 06:11:01PM +0200, Peter Zijlstra wrote:
> Just like the ttwu_queue_remote() IPI, make use of _TIF_POLLING_NRFLAG
> to avoid sending IPIs to idle CPUs.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |   10 ++++++++++
>  kernel/sched/idle.c  |    1 +
>  kernel/sched/sched.h |    2 ++
>  kernel/smp.c         |   16 +++++++++++++++-
>  4 files changed, 28 insertions(+), 1 deletion(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2296,6 +2296,16 @@ static void wake_csd_func(void *info)
>  	sched_ttwu_pending();
>  }
>  
> +void send_call_function_single_ipi(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +
> +	if (!set_nr_if_polling(rq->idle))
> +		arch_send_call_function_single_ipi(cpu);
> +	else
> +		trace_sched_wake_idle_without_ipi(cpu);
> +}
> +
>  /*
>   * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
>   * necessary. The wakee CPU on receipt of the IPI will queue the task
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -289,6 +289,7 @@ static void do_idle(void)
>  	 */
>  	smp_mb__after_atomic();
>  
> +	flush_smp_call_function_from_idle();
>  	sched_ttwu_pending();
>  	schedule_idle();
>  

Paul; the above patch basically allows smp_call_function_single() to run
from the idle context (with IRQs disabled, obviously) instead of from an
actual IRQ context.

This makes RCU unhappy (as reported by mingo):

 [ ] ------------[ cut here ]------------
 [ ] Not in hardirq as expected
 [ ] WARNING: CPU: 4 PID: 0 at kernel/rcu/tree.c:430 rcu_is_cpu_rrupt_from_idle+0xed/0x110
 [ ] Modules linked in:
 [ ] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.7.0-rc7-00840-ga61d572-dirty #1
 [ ] Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 2.0b       03/01/2012
 [ ] RIP: 0010:rcu_is_cpu_rrupt_from_idle+0xed/0x110
 [ ] Call Trace:
 [ ]  rcu_exp_handler+0x38/0x90
 [ ]  flush_smp_call_function_queue+0xce/0x230
 [ ]  flush_smp_call_function_from_idle+0x2f/0x60
 [ ]  do_idle+0x163/0x260
 [ ]  cpu_startup_entry+0x19/0x20
 [ ]  start_secondary+0x14f/0x1a0
 [ ] irq event stamp: 189300
 [ ] hardirqs last  enabled at (189299): [<ffffffff811d3e25>] tick_nohz_idle_exit+0x55/0xb0
 [ ] hardirqs last disabled at (189300): [<ffffffff811da5f5>] flush_smp_call_function_from_idle+0x25/0x60
 [ ] softirqs last  enabled at (189284): [<ffffffff811280a0>] irq_enter_rcu+0x70/0x80
 [ ] softirqs last disabled at (189283): [<ffffffff81128085>] irq_enter_rcu+0x55/0x80

This is rcu_is_cpu_rrupt_from_idle()'s lockdep_assert_in_irq() tripping
up (it's comment is obviously a bit antiquated).

Now, if I read that code correctly, it actually relies on
rcu_irq_enter() and thus really wants to be in an interrupt. Is there
any way this code can be made to work from the new context too?

After all, all that really is different is not having gone throught he
bother of setting up the IRQ context, nothing else changed -- it just so
happens you actually relied on that ;/

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-27  9:56   ` Peter Zijlstra
@ 2020-05-27 10:15     ` Peter Zijlstra
  2020-05-27 15:56       ` Paul E. McKenney
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-27 10:15 UTC (permalink / raw)
  To: tglx, frederic; +Cc: linux-kernel, x86, cai, mgorman, paulmck, joel

On Wed, May 27, 2020 at 11:56:45AM +0200, Peter Zijlstra wrote:

> This is rcu_is_cpu_rrupt_from_idle()'s lockdep_assert_in_irq() tripping
> up (it's comment is obviously a bit antiquated).
> 
> Now, if I read that code correctly, it actually relies on
> rcu_irq_enter() and thus really wants to be in an interrupt. Is there
> any way this code can be made to work from the new context too?
> 
> After all, all that really is different is not having gone throught he
> bother of setting up the IRQ context, nothing else changed -- it just so
> happens you actually relied on that ;/

At first glance, something like the below could work. But obviously I
might have overlooked something more subtle than a brick :-)

---

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 90c8be22d57a..0792c032a972 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -426,8 +426,11 @@ EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
  */
 static int rcu_is_cpu_rrupt_from_idle(void)
 {
-	/* Called only from within the scheduling-clock interrupt */
-	lockdep_assert_in_irq();
+	/*
+	 * Usually called from the tick; but also used from smp_call_function()
+	 * for expedited grace periods.
+	 */
+	lockdep_assert_irqs_disabled();
 
 	/* Check for counter underflows */
 	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
@@ -435,8 +438,11 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0,
 			 "RCU dynticks_nmi_nesting counter underflow/zero!");
 
-	/* Are we at first interrupt nesting level? */
-	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
+	/*
+	 * Are we at first interrupt nesting level? -- or below, when running
+	 * directly from the idle loop itself.
+	 */
+	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) > 1)
 		return false;
 
 	/* Does CPU appear to be idle from an RCU standpoint? */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB
  2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
  2020-05-26 23:56   ` Frederic Weisbecker
@ 2020-05-27 10:23   ` Vincent Guittot
  2020-05-27 11:28     ` Frederic Weisbecker
  2020-05-29 15:26   ` Valentin Schneider
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  3 siblings, 1 reply; 62+ messages in thread
From: Vincent Guittot @ 2020-05-27 10:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, frederic, linux-kernel, x86, Qian Cai, Mel Gorman

On Tue, 26 May 2020 at 18:19, Peter Zijlstra <peterz@infradead.org> wrote:
>
> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> got smp_call_function_single_async() subtly wrong. Even though it will
> return -EBUSY when trying to re-use a csd, that condition is not
> atomic and still requires external serialization.
>
> The change in kick_ilb() got this wrong.
>
> While on first reading kick_ilb() has an atomic test-and-set that
> appears to serialize the use, the matching 'release' is not in the
> right place to actually guarantee this serialization.
>
> Rework the nohz_idle_balance() trigger so that the release is in the
> IPI callback and thus guarantees the required serialization for the
> CSD.
>
> Fixes: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  include/linux/sched.h      |    4 +
>  include/linux/sched/idle.h |   55 ++++++++++++++++++
>  kernel/sched/core.c        |  132 +++++++++------------------------------------
>  kernel/sched/fair.c        |   18 ++----
>  kernel/sched/idle.c        |    3 -
>  kernel/sched/sched.h       |    2
>  kernel/smp.c               |    7 ++
>  7 files changed, 102 insertions(+), 119 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -637,41 +568,25 @@ void wake_up_nohz_cpu(int cpu)
>                 wake_up_idle_cpu(cpu);
>  }
>
> -static inline bool got_nohz_idle_kick(void)
> +static void nohz_csd_func(void *info)
>  {
> -       int cpu = smp_processor_id();
> -
> -       if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
> -               return false;
> -
> -       if (idle_cpu(cpu) && !need_resched())
> -               return true;
> +       struct rq *rq = info;
> +       int cpu = cpu_of(rq);
> +       unsigned int flags;
>
>         /*
> -        * We can't run Idle Load Balance on this CPU for this time so we
> -        * cancel it and clear NOHZ_BALANCE_KICK
> +        * Release the rq::nohz_csd.
>          */
> -       atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
> -       return false;
> -}
> -
> -static void nohz_csd_func(void *info)
> -{
> -       struct rq *rq = info;
> +       flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));

Why can't this be done in nohz_idle_balance() instead ?

you are not using flags in nohz_csd_func() and SCHED_SOFTIRQ which
calls nohz_idle_balance(), happens after nohz_csd_func(), isn't it ?

In this case, you don't have to use the intermediate variable
this_rq->nohz_idle_balance


> +       WARN_ON(!(flags & NOHZ_KICK_MASK));
>
> -       if (got_nohz_idle_kick()) {
> -               rq->idle_balance = 1;
> +       rq->idle_balance = idle_cpu(cpu);
> +       if (rq->idle_balance && !need_resched()) {
> +               rq->nohz_idle_balance = flags;
>                 raise_softirq_irqoff(SCHED_SOFTIRQ);
>         }
>  }
>
> -#else /* CONFIG_NO_HZ_COMMON */
> -
> -static inline bool got_nohz_idle_kick(void)
> -{
> -       return false;
> -}
> -
>  #endif /* CONFIG_NO_HZ_COMMON */
>
>  #ifdef CONFIG_NO_HZ_FULL
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10024,6 +10024,10 @@ static void kick_ilb(unsigned int flags)
>         if (ilb_cpu >= nr_cpu_ids)
>                 return;
>
> +       /*
> +        * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> +        * the first flag owns it; cleared by nohz_csd_func().
> +        */
>         flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
>         if (flags & NOHZ_KICK_MASK)
>                 return;
> @@ -10371,20 +10375,14 @@ static bool _nohz_idle_balance(struct rq
>   */
>  static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  {
> -       int this_cpu = this_rq->cpu;
> -       unsigned int flags;
> +       unsigned int flags = this_rq->nohz_idle_balance;
>
> -       if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> +       if (!flags)
>                 return false;
>
> -       if (idle != CPU_IDLE) {
> -               atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> -               return false;
> -       }
> +       this_rq->nohz_idle_balance = 0;
>
> -       /* could be _relaxed() */
> -       flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> -       if (!(flags & NOHZ_KICK_MASK))
> +       if (idle != CPU_IDLE)
>                 return false;
>
>         _nohz_idle_balance(this_rq, flags, idle);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -951,6 +951,7 @@ struct rq {
>
>         struct callback_head    *balance_callback;
>
> +       unsigned char           nohz_idle_balance;
>         unsigned char           idle_balance;
>
>         unsigned long           misfit_task_load;
>
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB
  2020-05-27 10:23   ` Vincent Guittot
@ 2020-05-27 11:28     ` Frederic Weisbecker
  2020-05-27 12:07       ` Vincent Guittot
  0 siblings, 1 reply; 62+ messages in thread
From: Frederic Weisbecker @ 2020-05-27 11:28 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Thomas Gleixner, linux-kernel, x86, Qian Cai, Mel Gorman

On Wed, May 27, 2020 at 12:23:23PM +0200, Vincent Guittot wrote:
> > -static void nohz_csd_func(void *info)
> > -{
> > -       struct rq *rq = info;
> > +       flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
> 
> Why can't this be done in nohz_idle_balance() instead ?
> 
> you are not using flags in nohz_csd_func() and SCHED_SOFTIRQ which
> calls nohz_idle_balance(), happens after nohz_csd_func(), isn't it ?
> 
> In this case, you don't have to use the intermediate variable
> this_rq->nohz_idle_balance

That's in fact to fix the original issue. The softirq was clearing
the nohz_flags but the softirq could be issued from two sources:
the tick and the IPI. And the tick source softirq could then clear
the flags set from the IPI sender before the IPI itself, resulting
in races such as described there: https://lore.kernel.org/lkml/20200521004035.GA15455@lenoir/

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB
  2020-05-27 11:28     ` Frederic Weisbecker
@ 2020-05-27 12:07       ` Vincent Guittot
  0 siblings, 0 replies; 62+ messages in thread
From: Vincent Guittot @ 2020-05-27 12:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Thomas Gleixner, linux-kernel, x86, Qian Cai, Mel Gorman

On Wed, 27 May 2020 at 13:28, Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Wed, May 27, 2020 at 12:23:23PM +0200, Vincent Guittot wrote:
> > > -static void nohz_csd_func(void *info)
> > > -{
> > > -       struct rq *rq = info;
> > > +       flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
> >
> > Why can't this be done in nohz_idle_balance() instead ?
> >
> > you are not using flags in nohz_csd_func() and SCHED_SOFTIRQ which
> > calls nohz_idle_balance(), happens after nohz_csd_func(), isn't it ?
> >
> > In this case, you don't have to use the intermediate variable
> > this_rq->nohz_idle_balance
>
> That's in fact to fix the original issue. The softirq was clearing
> the nohz_flags but the softirq could be issued from two sources:
> the tick and the IPI. And the tick source softirq could then clear
> the flags set from the IPI sender before the IPI itself, resulting
> in races such as described there: https://lore.kernel.org/lkml/20200521004035.GA15455@lenoir/

ah yes, even if the cpu is idle, the tick can fire and clear it.

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-27 10:15     ` Peter Zijlstra
@ 2020-05-27 15:56       ` Paul E. McKenney
  2020-05-27 16:35         ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Paul E. McKenney @ 2020-05-27 15:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel

On Wed, May 27, 2020 at 12:15:13PM +0200, Peter Zijlstra wrote:
> On Wed, May 27, 2020 at 11:56:45AM +0200, Peter Zijlstra wrote:
> 
> > This is rcu_is_cpu_rrupt_from_idle()'s lockdep_assert_in_irq() tripping
> > up (it's comment is obviously a bit antiquated).
> > 
> > Now, if I read that code correctly, it actually relies on
> > rcu_irq_enter() and thus really wants to be in an interrupt. Is there
> > any way this code can be made to work from the new context too?
> > 
> > After all, all that really is different is not having gone throught he
> > bother of setting up the IRQ context, nothing else changed -- it just so
> > happens you actually relied on that ;/
> 
> At first glance, something like the below could work. But obviously I
> might have overlooked something more subtle than a brick :-)

This can work, but only if the call from the idle loop is a place where
either RCU isn't watching on the one hand or that cannot be in an RCU
read-side critical section on the other.  Because rcu_exp_handler()
assumes that if this function returns true, we are not in an RCU read-side
critical section.  (I would expect this to be the case, but I figured
that I should make it explicit.)

> ---
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 90c8be22d57a..0792c032a972 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -426,8 +426,11 @@ EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
>   */

Could we please have a comment noting the change in semantics and
the reason?

>  static int rcu_is_cpu_rrupt_from_idle(void)
>  {
> -	/* Called only from within the scheduling-clock interrupt */
> -	lockdep_assert_in_irq();
> +	/*
> +	 * Usually called from the tick; but also used from smp_call_function()
> +	 * for expedited grace periods.
> +	 */
> +	lockdep_assert_irqs_disabled();
>  
>  	/* Check for counter underflows */
>  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
> @@ -435,8 +438,11 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0,
>  			 "RCU dynticks_nmi_nesting counter underflow/zero!");
>  
> -	/* Are we at first interrupt nesting level? */
> -	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
> +	/*
> +	 * Are we at first interrupt nesting level? -- or below, when running
> +	 * directly from the idle loop itself.
> +	 */
> +	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) > 1)

Wouldn't it also be a good idea to check that we are in the context of
an idle thread?  Just in case some idiot like me drops a call to this
function in the wrong place, for example, if I were to mistakenly remember
the old semantics where it would return false from process context?

Maybe something like this?

	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting;
	if (nesting > 1)
		return false;
	WARN_ON_ONCE(!nesting && !is_idle_task(current));

>  		return false;
>  
>  	/* Does CPU appear to be idle from an RCU standpoint? */

And let's check the other callers:

rcu_sched_clock_irq():  This will always be called from IRQ (right?), so
	no problem.

rcu_pending():  Only called from rcu_sched_clock_irq(), so still no problem.

rcu_flavor_sched_clock_irq(): Ditto for both definitions.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-27 15:56       ` Paul E. McKenney
@ 2020-05-27 16:35         ` Peter Zijlstra
  2020-05-27 17:12           ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-27 16:35 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel

On Wed, May 27, 2020 at 08:56:56AM -0700, Paul E. McKenney wrote:
> On Wed, May 27, 2020 at 12:15:13PM +0200, Peter Zijlstra wrote:

> > At first glance, something like the below could work. But obviously I
> > might have overlooked something more subtle than a brick :-)
> 
> This can work, but only if the call from the idle loop is a place where
> either RCU isn't watching on the one hand or that cannot be in an RCU
> read-side critical section on the other. 

Guaranteed no RCU read side, although the call is in a place where RCU
is active again, is that a problem? I think with a bit of work I can
move it to where RCU is still idle.

> Because rcu_exp_handler() assumes that if this function returns true,
> we are not in an RCU read-side critical section.  (I would expect this
> to be the case, but I figured that I should make it explicit.)

Indeed, I shall put a comment in the idle look to make sure it stays that way.

> > ---
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 90c8be22d57a..0792c032a972 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -426,8 +426,11 @@ EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
> >   */
> 
> Could we please have a comment noting the change in semantics and
> the reason?

A Changelog you mean? Sure, I can do, but I wasn't nowhere confident
enough in the change to even bother trying to write one.

> >  static int rcu_is_cpu_rrupt_from_idle(void)
> >  {
> > -	/* Called only from within the scheduling-clock interrupt */
> > -	lockdep_assert_in_irq();
> > +	/*
> > +	 * Usually called from the tick; but also used from smp_call_function()
> > +	 * for expedited grace periods.
> > +	 */
> > +	lockdep_assert_irqs_disabled();
> >  
> >  	/* Check for counter underflows */
> >  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
> > @@ -435,8 +438,11 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> >  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0,
> >  			 "RCU dynticks_nmi_nesting counter underflow/zero!");
> >  
> > -	/* Are we at first interrupt nesting level? */
> > -	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
> > +	/*
> > +	 * Are we at first interrupt nesting level? -- or below, when running
> > +	 * directly from the idle loop itself.
> > +	 */
> > +	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) > 1)
> 
> Wouldn't it also be a good idea to check that we are in the context of
> an idle thread?  Just in case some idiot like me drops a call to this
> function in the wrong place, for example, if I were to mistakenly remember
> the old semantics where it would return false from process context?
> 
> Maybe something like this?
> 
> 	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting;
> 	if (nesting > 1)
> 		return false;
> 	WARN_ON_ONCE(!nesting && !is_idle_task(current));

Yep, that should do.

> >  		return false;
> >  
> >  	/* Does CPU appear to be idle from an RCU standpoint? */
> 
> And let's check the other callers:
> 
> rcu_sched_clock_irq():  This will always be called from IRQ (right?), so
> 	no problem.
> 
> rcu_pending():  Only called from rcu_sched_clock_irq(), so still no problem.
> 
> rcu_flavor_sched_clock_irq(): Ditto for both definitions.

Right, I went though them, didn't find anything obvious amiss. OK, let
me do a nicer patch.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-27 16:35         ` Peter Zijlstra
@ 2020-05-27 17:12           ` Peter Zijlstra
  2020-05-27 19:39             ` Paul E. McKenney
                               ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-27 17:12 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel

On Wed, May 27, 2020 at 06:35:43PM +0200, Peter Zijlstra wrote:
> Right, I went though them, didn't find anything obvious amiss. OK, let
> me do a nicer patch.

something like so then?

---
Subject: rcu: Allow for smp_call_function() running callbacks from idle

Current RCU hard relies on smp_call_function() callbacks running from
interrupt context. A pending optimization is going to break that, it
will allow idle CPUs to run the callbacks from the idle loop. This
avoids raising the IPI on the requesting CPU and avoids handling an
exception on the receiving CPU.

Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
provided it is the idle task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/rcu/tree.c   | 25 +++++++++++++++++++------
 kernel/sched/idle.c |  4 ++++
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d8e9dbbefcfa..c716eadc7617 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -418,16 +418,23 @@ void rcu_momentary_dyntick_idle(void)
 EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
 
 /**
- * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
+ * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
  *
  * If the current CPU is idle and running at a first-level (not nested)
- * interrupt from idle, return true.  The caller must have at least
- * disabled preemption.
+ * interrupt, or directly, from idle, return true.
+ *
+ * The caller must have at least disabled IRQs.
  */
 static int rcu_is_cpu_rrupt_from_idle(void)
 {
-	/* Called only from within the scheduling-clock interrupt */
-	lockdep_assert_in_irq();
+	long nesting;
+
+	/*
+	 * Usually called from the tick; but also used from smp_function_call()
+	 * for expedited grace periods. This latter can result in running from
+	 * the idle task, instead of an actual IPI.
+	 */
+	lockdep_assert_irqs_disabled();
 
 	/* Check for counter underflows */
 	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
@@ -436,9 +443,15 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 			 "RCU dynticks_nmi_nesting counter underflow/zero!");
 
 	/* Are we at first interrupt nesting level? */
-	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
+	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
+	if (nesting > 1)
 		return false;
 
+	/*
+	 * If we're not in an interrupt, we must be in the idle task!
+	 */
+	WARN_ON_ONCE(!nesting && !is_idle_task(current));
+
 	/* Does CPU appear to be idle from an RCU standpoint? */
 	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index e9cef84c2b70..05deb81bb3e3 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -289,6 +289,10 @@ static void do_idle(void)
 	 */
 	smp_mb__after_atomic();
 
+	/*
+	 * RCU relies on this call to be done outside of an RCU read-side
+	 * critical section.
+	 */
 	flush_smp_call_function_from_idle();
 	schedule_idle();
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-27 17:12           ` Peter Zijlstra
@ 2020-05-27 19:39             ` Paul E. McKenney
  2020-05-28  1:35               ` Joel Fernandes
  2020-05-28  8:59             ` [tip: core/rcu] rcu: Allow for smp_call_function() running callbacks from idle tip-bot2 for Peter Zijlstra
  2021-01-21 16:56             ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
  2 siblings, 1 reply; 62+ messages in thread
From: Paul E. McKenney @ 2020-05-27 19:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel

On Wed, May 27, 2020 at 07:12:36PM +0200, Peter Zijlstra wrote:
> On Wed, May 27, 2020 at 06:35:43PM +0200, Peter Zijlstra wrote:
> > Right, I went though them, didn't find anything obvious amiss. OK, let
> > me do a nicer patch.
> 
> something like so then?
> 
> ---
> Subject: rcu: Allow for smp_call_function() running callbacks from idle
> 
> Current RCU hard relies on smp_call_function() callbacks running from
> interrupt context. A pending optimization is going to break that, it
> will allow idle CPUs to run the callbacks from the idle loop. This
> avoids raising the IPI on the requesting CPU and avoids handling an
> exception on the receiving CPU.
> 
> Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
> provided it is the idle task.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Looks good to me!

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  kernel/rcu/tree.c   | 25 +++++++++++++++++++------
>  kernel/sched/idle.c |  4 ++++
>  2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d8e9dbbefcfa..c716eadc7617 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -418,16 +418,23 @@ void rcu_momentary_dyntick_idle(void)
>  EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
>  
>  /**
> - * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
> + * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
>   *
>   * If the current CPU is idle and running at a first-level (not nested)
> - * interrupt from idle, return true.  The caller must have at least
> - * disabled preemption.
> + * interrupt, or directly, from idle, return true.
> + *
> + * The caller must have at least disabled IRQs.
>   */
>  static int rcu_is_cpu_rrupt_from_idle(void)
>  {
> -	/* Called only from within the scheduling-clock interrupt */
> -	lockdep_assert_in_irq();
> +	long nesting;
> +
> +	/*
> +	 * Usually called from the tick; but also used from smp_function_call()
> +	 * for expedited grace periods. This latter can result in running from
> +	 * the idle task, instead of an actual IPI.
> +	 */
> +	lockdep_assert_irqs_disabled();
>  
>  	/* Check for counter underflows */
>  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
> @@ -436,9 +443,15 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>  			 "RCU dynticks_nmi_nesting counter underflow/zero!");
>  
>  	/* Are we at first interrupt nesting level? */
> -	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
> +	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
> +	if (nesting > 1)
>  		return false;
>  
> +	/*
> +	 * If we're not in an interrupt, we must be in the idle task!
> +	 */
> +	WARN_ON_ONCE(!nesting && !is_idle_task(current));
> +
>  	/* Does CPU appear to be idle from an RCU standpoint? */
>  	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
>  }
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index e9cef84c2b70..05deb81bb3e3 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -289,6 +289,10 @@ static void do_idle(void)
>  	 */
>  	smp_mb__after_atomic();
>  
> +	/*
> +	 * RCU relies on this call to be done outside of an RCU read-side
> +	 * critical section.
> +	 */
>  	flush_smp_call_function_from_idle();
>  	schedule_idle();
>  

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-27 19:39             ` Paul E. McKenney
@ 2020-05-28  1:35               ` Joel Fernandes
  0 siblings, 0 replies; 62+ messages in thread
From: Joel Fernandes @ 2020-05-28  1:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, tglx, frederic, linux-kernel, x86, cai, mgorman

On Wed, May 27, 2020 at 12:39:14PM -0700, Paul E. McKenney wrote:
> On Wed, May 27, 2020 at 07:12:36PM +0200, Peter Zijlstra wrote:
> > On Wed, May 27, 2020 at 06:35:43PM +0200, Peter Zijlstra wrote:
> > > Right, I went though them, didn't find anything obvious amiss. OK, let
> > > me do a nicer patch.
> > 
> > something like so then?
> > 
> > ---
> > Subject: rcu: Allow for smp_call_function() running callbacks from idle
> > 
> > Current RCU hard relies on smp_call_function() callbacks running from
> > interrupt context. A pending optimization is going to break that, it
> > will allow idle CPUs to run the callbacks from the idle loop. This
> > avoids raising the IPI on the requesting CPU and avoids handling an
> > exception on the receiving CPU.
> > 
> > Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
> > provided it is the idle task.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Looks good to me!
> 
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


> 
> > ---
> >  kernel/rcu/tree.c   | 25 +++++++++++++++++++------
> >  kernel/sched/idle.c |  4 ++++
> >  2 files changed, 23 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d8e9dbbefcfa..c716eadc7617 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -418,16 +418,23 @@ void rcu_momentary_dyntick_idle(void)
> >  EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
> >  
> >  /**
> > - * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
> > + * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
> >   *
> >   * If the current CPU is idle and running at a first-level (not nested)
> > - * interrupt from idle, return true.  The caller must have at least
> > - * disabled preemption.
> > + * interrupt, or directly, from idle, return true.
> > + *
> > + * The caller must have at least disabled IRQs.
> >   */
> >  static int rcu_is_cpu_rrupt_from_idle(void)
> >  {
> > -	/* Called only from within the scheduling-clock interrupt */
> > -	lockdep_assert_in_irq();
> > +	long nesting;
> > +
> > +	/*
> > +	 * Usually called from the tick; but also used from smp_function_call()
> > +	 * for expedited grace periods. This latter can result in running from
> > +	 * the idle task, instead of an actual IPI.
> > +	 */
> > +	lockdep_assert_irqs_disabled();
> >  
> >  	/* Check for counter underflows */
> >  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
> > @@ -436,9 +443,15 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> >  			 "RCU dynticks_nmi_nesting counter underflow/zero!");
> >  
> >  	/* Are we at first interrupt nesting level? */
> > -	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
> > +	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
> > +	if (nesting > 1)
> >  		return false;
> >  
> > +	/*
> > +	 * If we're not in an interrupt, we must be in the idle task!
> > +	 */
> > +	WARN_ON_ONCE(!nesting && !is_idle_task(current));
> > +
> >  	/* Does CPU appear to be idle from an RCU standpoint? */
> >  	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
> >  }
> > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > index e9cef84c2b70..05deb81bb3e3 100644
> > --- a/kernel/sched/idle.c
> > +++ b/kernel/sched/idle.c
> > @@ -289,6 +289,10 @@ static void do_idle(void)
> >  	 */
> >  	smp_mb__after_atomic();
> >  
> > +	/*
> > +	 * RCU relies on this call to be done outside of an RCU read-side
> > +	 * critical section.
> > +	 */
> >  	flush_smp_call_function_from_idle();
> >  	schedule_idle();
> >  

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: core/rcu] rcu: Allow for smp_call_function() running callbacks from idle
  2020-05-27 17:12           ` Peter Zijlstra
  2020-05-27 19:39             ` Paul E. McKenney
@ 2020-05-28  8:59             ` tip-bot2 for Peter Zijlstra
  2021-01-21 16:56             ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
  2 siblings, 0 replies; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-05-28  8:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel),
	Ingo Molnar, Paul E. McKenney, Joel Fernandes (Google),
	x86, LKML

The following commit has been merged into the core/rcu branch of tip:

Commit-ID:     806f04e9fd2c6ad1e39bc2dba77155be0e4becde
Gitweb:        https://git.kernel.org/tip/806f04e9fd2c6ad1e39bc2dba77155be0e4becde
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 27 May 2020 19:12:36 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:50:12 +02:00

rcu: Allow for smp_call_function() running callbacks from idle

Current RCU hard relies on smp_call_function() callbacks running from
interrupt context. A pending optimization is going to break that, it
will allow idle CPUs to run the callbacks from the idle loop. This
avoids raising the IPI on the requesting CPU and avoids handling an
exception on the receiving CPU.

Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
provided it is the idle task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20200527171236.GC706495@hirez.programming.kicks-ass.net
---
 kernel/rcu/tree.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 90c8be2..f51385b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -418,16 +418,23 @@ void rcu_momentary_dyntick_idle(void)
 EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
 
 /**
- * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
+ * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
  *
  * If the current CPU is idle and running at a first-level (not nested)
- * interrupt from idle, return true.  The caller must have at least
- * disabled preemption.
+ * interrupt, or directly, from idle, return true.
+ *
+ * The caller must have at least disabled IRQs.
  */
 static int rcu_is_cpu_rrupt_from_idle(void)
 {
-	/* Called only from within the scheduling-clock interrupt */
-	lockdep_assert_in_irq();
+	long nesting;
+
+	/*
+	 * Usually called from the tick; but also used from smp_function_call()
+	 * for expedited grace periods. This latter can result in running from
+	 * the idle task, instead of an actual IPI.
+	 */
+	lockdep_assert_irqs_disabled();
 
 	/* Check for counter underflows */
 	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
@@ -436,9 +443,15 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 			 "RCU dynticks_nmi_nesting counter underflow/zero!");
 
 	/* Are we at first interrupt nesting level? */
-	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
+	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
+	if (nesting > 1)
 		return false;
 
+	/*
+	 * If we're not in an interrupt, we must be in the idle task!
+	 */
+	WARN_ON_ONCE(!nesting && !is_idle_task(current));
+
 	/* Does CPU appear to be idle from an RCU standpoint? */
 	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
 }

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 2/7] smp: Optimize flush_smp_call_function_queue()
  2020-05-26 16:10 ` [RFC][PATCH 2/7] smp: Optimize flush_smp_call_function_queue() Peter Zijlstra
@ 2020-05-28 12:28   ` Frederic Weisbecker
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 62+ messages in thread
From: Frederic Weisbecker @ 2020-05-28 12:28 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, x86, cai, mgorman

On Tue, May 26, 2020 at 06:10:59PM +0200, Peter Zijlstra wrote:
> The call_single_queue can contain (two) different callbacks,
> synchronous and asynchronous. The current interrupt handler runs them
> in-order, which means that remote CPUs that are waiting for their
> synchronous call can be delayed by running asynchronous callbacks.
> 
> Rework the interrupt handler to first run the synchonous callbacks.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I hope we are not unlucky enough to have dependencies between
some async and sync callbacks that require the whole execution
in order.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

(Sweet llist dance, I think I need fresh air and coffee now).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-05-26 16:11 ` [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue Peter Zijlstra
@ 2020-05-28 23:40   ` Frederic Weisbecker
  2020-05-29 13:36     ` Peter Zijlstra
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 62+ messages in thread
From: Frederic Weisbecker @ 2020-05-28 23:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, x86, cai, mgorman

On Tue, May 26, 2020 at 06:11:02PM +0200, Peter Zijlstra wrote:
> Currently irq_work_queue_on() will issue an unconditional
> arch_send_call_function_single_ipi() and has the handler do
> irq_work_run().
> 
> This is unfortunate in that it makes the IPI handler look at a second
> cacheline and it misses the opportunity to avoid the IPI. Instead note
> that struct irq_work and struct __call_single_data are very similar in
> layout, so use a few bits in the flags word to encode a type and stick
> the irq_work on the call_single_queue list.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/irq_work.h |    7 ++
>  include/linux/smp.h      |   23 ++++++++-
>  kernel/irq_work.c        |   53 +++++++++++---------
>  kernel/smp.c             |  119 ++++++++++++++++++++++++++++-------------------
>  4 files changed, 130 insertions(+), 72 deletions(-)
> 
> --- a/include/linux/irq_work.h
> +++ b/include/linux/irq_work.h
> @@ -13,6 +13,8 @@
>   * busy      NULL, 2 -> {free, claimed} : callback in progress, can be claimed
>   */
>  
> +/* flags share CSD_FLAG_ space */
> +
>  #define IRQ_WORK_PENDING	BIT(0)
>  #define IRQ_WORK_BUSY		BIT(1)
>  
> @@ -23,9 +25,12 @@
>  
>  #define IRQ_WORK_CLAIMED	(IRQ_WORK_PENDING | IRQ_WORK_BUSY)
>  
> +/*
> + * structure shares layout with single_call_data_t.
> + */
>  struct irq_work {
> -	atomic_t flags;
>  	struct llist_node llnode;
> +	atomic_t flags;


We should probably have:

struct csd_node {
       atomic_t flags;
       struct llist_node;
}

embed inside struct irq_work and struct __call_single_data. Relying on
structure layout for things to work doesn't really clarify things :-)

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-26 16:11 ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
  2020-05-27  9:56   ` Peter Zijlstra
@ 2020-05-29 13:01   ` Frederic Weisbecker
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 62+ messages in thread
From: Frederic Weisbecker @ 2020-05-29 13:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, x86, cai, mgorman

On Tue, May 26, 2020 at 06:11:01PM +0200, Peter Zijlstra wrote:
> +void flush_smp_call_function_from_idle(void)
> +{
> +	unsigned long flags;
> +
> +	if (llist_empty(this_cpu_ptr(&call_single_queue)))
> +		return;

Now it seems weird that sched_ttwu_pending() didn't have that
llist_empty() optimization. The ordering should allow it.

Anyway,

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 3/7] smp: Move irq_work_run() out of flush_smp_call_function_queue()
  2020-05-26 16:11 ` [RFC][PATCH 3/7] smp: Move irq_work_run() out of flush_smp_call_function_queue() Peter Zijlstra
@ 2020-05-29 13:04   ` Frederic Weisbecker
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 62+ messages in thread
From: Frederic Weisbecker @ 2020-05-29 13:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, x86, cai, mgorman

On Tue, May 26, 2020 at 06:11:00PM +0200, Peter Zijlstra wrote:
> This ensures flush_smp_call_function_queue() is strictly about
> call_single_queue.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/smp.c |   17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -84,6 +84,7 @@ int smpcfd_dying_cpu(unsigned int cpu)
>  	 * still pending.
>  	 */
>  	flush_smp_call_function_queue(false);
> +	irq_work_run();
>  	return 0;
>  }
>  
> @@ -191,6 +192,14 @@ static int generic_exec_single(int cpu,
>  void generic_smp_call_function_single_interrupt(void)
>  {
>  	flush_smp_call_function_queue(true);
> +
> +	/*
> +	 * Handle irq works queued remotely by irq_work_queue_on().
> +	 * Smp functions above are typically synchronous so they
> +	 * better run first since some other CPUs may be busy waiting
> +	 * for them.
> +	 */

You may want to update that comment once you merge remote irq_work and csd.

Thanks.

> +	irq_work_run();
>  }
>  
>  /**
> @@ -267,14 +276,6 @@ static void flush_smp_call_function_queu
>  		csd_unlock(csd);
>  		func(info);
>  	}
> -
> -	/*
> -	 * Handle irq works queued remotely by irq_work_queue_on().
> -	 * Smp functions above are typically synchronous so they
> -	 * better run first since some other CPUs may be busy waiting
> -	 * for them.
> -	 */
> -	irq_work_run();
>  }
>  
>  /*
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-05-28 23:40   ` Frederic Weisbecker
@ 2020-05-29 13:36     ` Peter Zijlstra
  2020-06-05  9:37       ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-05-29 13:36 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: tglx, linux-kernel, x86, cai, mgorman

On Fri, May 29, 2020 at 01:40:32AM +0200, Frederic Weisbecker wrote:
> On Tue, May 26, 2020 at 06:11:02PM +0200, Peter Zijlstra wrote:

> > +/*
> > + * structure shares layout with single_call_data_t.
> > + */
> >  struct irq_work {
> > -	atomic_t flags;
> >  	struct llist_node llnode;
> > +	atomic_t flags;
> 
> 
> We should probably have:
> 
> struct csd_node {
>        atomic_t flags;
>        struct llist_node;
> }
> 
> embed inside struct irq_work and struct __call_single_data. Relying on
> structure layout for things to work doesn't really clarify things :-)

Yes I know, but changing those structures is going to cause an aweful
lot of churn, and I didn't want to do that just now.. :-(

Also, there's more fun..

  CSD_TYPE_SYNC/ASYNC:

	struct {
		struct llist_node node;
		unsigned int flags;
		smp_call_func_t func;
		void *info;
	};

  CSD_TYPE_IRQ_WORK:

	struct {
		struct llist_node node;
		atomic_t flags;
		void (*func)(struct irq_work *);
	};

  CSD_TYPE_TTWU:

	struct {
		struct llist_node node;
		unsigned int flags;
	};

So while they all have a 'u32' sized @flags, irq_work wants it atomic.
Also, if we were to actually have the struct csd_node {}, you get a 4
byte hole when you embed it in task_struct.

This is all entirely fugly. No doubt about it.

But I failed to find a 'sane' way to express it and needed to get these
patches out because things were broken.

Maybe I can anonymous-union my way around it, dunno. I'll think about
it. I'm certainly not proud of this. But at least the BUILD_BUG_ON()s
should catch the more blatant breakage here.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-05-26 16:11 ` [RFC][PATCH 7/7] sched: Replace rq::wake_list Peter Zijlstra
@ 2020-05-29 15:10   ` Valdis Klētnieks
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2020-06-04 14:18   ` [RFC][PATCH 7/7] " Guenter Roeck
  2 siblings, 0 replies; 62+ messages in thread
From: Valdis Klētnieks @ 2020-05-29 15:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman

[-- Attachment #1: Type: text/plain, Size: 2842 bytes --]

On Tue, 26 May 2020 18:11:04 +0200, Peter Zijlstra said:
> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> got smp_call_function_single_async() subtly wrong. Even though it will
> return -EBUSY when trying to re-use a csd, that condition is not
> atomic and still requires external serialization.

>  kernel/smp.c          |   47 ++++++++++++++++++++++++++++++++++++++++-------

> --- a/kernel/smp.c
> +++ b/kernel/smp.c

> @@ -659,6 +685,13 @@ void __init smp_init(void)
>  	BUILD_BUG_ON(offsetof(struct irq_work, flags) !=
>  		     offsetof(struct __call_single_data, flags));
>
> +	/*
> +	 * Assert the CSD_TYPE_TTWU layout is similar enough
> +	 * for task_struct to be on the @call_single_queue.
> +	 */
> +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> +
>  	idle_threads_init();
>  	cpuhp_threads_init();

This blows up on a 32-bit ARM allmodconfig:

  CC      kernel/smp.o
kernel/smp.c:319:6: warning: no previous prototype for 'flush_smp_call_function_from_idle' [-Wmissing-prototypes]
 void flush_smp_call_function_from_idle(void)
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from ./arch/arm/include/asm/atomic.h:11,
                 from ./include/linux/atomic.h:7,
                 from ./include/linux/llist.h:51,
                 from ./include/linux/irq_work.h:5,
                 from kernel/smp.c:10:
kernel/smp.c: In function 'smp_init':
./include/linux/compiler.h:392:38: error: call to '__compiletime_assert_152' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) != offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist)
  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                                      ^
./include/linux/compiler.h:373:4: note: in definition of macro '__compiletime_assert'
    prefix ## suffix();    \
    ^~~~~~
./include/linux/compiler.h:392:2: note: in expansion of macro '_compiletime_assert'
  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
  ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:50:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
  BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
  ^~~~~~~~~~~~~~~~
kernel/smp.c:689:2: note: in expansion of macro 'BUILD_BUG_ON'
  BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
  ^~~~~~~~~~~~

[-- Attachment #2: Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB
  2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
  2020-05-26 23:56   ` Frederic Weisbecker
  2020-05-27 10:23   ` Vincent Guittot
@ 2020-05-29 15:26   ` Valentin Schneider
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  3 siblings, 0 replies; 62+ messages in thread
From: Valentin Schneider @ 2020-05-29 15:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman


On 26/05/20 17:10, Peter Zijlstra wrote:
> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> got smp_call_function_single_async() subtly wrong. Even though it will
> return -EBUSY when trying to re-use a csd, that condition is not
> atomic and still requires external serialization.
>
> The change in kick_ilb() got this wrong.
>
> While on first reading kick_ilb() has an atomic test-and-set that
> appears to serialize the use, the matching 'release' is not in the
> right place to actually guarantee this serialization.
>

I got confused the first time I read through that, and had the same
thoughts as Vincent; reading Frederic's reply and the thread
helped. Could we mention the tick softirq vs ilb kick race
thing in the changelog?

Otherwise the change looks okay, I couldn't convince myself there were
more gaps left to fill.

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> Rework the nohz_idle_balance() trigger so that the release is in the
> IPI callback and thus guarantees the required serialization for the
> CSD.
>
> Fixes: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: sched/core] sched: Replace rq::wake_list
  2020-05-26 16:11 ` [RFC][PATCH 7/7] sched: Replace rq::wake_list Peter Zijlstra
  2020-05-29 15:10   ` Valdis Klētnieks
@ 2020-06-01  9:52   ` tip-bot2 for Peter Zijlstra
  2020-06-02 15:16     ` Frederic Weisbecker
  2020-06-04 14:18   ` [RFC][PATCH 7/7] " Guenter Roeck
  2 siblings, 1 reply; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-01  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a148866489fbe243c936fe43e4525d8dbfa0318f
Gitweb:        https://git.kernel.org/tip/a148866489fbe243c936fe43e4525d8dbfa0318f
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 May 2020 18:11:04 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:54:16 +02:00

sched: Replace rq::wake_list

The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in ttwu_queue_remote() got this wrong.

While on first reading ttwu_queue_remote() has an atomic test-and-set
that appears to serialize the use, the matching 'release' is not in
the right place to actually guarantee this serialization.

The actual race is vs the sched_ttwu_pending() call in the idle loop;
that can run the wakeup-list without consuming the CSD.

Instead of trying to chain the lists, merge them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.129371594@infradead.org
---
 include/linux/sched.h |  1 +-
 include/linux/smp.h   |  1 +-
 kernel/sched/core.c   | 25 ++++++----------------
 kernel/sched/idle.c   |  1 +-
 kernel/sched/sched.h  |  8 +-------
 kernel/smp.c          | 47 +++++++++++++++++++++++++++++++++++-------
 6 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ebc6870..e0f5f41 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -654,6 +654,7 @@ struct task_struct {
 
 #ifdef CONFIG_SMP
 	struct llist_node		wake_entry;
+	unsigned int			wake_entry_type;
 	int				on_cpu;
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/* Current CPU: */
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 45ad6e3..84f90e2 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -25,6 +25,7 @@ enum {
 	CSD_TYPE_ASYNC		= 0x00,
 	CSD_TYPE_SYNC		= 0x10,
 	CSD_TYPE_IRQ_WORK	= 0x20,
+	CSD_TYPE_TTWU		= 0x30,
 	CSD_FLAG_TYPE_MASK	= 0xF0,
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b71ed5e..b3c64c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1538,7 +1538,7 @@ static int migration_cpu_stop(void *data)
 	 * __migrate_task() such that we will not miss enforcing cpus_ptr
 	 * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
 	 */
-	sched_ttwu_pending();
+	flush_smp_call_function_from_idle();
 
 	raw_spin_lock(&p->pi_lock);
 	rq_lock(rq, &rf);
@@ -2272,14 +2272,13 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 }
 
 #ifdef CONFIG_SMP
-void sched_ttwu_pending(void)
+void sched_ttwu_pending(void *arg)
 {
+	struct llist_node *llist = arg;
 	struct rq *rq = this_rq();
-	struct llist_node *llist;
 	struct task_struct *p, *t;
 	struct rq_flags rf;
 
-	llist = llist_del_all(&rq->wake_list);
 	if (!llist)
 		return;
 
@@ -2299,11 +2298,6 @@ void sched_ttwu_pending(void)
 	rq_unlock_irqrestore(rq, &rf);
 }
 
-static void wake_csd_func(void *info)
-{
-	sched_ttwu_pending();
-}
-
 void send_call_function_single_ipi(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -2327,12 +2321,7 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
 	WRITE_ONCE(rq->ttwu_pending, 1);
-	if (llist_add(&p->wake_entry, &rq->wake_list)) {
-		if (!set_nr_if_polling(rq->idle))
-			smp_call_function_single_async(cpu, &rq->wake_csd);
-		else
-			trace_sched_wake_idle_without_ipi(cpu);
-	}
+	__smp_call_single_queue(cpu, &p->wake_entry);
 }
 
 void wake_up_if_idle(int cpu)
@@ -2772,6 +2761,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->capture_control = NULL;
 #endif
 	init_numa_balancing(clone_flags, p);
+#ifdef CONFIG_SMP
+	p->wake_entry_type = CSD_TYPE_TTWU;
+#endif
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6564,7 +6556,6 @@ int sched_cpu_dying(unsigned int cpu)
 	struct rq_flags rf;
 
 	/* Handle pending wakeups and then migrate everything off */
-	sched_ttwu_pending();
 	sched_tick_stop(cpu);
 
 	rq_lock_irqsave(rq, &rf);
@@ -6763,8 +6754,6 @@ void __init sched_init(void)
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
-		rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
-
 		INIT_LIST_HEAD(&rq->cfs_tasks);
 
 		rq_attach_root(rq, &def_root_domain);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 387fd75..05deb81 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -294,7 +294,6 @@ static void do_idle(void)
 	 * critical section.
 	 */
 	flush_smp_call_function_from_idle();
-	sched_ttwu_pending();
 	schedule_idle();
 
 	if (unlikely(klp_patch_pending(current)))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c86fc94..1d4e94c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1023,11 +1023,6 @@ struct rq {
 	unsigned int		ttwu_local;
 #endif
 
-#ifdef CONFIG_SMP
-	call_single_data_t	wake_csd;
-	struct llist_head	wake_list;
-#endif
-
 #ifdef CONFIG_CPU_IDLE
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state	*idle_state;
@@ -1371,8 +1366,6 @@ queue_balance_callback(struct rq *rq,
 	rq->balance_callback = head;
 }
 
-extern void sched_ttwu_pending(void);
-
 #define rcu_dereference_check_sched_domain(p) \
 	rcu_dereference_check((p), \
 			      lockdep_is_held(&sched_domains_mutex))
@@ -1512,7 +1505,6 @@ extern void flush_smp_call_function_from_idle(void);
 
 #else /* !CONFIG_SMP: */
 static inline void flush_smp_call_function_from_idle(void) { }
-static inline void sched_ttwu_pending(void) { }
 #endif
 
 #include "stats.h"
diff --git a/kernel/smp.c b/kernel/smp.c
index 856562b..0d61dc0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -196,6 +196,7 @@ void generic_smp_call_function_single_interrupt(void)
 	flush_smp_call_function_queue(true);
 }
 
+extern void sched_ttwu_pending(void *);
 extern void irq_work_single(void *);
 
 /**
@@ -244,6 +245,10 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 					csd->func);
 				break;
 
+			case CSD_TYPE_TTWU:
+				pr_warn("IPI task-wakeup sent to offline CPU\n");
+				break;
+
 			default:
 				pr_warn("IPI callback, unknown type %d, sent to offline CPU\n",
 					CSD_TYPE(csd));
@@ -275,22 +280,43 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 		}
 	}
 
+	if (!entry)
+		return;
+
 	/*
 	 * Second; run all !SYNC callbacks.
 	 */
+	prev = NULL;
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
 		int type = CSD_TYPE(csd);
 
-		if (type == CSD_TYPE_ASYNC) {
-			smp_call_func_t func = csd->func;
-			void *info = csd->info;
+		if (type != CSD_TYPE_TTWU) {
+			if (prev) {
+				prev->next = &csd_next->llist;
+			} else {
+				entry = &csd_next->llist;
+			}
 
-			csd_unlock(csd);
-			func(info);
-		} else if (type == CSD_TYPE_IRQ_WORK) {
-			irq_work_single(csd);
+			if (type == CSD_TYPE_ASYNC) {
+				smp_call_func_t func = csd->func;
+				void *info = csd->info;
+
+				csd_unlock(csd);
+				func(info);
+			} else if (type == CSD_TYPE_IRQ_WORK) {
+				irq_work_single(csd);
+			}
+
+		} else {
+			prev = &csd->llist;
 		}
 	}
+
+	/*
+	 * Third; only CSD_TYPE_TTWU is left, issue those.
+	 */
+	if (entry)
+		sched_ttwu_pending(entry);
 }
 
 void flush_smp_call_function_from_idle(void)
@@ -659,6 +685,13 @@ void __init smp_init(void)
 	BUILD_BUG_ON(offsetof(struct irq_work, flags) !=
 		     offsetof(struct __call_single_data, flags));
 
+	/*
+	 * Assert the CSD_TYPE_TTWU layout is similar enough
+	 * for task_struct to be on the @call_single_queue.
+	 */
+	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
+		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
+
 	idle_threads_init();
 	cpuhp_threads_init();
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: sched/core] irq_work, smp: Allow irq_work on call_single_queue
  2020-05-26 16:11 ` [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue Peter Zijlstra
  2020-05-28 23:40   ` Frederic Weisbecker
@ 2020-06-01  9:52   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-01  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4b44a21dd640b692d4e9b12d3e37c24825f90baa
Gitweb:        https://git.kernel.org/tip/4b44a21dd640b692d4e9b12d3e37c24825f90baa
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 May 2020 18:11:02 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:54:15 +02:00

irq_work, smp: Allow irq_work on call_single_queue

Currently irq_work_queue_on() will issue an unconditional
arch_send_call_function_single_ipi() and has the handler do
irq_work_run().

This is unfortunate in that it makes the IPI handler look at a second
cacheline and it misses the opportunity to avoid the IPI. Instead note
that struct irq_work and struct __call_single_data are very similar in
layout, so use a few bits in the flags word to encode a type and stick
the irq_work on the call_single_queue list.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.011635912@infradead.org
---
 include/linux/irq_work.h |   7 +-
 include/linux/smp.h      |  23 ++++++-
 kernel/irq_work.c        |  53 +++++++++--------
 kernel/smp.c             | 119 +++++++++++++++++++++++---------------
 4 files changed, 130 insertions(+), 72 deletions(-)

diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h
index 3b752e8..f23a359 100644
--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -13,6 +13,8 @@
  * busy      NULL, 2 -> {free, claimed} : callback in progress, can be claimed
  */
 
+/* flags share CSD_FLAG_ space */
+
 #define IRQ_WORK_PENDING	BIT(0)
 #define IRQ_WORK_BUSY		BIT(1)
 
@@ -23,9 +25,12 @@
 
 #define IRQ_WORK_CLAIMED	(IRQ_WORK_PENDING | IRQ_WORK_BUSY)
 
+/*
+ * structure shares layout with single_call_data_t.
+ */
 struct irq_work {
-	atomic_t flags;
 	struct llist_node llnode;
+	atomic_t flags;
 	void (*func)(struct irq_work *);
 };
 
diff --git a/include/linux/smp.h b/include/linux/smp.h
index cbc9162..45ad6e3 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -16,17 +16,38 @@
 
 typedef void (*smp_call_func_t)(void *info);
 typedef bool (*smp_cond_func_t)(int cpu, void *info);
+
+enum {
+	CSD_FLAG_LOCK		= 0x01,
+
+	/* IRQ_WORK_flags */
+
+	CSD_TYPE_ASYNC		= 0x00,
+	CSD_TYPE_SYNC		= 0x10,
+	CSD_TYPE_IRQ_WORK	= 0x20,
+	CSD_FLAG_TYPE_MASK	= 0xF0,
+};
+
+/*
+ * structure shares (partial) layout with struct irq_work
+ */
 struct __call_single_data {
 	struct llist_node llist;
+	unsigned int flags;
 	smp_call_func_t func;
 	void *info;
-	unsigned int flags;
 };
 
 /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
 typedef struct __call_single_data call_single_data_t
 	__aligned(sizeof(struct __call_single_data));
 
+/*
+ * Enqueue a llist_node on the call_single_queue; be very careful, read
+ * flush_smp_call_function_queue() in detail.
+ */
+extern void __smp_call_single_queue(int cpu, struct llist_node *node);
+
 /* total number of cpus in this system (may exceed NR_CPUS) */
 extern unsigned int total_cpus;
 
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 48b5d1b..eca8396 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -31,7 +31,7 @@ static bool irq_work_claim(struct irq_work *work)
 {
 	int oflags;
 
-	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED, &work->flags);
+	oflags = atomic_fetch_or(IRQ_WORK_CLAIMED | CSD_TYPE_IRQ_WORK, &work->flags);
 	/*
 	 * If the work is already pending, no need to raise the IPI.
 	 * The pairing atomic_fetch_andnot() in irq_work_run() makes sure
@@ -102,8 +102,7 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
 	if (cpu != smp_processor_id()) {
 		/* Arch remote IPI send/receive backend aren't NMI safe */
 		WARN_ON_ONCE(in_nmi());
-		if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
-			arch_send_call_function_single_ipi(cpu);
+		__smp_call_single_queue(cpu, &work->llnode);
 	} else {
 		__irq_work_queue_local(work);
 	}
@@ -131,6 +130,31 @@ bool irq_work_needs_cpu(void)
 	return true;
 }
 
+void irq_work_single(void *arg)
+{
+	struct irq_work *work = arg;
+	int flags;
+
+	/*
+	 * Clear the PENDING bit, after this point the @work
+	 * can be re-used.
+	 * Make it immediately visible so that other CPUs trying
+	 * to claim that work don't rely on us to handle their data
+	 * while we are in the middle of the func.
+	 */
+	flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags);
+
+	lockdep_irq_work_enter(work);
+	work->func(work);
+	lockdep_irq_work_exit(work);
+	/*
+	 * Clear the BUSY bit and return to the free state if
+	 * no-one else claimed it meanwhile.
+	 */
+	flags &= ~IRQ_WORK_PENDING;
+	(void)atomic_cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY);
+}
+
 static void irq_work_run_list(struct llist_head *list)
 {
 	struct irq_work *work, *tmp;
@@ -142,27 +166,8 @@ static void irq_work_run_list(struct llist_head *list)
 		return;
 
 	llnode = llist_del_all(list);
-	llist_for_each_entry_safe(work, tmp, llnode, llnode) {
-		int flags;
-		/*
-		 * Clear the PENDING bit, after this point the @work
-		 * can be re-used.
-		 * Make it immediately visible so that other CPUs trying
-		 * to claim that work don't rely on us to handle their data
-		 * while we are in the middle of the func.
-		 */
-		flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags);
-
-		lockdep_irq_work_enter(work);
-		work->func(work);
-		lockdep_irq_work_exit(work);
-		/*
-		 * Clear the BUSY bit and return to the free state if
-		 * no-one else claimed it meanwhile.
-		 */
-		flags &= ~IRQ_WORK_PENDING;
-		(void)atomic_cmpxchg(&work->flags, flags, flags & ~IRQ_WORK_BUSY);
-	}
+	llist_for_each_entry_safe(work, tmp, llnode, llnode)
+		irq_work_single(work);
 }
 
 /*
diff --git a/kernel/smp.c b/kernel/smp.c
index 9f11813..856562b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -23,10 +23,8 @@
 
 #include "smpboot.h"
 
-enum {
-	CSD_FLAG_LOCK		= 0x01,
-	CSD_FLAG_SYNCHRONOUS	= 0x02,
-};
+
+#define CSD_TYPE(_csd)	((_csd)->flags & CSD_FLAG_TYPE_MASK)
 
 struct call_function_data {
 	call_single_data_t	__percpu *csd;
@@ -137,15 +135,33 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
 extern void send_call_function_single_ipi(int cpu);
 
+void __smp_call_single_queue(int cpu, struct llist_node *node)
+{
+	/*
+	 * The list addition should be visible before sending the IPI
+	 * handler locks the list to pull the entry off it because of
+	 * normal cache coherency rules implied by spinlocks.
+	 *
+	 * If IPIs can go out of order to the cache coherency protocol
+	 * in an architecture, sufficient synchronisation should be added
+	 * to arch code to make it appear to obey cache coherency WRT
+	 * locking and barrier primitives. Generic code isn't really
+	 * equipped to do the right thing...
+	 */
+	if (llist_add(node, &per_cpu(call_single_queue, cpu)))
+		send_call_function_single_ipi(cpu);
+}
+
 /*
  * Insert a previously allocated call_single_data_t element
  * for execution on the given CPU. data must already have
  * ->func, ->info, and ->flags set.
  */
-static int generic_exec_single(int cpu, call_single_data_t *csd,
-			       smp_call_func_t func, void *info)
+static int generic_exec_single(int cpu, call_single_data_t *csd)
 {
 	if (cpu == smp_processor_id()) {
+		smp_call_func_t func = csd->func;
+		void *info = csd->info;
 		unsigned long flags;
 
 		/*
@@ -159,28 +175,12 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 		return 0;
 	}
 
-
 	if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) {
 		csd_unlock(csd);
 		return -ENXIO;
 	}
 
-	csd->func = func;
-	csd->info = info;
-
-	/*
-	 * The list addition should be visible before sending the IPI
-	 * handler locks the list to pull the entry off it because of
-	 * normal cache coherency rules implied by spinlocks.
-	 *
-	 * If IPIs can go out of order to the cache coherency protocol
-	 * in an architecture, sufficient synchronisation should be added
-	 * to arch code to make it appear to obey cache coherency WRT
-	 * locking and barrier primitives. Generic code isn't really
-	 * equipped to do the right thing...
-	 */
-	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-		send_call_function_single_ipi(cpu);
+	__smp_call_single_queue(cpu, &csd->llist);
 
 	return 0;
 }
@@ -194,16 +194,10 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 void generic_smp_call_function_single_interrupt(void)
 {
 	flush_smp_call_function_queue(true);
-
-	/*
-	 * Handle irq works queued remotely by irq_work_queue_on().
-	 * Smp functions above are typically synchronous so they
-	 * better run first since some other CPUs may be busy waiting
-	 * for them.
-	 */
-	irq_work_run();
 }
 
+extern void irq_work_single(void *);
+
 /**
  * flush_smp_call_function_queue - Flush pending smp-call-function callbacks
  *
@@ -241,9 +235,21 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 		 * We don't have to use the _safe() variant here
 		 * because we are not invoking the IPI handlers yet.
 		 */
-		llist_for_each_entry(csd, entry, llist)
-			pr_warn("IPI callback %pS sent to offline CPU\n",
-				csd->func);
+		llist_for_each_entry(csd, entry, llist) {
+			switch (CSD_TYPE(csd)) {
+			case CSD_TYPE_ASYNC:
+			case CSD_TYPE_SYNC:
+			case CSD_TYPE_IRQ_WORK:
+				pr_warn("IPI callback %pS sent to offline CPU\n",
+					csd->func);
+				break;
+
+			default:
+				pr_warn("IPI callback, unknown type %d, sent to offline CPU\n",
+					CSD_TYPE(csd));
+				break;
+			}
+		}
 	}
 
 	/*
@@ -251,16 +257,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 	 */
 	prev = NULL;
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
-		smp_call_func_t func = csd->func;
-		void *info = csd->info;
-
 		/* Do we wait until *after* callback? */
-		if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
+		if (CSD_TYPE(csd) == CSD_TYPE_SYNC) {
+			smp_call_func_t func = csd->func;
+			void *info = csd->info;
+
 			if (prev) {
 				prev->next = &csd_next->llist;
 			} else {
 				entry = &csd_next->llist;
 			}
+
 			func(info);
 			csd_unlock(csd);
 		} else {
@@ -272,11 +279,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 	 * Second; run all !SYNC callbacks.
 	 */
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
-		smp_call_func_t func = csd->func;
-		void *info = csd->info;
+		int type = CSD_TYPE(csd);
 
-		csd_unlock(csd);
-		func(info);
+		if (type == CSD_TYPE_ASYNC) {
+			smp_call_func_t func = csd->func;
+			void *info = csd->info;
+
+			csd_unlock(csd);
+			func(info);
+		} else if (type == CSD_TYPE_IRQ_WORK) {
+			irq_work_single(csd);
+		}
 	}
 }
 
@@ -305,7 +318,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 {
 	call_single_data_t *csd;
 	call_single_data_t csd_stack = {
-		.flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS,
+		.flags = CSD_FLAG_LOCK | CSD_TYPE_SYNC,
 	};
 	int this_cpu;
 	int err;
@@ -339,7 +352,10 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 		csd_lock(csd);
 	}
 
-	err = generic_exec_single(cpu, csd, func, info);
+	csd->func = func;
+	csd->info = info;
+
+	err = generic_exec_single(cpu, csd);
 
 	if (wait)
 		csd_lock_wait(csd);
@@ -385,7 +401,7 @@ int smp_call_function_single_async(int cpu, call_single_data_t *csd)
 	csd->flags = CSD_FLAG_LOCK;
 	smp_wmb();
 
-	err = generic_exec_single(cpu, csd, csd->func, csd->info);
+	err = generic_exec_single(cpu, csd);
 
 out:
 	preempt_enable();
@@ -500,7 +516,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 		csd_lock(csd);
 		if (wait)
-			csd->flags |= CSD_FLAG_SYNCHRONOUS;
+			csd->flags |= CSD_TYPE_SYNC;
 		csd->func = func;
 		csd->info = info;
 		if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
@@ -632,6 +648,17 @@ void __init smp_init(void)
 {
 	int num_nodes, num_cpus;
 
+	/*
+	 * Ensure struct irq_work layout matches so that
+	 * flush_smp_call_function_queue() can do horrible things.
+	 */
+	BUILD_BUG_ON(offsetof(struct irq_work, llnode) !=
+		     offsetof(struct __call_single_data, llist));
+	BUILD_BUG_ON(offsetof(struct irq_work, func) !=
+		     offsetof(struct __call_single_data, func));
+	BUILD_BUG_ON(offsetof(struct irq_work, flags) !=
+		     offsetof(struct __call_single_data, flags));
+
 	idle_threads_init();
 	cpuhp_threads_init();
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: sched/core] sched: Add rq::ttwu_pending
  2020-05-26 16:11 ` [RFC][PATCH 6/7] sched: Add rq::ttwu_pending Peter Zijlstra
@ 2020-06-01  9:52   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-01  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     126c2092e5c8b28623cb890cd2930aa292410676
Gitweb:        https://git.kernel.org/tip/126c2092e5c8b28623cb890cd2930aa292410676
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 May 2020 18:11:03 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:54:16 +02:00

sched: Add rq::ttwu_pending

In preparation of removing rq->wake_list, replace the
!list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully
equivalent as this new variable is racy.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.070399698@infradead.org
---
 kernel/sched/core.c  | 13 +++++++++++--
 kernel/sched/debug.c |  1 -
 kernel/sched/fair.c  |  2 +-
 kernel/sched/sched.h |  4 +++-
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fa0d499..b71ed5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2275,13 +2275,21 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 void sched_ttwu_pending(void)
 {
 	struct rq *rq = this_rq();
-	struct llist_node *llist = llist_del_all(&rq->wake_list);
+	struct llist_node *llist;
 	struct task_struct *p, *t;
 	struct rq_flags rf;
 
+	llist = llist_del_all(&rq->wake_list);
 	if (!llist)
 		return;
 
+	/*
+	 * rq::ttwu_pending racy indication of out-standing wakeups.
+	 * Races such that false-negatives are possible, since they
+	 * are shorter lived that false-positives would be.
+	 */
+	WRITE_ONCE(rq->ttwu_pending, 0);
+
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
@@ -2318,6 +2326,7 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
 
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
+	WRITE_ONCE(rq->ttwu_pending, 1);
 	if (llist_add(&p->wake_entry, &rq->wake_list)) {
 		if (!set_nr_if_polling(rq->idle))
 			smp_call_function_single_async(cpu, &rq->wake_csd);
@@ -4705,7 +4714,7 @@ int idle_cpu(int cpu)
 		return 0;
 
 #ifdef CONFIG_SMP
-	if (!llist_empty(&rq->wake_list))
+	if (rq->ttwu_pending)
 		return 0;
 #endif
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1c24a6b..36c5426 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -638,7 +638,6 @@ do {									\
 
 	P(nr_running);
 	P(nr_switches);
-	P(nr_load_updates);
 	P(nr_uninterruptible);
 	PN(next_balance);
 	SEQ_printf(m, "  .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr)));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2890bd5..0ed04d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8590,7 +8590,7 @@ static int idle_cpu_without(int cpu, struct task_struct *p)
 	 */
 
 #ifdef CONFIG_SMP
-	if (!llist_empty(&rq->wake_list))
+	if (rq->ttwu_pending)
 		return 0;
 #endif
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 75b0629..c86fc94 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -895,7 +895,9 @@ struct rq {
 	atomic_t		nohz_flags;
 #endif /* CONFIG_NO_HZ_COMMON */
 
-	unsigned long		nr_load_updates;
+#ifdef CONFIG_SMP
+	unsigned int		ttwu_pending;
+#endif
 	u64			nr_switches;
 
 #ifdef CONFIG_UCLAMP_TASK

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: sched/core] smp: Move irq_work_run() out of flush_smp_call_function_queue()
  2020-05-26 16:11 ` [RFC][PATCH 3/7] smp: Move irq_work_run() out of flush_smp_call_function_queue() Peter Zijlstra
  2020-05-29 13:04   ` Frederic Weisbecker
@ 2020-06-01  9:52   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-01  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     afaa653c564da38c0b34c4baba31e88c46a8764c
Gitweb:        https://git.kernel.org/tip/afaa653c564da38c0b34c4baba31e88c46a8764c
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 May 2020 18:11:00 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:54:15 +02:00

smp: Move irq_work_run() out of flush_smp_call_function_queue()

This ensures flush_smp_call_function_queue() is strictly about
call_single_queue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161907.895109676@infradead.org
---
 kernel/smp.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index db2f738..f720e38 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -84,6 +84,7 @@ int smpcfd_dying_cpu(unsigned int cpu)
 	 * still pending.
 	 */
 	flush_smp_call_function_queue(false);
+	irq_work_run();
 	return 0;
 }
 
@@ -191,6 +192,14 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 void generic_smp_call_function_single_interrupt(void)
 {
 	flush_smp_call_function_queue(true);
+
+	/*
+	 * Handle irq works queued remotely by irq_work_queue_on().
+	 * Smp functions above are typically synchronous so they
+	 * better run first since some other CPUs may be busy waiting
+	 * for them.
+	 */
+	irq_work_run();
 }
 
 /**
@@ -267,14 +276,6 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 		csd_unlock(csd);
 		func(info);
 	}
-
-	/*
-	 * Handle irq works queued remotely by irq_work_queue_on().
-	 * Smp functions above are typically synchronous so they
-	 * better run first since some other CPUs may be busy waiting
-	 * for them.
-	 */
-	irq_work_run();
 }
 
 /*

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: sched/core] smp: Optimize send_call_function_single_ipi()
  2020-05-26 16:11 ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
  2020-05-27  9:56   ` Peter Zijlstra
  2020-05-29 13:01   ` Frederic Weisbecker
@ 2020-06-01  9:52   ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-01  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b2a02fc43a1f40ef4eb2fb2b06357382608d4d84
Gitweb:        https://git.kernel.org/tip/b2a02fc43a1f40ef4eb2fb2b06357382608d4d84
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 May 2020 18:11:01 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:54:15 +02:00

smp: Optimize send_call_function_single_ipi()

Just like the ttwu_queue_remote() IPI, make use of _TIF_POLLING_NRFLAG
to avoid sending IPIs to idle CPUs.

[ mingo: Fix UP build bug. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161907.953304789@infradead.org
---
 kernel/sched/core.c  | 10 ++++++++++
 kernel/sched/idle.c  |  5 +++++
 kernel/sched/sched.h |  7 ++++---
 kernel/smp.c         | 16 +++++++++++++++-
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2cacc1e..fa0d499 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2296,6 +2296,16 @@ static void wake_csd_func(void *info)
 	sched_ttwu_pending();
 }
 
+void send_call_function_single_ipi(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	if (!set_nr_if_polling(rq->idle))
+		arch_send_call_function_single_ipi(cpu);
+	else
+		trace_sched_wake_idle_without_ipi(cpu);
+}
+
 /*
  * Queue a task on the target CPUs wake_list and wake the CPU via IPI if
  * necessary. The wakee CPU on receipt of the IPI will queue the task
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b743bf3..387fd75 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -289,6 +289,11 @@ static void do_idle(void)
 	 */
 	smp_mb__after_atomic();
 
+	/*
+	 * RCU relies on this call to be done outside of an RCU read-side
+	 * critical section.
+	 */
+	flush_smp_call_function_from_idle();
 	sched_ttwu_pending();
 	schedule_idle();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3c163cb..75b0629 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1506,11 +1506,12 @@ static inline void unregister_sched_domain_sysctl(void)
 }
 #endif
 
-#else
+extern void flush_smp_call_function_from_idle(void);
 
+#else /* !CONFIG_SMP: */
+static inline void flush_smp_call_function_from_idle(void) { }
 static inline void sched_ttwu_pending(void) { }
-
-#endif /* CONFIG_SMP */
+#endif
 
 #include "stats.h"
 #include "autogroup.h"
diff --git a/kernel/smp.c b/kernel/smp.c
index f720e38..9f11813 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -135,6 +135,8 @@ static __always_inline void csd_unlock(call_single_data_t *csd)
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
+extern void send_call_function_single_ipi(int cpu);
+
 /*
  * Insert a previously allocated call_single_data_t element
  * for execution on the given CPU. data must already have
@@ -178,7 +180,7 @@ static int generic_exec_single(int cpu, call_single_data_t *csd,
 	 * equipped to do the right thing...
 	 */
 	if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
-		arch_send_call_function_single_ipi(cpu);
+		send_call_function_single_ipi(cpu);
 
 	return 0;
 }
@@ -278,6 +280,18 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 	}
 }
 
+void flush_smp_call_function_from_idle(void)
+{
+	unsigned long flags;
+
+	if (llist_empty(this_cpu_ptr(&call_single_queue)))
+		return;
+
+	local_irq_save(flags);
+	flush_smp_call_function_queue(true);
+	local_irq_restore(flags);
+}
+
 /*
  * smp_call_function_single - Run a function on a specific CPU
  * @func: The function to run. This must be fast and non-blocking.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: sched/core] sched: Fix smp_call_function_single_async() usage for ILB
  2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
                     ` (2 preceding siblings ...)
  2020-05-29 15:26   ` Valentin Schneider
@ 2020-06-01  9:52   ` tip-bot2 for Peter Zijlstra
  2020-06-01 11:40     ` Frederic Weisbecker
  3 siblings, 1 reply; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-01  9:52 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Qian Cai, Peter Zijlstra (Intel),
	Ingo Molnar, Frederic Weisbecker, mgorman, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     19a1f5ec699954d21be10f74ff71c2a7079e99ad
Gitweb:        https://git.kernel.org/tip/19a1f5ec699954d21be10f74ff71c2a7079e99ad
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 May 2020 18:10:58 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:54:15 +02:00

sched: Fix smp_call_function_single_async() usage for ILB

The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in kick_ilb() got this wrong.

While on first reading kick_ilb() has an atomic test-and-set that
appears to serialize the use, the matching 'release' is not in the
right place to actually guarantee this serialization.

Rework the nohz_idle_balance() trigger so that the release is in the
IPI callback and thus guarantees the required serialization for the
CSD.

Fixes: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20200526161907.778543557@infradead.org
---
 kernel/sched/core.c  | 36 ++++++++++--------------------------
 kernel/sched/fair.c  | 18 ++++++++----------
 kernel/sched/sched.h |  1 +
 3 files changed, 19 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95e457d..2cacc1e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -637,41 +637,25 @@ void wake_up_nohz_cpu(int cpu)
 		wake_up_idle_cpu(cpu);
 }
 
-static inline bool got_nohz_idle_kick(void)
+static void nohz_csd_func(void *info)
 {
-	int cpu = smp_processor_id();
-
-	if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
-		return false;
-
-	if (idle_cpu(cpu) && !need_resched())
-		return true;
+	struct rq *rq = info;
+	int cpu = cpu_of(rq);
+	unsigned int flags;
 
 	/*
-	 * We can't run Idle Load Balance on this CPU for this time so we
-	 * cancel it and clear NOHZ_BALANCE_KICK
+	 * Release the rq::nohz_csd.
 	 */
-	atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
-	return false;
-}
+	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
+	WARN_ON(!(flags & NOHZ_KICK_MASK));
 
-static void nohz_csd_func(void *info)
-{
-	struct rq *rq = info;
-
-	if (got_nohz_idle_kick()) {
-		rq->idle_balance = 1;
+	rq->idle_balance = idle_cpu(cpu);
+	if (rq->idle_balance && !need_resched()) {
+		rq->nohz_idle_balance = flags;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	}
 }
 
-#else /* CONFIG_NO_HZ_COMMON */
-
-static inline bool got_nohz_idle_kick(void)
-{
-	return false;
-}
-
 #endif /* CONFIG_NO_HZ_COMMON */
 
 #ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dda9b19..2890bd5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10024,6 +10024,10 @@ static void kick_ilb(unsigned int flags)
 	if (ilb_cpu >= nr_cpu_ids)
 		return;
 
+	/*
+	 * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
+	 * the first flag owns it; cleared by nohz_csd_func().
+	 */
 	flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
 	if (flags & NOHZ_KICK_MASK)
 		return;
@@ -10371,20 +10375,14 @@ abort:
  */
 static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 {
-	int this_cpu = this_rq->cpu;
-	unsigned int flags;
+	unsigned int flags = this_rq->nohz_idle_balance;
 
-	if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
+	if (!flags)
 		return false;
 
-	if (idle != CPU_IDLE) {
-		atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
-		return false;
-	}
+	this_rq->nohz_idle_balance = 0;
 
-	/* could be _relaxed() */
-	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
-	if (!(flags & NOHZ_KICK_MASK))
+	if (idle != CPU_IDLE)
 		return false;
 
 	_nohz_idle_balance(this_rq, flags, idle);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4b32cff..3c163cb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -951,6 +951,7 @@ struct rq {
 
 	struct callback_head	*balance_callback;
 
+	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
 	unsigned long		misfit_task_load;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip: sched/core] smp: Optimize flush_smp_call_function_queue()
  2020-05-26 16:10 ` [RFC][PATCH 2/7] smp: Optimize flush_smp_call_function_queue() Peter Zijlstra
  2020-05-28 12:28   ` Frederic Weisbecker
@ 2020-06-01  9:52   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 62+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-06-01  9:52 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     52103be07d8b08311955f8c30e535c2dda290cf4
Gitweb:        https://git.kernel.org/tip/52103be07d8b08311955f8c30e535c2dda290cf4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 26 May 2020 18:10:59 +02:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 28 May 2020 10:54:15 +02:00

smp: Optimize flush_smp_call_function_queue()

The call_single_queue can contain (two) different callbacks,
synchronous and asynchronous. The current interrupt handler runs them
in-order, which means that remote CPUs that are waiting for their
synchronous call can be delayed by running asynchronous callbacks.

Rework the interrupt handler to first run the synchonous callbacks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161907.836818381@infradead.org
---
 kernel/smp.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 786092a..db2f738 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -209,9 +209,9 @@ void generic_smp_call_function_single_interrupt(void)
  */
 static void flush_smp_call_function_queue(bool warn_cpu_offline)
 {
-	struct llist_head *head;
-	struct llist_node *entry;
 	call_single_data_t *csd, *csd_next;
+	struct llist_node *entry, *prev;
+	struct llist_head *head;
 	static bool warned;
 
 	lockdep_assert_irqs_disabled();
@@ -235,21 +235,40 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 				csd->func);
 	}
 
+	/*
+	 * First; run all SYNC callbacks, people are waiting for us.
+	 */
+	prev = NULL;
 	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
 		smp_call_func_t func = csd->func;
 		void *info = csd->info;
 
 		/* Do we wait until *after* callback? */
 		if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
+			if (prev) {
+				prev->next = &csd_next->llist;
+			} else {
+				entry = &csd_next->llist;
+			}
 			func(info);
 			csd_unlock(csd);
 		} else {
-			csd_unlock(csd);
-			func(info);
+			prev = &csd->llist;
 		}
 	}
 
 	/*
+	 * Second; run all !SYNC callbacks.
+	 */
+	llist_for_each_entry_safe(csd, csd_next, entry, llist) {
+		smp_call_func_t func = csd->func;
+		void *info = csd->info;
+
+		csd_unlock(csd);
+		func(info);
+	}
+
+	/*
 	 * Handle irq works queued remotely by irq_work_queue_on().
 	 * Smp functions above are typically synchronous so they
 	 * better run first since some other CPUs may be busy waiting

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip: sched/core] sched: Fix smp_call_function_single_async() usage for ILB
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2020-06-01 11:40     ` Frederic Weisbecker
  0 siblings, 0 replies; 62+ messages in thread
From: Frederic Weisbecker @ 2020-06-01 11:40 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar
  Cc: linux-tip-commits, Qian Cai, Peter Zijlstra (Intel), mgorman, x86

On Mon, Jun 01, 2020 at 09:52:21AM -0000, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     19a1f5ec699954d21be10f74ff71c2a7079e99ad
> Gitweb:        https://git.kernel.org/tip/19a1f5ec699954d21be10f74ff71c2a7079e99ad
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 26 May 2020 18:10:58 +02:00
> Committer:     Ingo Molnar <mingo@kernel.org>
> CommitterDate: Thu, 28 May 2020 10:54:15 +02:00
> 
> sched: Fix smp_call_function_single_async() usage for ILB
> 
> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> got smp_call_function_single_async() subtly wrong. Even though it will
> return -EBUSY when trying to re-use a csd, that condition is not
> atomic and still requires external serialization.
> 
> The change in kick_ilb() got this wrong.
> 
> While on first reading kick_ilb() has an atomic test-and-set that
> appears to serialize the use, the matching 'release' is not in the
> right place to actually guarantee this serialization.
> 
> Rework the nohz_idle_balance() trigger so that the release is in the
> IPI callback and thus guarantees the required serialization for the
> CSD.
> 
> Fixes: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> Cc: mgorman@techsingularity.net
> Link: https://lore.kernel.org/r/20200526161907.778543557@infradead.org

Many patches in the series lack some Reviewed-by tags.

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip: sched/core] sched: Replace rq::wake_list
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2020-06-02 15:16     ` Frederic Weisbecker
  0 siblings, 0 replies; 62+ messages in thread
From: Frederic Weisbecker @ 2020-06-02 15:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-tip-commits, Peter Zijlstra (Intel), Ingo Molnar, x86

On Mon, Jun 01, 2020 at 09:52:18AM -0000, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     a148866489fbe243c936fe43e4525d8dbfa0318f
> Gitweb:        https://git.kernel.org/tip/a148866489fbe243c936fe43e4525d8dbfa0318f
> Author:        Peter Zijlstra <peterz@infradead.org>
> AuthorDate:    Tue, 26 May 2020 18:11:04 +02:00
> Committer:     Ingo Molnar <mingo@kernel.org>
> CommitterDate: Thu, 28 May 2020 10:54:16 +02:00
> 
> sched: Replace rq::wake_list
> 
> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> got smp_call_function_single_async() subtly wrong. Even though it will
> return -EBUSY when trying to re-use a csd, that condition is not
> atomic and still requires external serialization.
> 
> The change in ttwu_queue_remote() got this wrong.
> 
> While on first reading ttwu_queue_remote() has an atomic test-and-set
> that appears to serialize the use, the matching 'release' is not in
> the right place to actually guarantee this serialization.
> 
> The actual race is vs the sched_ttwu_pending() call in the idle loop;
> that can run the wakeup-list without consuming the CSD.
> 
> Instead of trying to chain the lists, merge them.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Link: https://lore.kernel.org/r/20200526161908.129371594@infradead.org

Looks good, thanks :)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-05-26 16:11 ` [RFC][PATCH 7/7] sched: Replace rq::wake_list Peter Zijlstra
  2020-05-29 15:10   ` Valdis Klētnieks
  2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2020-06-04 14:18   ` Guenter Roeck
  2020-06-05  0:24     ` Eric Biggers
  2020-06-05  8:10     ` Peter Zijlstra
  2 siblings, 2 replies; 62+ messages in thread
From: Guenter Roeck @ 2020-06-04 14:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman

On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> got smp_call_function_single_async() subtly wrong. Even though it will
> return -EBUSY when trying to re-use a csd, that condition is not
> atomic and still requires external serialization.
> 
> The change in ttwu_queue_remote() got this wrong.
> 
> While on first reading ttwu_queue_remote() has an atomic test-and-set
> that appears to serialize the use, the matching 'release' is not in
> the right place to actually guarantee this serialization.
> 
> The actual race is vs the sched_ttwu_pending() call in the idle loop;
> that can run the wakeup-list without consuming the CSD.
> 
> Instead of trying to chain the lists, merge them.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
...
> +	/*
> +	 * Assert the CSD_TYPE_TTWU layout is similar enough
> +	 * for task_struct to be on the @call_single_queue.
> +	 */
> +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> +

There is no guarantee in C that

	type1 a;
	type2 b;

in two different data structures means that offsetof(b) - offsetof(a)
is the same in both data structures unless attributes such as
__attribute__((__packed__)) are used.

As result, this does and will cause a variety of build errors depending
on the compiler version and compile flags.

Guenter

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-04 14:18   ` [RFC][PATCH 7/7] " Guenter Roeck
@ 2020-06-05  0:24     ` Eric Biggers
  2020-06-05  7:41       ` Peter Zijlstra
  2020-06-05  8:10     ` Peter Zijlstra
  1 sibling, 1 reply; 62+ messages in thread
From: Eric Biggers @ 2020-06-05  0:24 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Peter Zijlstra, tglx, frederic, linux-kernel, x86, cai, mgorman

On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
> On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
> > The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> > got smp_call_function_single_async() subtly wrong. Even though it will
> > return -EBUSY when trying to re-use a csd, that condition is not
> > atomic and still requires external serialization.
> > 
> > The change in ttwu_queue_remote() got this wrong.
> > 
> > While on first reading ttwu_queue_remote() has an atomic test-and-set
> > that appears to serialize the use, the matching 'release' is not in
> > the right place to actually guarantee this serialization.
> > 
> > The actual race is vs the sched_ttwu_pending() call in the idle loop;
> > that can run the wakeup-list without consuming the CSD.
> > 
> > Instead of trying to chain the lists, merge them.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> ...
> > +	/*
> > +	 * Assert the CSD_TYPE_TTWU layout is similar enough
> > +	 * for task_struct to be on the @call_single_queue.
> > +	 */
> > +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> > +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> > +
> 
> There is no guarantee in C that
> 
> 	type1 a;
> 	type2 b;
> 
> in two different data structures means that offsetof(b) - offsetof(a)
> is the same in both data structures unless attributes such as
> __attribute__((__packed__)) are used.
> 
> As result, this does and will cause a variety of build errors depending
> on the compiler version and compile flags.
> 
> Guenter

Yep, this breaks the build for me.

  CC      kernel/smp.o
In file included from ./arch/x86/include/asm/atomic.h:5,
                 from ./include/linux/atomic.h:7,
                 from ./include/linux/llist.h:51,
                 from ./include/linux/irq_work.h:5,
                 from kernel/smp.c:10:
kernel/smp.c: In function ‘smp_init’:
./include/linux/compiler.h:403:38: error: call to ‘__compiletime_assert_68’ declared with attribute error: BUILD_BUG_ON failed: offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) != offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist)
  403 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |                                      ^
./include/linux/compiler.h:384:4: note: in definition of macro ‘__compiletime_assert’
  384 |    prefix ## suffix();    \
      |    ^~~~~~
./include/linux/compiler.h:403:2: note: in expansion of macro ‘_compiletime_assert’
  403 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
      |  ^~~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
   39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
      |                                     ^~~~~~~~~~~~~~~~~~
./include/linux/build_bug.h:50:2: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
   50 |  BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
      |  ^~~~~~~~~~~~~~~~
kernel/smp.c:687:2: note: in expansion of macro ‘BUILD_BUG_ON’
  687 |  BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
      |  ^~~~~~~~~~~~
make[1]: *** [scripts/Makefile.build:267: kernel/smp.o] Error 1
make: *** [Makefile:1735: kernel] Error 2

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-05  0:24     ` Eric Biggers
@ 2020-06-05  7:41       ` Peter Zijlstra
  2020-06-05 16:15         ` Eric Biggers
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-05  7:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Guenter Roeck, tglx, frederic, linux-kernel, x86, cai, mgorman

On Thu, Jun 04, 2020 at 05:24:33PM -0700, Eric Biggers wrote:
> On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
> > On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
> > > The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> > > got smp_call_function_single_async() subtly wrong. Even though it will
> > > return -EBUSY when trying to re-use a csd, that condition is not
> > > atomic and still requires external serialization.
> > > 
> > > The change in ttwu_queue_remote() got this wrong.
> > > 
> > > While on first reading ttwu_queue_remote() has an atomic test-and-set
> > > that appears to serialize the use, the matching 'release' is not in
> > > the right place to actually guarantee this serialization.
> > > 
> > > The actual race is vs the sched_ttwu_pending() call in the idle loop;
> > > that can run the wakeup-list without consuming the CSD.
> > > 
> > > Instead of trying to chain the lists, merge them.
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > ...
> > > +	/*
> > > +	 * Assert the CSD_TYPE_TTWU layout is similar enough
> > > +	 * for task_struct to be on the @call_single_queue.
> > > +	 */
> > > +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> > > +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> > > +
> > 
> > There is no guarantee in C that
> > 
> > 	type1 a;
> > 	type2 b;
> > 
> > in two different data structures means that offsetof(b) - offsetof(a)
> > is the same in both data structures unless attributes such as
> > __attribute__((__packed__)) are used.
> > 
> > As result, this does and will cause a variety of build errors depending
> > on the compiler version and compile flags.
> > 
> > Guenter
> 
> Yep, this breaks the build for me.

-ENOCONFIG

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-04 14:18   ` [RFC][PATCH 7/7] " Guenter Roeck
  2020-06-05  0:24     ` Eric Biggers
@ 2020-06-05  8:10     ` Peter Zijlstra
  2020-06-05 13:33       ` Guenter Roeck
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-05  8:10 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman

On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
> On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:

> > +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> > +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> > +
> 
> There is no guarantee in C that
> 
> 	type1 a;
> 	type2 b;
> 
> in two different data structures means that offsetof(b) - offsetof(a)
> is the same in both data structures unless attributes such as
> __attribute__((__packed__)) are used.

Do tell more; the alignment requirements and size of the types remains
the same, this resulting in different layout is unlikely.

I found this excellent quote on Hacker News this morning:

 "I think the attitude of compiler writers is a good reason to fix the
  spec so they can't keep ratfucking developers trying to get work done."

> As result, this does and will cause a variety of build errors depending
> on the compiler version and compile flags.

The only thing I can think of that's actually a problem is that retarded
struct randomization stuff.

Anyway, I'll move cleaning it up a little higher on the todo list.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-05-29 13:36     ` Peter Zijlstra
@ 2020-06-05  9:37       ` Peter Zijlstra
  2020-06-05 15:02         ` Frederic Weisbecker
                           ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-05  9:37 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: tglx, linux-kernel, x86, cai, mgorman, sfr, linux

On Fri, May 29, 2020 at 03:36:41PM +0200, Peter Zijlstra wrote:
> Maybe I can anonymous-union my way around it, dunno. I'll think about
> it. I'm certainly not proud of this. But at least the BUILD_BUG_ON()s
> should catch the more blatant breakage here.

How's this then? Differently ugly, but at least it compiles with that
horrible struct randomization junk enabled.

---
 include/linux/irq_work.h  |   28 ++++++-------------
 include/linux/sched.h     |    4 +-
 include/linux/smp.h       |   25 ++++++-----------
 include/linux/smp_types.h |   66 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c       |    6 ++--
 kernel/smp.c              |   18 ------------
 6 files changed, 89 insertions(+), 58 deletions(-)

--- a/include/linux/irq_work.h
+++ b/include/linux/irq_work.h
@@ -2,7 +2,7 @@
 #ifndef _LINUX_IRQ_WORK_H
 #define _LINUX_IRQ_WORK_H
 
-#include <linux/llist.h>
+#include <linux/smp_types.h>
 
 /*
  * An entry can be in one of four states:
@@ -13,26 +13,16 @@
  * busy      NULL, 2 -> {free, claimed} : callback in progress, can be claimed
  */
 
-/* flags share CSD_FLAG_ space */
-
-#define IRQ_WORK_PENDING	BIT(0)
-#define IRQ_WORK_BUSY		BIT(1)
-
-/* Doesn't want IPI, wait for tick: */
-#define IRQ_WORK_LAZY		BIT(2)
-/* Run hard IRQ context, even on RT */
-#define IRQ_WORK_HARD_IRQ	BIT(3)
-
-#define IRQ_WORK_CLAIMED	(IRQ_WORK_PENDING | IRQ_WORK_BUSY)
-
-/*
- * structure shares layout with single_call_data_t.
- */
 struct irq_work {
-	struct llist_node llnode;
-	atomic_t flags;
+	union {
+		struct __call_single_node node;
+		struct {
+			struct llist_node llnode;
+			atomic_t flags;
+		};
+	};
 	void (*func)(struct irq_work *);
-};
+} __no_randomize_layout;
 
 static inline
 void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,6 +32,7 @@
 #include <linux/posix-timers.h>
 #include <linux/rseq.h>
 #include <linux/kcsan.h>
+#include <linux/smp_types.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -654,9 +655,8 @@ struct task_struct {
 	unsigned int			ptrace;
 
 #ifdef CONFIG_SMP
-	struct llist_node		wake_entry;
-	unsigned int			wake_entry_type;
 	int				on_cpu;
+	struct __call_single_node	wake_entry;
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/* Current CPU: */
 	unsigned int			cpu;
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -12,32 +12,25 @@
 #include <linux/list.h>
 #include <linux/cpumask.h>
 #include <linux/init.h>
-#include <linux/llist.h>
+#include <linux/smp_types.h>
 
 typedef void (*smp_call_func_t)(void *info);
 typedef bool (*smp_cond_func_t)(int cpu, void *info);
 
-enum {
-	CSD_FLAG_LOCK		= 0x01,
-
-	/* IRQ_WORK_flags */
-
-	CSD_TYPE_ASYNC		= 0x00,
-	CSD_TYPE_SYNC		= 0x10,
-	CSD_TYPE_IRQ_WORK	= 0x20,
-	CSD_TYPE_TTWU		= 0x30,
-	CSD_FLAG_TYPE_MASK	= 0xF0,
-};
-
 /*
  * structure shares (partial) layout with struct irq_work
  */
 struct __call_single_data {
-	struct llist_node llist;
-	unsigned int flags;
+	union {
+		struct __call_single_node node;
+		struct {
+			struct llist_node llist;
+			unsigned int flags;
+		};
+	};
 	smp_call_func_t func;
 	void *info;
-};
+} __no_randomize_layout;
 
 /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
 typedef struct __call_single_data call_single_data_t
--- /dev/null
+++ b/include/linux/smp_types.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_SMP_TYPES_H
+#define __LINUX_SMP_TYPES_H
+
+#include <linux/llist.h>
+
+enum {
+	CSD_FLAG_LOCK		= 0x01,
+
+	IRQ_WORK_PENDING	= 0x01,
+	IRQ_WORK_BUSY		= 0x02,
+	IRQ_WORK_LAZY		= 0x04, /* No IPI, wait for tick */
+	IRQ_WORK_HARD_IRQ	= 0x08, /* IRQ context on PREEMPT_RT */
+
+	IRQ_WORK_CLAIMED	= (IRQ_WORK_PENDING | IRQ_WORK_BUSY),
+
+	CSD_TYPE_ASYNC		= 0x00,
+	CSD_TYPE_SYNC		= 0x10,
+	CSD_TYPE_IRQ_WORK	= 0x20,
+	CSD_TYPE_TTWU		= 0x30,
+
+	CSD_FLAG_TYPE_MASK	= 0xF0,
+};
+
+/*
+ * struct __call_single_node is the primary type on
+ * smp.c:call_single_queue.
+ *
+ * flush_smp_call_function_queue() only reads the type from
+ * __call_single_node::u_flags as a regular load, the above
+ * (anonymous) enum defines all the bits of this word.
+ *
+ * Other bits are not modified until the type is known.
+ *
+ * CSD_TYPE_SYNC/ASYNC:
+ *	struct {
+ *		struct llist_node node;
+ *		unsigned int flags;
+ *		smp_call_func_t func;
+ *		void *info;
+ *	};
+ *
+ * CSD_TYPE_IRQ_WORK:
+ *	struct {
+ *		struct llist_node node;
+ *		atomic_t flags;
+ *		void (*func)(struct irq_work *);
+ *	};
+ *
+ * CSD_TYPE_TTWU:
+ *	struct {
+ *		struct llist_node node;
+ *		unsigned int flags;
+ *	};
+ *
+ */
+
+struct __call_single_node {
+	struct llist_node	llist;
+	union {
+		unsigned int	u_flags;
+		atomic_t	a_flags;
+	};
+} __no_randomize_layout;
+
+#endif /* __LINUX_SMP_TYPES_H */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,7 +2293,7 @@ void sched_ttwu_pending(void *arg)
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
-	llist_for_each_entry_safe(p, t, llist, wake_entry)
+	llist_for_each_entry_safe(p, t, llist, wake_entry.llist)
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
 
 	rq_unlock_irqrestore(rq, &rf);
@@ -2322,7 +2322,7 @@ static void __ttwu_queue_wakelist(struct
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
 	WRITE_ONCE(rq->ttwu_pending, 1);
-	__smp_call_single_queue(cpu, &p->wake_entry);
+	__smp_call_single_queue(cpu, &p->wake_entry.llist);
 }
 
 void wake_up_if_idle(int cpu)
@@ -2763,7 +2763,7 @@ static void __sched_fork(unsigned long c
 #endif
 	init_numa_balancing(clone_flags, p);
 #ifdef CONFIG_SMP
-	p->wake_entry_type = CSD_TYPE_TTWU;
+	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 #endif
 }
 
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -669,24 +669,6 @@ void __init smp_init(void)
 {
 	int num_nodes, num_cpus;
 
-	/*
-	 * Ensure struct irq_work layout matches so that
-	 * flush_smp_call_function_queue() can do horrible things.
-	 */
-	BUILD_BUG_ON(offsetof(struct irq_work, llnode) !=
-		     offsetof(struct __call_single_data, llist));
-	BUILD_BUG_ON(offsetof(struct irq_work, func) !=
-		     offsetof(struct __call_single_data, func));
-	BUILD_BUG_ON(offsetof(struct irq_work, flags) !=
-		     offsetof(struct __call_single_data, flags));
-
-	/*
-	 * Assert the CSD_TYPE_TTWU layout is similar enough
-	 * for task_struct to be on the @call_single_queue.
-	 */
-	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
-		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
-
 	idle_threads_init();
 	cpuhp_threads_init();
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-05  8:10     ` Peter Zijlstra
@ 2020-06-05 13:33       ` Guenter Roeck
  2020-06-05 14:09         ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Guenter Roeck @ 2020-06-05 13:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman

On 6/5/20 1:10 AM, Peter Zijlstra wrote:
> On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
>> On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
> 
>>> +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
>>> +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
>>> +
>>
>> There is no guarantee in C that
>>
>> 	type1 a;
>> 	type2 b;
>>
>> in two different data structures means that offsetof(b) - offsetof(a)
>> is the same in both data structures unless attributes such as
>> __attribute__((__packed__)) are used.
> 
> Do tell more; the alignment requirements and size of the types remains
> the same, this resulting in different layout is unlikely.
> 

I have not made the C standard. You point out yourself a possible explicit
culprit: struct randomization. That by itself shows that you can not rely
on two elements of different structures having the same alignment,
which is pretty much exactly what I said (and may explain why observing
the problem seemed to at least somewhat depend on the weather).

> I found this excellent quote on Hacker News this morning:
> 
>  "I think the attitude of compiler writers is a good reason to fix the
>   spec so they can't keep ratfucking developers trying to get work done."
> 

Qed.

Guenter

>> As result, this does and will cause a variety of build errors depending
>> on the compiler version and compile flags.
> 
> The only thing I can think of that's actually a problem is that retarded
> struct randomization stuff.
> 
> Anyway, I'll move cleaning it up a little higher on the todo list.
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-05 13:33       ` Guenter Roeck
@ 2020-06-05 14:09         ` Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-05 14:09 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: tglx, frederic, linux-kernel, x86, cai, mgorman

On Fri, Jun 05, 2020 at 06:33:38AM -0700, Guenter Roeck wrote:

> I have not made the C standard. You point out yourself a possible explicit
> culprit: struct randomization. 

The randomization crud is very much outside the C spec.

> That by itself shows that you can not rely
> on two elements of different structures having the same alignment,

Randomization does not change the alignment, if it were to do that it
would break all sorts. All it does is change the order of elements.

> which is pretty much exactly what I said (and may explain why observing
> the problem seemed to at least somewhat depend on the weather).

A normal C compiler will have deterministic layout, otherwise unions and
union based type punning would not be a thing.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-06-05  9:37       ` Peter Zijlstra
@ 2020-06-05 15:02         ` Frederic Weisbecker
  2020-06-05 16:17           ` Peter Zijlstra
  2020-06-05 15:24         ` Kees Cook
  2020-06-10 13:24         ` Frederic Weisbecker
  2 siblings, 1 reply; 62+ messages in thread
From: Frederic Weisbecker @ 2020-06-05 15:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, x86, cai, mgorman, sfr, linux

On Fri, Jun 05, 2020 at 11:37:04AM +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2020 at 03:36:41PM +0200, Peter Zijlstra wrote:
> > Maybe I can anonymous-union my way around it, dunno. I'll think about
> > it. I'm certainly not proud of this. But at least the BUILD_BUG_ON()s
> > should catch the more blatant breakage here.
> 
> How's this then? Differently ugly, but at least it compiles with that
> horrible struct randomization junk enabled.
> 
> ---
>  include/linux/irq_work.h  |   28 ++++++-------------
>  include/linux/sched.h     |    4 +-
>  include/linux/smp.h       |   25 ++++++-----------
>  include/linux/smp_types.h |   66 ++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c       |    6 ++--
>  kernel/smp.c              |   18 ------------
>  6 files changed, 89 insertions(+), 58 deletions(-)
> 
> --- a/include/linux/irq_work.h
> +++ b/include/linux/irq_work.h
> @@ -2,7 +2,7 @@
>  #ifndef _LINUX_IRQ_WORK_H
>  #define _LINUX_IRQ_WORK_H
>  
> -#include <linux/llist.h>
> +#include <linux/smp_types.h>
>  
>  /*
>   * An entry can be in one of four states:
> @@ -13,26 +13,16 @@
>   * busy      NULL, 2 -> {free, claimed} : callback in progress, can be claimed
>   */
>  
> -/* flags share CSD_FLAG_ space */
> -
> -#define IRQ_WORK_PENDING	BIT(0)
> -#define IRQ_WORK_BUSY		BIT(1)
> -
> -/* Doesn't want IPI, wait for tick: */
> -#define IRQ_WORK_LAZY		BIT(2)
> -/* Run hard IRQ context, even on RT */
> -#define IRQ_WORK_HARD_IRQ	BIT(3)
> -
> -#define IRQ_WORK_CLAIMED	(IRQ_WORK_PENDING | IRQ_WORK_BUSY)
> -
> -/*
> - * structure shares layout with single_call_data_t.
> - */
>  struct irq_work {
> -	struct llist_node llnode;
> -	atomic_t flags;
> +	union {
> +		struct __call_single_node node;
> +		struct {
> +			struct llist_node llnode;
> +			atomic_t flags;
> +		};
> +	};

So why not just embed struct __call_single_node in
struct irq_work and struct __call_single_data ?

Is the point of that anonymous second union layer to
shorten the lines while accessing members?

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-06-05  9:37       ` Peter Zijlstra
  2020-06-05 15:02         ` Frederic Weisbecker
@ 2020-06-05 15:24         ` Kees Cook
  2020-06-10 13:24         ` Frederic Weisbecker
  2 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2020-06-05 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, tglx, linux-kernel, x86, cai, mgorman, sfr, linux

On Fri, Jun 05, 2020 at 11:37:04AM +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2020 at 03:36:41PM +0200, Peter Zijlstra wrote:
> > Maybe I can anonymous-union my way around it, dunno. I'll think about
> > it. I'm certainly not proud of this. But at least the BUILD_BUG_ON()s
> > should catch the more blatant breakage here.
> 
> How's this then? Differently ugly, but at least it compiles with that
> horrible struct randomization junk enabled.
> 
> ---
>  include/linux/irq_work.h  |   28 ++++++-------------
>  include/linux/sched.h     |    4 +-
>  include/linux/smp.h       |   25 ++++++-----------
>  include/linux/smp_types.h |   66 ++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c       |    6 ++--
>  kernel/smp.c              |   18 ------------
>  6 files changed, 89 insertions(+), 58 deletions(-)
> 
> --- a/include/linux/irq_work.h
> +++ b/include/linux/irq_work.h
> @@ -2,7 +2,7 @@
>  #ifndef _LINUX_IRQ_WORK_H
>  #define _LINUX_IRQ_WORK_H
>  
> -#include <linux/llist.h>
> +#include <linux/smp_types.h>
>  
>  /*
>   * An entry can be in one of four states:
> @@ -13,26 +13,16 @@
>   * busy      NULL, 2 -> {free, claimed} : callback in progress, can be claimed
>   */
>  
> -/* flags share CSD_FLAG_ space */
> -
> -#define IRQ_WORK_PENDING	BIT(0)
> -#define IRQ_WORK_BUSY		BIT(1)
> -
> -/* Doesn't want IPI, wait for tick: */
> -#define IRQ_WORK_LAZY		BIT(2)
> -/* Run hard IRQ context, even on RT */
> -#define IRQ_WORK_HARD_IRQ	BIT(3)
> -
> -#define IRQ_WORK_CLAIMED	(IRQ_WORK_PENDING | IRQ_WORK_BUSY)
> -
> -/*
> - * structure shares layout with single_call_data_t.
> - */
>  struct irq_work {
> -	struct llist_node llnode;
> -	atomic_t flags;
> +	union {
> +		struct __call_single_node node;
> +		struct {
> +			struct llist_node llnode;
> +			atomic_t flags;
> +		};
> +	};
>  	void (*func)(struct irq_work *);
> -};
> +} __no_randomize_layout;

The "__no_randomize_layout" isn't needed here. The only automatically
randomized structs are those entirely consisting of function pointers.

>  static inline
>  void init_irq_work(struct irq_work *work, void (*func)(struct irq_work *))
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -32,6 +32,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/rseq.h>
>  #include <linux/kcsan.h>
> +#include <linux/smp_types.h>
>  
>  /* task_struct member predeclarations (sorted alphabetically): */
>  struct audit_context;
> @@ -654,9 +655,8 @@ struct task_struct {
>  	unsigned int			ptrace;
>  
>  #ifdef CONFIG_SMP
> -	struct llist_node		wake_entry;
> -	unsigned int			wake_entry_type;
>  	int				on_cpu;
> +	struct __call_single_node	wake_entry;
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/* Current CPU: */
>  	unsigned int			cpu;
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -12,32 +12,25 @@
>  #include <linux/list.h>
>  #include <linux/cpumask.h>
>  #include <linux/init.h>
> -#include <linux/llist.h>
> +#include <linux/smp_types.h>
>  
>  typedef void (*smp_call_func_t)(void *info);
>  typedef bool (*smp_cond_func_t)(int cpu, void *info);
>  
> -enum {
> -	CSD_FLAG_LOCK		= 0x01,
> -
> -	/* IRQ_WORK_flags */
> -
> -	CSD_TYPE_ASYNC		= 0x00,
> -	CSD_TYPE_SYNC		= 0x10,
> -	CSD_TYPE_IRQ_WORK	= 0x20,
> -	CSD_TYPE_TTWU		= 0x30,
> -	CSD_FLAG_TYPE_MASK	= 0xF0,
> -};
> -
>  /*
>   * structure shares (partial) layout with struct irq_work
>   */
>  struct __call_single_data {
> -	struct llist_node llist;
> -	unsigned int flags;
> +	union {
> +		struct __call_single_node node;
> +		struct {
> +			struct llist_node llist;
> +			unsigned int flags;
> +		};
> +	};
>  	smp_call_func_t func;
>  	void *info;
> -};
> +} __no_randomize_layout;

Same here.

>  
>  /* Use __aligned() to avoid to use 2 cache lines for 1 csd */
>  typedef struct __call_single_data call_single_data_t
> --- /dev/null
> +++ b/include/linux/smp_types.h
> @@ -0,0 +1,66 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_SMP_TYPES_H
> +#define __LINUX_SMP_TYPES_H
> +
> +#include <linux/llist.h>
> +
> +enum {
> +	CSD_FLAG_LOCK		= 0x01,
> +
> +	IRQ_WORK_PENDING	= 0x01,
> +	IRQ_WORK_BUSY		= 0x02,
> +	IRQ_WORK_LAZY		= 0x04, /* No IPI, wait for tick */
> +	IRQ_WORK_HARD_IRQ	= 0x08, /* IRQ context on PREEMPT_RT */
> +
> +	IRQ_WORK_CLAIMED	= (IRQ_WORK_PENDING | IRQ_WORK_BUSY),
> +
> +	CSD_TYPE_ASYNC		= 0x00,
> +	CSD_TYPE_SYNC		= 0x10,
> +	CSD_TYPE_IRQ_WORK	= 0x20,
> +	CSD_TYPE_TTWU		= 0x30,
> +
> +	CSD_FLAG_TYPE_MASK	= 0xF0,
> +};
> +
> +/*
> + * struct __call_single_node is the primary type on
> + * smp.c:call_single_queue.
> + *
> + * flush_smp_call_function_queue() only reads the type from
> + * __call_single_node::u_flags as a regular load, the above
> + * (anonymous) enum defines all the bits of this word.
> + *
> + * Other bits are not modified until the type is known.
> + *
> + * CSD_TYPE_SYNC/ASYNC:
> + *	struct {
> + *		struct llist_node node;
> + *		unsigned int flags;
> + *		smp_call_func_t func;
> + *		void *info;
> + *	};
> + *
> + * CSD_TYPE_IRQ_WORK:
> + *	struct {
> + *		struct llist_node node;
> + *		atomic_t flags;
> + *		void (*func)(struct irq_work *);
> + *	};
> + *
> + * CSD_TYPE_TTWU:
> + *	struct {
> + *		struct llist_node node;
> + *		unsigned int flags;
> + *	};
> + *
> + */
> +
> +struct __call_single_node {
> +	struct llist_node	llist;
> +	union {
> +		unsigned int	u_flags;
> +		atomic_t	a_flags;
> +	};
> +} __no_randomize_layout;

Same.

> +
> +#endif /* __LINUX_SMP_TYPES_H */
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2293,7 +2293,7 @@ void sched_ttwu_pending(void *arg)
>  	rq_lock_irqsave(rq, &rf);
>  	update_rq_clock(rq);
>  
> -	llist_for_each_entry_safe(p, t, llist, wake_entry)
> +	llist_for_each_entry_safe(p, t, llist, wake_entry.llist)
>  		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
>  
>  	rq_unlock_irqrestore(rq, &rf);
> @@ -2322,7 +2322,7 @@ static void __ttwu_queue_wakelist(struct
>  	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
>  
>  	WRITE_ONCE(rq->ttwu_pending, 1);
> -	__smp_call_single_queue(cpu, &p->wake_entry);
> +	__smp_call_single_queue(cpu, &p->wake_entry.llist);
>  }
>  
>  void wake_up_if_idle(int cpu)
> @@ -2763,7 +2763,7 @@ static void __sched_fork(unsigned long c
>  #endif
>  	init_numa_balancing(clone_flags, p);
>  #ifdef CONFIG_SMP
> -	p->wake_entry_type = CSD_TYPE_TTWU;
> +	p->wake_entry.u_flags = CSD_TYPE_TTWU;
>  #endif
>  }
>  
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -669,24 +669,6 @@ void __init smp_init(void)
>  {
>  	int num_nodes, num_cpus;
>  
> -	/*
> -	 * Ensure struct irq_work layout matches so that
> -	 * flush_smp_call_function_queue() can do horrible things.
> -	 */
> -	BUILD_BUG_ON(offsetof(struct irq_work, llnode) !=
> -		     offsetof(struct __call_single_data, llist));
> -	BUILD_BUG_ON(offsetof(struct irq_work, func) !=
> -		     offsetof(struct __call_single_data, func));
> -	BUILD_BUG_ON(offsetof(struct irq_work, flags) !=
> -		     offsetof(struct __call_single_data, flags));
> -
> -	/*
> -	 * Assert the CSD_TYPE_TTWU layout is similar enough
> -	 * for task_struct to be on the @call_single_queue.
> -	 */
> -	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> -		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> -

Do you want to validate that the individual members of the union struct
still have their fields lining up with __call_single_node's members?
Or better yet, I have the same question as Frederic about the need for
the union. Why not just switch callers from "flags" to "node.u_flags"
and "node.a_flags"? (Or could that be cleaned up in a later patch to
avoid putting too much churn in one patch?)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-05  7:41       ` Peter Zijlstra
@ 2020-06-05 16:15         ` Eric Biggers
  2020-06-06 23:13           ` Guenter Roeck
  0 siblings, 1 reply; 62+ messages in thread
From: Eric Biggers @ 2020-06-05 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Guenter Roeck, tglx, frederic, linux-kernel, x86, cai, mgorman

On Fri, Jun 05, 2020 at 09:41:54AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 04, 2020 at 05:24:33PM -0700, Eric Biggers wrote:
> > On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
> > > On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
> > > > The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> > > > got smp_call_function_single_async() subtly wrong. Even though it will
> > > > return -EBUSY when trying to re-use a csd, that condition is not
> > > > atomic and still requires external serialization.
> > > > 
> > > > The change in ttwu_queue_remote() got this wrong.
> > > > 
> > > > While on first reading ttwu_queue_remote() has an atomic test-and-set
> > > > that appears to serialize the use, the matching 'release' is not in
> > > > the right place to actually guarantee this serialization.
> > > > 
> > > > The actual race is vs the sched_ttwu_pending() call in the idle loop;
> > > > that can run the wakeup-list without consuming the CSD.
> > > > 
> > > > Instead of trying to chain the lists, merge them.
> > > > 
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > ---
> > > ...
> > > > +	/*
> > > > +	 * Assert the CSD_TYPE_TTWU layout is similar enough
> > > > +	 * for task_struct to be on the @call_single_queue.
> > > > +	 */
> > > > +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> > > > +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> > > > +
> > > 
> > > There is no guarantee in C that
> > > 
> > > 	type1 a;
> > > 	type2 b;
> > > 
> > > in two different data structures means that offsetof(b) - offsetof(a)
> > > is the same in both data structures unless attributes such as
> > > __attribute__((__packed__)) are used.
> > > 
> > > As result, this does and will cause a variety of build errors depending
> > > on the compiler version and compile flags.
> > > 
> > > Guenter
> > 
> > Yep, this breaks the build for me.
> 
> -ENOCONFIG

For me, the problem seems to be randstruct.  To reproduce, you can use
(on x86_64):

	make defconfig
	echo CONFIG_GCC_PLUGIN_RANDSTRUCT=y >> .config
	make olddefconfig
	make kernel/smp.o

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-06-05 15:02         ` Frederic Weisbecker
@ 2020-06-05 16:17           ` Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-05 16:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: tglx, linux-kernel, x86, cai, mgorman, sfr, linux, keescook

On Fri, Jun 05, 2020 at 05:02:08PM +0200, Frederic Weisbecker wrote:
> On Fri, Jun 05, 2020 at 11:37:04AM +0200, Peter Zijlstra wrote:
> >  struct irq_work {
> > -	struct llist_node llnode;
> > -	atomic_t flags;
> > +	union {
> > +		struct __call_single_node node;
> > +		struct {
> > +			struct llist_node llnode;
> > +			atomic_t flags;
> > +		};
> > +	};
> 
> So why not just embed struct __call_single_node in
> struct irq_work and struct __call_single_data ?
> 
> Is the point of that anonymous second union layer to
> shorten the lines while accessing members?

To cut down on all the churn. irq_work is small and could easily be
fixed up, but csd is used in quite a few places.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-05 16:15         ` Eric Biggers
@ 2020-06-06 23:13           ` Guenter Roeck
  2020-06-09 20:21             ` Eric Biggers
  0 siblings, 1 reply; 62+ messages in thread
From: Guenter Roeck @ 2020-06-06 23:13 UTC (permalink / raw)
  To: Eric Biggers, Peter Zijlstra
  Cc: tglx, frederic, linux-kernel, x86, cai, mgorman

On 6/5/20 9:15 AM, Eric Biggers wrote:
> On Fri, Jun 05, 2020 at 09:41:54AM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 04, 2020 at 05:24:33PM -0700, Eric Biggers wrote:
>>> On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
>>>> On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
>>>>> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
>>>>> got smp_call_function_single_async() subtly wrong. Even though it will
>>>>> return -EBUSY when trying to re-use a csd, that condition is not
>>>>> atomic and still requires external serialization.
>>>>>
>>>>> The change in ttwu_queue_remote() got this wrong.
>>>>>
>>>>> While on first reading ttwu_queue_remote() has an atomic test-and-set
>>>>> that appears to serialize the use, the matching 'release' is not in
>>>>> the right place to actually guarantee this serialization.
>>>>>
>>>>> The actual race is vs the sched_ttwu_pending() call in the idle loop;
>>>>> that can run the wakeup-list without consuming the CSD.
>>>>>
>>>>> Instead of trying to chain the lists, merge them.
>>>>>
>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>> ---
>>>> ...
>>>>> +	/*
>>>>> +	 * Assert the CSD_TYPE_TTWU layout is similar enough
>>>>> +	 * for task_struct to be on the @call_single_queue.
>>>>> +	 */
>>>>> +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
>>>>> +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
>>>>> +
>>>>
>>>> There is no guarantee in C that
>>>>
>>>> 	type1 a;
>>>> 	type2 b;
>>>>
>>>> in two different data structures means that offsetof(b) - offsetof(a)
>>>> is the same in both data structures unless attributes such as
>>>> __attribute__((__packed__)) are used.
>>>>
>>>> As result, this does and will cause a variety of build errors depending
>>>> on the compiler version and compile flags.
>>>>
>>>> Guenter
>>>
>>> Yep, this breaks the build for me.
>>
>> -ENOCONFIG
> 
> For me, the problem seems to be randstruct.  To reproduce, you can use
> (on x86_64):
> 
> 	make defconfig
> 	echo CONFIG_GCC_PLUGIN_RANDSTRUCT=y >> .config
> 	make olddefconfig
> 	make kernel/smp.o
> 

I confirmed that disabling CONFIG_GCC_PLUGIN_RANDSTRUCT "fixes" the problem
in my test builds. Maybe it would make sense to mark that configuration option
for the time being as BROKEN.

Guenter

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-06 23:13           ` Guenter Roeck
@ 2020-06-09 20:21             ` Eric Biggers
  2020-06-09 21:25               ` Guenter Roeck
  2020-06-09 22:07               ` Peter Zijlstra
  0 siblings, 2 replies; 62+ messages in thread
From: Eric Biggers @ 2020-06-09 20:21 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Peter Zijlstra, tglx, frederic, linux-kernel, x86, cai, mgorman

On Sat, Jun 06, 2020 at 04:13:33PM -0700, Guenter Roeck wrote:
> On 6/5/20 9:15 AM, Eric Biggers wrote:
> > On Fri, Jun 05, 2020 at 09:41:54AM +0200, Peter Zijlstra wrote:
> >> On Thu, Jun 04, 2020 at 05:24:33PM -0700, Eric Biggers wrote:
> >>> On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
> >>>> On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
> >>>>> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> >>>>> got smp_call_function_single_async() subtly wrong. Even though it will
> >>>>> return -EBUSY when trying to re-use a csd, that condition is not
> >>>>> atomic and still requires external serialization.
> >>>>>
> >>>>> The change in ttwu_queue_remote() got this wrong.
> >>>>>
> >>>>> While on first reading ttwu_queue_remote() has an atomic test-and-set
> >>>>> that appears to serialize the use, the matching 'release' is not in
> >>>>> the right place to actually guarantee this serialization.
> >>>>>
> >>>>> The actual race is vs the sched_ttwu_pending() call in the idle loop;
> >>>>> that can run the wakeup-list without consuming the CSD.
> >>>>>
> >>>>> Instead of trying to chain the lists, merge them.
> >>>>>
> >>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>>>> ---
> >>>> ...
> >>>>> +	/*
> >>>>> +	 * Assert the CSD_TYPE_TTWU layout is similar enough
> >>>>> +	 * for task_struct to be on the @call_single_queue.
> >>>>> +	 */
> >>>>> +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> >>>>> +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> >>>>> +
> >>>>
> >>>> There is no guarantee in C that
> >>>>
> >>>> 	type1 a;
> >>>> 	type2 b;
> >>>>
> >>>> in two different data structures means that offsetof(b) - offsetof(a)
> >>>> is the same in both data structures unless attributes such as
> >>>> __attribute__((__packed__)) are used.
> >>>>
> >>>> As result, this does and will cause a variety of build errors depending
> >>>> on the compiler version and compile flags.
> >>>>
> >>>> Guenter
> >>>
> >>> Yep, this breaks the build for me.
> >>
> >> -ENOCONFIG
> > 
> > For me, the problem seems to be randstruct.  To reproduce, you can use
> > (on x86_64):
> > 
> > 	make defconfig
> > 	echo CONFIG_GCC_PLUGIN_RANDSTRUCT=y >> .config
> > 	make olddefconfig
> > 	make kernel/smp.o
> > 
> 
> I confirmed that disabling CONFIG_GCC_PLUGIN_RANDSTRUCT "fixes" the problem
> in my test builds. Maybe it would make sense to mark that configuration option
> for the time being as BROKEN.
> 

Still occurring on Linus' tree.  This needs to be fixed.  (And not by removing
support for randstruct; that's not a "fix"...)

Shouldn't the kbuild test robot have caught this?

- Eric

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-09 20:21             ` Eric Biggers
@ 2020-06-09 21:25               ` Guenter Roeck
  2020-06-09 21:38                 ` Eric Biggers
  2020-06-18 17:57                 ` Steven Rostedt
  2020-06-09 22:07               ` Peter Zijlstra
  1 sibling, 2 replies; 62+ messages in thread
From: Guenter Roeck @ 2020-06-09 21:25 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Peter Zijlstra, tglx, frederic, linux-kernel, x86, cai, mgorman

On Tue, Jun 09, 2020 at 01:21:34PM -0700, Eric Biggers wrote:
> On Sat, Jun 06, 2020 at 04:13:33PM -0700, Guenter Roeck wrote:
> > On 6/5/20 9:15 AM, Eric Biggers wrote:
> > > On Fri, Jun 05, 2020 at 09:41:54AM +0200, Peter Zijlstra wrote:
> > >> On Thu, Jun 04, 2020 at 05:24:33PM -0700, Eric Biggers wrote:
> > >>> On Thu, Jun 04, 2020 at 07:18:37AM -0700, Guenter Roeck wrote:
> > >>>> On Tue, May 26, 2020 at 06:11:04PM +0200, Peter Zijlstra wrote:
> > >>>>> The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
> > >>>>> got smp_call_function_single_async() subtly wrong. Even though it will
> > >>>>> return -EBUSY when trying to re-use a csd, that condition is not
> > >>>>> atomic and still requires external serialization.
> > >>>>>
> > >>>>> The change in ttwu_queue_remote() got this wrong.
> > >>>>>
> > >>>>> While on first reading ttwu_queue_remote() has an atomic test-and-set
> > >>>>> that appears to serialize the use, the matching 'release' is not in
> > >>>>> the right place to actually guarantee this serialization.
> > >>>>>
> > >>>>> The actual race is vs the sched_ttwu_pending() call in the idle loop;
> > >>>>> that can run the wakeup-list without consuming the CSD.
> > >>>>>
> > >>>>> Instead of trying to chain the lists, merge them.
> > >>>>>
> > >>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > >>>>> ---
> > >>>> ...
> > >>>>> +	/*
> > >>>>> +	 * Assert the CSD_TYPE_TTWU layout is similar enough
> > >>>>> +	 * for task_struct to be on the @call_single_queue.
> > >>>>> +	 */
> > >>>>> +	BUILD_BUG_ON(offsetof(struct task_struct, wake_entry_type) - offsetof(struct task_struct, wake_entry) !=
> > >>>>> +		     offsetof(struct __call_single_data, flags) - offsetof(struct __call_single_data, llist));
> > >>>>> +
> > >>>>
> > >>>> There is no guarantee in C that
> > >>>>
> > >>>> 	type1 a;
> > >>>> 	type2 b;
> > >>>>
> > >>>> in two different data structures means that offsetof(b) - offsetof(a)
> > >>>> is the same in both data structures unless attributes such as
> > >>>> __attribute__((__packed__)) are used.
> > >>>>
> > >>>> As result, this does and will cause a variety of build errors depending
> > >>>> on the compiler version and compile flags.
> > >>>>
> > >>>> Guenter
> > >>>
> > >>> Yep, this breaks the build for me.
> > >>
> > >> -ENOCONFIG
> > > 
> > > For me, the problem seems to be randstruct.  To reproduce, you can use
> > > (on x86_64):
> > > 
> > > 	make defconfig
> > > 	echo CONFIG_GCC_PLUGIN_RANDSTRUCT=y >> .config
> > > 	make olddefconfig
> > > 	make kernel/smp.o
> > > 
> > 
> > I confirmed that disabling CONFIG_GCC_PLUGIN_RANDSTRUCT "fixes" the problem
> > in my test builds. Maybe it would make sense to mark that configuration option
> > for the time being as BROKEN.
> > 
> 
> Still occurring on Linus' tree.  This needs to be fixed.  (And not by removing
> support for randstruct; that's not a "fix"...)
> 

How about the hack below ?

Guenter

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d96e3e7fff..df1cbb04f9b3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -629,6 +629,15 @@ struct wake_q_node {
 	struct wake_q_node *next;
 };
 
+/*
+ * Hack around assumption that wake_entry_type follows wake_entry even with
+ * CONFIG_GCC_PLUGIN_RANDSTRUCT=y.
+ */
+struct _wake_entry {
+	struct llist_node	wake_entry;
+	unsigned int		wake_entry_type;
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -653,8 +662,9 @@ struct task_struct {
 	unsigned int			ptrace;
 
 #ifdef CONFIG_SMP
-	struct llist_node		wake_entry;
-	unsigned int			wake_entry_type;
+	struct _wake_entry		_we;
+#define wake_entry		_we.wake_entry
+#define wake_entry_type		_we.wake_entry_type
 	int				on_cpu;
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/* Current CPU: */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-09 21:25               ` Guenter Roeck
@ 2020-06-09 21:38                 ` Eric Biggers
  2020-06-09 22:06                   ` Peter Zijlstra
  2020-06-18 17:57                 ` Steven Rostedt
  1 sibling, 1 reply; 62+ messages in thread
From: Eric Biggers @ 2020-06-09 21:38 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Peter Zijlstra, tglx, frederic, linux-kernel, x86, cai, mgorman

On Tue, Jun 09, 2020 at 02:25:09PM -0700, Guenter Roeck wrote:
> > 
> > Still occurring on Linus' tree.  This needs to be fixed.  (And not by removing
> > support for randstruct; that's not a "fix"...)
> > 
> 
> How about the hack below ?
> 
> Guenter
> 
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5d96e3e7fff..df1cbb04f9b3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -629,6 +629,15 @@ struct wake_q_node {
>  	struct wake_q_node *next;
>  };
>  
> +/*
> + * Hack around assumption that wake_entry_type follows wake_entry even with
> + * CONFIG_GCC_PLUGIN_RANDSTRUCT=y.
> + */
> +struct _wake_entry {
> +	struct llist_node	wake_entry;
> +	unsigned int		wake_entry_type;
> +};
> +
>  struct task_struct {
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/*
> @@ -653,8 +662,9 @@ struct task_struct {
>  	unsigned int			ptrace;
>  
>  #ifdef CONFIG_SMP
> -	struct llist_node		wake_entry;
> -	unsigned int			wake_entry_type;
> +	struct _wake_entry		_we;
> +#define wake_entry		_we.wake_entry
> +#define wake_entry_type		_we.wake_entry_type
>  	int				on_cpu;
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/* Current CPU: */

Does the struct actually have to be named?  How about:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d96e3e7fff42..14ca25cda19150 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -653,8 +653,14 @@ struct task_struct {
 	unsigned int			ptrace;
 
 #ifdef CONFIG_SMP
-	struct llist_node		wake_entry;
-	unsigned int			wake_entry_type;
+	/*
+	 * wake_entry_type must follow wake_entry, even when
+	 * CONFIG_GCC_PLUGIN_RANDSTRUCT=y.
+	 */
+	struct {
+		struct llist_node	wake_entry;
+		unsigned int		wake_entry_type;
+	};
 	int				on_cpu;
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/* Current CPU: */


However, it would be preferable to not rely on different structs sharing the
same field order, but rather write proper C code that uses the same struct
everywhere to encapsulate these 2 fields...

- Eric

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-09 21:38                 ` Eric Biggers
@ 2020-06-09 22:06                   ` Peter Zijlstra
  2020-06-09 23:03                     ` Guenter Roeck
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-09 22:06 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Guenter Roeck, tglx, frederic, linux-kernel, x86, cai, mgorman

On Tue, Jun 09, 2020 at 02:38:29PM -0700, Eric Biggers wrote:
> Does the struct actually have to be named?  How about:
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5d96e3e7fff42..14ca25cda19150 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -653,8 +653,14 @@ struct task_struct {
>  	unsigned int			ptrace;
>  
>  #ifdef CONFIG_SMP
> -	struct llist_node		wake_entry;
> -	unsigned int			wake_entry_type;
> +	/*
> +	 * wake_entry_type must follow wake_entry, even when
> +	 * CONFIG_GCC_PLUGIN_RANDSTRUCT=y.
> +	 */
> +	struct {
> +		struct llist_node	wake_entry;
> +		unsigned int		wake_entry_type;
> +	};
>  	int				on_cpu;
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/* Current CPU: */
> 
> 
> However, it would be preferable to not rely on different structs sharing the
> same field order, but rather write proper C code that uses the same struct
> everywhere to encapsulate these 2 fields...

https://lkml.kernel.org/r/20200605093704.GB2948@hirez.programming.kicks-ass.net

And I have more patches on top to clean up some of the anonymous union
stuff, that that's quite a lot of frobbing.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-09 20:21             ` Eric Biggers
  2020-06-09 21:25               ` Guenter Roeck
@ 2020-06-09 22:07               ` Peter Zijlstra
  1 sibling, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-09 22:07 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Guenter Roeck, tglx, frederic, linux-kernel, x86, cai, mgorman

On Tue, Jun 09, 2020 at 01:21:34PM -0700, Eric Biggers wrote:
> Still occurring on Linus' tree.  This needs to be fixed.  (And not by removing
> support for randstruct; that's not a "fix"...)
> 
> Shouldn't the kbuild test robot have caught this?

Should probably, but the thing has been rather spotty of late.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-09 22:06                   ` Peter Zijlstra
@ 2020-06-09 23:03                     ` Guenter Roeck
  2020-06-10  9:09                       ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Guenter Roeck @ 2020-06-09 23:03 UTC (permalink / raw)
  To: Peter Zijlstra, Eric Biggers
  Cc: tglx, frederic, linux-kernel, x86, cai, mgorman

On 6/9/20 3:06 PM, Peter Zijlstra wrote:
> On Tue, Jun 09, 2020 at 02:38:29PM -0700, Eric Biggers wrote:
>> Does the struct actually have to be named?  How about:
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index c5d96e3e7fff42..14ca25cda19150 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -653,8 +653,14 @@ struct task_struct {
>>  	unsigned int			ptrace;
>>  
>>  #ifdef CONFIG_SMP
>> -	struct llist_node		wake_entry;
>> -	unsigned int			wake_entry_type;
>> +	/*
>> +	 * wake_entry_type must follow wake_entry, even when
>> +	 * CONFIG_GCC_PLUGIN_RANDSTRUCT=y.
>> +	 */
>> +	struct {
>> +		struct llist_node	wake_entry;
>> +		unsigned int		wake_entry_type;
>> +	};
>>  	int				on_cpu;
>>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>>  	/* Current CPU: */
>>
>>
>> However, it would be preferable to not rely on different structs sharing the
>> same field order, but rather write proper C code that uses the same struct
>> everywhere to encapsulate these 2 fields...
> 
> https://lkml.kernel.org/r/20200605093704.GB2948@hirez.programming.kicks-ass.net
> 
> And I have more patches on top to clean up some of the anonymous union
> stuff, that that's quite a lot of frobbing.
> 

That is why I tried to keep it simple as hackish fixup patch.

Guenter

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-09 23:03                     ` Guenter Roeck
@ 2020-06-10  9:09                       ` Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2020-06-10  9:09 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Eric Biggers, tglx, frederic, linux-kernel, x86, cai, mgorman

On Tue, Jun 09, 2020 at 04:03:19PM -0700, Guenter Roeck wrote:
> On 6/9/20 3:06 PM, Peter Zijlstra wrote:
> > On Tue, Jun 09, 2020 at 02:38:29PM -0700, Eric Biggers wrote:
> >> Does the struct actually have to be named?  How about:
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index c5d96e3e7fff42..14ca25cda19150 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -653,8 +653,14 @@ struct task_struct {
> >>  	unsigned int			ptrace;
> >>  
> >>  #ifdef CONFIG_SMP
> >> -	struct llist_node		wake_entry;
> >> -	unsigned int			wake_entry_type;
> >> +	/*
> >> +	 * wake_entry_type must follow wake_entry, even when
> >> +	 * CONFIG_GCC_PLUGIN_RANDSTRUCT=y.
> >> +	 */
> >> +	struct {
> >> +		struct llist_node	wake_entry;
> >> +		unsigned int		wake_entry_type;
> >> +	};
> >>  	int				on_cpu;
> >>  #ifdef CONFIG_THREAD_INFO_IN_TASK
> >>  	/* Current CPU: */
> >>
> >>
> >> However, it would be preferable to not rely on different structs sharing the
> >> same field order, but rather write proper C code that uses the same struct
> >> everywhere to encapsulate these 2 fields...
> > 
> > https://lkml.kernel.org/r/20200605093704.GB2948@hirez.programming.kicks-ass.net
> > 
> > And I have more patches on top to clean up some of the anonymous union
> > stuff, that that's quite a lot of frobbing.
> > 
> 
> That is why I tried to keep it simple as hackish fixup patch.

Fair enough; I'll try and get the above variant merged to address the
build fail. Then I can chase down Paul's bug and finisht the cleanup.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue
  2020-06-05  9:37       ` Peter Zijlstra
  2020-06-05 15:02         ` Frederic Weisbecker
  2020-06-05 15:24         ` Kees Cook
@ 2020-06-10 13:24         ` Frederic Weisbecker
  2 siblings, 0 replies; 62+ messages in thread
From: Frederic Weisbecker @ 2020-06-10 13:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, x86, cai, mgorman, sfr, linux

On Fri, Jun 05, 2020 at 11:37:04AM +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2020 at 03:36:41PM +0200, Peter Zijlstra wrote:
> > Maybe I can anonymous-union my way around it, dunno. I'll think about
> > it. I'm certainly not proud of this. But at least the BUILD_BUG_ON()s
> > should catch the more blatant breakage here.
> 
> How's this then? Differently ugly, but at least it compiles with that
> horrible struct randomization junk enabled.
> 
> ---
>  include/linux/irq_work.h  |   28 ++++++-------------
>  include/linux/sched.h     |    4 +-
>  include/linux/smp.h       |   25 ++++++-----------
>  include/linux/smp_types.h |   66 ++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c       |    6 ++--
>  kernel/smp.c              |   18 ------------
>  6 files changed, 89 insertions(+), 58 deletions(-)


Looks good. I don't have a better idea.

Thanks!

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-09 21:25               ` Guenter Roeck
  2020-06-09 21:38                 ` Eric Biggers
@ 2020-06-18 17:57                 ` Steven Rostedt
  2020-06-18 19:06                   ` Guenter Roeck
  1 sibling, 1 reply; 62+ messages in thread
From: Steven Rostedt @ 2020-06-18 17:57 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Eric Biggers, Peter Zijlstra, tglx, frederic, linux-kernel, x86,
	cai, mgorman

On Tue, Jun 09, 2020 at 02:25:09PM -0700, Guenter Roeck wrote:
> > 
> > Still occurring on Linus' tree.  This needs to be fixed.  (And not by removing
> > support for randstruct; that's not a "fix"...)
> > 
> 
> How about the hack below ?

My test suite failed due to this bug (on my allmodconfig test).

Your hack appears to fix it. I've applied it to my "fixes" patches applied to
my test suite before building my kernels.

Thanks!

-- Steve


> 
> Guenter
> 
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5d96e3e7fff..df1cbb04f9b3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -629,6 +629,15 @@ struct wake_q_node {
>  	struct wake_q_node *next;
>  };
>  
> +/*
> + * Hack around assumption that wake_entry_type follows wake_entry even with
> + * CONFIG_GCC_PLUGIN_RANDSTRUCT=y.
> + */
> +struct _wake_entry {
> +	struct llist_node	wake_entry;
> +	unsigned int		wake_entry_type;
> +};
> +
>  struct task_struct {
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/*
> @@ -653,8 +662,9 @@ struct task_struct {
>  	unsigned int			ptrace;
>  
>  #ifdef CONFIG_SMP
> -	struct llist_node		wake_entry;
> -	unsigned int			wake_entry_type;
> +	struct _wake_entry		_we;
> +#define wake_entry		_we.wake_entry
> +#define wake_entry_type		_we.wake_entry_type
>  	int				on_cpu;
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/* Current CPU: */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 7/7] sched: Replace rq::wake_list
  2020-06-18 17:57                 ` Steven Rostedt
@ 2020-06-18 19:06                   ` Guenter Roeck
  0 siblings, 0 replies; 62+ messages in thread
From: Guenter Roeck @ 2020-06-18 19:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Eric Biggers, Peter Zijlstra, tglx, frederic, linux-kernel, x86,
	cai, mgorman

On Thu, Jun 18, 2020 at 01:57:33PM -0400, Steven Rostedt wrote:
> On Tue, Jun 09, 2020 at 02:25:09PM -0700, Guenter Roeck wrote:
> > > 
> > > Still occurring on Linus' tree.  This needs to be fixed.  (And not by removing
> > > support for randstruct; that's not a "fix"...)
> > > 
> > 
> > How about the hack below ?
> 
> My test suite failed due to this bug (on my allmodconfig test).
> 
> Your hack appears to fix it. I've applied it to my "fixes" patches applied to
> my test suite before building my kernels.
> 
Ah, I would have hoped that this was by now fixed upstream. Apparently that
is not the case. Guess Linus is using an old version of gcc (or maybe clang)
for his test builds, and everyone doing test builds is now using this or a
similar hack. For my part, I disabled CONFIG_GCC_PLUGIN_RANDSTRUCT.

Guenter

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2020-05-27 17:12           ` Peter Zijlstra
  2020-05-27 19:39             ` Paul E. McKenney
  2020-05-28  8:59             ` [tip: core/rcu] rcu: Allow for smp_call_function() running callbacks from idle tip-bot2 for Peter Zijlstra
@ 2021-01-21 16:56             ` Peter Zijlstra
  2021-01-22  0:20               ` Paul E. McKenney
  2 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2021-01-21 16:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel,
	valentin.schneider

On Wed, May 27, 2020 at 07:12:36PM +0200, Peter Zijlstra wrote:
> Subject: rcu: Allow for smp_call_function() running callbacks from idle
> 
> Current RCU hard relies on smp_call_function() callbacks running from
> interrupt context. A pending optimization is going to break that, it
> will allow idle CPUs to run the callbacks from the idle loop. This
> avoids raising the IPI on the requesting CPU and avoids handling an
> exception on the receiving CPU.
> 
> Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
> provided it is the idle task.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/rcu/tree.c   | 25 +++++++++++++++++++------
>  kernel/sched/idle.c |  4 ++++
>  2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d8e9dbbefcfa..c716eadc7617 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -418,16 +418,23 @@ void rcu_momentary_dyntick_idle(void)
>  EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
>  
>  /**
> - * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
> + * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
>   *
>   * If the current CPU is idle and running at a first-level (not nested)
> - * interrupt from idle, return true.  The caller must have at least
> - * disabled preemption.
> + * interrupt, or directly, from idle, return true.
> + *
> + * The caller must have at least disabled IRQs.
>   */
>  static int rcu_is_cpu_rrupt_from_idle(void)
>  {
> -	/* Called only from within the scheduling-clock interrupt */
> -	lockdep_assert_in_irq();
> +	long nesting;
> +
> +	/*
> +	 * Usually called from the tick; but also used from smp_function_call()
> +	 * for expedited grace periods. This latter can result in running from
> +	 * the idle task, instead of an actual IPI.
> +	 */
> +	lockdep_assert_irqs_disabled();
>  
>  	/* Check for counter underflows */
>  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
> @@ -436,9 +443,15 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>  			 "RCU dynticks_nmi_nesting counter underflow/zero!");
>  
>  	/* Are we at first interrupt nesting level? */
> -	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
> +	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
> +	if (nesting > 1)
>  		return false;
>  
> +	/*
> +	 * If we're not in an interrupt, we must be in the idle task!
> +	 */
> +	WARN_ON_ONCE(!nesting && !is_idle_task(current));
> +
>  	/* Does CPU appear to be idle from an RCU standpoint? */
>  	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
>  }

Let me revive this thread after yesterdays IRC conversation.

As said; it might be _extremely_ unlikely, but somewhat possible for us
to send the IPI concurrent with hot-unplug, not yet observing
rcutree_offline_cpu() or thereabout.

Then have the IPI 'delayed' enough to not happen until smpcfd_dying()
and getting ran there.

This would then run the function from the stopper thread instead of the
idle thread and trigger the warning, even though we're not holding
rcu_read_lock() (which, IIRC, was the only constraint).

So would something like the below be acceptable?

---
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 368749008ae8..2c8d4c3e341e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -445,7 +445,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 	/*
 	 * Usually called from the tick; but also used from smp_function_call()
 	 * for expedited grace periods. This latter can result in running from
-	 * the idle task, instead of an actual IPI.
+	 * a (usually the idle) task, instead of an actual IPI.
 	 */
 	lockdep_assert_irqs_disabled();
 
@@ -461,9 +461,14 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 		return false;
 
 	/*
-	 * If we're not in an interrupt, we must be in the idle task!
+	 * If we're not in an interrupt, we must be in task context.
+	 *
+	 * This will typically be the idle task through:
+	 *   flush_smp_call_function_from_idle(),
+	 *
+	 * but can also be in CPU HotPlug through smpcfd_dying().
 	 */
-	WARN_ON_ONCE(!nesting && !is_idle_task(current));
+	WARN_ON_ONCE(!nesting && !in_task(current));
 
 	/* Does CPU appear to be idle from an RCU standpoint? */
 	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2021-01-21 16:56             ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
@ 2021-01-22  0:20               ` Paul E. McKenney
  2021-01-22  8:31                 ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Paul E. McKenney @ 2021-01-22  0:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel,
	valentin.schneider

On Thu, Jan 21, 2021 at 05:56:53PM +0100, Peter Zijlstra wrote:
> On Wed, May 27, 2020 at 07:12:36PM +0200, Peter Zijlstra wrote:
> > Subject: rcu: Allow for smp_call_function() running callbacks from idle
> > 
> > Current RCU hard relies on smp_call_function() callbacks running from
> > interrupt context. A pending optimization is going to break that, it
> > will allow idle CPUs to run the callbacks from the idle loop. This
> > avoids raising the IPI on the requesting CPU and avoids handling an
> > exception on the receiving CPU.
> > 
> > Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
> > provided it is the idle task.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  kernel/rcu/tree.c   | 25 +++++++++++++++++++------
> >  kernel/sched/idle.c |  4 ++++
> >  2 files changed, 23 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d8e9dbbefcfa..c716eadc7617 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -418,16 +418,23 @@ void rcu_momentary_dyntick_idle(void)
> >  EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
> >  
> >  /**
> > - * rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
> > + * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
> >   *
> >   * If the current CPU is idle and running at a first-level (not nested)
> > - * interrupt from idle, return true.  The caller must have at least
> > - * disabled preemption.
> > + * interrupt, or directly, from idle, return true.
> > + *
> > + * The caller must have at least disabled IRQs.
> >   */
> >  static int rcu_is_cpu_rrupt_from_idle(void)
> >  {
> > -	/* Called only from within the scheduling-clock interrupt */
> > -	lockdep_assert_in_irq();
> > +	long nesting;
> > +
> > +	/*
> > +	 * Usually called from the tick; but also used from smp_function_call()
> > +	 * for expedited grace periods. This latter can result in running from
> > +	 * the idle task, instead of an actual IPI.
> > +	 */
> > +	lockdep_assert_irqs_disabled();
> >  
> >  	/* Check for counter underflows */
> >  	RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0,
> > @@ -436,9 +443,15 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> >  			 "RCU dynticks_nmi_nesting counter underflow/zero!");
> >  
> >  	/* Are we at first interrupt nesting level? */
> > -	if (__this_cpu_read(rcu_data.dynticks_nmi_nesting) != 1)
> > +	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
> > +	if (nesting > 1)
> >  		return false;
> >  
> > +	/*
> > +	 * If we're not in an interrupt, we must be in the idle task!
> > +	 */
> > +	WARN_ON_ONCE(!nesting && !is_idle_task(current));
> > +
> >  	/* Does CPU appear to be idle from an RCU standpoint? */
> >  	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
> >  }
> 
> Let me revive this thread after yesterdays IRC conversation.
> 
> As said; it might be _extremely_ unlikely, but somewhat possible for us
> to send the IPI concurrent with hot-unplug, not yet observing
> rcutree_offline_cpu() or thereabout.
> 
> Then have the IPI 'delayed' enough to not happen until smpcfd_dying()
> and getting ran there.
> 
> This would then run the function from the stopper thread instead of the
> idle thread and trigger the warning, even though we're not holding
> rcu_read_lock() (which, IIRC, was the only constraint).
> 
> So would something like the below be acceptable?
> 
> ---
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 368749008ae8..2c8d4c3e341e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -445,7 +445,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>  	/*
>  	 * Usually called from the tick; but also used from smp_function_call()
>  	 * for expedited grace periods. This latter can result in running from
> -	 * the idle task, instead of an actual IPI.
> +	 * a (usually the idle) task, instead of an actual IPI.

The story is growing enough hair that we should tell it only once.
So here just where it is called from:

	/*
	 * Usually called from the tick; but also used from smp_function_call()
	 * for expedited grace periods.
	 */

>  	lockdep_assert_irqs_disabled();
>  
> @@ -461,9 +461,14 @@ static int rcu_is_cpu_rrupt_from_idle(void)
>  		return false;
>  
>  	/*
> -	 * If we're not in an interrupt, we must be in the idle task!
> +	 * If we're not in an interrupt, we must be in task context.
> +	 *
> +	 * This will typically be the idle task through:
> +	 *   flush_smp_call_function_from_idle(),
> +	 *
> +	 * but can also be in CPU HotPlug through smpcfd_dying().
>  	 */

Good, but how about like this?

	/*
	 * If we are not in an interrupt handler, we must be in
	 * smp_call_function() handler.
	 *
	 * Normally, smp_call_function() handlers are invoked from
	 * the idle task via flush_smp_call_function_from_idle().
	 * However, they can also be invoked from CPU hotplug
	 * operations via smpcfd_dying().
	 */

> -	WARN_ON_ONCE(!nesting && !is_idle_task(current));
> +	WARN_ON_ONCE(!nesting && !in_task(current));

This is used in time-critical contexts, so why not RCU_LOCKDEP_WARN()?
That should also allow checking more closely.  Would something like the
following work?

	RCU_LOCKDEP_WARN(!nesting && !is_idle_task(current) && (!in_task(current) || !lockdep_cpus_write_held()));

Where lockdep_cpus_write_held is defined in kernel/cpu.c:

void lockdep_cpus_write_held(void)
{
#ifdef CONFIG_PROVE_LOCKING
	if (system_state < SYSTEM_RUNNING)
		return false;
	return lockdep_is_held_type(&cpu_hotplug_lock, 0);
#else
	return false;
#endif
}

Seem reasonable?

							Thanx, Paul

>  	/* Does CPU appear to be idle from an RCU standpoint? */
>  	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2021-01-22  0:20               ` Paul E. McKenney
@ 2021-01-22  8:31                 ` Peter Zijlstra
  2021-01-22 15:35                   ` Paul E. McKenney
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2021-01-22  8:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel,
	valentin.schneider

On Thu, Jan 21, 2021 at 04:20:12PM -0800, Paul E. McKenney wrote:

> > ---
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 368749008ae8..2c8d4c3e341e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -445,7 +445,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> >  	/*
> >  	 * Usually called from the tick; but also used from smp_function_call()
> >  	 * for expedited grace periods. This latter can result in running from
> > -	 * the idle task, instead of an actual IPI.
> > +	 * a (usually the idle) task, instead of an actual IPI.
> 
> The story is growing enough hair that we should tell it only once.
> So here just where it is called from:
> 
> 	/*
> 	 * Usually called from the tick; but also used from smp_function_call()
> 	 * for expedited grace periods.
> 	 */
> 
> >  	lockdep_assert_irqs_disabled();
> >  
> > @@ -461,9 +461,14 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> >  		return false;
> >  
> >  	/*
> > -	 * If we're not in an interrupt, we must be in the idle task!
> > +	 * If we're not in an interrupt, we must be in task context.
> > +	 *
> > +	 * This will typically be the idle task through:
> > +	 *   flush_smp_call_function_from_idle(),
> > +	 *
> > +	 * but can also be in CPU HotPlug through smpcfd_dying().
> >  	 */
> 
> Good, but how about like this?
> 
> 	/*
> 	 * If we are not in an interrupt handler, we must be in
> 	 * smp_call_function() handler.
> 	 *
> 	 * Normally, smp_call_function() handlers are invoked from
> 	 * the idle task via flush_smp_call_function_from_idle().
> 	 * However, they can also be invoked from CPU hotplug
> 	 * operations via smpcfd_dying().
> 	 */
> 
> > -	WARN_ON_ONCE(!nesting && !is_idle_task(current));
> > +	WARN_ON_ONCE(!nesting && !in_task(current));
> 
> This is used in time-critical contexts, so why not RCU_LOCKDEP_WARN()?
> That should also allow checking more closely.  Would something like the
> following work?
> 
> 	RCU_LOCKDEP_WARN(!nesting && !is_idle_task(current) && (!in_task(current) || !lockdep_cpus_write_held()));
> 
> Where lockdep_cpus_write_held is defined in kernel/cpu.c:

Works for me, except s/in_task(current)/in_task()/ compiles a lot
better.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi()
  2021-01-22  8:31                 ` Peter Zijlstra
@ 2021-01-22 15:35                   ` Paul E. McKenney
  0 siblings, 0 replies; 62+ messages in thread
From: Paul E. McKenney @ 2021-01-22 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, frederic, linux-kernel, x86, cai, mgorman, joel,
	valentin.schneider

On Fri, Jan 22, 2021 at 09:31:37AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 21, 2021 at 04:20:12PM -0800, Paul E. McKenney wrote:
> 
> > > ---
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 368749008ae8..2c8d4c3e341e 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -445,7 +445,7 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > >  	/*
> > >  	 * Usually called from the tick; but also used from smp_function_call()
> > >  	 * for expedited grace periods. This latter can result in running from
> > > -	 * the idle task, instead of an actual IPI.
> > > +	 * a (usually the idle) task, instead of an actual IPI.
> > 
> > The story is growing enough hair that we should tell it only once.
> > So here just where it is called from:
> > 
> > 	/*
> > 	 * Usually called from the tick; but also used from smp_function_call()
> > 	 * for expedited grace periods.
> > 	 */
> > 
> > >  	lockdep_assert_irqs_disabled();
> > >  
> > > @@ -461,9 +461,14 @@ static int rcu_is_cpu_rrupt_from_idle(void)
> > >  		return false;
> > >  
> > >  	/*
> > > -	 * If we're not in an interrupt, we must be in the idle task!
> > > +	 * If we're not in an interrupt, we must be in task context.
> > > +	 *
> > > +	 * This will typically be the idle task through:
> > > +	 *   flush_smp_call_function_from_idle(),
> > > +	 *
> > > +	 * but can also be in CPU HotPlug through smpcfd_dying().
> > >  	 */
> > 
> > Good, but how about like this?
> > 
> > 	/*
> > 	 * If we are not in an interrupt handler, we must be in
> > 	 * smp_call_function() handler.
> > 	 *
> > 	 * Normally, smp_call_function() handlers are invoked from
> > 	 * the idle task via flush_smp_call_function_from_idle().
> > 	 * However, they can also be invoked from CPU hotplug
> > 	 * operations via smpcfd_dying().
> > 	 */
> > 
> > > -	WARN_ON_ONCE(!nesting && !is_idle_task(current));
> > > +	WARN_ON_ONCE(!nesting && !in_task(current));
> > 
> > This is used in time-critical contexts, so why not RCU_LOCKDEP_WARN()?
> > That should also allow checking more closely.  Would something like the
> > following work?
> > 
> > 	RCU_LOCKDEP_WARN(!nesting && !is_idle_task(current) && (!in_task(current) || !lockdep_cpus_write_held()));
> > 
> > Where lockdep_cpus_write_held is defined in kernel/cpu.c:
> 
> Works for me, except s/in_task(current)/in_task()/ compiles a lot
> better.

These compilers sure constrain my creativity!  ;-)

Might be a good thing, though...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2021-01-22 15:38 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-26 16:10 [RFC][PATCH 0/7] Fix the scheduler-IPI mess Peter Zijlstra
2020-05-26 16:10 ` [RFC][PATCH 1/7] sched: Fix smp_call_function_single_async() usage for ILB Peter Zijlstra
2020-05-26 23:56   ` Frederic Weisbecker
2020-05-27 10:23   ` Vincent Guittot
2020-05-27 11:28     ` Frederic Weisbecker
2020-05-27 12:07       ` Vincent Guittot
2020-05-29 15:26   ` Valentin Schneider
2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2020-06-01 11:40     ` Frederic Weisbecker
2020-05-26 16:10 ` [RFC][PATCH 2/7] smp: Optimize flush_smp_call_function_queue() Peter Zijlstra
2020-05-28 12:28   ` Frederic Weisbecker
2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2020-05-26 16:11 ` [RFC][PATCH 3/7] smp: Move irq_work_run() out of flush_smp_call_function_queue() Peter Zijlstra
2020-05-29 13:04   ` Frederic Weisbecker
2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2020-05-26 16:11 ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
2020-05-27  9:56   ` Peter Zijlstra
2020-05-27 10:15     ` Peter Zijlstra
2020-05-27 15:56       ` Paul E. McKenney
2020-05-27 16:35         ` Peter Zijlstra
2020-05-27 17:12           ` Peter Zijlstra
2020-05-27 19:39             ` Paul E. McKenney
2020-05-28  1:35               ` Joel Fernandes
2020-05-28  8:59             ` [tip: core/rcu] rcu: Allow for smp_call_function() running callbacks from idle tip-bot2 for Peter Zijlstra
2021-01-21 16:56             ` [RFC][PATCH 4/7] smp: Optimize send_call_function_single_ipi() Peter Zijlstra
2021-01-22  0:20               ` Paul E. McKenney
2021-01-22  8:31                 ` Peter Zijlstra
2021-01-22 15:35                   ` Paul E. McKenney
2020-05-29 13:01   ` Frederic Weisbecker
2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2020-05-26 16:11 ` [RFC][PATCH 5/7] irq_work, smp: Allow irq_work on call_single_queue Peter Zijlstra
2020-05-28 23:40   ` Frederic Weisbecker
2020-05-29 13:36     ` Peter Zijlstra
2020-06-05  9:37       ` Peter Zijlstra
2020-06-05 15:02         ` Frederic Weisbecker
2020-06-05 16:17           ` Peter Zijlstra
2020-06-05 15:24         ` Kees Cook
2020-06-10 13:24         ` Frederic Weisbecker
2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2020-05-26 16:11 ` [RFC][PATCH 6/7] sched: Add rq::ttwu_pending Peter Zijlstra
2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2020-05-26 16:11 ` [RFC][PATCH 7/7] sched: Replace rq::wake_list Peter Zijlstra
2020-05-29 15:10   ` Valdis Klētnieks
2020-06-01  9:52   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2020-06-02 15:16     ` Frederic Weisbecker
2020-06-04 14:18   ` [RFC][PATCH 7/7] " Guenter Roeck
2020-06-05  0:24     ` Eric Biggers
2020-06-05  7:41       ` Peter Zijlstra
2020-06-05 16:15         ` Eric Biggers
2020-06-06 23:13           ` Guenter Roeck
2020-06-09 20:21             ` Eric Biggers
2020-06-09 21:25               ` Guenter Roeck
2020-06-09 21:38                 ` Eric Biggers
2020-06-09 22:06                   ` Peter Zijlstra
2020-06-09 23:03                     ` Guenter Roeck
2020-06-10  9:09                       ` Peter Zijlstra
2020-06-18 17:57                 ` Steven Rostedt
2020-06-18 19:06                   ` Guenter Roeck
2020-06-09 22:07               ` Peter Zijlstra
2020-06-05  8:10     ` Peter Zijlstra
2020-06-05 13:33       ` Guenter Roeck
2020-06-05 14:09         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).