All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] softirq: Per vector threading v3
@ 2018-01-19 15:46 Frederic Weisbecker
  2018-01-19 15:46 ` [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail Frederic Weisbecker
                   ` (6 more replies)
  0 siblings, 7 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-19 15:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Levin Alexander, Peter Zijlstra,
	Mauro Carvalho Chehab, Linus Torvalds, Hannes Frederic Sowa,
	Paul E . McKenney, Wanpeng Li, Dmitry Safonov, Thomas Gleixner,
	Andrew Morton, Paolo Abeni, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet, David Miller

As per Linus suggestion, this take doesn't limit the number of occurences
per jiffy anymore but instead defers a vector to workqueues as soon as
it gets re-enqueued on IRQ tail.

No tunable here, so testing should be easier.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	softirq/thread-v3

HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4

Thanks,
	Frederic
---

Frederic Weisbecker (4):
      softirq: Limit vector to a single iteration on IRQ tail
      softirq: Per vector deferment to workqueue
      softirq: Defer to workqueue when rescheduling is needed
      softirq: Replace ksoftirqd with workqueues entirely


 Documentation/RCU/stallwarn.txt |   4 +-
 include/linux/interrupt.h       |   7 +-
 kernel/sched/cputime.c          |  12 +--
 kernel/sched/sched.h            |   4 +-
 kernel/softirq.c                | 223 +++++++++++++++++++++++-----------------
 net/ipv4/tcp_output.c           |   5 +-
 6 files changed, 144 insertions(+), 111 deletions(-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail
  2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
@ 2018-01-19 15:46 ` Frederic Weisbecker
  2018-01-19 16:16   ` David Miller
  2018-01-19 15:46 ` [RFC PATCH 2/4] softirq: Per vector deferment to workqueue Frederic Weisbecker
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-19 15:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Levin Alexander, Peter Zijlstra,
	Mauro Carvalho Chehab, Linus Torvalds, Hannes Frederic Sowa,
	Paul E . McKenney, Wanpeng Li, Dmitry Safonov, Thomas Gleixner,
	Andrew Morton, Paolo Abeni, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet, David Miller

Softirqs are currently allowed to execute as much pending work as needed
as long as it doesn't overrun 10 iterations or 2 ms. The rest of the
pending work is deferred to ksoftirqd beyond that limit.

In fact 2 ms is a very high threshold that shouldn't be reachable after
only 10 iterations. And just counting iterations doesn't help diagnose
which vector forces softirqs to be deferred to a thread at the expense
of other vectors that could need quick service. Instead of blindly
kicking ksoftirqd when the softirq load becomes too high, we rather
want to offload the culprit vector only and let the other vectors to
be handled on IRQ tail as usual.

To achieve this, we rather keep track of which vectors have run on
past iterations and kick them off to threading if they get re-enqueued.
We assume that a vector only needs one iteration on normal loads while
re-enqueuing is a matter of stress, so this should be a simple yet
reliable way to detect which vector needs to be queued off in a task.

For now, vectors that get re-enqueued trigger ksoftirqd but they are
going to be handled by per-vector workqueues on subsequent patches.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Levin Alexander <alexander.levin@verizon.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Radu Rendec <rrendec@arista.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
---
 kernel/softirq.c | 30 +++++++-----------------------
 1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f..c8c6841 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -190,22 +190,6 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 }
 EXPORT_SYMBOL(__local_bh_enable_ip);
 
-/*
- * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
- * but break the loop if need_resched() is set or after 2 ms.
- * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
- * certain cases, such as stop_machine(), jiffies may cease to
- * increment and so we need the MAX_SOFTIRQ_RESTART limit as
- * well to make sure we eventually return from this method.
- *
- * These limits have been established via experimentation.
- * The two things to balance is latency against fairness -
- * we want to handle softirqs as soon as possible, but they
- * should not be able to lock up the box.
- */
-#define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
-#define MAX_SOFTIRQ_RESTART 10
-
 #ifdef CONFIG_TRACE_IRQFLAGS
 /*
  * When we run softirqs from irq_exit() and thus on the hardirq stack we need
@@ -241,12 +225,10 @@ static inline void lockdep_softirq_end(bool in_hardirq) { }
 
 asmlinkage __visible void __softirq_entry __do_softirq(void)
 {
-	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
 	unsigned long old_flags = current->flags;
-	int max_restart = MAX_SOFTIRQ_RESTART;
 	struct softirq_action *h;
 	bool in_hardirq;
-	__u32 pending;
+	__u32 pending, executed = 0;
 	int softirq_bit;
 
 	/*
@@ -284,6 +266,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 		trace_softirq_entry(vec_nr);
 		h->action(h);
 		trace_softirq_exit(vec_nr);
+
+		executed |= 1 << vec_nr;
+
 		if (unlikely(prev_count != preempt_count())) {
 			pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n",
 			       vec_nr, softirq_to_name[vec_nr], h->action,
@@ -299,11 +284,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 
 	pending = local_softirq_pending();
 	if (pending) {
-		if (time_before(jiffies, end) && !need_resched() &&
-		    --max_restart)
+		if (pending & executed || need_resched())
+			wakeup_softirqd();
+		else
 			goto restart;
-
-		wakeup_softirqd();
 	}
 
 	lockdep_softirq_end(in_hardirq);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
  2018-01-19 15:46 ` [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail Frederic Weisbecker
@ 2018-01-19 15:46 ` Frederic Weisbecker
  2018-01-20  8:41   ` Pavan Kondeti
  2018-02-08 17:44   ` Sebastian Andrzej Siewior
  2018-01-19 15:46 ` [RFC PATCH 3/4] softirq: Defer to workqueue when rescheduling is needed Frederic Weisbecker
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-19 15:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Levin Alexander, Peter Zijlstra,
	Mauro Carvalho Chehab, Linus Torvalds, Hannes Frederic Sowa,
	Paul E . McKenney, Wanpeng Li, Dmitry Safonov, Thomas Gleixner,
	Andrew Morton, Paolo Abeni, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet, David Miller

Some softirq vectors can be more CPU hungry than others. Especially
networking may sometimes deal with packet storm and need more CPU than
IRQ tail can offer without inducing scheduler latencies. In this case
the current code defers to ksoftirqd that behaves nicer. Now this nice
behaviour can be bad for other IRQ vectors that usually need quick
processing.

To solve this we only defer to threading the vectors that got
re-enqueued on IRQ tail processing and leave the others inline on IRQ
tail service. This is achieved using workqueues with
per-CPU/per-vector worklets.

Note ksoftirqd is not yet removed as it is still needed for threaded IRQs
mode.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Levin Alexander <alexander.levin@verizon.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Radu Rendec <rrendec@arista.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
---
 include/linux/interrupt.h |   2 +
 kernel/sched/cputime.c    |   5 +-
 kernel/softirq.c          | 120 ++++++++++++++++++++++++++++++++++++++++++----
 net/ipv4/tcp_output.c     |   3 +-
 4 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 69c2382..92d044d 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -514,6 +514,8 @@ static inline struct task_struct *this_cpu_ksoftirqd(void)
 	return this_cpu_read(ksoftirqd);
 }
 
+extern int softirq_serving_workqueue(void);
+
 /* Tasklets --- multithreaded analogue of BHs.
 
    Main feature differing them of generic softirqs: tasklet
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index bac6ac9..30f70e5 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -71,7 +71,8 @@ void irqtime_account_irq(struct task_struct *curr)
 	 */
 	if (hardirq_count())
 		irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd() &&
+		 !softirq_serving_workqueue())
 		irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
@@ -375,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 
 	cputime -= other;
 
-	if (this_cpu_ksoftirqd() == p) {
+	if (this_cpu_ksoftirqd() == p || softirq_serving_workqueue()) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
diff --git a/kernel/softirq.c b/kernel/softirq.c
index c8c6841..becb1d9 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -62,6 +62,19 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {
 	"TASKLET", "SCHED", "HRTIMER", "RCU"
 };
 
+struct vector {
+	int nr;
+	struct work_struct work;
+};
+
+struct softirq {
+	unsigned int pending_work_mask;
+	int work_running;
+	struct vector vector[NR_SOFTIRQS];
+};
+
+static DEFINE_PER_CPU(struct softirq, softirq_cpu);
+
 /*
  * we cannot loop indefinitely here to avoid userspace starvation,
  * but we also don't want to introduce a worst case 1/HZ latency
@@ -223,8 +236,77 @@ static inline bool lockdep_softirq_start(void) { return false; }
 static inline void lockdep_softirq_end(bool in_hardirq) { }
 #endif
 
+int softirq_serving_workqueue(void)
+{
+	return __this_cpu_read(softirq_cpu.work_running);
+}
+
+static void vector_work_func(struct work_struct *work)
+{
+	struct vector *vector = container_of(work, struct vector, work);
+	struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
+	int vec_nr = vector->nr;
+	int vec_bit = BIT(vec_nr);
+	u32 pending;
+
+	local_irq_disable();
+	pending = local_softirq_pending();
+	account_irq_enter_time(current);
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
+	lockdep_softirq_enter();
+	set_softirq_pending(pending & ~vec_bit);
+	local_irq_enable();
+
+	if (pending & vec_bit) {
+		struct softirq_action *sa = &softirq_vec[vec_nr];
+
+		kstat_incr_softirqs_this_cpu(vec_nr);
+		softirq->work_running = 1;
+		trace_softirq_entry(vec_nr);
+		sa->action(sa);
+		trace_softirq_exit(vec_nr);
+		softirq->work_running = 0;
+	}
+
+	local_irq_disable();
+
+	pending = local_softirq_pending();
+	if (pending & vec_bit)
+		schedule_work_on(smp_processor_id(), &vector->work);
+	else
+		softirq->pending_work_mask &= ~vec_bit;
+
+	lockdep_softirq_exit();
+	account_irq_exit_time(current);
+	__local_bh_enable(SOFTIRQ_OFFSET);
+	local_irq_enable();
+}
+
+static void do_softirq_workqueue(u32 pending)
+{
+	struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
+	struct softirq_action *h = softirq_vec;
+	int softirq_bit;
+
+	pending &= ~softirq->pending_work_mask;
+
+	while ((softirq_bit = ffs(pending))) {
+		struct vector *vector;
+		unsigned int vec_nr;
+
+		h += softirq_bit - 1;
+		vec_nr = h - softirq_vec;
+		softirq->pending_work_mask |= BIT(vec_nr);
+		vector = &softirq->vector[vec_nr];
+		schedule_work_on(smp_processor_id(), &vector->work);
+		h++;
+		pending >>= softirq_bit;
+	}
+}
+
 asmlinkage __visible void __softirq_entry __do_softirq(void)
 {
+	struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
 	unsigned long old_flags = current->flags;
 	struct softirq_action *h;
 	bool in_hardirq;
@@ -238,15 +320,18 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	 */
 	current->flags &= ~PF_MEMALLOC;
 
-	pending = local_softirq_pending();
+	/* Ignore vectors pending on workqueues, they have been punished */
+	pending = local_softirq_pending() & ~softirq->pending_work_mask;
 	account_irq_enter_time(current);
 
 	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
 	in_hardirq = lockdep_softirq_start();
-
 restart:
-	/* Reset the pending bitmask before enabling irqs */
-	set_softirq_pending(0);
+	/*
+	 * Reset the pending bitmask before enabling irqs but keep
+	 * those pending on workqueues so they get properly handled there.
+	 */
+	set_softirq_pending(softirq->pending_work_mask);
 
 	local_irq_enable();
 
@@ -282,12 +367,18 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 	rcu_bh_qs();
 	local_irq_disable();
 
-	pending = local_softirq_pending();
+	pending = local_softirq_pending() & ~softirq->pending_work_mask;
 	if (pending) {
-		if (pending & executed || need_resched())
+		if (need_resched()) {
 			wakeup_softirqd();
-		else
-			goto restart;
+		} else {
+			/* Vectors that got re-enqueued are threaded */
+			if (executed & pending)
+				do_softirq_workqueue(executed & pending);
+			pending &= ~executed;
+			if (pending)
+				goto restart;
+		}
 	}
 
 	lockdep_softirq_end(in_hardirq);
@@ -624,10 +715,23 @@ void __init softirq_init(void)
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
+		struct softirq *softirq;
+		int i;
+
 		per_cpu(tasklet_vec, cpu).tail =
 			&per_cpu(tasklet_vec, cpu).head;
 		per_cpu(tasklet_hi_vec, cpu).tail =
 			&per_cpu(tasklet_hi_vec, cpu).head;
+
+		softirq = &per_cpu(softirq_cpu, cpu);
+
+		for (i = 0; i < NR_SOFTIRQS; i++) {
+			struct vector *vector;
+
+			vector = &softirq->vector[i];
+			vector->nr = i;
+			INIT_WORK(&vector->work, vector_work_func);
+		}
 	}
 
 	open_softirq(TASKLET_SOFTIRQ, tasklet_action);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a4d214c..b4e4160 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -919,7 +919,8 @@ void tcp_wfree(struct sk_buff *skb)
 	 * - chance for incoming ACK (processed by another cpu maybe)
 	 *   to migrate this flow (skb->ooo_okay will be eventually set)
 	 */
-	if (refcount_read(&sk->sk_wmem_alloc) >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
+	if (refcount_read(&sk->sk_wmem_alloc) >= SKB_TRUESIZE(1) &&
+	    (this_cpu_ksoftirqd() == current || softirq_serving_workqueue()))
 		goto out;
 
 	for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 3/4] softirq: Defer to workqueue when rescheduling is needed
  2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
  2018-01-19 15:46 ` [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail Frederic Weisbecker
  2018-01-19 15:46 ` [RFC PATCH 2/4] softirq: Per vector deferment to workqueue Frederic Weisbecker
@ 2018-01-19 15:46 ` Frederic Weisbecker
  2018-01-19 15:46 ` [RFC PATCH 4/4] softirq: Replace ksoftirqd with workqueues entirely Frederic Weisbecker
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-19 15:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Levin Alexander, Peter Zijlstra,
	Mauro Carvalho Chehab, Linus Torvalds, Hannes Frederic Sowa,
	Paul E . McKenney, Wanpeng Li, Dmitry Safonov, Thomas Gleixner,
	Andrew Morton, Paolo Abeni, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet, David Miller

One more step toward converting ksoftirqd to per vector workqueues.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Levin Alexander <alexander.levin@verizon.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Radu Rendec <rrendec@arista.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
---
 kernel/softirq.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index becb1d9..bb0cffa 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -369,16 +369,14 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 
 	pending = local_softirq_pending() & ~softirq->pending_work_mask;
 	if (pending) {
-		if (need_resched()) {
-			wakeup_softirqd();
-		} else {
-			/* Vectors that got re-enqueued are threaded */
-			if (executed & pending)
-				do_softirq_workqueue(executed & pending);
-			pending &= ~executed;
-			if (pending)
-				goto restart;
-		}
+		if (need_resched())
+			executed = pending;
+		/* Vectors that got re-enqueued are threaded */
+		if (executed & pending)
+			do_softirq_workqueue(executed & pending);
+		pending &= ~executed;
+		if (pending)
+			goto restart;
 	}
 
 	lockdep_softirq_end(in_hardirq);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 4/4] softirq: Replace ksoftirqd with workqueues entirely
  2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2018-01-19 15:46 ` [RFC PATCH 3/4] softirq: Defer to workqueue when rescheduling is needed Frederic Weisbecker
@ 2018-01-19 15:46 ` Frederic Weisbecker
  2018-01-22 19:58 ` [RFC PATCH 0/4] softirq: Per vector threading v3 Mauro Carvalho Chehab
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-19 15:46 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Levin Alexander, Peter Zijlstra,
	Mauro Carvalho Chehab, Linus Torvalds, Hannes Frederic Sowa,
	Paul E . McKenney, Wanpeng Li, Dmitry Safonov, Thomas Gleixner,
	Andrew Morton, Paolo Abeni, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet, David Miller

Ksoftirqd only remains to implement threaded IRQs. Convert it to
existing per-vector workqueues to avoid code duplication.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Levin Alexander <alexander.levin@verizon.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Radu Rendec <rrendec@arista.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
---
 Documentation/RCU/stallwarn.txt |  4 +-
 include/linux/interrupt.h       |  7 ----
 kernel/sched/cputime.c          | 13 +++---
 kernel/sched/sched.h            |  4 +-
 kernel/softirq.c                | 87 +++++++++--------------------------------
 net/ipv4/tcp_output.c           |  4 +-
 6 files changed, 31 insertions(+), 88 deletions(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index a08f928..ea3a8de 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -17,8 +17,8 @@ o	A CPU looping in an RCU read-side critical section.
 o	A CPU looping with interrupts disabled.
 
 o	A CPU looping with preemption disabled.  This condition can
-	result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
-	stalls.
+	result in RCU-sched stalls and, if softirq workqueue is in use,
+	RCU-bh stalls.
 
 o	A CPU looping with bottom halves disabled.  This condition can
 	result in RCU-sched and RCU-bh stalls.
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 92d044d..680f620 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -507,13 +507,6 @@ extern void __raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
 
-DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
-
-static inline struct task_struct *this_cpu_ksoftirqd(void)
-{
-	return this_cpu_read(ksoftirqd);
-}
-
 extern int softirq_serving_workqueue(void);
 
 /* Tasklets --- multithreaded analogue of BHs.
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 30f70e5..c5b8dbd 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -64,15 +64,14 @@ void irqtime_account_irq(struct task_struct *curr)
 	irqtime->irq_start_time += delta;
 
 	/*
-	 * We do not account for softirq time from ksoftirqd here.
-	 * We want to continue accounting softirq time to ksoftirqd thread
+	 * We do not account for softirq time from workqueue here.
+	 * We want to continue accounting softirq time to workqueue thread
 	 * in that case, so as not to confuse scheduler with a special task
 	 * that do not consume any time, but still wants to run.
 	 */
 	if (hardirq_count())
 		irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
-	else if (in_serving_softirq() && curr != this_cpu_ksoftirqd() &&
-		 !softirq_serving_workqueue())
+	else if (in_serving_softirq() && !softirq_serving_workqueue())
 		irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
 }
 EXPORT_SYMBOL_GPL(irqtime_account_irq);
@@ -376,11 +375,11 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 
 	cputime -= other;
 
-	if (this_cpu_ksoftirqd() == p || softirq_serving_workqueue()) {
+	if (softirq_serving_workqueue()) {
 		/*
-		 * ksoftirqd time do not get accounted in cpu_softirq_time.
+		 * Softirq wq time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
-		 * Also, p->stime needs to be updated for ksoftirqd.
+		 * Also, p->stime needs to be updated for workqueue.
 		 */
 		account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
 	} else if (user_tick) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a2..5d481f1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2061,8 +2061,8 @@ struct irqtime {
 DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
 
 /*
- * Returns the irqtime minus the softirq time computed by ksoftirqd.
- * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime
+ * Returns the irqtime minus the softirq time computed by workqueue.
+ * Otherwise workqueue's sum_exec_runtime is substracted its own runtime
  * and never move forward.
  */
 static inline u64 irq_time_read(int cpu)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bb0cffa..cf43a8d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -55,8 +55,6 @@ EXPORT_SYMBOL(irq_stat);
 
 static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
 
-DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
-
 const char * const softirq_to_name[NR_SOFTIRQS] = {
 	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
 	"TASKLET", "SCHED", "HRTIMER", "RCU"
@@ -76,32 +74,6 @@ struct softirq {
 static DEFINE_PER_CPU(struct softirq, softirq_cpu);
 
 /*
- * we cannot loop indefinitely here to avoid userspace starvation,
- * but we also don't want to introduce a worst case 1/HZ latency
- * to the pending events, so lets the scheduler to balance
- * the softirq load for us.
- */
-static void wakeup_softirqd(void)
-{
-	/* Interrupts are disabled: no need to stop preemption */
-	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
-
-	if (tsk && tsk->state != TASK_RUNNING)
-		wake_up_process(tsk);
-}
-
-/*
- * If ksoftirqd is scheduled, we do not want to process pending softirqs
- * right now. Let ksoftirqd handle this at its own rate, to get fairness.
- */
-static bool ksoftirqd_running(void)
-{
-	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
-
-	return tsk && (tsk->state == TASK_RUNNING);
-}
-
-/*
  * preempt_count and SOFTIRQ_OFFSET usage:
  * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
  *   softirq processing.
@@ -388,7 +360,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
 
 asmlinkage __visible void do_softirq(void)
 {
-	__u32 pending;
+	__u32 pending, pending_work;
 	unsigned long flags;
 
 	if (in_interrupt())
@@ -397,8 +369,9 @@ asmlinkage __visible void do_softirq(void)
 	local_irq_save(flags);
 
 	pending = local_softirq_pending();
+	pending_work = __this_cpu_read(softirq_cpu.pending_work_mask);
 
-	if (pending && !ksoftirqd_running())
+	if (pending & ~pending_work)
 		do_softirq_own_stack();
 
 	local_irq_restore(flags);
@@ -412,7 +385,7 @@ void irq_enter(void)
 	rcu_irq_enter();
 	if (is_idle_task(current) && !in_interrupt()) {
 		/*
-		 * Prevent raise_softirq from needlessly waking up ksoftirqd
+		 * Prevent raise_softirq from needlessly waking up workqueue
 		 * here, as softirq will be serviced on return from interrupt.
 		 */
 		local_bh_disable();
@@ -425,7 +398,15 @@ void irq_enter(void)
 
 static inline void invoke_softirq(void)
 {
-	if (ksoftirqd_running())
+	unsigned int pending_work, pending = local_softirq_pending();
+
+	if (!pending)
+		return;
+
+	pending_work = __this_cpu_read(softirq_cpu.pending_work_mask);
+	pending &= ~pending_work;
+
+	if (!pending)
 		return;
 
 	if (!force_irqthreads) {
@@ -445,7 +426,7 @@ static inline void invoke_softirq(void)
 		do_softirq_own_stack();
 #endif
 	} else {
-		wakeup_softirqd();
+		do_softirq_workqueue(pending);
 	}
 }
 
@@ -474,7 +455,7 @@ void irq_exit(void)
 #endif
 	account_irq_exit_time(current);
 	preempt_count_sub(HARDIRQ_OFFSET);
-	if (!in_interrupt() && local_softirq_pending())
+	if (!in_interrupt())
 		invoke_softirq();
 
 	tick_irq_exit();
@@ -495,11 +476,11 @@ inline void raise_softirq_irqoff(unsigned int nr)
 	 * actually run the softirq once we return from
 	 * the irq or softirq.
 	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
+	 * Otherwise we wake up workqueue to make sure we
 	 * schedule the softirq soon.
 	 */
 	if (!in_interrupt())
-		wakeup_softirqd();
+		do_softirq_workqueue(BIT(nr));
 }
 
 void raise_softirq(unsigned int nr)
@@ -736,27 +717,6 @@ void __init softirq_init(void)
 	open_softirq(HI_SOFTIRQ, tasklet_hi_action);
 }
 
-static int ksoftirqd_should_run(unsigned int cpu)
-{
-	return local_softirq_pending();
-}
-
-static void run_ksoftirqd(unsigned int cpu)
-{
-	local_irq_disable();
-	if (local_softirq_pending()) {
-		/*
-		 * We can safely run softirq on inline stack, as we are not deep
-		 * in the task stack here.
-		 */
-		__do_softirq();
-		local_irq_enable();
-		cond_resched_rcu_qs();
-		return;
-	}
-	local_irq_enable();
-}
-
 #ifdef CONFIG_HOTPLUG_CPU
 /*
  * tasklet_kill_immediate is called to remove a tasklet which can already be
@@ -819,22 +779,13 @@ static int takeover_tasklets(unsigned int cpu)
 #define takeover_tasklets	NULL
 #endif /* CONFIG_HOTPLUG_CPU */
 
-static struct smp_hotplug_thread softirq_threads = {
-	.store			= &ksoftirqd,
-	.thread_should_run	= ksoftirqd_should_run,
-	.thread_fn		= run_ksoftirqd,
-	.thread_comm		= "ksoftirqd/%u",
-};
-
-static __init int spawn_ksoftirqd(void)
+static __init int tasklet_set_takeover(void)
 {
 	cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL,
 				  takeover_tasklets);
-	BUG_ON(smpboot_register_percpu_thread(&softirq_threads));
-
 	return 0;
 }
-early_initcall(spawn_ksoftirqd);
+early_initcall(tasklet_set_takeover);
 
 /*
  * [ These __weak aliases are kept in a separate compilation unit, so that
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b4e4160..3b4811e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -912,7 +912,7 @@ void tcp_wfree(struct sk_buff *skb)
 	 */
 	WARN_ON(refcount_sub_and_test(skb->truesize - 1, &sk->sk_wmem_alloc));
 
-	/* If this softirq is serviced by ksoftirqd, we are likely under stress.
+	/* If this softirq is serviced by workqueue, we are likely under stress.
 	 * Wait until our queues (qdisc + devices) are drained.
 	 * This gives :
 	 * - less callbacks to tcp_write_xmit(), reducing stress (batches)
@@ -920,7 +920,7 @@ void tcp_wfree(struct sk_buff *skb)
 	 *   to migrate this flow (skb->ooo_okay will be eventually set)
 	 */
 	if (refcount_read(&sk->sk_wmem_alloc) >= SKB_TRUESIZE(1) &&
-	    (this_cpu_ksoftirqd() == current || softirq_serving_workqueue()))
+	    softirq_serving_workqueue())
 		goto out;
 
 	for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail
  2018-01-19 15:46 ` [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail Frederic Weisbecker
@ 2018-01-19 16:16   ` David Miller
  2018-01-19 18:25     ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2018-01-19 16:16 UTC (permalink / raw)
  To: frederic
  Cc: linux-kernel, alexander.levin, peterz, mchehab, torvalds, hannes,
	paulmck, wanpeng.li, dima, tglx, akpm, pabeni, rrendec, mingo,
	sgruszka, riel, edumazet

From: Frederic Weisbecker <frederic@kernel.org>
Date: Fri, 19 Jan 2018 16:46:11 +0100

> For now, vectors that get re-enqueued trigger ksoftirqd but they are
> going to be handled by per-vector workqueues on subsequent patches.

Frederic, first of all, thanks for doing all of this work.

So this "get requeued" condition I think will trigger always for
networking tunnel decapsulation.

Each decap will (eventually) do a:

	__raise_softirq_irqoff(NET_RX_SOFTIRQ);

via ____napi_schedule() in net/core/dev.c

Example code path:

	ip_tunnel_rcv()
		gro_cells_receive()
			napi_schedule()
			__napi_schedule()
				____napi_schedule();
				__raise_softirq_irqoff(NET_RX_SOFTIRQ);

Anyways, just FYI...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail
  2018-01-19 16:16   ` David Miller
@ 2018-01-19 18:25     ` Linus Torvalds
  2018-01-19 18:47       ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2018-01-19 18:25 UTC (permalink / raw)
  To: David Miller
  Cc: Frederic Weisbecker, Linux Kernel Mailing List, Sasha Levin,
	Peter Zijlstra, Mauro Carvalho Chehab, Hannes Frederic Sowa,
	Paul McKenney, Wanpeng Li, Dmitry Safonov, Thomas Gleixner,
	Andrew Morton, Paolo Abeni, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet

On Fri, Jan 19, 2018 at 8:16 AM, David Miller <davem@davemloft.net> wrote:
>
> So this "get requeued" condition I think will trigger always for
> networking tunnel decapsulation.

Hmm. Interesting and a perhaps bit discouraging.

Will it always be just a _single_ level of indirection, or will double
tunnels (I assume some people do that, just because the universe is
out to get us) then result in this perhaps repeating several times?

Because it would be not very complicated to just have two bitmasks
(and still no per-softirq counting), and have that "switch to
threading only on the _second_ time this happens".

But let's have people test the behavior of this simpler model first?

                      Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail
  2018-01-19 18:25     ` Linus Torvalds
@ 2018-01-19 18:47       ` David Miller
  2018-01-21 16:30         ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2018-01-19 18:47 UTC (permalink / raw)
  To: torvalds
  Cc: frederic, linux-kernel, alexander.levin, peterz, mchehab, hannes,
	paulmck, wanpeng.li, dima, tglx, akpm, pabeni, rrendec, mingo,
	sgruszka, riel, edumazet

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 19 Jan 2018 10:25:03 -0800

> On Fri, Jan 19, 2018 at 8:16 AM, David Miller <davem@davemloft.net> wrote:
>>
>> So this "get requeued" condition I think will trigger always for
>> networking tunnel decapsulation.
> 
> Hmm. Interesting and a perhaps bit discouraging.
> 
> Will it always be just a _single_ level of indirection, or will double
> tunnels (I assume some people do that, just because the universe is
> out to get us) then result in this perhaps repeating several times?

Every level of tunnel encapsulation will trigger a new softirq.

So if you have an IP tunnel inside of an IP tunnel that will trigger
twice.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-01-19 15:46 ` [RFC PATCH 2/4] softirq: Per vector deferment to workqueue Frederic Weisbecker
@ 2018-01-20  8:41   ` Pavan Kondeti
  2018-01-21 16:11     ` Frederic Weisbecker
  2018-02-08 17:44   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 39+ messages in thread
From: Pavan Kondeti @ 2018-01-20  8:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Paolo Abeni, Radu Rendec, Ingo Molnar, Stanislaw Gruszka,
	Rik van Riel, Eric Dumazet, David Miller

Hi Frederic,

On Fri, Jan 19, 2018 at 04:46:12PM +0100, Frederic Weisbecker wrote:
> Some softirq vectors can be more CPU hungry than others. Especially
> networking may sometimes deal with packet storm and need more CPU than
> IRQ tail can offer without inducing scheduler latencies. In this case
> the current code defers to ksoftirqd that behaves nicer. Now this nice
> behaviour can be bad for other IRQ vectors that usually need quick
> processing.
> 
> To solve this we only defer to threading the vectors that got
> re-enqueued on IRQ tail processing and leave the others inline on IRQ
> tail service. This is achieved using workqueues with
> per-CPU/per-vector worklets.
> 
> Note ksoftirqd is not yet removed as it is still needed for threaded IRQs
> mode.
> 

<snip>

> +static void do_softirq_workqueue(u32 pending)
> +{
> +	struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
> +	struct softirq_action *h = softirq_vec;
> +	int softirq_bit;
> +
> +	pending &= ~softirq->pending_work_mask;
> +
> +	while ((softirq_bit = ffs(pending))) {
> +		struct vector *vector;
> +		unsigned int vec_nr;
> +
> +		h += softirq_bit - 1;
> +		vec_nr = h - softirq_vec;
> +		softirq->pending_work_mask |= BIT(vec_nr);
> +		vector = &softirq->vector[vec_nr];
> +		schedule_work_on(smp_processor_id(), &vector->work);
> +		h++;
> +		pending >>= softirq_bit;
> +	}
> +}
> +

I have couple questions/comments.

(1) Since the work is queued on a bounded per-cpu worker, we may run
into a deadlock if a TASKLET is killed from another work running on
the same bounded per-cpu worker.

For example,

(1) Schedule a TASKLET on CPU#0 from IRQ.
(2) Another IRQ comes on the same CPU and we queue a work to kill
the TASKLET. 
(3) The TASKLET vector is deferred to workqueue.
(4) We run the TASKLET kill work and wait for the TASKLET to finish,
which won't happen.

We can fix this by queueing the TASKLET kill work on an unbounded
workqueue so that this runs in parallel with TASKLET vector work.

Just wanted to know if we have to be aware of this *condition*.

(2) Ksoftirqd thread gets parked when a CPU is hotplugged out. So
there is a gaurantee that the softirq handling never happens on
another CPU. Where as a bounded worker gets detached and the queued
work can run on another CPU. I guess, some special handling is
needed to handle hotplug.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-01-20  8:41   ` Pavan Kondeti
@ 2018-01-21 16:11     ` Frederic Weisbecker
  2018-01-21 17:50       ` Pavan Kondeti
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-21 16:11 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: LKML, Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Paolo Abeni, Radu Rendec, Ingo Molnar, Stanislaw Gruszka,
	Rik van Riel, Eric Dumazet, David Miller

On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:

Hi Pavan,


> I have couple questions/comments.
> 
> (1) Since the work is queued on a bounded per-cpu worker, we may run
> into a deadlock if a TASKLET is killed from another work running on
> the same bounded per-cpu worker.
> 
> For example,
> 
> (1) Schedule a TASKLET on CPU#0 from IRQ.
> (2) Another IRQ comes on the same CPU and we queue a work to kill
> the TASKLET. 
> (3) The TASKLET vector is deferred to workqueue.
> (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> which won't happen.
> 
> We can fix this by queueing the TASKLET kill work on an unbounded
> workqueue so that this runs in parallel with TASKLET vector work.
> 
> Just wanted to know if we have to be aware of this *condition*.

But IIRC the workqueues have several workers per CPU so the tasklet to
be killed can run while the tasklet killer yields.

> 
> (2) Ksoftirqd thread gets parked when a CPU is hotplugged out. So
> there is a gaurantee that the softirq handling never happens on
> another CPU. Where as a bounded worker gets detached and the queued
> work can run on another CPU. I guess, some special handling is
> needed to handle hotplug.

Good catch. Funny, I worried a bit about CPU hotplug but I assumed
the pending CPU-bound worklets would be simply sync'ed before CPU gets down.

Breaking their CPU-bound properties doesn't look sane to me.

Anyway, I'll need to make a CPU hotplug hook.

Thanks for reporting that!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail
  2018-01-19 18:47       ` David Miller
@ 2018-01-21 16:30         ` Frederic Weisbecker
  2018-01-21 16:57           ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-21 16:30 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, linux-kernel, alexander.levin, peterz, mchehab, hannes,
	paulmck, wanpeng.li, dima, tglx, akpm, pabeni, rrendec, mingo,
	sgruszka, riel, edumazet

On Fri, Jan 19, 2018 at 01:47:27PM -0500, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Fri, 19 Jan 2018 10:25:03 -0800
> 
> > On Fri, Jan 19, 2018 at 8:16 AM, David Miller <davem@davemloft.net> wrote:
> >>
> >> So this "get requeued" condition I think will trigger always for
> >> networking tunnel decapsulation.
> > 
> > Hmm. Interesting and a perhaps bit discouraging.
> > 
> > Will it always be just a _single_ level of indirection, or will double
> > tunnels (I assume some people do that, just because the universe is
> > out to get us) then result in this perhaps repeating several times?
> 
> Every level of tunnel encapsulation will trigger a new softirq.
> 
> So if you have an IP tunnel inside of an IP tunnel that will trigger
> twice.

So we may likely need to come back to a call counter based limit :-s

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail
  2018-01-21 16:30         ` Frederic Weisbecker
@ 2018-01-21 16:57           ` David Miller
  0 siblings, 0 replies; 39+ messages in thread
From: David Miller @ 2018-01-21 16:57 UTC (permalink / raw)
  To: frederic
  Cc: torvalds, linux-kernel, alexander.levin, peterz, mchehab, hannes,
	paulmck, wanpeng.li, dima, tglx, akpm, pabeni, rrendec, mingo,
	sgruszka, riel, edumazet

From: Frederic Weisbecker <frederic@kernel.org>
Date: Sun, 21 Jan 2018 17:30:09 +0100

> On Fri, Jan 19, 2018 at 01:47:27PM -0500, David Miller wrote:
>> From: Linus Torvalds <torvalds@linux-foundation.org>
>> Date: Fri, 19 Jan 2018 10:25:03 -0800
>> 
>> > On Fri, Jan 19, 2018 at 8:16 AM, David Miller <davem@davemloft.net> wrote:
>> >>
>> >> So this "get requeued" condition I think will trigger always for
>> >> networking tunnel decapsulation.
>> > 
>> > Hmm. Interesting and a perhaps bit discouraging.
>> > 
>> > Will it always be just a _single_ level of indirection, or will double
>> > tunnels (I assume some people do that, just because the universe is
>> > out to get us) then result in this perhaps repeating several times?
>> 
>> Every level of tunnel encapsulation will trigger a new softirq.
>> 
>> So if you have an IP tunnel inside of an IP tunnel that will trigger
>> twice.
> 
> So we may likely need to come back to a call counter based limit :-s

I'm not so sure exactly to what extent we should try to handle that,
and if so exactly how.

The only reason we do this is to control stack usage.  The re-softirq
on tunnel decapsulation is functioning as a continuation of sorts.
If we are already running in the net_rx_action() softirq it would
be so much nicer and efficient to just make sure net_rx_action()
runs the rest of the packet processing.

Right now that doesn't happen because net_rx_action() runs only one
round of NAPI polling.  It doesn't re-snapshot the list and try
again before returning.

net_rx_action() has it's own logic like do_softirq() does for timing
out in the middle of it's work which may or may not have some further
influence upon fairness to other softirqs.

Basically, it runs a snapshot the NAPI poll list for this CPU until
either usecs_to_jiffies(netdev_budget_usecs) jiffies have elapsed or
the list snapshot has been fully processed.

The default netdev_budget_usecs is 2000, which is my math isn't broken
is 2 jiffies when HZ=1000.  I know why we use 2000 instead of 1000,
it's to handle the case where we are invoked very close to the end
of a jiffy.  That situation does happen often enough in practice to
cause performance problems.

It would seem that all of these issues are why the tendency is to deal
with measuring cost using time rather than a simpler heuristic such as
whether softirqs were retriggered during a softirq run.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-01-21 16:11     ` Frederic Weisbecker
@ 2018-01-21 17:50       ` Pavan Kondeti
  2018-01-21 20:48         ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Pavan Kondeti @ 2018-01-21 17:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Paolo Abeni, Radu Rendec, Ingo Molnar, Stanislaw Gruszka,
	Rik van Riel, Eric Dumazet, David Miller

On Sun, Jan 21, 2018 at 05:11:17PM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:
> 
> Hi Pavan,
> 
> 
> > I have couple questions/comments.
> > 
> > (1) Since the work is queued on a bounded per-cpu worker, we may run
> > into a deadlock if a TASKLET is killed from another work running on
> > the same bounded per-cpu worker.
> > 
> > For example,
> > 
> > (1) Schedule a TASKLET on CPU#0 from IRQ.
> > (2) Another IRQ comes on the same CPU and we queue a work to kill
> > the TASKLET. 
> > (3) The TASKLET vector is deferred to workqueue.
> > (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> > which won't happen.
> > 
> > We can fix this by queueing the TASKLET kill work on an unbounded
> > workqueue so that this runs in parallel with TASKLET vector work.
> > 
> > Just wanted to know if we have to be aware of this *condition*.
> 
> But IIRC the workqueues have several workers per CPU so the tasklet to
> be killed can run while the tasklet killer yields.
> 
AFAIK, the work items queued via schedule_work_on() goes to the system_wq
which is bounded to a CPU with concurrency restrictions. If any work
item (in this case tasklet kill) is getting executed on this bounded
worker, the next items have to wait. The forward progress happens only
when the current work is finished or enters sleep.

This also makes me wonder what happens if a CPU hogging work gets executed
on the system_wq while softirq work is pending? The softirq work gets
starved which won't happen now with ksoftirqd design. Ideally the CPU
hogging work should not be queued on the system_wq and instead should
be queued on CPU intenstive workqueue (WQ_CPU_INTENSIVE) to exempt
from concurrency management. May be we need some special workqueue
which is bounded but not subjected to concurrency management.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-01-21 17:50       ` Pavan Kondeti
@ 2018-01-21 20:48         ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-21 20:48 UTC (permalink / raw)
  To: Pavan Kondeti
  Cc: LKML, Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Paolo Abeni, Radu Rendec, Ingo Molnar, Stanislaw Gruszka,
	Rik van Riel, Eric Dumazet, David Miller

On Sun, Jan 21, 2018 at 11:20:40PM +0530, Pavan Kondeti wrote:
> On Sun, Jan 21, 2018 at 05:11:17PM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:
> > 
> > Hi Pavan,
> > 
> > 
> > > I have couple questions/comments.
> > > 
> > > (1) Since the work is queued on a bounded per-cpu worker, we may run
> > > into a deadlock if a TASKLET is killed from another work running on
> > > the same bounded per-cpu worker.
> > > 
> > > For example,
> > > 
> > > (1) Schedule a TASKLET on CPU#0 from IRQ.
> > > (2) Another IRQ comes on the same CPU and we queue a work to kill
> > > the TASKLET. 
> > > (3) The TASKLET vector is deferred to workqueue.
> > > (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> > > which won't happen.
> > > 
> > > We can fix this by queueing the TASKLET kill work on an unbounded
> > > workqueue so that this runs in parallel with TASKLET vector work.
> > > 
> > > Just wanted to know if we have to be aware of this *condition*.
> > 
> > But IIRC the workqueues have several workers per CPU so the tasklet to
> > be killed can run while the tasklet killer yields.
> > 
> AFAIK, the work items queued via schedule_work_on() goes to the system_wq
> which is bounded to a CPU with concurrency restrictions. If any work
> item (in this case tasklet kill) is getting executed on this bounded
> worker, the next items have to wait. The forward progress happens only
> when the current work is finished or enters sleep.

Workqueues have multiple workers to handle the pending works, unless they
are __WQ_ORDERED.

The worker manager in a pool is supposed to create workers on-demand
when necessary to make sure that no work is blocking the others.

Thanks.

> 
> This also makes me wonder what happens if a CPU hogging work gets executed
> on the system_wq while softirq work is pending? The softirq work gets
> starved which won't happen now with ksoftirqd design. Ideally the CPU
> hogging work should not be queued on the system_wq and instead should
> be queued on CPU intenstive workqueue (WQ_CPU_INTENSIVE) to exempt
> from concurrency management. May be we need some special workqueue
> which is bounded but not subjected to concurrency management.
> 
> -- 
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2018-01-19 15:46 ` [RFC PATCH 4/4] softirq: Replace ksoftirqd with workqueues entirely Frederic Weisbecker
@ 2018-01-22 19:58 ` Mauro Carvalho Chehab
  2018-01-23 10:13 ` Paolo Abeni
  2018-02-07 14:18 ` Mauro Carvalho Chehab
  6 siblings, 0 replies; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2018-01-22 19:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Levin Alexander, Peter Zijlstra, Linus Torvalds,
	Hannes Frederic Sowa, Paul E . McKenney, Wanpeng Li,
	Dmitry Safonov, Thomas Gleixner, Andrew Morton, Paolo Abeni,
	Radu Rendec, Ingo Molnar, Stanislaw Gruszka, Rik van Riel,
	Eric Dumazet, David Miller

Em Fri, 19 Jan 2018 16:46:10 +0100
Frederic Weisbecker <frederic@kernel.org> escreveu:

> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
> 
> No tunable here, so testing should be easier.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	softirq/thread-v3

Frederic,

FYI, I'm out of town for two weeks. I'll try to setup a test environment
while traveling, but I'm unsure if it will work, as I won't have access to
my DVB-S2 provider from here.

> 
> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> 
> Thanks,
> 	Frederic
> ---
> 
> Frederic Weisbecker (4):
>       softirq: Limit vector to a single iteration on IRQ tail
>       softirq: Per vector deferment to workqueue
>       softirq: Defer to workqueue when rescheduling is needed
>       softirq: Replace ksoftirqd with workqueues entirely
> 
> 
>  Documentation/RCU/stallwarn.txt |   4 +-
>  include/linux/interrupt.h       |   7 +-
>  kernel/sched/cputime.c          |  12 +--
>  kernel/sched/sched.h            |   4 +-
>  kernel/softirq.c                | 223 +++++++++++++++++++++++-----------------
>  net/ipv4/tcp_output.c           |   5 +-
>  6 files changed, 144 insertions(+), 111 deletions(-)




Cheers,
Mauro

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2018-01-22 19:58 ` [RFC PATCH 0/4] softirq: Per vector threading v3 Mauro Carvalho Chehab
@ 2018-01-23 10:13 ` Paolo Abeni
  2018-01-23 12:32   ` Dmitry Safonov
  2018-01-23 16:22   ` David Miller
  2018-02-07 14:18 ` Mauro Carvalho Chehab
  6 siblings, 2 replies; 39+ messages in thread
From: Paolo Abeni @ 2018-01-23 10:13 UTC (permalink / raw)
  To: Frederic Weisbecker, LKML
  Cc: Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Radu Rendec, Ingo Molnar, Stanislaw Gruszka, Rik van Riel,
	Eric Dumazet, David Miller

Hi,

On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
> 
> No tunable here, so testing should be easier.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	softirq/thread-v3
> 
> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4

I tested this series in the UDP flood scenario, binding the user space
process receiving the packets on the same CPU processing the related
IRQ, and the tput sinks nearly to 0, like before Eric's patch.

The perf tool says that almost all the softirq processing is done
inside the workqueue, but the user space process is scheduled very
rarely, while before this series, in this scenario, ksoftirqd and the
user space process got a fair share of the CPU time.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 10:13 ` Paolo Abeni
@ 2018-01-23 12:32   ` Dmitry Safonov
  2018-01-24  2:12     ` Frederic Weisbecker
  2018-01-23 16:22   ` David Miller
  1 sibling, 1 reply; 39+ messages in thread
From: Dmitry Safonov @ 2018-01-23 12:32 UTC (permalink / raw)
  To: Paolo Abeni, Frederic Weisbecker, LKML
  Cc: Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Thomas Gleixner, Andrew Morton, Radu Rendec,
	Ingo Molnar, Stanislaw Gruszka, Rik van Riel, Eric Dumazet,
	David Miller

On Tue, 2018-01-23 at 11:13 +0100, Paolo Abeni wrote:
> Hi,
> 
> On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> > As per Linus suggestion, this take doesn't limit the number of
> > occurences
> > per jiffy anymore but instead defers a vector to workqueues as soon
> > as
> > it gets re-enqueued on IRQ tail.
> > 
> > No tunable here, so testing should be easier.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-
> > dynticks.git
> > 	softirq/thread-v3
> > 
> > HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> 
> I tested this series in the UDP flood scenario, binding the user
> space
> process receiving the packets on the same CPU processing the related
> IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> 
> The perf tool says that almost all the softirq processing is done
> inside the workqueue, but the user space process is scheduled very
> rarely, while before this series, in this scenario, ksoftirqd and the
> user space process got a fair share of the CPU time.

It get a fair share of the CPU time only on some lucky platforms (maybe
on the most). On other not-so-lucky platforms it was also unfair - as
I've previously described by the reason of slow raising softirqs :-/

-- 
Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 10:13 ` Paolo Abeni
  2018-01-23 12:32   ` Dmitry Safonov
@ 2018-01-23 16:22   ` David Miller
  2018-01-23 16:57     ` Paolo Abeni
  1 sibling, 1 reply; 39+ messages in thread
From: David Miller @ 2018-01-23 16:22 UTC (permalink / raw)
  To: pabeni
  Cc: frederic, linux-kernel, alexander.levin, peterz, mchehab,
	torvalds, hannes, paulmck, wanpeng.li, dima, tglx, akpm, rrendec,
	mingo, sgruszka, riel, edumazet

From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 23 Jan 2018 11:13:52 +0100

> Hi,
> 
> On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
>> As per Linus suggestion, this take doesn't limit the number of occurences
>> per jiffy anymore but instead defers a vector to workqueues as soon as
>> it gets re-enqueued on IRQ tail.
>> 
>> No tunable here, so testing should be easier.
>> 
>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>> 	softirq/thread-v3
>> 
>> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> 
> I tested this series in the UDP flood scenario, binding the user space
> process receiving the packets on the same CPU processing the related
> IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> 
> The perf tool says that almost all the softirq processing is done
> inside the workqueue, but the user space process is scheduled very
> rarely, while before this series, in this scenario, ksoftirqd and the
> user space process got a fair share of the CPU time.

Do workqueue threads get a higher scheduling priority than user
processes?  If so, that's going to work against the entire point of
deferring softirq work into a thread.

Or is it that the workqueue execution is simply not yielding for some
reason?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 16:22   ` David Miller
@ 2018-01-23 16:57     ` Paolo Abeni
  2018-01-23 17:42       ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Paolo Abeni @ 2018-01-23 16:57 UTC (permalink / raw)
  To: David Miller
  Cc: frederic, linux-kernel, alexander.levin, peterz, mchehab,
	torvalds, hannes, paulmck, wanpeng.li, dima, tglx, akpm, rrendec,
	mingo, sgruszka, riel, edumazet

On Tue, 2018-01-23 at 11:22 -0500, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 23 Jan 2018 11:13:52 +0100
> 
> > Hi,
> > 
> > On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> >> As per Linus suggestion, this take doesn't limit the number of occurences
> >> per jiffy anymore but instead defers a vector to workqueues as soon as
> >> it gets re-enqueued on IRQ tail.
> >> 
> >> No tunable here, so testing should be easier.
> >> 
> >> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >>      softirq/thread-v3
> >> 
> >> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> > 
> > I tested this series in the UDP flood scenario, binding the user space
> > process receiving the packets on the same CPU processing the related
> > IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> > 
> > The perf tool says that almost all the softirq processing is done
> > inside the workqueue, but the user space process is scheduled very
> > rarely, while before this series, in this scenario, ksoftirqd and the
> > user space process got a fair share of the CPU time.
> 
> Do workqueue threads get a higher scheduling priority than user
> processes?

As far as I can see, no: the workqueue thread has the same priority and
nice level than the user space process.

> Or is it that the workqueue execution is simply not yielding for some
> reason?

It's like that.

I spent little time on it, so I haven't many data point. I'll try to
investigate the scenario later this week.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 16:57     ` Paolo Abeni
@ 2018-01-23 17:42       ` Linus Torvalds
  2018-01-23 18:01         ` Mike Galbraith
                           ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Linus Torvalds @ 2018-01-23 17:42 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: David Miller, Frederic Weisbecker, Linux Kernel Mailing List,
	Sasha Levin, Peter Zijlstra, Mauro Carvalho Chehab,
	Hannes Frederic Sowa, Paul McKenney, Wanpeng Li, Dmitry Safonov,
	Thomas Gleixner, Andrew Morton, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet

On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>
>> Or is it that the workqueue execution is simply not yielding for some
>> reason?
>
> It's like that.
>
> I spent little time on it, so I haven't many data point. I'll try to
> investigate the scenario later this week.

Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
cond_resched() (and a RCU quiescent note).

But I wonder if the test triggers the "lets run lots of workqueue
threads", and then the single-threaded user space just gets blown out
of the water by many kernel threads. Each thread gets its own "fair"
amount of CPU, but..

                   Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 17:42       ` Linus Torvalds
@ 2018-01-23 18:01         ` Mike Galbraith
  2018-01-23 18:24         ` David Miller
  2018-01-24 14:54         ` Paolo Abeni
  2 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2018-01-23 18:01 UTC (permalink / raw)
  To: Linus Torvalds, Paolo Abeni
  Cc: David Miller, Frederic Weisbecker, Linux Kernel Mailing List,
	Sasha Levin, Peter Zijlstra, Mauro Carvalho Chehab,
	Hannes Frederic Sowa, Paul McKenney, Wanpeng Li, Dmitry Safonov,
	Thomas Gleixner, Andrew Morton, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet

On Tue, 2018-01-23 at 09:42 -0800, Linus Torvalds wrote:
> On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> >
> >> Or is it that the workqueue execution is simply not yielding for some
> >> reason?
> >
> > It's like that.
> >
> > I spent little time on it, so I haven't many data point. I'll try to
> > investigate the scenario later this week.
> 
> Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
> cond_resched() (and a RCU quiescent note).
> 
> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

If folks aren't careful with workqueues, they can be a generic
starvation problem.  Like the below in the here and now.

fs/nfs: Add a resched point to nfs_commit_release_pages()

During heavy NFS write, kworkers can do very large amounts of work
without scheduling (82ms traced).  Add a resched point.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Suggested-by: Trond Myklebust <trondmy@primarydata.com>
---
 fs/nfs/write.c |    1 +
 1 file changed, 1 insertion(+)

--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1837,6 +1837,7 @@ static void nfs_commit_release_pages(str
 		set_bit(NFS_CONTEXT_RESEND_WRITES, &req->wb_context->flags);
 	next:
 		nfs_unlock_and_release_request(req);
+		cond_resched();
 	}
 	nfss = NFS_SERVER(data->inode);
 	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 17:42       ` Linus Torvalds
  2018-01-23 18:01         ` Mike Galbraith
@ 2018-01-23 18:24         ` David Miller
  2018-01-24  1:57           ` Frederic Weisbecker
  2018-01-24 14:54         ` Paolo Abeni
  2 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2018-01-23 18:24 UTC (permalink / raw)
  To: torvalds
  Cc: pabeni, frederic, linux-kernel, alexander.levin, peterz, mchehab,
	hannes, paulmck, wanpeng.li, dima, tglx, akpm, rrendec, mingo,
	sgruszka, riel, edumazet

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 23 Jan 2018 09:42:32 -0800

> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

If a single cpu's softirq deferral can end up running on multiple
workqueue threads, indeed that's a serious problem.

So if we're in a workqueue and it does a:

	schedule_work_on(this_cpu, currently_executing_work);

it'll potentially make a new thread?

That's exactly the code path that will get exercised during a UDP
flood the way that vector_work_func() is implemented.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 18:24         ` David Miller
@ 2018-01-24  1:57           ` Frederic Weisbecker
  2018-01-24  2:01             ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-24  1:57 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, pabeni, linux-kernel, alexander.levin, peterz, mchehab,
	hannes, paulmck, wanpeng.li, dima, tglx, akpm, rrendec, mingo,
	sgruszka, riel, edumazet

On Tue, Jan 23, 2018 at 01:24:24PM -0500, David Miller wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Tue, 23 Jan 2018 09:42:32 -0800
> 
> > But I wonder if the test triggers the "lets run lots of workqueue
> > threads", and then the single-threaded user space just gets blown out
> > of the water by many kernel threads. Each thread gets its own "fair"
> > amount of CPU, but..
> 
> If a single cpu's softirq deferral can end up running on multiple
> workqueue threads, indeed that's a serious problem.
> 
> So if we're in a workqueue and it does a:
> 
> 	schedule_work_on(this_cpu, currently_executing_work);
> 
> it'll potentially make a new thread?
> 
> That's exactly the code path that will get exercised during a UDP
> flood the way that vector_work_func() is implemented.

It shouldn't create a new thread unless all other workers in the CPU
are voluntarily sleeping while executing a work. Also since softirqs
can't sleep, we shouldn't even have two vectors running concurrently
on workqueues.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-24  1:57           ` Frederic Weisbecker
@ 2018-01-24  2:01             ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-24  2:01 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, pabeni, linux-kernel, alexander.levin, peterz, mchehab,
	hannes, paulmck, wanpeng.li, dima, tglx, akpm, rrendec, mingo,
	sgruszka, riel, edumazet

2018-01-24 2:57 UTC+01:00, Frederic Weisbecker <frederic@kernel.org>:
> On Tue, Jan 23, 2018 at 01:24:24PM -0500, David Miller wrote:
>> From: Linus Torvalds <torvalds@linux-foundation.org>
>> Date: Tue, 23 Jan 2018 09:42:32 -0800
>>
>> > But I wonder if the test triggers the "lets run lots of workqueue
>> > threads", and then the single-threaded user space just gets blown out
>> > of the water by many kernel threads. Each thread gets its own "fair"
>> > amount of CPU, but..
>>
>> If a single cpu's softirq deferral can end up running on multiple
>> workqueue threads, indeed that's a serious problem.
>>
>> So if we're in a workqueue and it does a:
>>
>> 	schedule_work_on(this_cpu, currently_executing_work);
>>
>> it'll potentially make a new thread?
>>
>> That's exactly the code path that will get exercised during a UDP
>> flood the way that vector_work_func() is implemented.
>
> It shouldn't create a new thread unless all other workers in the CPU
> are voluntarily sleeping while executing a work. Also since softirqs
> can't sleep, we shouldn't even have two vectors running concurrently
> on workqueues.
>

...unless I misunderstood workqueue internals or missed something, in
which case we may try ordered workqueue.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 12:32   ` Dmitry Safonov
@ 2018-01-24  2:12     ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-01-24  2:12 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: Paolo Abeni, LKML, Levin Alexander, Peter Zijlstra,
	Mauro Carvalho Chehab, Linus Torvalds, Hannes Frederic Sowa,
	Paul E . McKenney, Wanpeng Li, Thomas Gleixner, Andrew Morton,
	Radu Rendec, Ingo Molnar, Stanislaw Gruszka, Rik van Riel,
	Eric Dumazet, David Miller

On Tue, Jan 23, 2018 at 12:32:13PM +0000, Dmitry Safonov wrote:
> On Tue, 2018-01-23 at 11:13 +0100, Paolo Abeni wrote:
> > Hi,
> > 
> > On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> > > As per Linus suggestion, this take doesn't limit the number of
> > > occurences
> > > per jiffy anymore but instead defers a vector to workqueues as soon
> > > as
> > > it gets re-enqueued on IRQ tail.
> > > 
> > > No tunable here, so testing should be easier.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-
> > > dynticks.git
> > > 	softirq/thread-v3
> > > 
> > > HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> > 
> > I tested this series in the UDP flood scenario, binding the user
> > space
> > process receiving the packets on the same CPU processing the related
> > IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> > 
> > The perf tool says that almost all the softirq processing is done
> > inside the workqueue, but the user space process is scheduled very
> > rarely, while before this series, in this scenario, ksoftirqd and the
> > user space process got a fair share of the CPU time.
> 
> It get a fair share of the CPU time only on some lucky platforms (maybe
> on the most). On other not-so-lucky platforms it was also unfair - as
> I've previously described by the reason of slow raising softirqs :-/

Indeed your case looks pretty special, as IRQs don't get the opportunity to
interrupt softirqs but they fire right after.

I'm wondering if you shouldn't try threaded IRQs. Having softirqs always
running in a thread instead of hardirq tail might help (through threadirqs=
kernel param).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-23 17:42       ` Linus Torvalds
  2018-01-23 18:01         ` Mike Galbraith
  2018-01-23 18:24         ` David Miller
@ 2018-01-24 14:54         ` Paolo Abeni
  2018-01-24 15:05           ` David Miller
  2 siblings, 1 reply; 39+ messages in thread
From: Paolo Abeni @ 2018-01-24 14:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, Frederic Weisbecker, Linux Kernel Mailing List,
	Sasha Levin, Peter Zijlstra, Mauro Carvalho Chehab,
	Hannes Frederic Sowa, Paul McKenney, Wanpeng Li, Dmitry Safonov,
	Thomas Gleixner, Andrew Morton, Radu Rendec, Ingo Molnar,
	Stanislaw Gruszka, Rik van Riel, Eric Dumazet, Niklas Cassel

On Tue, 2018-01-23 at 09:42 -0800, Linus Torvalds wrote:
> On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > > Or is it that the workqueue execution is simply not yielding for some
> > > reason?
> > 
> > It's like that.
> > 
> > I spent little time on it, so I haven't many data point. I'll try to
> > investigate the scenario later this week.
> 
> Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
> cond_resched() (and a RCU quiescent note).
> 
> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
and indeed he was right.

The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
very little CPU time was accounted to the kworker:

[2125 is the relevant kworker's pid]
grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime /proc/2125/sched
se.sum_exec_runtime                          :         13408.239286
se.sum_exec_runtime                          :         13456.907197

despite such process was processing a lot of packets and basically
burning a CPU.

Switching CONFIG_IRQ_TIME_ACCOUNTING off I see the expected behaviour:
top reports that the user space process and kworker share the CPU
almost fairly and the user space process is able to receive a
reasonable amount of packets.

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-24 14:54         ` Paolo Abeni
@ 2018-01-24 15:05           ` David Miller
  2018-01-24 16:11             ` Paolo Abeni
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2018-01-24 15:05 UTC (permalink / raw)
  To: pabeni
  Cc: torvalds, frederic, linux-kernel, alexander.levin, peterz,
	mchehab, hannes, paulmck, wanpeng.li, dima, tglx, akpm, rrendec,
	mingo, sgruszka, riel, edumazet, nks.gnu

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 24 Jan 2018 15:54:05 +0100

> Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
> and indeed he was right.
> 
> The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
> very little CPU time was accounted to the kworker:
> 
> [2125 is the relevant kworker's pid]
> grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime /proc/2125/sched
> se.sum_exec_runtime                          :         13408.239286
> se.sum_exec_runtime                          :         13456.907197
> 
> despite such process was processing a lot of packets and basically
> burning a CPU.

So IRQ_TIME_ACCOUNTING makes the scheduler think that the worker
threads are using nearly no task time at all.

The existing ksoftirqd code should hit the same problem, right?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-24 15:05           ` David Miller
@ 2018-01-24 16:11             ` Paolo Abeni
  0 siblings, 0 replies; 39+ messages in thread
From: Paolo Abeni @ 2018-01-24 16:11 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, frederic, linux-kernel, alexander.levin, peterz,
	mchehab, hannes, paulmck, wanpeng.li, dima, tglx, akpm, rrendec,
	mingo, sgruszka, riel, edumazet, nks.gnu

On Wed, 2018-01-24 at 10:05 -0500, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Wed, 24 Jan 2018 15:54:05 +0100
> 
> > Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
> > and indeed he was right.
> > 
> > The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
> > very little CPU time was accounted to the kworker:
> > 
> > [2125 is the relevant kworker's pid]
> > grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime /proc/2125/sched
> > se.sum_exec_runtime                          :         13408.239286
> > se.sum_exec_runtime                          :         13456.907197
> > 
> > despite such process was processing a lot of packets and basically
> > burning a CPU.
> 
> So IRQ_TIME_ACCOUNTING makes the scheduler think that the worker
> threads are using nearly no task time at all.

Yes, this is the behavior I observe in the test. But a quick look at
the scheduler code - I'm not very familiar with it - let me think this
is not the intended/expected behaviour for the ksoftirqd (and after
this series, for the kworker serving the softirq).

> The existing ksoftirqd code should hit the same problem, right?

I just tried the vanilla kernel with CONFIG_IRQ_TIME_ACCOUNTING=y, and
in the same test scenario, I observe the 'good' behavior: the user
space process and ksoftirqd share almost fairly the relevant CPU, and
the CPU time spend in softirq processing is accounted to ksoftirqd.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2018-01-23 10:13 ` Paolo Abeni
@ 2018-02-07 14:18 ` Mauro Carvalho Chehab
  2018-03-01 15:21   ` Frederic Weisbecker
  6 siblings, 1 reply; 39+ messages in thread
From: Mauro Carvalho Chehab @ 2018-02-07 14:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Levin Alexander, Peter Zijlstra, Linus Torvalds,
	Hannes Frederic Sowa, Paul E . McKenney, Wanpeng Li,
	Dmitry Safonov, Thomas Gleixner, Andrew Morton, Paolo Abeni,
	Radu Rendec, Ingo Molnar, Stanislaw Gruszka, Rik van Riel,
	Eric Dumazet, David Miller

Em Fri, 19 Jan 2018 16:46:10 +0100
Frederic Weisbecker <frederic@kernel.org> escreveu:

> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
> 
> No tunable here, so testing should be easier.

Sorry for taking so long to test it. Just returned this week from a 2
weeks travel, and had first to submit media patches for the merge window.

Didn't work. With this patch series, it still lose frames with
usb bulk transfers, while recording DVB programs with tvheadend.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-01-19 15:46 ` [RFC PATCH 2/4] softirq: Per vector deferment to workqueue Frederic Weisbecker
  2018-01-20  8:41   ` Pavan Kondeti
@ 2018-02-08 17:44   ` Sebastian Andrzej Siewior
  2018-02-08 18:45     ` David Miller
  2018-02-15 16:13     ` Frederic Weisbecker
  1 sibling, 2 replies; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-02-08 17:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Paolo Abeni, Radu Rendec, Ingo Molnar, Stanislaw Gruszka,
	Rik van Riel, Eric Dumazet, David Miller

On 2018-01-19 16:46:12 [+0100], Frederic Weisbecker wrote:
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index c8c6841..becb1d9 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -62,6 +62,19 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {
> +static void vector_work_func(struct work_struct *work)
> +{
> +	struct vector *vector = container_of(work, struct vector, work);
> +	struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
> +	int vec_nr = vector->nr;
> +	int vec_bit = BIT(vec_nr);
> +	u32 pending;
> +
> +	local_irq_disable();
> +	pending = local_softirq_pending();
> +	account_irq_enter_time(current);
> +	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> +	lockdep_softirq_enter();
> +	set_softirq_pending(pending & ~vec_bit);
> +	local_irq_enable();
> +
> +	if (pending & vec_bit) {
> +		struct softirq_action *sa = &softirq_vec[vec_nr];
> +
> +		kstat_incr_softirqs_this_cpu(vec_nr);
> +		softirq->work_running = 1;
> +		trace_softirq_entry(vec_nr);
> +		sa->action(sa);

You invoke the softirq handler while BH is disabled (not wrong, I just
state the obvious). That means, the scheduler can't preempt/interrupt
the workqueue/BH-handler while it is invoked so it has to wait until it
completes its doing.
In do_softirq_workqueue() you schedule multiple workqueue items (one for
each softirq vector) which is unnecessary because they can't preempt one
another and should be invoked the order they were enqueued. So it would
be enough to enqueue one item because it is serialized after all. So one
work_struct per CPU with a cond_resched_rcu_qs() while switching from one
vector to another should accomplish that what you have now here (not
sure if that cond_resched after each vector is needed). But…

> +		trace_softirq_exit(vec_nr);
> +		softirq->work_running = 0;
> +	}
> +
> +	local_irq_disable();
> +
> +	pending = local_softirq_pending();
> +	if (pending & vec_bit)
> +		schedule_work_on(smp_processor_id(), &vector->work);

… on a system that is using system_wq a lot, it might introduced a certain
latency until your softirq-worker gets its turn. The workqueue will
spawn new workers if the current worker schedules out but until that
happens you have to wait. I am not sure if this is intended or whether
this might be a problem. I think you could argue either way depending on
what you currently think is more important.
Further, schedule_work_on(x, ) does not guarentee that the work item is
invoked on CPU x. It tries that but if CPU x goes down due to
CPU-hotplug then the workitem will be moved to random CPU. For that
reason we have work_on_cpu_safe() but you don't want to use that / flush
that workqueue while in here.

May I instead suggest to stick to ksoftirqd? So you run in softirq
context (after return from IRQ) and if takes too long, you offload the
vector to ksoftirqd instead. You may want to play with the metric on
which you decide when you want switch to ksoftirqd / account how long a
vector runs.

> +	else
> +		softirq->pending_work_mask &= ~vec_bit;
> +
> +	lockdep_softirq_exit();
> +	account_irq_exit_time(current);
> +	__local_bh_enable(SOFTIRQ_OFFSET);
> +	local_irq_enable();
> +}

Sebastian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-08 17:44   ` Sebastian Andrzej Siewior
@ 2018-02-08 18:45     ` David Miller
  2018-02-08 20:14       ` Dmitry Safonov
  2018-02-15 16:13     ` Frederic Weisbecker
  1 sibling, 1 reply; 39+ messages in thread
From: David Miller @ 2018-02-08 18:45 UTC (permalink / raw)
  To: bigeasy
  Cc: frederic, linux-kernel, alexander.levin, peterz, mchehab,
	torvalds, hannes, paulmck, wanpeng.li, dima, tglx, akpm, pabeni,
	rrendec, mingo, sgruszka, riel, edumazet

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Thu, 8 Feb 2018 18:44:52 +0100

> May I instead suggest to stick to ksoftirqd? So you run in softirq
> context (after return from IRQ) and if takes too long, you offload the
> vector to ksoftirqd instead. You may want to play with the metric on
> which you decide when you want switch to ksoftirqd / account how long a
> vector runs.

Having read over this stuff for the past few weeks this is how I feel
as well.  Just make ksofbitrq do what we want (only execute the
overloaded softirq vectors).

The more I look at the workqueue stuff, the more complications and
weird behavioral artifacts we are getting for questionable gain.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-08 18:45     ` David Miller
@ 2018-02-08 20:14       ` Dmitry Safonov
  2018-02-08 20:22         ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Dmitry Safonov @ 2018-02-08 20:14 UTC (permalink / raw)
  To: David Miller, bigeasy
  Cc: frederic, linux-kernel, alexander.levin, peterz, mchehab,
	torvalds, hannes, paulmck, wanpeng.li, tglx, akpm, pabeni,
	rrendec, mingo, sgruszka, riel, edumazet

On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date: Thu, 8 Feb 2018 18:44:52 +0100
> 
> > May I instead suggest to stick to ksoftirqd? So you run in softirq
> > context (after return from IRQ) and if takes too long, you offload
> the
> > vector to ksoftirqd instead. You may want to play with the metric
> on
> > which you decide when you want switch to ksoftirqd / account how
> long a
> > vector runs.
> 
> Having read over this stuff for the past few weeks this is how I feel
> as well.  Just make ksofbitrq do what we want (only execute the
> overloaded softirq vectors).
> 
> The more I look at the workqueue stuff, the more complications and
> weird behavioral artifacts we are getting for questionable gain.

What about creating several ksoftirqd threads per-cpu?
Like I did with boot parameter to specify how many threads and which
softirqs to serve.

-- 
Thanks,
             Dmitry

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-08 20:14       ` Dmitry Safonov
@ 2018-02-08 20:22         ` David Miller
  2018-02-08 20:30           ` Dmitry Safonov
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2018-02-08 20:22 UTC (permalink / raw)
  To: dima
  Cc: bigeasy, frederic, linux-kernel, alexander.levin, peterz,
	mchehab, torvalds, hannes, paulmck, wanpeng.li, tglx, akpm,
	pabeni, rrendec, mingo, sgruszka, riel, edumazet

From: Dmitry Safonov <dima@arista.com>
Date: Thu, 08 Feb 2018 20:14:55 +0000

> On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
>> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Date: Thu, 8 Feb 2018 18:44:52 +0100
>> 
>> > May I instead suggest to stick to ksoftirqd? So you run in softirq
>> > context (after return from IRQ) and if takes too long, you offload
>> the
>> > vector to ksoftirqd instead. You may want to play with the metric
>> on
>> > which you decide when you want switch to ksoftirqd / account how
>> long a
>> > vector runs.
>> 
>> Having read over this stuff for the past few weeks this is how I feel
>> as well.  Just make ksofbitrq do what we want (only execute the
>> overloaded softirq vectors).
>> 
>> The more I look at the workqueue stuff, the more complications and
>> weird behavioral artifacts we are getting for questionable gain.
> 
> What about creating several ksoftirqd threads per-cpu?
> Like I did with boot parameter to specify how many threads and which
> softirqs to serve.

Why do we need more than one per cpu?

There is a set of vectors which are "overloaded" and ksoftirqd processes
them one by one.

The only difference with what happens now is that one softirq being
overloaded doesn't defer the processing of all softirqs to ksoftirqd.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-08 20:22         ` David Miller
@ 2018-02-08 20:30           ` Dmitry Safonov
  2018-02-09  4:11             ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Dmitry Safonov @ 2018-02-08 20:30 UTC (permalink / raw)
  To: David Miller
  Cc: bigeasy, frederic, linux-kernel, alexander.levin, peterz,
	mchehab, torvalds, hannes, paulmck, wanpeng.li, tglx, akpm,
	pabeni, rrendec, mingo, sgruszka, riel, edumazet

On Thu, 2018-02-08 at 15:22 -0500, David Miller wrote:
> From: Dmitry Safonov <dima@arista.com>
> Date: Thu, 08 Feb 2018 20:14:55 +0000
> 
> > On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
> >> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >> Date: Thu, 8 Feb 2018 18:44:52 +0100
> >> 
> >> > May I instead suggest to stick to ksoftirqd? So you run in
> softirq
> >> > context (after return from IRQ) and if takes too long, you
> offload
> >> the
> >> > vector to ksoftirqd instead. You may want to play with the
> metric
> >> on
> >> > which you decide when you want switch to ksoftirqd / account how
> >> long a
> >> > vector runs.
> >> 
> >> Having read over this stuff for the past few weeks this is how I
> feel
> >> as well.  Just make ksofbitrq do what we want (only execute the
> >> overloaded softirq vectors).
> >> 
> >> The more I look at the workqueue stuff, the more complications and
> >> weird behavioral artifacts we are getting for questionable gain.
> > 
> > What about creating several ksoftirqd threads per-cpu?
> > Like I did with boot parameter to specify how many threads and
> which
> > softirqs to serve.
> 
> Why do we need more than one per cpu?

Ugh, yeah, I remember why I did it - I tried to reuse scheduler for
each ksoftirqd thread to decide if it need to run now or later.
That would give an admin a way to prioritise softirqs with nice.
Not sure if it's a nice idea at all..

> 
> There is a set of vectors which are "overloaded" and ksoftirqd
> processes
> them one by one.
> 
> The only difference with what happens now is that one softirq being
> overloaded doesn't defer the processing of all softirqs to ksoftirqd.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-08 20:30           ` Dmitry Safonov
@ 2018-02-09  4:11             ` Mike Galbraith
  2018-02-09 12:35               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2018-02-09  4:11 UTC (permalink / raw)
  To: Dmitry Safonov, David Miller
  Cc: bigeasy, frederic, linux-kernel, alexander.levin, peterz,
	mchehab, torvalds, hannes, paulmck, wanpeng.li, tglx, akpm,
	pabeni, rrendec, mingo, sgruszka, riel, edumazet

On Thu, 2018-02-08 at 20:30 +0000, Dmitry Safonov wrote:
> On Thu, 2018-02-08 at 15:22 -0500, David Miller wrote:
> > From: Dmitry Safonov <dima@arista.com>
> > Date: Thu, 08 Feb 2018 20:14:55 +0000
> > 
> > > On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
> > >> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > >> Date: Thu, 8 Feb 2018 18:44:52 +0100
> > >> 
> > >> > May I instead suggest to stick to ksoftirqd? So you run in
> > softirq
> > >> > context (after return from IRQ) and if takes too long, you
> > offload
> > >> the
> > >> > vector to ksoftirqd instead. You may want to play with the
> > metric
> > >> on
> > >> > which you decide when you want switch to ksoftirqd / account how
> > >> long a
> > >> > vector runs.
> > >> 
> > >> Having read over this stuff for the past few weeks this is how I
> > feel
> > >> as well.  Just make ksofbitrq do what we want (only execute the
> > >> overloaded softirq vectors).
> > >> 
> > >> The more I look at the workqueue stuff, the more complications and
> > >> weird behavioral artifacts we are getting for questionable gain.
> > > 
> > > What about creating several ksoftirqd threads per-cpu?
> > > Like I did with boot parameter to specify how many threads and
> > which
> > > softirqs to serve.
> > 
> > Why do we need more than one per cpu?
> 
> Ugh, yeah, I remember why I did it - I tried to reuse scheduler for
> each ksoftirqd thread to decide if it need to run now or later.
> That would give an admin a way to prioritise softirqs with nice.
> Not sure if it's a nice idea at all..

For RT that can be handy, but for the general case it's a waste of
cycles, so would want to be opt-in.

	-Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-09  4:11             ` Mike Galbraith
@ 2018-02-09 12:35               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-02-09 12:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Dmitry Safonov, David Miller, frederic, linux-kernel,
	alexander.levin, peterz, mchehab, torvalds, hannes, paulmck,
	wanpeng.li, tglx, akpm, pabeni, rrendec, mingo, sgruszka, riel,
	edumazet

On 2018-02-09 05:11:08 [+0100], Mike Galbraith wrote:
> On Thu, 2018-02-08 at 20:30 +0000, Dmitry Safonov wrote:
> > On Thu, 2018-02-08 at 15:22 -0500, David Miller wrote:
> > > Why do we need more than one per cpu?
> > 
> > Ugh, yeah, I remember why I did it - I tried to reuse scheduler for
> > each ksoftirqd thread to decide if it need to run now or later.
> > That would give an admin a way to prioritise softirqs with nice.
> > Not sure if it's a nice idea at all..
> 
> For RT that can be handy, but for the general case it's a waste of
> cycles, so would want to be opt-in.

We used to have it in RT. It got replaced because the context switch from
one thread to the other simply took time.
If you need to prioritise one softirqs over the other I think you would
be better off running with threaded-interrupts. In "normal" mode, if you
have a raised NAPI and say a TASKLET then once both IRQ handler complete
(and interrupts get enabled again) then the softirqs are run. With
threaded interrupts enabled, the NAPI-softirq is invoked once the
threaded-handler completes that raised it - same goes the other threaded
handler which raised the tasklet softirq.
So you could simply change the priority of the threaded interrupt in
order to prefer NAPI over tasklet handling (or the other way around).

> 	-Mike

Sebastian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-08 17:44   ` Sebastian Andrzej Siewior
  2018-02-08 18:45     ` David Miller
@ 2018-02-15 16:13     ` Frederic Weisbecker
  2018-02-15 16:58       ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2018-02-15 16:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: LKML, Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Paolo Abeni, Radu Rendec, Ingo Molnar, Stanislaw Gruszka,
	Rik van Riel, Eric Dumazet, David Miller

On Thu, Feb 08, 2018 at 06:44:52PM +0100, Sebastian Andrzej Siewior wrote:
> On 2018-01-19 16:46:12 [+0100], Frederic Weisbecker wrote:
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index c8c6841..becb1d9 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -62,6 +62,19 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {
> …
> > +static void vector_work_func(struct work_struct *work)
> > +{
> > +	struct vector *vector = container_of(work, struct vector, work);
> > +	struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
> > +	int vec_nr = vector->nr;
> > +	int vec_bit = BIT(vec_nr);
> > +	u32 pending;
> > +
> > +	local_irq_disable();
> > +	pending = local_softirq_pending();
> > +	account_irq_enter_time(current);
> > +	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> > +	lockdep_softirq_enter();
> > +	set_softirq_pending(pending & ~vec_bit);
> > +	local_irq_enable();
> > +
> > +	if (pending & vec_bit) {
> > +		struct softirq_action *sa = &softirq_vec[vec_nr];
> > +
> > +		kstat_incr_softirqs_this_cpu(vec_nr);
> > +		softirq->work_running = 1;
> > +		trace_softirq_entry(vec_nr);
> > +		sa->action(sa);
> 
> You invoke the softirq handler while BH is disabled (not wrong, I just
> state the obvious). That means, the scheduler can't preempt/interrupt
> the workqueue/BH-handler while it is invoked so it has to wait until it
> completes its doing.
> In do_softirq_workqueue() you schedule multiple workqueue items (one for
> each softirq vector) which is unnecessary because they can't preempt one
> another and should be invoked the order they were enqueued. So it would
> be enough to enqueue one item because it is serialized after all. So one
> work_struct per CPU with a cond_resched_rcu_qs() while switching from one
> vector to another should accomplish that what you have now here (not
> sure if that cond_resched after each vector is needed). But…

Makes sense.

> 
> > +		trace_softirq_exit(vec_nr);
> > +		softirq->work_running = 0;
> > +	}
> > +
> > +	local_irq_disable();
> > +
> > +	pending = local_softirq_pending();
> > +	if (pending & vec_bit)
> > +		schedule_work_on(smp_processor_id(), &vector->work);
> 
> … on a system that is using system_wq a lot, it might introduced a certain
> latency until your softirq-worker gets its turn. The workqueue will
> spawn new workers if the current worker schedules out but until that
> happens you have to wait. I am not sure if this is intended or whether
> this might be a problem. I think you could argue either way depending on
> what you currently think is more important.

Indeed :)

> Further, schedule_work_on(x, ) does not guarentee that the work item is
> invoked on CPU x. It tries that but if CPU x goes down due to
> CPU-hotplug then the workitem will be moved to random CPU. For that
> reason we have work_on_cpu_safe() but you don't want to use that / flush
> that workqueue while in here.

Yeah, someone also reported me that hotplug issue. I didn't think workqueue
would break the affinity but here it does. So we would need a hotplug hook
indeed.

> 
> May I instead suggest to stick to ksoftirqd? So you run in softirq
> context (after return from IRQ) and if takes too long, you offload the
> vector to ksoftirqd instead. You may want to play with the metric on
> which you decide when you want switch to ksoftirqd / account how long a
> vector runs.

Yeah that makes sense. These workqueues are too much headaches eventually.
I'm going to try that ksoftirqd thing.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue
  2018-02-15 16:13     ` Frederic Weisbecker
@ 2018-02-15 16:58       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-02-15 16:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Levin Alexander, Peter Zijlstra, Mauro Carvalho Chehab,
	Linus Torvalds, Hannes Frederic Sowa, Paul E . McKenney,
	Wanpeng Li, Dmitry Safonov, Thomas Gleixner, Andrew Morton,
	Paolo Abeni, Radu Rendec, Ingo Molnar, Stanislaw Gruszka,
	Rik van Riel, Eric Dumazet, David Miller

On 2018-02-15 17:13:52 [+0100], Frederic Weisbecker wrote:
> Yeah that makes sense. These workqueues are too much headaches eventually.
> I'm going to try that ksoftirqd thing.

I may have something for that USB problem, I just need some testing
before posting…

> Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/4] softirq: Per vector threading v3
  2018-02-07 14:18 ` Mauro Carvalho Chehab
@ 2018-03-01 15:21   ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2018-03-01 15:21 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: LKML, Levin Alexander, Peter Zijlstra, Linus Torvalds,
	Hannes Frederic Sowa, Paul E . McKenney, Wanpeng Li,
	Dmitry Safonov, Thomas Gleixner, Andrew Morton, Paolo Abeni,
	Radu Rendec, Ingo Molnar, Stanislaw Gruszka, Rik van Riel,
	Eric Dumazet, David Miller

On Wed, Feb 07, 2018 at 12:18:44PM -0200, Mauro Carvalho Chehab wrote:
> Em Fri, 19 Jan 2018 16:46:10 +0100
> Frederic Weisbecker <frederic@kernel.org> escreveu:
> 
> > As per Linus suggestion, this take doesn't limit the number of occurences
> > per jiffy anymore but instead defers a vector to workqueues as soon as
> > it gets re-enqueued on IRQ tail.
> > 
> > No tunable here, so testing should be easier.
> 
> Sorry for taking so long to test it. Just returned this week from a 2
> weeks travel, and had first to submit media patches for the merge window.
> 
> Didn't work. With this patch series, it still lose frames with
> usb bulk transfers, while recording DVB programs with tvheadend.

How can we reproduce this workload?

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-03-01 15:21 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-19 15:46 [RFC PATCH 0/4] softirq: Per vector threading v3 Frederic Weisbecker
2018-01-19 15:46 ` [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail Frederic Weisbecker
2018-01-19 16:16   ` David Miller
2018-01-19 18:25     ` Linus Torvalds
2018-01-19 18:47       ` David Miller
2018-01-21 16:30         ` Frederic Weisbecker
2018-01-21 16:57           ` David Miller
2018-01-19 15:46 ` [RFC PATCH 2/4] softirq: Per vector deferment to workqueue Frederic Weisbecker
2018-01-20  8:41   ` Pavan Kondeti
2018-01-21 16:11     ` Frederic Weisbecker
2018-01-21 17:50       ` Pavan Kondeti
2018-01-21 20:48         ` Frederic Weisbecker
2018-02-08 17:44   ` Sebastian Andrzej Siewior
2018-02-08 18:45     ` David Miller
2018-02-08 20:14       ` Dmitry Safonov
2018-02-08 20:22         ` David Miller
2018-02-08 20:30           ` Dmitry Safonov
2018-02-09  4:11             ` Mike Galbraith
2018-02-09 12:35               ` Sebastian Andrzej Siewior
2018-02-15 16:13     ` Frederic Weisbecker
2018-02-15 16:58       ` Sebastian Andrzej Siewior
2018-01-19 15:46 ` [RFC PATCH 3/4] softirq: Defer to workqueue when rescheduling is needed Frederic Weisbecker
2018-01-19 15:46 ` [RFC PATCH 4/4] softirq: Replace ksoftirqd with workqueues entirely Frederic Weisbecker
2018-01-22 19:58 ` [RFC PATCH 0/4] softirq: Per vector threading v3 Mauro Carvalho Chehab
2018-01-23 10:13 ` Paolo Abeni
2018-01-23 12:32   ` Dmitry Safonov
2018-01-24  2:12     ` Frederic Weisbecker
2018-01-23 16:22   ` David Miller
2018-01-23 16:57     ` Paolo Abeni
2018-01-23 17:42       ` Linus Torvalds
2018-01-23 18:01         ` Mike Galbraith
2018-01-23 18:24         ` David Miller
2018-01-24  1:57           ` Frederic Weisbecker
2018-01-24  2:01             ` Frederic Weisbecker
2018-01-24 14:54         ` Paolo Abeni
2018-01-24 15:05           ` David Miller
2018-01-24 16:11             ` Paolo Abeni
2018-02-07 14:18 ` Mauro Carvalho Chehab
2018-03-01 15:21   ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.