All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RT] sched/core: Avoid__schedule() being called twice, the second in vain
@ 2018-07-30 13:00 Daniel Bristot de Oliveira
  2018-08-01 11:18 ` Sebastian Andrzej Siewior
  2019-04-16 15:33 ` [tip:sched/core] sched/core, workqueues: Distangle worker accounting from rq lock tip-bot for Thomas Gleixner
  0 siblings, 2 replies; 3+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-07-30 13:00 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Clark Williams, Tommaso Cucinotta, Romulo da Silva de Oliveira,
	Sebastian Andrzej Siewior, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, LKML

Before calling __schedule() to cause a context switch, the schedule()
function runs sched_submit_work() to dispatch deferred work that was
postponed to the point that the thread is leaving the processor.
The function sched_submit_work() checks for the need to wake up a
kworker; if needed, a kworker thread is awakened. In this context,
the following behavior takes place.

A kworker (let's call it kworker_1) is running and enqueue a postponed
work (this is possible). kworker_1 then set its state as "sleepable" and
calls schedule() to leave the processor.

At this point, the kworker_1 is in line 1 of the pseudo-code like
bellow.

----------------------------- %< ----------------------------------------
  1 void schedule(void)
  2 {
  3	sched_submit_work(){
  4		wq_worker_sleeping() {
  5			preempt_disable();
  6			wakeup(kworker_2);
  7			preempt_enable() {
  8				should_resched() {
  9					__schedule() {
 10						context_switch
 11					}
 12				}
 13			}
 14		}
 15	}
 16	do {
 17		preempt_disable();
 18		__schedule();
 19		sched_preempt_enable_no_resched();
 20	} while (need_resched());
 21	sched_update_worker();
 22 }
----------------------------- >% ----------------------------------------

Then sched_subimit_work() is called. During the execution of
sched_subimit_work(), the preemption is disabled, a second kworker is
awakened (let's call it kworker_2) and then preemption is enabled again.

While in the preempt_enable(), after enabling the preemption, the system
checks if it needs to resched. As a kthread_1 is in sleepable and the
kthread_2 was just awakenedi, this is the case of re-scheduling.
Then the __schedule() is called  at line 9, and a context switch happens
at line 10.

While other threads run, kthread_1 is awakened, is scheduled and return
to the execution at line 11. Continuing the execution, the code return to
the schedule() execution at line 15. Preemption is then disabled and
__schedule() called again at line 18. As the __schedule() just returned
at line 11, there is nothing to schedule, and so the __schedule() at
line 18 is called in vain.

A trace with details of this execution can be found here:
  http://bristot.me/__schedule-being-called-twice-the-second-in-vain/

Knowing that sched_submit_work() is called only inside schedule(), and
that it does not suspend or block, one way to avoid the first schedule is
to disable the preemption before calling wq_worker_sleeping(), and
then use sched_preempt_enable_no_resched() to enable the preemption again
after the return of wq_worker_sleeping().

This patch is RT specific because wq_worker_sleeping() is called with
preemption disabled on non-rt, see -rt patch:
  sched: Distangle worker accounting from rqlock

Note: A "similar" behavior can happen in the blk_schedule_flush_plug(),
but that is different. There, the problem can occur because of
rt_spin_lock(), but we cannot convert them to raw_spin_lock because it
will potentially cause latency spikes.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Cc: Romulo da Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
---
 kernel/sched/core.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 70641e8f3c89..7a48619af3e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3545,9 +3545,16 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	/*
 	 * If a worker went to sleep, notify and ask workqueue whether
 	 * it wants to wake up a task to maintain concurrency.
+	 *
+	 * As this function is called inside the schedule() context,
+	 * we disable preemption to avoid it calling schedule() again
+	 * in the possible wakeup of a kworker.
 	 */
-	if (tsk->flags & PF_WQ_WORKER)
+	if (tsk->flags & PF_WQ_WORKER) {
+		preempt_disable();
 		wq_worker_sleeping(tsk);
+		preempt_enable_no_resched();
+	}
 
 
 	if (tsk_is_pi_blocked(tsk))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH RT] sched/core: Avoid__schedule() being called twice, the second in vain
  2018-07-30 13:00 [PATCH RT] sched/core: Avoid__schedule() being called twice, the second in vain Daniel Bristot de Oliveira
@ 2018-08-01 11:18 ` Sebastian Andrzej Siewior
  2019-04-16 15:33 ` [tip:sched/core] sched/core, workqueues: Distangle worker accounting from rq lock tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 3+ messages in thread
From: Sebastian Andrzej Siewior @ 2018-08-01 11:18 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-rt-users, Clark Williams, Tommaso Cucinotta,
	Romulo da Silva de Oliveira, Steven Rostedt, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, LKML

On 2018-07-30 15:00:00 [+0200], Daniel Bristot de Oliveira wrote:
> Before calling __schedule() to cause a context switch, the schedule()
> function runs sched_submit_work() to dispatch deferred work that was
> postponed to the point that the thread is leaving the processor.
> The function sched_submit_work() checks for the need to wake up a
> kworker; if needed, a kworker thread is awakened. In this context,
> the following behavior takes place.
> at line 11, there is nothing to schedule, and so the __schedule() at
> line 18 is called in vain.

I read this as an optimisation for the common case as you put it. 
Applied.

Sebastian

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [tip:sched/core] sched/core, workqueues: Distangle worker accounting from rq lock
  2018-07-30 13:00 [PATCH RT] sched/core: Avoid__schedule() being called twice, the second in vain Daniel Bristot de Oliveira
  2018-08-01 11:18 ` Sebastian Andrzej Siewior
@ 2019-04-16 15:33 ` tip-bot for Thomas Gleixner
  1 sibling, 0 replies; 3+ messages in thread
From: tip-bot for Thomas Gleixner @ 2019-04-16 15:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, bristot, tglx, jiangshanlai, tj, linux-kernel, torvalds,
	bigeasy, peterz, mingo

Commit-ID:  6d25be5782e482eb93e3de0c94d0a517879377d0
Gitweb:     https://git.kernel.org/tip/6d25be5782e482eb93e3de0c94d0a517879377d0
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Wed, 13 Mar 2019 17:55:48 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 16 Apr 2019 16:55:15 +0200

sched/core, workqueues: Distangle worker accounting from rq lock

The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().

[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c         | 88 +++++++++++----------------------------------
 kernel/workqueue.c          | 54 +++++++++++++---------------
 kernel/workqueue_internal.h |  5 +--
 3 files changed, 48 insertions(+), 99 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4778c48a7fda..6184a0856aab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1685,10 +1685,6 @@ static inline void ttwu_activate(struct rq *rq, struct task_struct *p, int en_fl
 {
 	activate_task(rq, p, en_flags);
 	p->on_rq = TASK_ON_RQ_QUEUED;
-
-	/* If a worker is waking up, notify the workqueue: */
-	if (p->flags & PF_WQ_WORKER)
-		wq_worker_waking_up(p, cpu_of(rq));
 }
 
 /*
@@ -2106,56 +2102,6 @@ out:
 	return success;
 }
 
-/**
- * try_to_wake_up_local - try to wake up a local task with rq lock held
- * @p: the thread to be awakened
- * @rf: request-queue flags for pinning
- *
- * Put @p on the run-queue if it's not already there. The caller must
- * ensure that this_rq() is locked, @p is bound to this_rq() and not
- * the current task.
- */
-static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
-{
-	struct rq *rq = task_rq(p);
-
-	if (WARN_ON_ONCE(rq != this_rq()) ||
-	    WARN_ON_ONCE(p == current))
-		return;
-
-	lockdep_assert_held(&rq->lock);
-
-	if (!raw_spin_trylock(&p->pi_lock)) {
-		/*
-		 * This is OK, because current is on_cpu, which avoids it being
-		 * picked for load-balance and preemption/IRQs are still
-		 * disabled avoiding further scheduler activity on it and we've
-		 * not yet picked a replacement task.
-		 */
-		rq_unlock(rq, rf);
-		raw_spin_lock(&p->pi_lock);
-		rq_relock(rq, rf);
-	}
-
-	if (!(p->state & TASK_NORMAL))
-		goto out;
-
-	trace_sched_waking(p);
-
-	if (!task_on_rq_queued(p)) {
-		if (p->in_iowait) {
-			delayacct_blkio_end(p);
-			atomic_dec(&rq->nr_iowait);
-		}
-		ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK);
-	}
-
-	ttwu_do_wakeup(rq, p, 0, rf);
-	ttwu_stat(p, smp_processor_id(), 0);
-out:
-	raw_spin_unlock(&p->pi_lock);
-}
-
 /**
  * wake_up_process - Wake up a specific process
  * @p: The process to be woken up.
@@ -3472,19 +3418,6 @@ static void __sched notrace __schedule(bool preempt)
 				atomic_inc(&rq->nr_iowait);
 				delayacct_blkio_start();
 			}
-
-			/*
-			 * If a worker went to sleep, notify and ask workqueue
-			 * whether it wants to wake up a task to maintain
-			 * concurrency.
-			 */
-			if (prev->flags & PF_WQ_WORKER) {
-				struct task_struct *to_wakeup;
-
-				to_wakeup = wq_worker_sleeping(prev);
-				if (to_wakeup)
-					try_to_wake_up_local(to_wakeup, &rf);
-			}
 		}
 		switch_count = &prev->nvcsw;
 	}
@@ -3544,6 +3477,20 @@ static inline void sched_submit_work(struct task_struct *tsk)
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
+
+	/*
+	 * If a worker went to sleep, notify and ask workqueue whether
+	 * it wants to wake up a task to maintain concurrency.
+	 * As this function is called inside the schedule() context,
+	 * we disable preemption to avoid it calling schedule() again
+	 * in the possible wakeup of a kworker.
+	 */
+	if (tsk->flags & PF_WQ_WORKER) {
+		preempt_disable();
+		wq_worker_sleeping(tsk);
+		preempt_enable_no_resched();
+	}
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
@@ -3552,6 +3499,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
 		blk_schedule_flush_plug(tsk);
 }
 
+static void sched_update_worker(struct task_struct *tsk)
+{
+	if (tsk->flags & PF_WQ_WORKER)
+		wq_worker_running(tsk);
+}
+
 asmlinkage __visible void __sched schedule(void)
 {
 	struct task_struct *tsk = current;
@@ -3562,6 +3515,7 @@ asmlinkage __visible void __sched schedule(void)
 		__schedule(false);
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
+	sched_update_worker(tsk);
 }
 EXPORT_SYMBOL(schedule);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ddee541ea97a..56180c9286f5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -841,43 +841,32 @@ static void wake_up_worker(struct worker_pool *pool)
 }
 
 /**
- * wq_worker_waking_up - a worker is waking up
+ * wq_worker_running - a worker is running again
  * @task: task waking up
- * @cpu: CPU @task is waking up to
  *
- * This function is called during try_to_wake_up() when a worker is
- * being awoken.
- *
- * CONTEXT:
- * spin_lock_irq(rq->lock)
+ * This function is called when a worker returns from schedule()
  */
-void wq_worker_waking_up(struct task_struct *task, int cpu)
+void wq_worker_running(struct task_struct *task)
 {
 	struct worker *worker = kthread_data(task);
 
-	if (!(worker->flags & WORKER_NOT_RUNNING)) {
-		WARN_ON_ONCE(worker->pool->cpu != cpu);
+	if (!worker->sleeping)
+		return;
+	if (!(worker->flags & WORKER_NOT_RUNNING))
 		atomic_inc(&worker->pool->nr_running);
-	}
+	worker->sleeping = 0;
 }
 
 /**
  * wq_worker_sleeping - a worker is going to sleep
  * @task: task going to sleep
  *
- * This function is called during schedule() when a busy worker is
- * going to sleep.  Worker on the same cpu can be woken up by
- * returning pointer to its task.
- *
- * CONTEXT:
- * spin_lock_irq(rq->lock)
- *
- * Return:
- * Worker task on @cpu to wake up, %NULL if none.
+ * This function is called from schedule() when a busy worker is
+ * going to sleep.
  */
-struct task_struct *wq_worker_sleeping(struct task_struct *task)
+void wq_worker_sleeping(struct task_struct *task)
 {
-	struct worker *worker = kthread_data(task), *to_wakeup = NULL;
+	struct worker *next, *worker = kthread_data(task);
 	struct worker_pool *pool;
 
 	/*
@@ -886,13 +875,15 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
 	 * checking NOT_RUNNING.
 	 */
 	if (worker->flags & WORKER_NOT_RUNNING)
-		return NULL;
+		return;
 
 	pool = worker->pool;
 
-	/* this can only happen on the local cpu */
-	if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id()))
-		return NULL;
+	if (WARN_ON_ONCE(worker->sleeping))
+		return;
+
+	worker->sleeping = 1;
+	spin_lock_irq(&pool->lock);
 
 	/*
 	 * The counterpart of the following dec_and_test, implied mb,
@@ -906,9 +897,12 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
 	 * lock is safe.
 	 */
 	if (atomic_dec_and_test(&pool->nr_running) &&
-	    !list_empty(&pool->worklist))
-		to_wakeup = first_idle_worker(pool);
-	return to_wakeup ? to_wakeup->task : NULL;
+	    !list_empty(&pool->worklist)) {
+		next = first_idle_worker(pool);
+		if (next)
+			wake_up_process(next->task);
+	}
+	spin_unlock_irq(&pool->lock);
 }
 
 /**
@@ -4929,7 +4923,7 @@ static void rebind_workers(struct worker_pool *pool)
 		 *
 		 * WRITE_ONCE() is necessary because @worker->flags may be
 		 * tested without holding any lock in
-		 * wq_worker_waking_up().  Without it, NOT_RUNNING test may
+		 * wq_worker_running().  Without it, NOT_RUNNING test may
 		 * fail incorrectly leading to premature concurrency
 		 * management operations.
 		 */
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index cb68b03ca89a..498de0e909a4 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -44,6 +44,7 @@ struct worker {
 	unsigned long		last_active;	/* L: last active timestamp */
 	unsigned int		flags;		/* X: flags */
 	int			id;		/* I: worker id */
+	int			sleeping;	/* None */
 
 	/*
 	 * Opaque string set with work_set_desc().  Printed out with task
@@ -72,8 +73,8 @@ static inline struct worker *current_wq_worker(void)
  * Scheduler hooks for concurrency managed workqueue.  Only to be used from
  * sched/ and workqueue.c.
  */
-void wq_worker_waking_up(struct task_struct *task, int cpu);
-struct task_struct *wq_worker_sleeping(struct task_struct *task);
+void wq_worker_running(struct task_struct *task);
+void wq_worker_sleeping(struct task_struct *task);
 work_func_t wq_worker_last_func(struct task_struct *task);
 
 #endif /* _KERNEL_WORKQUEUE_INTERNAL_H */

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-04-16 15:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-30 13:00 [PATCH RT] sched/core: Avoid__schedule() being called twice, the second in vain Daniel Bristot de Oliveira
2018-08-01 11:18 ` Sebastian Andrzej Siewior
2019-04-16 15:33 ` [tip:sched/core] sched/core, workqueues: Distangle worker accounting from rq lock tip-bot for Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.